Spotlight: Dmitry Zinoviev (Author) Interview and AMA!

Author Spotlight:
Dmitry Zinoviev (@aqsaqal)

Today we’re putting our spotlight on Dmitry Zinoviev, author of Data Science Essentials in Python, Pythonic Programming, Complex Network Analysis in Python, and Resourceful Code Reuse. Pragmatic sat down and talked with Professor Zinoviev about everything Python.

This is also an AMA. Everyone commenting or asking a question will automatically be entered into our draw to win one of his books!

Without further ado…

3 Likes

Hello Dmitry! Please introduce yourself.

I am a professor of Computer Science, full professor. I work at Suffolk University in Boston, Massachusetts, which people often confuse with Suffolk University in England. But this is New England; this is not Old England. I teach computer science, mostly, but also a science course and sometimes mathematics. I’ve been here for twenty-one years. This is my twenty-second year.

I have a PhD in computer science from Stony Brook University on Long Island, and I have a master’s degree in physics from Moscow State University in Moscow, Russia—at that time, the USSR. That’s pretty much the summary of what I am.

I’ve published four books with Pragmatic, and I just signed a contract for the fifth one. The first book is Data Science Essentials in Python. The second was about networks and network analysis, in Python as well. The third one was in the Pragmatic Express Series. In fact, I believe it was the second book in that series. I was pretty much like a guinea pig. And it was about code reuse in C and in Python as well. And the fourth one was Pythonic Programming: Tips for Becoming an Idiomatic Python Programmer, which was essentially writing idiomatic Python code. The fifth one, which has not been written yet, is gonna be about simulation and modeling in Python and NetLogo.

I sense a theme here. So tell me about Python, and tell me about the language. And tell me about why it resonates so well with you.

Python is apparently one of the most popular languages nowadays. I think it is very accessible. In the first place, it is very permissive. It does allow you to do things that other languages would prefer you not do.

It sounds and reads a little bit like English. So, there is actually a way to read Python programs aloud so that they sound almost like English descriptions.

In a sense, it is a bridge between algorithmic thinking and proper computer programming that other languages like C and especially Java do not provide. That would be like two sides of the same coin in Java. In Python, it is essentially one side of the same coin. Like a Moebius coin, if you want.

It is also very interactive, in a sense that when you write something in Python, you can immediately see the results of your program, while in C and in Java, you have to go through a sometimes lengthy compiling process.

Also, Python has very nice and easy-to-understand graphics. Even native Python graphics are simple to understand. On top of that we have Matplotlib, which is a scientific plotting library, which is also pretty intuitive. Not many computer languages would provide interactivity and convenient graphics at the same time.

I think because of these features, Python became widespread.

Ah, there’s one more thing. Not too many people write programs in pure Python nowadays, aside from, say, high schoolers and freshmen in universities. Most of the time, Python is used as a glue that takes libraries developed in other languages, usually C or C++, and puts them together in a very easy, convenient way. It’s like Lego blocks.

Python would be not even a block in that collection. Python is the glue that keeps the blocks together. That way, Python can be easily extended by connecting some other libraries. It can be easily updated by replacing the libraries. And this is what makes me fascinated about this language.

I have a long history of being in love with Python. Well, just like almost any other love story, it started with hate. When I first discovered Python (in 1995, most probably), I strongly disliked it because I was a C person. Over time, literally in a couple of years, I got madly in love with it. And to date, it is my favorite development language.

How is Python used in teaching?

In my twenty years, twenty-one years at Suffolk, we changed the CS1 language, Computer Science 101, three times.

We started with C—that was before I came—then we attempted to switch to C++. It did not go well, because C++ actually is harder than C. Then we switched to Java, and it was even worse because Java is harder than C++, in my view. Until eventually I suggested to switch to Python, and that’s where we are.

The major schools in the area, such as Northeastern and Boston University and MIT, all started teaching Computer Science 1 in Python. And there are good things and bad things about this transition.

The good thing is Python is easy. Python is permissive. Students can start producing results faster—self-satisfying results.

On the other hand, Python is easy and permissive. I use the same words, but now I use them in a negative sense. Because it allows students to make mistakes without telling them that they’re making mistakes, which kind of makes them less disciplined. Less careful about their programs. And, in the first semester, it is a good thing as they progress into higher-level courses, but that kind of backfires.

I would say Python is a great language to teach programming to someone who will not, who does not, intend to become a hardcore programmer. And those who are going to be real professional programmers probably should learn some Python but otherwise stay in the C/C++ Java land.

We encounter Python in the professional world, often in Dev Tools. Are there other places that Python slots into the working professional zone of development?

If we talk about machine learning, artificial intelligence, this is where Python shines. But then we argue that this is not actually software development, it is machine learning in the first place, and the reason why it shines in machine learning is because it is a glue.

You can take many machine learning libraries developed in other languages—Java and C and C++ in the first place—and instantly build a reasonably well-performing systems.

If the performance is kind of an issue, then the system may be reprogrammed in C or in Java, almost without changing the libraries. Just take the same libraries and put them together using C or C++ as the glue. I would say this is where it shines.

Robotics is another place. Mostly for the same reason: a robotic library, such as computer vision and motion libraries, they can come close using Python as a glue. Python performance in this case is not really important as such, as the libraries are written in compiled languages anyway.

Anywhere Python is used as a glue, they’re putting together the high-performing libraries in other languages. It shines, or at least it has a chance to shine.

In any kind of experimental areas, any kind of research, Python is great. Because when you do research, you sometimes do not even know what you are looking for, and what your algorithms are going to be.

Python gives you unique flexibility. You can change your program instantly on the flight. You can play many different scenarios. When you work on something production quality, you have to follow proper software engineering rules and principles. And Python is good for prototyping but not necessarily for a general production system.

You mentioned that it’s for prototyping, good for learning, but where performance is not critical. So how do you teach about, for example, computational complexity in Python?

Computational complexity is a pain because Python hides most of the computational complexity from the user, but it does so in a very stealthy way. For example, we know that lists in Python have linear complexity. So if you want to search for an element on a list, it takes linear time. That’s true not only about Python lists but lists in general.

On the other hand, Python has sets and dictionaries that visually differ from lists only by the shape of the enclosing parentheses. However they have constant complexity. Searching an item in a dictionary or in a set takes constant time, regardless of the size of the set or dictionary.

It raises some interesting questions. When a student decides whether to use a list, which is introduced early in the semester, or dictionary or set, which I introduce later in the semester, they often go for lists because lists visually are very similar to sets and dictionaries.

However, performance is not only inferior, it may be drastically inferior for large datasets. And it is not clear from the notation that dictionaries and sets in Python are much better for searching and insertion tasks.

What I basically do when I teach the intro course in Python, I briefly mention this fact. But since I do it briefly, students often forget about it. However, they see me again in a data science course, where Python is a tool rather than an objective. We don’t learn Python in that course. We use it.

And I give them an assignment, where they have to analyze some piece of big data, usually complete Shakespeare, all of his works. And they feel that if they use dictionaries, they get results in a matter of a minute or two. If they use lists, it takes hours, and even days, depending on how fast their computer is. At this point, they start appreciating the difference between the two data types.

You may argue, or one may argue, that it is too late. It’s like a junior-year course. By that time, they have already learned that the choice of data structures is unimportant.

I would say it is a problem if performance is an issue. Python as such is not a good language. Python libraries written in other languages are good, and Python as a glue is good. But the fact that it hides the computational complexity in many ways; I’ve given you just one example; I don’t think is a good feature of the language. in particular when it becomes crucial for the success of application.

What about pydoc? We’d imagine that dictionaries and lists would use code-documentation to specify complexity. Do they do this? How is this supported? And how does this tie in if there’s no standard IDE?

The problem with junior programmers is that they do not read documentation.

It is probably documented. I’m sure it is a part of Python documentation, and of Python standard, that lists have linear access time and dictionaries have constant access time. If it is a part of the standard, eventually you would get to remember that lists are bad and dictionaries are good.

But if it is not part of the standard—and unfortunately at the moment, I do not know precisely whether it is written in the standard or not—then it would be up to the implementation.

And it is not hard to imagine an implementation of dictionaries that does not use hash tables as it does in most versions. Say dictionaries may have been implemented on top of lists. Then in that case, they would have inferior performance as well.

We’re basically relying on the fact that most standard implementations of Python follow the hash table implementation of dictionaries.

Speaking of IDEs, there is no standard IDE for Python, but it’s okay because most other languages do not have a standard IDE as well. In particular, C people use whatever they get used to.

When it comes to common IDEs like PyCharm or Spyder (which I prefer for some reason; I think it is reasonably simple) or Emacs, then they do support implementation.

But as I said previously, most of the time, the performance features would be described somewhere in the middle of the documentation, not at the beginning. Because what is required from a dictionary is to provide a mapping from keys to values. This is the functional feature of a dictionary. Performance would be a nonfunctional feature.

Many developers, even if they have never used dictionaries before, would read the first paragraph and say, “Okay, I can map my keys to my values. That’s exactly what I need. What would be a problem?” And then they just go and implement.

If they use dictionaries instead of lists, it’s usually okay. There is some space penalty. Dictionaries need more space than lists. But if they go in the opposite direction, use lists as dictionaries, the performance penalty would definitely be there.

They may not even discover it at the time of development. Because if your dataset is…I don’t know, a couple hundred items…there will be no substantial difference between linear and constant access time.

But if in production code they’re going to deal with big data, then the difference becomes a matter of survival. But that would be much, much later in the pipeline when it may be already too late to change the code.

You touched on this already, very briefly, but let’s circle back to it, which is, when you compare Python to, for example, Rust. Rust is a very type-safe, fast, and deployable language. How do you deal with the fact that Python doesn’t seem to have the same sort of feedback for type safety, for error reporting, and so forth?

Rust is essentially the new C. so it has the same benefits as C, and among them is type safety. I mean, type safety in Rust is superior. C is actually a language where you can cheat the type-checking system. But in Rust, no way. And if you manage to cheat it anyway, you’re gonna be responsible for that.

Python was designed as a type-safe language. So anything can become anything as necessary. It is not as bad as Visual Basic or PHP, where the concept of type is almost virtual. But it is still very loose on types.

Python uses the optimistic defense mechanism, which basically says, “We’re not going to encounter a type error most of the time, because type errors are introduced by bad programmers, and we are good programmers. So we’re not going to make any type errors. But if there is a type error, we should anticipate it and use exception handling.”

Exception handling is a universal tool; most junior developers do not know how to use it correctly. But at least it gives you some degree of control, and at least you can terminate your program without all these intimidating error messages that reveal that you are an inferior programmer.

On the other hand, Rust and C++, and C to some extent, use the pessimistic approach. They imply that there will be type errors. Because there are very few good programmers, and even they make mistakes.

And the purpose of the type system is to eliminate potential type errors at the moment they’re made, not at the moment they manifest themselves. And the time from making a mistake and seeing the manifestation of that mistake is often days and weeks and even years.

So these are just two different paradigms that are seen in computer science time and again. An optimistic approach where you assume that errors don’t happen. And if they happen, you handle them. And the pessimistic approach, that takes a big deal on eliminating errors before they are made. So Python is an optimistic language, and Rust is a pessimistic language.

There are not too many situations, fortunately, in Python when type errors make a huge difference or go unnoticed. If an error is noticed, it will sooner or later be fixed, so it is not such a big deal. Unlike C, for example, where type errors may happen and may go unnoticed for a very long time. And they may manifest themselves when your software is on the way to Mars or Jupiter, when it’s too late to fix anything.

So not too long ago, Python got Async IO. How has that impacted its use?

It made it possible to use it in systems that have to take care of many activities at the same time. I would say that Async IO had to be introduced because of a bad design choice that was made twenty years ago or so.

All data structures in Python are global. And that means that when you modify a data structure, this modification is visible to all threads in Python, in the Python program at the same time.

So the decision was made to lock this data structure with a global interpreter lock (GIL) such that any access to this data structure when it has been accessed is delayed or prohibited. This is a very harsh approach.

In other languages that support or are expected to support multithreading, global data structures are protected by the application, by the program itself—in other words, by the programmer.

If I want to modify a global variable, it is my responsibility as a programmer to make sure that no other thread modifies this variable at the same time. Actually, I don’t even want any thread to read this variable, because its internal state may be inconsistent.

In Python’s case, the approach is pessimistic rather than optimistic. Basically, the assumption was that if there is a global variable, someone will try to modify it in an inconsistent way. So why don’t we have a global lock that gets locked when any thread tries to access any global variable?

Because of this global lock feature, multithreading in Python is actually impossible. And you cannot start multiple tasks in Python at random concurrently; this is what threads are for—because of the global lock. When one thread does anything to any global variable, all other threads have to wait until it releases the lock, even if they do not intend to modify the same global variable, because there’s only one lock, and that lock is global.

This makes multithreading and concurrent programming in Python virtually impossible if you have multiple activities that have to be handled at the same time—let’s say, robotics. You have a robot. You have to do some computer vision to recognize the environment. You have to have some activators that control the motors, the hands, the legs of the robot. You may need to control the lidar, again, to monitor the environment. Sound recognition, speech recognition. And all that has to happen more or less at the same time. If I write a program in Java or in C++, I simply start multiple threads. And each thread is responsible for an activity. In Python, multiple threads are not useful. So what can I do instead? I can use Async IO. I can monitor the conditions. And these conditions would allow me to activate parts of my program that have to respond to these conditions.

So if I hear something, I listen to it. If I see something, I take a picture and build it with the processing of that picture. If I need to move to the left, I put everything else aside and move to the left. This is kind of a limitation of multiprocessing, but it is better than no multiprocessing at all.

The good news is that there is a strong momentum in the Python community towards eliminating that global lock. This would require essentially releasing a huge major version of Python. Maybe not even 4.0, but something beyond that. And it may introduce some backward incompatibility. I mean, Python 3 was already a huge disruption. It caused a lot of pain in the software development community. I’m not sure if Python developers would appreciate another fundamental change to version 4.0 that would not have the global lock and allow true multithreading.

So I do not know where it is going, but Async IO is definitely a good temporary solution. It’s a patch, in a sense, but it is a well-designed patch. Obviously better designed than the feature that it’s trying to hide.

What thoughts do you have about readability and the decision to use indentation and whitespace as delimiters?

You hit me where it hurts. Yeah. Well, actually, I know the reason for that decision. Most programmers still use indentation to make the code more readable. And they use indentation together with braces or whatever is the delimiter in the language. Most style guides insist that a well-written program should be indented properly, not just indented, but accordingly to different styles, and there are different styles like, do you use two spaces, or four spaces or eight spaces for a tab? And actually any combination of those is allowed as long as the program looks well indented.

Van Rossum’s idea was to eliminate some redundancy. In fact that was one of his principles when he worked on Python, to eliminate redundancy. It is funny how the very same Python now provides so many ways of doing the same things using different means, which kind of defeats the Occam’s razor. Instead of eliminating redundancy, it turned to using a lot of redundancy.

But back to indentation…clearly, a combination of indentation and bracketing is redundant. Either one of the other, since a program without indentation does not look good, does not look tidy. Van Rossum’s decision was to eliminate bracing and he said that, I mean, “Who needs these silly braces if you can read the program and understand everything without braces just by looking at the amount of whitespace.”

And this caused two problems. The first problem, that you cannot really tell by looking at your program whether it is one tab or eight spaces. But from Python’s interpreter’s point of view, eight spaces is eight spaces and a tab is one space. So if you mix and match spaces and tabs on different lines, visually you get an impression of a perfectly indented program, but from Python’s interpreter’s point of view, it is not.

This naturally causes a lot of strange error messages about the amount of indentation not being as expected. They’re really hard to decode because people trust their eyes, and if they see four nicely aligned lines, they refuse to believe that the amount of indentation differs among those four lines. If I were a Python interpreter developer, I would attempt to convert tabs into spaces using some convention, like four spaces per tab or eight spaces per tab.

The same feature would have to be integrated into development environments because a tab is not necessarily eight spaces; it would be four spaces or two or whatever the developer wants. So this is the first problem; the second problem is more subtle.

Let’s take for example a loop, a for loop or a while loop, whatever. If I want to have a loop with no body, an empty loop, just the header in C or in C++ or in Java, I can use a pair of braces to provide an empty loop body. The body is still required by the rules of the language, but since I don’t want any body, I can put a pair of matching braces. And the compiler would know that it’s an empty body. In Python, since I do not have braces, the only natural way for me to introduce an empty body would be to have a line with indentation, but no further content. But that kind of line is not easy to see because an empty line and a line consisting of spaces usually look the same. So Python developers have to introduce a new keyword pass, which simply means something must be here by the rules of the language, but since the programmer doesn’t want to put anything here, we’ll put a placeholder, a special keyword that does nothing. So the purpose of that keyword is often obscure for junior developers, especially for students, who start using it here and there because they believe it is a part of some ancient, obscure ritual, which it is not.

The second implication of having indentation without parentheses is actually worse than the first one, because the first one causes syntax errors, which can be easily fixed. But the second one causes semantic errors, which are much harder to detect and fix.

You hinted at another major release of Python, so could you talk about Python 4 and why people believe that it may not actually happen?

I do not know much about the plans for Python 4, but as I said, there is a strong resistance towards having a new major version of Python. The transition from 2.7 to 3 was painful, and we still feel its consequences. Every now and then I see a question on Stack Overflow, or homework assignments submitted by students where they use 2.7 syntax and semantics, and programs don’t work for them, which makes them confused, like, “I followed the tutorial. I did everything. Why doesn’t it work?”

Syntax errors actually are not so hard to fix. Unfortunately, there is a change of semantics where the same function call in 2.7 behaves differently in 3.x than it used to behave before. So if Python 4.0 is going to change semantics, then it’s gonna be a no-go, I am afraid. And if it is going to change syntax, that I’m not sure. Syntax of Python is being changed gradually, one step at a time, so the walrus assignment operator has been introduced recently.

I do not remember off the top of my head what some other new features were, but yeah. It is normal to make some minor changes, but they wouldn’t qualify for version 4.0. Version 4.0 has to be something really different, and with so much Python code circulating around, if there is any kind of backward incompatibility. I see it as a big potential disaster. Perhaps if it becomes different to the same extent as 3.x was different from 2.7, perhaps it should be given a different name, like not Python, but some other name to clearly distinguish it from what we have now.

Perhaps they could name it after some smaller snake, I suppose. Python is a very large snake, so maybe they could name it the Garter language.

Yeah, garter snake. Yeah. Actually, I had the same idea. Since it’s the most popular snake in New England…I think it’s actually the official snake of Massachusetts, so I wouldn’t mind that. (laughter) Garter. (laughs) Yeah.

If people want to follow you and see what you’re up to and what you’re doing next, where can they find information about you?

I think they should go to LinkedIn. I am active on LinkedIn, I have many followers and I would appreciate more followers. I use LinkedIn in a very professional way, so it is basically about data science, Python, and retrocomputing. There is sometimes education as well, computer science education in the first place. And I always welcome new followers.

What about Twitter?

Twitter…I try to use it, but for some reason it does not like me, in a sense that I got twenty new followers in half a year, and that’s not how people become famous on Twitter, I believe.

No matter what I do, new people don’t follow me. So there is something about me that Twitter doesn’t like. Maybe it’s my accent, I don’t know. (laughs)

3 Likes

Thank you so much for taking the time to talk to us Dmitry, this has been absolutely fascinating!

Now it’s over to everyone else. This is an AMA where anyone can ask Dmitry questions and we’ll randomly pick one lucky person to win one of his books!

For those who can’t wait, don’t forget you can get 35% off any of Dmitry’s books with the coupon code devtalk.com.

3 Likes

Corresponding tweet for this thread:

Share link for this tweet.

2 Likes

It was a pleasure!

3 Likes

Brilliant interview @PragmaticErica and @aqsaqal!

Not only did I learn more about Dimitry but also got a really good overview of the state of Python as well as some of its history :038:

What’s your second favourite? Or if Python wasn’t around, which language do you think you would opt for? (From modern languages too!)

That’s interesting! I always thought Java was meant to be the easiest of the three :lol: (I haven’t used any of them myself).

:081:

I’m glad you were asked that! When I started learning programming my Python and Ruby were in my shortlist… and the whitespace dependence syntax was definitely one of the things that I wasn’t keen on. When teaching Python to students, do lots of people dislike it? Or by that time have they already had enough experience to have gotten used to it?

2 Likes

C is always an excellent choice if you are comfortable with pointers and low-level memory management. You simply cannot beat its performance. Rust is a great “next C” thing. And for distributed systems, Erlang rules. And no, I haven’t seen a student who dislikes Python (but I know many mature programmers who do).

3 Likes

Great Interview. I would like to ask Dmitry’s view on lack of type system/strong typing in Python. I feel it was a big miss, and restricts reasoning and formal verification of Python programs. I think the Python team realized this later in 3X and now are introducing more of it in every iteration of 3X and more in 4. (They really in my opinion are back-peddling as they were in denial for a long time about the need for it).

cheers,
bammi

1 Like