Python stories, July 2018

I’m running @pythonetc, a Telegram channel about Python and programming in general. Here are the best posts of July 2018.

Regular languages

A regular language is a formal language that can be recognized by a finite-state machine (FSM). Simply put, that means that to process text character by character, you only need to remember the current state, and the number of such states is finite.

The beautiful and simple example is a machine that checks whether an input is a simple number like -3, 2.2 or 001. The picture at the beginnig of the article is an FSM diagram. Double circles mean accept states, they identify where the machine can stop.

The machine starts at ①, possibly matches minus sign, then processes as many digits as required at ③. After that, it may match a dot (③ → ④) which must be followed by one digit (④ → ⑤), but maybe more (⑤ → ⑤).

The classic example of a non-regular language is a family of strings like:

a-baaa-bbbaaaaa-bbbbb

Formally, we need a line that contains N occurrences of a, then -, then N occurrences of b. N is an integer greater than zero. You can't do it with a finite machine, because you have to remember the number of a chars you encountered which leads you to the infinite number of states.

Regular expressions can match only regular languages. Remember to check whether the line you are trying to process can be handled by FSM at all. JSON, XML or even mere arithmetic expression with nested brackets cannot be.

The funny thing is, a lot of modern regular expression engines are not regular. For example, Python regex module supports recursion (which will help with that aaa-bbb problem).

Dynamic dispatch

When Python executes a method call, say a.f(b, c, d), it should first select the right f function. Due to polymorphism, what is selected depends on the type of a. The process of choosing the method is usually called dynamic dispatch.

Python supports only single-dispatch polymorphism, that means a single object alone (a in the example) affects the method selection. Some other languages, however, may also consider types of b, c and d. This mechanism is called multiple dispatch. C# is a notable example of languages that support that technique.

However, multiple dispatch can be emulated via single-dispatch. The visitor design pattern is created exactly for this. visitor essentially just uses single-dispatch twice to imitate double-dispatch.

Mind, that the ability to overload methods (like in Java and C++) is not the same as multiple dispatch. Dynamic dispatch works in runtime while overloading solely affects compile time.

These are some code examples to understand the topic better: Python visitor example, Java overloading doesn’t work as multiple dispatch, C# multiple dispatch.

Built-in names

In Python, you can easily modify all standard variables that are available in the global namespace:

>>> print = 42>>> print(42)Traceback (most recent call last):  File "<stdin>", line 1, in <module>TypeError: 'int' object is not callable

That may be helpful if your module defines some functions that have the same name as built-in ones. That also happens if you practice metaprogramming and you accept an arbitrary string as an identifier.

However, even if you shadow some built-in names, you still may want to have access to things they initially referred to. The builtins module exists exactly for that:

>>> import builtins>>> print = 42>>> builtins.print(1)1

The __builtins__ variable is also available in most modules. There is a catch though. First, this is a cpython implementation detail and usually should not be used at all. Second, __builtins__ might refer to either builtins or builtins.__dict__, depending on how exactly the current module was loaded.

strace

Sometimes software starts to behave weirdly in the production. Instead of simply restarting it, you probably wish to understand what exactly is happening so you can fix it later.

The obvious way to do it is to analyze what a program does and try to guess which piece of code is executing. Surely proper logging makes that task easier, but your application’s logs may be not verbose enough, either by design or because the high level of logging is set in the configuration.

In that case, strace may be quite beneficial. It's a Unix utility which traces system calls for you. You can run it in advance — strace python script.py — but usually connecting to the already executing application is more suitable: strace -p PID.

$ cat test.pywith open('/tmp/test', 'w') as f:    f.write('test')$ strace python test.py 2>&1 | grep open | tail -n 1open("/tmp/test", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3

Each line in the trace contains the system call name, followed by its arguments in parentheses and its return value. Since some arguments are used for returning a result from the system call, not for passing data into it, line outputting may be interrupted until system call is finished.

In this example, the output is interrupted until someone writes to STDIN:

$ strace python -c 'input()'read(0,

Tuple literals

One of the most inconsistent part of the Python syntax is tuple literals.

Basically, to create a tuple you just write values separated by commas: 1, 2, 3. OK, so far, so good. What about tuple containing only one element? You just add trailing comma to the only value: 1,. Well, that’s somewhat ugly and error prone, but makes sense.

What about empty tuple? Is it a bare ,? No, it’s (). Do parentheses create tuple as well as commas? No, they don’t, (4) is not a tuple, it’s just 4.

In : a = [...:     (1, 2, 3),...:     (1, 2),...:     (1),...:     (),...: ]

In : [type(x) for x in a]Out: [tuple, tuple, int, tuple]

To make things more obscure, tuple literals often require additional parentheses. If you want a tuple to be the only argument of a function, that f(1, 2, 3) doesn’t work for an obvious reason, you need f((1, 2, 3)) instead.