Wednesday, August 11, 2010

Using Python Generators

Python 2.2 saw the introduction of the yield statement. This statement returns a generator from a function or method. Returning a generator is like returning a list. You can iterate through the result in a loop. The difference between returning a generator and a list is the flow of control. When a method yields data, it also yields control flow temporarily. The method will resume control once the generator's next() method is called. When a method returns data, control flow is permanently returned. Generators aren't a replacement for returning data. They are a tool for Python developers to return a data stream.

Let's create a simple example system to get a better feel for what generators are all about. We'll design a book catalog that searches book files. The format of each book file is a serialized Python dictionary. The book fields, or dictionary keys, are the book title, synopsis, and cover image. Using generators where a simple list is sufficient can be avoided. During it's lifetime, a generator method goes through several states. These states help us envision generator properties, such as control flow and the overall responsiveness of the program.

Before we embark on our book catalog design, we should make an analogy. A Unix pipe is a data channel between two processes. The first process writes data to the pipe while the second process reads data from the pipe. Python generators are similar. Instead of processes we have a method that writes data to the generator and an iterator that reads data from the generator. The iterator that reads data from the pipe is a for loop. Each loop iteration is executed when data is made available by the generator. Generators, used as a loop operand, behave the same as lists or tuples. The only difference is the flow of control.

Our book catalog application should have an optional sorting component to sort search results. For this, we will create a Sorter class with a sort() method. This method accepts a list of books and returns a sorted list. Python lists have a sort() method that will sort the elements in the list, so we'll use it instead of reinventing the wheel. Once the list has been sorted, should Sorter.sort() return a generator? Before you decide, ask yourself if the return data is available. Does the method use a pipe and filter approach? Are we iterating over some data set, manipulating each element? Does the method produce a stream of data, or a monotonous piece of data? Our Sorter.sort() method isn't any of these things and will simply return the sorted dictionary.

The next book catalog component is the most important piece of the puzzle. Searching for books. A Filter class with a filter() method will handle this. This method accepts book iterator and filter string parameters and returns a book iterator as the filter result. Returning a generator from this method is a good idea because the return data isn't immediately available. This is because we're iterating over a set of books. We filter each book by checking if the supplied string exists in the title or description. If so, the book is yielded. This method produces streaming behavior because other objects that invoke the Filter.filter() method can begin reading from the returned generator before all data is available.

How does all this interleaving data and control flow work? Let's take a look at the states a generator method goes through during it's lifetime. The method starts in a running state. This is where the method computes data to send to the generator. Once data is ready to yield, the method goes into a yielding state. This state isn't active for very long. It is only active while the data is being written to the generator. Finally, the method goes into a frozen state. The method is frozen so that it may resume its flow of control once data has been read from the generator. The method will then enter the running state again.

The final component of our book catalog is a FileReader class. This class has a read() method that loads all books files and creates the corresponding book dictionaries. Each dictionary is then sent to the generator returned by this method. Now that we have all our book catalog components, the search work flow is easy. The method yields a book dictionary. The Filter.filter() method searches the dictionary for "MyBook". A match is found and the book is sent to the generator. The user interface, which invoked Filter.filter() displays "MyBook" in the search results. resumes with the next file in the directory. This chain of generators produces a stream of data and a responsive search. Think about a catalog with 5000 books. If one of the first 100 books matches the criteria, the user sees this book before the search has completed.

Designing a simple book catalog program has shown us that generators add overhead to methods when they aren't used to create a data stream. If your method operates on individual elements of an input data set and produces another set, use a generator. This is a pipe and filter approach where the generator is the pipe and the method logic is the filter. Our example illustrates the effect data streams can have on the responsiveness of some behaviors, like searching for books. Adding multiple threads of control to an application can also increase the responsiveness but using generators is a more intuitive, data-centric approach.