Boduch's Blog: August 2010

In object-oriented software development, there are classes, and there are methods. Methods are like functions in a functional programming language with an important difference - they don't exist in isolation. Methods belong to an object. Object-oriented programming methodology encapsulates structure and behavior into a single descriptive unit, the class. Classes describe what structure and what behavior an object may have. They are the object blueprints.

Encapsulating the structure and behavior of an object is how software design scales in size and complexity. A developer that uses an object, doesn't know how the behavior of that object is implemented. Nor do they care. They know everything beneath the surface works as expected. Encapsulation isn't limited to objects. A developer using a function only cares that it works as expected, not how the job is carried out.

Objects, in addition to hiding method implementation, hide structure. These are the attributes that give an object it's identity. Functions have no such structure to hide. An object represents a concept, something that exists in the real world, or some intangible thing. Attributes that describe objects in a software system don't exist in functions. They suffer from a lack of identity, and that is the key difference between functions and methods.

With everything object oriented programming languages promises, are functional programming languages still relevant? Objects are nothing more than a design tool, built with functions, that aid in designing complex systems. Functions are a concise, direct way to supply input and get output. The very idea of a function is an elegant one. However, humans need a better way to understand what they've built and this is where abstract objects help. Functions don't cause a program to fall apart - how they're used do.

Methods are intended to manipulate, or retrieve the values of object attributes. If the attribute values of an object embody the state of the object, methods that change attribute values also change the state of the object. Methods are more likely to have side effects than functions. However, this is the intention of methods - to encapsulate the object behavior, including the changing states of the object. Other objects can be passed to methods as parameters. The state of these objects can, in turn, be changed by the method. Does this violate the encapsulation principle? It doesn't because the object being manipulated doesn't care who does it, as long as the public interfaces of the object are used.

What best describes a software object? The methods that realize it's public interface or the attributes that define it's structure? An object represents some concept while an interface represents some behavior, not bound to any particular concept. Imagine I have a class called Object and I ask you what this object represents based on it's run() method. With this information, we can at best determine that it is something that can run but we can't deduce why the object is able to run. A software program can run. An engine, an animal, a person - all things that can run. What if instead of identifying what the object represents by a method, I asked you to identity it by a structural feature - wheel. A object with a wheel attribute has more of an identity than a object with a run() method.

An interface, not necessarily explicit, describes the behavior of an object, not it's structure. By nature, the methods that implement an interface are a behavioral concept. If a given class implements a Runnable interface, which requires a run() method, it can run. The concept of running is tightly coupled with your class. Interfaces, and their corresponding methods, always need to be implemented, even if that implementation is inherited.

Lets take a step back and compare methods and functions again. Functions are pure behavior with input and output. They don't belong to an object so they can't alter it's state. Conversely, methods belong to an object and alter it's state. As object-based systems grow, so do the number of methods. The side effects of these methods also increase in size.

How can we make methods more like functions and maintain an abstract object design that allows the size of the system to scale? Objects are good at representing some concept with structure. The provided interfaces of those objects, the methods it implements, are the challenge. Lets try making methods their own concepts.

Classes define a special type of method - a constructor. This method is called when the object is first created to do any setup the object may need. An alternative perspective: a constructor is the object's owned behavior. It cannot be called externally once the object exists. A constructor is the essence of the object - it is responsible for making the object what it is.

The constructor method is intended to initialize the computing resources the object needs. Languages like C++ use it to allocate memory. Dynamic languages, like Python, don't share this need. Constructors in these languages are used to initialize the attributes of the object because they aren't statically defined. The constructor may call some other methods to perform some validation, or to read some data from the network or hard drive. Other objects might even be created as the result of a constructor being called. Initialization is the primary focus of the constructor method.

If a class represented nothing but behavior, if it had no structure, the constructor of that class is the behavior. The constructor parameters would be the parameters to the behavior. For example, if Runnable were a class instead of an interface, the constructor would run instead of calling methods defined by the interface. Picture the constructor as a function. Its input the parameters supplied to the constructor. Its output the changes in state to the objects supplied as input. Lets call this type of class a behavioral class.

What about functions that return values? Can behavioral classes return values? Doing so is counter-intuitive because the constructor returns the instance of the behavioral class it self. Instead, the return value is placed in a result attribute of the object. There is a problem with this approach, however. If you were to place the result of creating the behavioral class in an attribute, the object must still exist when the result is used.

On the other hand, traditional methods bound to their objects, need to stick around for the entire life of the object. From an identity perspective, it doesn't make much sense. Once the object is dead, the behavior of that object doesn't really exist either. It might still exist in the blueprint, the class of that object, but that may not necessarily be optimal. Behavior is a concept in its own right.

Methods can be inherited from parent classes. The specialized class has all the generic behavior of the parent. Would a behavioral class negate this aspect of inheritance? Not exactly. If behavior were implemented as a class, it can also inherit general behavior. We have to think differently about how the generalization would work. Traditionally, inherited methods are called in a polymorphic context. That is, they are called when an object that supports the method in question is used. If a method is a class, similar to a function, the constructor needs to call the parent constructor, usually sharing the same signature. Some languages take care of the parent constructor implicitly, like C++.

So can methods as classes, or concepts, actually work? Possibly. Method objects would take care of the behavior identity problem. This approach addresses the notion that behavior is a separate idea from that of object structure. They are similar to functions in that they have no real structure. Methods as concepts, or behavioral classes, can be used in conjunction with existing objects and methods. At least in an experimental sense.

Python 2.2 saw the introduction of the yield statement. This statement returns a generator from a function or method. Returning a generator is like returning a list. You can iterate through the result in a loop. The difference between returning a generator and a list is the flow of control. When a method yields data, it also yields control flow temporarily. The method will resume control once the generator's next() method is called. When a method returns data, control flow is permanently returned. Generators aren't a replacement for returning data. They are a tool for Python developers to return a data stream.

Let's create a simple example system to get a better feel for what generators are all about. We'll design a book catalog that searches book files. The format of each book file is a serialized Python dictionary. The book fields, or dictionary keys, are the book title, synopsis, and cover image. Using generators where a simple list is sufficient can be avoided. During it's lifetime, a generator method goes through several states. These states help us envision generator properties, such as control flow and the overall responsiveness of the program.

Before we embark on our book catalog design, we should make an analogy. A Unix pipe is a data channel between two processes. The first process writes data to the pipe while the second process reads data from the pipe. Python generators are similar. Instead of processes we have a method that writes data to the generator and an iterator that reads data from the generator. The iterator that reads data from the pipe is a for loop. Each loop iteration is executed when data is made available by the generator. Generators, used as a loop operand, behave the same as lists or tuples. The only difference is the flow of control.

Our book catalog application should have an optional sorting component to sort search results. For this, we will create a Sorter class with a sort() method. This method accepts a list of books and returns a sorted list. Python lists have a sort() method that will sort the elements in the list, so we'll use it instead of reinventing the wheel. Once the list has been sorted, should Sorter.sort() return a generator? Before you decide, ask yourself if the return data is available. Does the method use a pipe and filter approach? Are we iterating over some data set, manipulating each element? Does the method produce a stream of data, or a monotonous piece of data? Our Sorter.sort() method isn't any of these things and will simply return the sorted dictionary.

The next book catalog component is the most important piece of the puzzle. Searching for books. A Filter class with a filter() method will handle this. This method accepts book iterator and filter string parameters and returns a book iterator as the filter result. Returning a generator from this method is a good idea because the return data isn't immediately available. This is because we're iterating over a set of books. We filter each book by checking if the supplied string exists in the title or description. If so, the book is yielded. This method produces streaming behavior because other objects that invoke the Filter.filter() method can begin reading from the returned generator before all data is available.

How does all this interleaving data and control flow work? Let's take a look at the states a generator method goes through during it's lifetime. The method starts in a running state. This is where the method computes data to send to the generator. Once data is ready to yield, the method goes into a yielding state. This state isn't active for very long. It is only active while the data is being written to the generator. Finally, the method goes into a frozen state. The method is frozen so that it may resume its flow of control once data has been read from the generator. The method will then enter the running state again.

The final component of our book catalog is a FileReader class. This class has a read() method that loads all books files and creates the corresponding book dictionaries. Each dictionary is then sent to the generator returned by this method. Now that we have all our book catalog components, the search work flow is easy. The FileReader.read() method yields a book dictionary. The Filter.filter() method searches the dictionary for "MyBook". A match is found and the book is sent to the generator. The user interface, which invoked Filter.filter() displays "MyBook" in the search results. FileReader.read() resumes with the next file in the directory. This chain of generators produces a stream of data and a responsive search. Think about a catalog with 5000 books. If one of the first 100 books matches the criteria, the user sees this book before the search has completed.

Designing a simple book catalog program has shown us that generators add overhead to methods when they aren't used to create a data stream. If your method operates on individual elements of an input data set and produces another set, use a generator. This is a pipe and filter approach where the generator is the pipe and the method logic is the filter. Our example illustrates the effect data streams can have on the responsiveness of some behaviors, like searching for books. Adding multiple threads of control to an application can also increase the responsiveness but using generators is a more intuitive, data-centric approach.

Boduch's Blog

Tuesday, August 31, 2010

Methods As Concepts

Wednesday, August 11, 2010

Using Python Generators