Friday, April 29, 2011

Statistical Objects

Software is really good at keeping statistical records. We can write code that stores raw data we want to keep track of. We can then display this data in a user friendly way, maybe as a chart of some kind. In addition, the stored raw data can be mined for trends, and other anomalies. Imagine trying to function in a modern scientific discipline without this capability - software that aids in statistical analysis. It would be nearly impossible. There is simply too much data in the world for us to process without tools that extract meaning for us. This same complexity phenomenon is prevalent in modern software systems. External tools will monitor the externally-visible performance characteristics of our running system for us. Inside the code, however, is a completely different ball-game. We really don't know what is happening at a level we can grasp. Logs help - they're good at reporting events that take place, things like simple tasks, errors, and values for debugging. But the traditional style of logging can only take us so far in terms of the true nature of how our software behaves in it's environment. Object systems should retain this information instead of broadcasting it to a log file and forgetting about it.

Statistics software is different from software statistics. Software statistics is data about software itself, things like how long it takes to process an event or respond to a request. Running software can track and store data like this in logs. Why do we need this data? The short answer, we need it to gauge characteristics, the otherwise intangible attributes of our software exhibits during it's lifespan. The system logs, in their chronological format can answer questions like “when was the last failure?” and “what was the value of the url parameter in the last parse() call?”. Further questions, questions about an aggregate measure of quality or otherwise, require a tool that will parse these logs. Something that will answer questions such as “what is the average response time for the profile view?” or “which class produces the most failures”?

These latter questions need a little more work to produce answers. But first, are runtime characteristics the same thing as application logs?

Well, yes and no. Logs are statements that we're making about an occurrence in time, an event. This is why logs are typically timestamped and what makes them such an effective debugging tool. Something went wrong at 10:34:11? The error logs just before then say something about permission problems. I now make a few adjustments, perhaps even modify the logging a little, and problem solved. Characteristics of a running system, on the other hand, equate to qualities that cannot be measured by a single event. The characteristics of running software changes over time. We can say things like “the system was performing well yesterday” or “two hours ago, the disk activity spiked for five minutes”.

Software logs are like a newspaper. Not an individual paper, but the entire publication process. Events take place in the world and each event gets logged. We, as external observers, read about them and draw our own conclusions. Software characteristics are more like the readout on car dashboards that tell you what the fuel economy is like over the past several kilometres. This can answer questions such as “how much money will I need this week for gas?”.

Its not as though we cannot log these characteristics to files for the consumption of external tools to analyze and provide us with meaningful insight. We can, but that doesn't really suit the intent of logging events. Events are one-time occurrences. Characteristics, or traits, is something that is established over time. Our system needs to run and interact with its environment before anything interesting and be accumulated and measured. The question is, how is this done? If we want to measure characteristics of our software, how it behaves in its environment over time, we'll need an external observer to do it for us, to take measurement. External tools can give us performance characteristics or point to trends that cause are software to fail. These things are limiting in some respects because they say nothing about how the objects inside the system interact with one another and the resulting idiosyncrasies.

In a system composed of objects, wouldn't it be nice to know how they interact with one another? That is, store a statistical record of the system's behaviour at both an individual object level and at a class level? This type of information about our software is meta-statistical – stats that only exist during the lifetime of our software. Once it's no longer running, the behavioural data stored about our objects is meaningless because this could change entirely once the software starts up again. If we can't use a report generated by a running system to improve something, say, our code, or development process, or whatever, what value does it have?

For example, suppose we want to know how often an instance of the Road class is created. We might also be interested in how the instance is created – which object was responsible for it's instantiation? If I want to find out, I generally have to write some code that will log what I'm looking for and remove it when I'm done. This is typical of the debugging process – make a change that will produce some information that we'd otherwise not be interested in. This is why we remove the debugging code when we're finished with it – its in the way. Running systems tested as being stable don't need the added overhead of keeping track of stuff like which object has been around the longest or which class has the most instances. These statistics don't seem valuable at first, but we add code to produce it when we need it. When something goes wrong, it certainly comes in handy. Maybe we don't need to get rid of it entirely.

Maybe we want to use statistical objects in our deployed, production systems after all. Maybe this will prevent us from having to piece together logic that helps us diagnose what went wrong. Logs are also handy in this sense, for figuring out what happened leading up to the failure. Recall that logs are system occurrences, a chronological ordering of stuff that happens. We can log as much or as little as we please about things that take place while our software is running. But, too much information in the log files is equally useless as not having enough information to begin with.

The trouble with storing statistical data about our running system – data about software objects in their environment – is the overhead. If overhead weren't an issue, we probably wouldn't bother removing our debug code in the first place. In fact, we might build it into the system. Code that stores interesting information. Insightful information. Developers dream of having the capability to query a running system for any characteristic they can think of. Obviously this isn't feasible, let alone possible. To store everything we want to know about our running system would be an exercise in futility. It simply cannot be done, even if we had unlimited memory. The complexities involved are too great.

So what can we do with statistical software objects that speak to meta properties of the system? Properties that will guide us in maintenance or further modifications to the software in the future. Its simply, really. We only store the interesting meta data on our objects. Just as we only log events we feel are relevant, we only keep a record of system properties that are statistically relevant to a better performing system. Perhaps a more stable system or any other unforeseen improvement made as a result of the meta data being made available to us. This way we can be selective in our resource usage. Storing things that don't help us, doesn't necessarily make sense, although, the usages sometimes wont reveal themselves until its too late and you don't have the data you need and have a heroic debugging effort to deal with.

For example, say you're building a painter application, one that allows you to draw simple shapes and move them around. You can keep track of things like how many Circle and Rectangle objects are created. This falls into the domain statistics, however, because this is what the application does. This information is useful to the user, potentially, but not so much the developer, or the application itself. It is more a characteristic of the user than the software. But what if we knew how each shape was instantiated? Did the user select the shape from a menu, or did they use the toolbar button? Perhaps these two user interface components have different factories that create the objects. Which one is more popular? How can we, and by we I mean the software, exploit this information to function better? Using this information is really implementation dependent, if used at all. For example, the painter application could implement a cache for each factory that creates shapes. The cache stores a shape prototype that gets copied onto the drawing canvas when created. Armed with meta-statistical information about our system, we can treat one factory preferentially over another, perhaps allocating it a larger cache size.

The preceding example is still reflective of the domain itself. Sure, the implementation of our software could certainly benefit from having it, but what about problematic scenarios that are independent of the domain? For example, disk latency may be high in one class of objects, while not as high in another class. Again, this is does depend on the user and what they're doing, but also on factors external to the software, such as hardware or other software processes competing for resources. Whatever the cause, we give our system a fighting chance to adapt, given sufficient data. Sometimes, however, there really isn't anything that can be done to improve the software during runtime. Sometimes, external factors are simply too limiting, or maybe there is a problem with the design. In either case, the developers can query the system and say “wow, should definitely be running with more memory available” or “the BoundedBox class works great, except when running in threads”.

Of course, we're assuming we have the physical resources to store all this meta data about our software, data that highlights the running characteristics of it. We might not have the luxury of free memory or maybe writing to the disk frequently is out of the question. In these situations, it might make sense to have the ability to turn off statistical objects. You could run your software with them turned on in an environment that can handle them. When it comes to deploying to its live, production environment, shut off the extraneous stuff that causes unacceptable overhead. More often than not, however, there are more than enough physical resources in today's hardware deployments to handle the extra data and processing power required by statistical objects. If you've got the space, utilize it for better software. The reality is, as our software grows more complex, we'll have no choice but to generate and use this type of information to cope with factors we cannot understand.

Monday, April 18, 2011

Pick Your Cloud Model

Cloud computing is a category your application is either a part of, or it isn't. We might even go so far as to call it a paradigm, a way to classify software – your code is either object-oriented or something else, like a functional paradigm. As a model of computing, cloud means a lot of things, its too broad to describe anything particular. Does your car fall into the category of “vehicle”? Yes, but that doesn't tell us much about the car. You can't exactly tell someone you travel by means of “vehicle” without getting a laugh or two. You can, however, say something like “I drive a four-wheel drive truck”. This is something the other party can relate to, sparing the technical jargon. Perhaps there isn't any further-refined sub-genres of the cloud just yet. Maybe a good starting point is looking at what we're hoping to accomplish with our software. Maybe then we can better identify auxiliary cloud models, and when it makes sense to use one over the other.

The term cloud computing is notoriously ambiguous. With good reason too, I don't think there is an easy way to define it as it means different things to different people. Software systems that utilize cloud technology have their own vision of what the cloud is exactly. Rather than trying to define an all-encompassing definition of what cloud means, perhaps the better approach is to identify some of its well-defined components and try to piece them together as something that is a “kind of” cloud. Why do we need to stuff everything we do over the Internet into a single pigeonhole? I say leave it vague – the cloud means events that take place over network-connected nodes. Let's focus the more tangible models of cloud computing.

The cloud's most compelling pronouncement is budgetary - saving money. We can use cloud provider's resources for less than it would cost us to buy, setup, and maintain our own. There are other, technological benefits to cloud computing too. Providers give us APIs we use to store and retrieve our data, to control our virtual machines, or to let someone else do it for us. How does this technology coalesce with your business goals? Do cloud offerings somehow solve deficiencies in your software?

One universal problem throughout all information technology is lack of physical hardware resources. The goal of any business is to scale up their operations, meaning that they have a large demand for whatever it is their software does. Say you're operating an online store and you've got a lot of customers. The software you started with isn't going to fulfil the demand of your newly acquired customer base. You've now got a scaling problem, which is in the most optimistic sense, a good problem to have. The need to scale up equals lack of compute resources – CPU time, memory, storage, and bandwidth.

The virtual cloud model can help with provisioning more physical hardware resources as they're needed. With this model, you deploy your existing application, without modification, to a cloud provider. The is no need to hack your software for the sake of potability. As long as you can get it running as a virtual machine you can deploy it to the cloud. When your software is virtual, its no longer at the mercy of the hardware. When your hardware fails, always a safe assumption that it will, you've got other copies of your application – new copies can be cloned as required. Of course, I'm over-generalizing a little when I say you deploy to using the virtual model with no modification. There is no such thing as software that knows how to adapt to a new environment and perform optimally without error.

Having said that, as a consequence of virtual machines being so easy to deploy, so readily available, so cheap - they're replaceable components in this model. You can find yourself a virtual machine that plugs into your infrastructure. This is probably something specialized, something that serves a particular purpose, like a database server or a cache node. The virtual cloud model promotes reuse and interchangeability.

Whats different about the virtual cloud cloud model from anything we've done in the past is running your entire infrastructure on a service provider's hardware. Not just a few services, but the whole thing. This can significantly lower your operating expenses, absolutely. If you don't have any hardware to operate, you've essentially eliminated that cost. The virtual model of cloud computing isn't for everyone, because not everyone is capable of migrating their existing software to an entirely new platform. Even armed with the know-how, would you really want to do it? In addition to the risk of carrying out such an endeavor, you've still got to go through the process of migration, even though you don't need to write new code.

The alternative cloud model, as I see it, is that of the service model, the service-oriented architecture if you will. The service model is fine grained while the virtual model is coarse grained. Services offer small bits and pieces of data and functionality where the fundamental unit of a virtual environment is that of an operating system.

There are all kinds of services available on the web. The web itself, that is, individual web sites offering content are a prime example of a service. If you subscribe to the opinion that the cloud is nothing more than an acronym for the Internet, you're probably a service-minded person already. The service approach solves the same basic problems as the virtual approach – finding ways to cut computing costs and provisioning more resources at a moments notice. Instead of offering virtual machines, the provider offers a service, another type of resource analogous to a virtual machine.

Storage is the most common service you'll find on cloud providers. Your application makes an API call to store a chunk of data. It makes another call to retrieve that chunk later on. An interesting usage scenario is using cloud storage APIs as your secondary storage, as a replicated copy of your primary data hosted elsewhere. The service cloud model isn't just about storage, its about providing an API for anything, any computation that you don't want to perform locally. Maybe you can't perform it locally because you cannot keep up with the demand. Or maybe you simply do not have the required software to do it. Imagine a charting API that produces chart images based on supplied input parameters. The Google chart API does just that. All the client cares about is producing the necessary input data for the chart. Generating the chart image is up to the API, my input parameters processed somewhere in the cloud.

What are the differences between the two cloud models? How do I know which one to use? Do the two models have anything in common with one another? The key difference between a virtual cloud model and a service cloud model is the application it self. Take a photo printing store, for instance. This store allows users to upload their digital photos to be printed and mailed to the customer. This business's website needs several different components to operate. It needs the website itself, this includes things like the user interface, static pages that display information about the company, the services offered, and so on. The facility to upload photos is component on its own, the service for processing payments is also a separate component, along with the facility to queue the physical photos for printing.

Let's now imagine we're going to deploy our photo store using the virtual cloud model. Each component, the website, the upload facility, the payment gateway, and the print queue, they're all virtual machines. Each component of our photo printing system is an operating system that can run in any virtual environment. If one goes down, it can be replaced with another copy of that same component. In our own local environment, we create the photo print queue and test it. This same utility gets deployed, without modification to the cloud service provider. However, deploying without modification is a little misleading. Our components still need to orchestrate with one another, not a trivial task unless I've found some system to facilitate the communication amongst the different photo printing components. Unlikely given that this is what makes our service unique and thus competitive. I've got to write something that does this.

The alternative method is to employ the cloud service model approach. Rather than deploy each of our application's components as a virtual machine, we deploy the photo printing website on our own host, calling different service APIs from the cloud to help our application along. Maybe we call a storage API to save uploaded photo files. Maybe there is a cloud API we can use use to queue the photo print jobs, a message queue perhaps. There are a ton of payment gateway APIs that we can use.

So which model is better? There isn't a definitive measure of value when comparing the two models with one another. Not without a context. The question instead, should be something like, which model is better suited for what I'm trying to achieve? How does cloud technology help me when my own resources are running dry? You have to ask yourself, how equipped am I to operate a virtual model and is it really just a matter of deploying my software without modification? Or is there a lot of mediation code I'm going to need. Maybe something like this already exists, how much will it cost?

If the virtual model doesn't help you, if you're not technically capable of operating such a setup, or maybe its just not cost-effective, you can look at the service cloud model. The service cloud is all about providing APIs to applications, APIs that offer something of value to the application. Another factor favouring the service cloud model is deployment and availability. If my software utilizes one or more cloud APIs, they must have already existed before my application decided to use them. That is, after all, the value proposition of the provider offering the API. Consequently, APIs should always be available because this is the reason for the provider's existence, to offer applications such as ours, compute resources on demand.

The cloud service model might also be lacking depending on what you're trying to build. Consider our photo printing store. We've got some fairly well defined components that will help it match the customer demand. We've come to the conclusion, during the design of our system, that we need APIs that do certain things. Odds are we'll be able to find a payment gateway that will suit our application's needs just fine. But what about the printing queue component, the one where we send photos to be printed? This is a requirement specific to what we're trying to do with our own service. So we're going to have to build our own service to do something like this.

Can the two cloud models conspire to offer real value? My thoughts are, one can serve as an adherent to the other. Take our photo printing pursuit for example. We've examined what it would be like running in the cloud as both a virtual cloud model and as a service cloud model. Both have their advantages – the virtual model is gives a clear distinction between components and total control as they're created by us. We can also find virtual components off the self that do what we want. The service model is lightweight, as in, we don't have to deploy changes to it because they're hosted by the cloud. We can design our applications with these cloud service APIs in mind, giving use fine-grained control over how our software interacts with the cloud.

The two models can certainly help one another out. A solution for our printing service job queue component might be to implement our own virtual service. That is, a combination of a service and virtual model where a virtual machine, implementing the job queue service we're looking for gets deployed and we have total control over it. We're taking advantage of the control and flexibility of deploying virtual while keeping the design philosophy of our application's features and using an API, service driven approach to doing certain tasks.

In the end, there are probably several more of these sub-genres of cloud, we can concoct and describe once we've encountered real problems and devised real solutions to them. We simply use the higher-level, vague cloud term for inspiration.

Tuesday, April 12, 2011

Easy as ABCMeta

My earlier attempt at explaining abstract classes in Python didn't take into account the abc module, used for creating abstract base classes.  What follows is an example I find useful for using the abc module to define virtual subclasses.  A virtual subclass is different from a subclass in the traditional sense.  Normally, a subclass is something that inherits from a base class.  In Python, we can use this hierarchical structure to implement abstract base classes that do nothing, or raise a NotImplementedError exception.  It is up to the subclasses to provide the implementation.

Virtual subclasses are different in that there is no inheritance - the abstract base class registers virtual subclasses.  This registration takes place outside of the class hierarchy, so its easy to replace virtual subclasses, or remove them entirely.  This is all done with the ABCMeta class, part of the abc module.

Suppose I have a Shape class that can move to a specified point.  Such an implementation might look like this.

class Shape(object):
    
    def move(self, x, y):
        
        print 'Moving to X: %s Y: %s'%(x, y)
        
if __name__ == '__main__':
    
    my_shape = Shape()
    
    my_point = (50, 100)
    
    my_shape.move(*my_point)

The Shape.move() method is expecting a point in which to move.  This is passed as a tuple, containing x, y coordinates.  However, Shape is a legacy class that's been around forever and is being replaced with ShapeNew.

class ShapeNew(object):
    
    def move(self, point):
        
        print 'New Moving to X: %s Y: %s'%(point.x, point.y)

The ShapeNew.move() now expects a point object with x and y attributes, not a tuple with zero-based index lookups.  This is where the ABCMeta class comes in handy.

from abc import ABCMeta

class Point(object):
    
    __metaclass__ = ABCMeta
    
    def __init__(self, *args):
        
        self.data = args
        
    def __getitem__(self, index):
        
        return self.data[index]
        
    def __len__(self):
        
        return len(self.data)
        
    def get_x(self):
        
        return self[0]
        
    def get_y(self):
        
        return self[1]
        
    x = property(get_x)
    y = property(get_y)        
        
Point.register(tuple)

We now have a Point class that we can use with the ShapeNew.move() method.  You'll notice, Point sets its __metaclass__ attribute to ABCMeta.  Following the definition of Point, we see Point.register(tuple).  This tells the Python that tuple is a virtual subclass of Point.  So Point can now be used anywhere tuples were used for the old Shape.move() method.

from abc import ABCMeta

class Point(object):
    
    __metaclass__ = ABCMeta
    
    def __init__(self, *args):
        
        self.data = args
        
    def __getitem__(self, index):
        
        return self.data[index]
        
    def __len__(self):
        
        return len(self.data)
        
    def get_x(self):
        
        return self[0]
        
    def get_y(self):
        
        return self[1]
        
    x = property(get_x)
    y = property(get_y)        
        
Point.register(tuple)

class Shape(object):
    
    def move(self, x, y):
        
        print 'Moving to X: %s Y: %s'%(x, y)
        
class ShapeNew(object):
    
    def move(self, point):
        
        print 'New Moving to X: %s Y: %s'%(point.x, point.y)
    
if __name__ == '__main__':
    
    my_shape = Shape()
    my_shape_new = ShapeNew()
    
    my_point = Point(50, 100)
    
    my_shape.move(*my_point)
    my_shape_new.move(my_point)

Here, we're using ABCMeta to help us transition from legacy code to a newer implementation.  So when Point no longer needs to be a tuple, we can remove ABCMeta as its __metaclass__ and remove the Point.register(tuple) resgistration.