Friday, April 29, 2011

Statistical Objects

Software is really good at keeping statistical records. We can write code that stores raw data we want to keep track of. We can then display this data in a user friendly way, maybe as a chart of some kind. In addition, the stored raw data can be mined for trends, and other anomalies. Imagine trying to function in a modern scientific discipline without this capability - software that aids in statistical analysis. It would be nearly impossible. There is simply too much data in the world for us to process without tools that extract meaning for us. This same complexity phenomenon is prevalent in modern software systems. External tools will monitor the externally-visible performance characteristics of our running system for us. Inside the code, however, is a completely different ball-game. We really don't know what is happening at a level we can grasp. Logs help - they're good at reporting events that take place, things like simple tasks, errors, and values for debugging. But the traditional style of logging can only take us so far in terms of the true nature of how our software behaves in it's environment. Object systems should retain this information instead of broadcasting it to a log file and forgetting about it.

Statistics software is different from software statistics. Software statistics is data about software itself, things like how long it takes to process an event or respond to a request. Running software can track and store data like this in logs. Why do we need this data? The short answer, we need it to gauge characteristics, the otherwise intangible attributes of our software exhibits during it's lifespan. The system logs, in their chronological format can answer questions like “when was the last failure?” and “what was the value of the url parameter in the last parse() call?”. Further questions, questions about an aggregate measure of quality or otherwise, require a tool that will parse these logs. Something that will answer questions such as “what is the average response time for the profile view?” or “which class produces the most failures”?

These latter questions need a little more work to produce answers. But first, are runtime characteristics the same thing as application logs?

Well, yes and no. Logs are statements that we're making about an occurrence in time, an event. This is why logs are typically timestamped and what makes them such an effective debugging tool. Something went wrong at 10:34:11? The error logs just before then say something about permission problems. I now make a few adjustments, perhaps even modify the logging a little, and problem solved. Characteristics of a running system, on the other hand, equate to qualities that cannot be measured by a single event. The characteristics of running software changes over time. We can say things like “the system was performing well yesterday” or “two hours ago, the disk activity spiked for five minutes”.

Software logs are like a newspaper. Not an individual paper, but the entire publication process. Events take place in the world and each event gets logged. We, as external observers, read about them and draw our own conclusions. Software characteristics are more like the readout on car dashboards that tell you what the fuel economy is like over the past several kilometres. This can answer questions such as “how much money will I need this week for gas?”.

Its not as though we cannot log these characteristics to files for the consumption of external tools to analyze and provide us with meaningful insight. We can, but that doesn't really suit the intent of logging events. Events are one-time occurrences. Characteristics, or traits, is something that is established over time. Our system needs to run and interact with its environment before anything interesting and be accumulated and measured. The question is, how is this done? If we want to measure characteristics of our software, how it behaves in its environment over time, we'll need an external observer to do it for us, to take measurement. External tools can give us performance characteristics or point to trends that cause are software to fail. These things are limiting in some respects because they say nothing about how the objects inside the system interact with one another and the resulting idiosyncrasies.

In a system composed of objects, wouldn't it be nice to know how they interact with one another? That is, store a statistical record of the system's behaviour at both an individual object level and at a class level? This type of information about our software is meta-statistical – stats that only exist during the lifetime of our software. Once it's no longer running, the behavioural data stored about our objects is meaningless because this could change entirely once the software starts up again. If we can't use a report generated by a running system to improve something, say, our code, or development process, or whatever, what value does it have?

For example, suppose we want to know how often an instance of the Road class is created. We might also be interested in how the instance is created – which object was responsible for it's instantiation? If I want to find out, I generally have to write some code that will log what I'm looking for and remove it when I'm done. This is typical of the debugging process – make a change that will produce some information that we'd otherwise not be interested in. This is why we remove the debugging code when we're finished with it – its in the way. Running systems tested as being stable don't need the added overhead of keeping track of stuff like which object has been around the longest or which class has the most instances. These statistics don't seem valuable at first, but we add code to produce it when we need it. When something goes wrong, it certainly comes in handy. Maybe we don't need to get rid of it entirely.

Maybe we want to use statistical objects in our deployed, production systems after all. Maybe this will prevent us from having to piece together logic that helps us diagnose what went wrong. Logs are also handy in this sense, for figuring out what happened leading up to the failure. Recall that logs are system occurrences, a chronological ordering of stuff that happens. We can log as much or as little as we please about things that take place while our software is running. But, too much information in the log files is equally useless as not having enough information to begin with.

The trouble with storing statistical data about our running system – data about software objects in their environment – is the overhead. If overhead weren't an issue, we probably wouldn't bother removing our debug code in the first place. In fact, we might build it into the system. Code that stores interesting information. Insightful information. Developers dream of having the capability to query a running system for any characteristic they can think of. Obviously this isn't feasible, let alone possible. To store everything we want to know about our running system would be an exercise in futility. It simply cannot be done, even if we had unlimited memory. The complexities involved are too great.

So what can we do with statistical software objects that speak to meta properties of the system? Properties that will guide us in maintenance or further modifications to the software in the future. Its simply, really. We only store the interesting meta data on our objects. Just as we only log events we feel are relevant, we only keep a record of system properties that are statistically relevant to a better performing system. Perhaps a more stable system or any other unforeseen improvement made as a result of the meta data being made available to us. This way we can be selective in our resource usage. Storing things that don't help us, doesn't necessarily make sense, although, the usages sometimes wont reveal themselves until its too late and you don't have the data you need and have a heroic debugging effort to deal with.

For example, say you're building a painter application, one that allows you to draw simple shapes and move them around. You can keep track of things like how many Circle and Rectangle objects are created. This falls into the domain statistics, however, because this is what the application does. This information is useful to the user, potentially, but not so much the developer, or the application itself. It is more a characteristic of the user than the software. But what if we knew how each shape was instantiated? Did the user select the shape from a menu, or did they use the toolbar button? Perhaps these two user interface components have different factories that create the objects. Which one is more popular? How can we, and by we I mean the software, exploit this information to function better? Using this information is really implementation dependent, if used at all. For example, the painter application could implement a cache for each factory that creates shapes. The cache stores a shape prototype that gets copied onto the drawing canvas when created. Armed with meta-statistical information about our system, we can treat one factory preferentially over another, perhaps allocating it a larger cache size.

The preceding example is still reflective of the domain itself. Sure, the implementation of our software could certainly benefit from having it, but what about problematic scenarios that are independent of the domain? For example, disk latency may be high in one class of objects, while not as high in another class. Again, this is does depend on the user and what they're doing, but also on factors external to the software, such as hardware or other software processes competing for resources. Whatever the cause, we give our system a fighting chance to adapt, given sufficient data. Sometimes, however, there really isn't anything that can be done to improve the software during runtime. Sometimes, external factors are simply too limiting, or maybe there is a problem with the design. In either case, the developers can query the system and say “wow, should definitely be running with more memory available” or “the BoundedBox class works great, except when running in threads”.

Of course, we're assuming we have the physical resources to store all this meta data about our software, data that highlights the running characteristics of it. We might not have the luxury of free memory or maybe writing to the disk frequently is out of the question. In these situations, it might make sense to have the ability to turn off statistical objects. You could run your software with them turned on in an environment that can handle them. When it comes to deploying to its live, production environment, shut off the extraneous stuff that causes unacceptable overhead. More often than not, however, there are more than enough physical resources in today's hardware deployments to handle the extra data and processing power required by statistical objects. If you've got the space, utilize it for better software. The reality is, as our software grows more complex, we'll have no choice but to generate and use this type of information to cope with factors we cannot understand.