Wednesday, October 3, 2012

Designing For Big Data

Information is all around us. We're producing more data on a daily basis than ever before, and there is no sign of this phenomenon slowing down. Most likely, this is due to the increasing number of network-connected devices. Everyone has a laptop, a tablet, and two phones it seems — each generating data in their own unique ways. Going beyond personal computing devices that we control for the most part, there is a growing number of unmanned devices that are pushing data across networks. These autonomous agents are usually sensing data about their surrounding environment, and publishing the raw data for either some centralized software that knows how to make sense of that data, or in a more distributed fashion where other agents might want to know what's going on. These autonomous agents have a profound impact on the global measurement of available information because they never sleep. There is an endless stream of data about the current state of the world. We're measuring just a portion of our existence through these autonomous tools, a tiny fraction of what we could potentially measure. Imagine what the automated information stream will look like 5, 10, 20 years from now...

We know there is a lot of information moving around the networks that connect us all. We label this high-volume of data as big data. Its a good name as it accurately captures two properties of what seems to be happening in information technology today. We're collecting data about anything and everything. That much is a given. We've always done this, as long as computers have existed. The new part of the equation is that our data has grown-up. We used to have a concept of storing only meaningful data, while discarding the stuff that seemed wasteful to the domain in question. That concept is eroding quickly — the idea that data can be wasteful seems to be an idea from the distant past where hard drive space was not a commodity. Since that notion is becoming less and less relevant, the thirst for data is palpable. You can see it everywhere. We've got software such as Hadoop dedicated to solving the problem of big data. Google's map-reduce has pushed it's way into the mindset of every system architect it would seem. Given that we're not all accustomed to this idea that we have to be capable of processing terabytes of data on a daily basis, perhaps we need to take a step back and examine what big data really means for our traditional approaches to software development.

The world is a complex place, constantly presenting us with complex information. Now that we're able to capture a larger slice of that information, we have to think about the best way to approach it. Not just from a capacity standpoint, but from a few other practical perspectives as well. To do that, it might help if we were to think about why big data is big to begin with. Has data grown big simply due to the increased volume of available data? Or is big data a manifestation of complexity in the form of bits we can compute with?

The reason I think these are important questions to ask in this era of computing with big data, is that they have a rather noticeable impact on software. Both our legacy software that has been deployed for years, and the software we're writing right now. In either scenario — whether big data exists as just a side-effect of availability or if big data is the realization of the complexity in our world — we require more compute power to interact with it. Is this necessarily a bad thing? No, I think not, but I do think that the focus around what we're actually computing with big data could be sharpened. For example, if you view big data as being collected as one big giant heaping pile of information, stored away for later analyses, you're going to require specialized software that is meant for dealing with enormous data sets. Alternatively, we could think about how we scale out our existing architectures to support larger volumes of data.

The trouble is, existing IO patterns that we have in place today do not match up with some of the larger scale information processing operations that are becoming prevalent, again, due to the availability of raw data. If my modest web application wanted to perform a sort on some data set that spans the terabyte range, I'm in trouble unless I've gotten many things right in terms of both design and deployment. So, imagine this same problem with even more data. It's challenging enough to get this types of problems right, and to perform reasonably well, with a mere terabyte of data. But is a terabyte even considered big data anymore with respect to perform these types of IO operations that we're accustomed to — like sorting? Let's just assume that big data is nothing more than "more of the same". We're still building software as we normally would, its just that now, we have forty times our current peak to process. The inability of our current programming models to handle this type of volume renders concepts that we're used to meaningless. We have to keep the end user in mind too — if we're writing a big data application, they cannot expect progress indicators to function as they would with regular-sized data.

A big part of what Google and Hadoop do in their big data practices is filter out noise. That is, the actor working with the big data wants to find something smaller. Given the actual magnitude of how big some of this big data really is, small isn't really small. Take Google's search results — "hotels in Toronto" gives me an impressive 109,000,000 results. Of those, maybe 5 are what I'm really looking for. It seems like I might have some trouble finding 5 search results amongst millions? It turns out that I do not — what I'm looking for is actually on the first page. Where I'm going with this is that it turns out that it can be advantageous for software to give users the impression of big data. A web search that returned only 10 results wouldn't look very impressive now would it? Going forward, developers need to acknowledge that big data is a multi-dimensional problem that extends all the way to the user — it's not just an issue with available capacity.