Thursday, August 2, 2012

Global Storage

I'm often amazed at how much of the world's information can fit into my laptops hard drive. I can literally shovel in a profusion of human knowledge and think nothing of it. We observe with our senses and record our interpretations. We've been doing this really well, for a really long time. The puzzling problem — no feasible way to search in this mountain of data. Now we can search it. We can organize our observations about the universe, seemingly without restriction. Imagine you were assigned with the task of recording everything you know, from kindergarten to present, on your laptop. You have as much time as you need to type out every piece of knowledge you have encoded just beneath your skull. Not only would this take forever, but you would probably never run out of storage capacity.

The information storage capacity we have at our disposal is so vast that we seldom take note. If you write code, or otherwise work closely with devices that preserve information, the thought may cross your mind from time to time. Global storage capacity, and the lack thereof, is such a dubious problem that nobody asks the question — what will happen when we run out? You see, the problem is so far-fetched simply because of the sheer amplitude of drive space available to us, either directly, or through a web service.

What got me on this topic in the first place was thinking about the types of information we create and use as part of our day to day lives. I was organizing some of my digital media, taking my old physical media and putting it on my external 3TB drive, actually. During which, I could not shake the feeling that this is a lot of data, even by today's standards. The scary part. however, was my second thought — all of this is going to have to be copied over to another drive. This exercise instilled in me two inklings about persistent data. First, the more software we write and the more sophisticated it becomes, the more data that is ultimately produced and stored. Second, this data should all be backed up given that hardware will eventually fail.

These two facts, when combined, paint a somewhat scary picture. The first, producing manifold data without much effort isn't likely to slow down. Backing up all this data, combined with our powerhouse data generation capabilities, means that we're putting massive pressure on hardware manufacturers to produce more and more disk space. The end result? Focus on creating more and more hardware on which we can store our information and back it up. Until we have hardware that doesn't fail, we need to replicate information if it has any hope of survival.

Talking about data backup this way leads me to think of something else this industry may be confronted with in the near future — the value of data. That is, of all the information we're worried about — all the investment in storage hardware, both primary and replicated. How can we jettison some of it? I think at some point, companies are going to need to start asking these types of questions. What are the trade-offs that will either result in less information produced, weaker replication requirements, or affordable disk space that will enable them to continue as they do today?

These are big questions, none of which have an easy answer because none pose an immediate threat to the daily operations of institutions that rely heavily on information technology. Storage is cheap. Replication is cheap. Therefore, it might be helpful as a thought experiment to simply adjust one of the constants we're used to.

Let's start on the information production end of things. If we think about that, we might be able to answer the simple question of why we have so much data to begin with. Because storage is so cheap, it seems the question should be — why shouldn't we have more data? After all, isn't that how you gain a competitive edge? More data means more opportunity to analyze it and generate insights that lead to, well, more data. We have so many inputs that make their way through the pipes and filters of our information systems that its difficult to measure exactly how much information our companies generate. Again, we don't think about it because we don't need to — storage is cheap.

As time goes by, the value of information decreases. Something that was relevant today becomes less relevant tomorrow. Even less so the day after that. And so on. Theoretically, information is never worthless — you can always devise some computation that will take into account the events of a decade ago. Think about the field of statistical analyses. Imagine the statistical software we have in place today used data produced at today's rate, starting from fifty years ago. Would we even have the processing power to handle it? Hard drives aren't the only resource we need to worry about in terms of scarcity, but also raw processing power. Just as a single hard disk only has so much capacity, so to does a CPU only have so much throughput. But the nature of what we're storing, the semantic content, plays a role here too.

I mentioned that its easy to produce a lot of data today. On any computer. But that doesn't equate to valuable data that will be interesting to compute with down the road. For instance, how much of the world's information is opaque binary data, only meaningful to human eyes when rendered on a screen? Data such as this needs to be treated differently due to what it's semantic content. If a company stores opaque data such as image and video files, they're highly unlikely to execute large processing jobs on these data on a daily basis if ever. So we need to think about isolating these resources away from our raw processing power. We also need to think about demand for these resources. In any given system, there is likely more information than there are users to look at it. That leads me to believe that we can do a better job of putting a value system in place for data. That is, high demand should equate to value. Value should equate to allocation of better hardware resources for that particular data.

What all of this amounts to is managing data independently of the applications that produce it, and those that consume it. Not just administering the supporting software that sits between the data and the hard disk, and not just ensuring that the data remains replicated and is always available. Active data management, as opposed to treating data passively. When we're taking the passive perspective toward data we store, we're not taking into account the liveliness of the data. Are we using valuable hardware resources to serve up sterile data? Imagine, should it actually be feasible, if we could have that type of monitoring in place. An observer that let's us know that the our software is working harder than it has to, sifting through piles and piles of data that isn't serving the greater good.

The process I'm describing is almost like that of data warehouses. You use a data warehouse to perform operations on data outside of the context in which it originated. For instance, your warehouse might contain request logs for the month of January. That means you can safely perform analyses on those logs without impacting the real application or it's data. This is great, but it still involves a batch-style migration of data, a somewhat passive approach to storing data effectively. That means, instead of moving data that doesn't impact the current state of the application to a better location, we're leaving it there for a predetermined amount of time. During that time, we're working with sub-optimal data. As micro-optimization these principles may sound, they will not be so micro in a broader context. A future time, when we have more and more information to store, we'll wish we had taken a closer look at some of these seemingly trivial improvements.

One thing about information storage around the globe that isn't likely to slow down is the rate at which we produce data. So we'll keep acquiring more storage devices. We'll be able to keep it up for so long before our software cannot cope. We need smarter solutions to how we store our data, not how can we store more of it. Not producing information isn't the answer, because that would deter creativity, and the general advancement of knowledge. But doing research now, into how we can automate keeping valuable data visible, and archiving it as it degrades, seems like a good idea to me.