Thursday, February 24, 2011

Privacy and Ownership In The Cloud

I'll admit, when cloud computing became a mainstream technology, I only saw the risks, not how to utilize it. The obvious risk being - I cannot control my information in the cloud. I lose at least some level of control, not all of it - I can still delete my data, manipulate it, etc. But the cloud is a foreign place, out on the periphery of the Internet. Concerns about information control in the cloud fit into one of two categories - privacy and ownership. Privacy means I don't want my data viewed or modified by anyone unintended. Security plays a crucial role in information privacy - if my data is insecure, it is no longer private. We'll stick with the more general term here since this is more about privacy implications, not technical security implementations. Ownership means that my raw data doesn't exist on a physical hardware device in my possession. Instead, the device is owned by someone else, and is in their possession. Theoretically, I own my data I put in the cloud. In reality, my data could disappear at any time. The cloud isn't scary and unprotected, its just new. We're only starting to see it's real potential, how we can use the cloud to minimize the mundane administration tasks, focusing on building quality software. How we can acquire new computing resources as we need them and discard them we we're finished. As a first step, we need to dispel privacy and ownership concerns in the cloud.

Before the cloud became the all-encompassing entity it is today, privacy and ownership concerns in the Internet as a whole were still prevalent. I obviously need to make sure that the code running my website isn't visible to anyone but myself. I need to make sure the user-profile information stored in my database is immutable to anyone aside from the owner. Operating on the web requires endless security responsibilities that are ultimately unattainable. What you can do is be informed of best security practices and adapt when new vulnerabilities are identified. This is the status quo today and likely, always will be as new attackers come into existence. Ownership, having possession of the storage devices that house your data, isn't a new consideration either. If you build a web site, you're not going to blindly hand over your data to someone claiming to be a hosting service provider. You need to establish at least some level of trust between yourself and the provider business. This generally goes without saying, but is worth repeating. Anxiety will always take hold when you can't physically hold your data, take it home with you. Issues with information control - privacy and ownership - have been around since the early days of the Internet.

There is no real differentiating factor between the cloud and traditional dedicated hosting services with regard to privacy and ownership. Deploying to the cloud requires that we atomize our applications, break them down into isolated units. Rather than a dedicated host, your application might be driven by specialized virtual machines, all running on physical hosts. As an alternative to the virtualization approach, you might use a cloud development platform like Google App Engine. Either way, your application and it's data, live on a physical device out of reach. These mainstream cloud services are no less secure than a web application running on a dedicated host. The cloud is a new phenomenon and we generally don't trust it yet.

Let's further highlight the differences between running my application on a dedicated host and running it in a cloud environment. I have an e-commerce store that is ready to go live in production. I've tested it in my own local development environment, it works, its secure, I know all the external hardware and software dependencies like the back of my hand. Now I have to decide where I want it deployed - a dedicated host or the cloud. The simplicity of going with a dedicated host, when the system is first deployed seems like a good idea. It'll run just fine with limited demand for resources. But I'm hoping that business will pick up, and I'll therefore need to scale-up. The simplicity of running on a dedicated host is suddenly not so simple. On the other hand, if I were able to deploy some of my application's components as virtual machines in the cloud, I've suddenly made things more manageable. But something doesn't seem right about putting my data into cloud territory. I just can't et my head around it even though I know others who've done similar deployments, they say their systems are secure and have the utmost confidence in their cloud infrastructure.

It seems like my preconceptions of privacy and ownership in the the cloud are obstructing what should be an easy decision. For some reason, I've come to the conclusion that dedicated hosting environments are immune to people stealing my information. This simply isn't true. The same concerns exist in any web environment.

There has to be a solution to mitigate both privacy and ownership concerns. Fixing the ownership problem is simple - buy some hardware, run your software on it, and store your data on it. Why doesn't everyone do this? If we want to own our data, truly own it, possession and everything, this is the answer. The argument against this approach is money - hardware is expensive, operating more costly still. We can't all be data center experts, that's why the service provider industry exists to begin with. They're good at what they do, and offer their expertise so we don't have to bother with it. I could always hire an expert to manage my hardware for me. Own your own data on your hardware, while paying someone knowledgeable enough to manage it. But what about cloud technology? Is it just a pipe-dream I'll never realize?

Private clouds address businesses concerned with privacy and ownership. They get the cloud, the easy deployment, redundancy, configuration, back-magic, whatever - without the need to surrender their data to the Internet. Does this really address the concerns inherent in the service provider business? Ownership, yes. Privacy, not exactly. Remember, privacy of your data is directly coupled with your system security. Responsibility is another constituent when operating your own hardware-software amalgamation - if your data is compromised, you and only you are to blame. What this means is that you've taken away the possibility that some other businesses, operating out of the same service provider you are, from inadvertently, or perhaps maliciously, damaging your information privacy. So we can see how ownership directly affects the potential for failed privacy in information technology.

Aside from where our data is stored, how do we protect ourselves in the cloud from ownership and privacy disquiet? The trouble is, simple requirements need only simple solutions. My low-traffic website has no business running in the cloud. Having said that, things change. Suppose I do have a lot of traffic, I'm not going to rewrite my entire infrastructure just because I have a lot of visitors. Unfortunately, this is usually what running on the cloud means. We can't take a web site and expect it to magically scale up without any changes.

The cloud is really good at provisioning new resources. This is the selling point for cloud technology - resources on demand with minimal overhead. This is what sets cloud services and traditional web hosting services apart. I put my application out into the cloud with high hopes that I'll have a need for more compute resources. When that day does come, I can expand and contract as necessary. The cost of operating these resources expands and contracts accordingly, so this is a win-win. But as we've seen, I'm a business that values the privacy and ownership of it's information. How can I use these resources that are so readily available, so easy to deploy?

We can use the cloud as a raw number-crunching machine. By this I mean using the cloud as something that computes, not something that stores. Raw number crunching tasks that suck-up CPU cycles that would otherwise interfere with the trivial stuff associated with day to day application jobs. Using the cloud in this way addresses both the privacy and the ownership concerns inherent in any service on the web. The ownership is retained because you're not relying on the cloud service as a permanent storage mechanism. Instead, you're sending out input for computing and receive the results in return. This, however, doesn't completely solve privacy concerns in the cloud space. With this approach, using the cloud as a raw compute machine, we can at least make the data of no value to prying eyes.

So how do we do this exactly? Do we take our existing data and send it out to the compute farm that is the cloud for processing? Remember, the cloud is really good at providing us with additional resources as we need them - there is very little overhead in creating them. You need more compute resources when your application can't process information fast enough. This is the basic necessity of all software - the speed at which output is delivered depends on how much input there is to process. Large quantities of input, a large number of queries say, requires an abounding number of CPU cycles to keep up. Even efficient algorithms are limited by the number of instructions we're able to execute per second. The cloud can help provide additional processing resources. We add more CPU cycles as we need, to handle the demand for more processing power.

What does all this mean for ownership and privacy? It means that my business doesn't need to store all it's information somewhere else. It means that I can persist my information locally while using the cloud to process it. Using the cloud as a pool of CPUs means I don't need to worry should the privacy of my information be violated for whatever reason. My processing input I send to the cloud has no value in this context. I have an application that stores a large inventory of products. In the background, while the store isn't serving a high customer demand, I want to sort large portions of this data. I divide the work up into tasks and then send it out into the cloud to run. This heavy compute job doesn't interfere with my core business application. Not only do I still have ownership of my application's data, but I can feel at ease with what I'm sending to the cloud because the data is segmented and has no context. It only exists in the cloud temporarily as input to my task. Once finished, it all goes away. No new hardware, no new privacy or ownership worries.

All this comes at a cost, however. Dividing your applications, so that you keep your information entirely in your possession isn't easy. There is a development effort involved here that may or may not prove useful. Some software is designed to do exactly what I'm suggesting - raw number crunching. Using the cloud for the CPU is far more effective than using it for storage. Until the risks with privacy and ownership are permanently resolved, which I think is impossible in the near-term, you're just as safe using the cloud as you'd be on another hosting service.