Friday, October 12, 2012

Privacy In The Cloud

Not much thought goes into privacy control with regard to the pieces of information our virtual machines expose when running on a cloud provider's infrastructure. I'm not talking about the internal data that only the guest operating system has privileged access to. That's a security concern. I'm of course referring to the mountain of data that we're generating as a bi-product of operating a virtual infrastructure. The meta-data. This infrastructure is offered out to customers as a service, and just like any other web service, the provider is going to collect data on all their tenants. What does this data look like? What information can be derived from this data, and do we as tenants have any say in the matter?

Social networks have privacy controls now, mostly thanks to Facebook for bringing issues of privacy to light. But even in the social domain, we have not a clue what information about our interactions is accessible. Protected, yes most likely, but existent nonetheless. If I'm operating mission-critical virtual infrastructures, I'm going to need to know what information exists. What is my software stack generating, and do I have access to it? I believe part of the challenge faced by cloud providers in instituting any kind of privacy control is sophistication. It's easy, in theory, to give social network users control over the data they publish — what is visible and to whom? But more importantly, you have a much finer level of control in terms of what information you decide to publish socially. That is unless you post while you're asleep, in which case you have significantly less control over what gets put on the web. But this isn't that far off from how virtual infrastructures work. Virtual machines are living things, constantly working, constantly collaborating with other virtual machines — they don't sleep. Privacy controls in this scenario go beyond anything that could be reasonably implemented as far as I'm concerned, and that's probably why we don't see it in practice.

Let's think about what information is collected from my virtual infrastructure. There is the obvious stuff, the essential state information that dictates the health of individual virtual components, and the overall health of the system. This is disclosed to the tenant, and goes without saying. I don't need to ask because I can see it. I can also see more fine-grained state information, like what my memory consumption is, or my IO throughput. This stuff is probably all documented by the provider because I can either have access to it from an API, or through the management portal. But what I really need to do is take this accessible information and think about what else might be there.

From the provider's perspective, all this data that's exposed to the tenants is useful. It can be used to generate more interesting profiles about tenants, groups of tenants, groups of virtual resources — the possibilities are endless. If we have characteristics about the current state of our infrastructure, in addition to past states, we can ultimately offer a better service to our tenants. This is a key cloud advantage — letting providers coach and instruct based on what we're trying to achieve. Given the right data, this can be done in a semi-automated fashion. I find the prospect of this optimal operating environment titillating to say the least. I also put myself in the mindset of the tenant frequently, which leads to ramifications of this data existing at all. For example, let's say we're a provider and we've developed an algorithm that takes a collection of virtual machine data samples and builds a profile. This profile ultimately benefits the tenant because we can use it to make decisions for them that they otherwise couldn't make. We could use the profile to offer suggestions or warn of potential problem scenarios. To make this great capability a reality, we need to store meta-data about the tenant's deployed environment.

Encapsulation is probably the best principle to describe what I mean when I'm talking about these capabilities of cloud providers. I have a goal as a tenant of the cloud, and the provider provides the interface to accomplish that goal while hiding the implementation details. Those implementation details are part of that provider's intellectual property — they're going to protect it. Things get a little tricky in terms of disclosing privacy information, let alone the controls to modify what data is sampled, and what isn't. If data about my infrastructure exists as a result of a particular provider capability, that presents a concern. Not a threat, but merely a potential privacy issue. I say potential because we're not talking about security, but if new data exists that didn't before, that brings new potential for problems. The benefits almost certainly outweigh the risk, and so the question in my mind is — how can you disclose what information is there while maintaining a seamless infrastructure offering that knows how to best place, configure and ultimately operate the virtual resources it holds?

I don't think it makes logical sense for providers to allow too much fine-tuning of what data they're allowed to sample from their customer's virtual machines. That defeats the whole purpose of advancing cloud technology. Too restrictive an approach means that the tools the provider has put in place cannot operate effectively, and might even have a negative impact on other tenants, simply due to an incomplete picture. What providers can do is disclose what ingredients they're using. The stuff that the tenant has access to, these are what we're using to carry out this particular capability. There is no risk there, and you'll retain any proprietary functionality. This informs the tenant, allowing them to ask the right questions. And they have a responsibility to do so, to determine the right fit for their infrastructure. If you've informed the tenant about what you're sampling, they can draw their own conclusions to what the privacy risks are. When it comes to privacy in the cloud, you have to establish a trust relationship between provider and tenant first and foremost. Secondary is the detailed information that gets collected and stored.