The web used to be composed of simple pages. It still is, to a degree, but less and less so. For example, Wikipedia is the active giant still spewing forth pages rich with information. There are plenty of news sites, and blogs just like this one that I wouldn't consider applications. But less and less so. Instead, we're craving the interactivity brought about by applications. The information is still there, just arranged, and curated differently. Instead of a monolithic blob, such as this post, the pieces of information are fragmentary. Not necessarily incomplete, but defined by a work-flow different from what we're used to. Will there be a need for pages in the future when the web is composed of powerful applications that allow us to better carve out the chunks we need at that moment?
Showing posts with label information. Show all posts
Showing posts with label information. Show all posts
Monday, February 25, 2013
Tuesday, February 12, 2013
Organizational Features
Depending on the software, the information captured needs to be organized in one way or another. Users don't just supply input only to be dumped in a large holding bin. The user doesn't want to sift through their information when they actually need to find something. Another reason users want their information organized, probably according to a schema of their own creation, is for when they don't know what they're looking for exactly. There will come a time, when you have a flash, a sudden recollection, or partial memory of something more complete. You know it exists, you just need to come across it to bring it to the forefront of your memory. In these situations of "I don't know what I'm looking for", we browse our stored information. How we browse, depends on how we organize. But organizing isn't just some trivial activity. Depending on the organizational facilities we've provided to the user, they just might overwhelm themselves.
Thursday, August 2, 2012
Global Storage
I'm often amazed at how much of the world's information can fit into my laptops hard drive. I can literally shovel in a profusion of human knowledge and think nothing of it. We observe with our senses and record our interpretations. We've been doing this really well, for a really long time. The puzzling problem — no feasible way to search in this mountain of data. Now we can search it. We can organize our observations about the universe, seemingly without restriction. Imagine you were assigned with the task of recording everything you know, from kindergarten to present, on your laptop. You have as much time as you need to type out every piece of knowledge you have encoded just beneath your skull. Not only would this take forever, but you would probably never run out of storage capacity.
The information storage capacity we have at our disposal is so vast that we seldom take note. If you write code, or otherwise work closely with devices that preserve information, the thought may cross your mind from time to time. Global storage capacity, and the lack thereof, is such a dubious problem that nobody asks the question — what will happen when we run out? You see, the problem is so far-fetched simply because of the sheer amplitude of drive space available to us, either directly, or through a web service.
What got me on this topic in the first place was thinking about the types of information we create and use as part of our day to day lives. I was organizing some of my digital media, taking my old physical media and putting it on my external 3TB drive, actually. During which, I could not shake the feeling that this is a lot of data, even by today's standards. The scary part. however, was my second thought — all of this is going to have to be copied over to another drive. This exercise instilled in me two inklings about persistent data. First, the more software we write and the more sophisticated it becomes, the more data that is ultimately produced and stored. Second, this data should all be backed up given that hardware will eventually fail.
These two facts, when combined, paint a somewhat scary picture. The first, producing manifold data without much effort isn't likely to slow down. Backing up all this data, combined with our powerhouse data generation capabilities, means that we're putting massive pressure on hardware manufacturers to produce more and more disk space. The end result? Focus on creating more and more hardware on which we can store our information and back it up. Until we have hardware that doesn't fail, we need to replicate information if it has any hope of survival.
Talking about data backup this way leads me to think of something else this industry may be confronted with in the near future — the value of data. That is, of all the information we're worried about — all the investment in storage hardware, both primary and replicated. How can we jettison some of it? I think at some point, companies are going to need to start asking these types of questions. What are the trade-offs that will either result in less information produced, weaker replication requirements, or affordable disk space that will enable them to continue as they do today?
These are big questions, none of which have an easy answer because none pose an immediate threat to the daily operations of institutions that rely heavily on information technology. Storage is cheap. Replication is cheap. Therefore, it might be helpful as a thought experiment to simply adjust one of the constants we're used to.
Let's start on the information production end of things. If we think about that, we might be able to answer the simple question of why we have so much data to begin with. Because storage is so cheap, it seems the question should be — why shouldn't we have more data? After all, isn't that how you gain a competitive edge? More data means more opportunity to analyze it and generate insights that lead to, well, more data. We have so many inputs that make their way through the pipes and filters of our information systems that its difficult to measure exactly how much information our companies generate. Again, we don't think about it because we don't need to — storage is cheap.
As time goes by, the value of information decreases. Something that was relevant today becomes less relevant tomorrow. Even less so the day after that. And so on. Theoretically, information is never worthless — you can always devise some computation that will take into account the events of a decade ago. Think about the field of statistical analyses. Imagine the statistical software we have in place today used data produced at today's rate, starting from fifty years ago. Would we even have the processing power to handle it? Hard drives aren't the only resource we need to worry about in terms of scarcity, but also raw processing power. Just as a single hard disk only has so much capacity, so to does a CPU only have so much throughput. But the nature of what we're storing, the semantic content, plays a role here too.
I mentioned that its easy to produce a lot of data today. On any computer. But that doesn't equate to valuable data that will be interesting to compute with down the road. For instance, how much of the world's information is opaque binary data, only meaningful to human eyes when rendered on a screen? Data such as this needs to be treated differently due to what it's semantic content. If a company stores opaque data such as image and video files, they're highly unlikely to execute large processing jobs on these data on a daily basis if ever. So we need to think about isolating these resources away from our raw processing power. We also need to think about demand for these resources. In any given system, there is likely more information than there are users to look at it. That leads me to believe that we can do a better job of putting a value system in place for data. That is, high demand should equate to value. Value should equate to allocation of better hardware resources for that particular data.
What all of this amounts to is managing data independently of the applications that produce it, and those that consume it. Not just administering the supporting software that sits between the data and the hard disk, and not just ensuring that the data remains replicated and is always available. Active data management, as opposed to treating data passively. When we're taking the passive perspective toward data we store, we're not taking into account the liveliness of the data. Are we using valuable hardware resources to serve up sterile data? Imagine, should it actually be feasible, if we could have that type of monitoring in place. An observer that let's us know that the our software is working harder than it has to, sifting through piles and piles of data that isn't serving the greater good.
The process I'm describing is almost like that of data warehouses. You use a data warehouse to perform operations on data outside of the context in which it originated. For instance, your warehouse might contain request logs for the month of January. That means you can safely perform analyses on those logs without impacting the real application or it's data. This is great, but it still involves a batch-style migration of data, a somewhat passive approach to storing data effectively. That means, instead of moving data that doesn't impact the current state of the application to a better location, we're leaving it there for a predetermined amount of time. During that time, we're working with sub-optimal data. As micro-optimization these principles may sound, they will not be so micro in a broader context. A future time, when we have more and more information to store, we'll wish we had taken a closer look at some of these seemingly trivial improvements.
One thing about information storage around the globe that isn't likely to slow down is the rate at which we produce data. So we'll keep acquiring more storage devices. We'll be able to keep it up for so long before our software cannot cope. We need smarter solutions to how we store our data, not how can we store more of it. Not producing information isn't the answer, because that would deter creativity, and the general advancement of knowledge. But doing research now, into how we can automate keeping valuable data visible, and archiving it as it degrades, seems like a good idea to me.
The information storage capacity we have at our disposal is so vast that we seldom take note. If you write code, or otherwise work closely with devices that preserve information, the thought may cross your mind from time to time. Global storage capacity, and the lack thereof, is such a dubious problem that nobody asks the question — what will happen when we run out? You see, the problem is so far-fetched simply because of the sheer amplitude of drive space available to us, either directly, or through a web service.
What got me on this topic in the first place was thinking about the types of information we create and use as part of our day to day lives. I was organizing some of my digital media, taking my old physical media and putting it on my external 3TB drive, actually. During which, I could not shake the feeling that this is a lot of data, even by today's standards. The scary part. however, was my second thought — all of this is going to have to be copied over to another drive. This exercise instilled in me two inklings about persistent data. First, the more software we write and the more sophisticated it becomes, the more data that is ultimately produced and stored. Second, this data should all be backed up given that hardware will eventually fail.
These two facts, when combined, paint a somewhat scary picture. The first, producing manifold data without much effort isn't likely to slow down. Backing up all this data, combined with our powerhouse data generation capabilities, means that we're putting massive pressure on hardware manufacturers to produce more and more disk space. The end result? Focus on creating more and more hardware on which we can store our information and back it up. Until we have hardware that doesn't fail, we need to replicate information if it has any hope of survival.
Talking about data backup this way leads me to think of something else this industry may be confronted with in the near future — the value of data. That is, of all the information we're worried about — all the investment in storage hardware, both primary and replicated. How can we jettison some of it? I think at some point, companies are going to need to start asking these types of questions. What are the trade-offs that will either result in less information produced, weaker replication requirements, or affordable disk space that will enable them to continue as they do today?
These are big questions, none of which have an easy answer because none pose an immediate threat to the daily operations of institutions that rely heavily on information technology. Storage is cheap. Replication is cheap. Therefore, it might be helpful as a thought experiment to simply adjust one of the constants we're used to.
Let's start on the information production end of things. If we think about that, we might be able to answer the simple question of why we have so much data to begin with. Because storage is so cheap, it seems the question should be — why shouldn't we have more data? After all, isn't that how you gain a competitive edge? More data means more opportunity to analyze it and generate insights that lead to, well, more data. We have so many inputs that make their way through the pipes and filters of our information systems that its difficult to measure exactly how much information our companies generate. Again, we don't think about it because we don't need to — storage is cheap.
As time goes by, the value of information decreases. Something that was relevant today becomes less relevant tomorrow. Even less so the day after that. And so on. Theoretically, information is never worthless — you can always devise some computation that will take into account the events of a decade ago. Think about the field of statistical analyses. Imagine the statistical software we have in place today used data produced at today's rate, starting from fifty years ago. Would we even have the processing power to handle it? Hard drives aren't the only resource we need to worry about in terms of scarcity, but also raw processing power. Just as a single hard disk only has so much capacity, so to does a CPU only have so much throughput. But the nature of what we're storing, the semantic content, plays a role here too.
I mentioned that its easy to produce a lot of data today. On any computer. But that doesn't equate to valuable data that will be interesting to compute with down the road. For instance, how much of the world's information is opaque binary data, only meaningful to human eyes when rendered on a screen? Data such as this needs to be treated differently due to what it's semantic content. If a company stores opaque data such as image and video files, they're highly unlikely to execute large processing jobs on these data on a daily basis if ever. So we need to think about isolating these resources away from our raw processing power. We also need to think about demand for these resources. In any given system, there is likely more information than there are users to look at it. That leads me to believe that we can do a better job of putting a value system in place for data. That is, high demand should equate to value. Value should equate to allocation of better hardware resources for that particular data.
What all of this amounts to is managing data independently of the applications that produce it, and those that consume it. Not just administering the supporting software that sits between the data and the hard disk, and not just ensuring that the data remains replicated and is always available. Active data management, as opposed to treating data passively. When we're taking the passive perspective toward data we store, we're not taking into account the liveliness of the data. Are we using valuable hardware resources to serve up sterile data? Imagine, should it actually be feasible, if we could have that type of monitoring in place. An observer that let's us know that the our software is working harder than it has to, sifting through piles and piles of data that isn't serving the greater good.
The process I'm describing is almost like that of data warehouses. You use a data warehouse to perform operations on data outside of the context in which it originated. For instance, your warehouse might contain request logs for the month of January. That means you can safely perform analyses on those logs without impacting the real application or it's data. This is great, but it still involves a batch-style migration of data, a somewhat passive approach to storing data effectively. That means, instead of moving data that doesn't impact the current state of the application to a better location, we're leaving it there for a predetermined amount of time. During that time, we're working with sub-optimal data. As micro-optimization these principles may sound, they will not be so micro in a broader context. A future time, when we have more and more information to store, we'll wish we had taken a closer look at some of these seemingly trivial improvements.
One thing about information storage around the globe that isn't likely to slow down is the rate at which we produce data. So we'll keep acquiring more storage devices. We'll be able to keep it up for so long before our software cannot cope. We need smarter solutions to how we store our data, not how can we store more of it. Not producing information isn't the answer, because that would deter creativity, and the general advancement of knowledge. But doing research now, into how we can automate keeping valuable data visible, and archiving it as it degrades, seems like a good idea to me.
Tuesday, September 13, 2011
What URIs Say About Resources
The web today is made up of boundless resources — each identified by their URI. I favor the term URI over URL because it denotes identity. So what information can we gather from the URI alone? Are we better of calling it a URL since the mail purpose is to do a lookup? I don't think so because the fact that URIs are used to look something up is implied knowledge — it's the identity of the resource we want to learn about. But, can we attain this type of information from the URI alone or is it a meaningless question? URI's should be designed to advocate foreknowledge of what the resource is.
Uniqueness
URIs are unique. That is, there is a one-to-one mapping between a URI and the resource it points to. There are, of course, exceptions to this rule. For example, you might have a radio station web application that displays the currently playing artist. During the artist's air time, they might have a artist/current URI that points to the artist's detail page. Alternatively, there might be a single canonical URI associated with the artist's page — artist/123 for instance.
So in the case of the former — where the artist can have two URIs pointing to their page — there is no one-to-one mapping of URI to resource. There might even be more than two URIs pointing to the artist's page — charts/top, for instance. But these URIs are unique in that they're referring to one resource. The underlying resource might change — the artist/current URI stays the same but the resource it points to will change frequently.
The artist/current URI is an example of a virtual resource — to the external agent, this appears to be where the resource lives. But this isn't where the resource lives — it isn't it's canonical URI. The URI artist/123 is static and probably will never change. The virtual URI points to the canonical URI.
To better illustrate this concept, let's talk in hockey terms. Imagine the center for the home team. He is number 15. So his canonical resource URI looks like home/15. Now imagine that you want a URI for the home team player currently in possession of the puck. Our star center has the puck — so we can represent this as a virtual URI — puck/control. There isn't anything special about this URI — it just contains some logic that points to the home/15 URI.
Meaning
So it turns out that URIs carry some important information after all. And this is what I'm trying to figure out. Exactly how much information is of value to the reader of the URI? In theory, every URI on the web could be some arbitrary string — CD4F2ACF4, for example. It wouldn't matter because information is properly linked to other information. The readers don't care what the URI is — they only care about the anchor text.
I think this might have some degree of truth behind it but the reality is that people do care about the URI and what it looks like. I know I do. In fact, before clicking on links, I find myself hovering over them to see where they go — trying to examine the URI to guess it's worth before I go there. Mind you, I take a very active interest in URI design — so I doubt every single user will scrutinize — or even care for that matter — what a URI looks like.
But it turns out that even the most arbitrary URIs make subtle attempts to attach meaning. Consider our earlier URI — artist/123. What do we know about this page before visiting it? Even if you're a lay user — you're probably able to guess that it has something to do with an artist. We achieve two things with this URI — the vocabulary and the multiplicity.
The vocabulary establishes the kind of thing users can expect to see should they choose to follow the link — in this case, an artist. The multiplicity is established in two ways. First, we're explicitly choosing the term artist, not artists. Second, the reader can see the trailing identifier — 123.
So the most meaningful piece of information in this URI is artist. The arbitrary part, arbitrary from the reader's perspective, is both meaningless and important at the same time. The arbitrary identifier assigned to the resource is an important part of the URI — it's what makes it canonical. The number itself has no meaning to the user but it has utility in sharing that URI with others.
Evolution
It turns out that URI design has evolved quite a lot since the emergence of the web. We've seen a lot of resources — immeasurable resources — created over the years. This directly impacts our ability to create meaningful URIs for users. If it were simply a matter of incrementing the resource count once a new resource is created, we'd be all set. Unfortunately, that isn't true at all. There are new types of resources that need to be created as applications and organizations evolve. These new resource types are going to form an ever more complex mesh of relationships — links to other resources both new and old.
These new resource types — once invented to help solve the technological problems of the day — will also need virtual resources. The virtual resources are the logic of the web — they're not real data, just pointers to other canonical resources that store the real information that external agents update and use.
Keeping URLs meaningful for users is important as available information continues to expand. If we succumb to churning out completely arbitrary URIs, we're taking a step backward. Likewise, the URI itself is real data that needs to be shared and passed around — so we must be careful to add meaning, but not too much.
Uniqueness
URIs are unique. That is, there is a one-to-one mapping between a URI and the resource it points to. There are, of course, exceptions to this rule. For example, you might have a radio station web application that displays the currently playing artist. During the artist's air time, they might have a artist/current URI that points to the artist's detail page. Alternatively, there might be a single canonical URI associated with the artist's page — artist/123 for instance.
So in the case of the former — where the artist can have two URIs pointing to their page — there is no one-to-one mapping of URI to resource. There might even be more than two URIs pointing to the artist's page — charts/top, for instance. But these URIs are unique in that they're referring to one resource. The underlying resource might change — the artist/current URI stays the same but the resource it points to will change frequently.
The artist/current URI is an example of a virtual resource — to the external agent, this appears to be where the resource lives. But this isn't where the resource lives — it isn't it's canonical URI. The URI artist/123 is static and probably will never change. The virtual URI points to the canonical URI.
To better illustrate this concept, let's talk in hockey terms. Imagine the center for the home team. He is number 15. So his canonical resource URI looks like home/15. Now imagine that you want a URI for the home team player currently in possession of the puck. Our star center has the puck — so we can represent this as a virtual URI — puck/control. There isn't anything special about this URI — it just contains some logic that points to the home/15 URI.
Meaning
So it turns out that URIs carry some important information after all. And this is what I'm trying to figure out. Exactly how much information is of value to the reader of the URI? In theory, every URI on the web could be some arbitrary string — CD4F2ACF4, for example. It wouldn't matter because information is properly linked to other information. The readers don't care what the URI is — they only care about the anchor text.
I think this might have some degree of truth behind it but the reality is that people do care about the URI and what it looks like. I know I do. In fact, before clicking on links, I find myself hovering over them to see where they go — trying to examine the URI to guess it's worth before I go there. Mind you, I take a very active interest in URI design — so I doubt every single user will scrutinize — or even care for that matter — what a URI looks like.
But it turns out that even the most arbitrary URIs make subtle attempts to attach meaning. Consider our earlier URI — artist/123. What do we know about this page before visiting it? Even if you're a lay user — you're probably able to guess that it has something to do with an artist. We achieve two things with this URI — the vocabulary and the multiplicity.
The vocabulary establishes the kind of thing users can expect to see should they choose to follow the link — in this case, an artist. The multiplicity is established in two ways. First, we're explicitly choosing the term artist, not artists. Second, the reader can see the trailing identifier — 123.
So the most meaningful piece of information in this URI is artist. The arbitrary part, arbitrary from the reader's perspective, is both meaningless and important at the same time. The arbitrary identifier assigned to the resource is an important part of the URI — it's what makes it canonical. The number itself has no meaning to the user but it has utility in sharing that URI with others.
Evolution
It turns out that URI design has evolved quite a lot since the emergence of the web. We've seen a lot of resources — immeasurable resources — created over the years. This directly impacts our ability to create meaningful URIs for users. If it were simply a matter of incrementing the resource count once a new resource is created, we'd be all set. Unfortunately, that isn't true at all. There are new types of resources that need to be created as applications and organizations evolve. These new resource types are going to form an ever more complex mesh of relationships — links to other resources both new and old.
These new resource types — once invented to help solve the technological problems of the day — will also need virtual resources. The virtual resources are the logic of the web — they're not real data, just pointers to other canonical resources that store the real information that external agents update and use.
Keeping URLs meaningful for users is important as available information continues to expand. If we succumb to churning out completely arbitrary URIs, we're taking a step backward. Likewise, the URI itself is real data that needs to be shared and passed around — so we must be careful to add meaning, but not too much.
Labels:
design
,
information
,
resource
,
uri
Subscribe to:
Posts
(
Atom
)