Friday, February 3, 2012

The Document Concept

How do you keep a chunk of information self-contained?  How do you package an accumulation of thoughts and facts into one whole?  Perhaps more importantly, now that the world is connected over the Internet, how do we make those information packages available for consumption?  The answer, of course, is the document — a generalized abstraction that represents a pertinacious embodiment of information we've collected.

Is the document concept something we can continue with moving forward, given the degree of connectedness we all endure?  That is, are documents a suitable abstraction with which to enclose and exchange our information?  Most people understand the concept.  Documents are a generalized notion — you have a document and a suitable application that'll read and possibly alter it.  The drawback is that they are in fact a black box.  Until a document exists somewhere locally, it doesn't hold any meaning — it is a closed source of information.

Locally Addressable
If you're looking for something, it helps if it's been labeled.  Maybe not so much if you're searching an area that has a limited number of items and if each item is uniquely identifiable by it's form alone.  But if you're seeking out an item that is surrounded by similar-looking things, a label will identify it.  Without the ability to label things, we're required to do some deep inspecting on each individual item. This isn't without it's downsides — the time consumption alone would render us unavailable for anything else.

And so the theory holds inside digital documents — labels can identify what they contain, who they belong to, where they came from, and why they exist.  A document stored in a software system is simply a file.  The key trait documents inherit from their older file ancestors is that of being addressable.  In this context, an address is synonymous with a label you might attach to a physical object — a storage bin for instance.  A document has a name, a name that might reflect something about the information contained within.  That is, on your computer, if you've got a dozen or so files in a single folder, you need a means to differentiate between them.  Without attaching an outward-facing label for the owner, they'd never be able to find what they're looking for without opening each document, looking through the contents, and making sure that it is in fact the information they've sought in the first place.

So our documents have an address — an address we've designated, possibly injecting some meta-data about the document's contents in the process.  This really isn't unique to documents, any file on any file system has an address.  So what?  Bare in mind, however, that a document is simply a concept — as is a file, but more concrete.  The idea behind the document abstraction is more geared toward human production and consumption.  Sure, applications read documents too, but I think it makes more sense not to confuse the terminology where humans are the creator and the reader.  Having said that, when we think addressable, we're thinking of things that we can point to.  This includes things inside the document itself.

For example, a document might have one or more headings, it might have a glossary, it might be a spreadsheet with individually addressable cells.  So the idea that a document is addressable extends down into what the document contains — it's components are also addressable.  No matter who is consuming these documents, human or machine, labels help guide them.  Meta-data about the information that instructs on meaning.

The addresses centered around the document concept are only local to where the document is currently stored.  This is great for portability, for making sure that no matter where the information ends up, there is always going to be a label that describes a particular data item.  And this is why documents are only considered to be locally addressable.  The address of something pertaining to a document is meaningless without the document itself — it provides the context.  This is meaningful, still, but if we focus too much on ensuring that all our information is stuffed into discrete packages before they move around, we lose our grip on what it means to be globally-connected.

Globally Connected
Having things that are locally addressable is useful only to an extent.  Given that documents are an encapsulated object, containing the information we're addressing.  So the context is local in scope, the addressable items limited.  One of the great attributes of the web is that we're able to define canonical addresses that point to information, regardless of where it is physically located.  The address abstraction we're presented with, when working on the web, enables a broader context with which we can point to information and make use of it.

Can we keep in tact the same benefits that the document concept affords?  Things such as information encapsulation are indeed valuable.  The portable document format let's our information be consumed by anyone, simply because everything associated with that information, including the presentation rules, are blobbed together.  And, of course, let's not forget the obvious advantage traditional documents have over other resources found on the web — they work in an offline environment.  But like other handy document features, this is becoming less of a concern as well — people have access to a network connection more often than not.

Being connected, having a link to the community of others with which we share information, is the norm.  What does this mean for the fundamental document property of self-containment?  It seems to me that this priority of being able to ship information back and forth as though we're exchanging physical goods isn't so prudent.  Maybe the document concept needs to be broadened slightly, to include some of the more flexible features of the web.

Creating web versions of documents — where the form of the document is assumed on some server and it's editors and readers are presented with something browser-friendly — isn't a new idea.  In fact, it was envisioned to solve the exact issues I'm talking about now — to take a concept traditionally thought of as a local item, and make it globally addressable.  Instead of working on creating and modifying information locally, and then shipping off to target recipients, the canonical document is used.  Now we don't send documents, we point to their address.

Is this model perfect?  Perhaps we're starting to move away from the document abstraction altogether.  The whole concept of a document is centered around self-contained information that we may choose to send, or we may not.  Another area in which the document concept is falling behind in our daily operations on the web is the very fact that documents are so self-contained.  Maybe we need to reevaluate what is considered a tightly-knit unit of information?  Maybe the document needs to be decomposed into a new concept.  Something with smaller parts, all of which are addressable.