Tuesday, September 13, 2011

What URIs Say About Resources

The web today is made up of boundless resources — each identified by their URI.  I favor the term URI over URL because it denotes identity.  So what information can we gather from the URI alone?  Are we better of calling it a URL since the mail purpose is to do a lookup?  I don't think so because the fact that URIs are used to look something up is implied knowledge — it's the identity of the resource we want to learn about.  But, can we attain this type of information from the URI alone or is it a meaningless question?  URI's should be designed to advocate foreknowledge of what the resource is.

Uniqueness
URIs are unique.  That is, there is a one-to-one mapping between a URI and the resource it points to.  There are, of course, exceptions to this rule.  For example, you might have a radio station web application that displays the currently playing artist.  During the artist's air time, they might have a artist/current URI that points to the artist's detail page.  Alternatively, there might be a single canonical URI associated with the artist's page — artist/123 for instance.

So in the case of the former — where the artist can have two URIs pointing to their page — there is no one-to-one mapping of URI to resource.  There might even be more than two URIs pointing to the artist's page — charts/top, for instance.  But these URIs are unique in that they're referring to one resource.  The underlying resource might change — the artist/current URI stays the same but the resource it points to will change frequently.

The artist/current URI is an example of a virtual resource — to the external agent, this appears to be where the resource lives.  But this isn't where the resource lives — it isn't it's canonical URI.  The URI artist/123 is static and probably will never change.  The virtual URI points to the canonical URI.

To better illustrate this concept, let's talk in hockey terms.  Imagine the center for the home team.  He is number 15.  So his canonical resource URI looks like home/15.  Now imagine that you want a URI for the home team player currently in possession of the puck.  Our star center has the puck — so we can represent this as a virtual URI — puck/control.  There isn't anything special about this URI — it just contains some logic that points to the home/15 URI.

Meaning
So it turns out that URIs carry some important information after all.  And this is what I'm trying to figure out.  Exactly how much information is of value to the reader of the URI?  In theory, every URI on the web could be some arbitrary string — CD4F2ACF4, for example.  It wouldn't matter because information is properly linked to other information.  The readers don't care what the URI is — they only care about the anchor text.

I think this might have some degree of truth behind it but the reality is that people do care about the URI and what it looks like.  I know I do.  In fact, before clicking on links, I find myself hovering over them to see where they go — trying to examine the URI to guess it's worth before I go there.  Mind you, I take a very active interest in URI design — so I doubt every single user will scrutinize — or even care for that matter — what a URI looks like.

But it turns out that even the most arbitrary URIs make subtle attempts to attach meaning.  Consider our earlier URI — artist/123.  What do we know about this page before visiting it?  Even if you're a lay user — you're probably able to guess that it has something to do with an artist.  We achieve two things with this URI — the vocabulary and the multiplicity.

The vocabulary establishes the kind of thing users can expect to see should they choose to follow the link — in this case, an artist.  The multiplicity is established in two ways.  First, we're explicitly choosing the term artist, not artists.  Second, the reader can see the trailing identifier — 123.

So the most meaningful piece of information in this URI is artist.  The arbitrary part, arbitrary from the reader's perspective, is both meaningless and important at the same time.  The arbitrary identifier assigned to the resource is an important part of the URI — it's what makes it canonical.  The number itself has no meaning to the user but it has utility in sharing that URI with others.

Evolution
It turns out that URI design has evolved quite a lot since the emergence of the web.  We've seen a lot of resources — immeasurable resources — created over the years.  This directly impacts our ability to create meaningful URIs for users.  If it were simply a matter of incrementing the resource count once a new resource is created, we'd be all set.  Unfortunately, that isn't true at all.  There are new types of resources that need to be created as applications and organizations evolve.  These new resource types are going to form an ever more complex mesh of relationships — links to other resources both new and old.

These new resource types — once invented to help solve the technological problems of the day — will also need virtual resources.  The virtual resources are the logic of the web — they're not real data, just pointers to other canonical resources that store the real information that external agents update and use.

Keeping URLs meaningful for users is important as available information continues to expand.  If we succumb to churning out completely arbitrary URIs, we're taking a step backward.  Likewise, the URI itself is real data that needs to be shared and passed around — so we must be careful to add meaning, but not too much.