from Hacker News

Choosing between names and identifiers in URLs

by bussetta on 10/17/17, 5:25 PM with 153 comments

by sgentle on 10/17/17, 11:50 PM
The root of the issue here is that URLs are trying to be human-meaningful and machine-meaningful at the same time, but those requirements are fundamentally incompatible.
Humans work well with ambiguity and context. You know that when your coworker says "Bob's birthday is this weekend" you know she means her husband Bob, not Bob from accounting who nobody likes. And you even prefer that system to having an unambiguous human identifier, even a friendly one like "Bob-4592-daring-weasel-horseradish".
Machines, on the other hand, hate ambiguity and context. Every bit of context is an extra bit of state that has to be stored somewhere, and now all your results are actually statistical guesses - how inelegant!
In the early days of computing, there was no separation between the internals of the machine and its interface. If you worked on a computer, you were as much the mechanic as the driver. We got used to usernames, filenames, and hostnames because they were a decent compromise; they were meaningful enough to humans, and unambiguous enough for machines, so we could use them as a kind of human-computer pidgin.
But we don't need them anymore, and they were never really very good at either job anyway. Google's (probably accidental) discovery was that we were using the web wrong. Everyone was building web directories and portals because they thought that URLs weren't discoverable, but the real problem was that they weren't usable. Search was the first human interface to the web.
So Google's going to kill the URL, Facebook's going to kill the username, and someone (apparently not Microsoft) is going to kill the filename. There'll be much wailing and gnashing of teeth from the old guard while it happens, but someday our grandchildren will grow up never having to memorise an arbitrary sequence of characters for a computer, and I think that's a future to look forward to.
by yathern on 10/17/17, 6:30 PM
Great post - I quite like the stackoverflow.com style of `stackoverflow.com/questions/<question-id>/<question-title>`, where <question-title> can be changed to anything, and the link still works.
This allows for easy URL readability, while also having a unique ID.
In the context of this post (the library example) that would look like
library.com/books/1as03jf08e/Moby-Dick/
by nayuki on 10/17/17, 7:57 PM
The article talks about referring to resources by using URLs containing opaque ID numbers versus URLs containing human-readable hierarchical paths and names. They give examples like bank accounts and library books.
This problem about naming URLs is also present in file system design. File names can be short, meaningful, context-sensitive, and human-friendly; or they can be long, unique, and permanent. For example, a photo might be named IMG_1234.jpg or Mountain.jpg, or it can be named 63f8d706e07a308964e3399d9fbf8774d37493e787218ac055a572dfeed49bbe.jpg. The problem with the short names is that they can easily collide, and often change at the whim of the user. The article highlights the difference between the identity of an object (the permanent long name) versus searching for an object (the human-friendly path, which could return different results each time).
For decades, the core assumption in file system design is to provide hierarchical paths that refer to mutable files. A number of alternative systems have sprouted which upend this assumption - by having all files be immutable, addressed by hash, and searchable through other mechanisms. Examples include Git version control, BitTorrent, IPFS, Camlistore, and my own unnamed proposal: https://www.nayuki.io/page/designing-a-better-nonhierarchica... . (Previous discussion: https://news.ycombinator.com/item?id=14537650 )
Personally, I think immutable files present a fascinating opportunity for exploration, because they make it possible to create stable metadata. In a mutable hierarchical file system, metadata (such as photo tags or song titles) can be stored either within the file itself, or in a separate file that points to the main file. But "pointers" in the form of hard links or symlinks are brittle, hence storing metadata as a separate file is perilous. Moreover, the main file can be overwritten with completely different data, and the metadata can become out of date. By contrast, if the metadata points to the main data by hash, then the reference is unambiguous, and the metadata can never accidentally point to the "wrong" file in the future.
by wyndham on 10/17/17, 6:37 PM
The article's main insight: "URLs based on hierarchical names are actually the URLs of search results rather than the URLs of the entities in those search results".
by andrewstuart2 on 10/17/17, 7:35 PM
"The case for identifiers" is really more of a case for surrogate keys. Surrogate keys need not be opaque, but rather are distinguished by the fact that they're assigned by an authority and may be completely unrelated to the properties of an entity.
Natural keys, meaning entity identification by some unique combination of properties, are hard to get right (oops, your email address isn't unique, or it's a mailing list) and a pain to translate into a name (`where x = x' and y = y' and z = z'`, or `/x/x'/y/y'/z/z'`, etc.).
Surrogate keys, on the other hand, make it easy to identify one and only one object forever, but only so long as everybody uses the same key for the same thing.
And as mentioned in the article, the most appropriate is usually both. Often you don't have the surrogate key, so you need to look up by the natural key, but when you do have the surrogate key, it's fastest and most likely to be correct if you use that in your naming scheme.

by jey on 10/17/17, 8:17 PM

  There are only two hard things in Computer Science: cache invalidation and
  naming things.
  
  -- Phil Karlton

https://martinfowler.com/bliki/TwoHardThings.html

by bo1024 on 10/17/17, 11:55 PM
Something was bugging me about this, but I had to think hard to figure it out.
The article is largely based on a misguided premise: the idea that URLs should be conceptualized as either names or identifiers. URLs are neither: they are addresses of web pages. The things located at the URL may have names or identifiers, but by design of the web the stuff located at an address is mutable while the address is immutable.
This is an important point because it breaks the analogies to books or bank accounts. A physical copy of Moby Dick is a thing that may be located at a given address, or not. The work of fiction "Moby Dick" has an ISBN number, but the ISBN number is metadata, not an address. A bank account number is also metadata, not an address.
So I get the feeling that URLs should be conceptualized as addresses first and foremost. This isn't a magic bullet for the problem the blog post addresses (how to design URLs) but I think it gives some perspective:
* If the "thing" at the URL will always be conceptually the same "thing", but its name or other metadata may change, it makes sense to assign that thing a unique identifier and use this as part of the URL. (Because the thing with this ID will always be found at this address.)
* If the name of the stuff located at the URL is never going to change, it makes sense to use the name as part of the URL. (Because the stuff with this name will always be found there.)
* "Search results" as discussed in the blog post are a special case of the previous point: if a URL will always contain search results for a certain query, it makes sense to use the name of the query as part of the URL.
* There are also URLs that fall outside the name or identifier paradigms. http://www.ycombinator.com/about/ is the address of a bunch of stuff, which is not necessarily a single coherent thing with either an ID number or a name, but is a very reasonable address at which some content may be located.
Maybe this is all obvious, but to me it really helps think about the issue whereas the blog post confused some things for me, so I thought I'd share.
by spiralpolitik on 10/17/17, 7:39 PM
"The downside of the second example URL is that if a book or shelf changes its name, references to it based on hierarchical names like this one in the example URL will break."
The author appears to have forgotten about 3xx redirection codes which were intended to solve that very problem.
by tejtm on 10/17/17, 8:13 PM
http://journals.plos.org/plosbiology/article?id=10.1371/jour...
Abstract
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
claimer: I am one of the many authors.
by bvrmn on 10/17/17, 8:22 PM
Many commenters here and author of OP talk about urls in browser address bar. However article has "API design" in title.
by jgrodziski on 10/17/17, 9:13 PM
Identifying changing "stuff" in the real world is for me a fundamental topic of any serious data modeling for any kind of software (be it an API, a traditional database stuff, etc). Identity is also at the center of the entity concept of Domain-Driven Design (see the seminal book of Eric Evans on that: https://www.amazon.com/Domain-Driven-Design-Tackling-Complex...).
I started changing my way of looking at identity by reading the rationale of clojure (https://clojure.org/about/state#_working_models_and_identity) -> "Identities are mental tools we use to superimpose continuity on a world which is constantly, functionally, creating new values of itself."
The timeless book "Data and reality" is also priceless: https://www.amazon.com/Data-Reality-Perspective-Perceiving-I....
More specifically concerning the article, I do agree with the point of view of the author distinguishing access by identifier and hierarchical compound name better represented as a search. On the id stuff, I find the amazon approach of using URN (in summary: a namespaced identifier) very appealing: http://philcalcado.com/2017/03/22/pattern_using_seudo-uris_w.... And of course, performance matters concerning IDs and UUID: https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-ca....
Happy data modeling :)
EDIT: - add an excerpt from the clojure rationale
by lwansbrough on 10/17/17, 6:44 PM
Nice, this reflects the choice I've made with a recent API design. This is especially important for entity names you don't control.
For example, we ingest gamertags and IDs from players of Xbox Live, PSN, Steam, Origin, Battle.net, etc. - each have their own requirements in terms of what is allowed in a username, and even whether or not they're unique. Often you can't ensure a user is unique by their gamertag alone. You can't even ensure uniqueness based on gamertag and platform name. Reality is that search is almost always required in these cases, and that's why we've implemented search in the way described in this article, with each result pointing to a GUID representing a gamer persona.
by jlg23 on 10/17/17, 7:36 PM
Missing for me: Timestamps. A lot of data is sufficiently unique if prefixed with a timestamp, which could be as simple and readable as /2017/10/17/my-great-blog-post/
by HumanDrivenDev on 10/18/17, 2:51 AM
A bit of an aside: why is it not standard practice to format uuids in a radix 64 encoding? It cuts down the identifier size from 32 to 22 characters
by buro9 on 10/18/17, 8:42 AM
The author took an easy way out by recommending a canonical identifier based URL and a named URL, and then choosing a library as an example.
Books in a library are seldom renamed, if ever. The named URL would be almost as permanent as the canonical URL.
However in their earlier example of a bank account, a personal account name is typically the account holder name and the type of account, and both of these could be subject to change as a result of marriage, death, or the change in products offered by a bank. Even then, the rate of change is low.
A better example that the author could have (should have?) used is that of a news website where the article title may change frequently and yet there is a desire to make the link indicate the type of content at the destination... this is the real crux of the issue.
On a news site a canonical identifier driven URL may be correct... but does not sell or communicate the story behind the link and the link is likely to be shared without context. Sure you may see `example.com/news/a49a9762-3790-4b4f-adbf-4577a35b1df7` but this could be any news... it is far less obvious what is behind the link than the banking example as diversity in news stories is huge.
Yet the named URL would likely fail too, as once created and shared it should not mutate or at least should remain working... and yet the story title is likely to be sub-edited multiple times as news evolves.
The best scheme was not even mentioned in the article... combining both an identifier with a vanity named part: `example.org/news/a49a9762-3790-4b4f-adbf-4577a35b1df7_choosing_between_names_identifiers_URLs` . The named part can vary as it is not actually used for lookup, only the prefix identifier is used for lookup.
Though that has it's own downside... one can conjure up misleading named sections for valid identifiers to misdirect and mislead.
by dreamfactored on 10/17/17, 11:33 PM
Odd that the article doesn't seem to mention the considerations of whether id's are a) globally unique and b) unguessable, and the huge difference between the URL param and directory styles - that param id is inferred from order in directory style, making all params required and missing the final one default to it equalling *.
by baradas on 10/18/17, 4:10 PM
There's also the locality aspect of the problem which is unaddressed. Typically humans resolve ambiguity in a finite namespace. E.g. there are only a few Bob's I know of. If a single human were asked to resolve of a bob without context it would be a hard to resolve problem. I think all naming resolution problem are related to identification on the basis of attributes, and a url in a certain sense is supposed to model enough attributes to help us resolve this. We have modeled systems unlike humans, not with distributed and local information but looking at url resolution using a central brain of sorts.
by DelightOne on 10/18/17, 4:37 AM
> You also need to be careful about how you store your identifiers—the identifiers that should be stored persistently by the API implementation are almost always the identifiers that were used to form the permalinks. Using names to represent references or identity in a database is rarely the right thing to do—if you see names in a database used this way, you should examine that usage carefully.
What does this mean? Is it just to say don‘t use the name hierarchy but rather the permalink-key as identity in the database?
by mcdan on 10/17/17, 6:49 PM
Isn't one problem with this is that intermediate caches now have two resources that represent the same thing, therefore invalidation of intermediate caches will be nearly impossible?

by nazri1 on 10/18/17, 2:26 AM

    Those who do not understand UNIX are condemned to reinvent it, poorly. -- Henry Spencer

Hard links, symlinks and inodes.

by monkeycantype on 10/18/17, 4:50 PM
in the article:
/shelf/{something}
{something} could be a name - 'american literature' {something} could be an identifier - '20211fcf-0116-4217-9816-be11a4954344'
if someone calls:
https://library.com/locations: { "kind": "Shelf", "name": "20211fcf-0116-4217-9816-be11a4954344", }
now we have a shelf named with the id of a different shelf
and the meaning of
/shelf/20211fcf-0116-4217-9816-be11a4954344/book
is now ambiguous
i don't know a great way to avoid this
this is unambiguous, but i don't think my co-workers would like it: /shelf/name/{id}/books /shelf/id/{id}/books
I think this would only be slightly more popular
/shelf/name/{id}/books /shelf/{id}/books
because the thing after shelf/ would not consistently be an id
by amelius on 10/17/17, 6:36 PM
Why not make every URL that's shown in the title bar a permalink by default?
That way, you have the best of both worlds in all cases.
If another object tries to use the same URL as another object (which was used first), then a new URL must be generated (just add something at the end of the name).
by a13n on 10/17/17, 6:35 PM
For Canny, I wrote some awesome code that I'm proud of that turns a "post title" into a unique URL.
https://react-native.canny.io/feature-requests/p/headless-js...
For example, a post with title "post title" will get url "post-title".
Then a second post with title "post title" will get url "post-title-1".
Since there's only one URL part associated with each post, it's a unique identifier.
This gets rid of the ugly id in the URL, for epic URL awesomeness.
Furthermore, if you edit the first post to have "new post title" then its URL will update to "new-post-title", but "post-title" will still redirect to "new-post-title".
Someday I'm gonna open source a lib that lets you easily add awesome URLs to your app. :)
by joshzilla2017 on 10/18/17, 1:26 AM
Nice example of usability concerns, but I think bookshelves/<bookshelf>/<book> is much more intuitive than bookshelves/<bookshelf>/book/<book>.
by mirko22 on 10/18/17, 2:57 PM
Off topic, but i wish i could open a simple blog page without enabling ton of JavaScript :/
by afandian on 10/17/17, 6:32 PM
Good advice. Interesting that Canonical URLs aren't mentioned.
But the sheer arrogance of serving a webpage that doesn't render any text unless you execute their JavaScript really annoys me. It's not a fancy interactive web-app, it's a webpage with some text on it.