As part of a project I’m working on for fun (shhhh) I’ve been trying to solve an interesting problem – what is the most compact way I can uniquely refer to something in a URI?
This one’s going to get complicated, so here’s the tl;dr version. To get the shortest possible, web-safe identifiers:
- encode identifiers with base62 or a customized version of base64 that uses valid characters.
- Use a big enough hash space to get an acceptable level of collisions.
Ok. Here’s the long-winded version.
It’s a little like the “URL Shortener” use case, where you want to take a big fat link and describe it in a tiny link. In this case, I’m converting individual sentences into a tiny, linkable representation.
So, I need a very short identifier that obeys certain (conflicting) properties:
- As much as possible, the identifier should be the same regardless of the order that I create the sentences in
- It should be as short as possible, because I may need to refer to a lot of these in a single URI
In the URL shortener mode, they tend to solve this by just keeping a global “ID” counter and incrementing it every time someone creates a new link. I need to use something a lot more like a hash. (MD5 or SHA1 are typical choices here.)
So what are the constraints?
- How long can a URI be?
- What characters are we allowed to use?
- How much unique data is going to be represented?
- How tolerant can we be of collisions? (When 2 “objects” might map to the same shortened identifier?)
URI constraints
The first 2 are simple and global. There’s no real standard on URI length, but smart people have done the legwork for us so we’ll steal their conclusions, and say “keep URI’s shorter than 2,000 characters.”
As far as the characters that can be used, we can consult RFC 3986 in the “Unreserved Characters” section and find we’re allowed 0-9, a-z, A-Z, and “- _ . ~” (dash, underscore, dot, tilde.) I’m going to avoid using those because I’d like to keep them for separating my identifiers. So, if we just go with the “normal characters”, there’s 62 of them.
Collision Constraints
If you assume a hashing function generates numbers pretty randomly spread throughout the number space available to it (which I will for this) there’s actually some pretty good math we can use to figure out how likely a collision will be. If you’re interested in where this comes from, check out the Birthday Paradox, which is nicely written up at Wikipedia. Using the formulas there you can calculate the probability of having 2 keys that collide, as a function of how many things you’re trying to insert, and the size of the overall hash space.
Putting it all together
First, figure out how many items you want to be able to store. (for me, 20,000 is a high estimate). Second, figure out how comfortable you are with having some keys collide. 50/50 odds? (0.5) 1/1000 chance? (0.001) I’m actually pretty ok having a very small number of collisions – I’m going to use 0.99, a 99% chance that SOMETHING will collide. (That’s still going to be a small number, less than a 100% chance that anything at all will collide, let alone more than one.)
I sketched up a little script for this article (which you can grab from my git repo. It’s pretty self-documenting, so I’ll let it speak for itself.
The goal is, given our constraints, and the fact that we need to use only the characters 0 .. 9, a .. z, A .. Z, (a) how many of those characters do we need, and (b) given that it probably won’t be an exact fit, what will our collision probability end up being?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
The most mysterious parts of this are probably the formulas. The one in the get_probability subroutine is transcribed right from the Wikipedia page, but the other one is the same formula, solved for a different value. In general, if you need to do this, WolframAlpha is a math nerd’s dream come true. I just asked it to “solve (the equation) for d” and got the new formula I needed.

The solution actually comes from “show your steps” – I can find an intermediate form that’s easier to represent in a non-math-centric programming language. (I’m sure you can do imaginary numbers in perl, but it was kind of outside the scope of my plans for this evening.)
Here’s the formula I ended up using:

Here’s a few sample runs of the script:
First, using my personal constraints for this project:
1 2 3 4 5 6 7 | |
Cool! So I said I’m ok with a 99% chance of a collision, and the algorithm figured out that in order to do that, I’d need to be using 5 digits of base62. And if I’m using 5 digits of base62, I get 3 more bits than I strictly “need”, which means I end up with only about a 1/5 chance of EVER getting a collision.
Let’s say I wanted to be more strict, and go “one in a million”.
1 2 3 4 5 6 7 | |
In this case base62 comes pretty close to exactly the dimensions that we want, so we more or less get 1/1,000,000 on the nose with 8 digits.
If you grab the script to play with, you may need to tweak the printf’s and make sure they have enough resolution for the data you’re trying to examine.
Ok, so that tells us how bit our bitspace needs to be, but not how to get a hash, or how to do base62.
Hashing Function
I chose md5 because…. it seems to work fine. I’m sure there’s a better option, but this is working for now. However, md5 has a lot more bits available than I need. How to just steal a few of them?
First, you need to know how many bits you want. Thankfully, I know I want 29 bits (thanks, helper script!). I can extract 29 bits of information from it by making a “mask” of 29 1’s, which can be done easily like so:
1
| |
So, now I just need a raw integer slice of an md5, and do a “logical and” of that:
1 2 3 4 5 6 | |
Groovy. Now we know what number we want to represent, we need to actually represent that in this weird “base62” format.
The algorithm for converting to base(anything) is actually pretty easy. We’re used to thinking in base10, so I’ll show an example of running the algorithm to from and to base10, just so the flow is clear.
| 125 | Starting Value |
| 125 % 10 = 5 | Find the remainder when dividing by the “destination base.” Keep “5” as the “new number” |
| 125 / 10 = 12 | Divide by the “destination base”. This chops the number off the end that we just snagged as a remainder |
| 12 % 10 = 2 | Do it again with the result of the division. Stick the result to the front of number we’re keeping track of (now 25) |
| 12 / 10 = 1 | And do the division again |
| 1 % 10 = 1 | Last remainder, glue to the front again, and we have the answer “125”. (Back where we started.) |
| 1 / 10 = 0 | As soon as we get 0 from the division, our work here is done. |
Simple enough to understand in base10, but the exact same technique works when going from base10 to any other base (2 would work to convert to binary, we’re using 62, and if you could find enough characters, you could go to something crazy like base300 using the same idea.)
Here’s the source for it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
So there you have it. Provably optimal URI-compatible identifiers with 3 easy steps:
- Figure out what the constraints for your problem space are
- Grab enough bits from md5
- Convert to (and from) base62