Wednesday, January 11, 2017

In Defense of Passphrases


Ever since the XKCD comic on Password Strength became popular, I've heard more and more disparaging remarks about how passphrases are worse than more "random" passwords. I don't understand all the hating on passphrases; the basic idea of them, as I see it, is that words are easier for humans to memorize, and create associations between, than random gibberish characters.

Now it's a given that both "gibberish" passwords and long passphrases can both be done poorly -- "correct horse battery staple" is now a terrible password, because it was featured in the comic. But so is "2143658701badcfe" (even if you somehow think that string was random, the fact that it now appears in this blog post makes it a bad choice). But I think these naysayers do not understand the value that passphrases adds -- it is easier (for most people) to remember words than random characters, of the same entropy (if you aren't familiar with "entropy", think "randomness"). But let's try to prove it with some simple examples.

First, how much entropy is enough? That's a complex question, but for our purposes let's just say 80 bits; this is based on this Q&A entry. Whether 80 bits is enough or not doesn't really matter -- if you want 160 bits, just double the lengths of all the values below.

What does 80 bits of entropy look like in English? The English language allegedly has around 1,000,000 words. Now we can't use them all; for one thing, very rare words are hard to remember. So let's pick from the most common 10,000 words. I'm using the 10,000 words at the top of this github page. I made that list by taking another list, and spending just a few minutes cleaning it. But I don't think anyone would object to 10,000 being a reasonable number of words for someone to know, however you come up with the list.

Now each word has a 1 in 10,000 chance of being selected. This doesn't go quite evenly into 80 bits, but 6 words works out to be 82% of the 80 bits. (7 words would be 820000% of 80 bits, so let's stick with 6 words)

I grabbed 6 random numbers from 1 to 10,000, and got:
  • 6225, 1738, 4836, 6378, 7361, 8406.
Looking up those words on my list 10,000 word list gives:
  • objections, shoulders, breathe, comrade, angrily, vs
That's what 80 bits of entropy looks like in English. So how does that compare to more "conventional" randomly generated passwords?
  • Hex: 63485AE5638C1EDCC61E
    • 20 hex digits is exactly 80 bits of entropy.
  • Decimal: 236663118018716201382515
    • 24 digits is ~80% of the entropy of 80 bits; close enough
  • Base64: kiydPJHQh4jL7
    • A lot of systems try to use all the number, upper and lower case letters, and sometimes other characters thrown in. This is pretty awful for humans, both because it takes longer to type in, and it's often hard to tell a 1 from I from l, 0 from O, etc. But including here for comparison. The math works out that 13 characters in base64 is 2^78; that is only 25% of 2^80, it's close enough.
  • Passphrase: objections shoulders breathe comrade angrily vs
So now, you be the judge. You have to memorize one of these 5 choices; if you succeed, you will live a long and safe life, if you fail, your identity will be stolen and you will be miserable. Which one do you choose?

Or maybe you're trying to be "practical"; which one is easier to type in? Well I just timed myself typing in each one, on my normal keyboard, and my times were: 10 seconds, 8 seconds, 5 seconds, 7 seconds. So the terrible base64 was the fastest, presumably only due to the very low character count; but typing the long passphrase was second in speed. But I certainly wouldn't choose the base64 option, especially if you consider what it's like to type that into a phone/tablet; all the numbers and capital letters require multiple taps, it's terrible. Whereas the whole words could be swiped-in, since they are all recognizable, common words. I'm too lazy to try timing myself on a phone right now, but I would speculate that I can swipe-typing 6 words much faster than I can enter any of the other three random sets of characters above.

Now a real scientific test would be to formalize this a bit more, and run real memory tests on humans. But I think I have proved my point.

Just for fun/reference, here's the javascript code I used to help with the above:

var base64 = function(n){return '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+/'[n];}
var rnd = function(max){return Math.floor(Math.random() * max)};
var rndDigits = function(base, digits){var s = '';
 for (var i = 0; i < digits; i++)
  s += base64(rnd(base));
 return s;}
console.log('Binary: ' + rndDigits(2, 80));
console.log('Decimal: ' + rndDigits(10, 24));
console.log('Hex: ' + rndDigits(16, 20));
console.log('Base64: ' + rndDigits(64, 13));
console.log(rnd(10000) + ', ' + rnd(10000) + ', ' + rnd(10000) + ', ' + rnd(10000) + ', ' + rnd(10000) + ', ' + rnd(10000));
// for marking times, I just pressed enter before and after each combination
document.addEventListener('keydown', function (e) { if (13 == e.keyCode) { console.log(new Date()); } }, false);