I previously blogged about visualizing part of Adobe's recent security blunder. Here is another attempt at showing the implications of this leak. I randomly extracted about 1500 hashes from the complete data set. I then used the Levenshtein distance between the hashed strings as a dissimilarity measure. I then obtained a coordinatization of all hashes using Kruskal's multidimensional scaling.

The result is a password landscape in which each node represents a single password hash. Nodes are scaled by their frequencies, i.e. how often their corresponding password occurs in the complete data set. Upon hovering over each data point, a tooltip shows you additional information about the hash.

This is my first attempt with the awesome D3.js library and the results can arguably be improved very much. For example, the "lumpy" part on the left side of the data set is what I like to call the Island of not-so-smart-passwords. The hashes in this region are all empty–note how often this occurs even though I sampled randomly from the complete data set. This also explains the many overlaps in this area.

Still, you can get a pretty decent picture of what sorts of passwords are commonly used by people. This data set is simply a treasure trove, showing the dangers of "shared" passwords as well as the amount of data a large company is sitting on.

Some resources:

  • See the source code of this page for the D3.js source. You may use it under the MIT License
  • The file data.csv contains the raw data, along with coordinates for the layout. I have excluded the e-mail addresses (that are present in the original leaked data) for reasons of anonymity.