Understanding Hash Collisions
A GUID (globally unique identifier) is a 128 bit value that can represent 340,282,366,920,938,463,463,374,607,431,768,211,456 unique values = that’s 340 undecillion. You can see why programmers use GUIDs as unique identifiers. It would take billions upon billions of years to generate two GUIDs that are the same.
A hash is a value that is calculated by running some data (like a GUID) through an algorithm, which produces a checksum (kind of like a digital fingerprint).
There is a common misconception that hashes are unique, or at least, unique enough not to worry about hash collisions (two identical hashes generated from different data) most of the time. It’s easy to see why most would think that – after all, a 32 bit hash, for example, can represent 4,294,967,295 unique values – that’s over 4 billion possibilities.
In the case of a checksum that is used to check the integrity of a file, hashes are extremely reliable. The probability of someone tampering with your file, changing data, or injecting a virus, and still having that file produce the same hash, is astronomically low.
In the case of a GUID, a 32 bit hash algorithm will assign 79,228,162,514,264,337,593,543,950,336 (that’s 79 octillion) GUIDs on average to each and every hash value – that’s just basic math, and why we generally don’t use hashes as a representation of the data they are hashed from.
I recently came across a scenario where, to implement scoped access in our web app, we store up to ~7 million GUIDs in up to ~110 million GUID -> GUID pairings in our caching layer. Because of space constraints, 32 bit GUID hashes were being stored instead of the GUID itself.
The maximum file size for SQL CE is 4 GB, and at first, that seems to be ok. A GUID consumes 16 bytes of space, so 220 million GUIDs uses about 3.2 GB, but include row overhead in SQL CE (6 bytes), column overhead (1 byte), and various other overheads (table overhead, indexing, etc.) the 4GB limit is exceeded.
Storing 4 byte GUID hashes effectively quartered the space requirements, but at the expense of guaranteed hash collisions.
That’s because, statistically, every ~75,000 GUIDs will have at least one hash collision – try it out, compile this code snippet and run it a few times.
We ended up resolving this by switching to 8-byte hashes, which cut the space used in half, and only yields a hash collisions once in every 5 billion GUIDs, or in terms of our requirements, never.
Thank you for taking a moment to learn about hash collisions. Ready to transform your SCSM experience? View all of the exciting apps Cireson has to offer.