Have you seen Damn Cool Algorithms: Cardinality Estimation yet? If not, take a few minutes and read through it. Now, what if we try using that approach instead of COUNT(DISTINCT) in MySQL to see how many distinct values there are in a column?
I recently needed this information in real life, and the table is
large with many duplicate values. The column is some 32-character
hex string, a hash value that represents a session ID. I’ll call
the column sess_id
. I wanted to know how many
distinct values it had, but I thought it would be cool (damn
cool, really) to try this approach and see what happened.
I read the blog post, convinced myself that it made sense, and
tried to code it. Here’s my rough translation of the algorithm
into MySQL-speak. Note that I’m using crc32()
, which
may not be a great choice …