How to hash mysql varchar/string into different bins

First off, what am I talking about?
Here’s a scenario.

You have a bunch of rows in your table that are URLs. You want to select an evenly distributed random set of them every so often and check if they’re still alive. How do you do this?

You could just take a range.

Get the first 10

SELECT url FROM myTable
ORDER BY url
LIMIT 0, 10

Get the next 10

SELECT url FROM myTable
ORDER BY url
LIMIT 10, 10

And the next 10…

SELECT url FROM myTable
ORDER BY url
LIMIT 20, 10

But what if you wanted to evenly distribute your checks because the URLs from the same site are sequentially ordered.

Well, you could ensure an auto-increment ID and just get chunks of them based on a mod.

So to get every 10th one

SELECT id, url FROM myTable
WHERE id % 10 = 0 

And the next 10th sequence

SELECT id, url FROM myTable
WHERE id % 10 = 1 

And the next 10th sequence

SELECT id, url FROM myTable
WHERE id % 10 = 2 

This is still not that well distributed since you might run into large clusters of site url’s (ie more than 10).
But this also requires a nice integer column value.

Instead, you could also just use the VARCHAR value like the URL itself

So to get the first 10

SELECT id, ... FROM myTable
WHERE CAST(CONV(SUBSTRING(MD5(url), 1, 16), 16, 10) AS SIGNED INTEGER) % 10 = 0

To get the next 10

SELECT id, ... FROM myTable
WHERE CAST(CONV(SUBSTRING(MD5(url), 1, 16), 16, 10) AS SIGNED INTEGER) % 10 = 1

And so on…

Let’s go over what this does.

The MD5() is a hash function that will convert your VARCHAR instead a seemingly random sequence of numbers of letters. It’s not random though. It always converts to the same sequence, but distributes the VARCHAR sequence of characters more uniformly.

The SUBSTRING(…, 1, 16) takes the first 16 digits of the MD5 hash value. I believe this gives you the first 64 bits of it, otherwise there’s a possible overflow error.

The CONV(…, 16, 10) function converts the hash (which is a hex or base-16 value) into a base-10 value.

The CAST(… AS SIGNED INTEGER) function converts it to a signed integer. (If you’re going to read this value into java, you want a signed integer otherwise you’ll get an overflow)

Then simply mod (%) it by the number of bins you want. In my example, I modded it with 10.

Advertisements
Tagged

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: