Daniel Morrison

The Road to Randomness

Goal: Get 3 random records from a database table.

There are a number or ways to pick random records in Rails, but none that are perfect.

1. Do it in Ruby

things = Thing.find(:all)
random_things = []
3.times do
random_things << things[rand(things.size)]
end

Two problems with this approach:

The second problem is easy to fix (unless there are fewer than 3 things total, or our rand() takes too long to get 3 distinct numbers)

things = Thing.find(:all)
random_things = []
while random_things.size < 3
random_things |= things[rand(things.size)]
end

This solution isn’t great. And we didn’t solve the problem with fetching the whole set.

2. All Ruby with a 2nd query

thing_ids = Thing.find(:all, :select => ‘id’).map(&:id)
random_ids = []
while random_ids.size < 3
random_things |= things[rand(things.size)]
end

random_things = Thing.find(:all, :conditions => [‘id IN (?)’, random_ids])

Ok, now we only select ids, so we don’t have to load the whole set. We could make the first query faster if we wrote SQL instead of fetching them as Things. (Also, does SQLite support the ‘IN’ operator?)

3. Do it in the database

I stumbled across this idea, and thought it was fantastic.

random_things = Thing.find(:all, :limit => 3, :order => ‘random()’)

Take a look at this a minute. We can do it all simply, and in one line! This seems to be a standard SQL construct, as it works similarly across a few of them (silly MySQL uses rand() instead) including SQLite.

This has the benefit of being simple, and would seem to be faster as it is happening in the database.

Caveats

Unfortunately, its not that clear. Each database does this differently (generally adds a column to each row in your table, in memory), and it doesn’t scale well (remember that :limit doesn’t can’t get applied until the result set is fully built).

After I put together a plugin and a patch to make this transparent, I came across a Rails ticket that got flagged ‘wontfix’. Jeremy Kemper’s response to this idea was:

order by rand() limit 1 is very taxing on the database. It shouldn’t be a standard nor encouraged feature.

Jeremy certainly isn’t wrong, but I’m not sure he’s fully right either. We have a solution that won’t be usable with large tables (or a high-traffic site), but technically works well.

The Plugin:

Here’s the working plugin. which lets you pass :random to any finder. Thing.find :all, :order => :random

./script/plugin install -x http://source.collectiveidea.com/public/rails/plugins/random_finders

It works as with MySQL, SQLite and PostgreSQL. Do listen to Jeremy’s warning. I’m using it on a an app with a dataset that will stay relatively small. If you run into scaling issues, don’t say we didn’t warn you.

Anyone have a better solution?

9 Comments

  1. Mislav — May 20, 2007

    Here’s a better solution:

    Thing.find :first, :offset => rand(Thing.count)

    It’s 2 queries, but it scales because the first query is just a count (which databases optimize quickly). You should order by some column and have an index on that column, naturally (that goes without saying).

  2. Daniel Morrison — May 21, 2007

    @Mislav: That does sound quicker, though it doesn’t help with multiple records.

  3. Mislav — May 21, 2007

    Depending on how much records do you need and how large is the table, this may be a better solution even for multiple records. Just cache the count and issue the finder multiple times. Even 5 queries can be faster than selecting 5 rows with RAND is the way. I just don’t see what might be the purpose for fetching so much random rows.

  4. hiroshi — May 21, 2007

    I just tell you that the URL you specified is wrong

    - http://source.collectiveidea.com/public/rails/plugins/imap_authenticatable
    + http://source.collectiveidea.com/public/rails/plugins/random_finders

  5. Daniel Morrison — May 22, 2007

    @hiroshi: thanks, fixed.

    @Mislav: I get what you’re saying now, do the RAND in Ruby, and get the rows that way (IN condition seems best). I’ll update the plugin shortly.

    In reality, I can’t see why you’d want that many rows either. I think most people would want a few (5-10) max, so we can optimize to that.

  6. sjs — June 05, 2007

    I’m working on an app right now where customers buy anywhere from 1-100 (possibly more) “things”, and they should receive these randomly from a table with millions of rows. I’m pretty sure that I’ll run into scaling issues using find :all, :order => 'rand()', :limit => n so I have a feeling I’ll benchmark the 2 methods fairly soon to see which is better for us. Even with 100 queries I think Mislav’s solution sounds faster.

  7. Daniel Morrison — June 07, 2007

    @sjs: That would be great! If you get any benchmarks I’d love to link to them.

    I expect Mislav’s solution to be quicker too.

  8. ruby bonk — December 24, 2007

    Great discussion here. Thanks to you and Mislav for pointing out many solutions and drawbacks/benefits.

    Since mine is a large-data-set thing, i’m going w/ Mislav’s idea.

    Thanks again for the post!

  9. Dan Parker — March 21, 2008

    I know SQL does a lot of math quite happily too – there should be a way to create a query that has a mathematical algorithm centered around random() that can do it quite efficiently. Thinking out loud here as I research…

    For starters, you can use “FLOOR*180))” to get a random integer between 20 and 200. If you can assume that most numbers exist in your id’s, you could grab a random match: “id IN (FLOOR*highest_id_in_table)))”.

    Now, you’d need to gather that highest_id_in_table. The following is a start:

    SELECT * FROM logs WHERE id IN (FLOOR(@highest := (SELECT MAX FROM logs)))), FLOOR@highest)));

    - but I’m not sure if it’s executing the (SELECT MAX FROM logs) every time it compares a record, or not. With just a little more work (I have to go now, the evening activity is calling :P), you could get the @highest (a client variable in sql) to set only if it hasn’t been set yet. In any case, it gets carried over the rest of the equation. In fact, if you run that query, you’ll see very random results - on a table of over 3000 records, I got anywhere from 0 to 5 records returned. That was some good learning, anyway.

    My real suggestion is to run two queries:
    1) top = SELECT MAX as top FROM logs; #not a big overhead on an indexed integer column
    2) Gather 6 random ruby bits: r = [rand(top),rand(top),…]
    3) records = [“SELECT * FROM logs WHERE id IN (?) GROUP BY id LIMIT 3”, r]

Make a Comment