The Road to Randomness
Goal: Get 3 random records from a database table.
There are a number or ways to pick random records in Rails, but none that are perfect.
1. Do it in Ruby
things = Thing.find(:all)
random_things = []
3.times do
random_things << things[rand(things.size)]
end
Two problems with this approach:
- Have to fetch the entire set of things
- random_things might have duplicates
The second problem is easy to fix (unless there are fewer than 3 things total, or our rand() takes too long to get 3 distinct numbers)
things = Thing.find(:all)
random_things = []
while random_things.size < 3
random_things |= things[rand(things.size)]
end
This solution isn’t great. And we didn’t solve the problem with fetching the whole set.
2. All Ruby with a 2nd query
thing_ids = Thing.find(:all, :select => 'id').map(&:id)
random_ids = []
while random_ids.size < 3
random_things |= things[rand(things.size)]
end
random_things = Thing.find(:all, :conditions => ['id IN (?)', random_ids])
Ok, now we only select ids, so we don’t have to load the whole set. We could make the first query faster if we wrote SQL instead of fetching them as Things. (Also, does SQLite support the ‘IN’ operator?)
3. Do it in the database
I stumbled across this idea, and thought it was fantastic.
random_things = Thing.find(:all, :limit => 3, :order => 'random()')
Take a look at this a minute. We can do it all simply, and in one line! This seems to be a standard SQL construct, as it works similarly across a few of them (silly MySQL uses rand() instead) including SQLite.
This has the benefit of being simple, and would seem to be faster as it is happening in the database.
Caveats
Unfortunately, its not that clear. Each database does this differently (generally adds a column to each row in your table, in memory), and it doesn’t scale well (remember that :limit doesn’t can’t get applied until the result set is fully built).
After I put together a plugin and a patch to make this transparent, I came across a Rails ticket that got flagged ‘wontfix’. Jeremy Kemper’s response to this idea was:
order by rand() limit 1 is very taxing on the database. It shouldn’t be a standard nor encouraged feature.
Jeremy certainly isn’t wrong, but I’m not sure he’s fully right either. We have a solution that won’t be usable with large tables (or a high-traffic site), but technically works well.
The Plugin:
Here’s the working plugin. which lets you pass :random to any finder.
Thing.find :all, :order => :random
./script/plugin install -x http://source.collectiveidea.com/public/rails/plugins/random_finders
It works as with MySQL, SQLite and PostgreSQL. Do listen to Jeremy’s warning. I’m using it on a an app with a dataset that will stay relatively small. If you run into scaling issues, don’t say we didn’t warn you.
Anyone have a better solution?
Comments
-
Here’s a better solution:
Thing.find :first, :offset => rand(Thing.count)It’s 2 queries, but it scales because the first query is just a count (which databases optimize quickly). You should order by some column and have an index on that column, naturally (that goes without saying).
-
@Mislav: That does sound quicker, though it doesn’t help with multiple records.
-
Depending on how much records do you need and how large is the table, this may be a better solution even for multiple records. Just cache the count and issue the finder multiple times. Even 5 queries can be faster than selecting 5 rows with RAND from a huge table. But if you need 100 rows, then RAND is the way. I just don’t see what might be the purpose for fetching so much random rows.
-
I just tell you that the URL you specified is wrong
- http://source.collectiveidea.com/public/rails/plugins/imap_authenticatable + http://source.collectiveidea.com/public/rails/plugins/random_finders
-
@hiroshi: thanks, fixed.
@Mislav: I get what you’re saying now, do the RAND in Ruby, and get the rows that way (IN condition seems best). I’ll update the plugin shortly.
In reality, I can’t see why you’d want that many rows either. I think most people would want a few (5-10) max, so we can optimize to that.
-
I’m working on an app right now where customers buy anywhere from 1-100 (possibly more) “things”, and they should receive these randomly from a table with millions of rows. I’m pretty sure that I’ll run into scaling issues using
find :all, :order => 'rand()', :limit => nso I have a feeling I’ll benchmark the 2 methods fairly soon to see which is better for us. Even with 100 queries I think Mislav’s solution sounds faster. -
@sjs: That would be great! If you get any benchmarks I’d love to link to them.
I expect Mislav’s solution to be quicker too.
-
Great discussion here. Thanks to you and Mislav for pointing out many solutions and drawbacks/benefits.
Since mine is a large-data-set thing, i’m going w/ Mislav’s idea.
Thanks again for the post!
-
I know SQL does a lot of math quite happily too – there should be a way to create a query that has a mathematical algorithm centered around random() that can do it quite efficiently. Thinking out loud here as I research…
For starters, you can use “FLOOR180))” to get a random integer between 20 and 200. If you can assume that most numbers exist in your id’s, you could grab a random match: “id IN (FLOORhighest_id_in_table)))”.
Now, you’d need to gather that highest_id_in_table. The following is a start:
SELECT * FROM logs WHERE id IN (FLOOR(@highest := (SELECT MAX FROM logs)))), FLOOR@highest)));
- but I’m not sure if it’s executing the (SELECT MAX FROM logs) every time it compares a record, or not. With just a little more work (I have to go now, the evening activity is calling :P), you could get the @highest (a client variable in sql) to set only if it hasn’t been set yet. In any case, it gets carried over the rest of the equation. In fact, if you run that query, you’ll see very random results -on a table of over 3000 records, I got anywhere from 0 to 5 records returned. That was some good learning, anyway.My real suggestion is to run two queries: 1) top = SELECT MAX as top FROM logs; #not a big overhead on an indexed integer column 2) Gather 6 random ruby bits: r = [rand(top),rand(top),...] 3) records = [“SELECT * FROM logs WHERE id IN (?) GROUP BY id LIMIT 3”, r]