Friday, May 4, 2018

Loading a 1% Random Sample


I needed to load a one percent random sample of a large file last week. I’m sure others have figured out this method but this was the first time I coded it like this:
MYSAMPLE:
Load * from MYFILE.qvd (qvd)
Where Rand()<=0.01;

The Rand function returns a random value between 0 and 1. In the load statement above it would only be true 1% of the time (approximately) -- that would load only 1% of the rows. If it was coded as Where Rand()<=0.20 then it would be a 20% random sample, etc.

This method can be used with almost any Load statement even one of the steps in a preceding load. 
Note that we're talking about random numbers here so the total number of rows loaded using this technique will not be exactly 1% of the total rows in the file, although it will be close especially when the total number of rows in the file is very large. Be aware too that the rows that get loaded will not be an evenly distributed sample of rows although it will be close.

  ★★★

6 comments:

Karl Pover said...

Out of curiosity, how large was the file?

-TB said...

It was about 12 million rows -- not a giant but too big for the kind of testing I wanted to do with the data (and I actually only loaded 0.1% of the rows).

Karl, since the Where clause isn't based on the data being loaded, do you think there is any easy way to make something like this work as an optimized load? I thought about creating a temporary table of integer numbers using the rand function and then making the where clause something like
Where exists(TMP_INTEGER, RecNo());

I think that might work but the extra typing cancels out the benefit of an optimized load.

-Tim

Vlad Gutkovsky said...

Just keep in mind that the reason this worked is specifically because you had 12 million records. This technique requires that you have a large enough data set that a random distribution truly provides 10%. For instance, if you had 10 rows, there's no guarantee that this would load 1 record. I'd say a more robust technique would be to use rand() to assign an ID to each record, then do a FIRST X load, ordering by the Rand ID you created, where X is 10% of noofrows() of the first table.

-TB said...

That is a good point, Vlad. If you were loading a file containing 10 rows using this technique it might load 2 or 5 or possibly, though unlikely, even 10 rows.
If the requirement is to load exactly 1% or 10% of the rows then you would need something like you suggest. In that's a client requirement then it would be worth clarifying with the client if simply loading every 100th or every 10th row would satisfy the requirement since that would be simpler to code.

I modified the blog text a bit this morning to say that the number of rows loaded is not a precise calculation and also that the rows that get loaded are not an even distribution of the data.

Thanks for your comment!

Gab said...

Hi All,

Just to add to this excellent discussion.

There's a function in Qlik Help called SAMPLE, which is not talked about at all. It's a quick way of loading sample data from the database.

Sample p ( loadstatement | selectstatement ). Where P stands for probability. Probability of a record occurring in the database. The P has to resolve to integer greater than 0 and less than 1. This function eliminate the need for RAND() function.

Example:
Sample 0.15 SQL SELECT * from Longtable;

Sample(0.15) LOAD * from Longtab.csv;

Hope this helps

Best Regards,
Gabriel

Mike Steedle said...

Hi, Tim.

I tested the reload time to sample 1% of a large QVD using a sample load (with added "WHERE 1 = 1" to force deoptimization, for SAMPLE to work) and the WHERE clause with the Rand function.

SAMPLE .01
LOAD *
FROM [FileName.qvd] (qvd)
WHERE 1 = 1;

The SAMPLE load took 2:15 and the WHERE clause took 2:31, which I was able to recreate a few times. Small difference, but if you're working with extremely large data or systematically sampling a large number of QVDs, it might be worth using the OOB feature for sampling.