Friday, May 4, 2018

Loading a 1% Random Sample


I needed to load a one percent random sample of a large file last week. I’m sure others have figured out this method but this was the first time I coded it like this:
MYSAMPLE:
Load * from MYFILE.qvd (qvd)
Where Rand()<=0.01;

The Rand function returns a random value between 0 and 1. In the load statement above it would only be true 1% of the time (approximately) -- that would load only 1% of the rows. If it was coded as Where Rand()<=0.20 then it would be a 20% random sample, etc.

This method can be used with almost any Load statement even one of the steps in a preceding load. 
Note that we're talking about random numbers here so the total number of rows loaded using this technique will not be exactly 1% of the total rows in the file, although it will be close especially when the total number of rows in the file is very large. Be aware too that the rows that get loaded will not be an evenly distributed sample of rows although it will be close.

  ★★★