tag:blogger.com,1999:blog-3174073825360154810.post2993438916195764293..comments2022-12-06T05:49:47.306-05:00Comments on QlikView Maven: Loading a 1% Random Sample-TBhttp://www.blogger.com/profile/08386812862574514289noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-3174073825360154810.post-36673062002769435032018-05-31T08:07:34.822-04:002018-05-31T08:07:34.822-04:00Hi, Tim.
I tested the reload time to sample 1% of...Hi, Tim.<br /><br />I tested the reload time to sample 1% of a large QVD using a sample load (with added "WHERE 1 = 1" to force deoptimization, for SAMPLE to work) and the WHERE clause with the Rand function.<br /><br />SAMPLE .01<br />LOAD *<br />FROM [FileName.qvd] (qvd)<br />WHERE 1 = 1;<br /><br />The SAMPLE load took 2:15 and the WHERE clause took 2:31, which I was able to recreate a few times. Small difference, but if you're working with extremely large data or systematically sampling a large number of QVDs, it might be worth using the OOB feature for sampling.Mike Steedlehttp://blog.axisgroup.comnoreply@blogger.comtag:blogger.com,1999:blog-3174073825360154810.post-72467306752032659162018-05-20T07:54:31.616-04:002018-05-20T07:54:31.616-04:00Hi All,
Just to add to this excellent discussion....Hi All,<br /><br />Just to add to this excellent discussion.<br /><br />There's a function in Qlik Help called SAMPLE, which is not talked about at all. It's a quick way of loading sample data from the database.<br /><br />Sample p ( loadstatement | selectstatement ). Where P stands for probability. Probability of a record occurring in the database. The P has to resolve to integer greater than 0 and less than 1. This function eliminate the need for RAND() function. <br /><br />Example:<br />Sample 0.15 SQL SELECT * from Longtable;<br /><br />Sample(0.15) LOAD * from Longtab.csv;<br /><br />Hope this helps<br /><br />Best Regards,<br />GabrielGabhttps://www.blogger.com/profile/07364473640942386149noreply@blogger.comtag:blogger.com,1999:blog-3174073825360154810.post-88005842734004862632018-05-07T11:18:18.016-04:002018-05-07T11:18:18.016-04:00That is a good point, Vlad. If you were loading a ...That is a good point, Vlad. If you were loading a file containing 10 rows using this technique it might load 2 or 5 or possibly, though unlikely, even 10 rows.<br />If the requirement is to load exactly 1% or 10% of the rows then you would need something like you suggest. In that's a client requirement then it would be worth clarifying with the client if simply loading every 100th or every 10th row would satisfy the requirement since that would be simpler to code. <br /><br />I modified the blog text a bit this morning to say that the number of rows loaded is not a precise calculation and also that the rows that get loaded are not an even distribution of the data.<br /><br />Thanks for your comment! -TBhttps://www.blogger.com/profile/08386812862574514289noreply@blogger.comtag:blogger.com,1999:blog-3174073825360154810.post-30602497904983342512018-05-06T19:38:53.352-04:002018-05-06T19:38:53.352-04:00Just keep in mind that the reason this worked is s...Just keep in mind that the reason this worked is specifically because you had 12 million records. This technique requires that you have a large enough data set that a random distribution truly provides 10%. For instance, if you had 10 rows, there's no guarantee that this would load 1 record. I'd say a more robust technique would be to use rand() to assign an ID to each record, then do a FIRST X load, ordering by the Rand ID you created, where X is 10% of noofrows() of the first table.Vlad Gutkovskyhttps://www.blogger.com/profile/10157150095997429564noreply@blogger.comtag:blogger.com,1999:blog-3174073825360154810.post-34018781495612448312018-05-05T11:45:41.953-04:002018-05-05T11:45:41.953-04:00It was about 12 million rows -- not a giant but to...It was about 12 million rows -- not a giant but too big for the kind of testing I wanted to do with the data (and I actually only loaded 0.1% of the rows).<br /><br />Karl, since the <i>Where</i> clause isn't based on the data being loaded, do you think there is any easy way to make something like this work as an optimized load? I thought about creating a temporary table of integer numbers using the rand function and then making the where clause something like <br /> Where exists(TMP_INTEGER, RecNo()); <br /><br />I think that might work but the extra typing cancels out the benefit of an optimized load.<br /><br />-Tim-TBhttps://www.blogger.com/profile/08386812862574514289noreply@blogger.comtag:blogger.com,1999:blog-3174073825360154810.post-38865399631289723372018-05-05T11:19:31.759-04:002018-05-05T11:19:31.759-04:00Out of curiosity, how large was the file? Out of curiosity, how large was the file? Karl Poverhttps://www.blogger.com/profile/09248474588988796613noreply@blogger.com