Wednesday, January 7, 2015

Optimized QVD Load With a Where Clause and Dates

If you are working with large quantities of data and especially if you are working with large data sets of historical data where you add a portion of new data to the historical data each week (an incremental load), then you will want to be working with qvd files in your loadscript. And if you are working with qvd files then you certainly want to load data from the qvd files with an optimized load. The optimized load is many times faster than loading the same data with an unoptimized load. 

A common problem is that adding a where clause to the load statement will cause the load to be unoptimized. Some types of where clauses though, can be used with an optimized load. A simple where exists type of where clause, for example, can usually be used with an optimized load.

Every incremental load process is a little different but if your process involves dates then this idea may help. This idea is for situations where you want to load data from a qvd file and you want your where clause to refer to a date within the data. Here’s an example:

Imagine that you have a date like 01/01/2014 in a variable named start_dt and another date like 12/31/2015 in a variable named end_dt and you want to load invoice data where the invoice date, inv_date, is between start_dt and end_dt. First, build a little table of all possible dates between start_dt and end_dt. Use this in the loadscript before loading any other data containing a INV_DATE field:

/* Build table of dates just for the "where exists" clause */
Load date(date('$(start_dt)')+IterNo()-1) as INV_DATE 
While date(date('$(start_dt)')+IterNo()-1) <= date('$(end_dt)'); 

That code can build a large table of dates in your loadscript in just a few seconds. Now, you can code the load from qvd like this:

INVOICE_DATA:   //this is an optimized load
Load * from invoice_data.qvd (qvd) 
where exists(INV_DATE);
Drop table TMP_DATE_TABLE; //no longer needed

That table of dates can be varied as needed for your specific application. For example, using (IterNo()-1)*7 could give you a set of week start dates. Use whatever method makes sense to create a list of the dates that you would want to load if they exist within the qvd file data. Remember, the basic idea here is to build a table of values you would like to load from the qvd file. It is not limited to dates and not all of the values in your temp table have to actually match something in the qvd file.

____________ _________________ _________________

Note to Santhosh:  If you need to quickly find the maximum or latest invoice date from the qvd file in order to use it with date arithmetic then you can get it in a variable by doing something like this in your loadscript before loading any other tables with the INV_DATE field:

TMP_INV_DATE:  //this is an optimized load
Load INV_DATE FROM invoice_data.qvd (qvd);
Load Max(INV_DATE) as maxinvdate;  //preceding load
Load FieldValue('INV_DATE',IterNo()) as INV_DATE
While not Isnull(FieldValue('INV_DATE',IterNo()));
Let var_maxinvdate = Peek('maxinvdate',-1,'TMP_MAXINV_DATE');
Drop tables TMP_INV_DATE, TMP_MAXINV_DATE; //not needed



Anonymous said...

Hi Tim,

Great post!

Could you explain how the section of code at the bottom of post is working?

It finds the min and max dates exponentially faster than doing it via a resident load, but I'm not clear on how it's pulling back the results

-TB said...

QlikView internally keeps track of all values for each field - it helps QlikView with selection processing and the great data compression. That load statement utilizes QlikView's list of all values for a field instead of processing all of the table rows that happens with a resident load.

I wrote about something similar a few years ago (search on "qlikview maven Load a Table With All of the Values for a Field") but I described it in terms of a method for avoiding synthetic keys and I think many people missed the value of the concept.


-TB said...

I should add that the FieldValue function returns distinct values. So, although your table may contain three million rows, if the data covers one year then there would only be 365 date values.