Monday, May 6, 2013

Loading an Unstructured Text File


This is the start of a series of postings about dealing with unstructured text data. I've used QlikView to help analyze message log files and error files and even program source code – there are some things QlikView can do that Notepad style text editors cannot.
First, we need to load the data. It isn’t always a completely unstructured file like my example, often the text is one or more fields inside a structured file or database.  There are various ways to load a text file; this is what I usually put into the loadscript:
TEXTFILE:
LOAD 
RecNo() as recno,
@1 as rectext
FROM
[$(File_Pathname)]
(txt, no labels, delimiter is \x7, no quotes); 

That code will load each record from the text file into the rectext field. I use a variable for the file pathname because I usually use a loadscript like this with a document where the user can supply a filename through an inputbox. Alternatively, you could add browse-for-file functionality like I wrote about in March 2009 (search on QlikView Maven Browse-for-File Macro Button). Because the data is unstructured and the document objects are meant to be used with unstructured data, the document works well as a general-purpose tool that can be used with many types of data in varying situations. 

I defined the file as though it is delimited and made the specified delimiter an uncommon character I don’t expect to find in the data. 

The record number, loaded as recno, from the input file is a useful piece of data and we’ll be using it in some of the example objects over the next few postings.

3 comments:

Shilpan said...

TB,

Thanks for sharing. Any thoughts on how to handle unstructured data via Hadoop file system? Hadoop is omnipresent, and I think it is just matter of time before Qlikview will have to develop connector to deal with the big data.

-TB said...

I don't have any personal experience with Hadoop. Google tells me that several hundred people come by this blog every day - maybe someone will add their thoughts on Hadoop.

the4thv (@the4thv) said...

Hi Maven and Shilpan,

I like the principle of the exercise here and I'm finding it very useful to learn more about Qlikview load scripts, so thanks! :)

However, Qlikview isn't ideal to process in massive data sets (Tb's, definately not Pb's). This is where Hadoop comes in. (BTW, you can connect to Hive, Hadoop's data warehouse sister project, via a JDBC connector.)

Hadoop is massively scalable (across many nodes) and is fault tolerant (can handle a node breaking without human intervention) and you can use n x cheap commodity boxes - i.e. made to cope with data of any size cheaply (the bigger the set, then either the time to complete goes up, or the number of nodes you use goes up). There are gotcha's on how you design the MapReduce tasks that can interfere with this principle, but that's the basic principle.

Personally, if you are reaching terabytes of data that need processing, I'd use Hadoop to create aggregated structured data first and then import that into Qlikview.