Perl – The right tool for the right job

April 12th, 2007 by comment Nola

On one of my projects the requirements was to take two csv files, one with words and data and one with words. The goal is to use the list of words to retrieve the data from the other. Long-term storage for the datasets is not needed, so I didn’t see a reason to put in a database. Typically the language used for this company is PHP (though I have sneaked in a few ruby scripts.. shhhh) so my first try was to load each file into an array in memory and do a lookup. It worked fine for my sample files but croaked when I fed it the real data. My real data file was 180k some words and my word list was 400k words. So, I tried using a temporary table. I started it when I left for the day and it was still going the next morning. This was unbearable slow and never actually completed. I could have probably tried to optimize or choose different table types, but I didn’t…because I was thinking hmmmmm… wonder if Perl will be faster? It was, not only did it complete — but finished in minutes (with printing each word out to the display) or in seconds (in a more silent mode). I showed the people who would be using it – they were amazed. This sort of thing takes hours in excel and crashes sometimes.

# load data list into hash
$parser = Parse::CSV->new( binary => 1,
                    file => $infile_name );
my %DATA = {};
print "\\n\\nNow indexing data file\\n";
while ( $line = $parser->fetch ) {
	next if !$line;
	$word = shift @$line;
	next if ( (!defined($word)) or ($word eq ''));
	$DATA{$word} = $line;
	print $verbose ? "Indexing $word\\n" : '.';
}

The files sometimes have foreign characters in it so I open it in binary mode. Parse::CSV is a wrapper for Text::CSV_XS class and is much easier to use. Its written by Adam Kennedy and I had a question about it and was able to find him in IRC and ask him myself, and got some help from someone else in the room too — pretty cool! Back to the code, I skip if the line is not defined, shift off the first element of the array. The fetch method always returns an array reference, so I must prefix it with @ to access it as an array and shift off the first element. If thats not defined or blank, I use as the key to store the reference to the array. If verbose option was used, I display a longer notification to the user, otherwise I display just a period. That way the user knows its working!

#setup excel file
my $workbook = Spreadsheet::WriteExcel->new($outfile_name);
my $worksheet = $workbook->add_worksheet();

#process word list

$parser = Parse::CSV->new( binary => 1,
                 file => $wordfile_name );
my $row_count = 0;
my (@row, $row_ref);
print "\\nNow starting lookup...\\n";
while ( $line = $parser->fetch ) {
	@row = ();
	$word = shift @$line;
	next if ( (!defined($word)) or ($word eq ''));

	if (exists $DATA{$word}) {
		#word found
		$row_ref = $DATA{$word};  #get ref to array
	  	print $verbose ? "Found $word\n" : '+';
		push @row, $word;
		push @row, @$row_ref;
		$worksheet->write_row($row_count,
                                      0,
                                      \\@row);
	} else {
	  	print $verbose ? "Not Found $word\n" : '-';
		$worksheet->write_row($row_count,
                                      0,
                                      [$word, 'not found']);
	}
	$row_count++;
}

Similar to the loading of the hash, I open the file with Parser::CSV as before. I initialize my variables and set a row_count used to write to the excel file. This data file is supposed to be a single column list of words, so I could probably get away with just doing a simple file read but just in case I treat as a CSV file. Once I get the word variable, I see if it exists as a key in my %DATA hash. Funny thing about Perl regarding hashes and arrays – when referring to the whole thing you use % for hash, @ for array. But when referring to a particular element, you use $. Kinda funny but in a way makes sense…hehe. :P I simplified this code a bit so you get the concept and these lines better no doubt. I used @row as a temporary holder for the data I want to write to the excel file, I had some other code here I took out which had more to do with the @row. Finally I write the row array to the the excel file. The method requires an array ref, and \@row gives me a reference.

Thats basically it. The loading of %DATA hash is of course, dependent on how much memory you have. If you have problems you could try tieing it to a DB file:

use DB_File;

tie %DATA, 'DB_File', 'output.db'
         or die ('cannot open output.db');

Building a DB file also has the advantage of persistence. If you need to run a lot of lookups, multiple times you may be interested in using it. It will be slower to some extent, but it will work! :)

My script could probably be optimized some and perhaps golfed some which could make it faster. But I try and write Perl in english, at least for the first pass. Then perhaps I can use the shortcuts and see if it runs faster. This was a fun project and I got to learn more Perl. My friends Liz and Yaakov helped me and I thank them :)

Comments

4 Responses to “Perl – The right tool for the right job”

  1. Yaakov says:


    “Funny thing you use % for hash, @ for array. But when referring to a particular element, you use $. Kinda funny but in a way makes sense…hehe. ”

    This is easy to understand if you think about return values. The “sigil” (It means a “magic sign”, perl people are… odd) which will be % for hash, @ for array, and $ for scalar (nmemonic: %ash, @rray, $calar) indicates what you expect to return from the variable.

    If I want to assign an array, I stuff a list into it, so:

    @array = (1,2,3,4,5); [or, since I am writing perl: @array = 1..5;]

    Arrays store scalar values, so if I want a value from the array:

    $thing = $array[2];

    The $ sigil is referring to the *value* the expression will return, in this case a scalar. Consider:

    @things = @array[2..4]

    This is an “array slice”. It returns a list (of which an array is a sort, though somewhere in the philosophical realm there may be a distinction). The list is not a scalar, it is a collection of scalars and so @ is used.

    Just because we are talking about slices, here is some fun. If I have a hash, the variable type formerly known as “associative array” and in some languages called a “dictionary object”, I can take a slice of it as well. Slices do not have to be contiguous. Consider:

    %hash = ( Hello => ‘Shalom’ , Goodbye => ‘Aloha’, Peace => ‘Salam’ );

    (The => operator is some perlish syntactic sugar. It is a comma with big ideas. It automagically quotes the left hand side but is otherwise just a neat way to layout hash assignments.)

    Now lets make an assignment:

    @words = @hash{ ‘Hello’, ‘Peace’ };

    The array @words now contains two scalars, Shalom and Salam.

    Note that the sigil on the hash was an @ because we want back a list.

    When you look at sigils from the perspective of what is returned, they aren’t so odd after all.

  2. chromatic says:


    Yaakov’s explanation is absolutely correct.

    Nola, you have good instincts though; Perl 6 will have invariant sigils. $ will always mean that the variable is a scalar, @ an array, and % a hash. Perl 6 can tell that you want to perform a slicing operation on a hash, for example, based on the rest of the expression.

  3. Alex Jamestin says:


    Nice :) Learnt something new reading your code.
    I was actually looking for some way to tie a hash contents to a file.

    PS: Is it just me or am i the only one who likes to read the code first and then skim over the explanation (if at all)

  4. Nola says:


    Thanks.. Yeah sometimes I do that to see if the code looks like what I am doing to see if its worth reading the explanation!

    glad it helped!

Got something to say?