Perl: Climbing Trees


LanguagesPerlTips and Tricks

HTML tree, that is… Ahh HTML. The tree with often not so perfect branches. Recently for one of my projects I had to grab certain bits of information from a series of HTML pages. The HTML was proper in some and not so good with others (huh? Font tags? Poorly nested tags?). I thought about parsing it as I would an XML document but given the “crooked branches” I figured that wouldn’t work. I could write a mess of regex, but gee — isn’t there a better way? Through a series of searches and poking around, I discovered HTML::Tree module which seemed to suit my purpose well! I will not bore you with the use I had for using HTML::Tree, so I made a fun sample:

#!/usr/bin/perl

use warnings;
use strict;
use diagnostics;

use HTML::TreeBuilder;

The typical top of a Perl script. I usually ″use diagnostics″ when developing and take it out when I am done. Just give you more verbose error messages. I need all the help I can get!

Now the fun stuff:

I made a function to load the DevChix homepage that I had saved to a local file. Ideally, this data would be pulled live from the site. For now though, I start simple.

sub load_tree {
   my $page = HTML::TreeBuilder->new();
   $page->parse_file('DevChix.htm');
   return $page;
}

This returns the TreeBuilder object with my data loaded. Using the most awesome tool Firebug (Also, see Jen′s post about it awhile back), I see that the sidebar list is a div tag with id of “sidebarposts”. Lets look down our Html Tree and find that element:

my $page = load_tree();

my $sidebar = $page->look_down( '_tag', 'div',
   sub { $_[0]->id eq 'sidebarposts' } );

Not too complex. Look down the tree, look for a tag thats a div with the id of “sidebarposts” .. gee, thats nearly english (and people say that Perl is jibberish! Bah!).

Now, lets grab the li elements in that div:

my @ul_list = $sidebar->look_down('_tag','li');

foreach my $li(@ul_list) {
  print $li->as_text, "\n";
}

I know I′ll be getting back more than 1 element so I assign it to an array instead of a scalar. Then in the for loop, I want to iterate through the list and print the element as text, which gives me the name of the link.

Output is something like this:

...
Regular Expressions: The Wart on the Bum of Every Language in Existence
RUBY: DRY up your Enumerations
*waving, not drowning*
Beautiful Python: The programming language that taught me how to love again
RailsConf
Test More for Java?!
Book Review: Beginning Ruby On Rails E-Commerce
...

Using a code ref to find elements came in extremely handy when I had some bunked up HTML, sometimes I had to look_down a table, find a tr that contained a table which had a certain class, or find all href tags that had a certain domain in the URL and then grab the tr container so I could grab the following tr .. etc.

The Tree::Builder interface has a ton more methods than I’ve talked about and you should take a look at it next time you need to grab bits from html pages!

Perl – The right tool for the right job


Perl

On one of my projects the requirements was to take two csv files, one with words and data and one with words. The goal is to use the list of words to retrieve the data from the other. Long-term storage for the datasets is not needed, so I didn’t see a reason to put in a database. Typically the language used for this company is PHP (though I have sneaked in a few ruby scripts.. shhhh) so my first try was to load each file into an array in memory and do a lookup. It worked fine for my sample files but croaked when I fed it the real data. My real data file was 180k some words and my word list was 400k words. So, I tried using a temporary table. I started it when I left for the day and it was still going the next morning. This was unbearable slow and never actually completed. I could have probably tried to optimize or choose different table types, but I didn’t…because I was thinking hmmmmm… wonder if Perl will be faster? It was, not only did it complete — but finished in minutes (with printing each word out to the display) or in seconds (in a more silent mode). I showed the people who would be using it – they were amazed. This sort of thing takes hours in excel and crashes sometimes.

# load data list into hash
$parser = Parse::CSV->new( binary => 1,
                    file => $infile_name );
my %DATA = {};
print "\\n\\nNow indexing data file\\n";
while ( $line = $parser->fetch ) {
	next if !$line;
	$word = shift @$line;
	next if ( (!defined($word)) or ($word eq ''));
	$DATA{$word} = $line;
	print $verbose ? "Indexing $word\\n" : '.';
}

The files sometimes have foreign characters in it so I open it in binary mode. Parse::CSV is a wrapper for Text::CSV_XS class and is much easier to use. Its written by Adam Kennedy and I had a question about it and was able to find him in IRC and ask him myself, and got some help from someone else in the room too — pretty cool! Back to the code, I skip if the line is not defined, shift off the first element of the array. The fetch method always returns an array reference, so I must prefix it with @ to access it as an array and shift off the first element. If thats not defined or blank, I use as the key to store the reference to the array. If verbose option was used, I display a longer notification to the user, otherwise I display just a period. That way the user knows its working!

#setup excel file
my $workbook = Spreadsheet::WriteExcel->new($outfile_name);
my $worksheet = $workbook->add_worksheet();

#process word list

$parser = Parse::CSV->new( binary => 1,
                 file => $wordfile_name );
my $row_count = 0;
my (@row, $row_ref);
print "\\nNow starting lookup...\\n";
while ( $line = $parser->fetch ) {
	@row = ();
	$word = shift @$line;
	next if ( (!defined($word)) or ($word eq ''));

	if (exists $DATA{$word}) {
		#word found
		$row_ref = $DATA{$word};  #get ref to array
	  	print $verbose ? "Found $word\n" : '+';
		push @row, $word;
		push @row, @$row_ref;
		$worksheet->write_row($row_count,
                                      0,
                                      \\@row);
	} else {
	  	print $verbose ? "Not Found $word\n" : '-';
		$worksheet->write_row($row_count,
                                      0,
                                      [$word, 'not found']);
	}
	$row_count++;
}

Similar to the loading of the hash, I open the file with Parser::CSV as before. I initialize my variables and set a row_count used to write to the excel file. This data file is supposed to be a single column list of words, so I could probably get away with just doing a simple file read but just in case I treat as a CSV file. Once I get the word variable, I see if it exists as a key in my %DATA hash. Funny thing about Perl regarding hashes and arrays – when referring to the whole thing you use % for hash, @ for array. But when referring to a particular element, you use $. Kinda funny but in a way makes sense…hehe. :P I simplified this code a bit so you get the concept and these lines better no doubt. I used @row as a temporary holder for the data I want to write to the excel file, I had some other code here I took out which had more to do with the @row. Finally I write the row array to the the excel file. The method requires an array ref, and \@row gives me a reference.

Thats basically it. The loading of %DATA hash is of course, dependent on how much memory you have. If you have problems you could try tieing it to a DB file:

use DB_File;

tie %DATA, 'DB_File', 'output.db'
         or die ('cannot open output.db');

Building a DB file also has the advantage of persistence. If you need to run a lot of lookups, multiple times you may be interested in using it. It will be slower to some extent, but it will work! :)

My script could probably be optimized some and perhaps golfed some which could make it faster. But I try and write Perl in english, at least for the first pass. Then perhaps I can use the shortcuts and see if it runs faster. This was a fun project and I got to learn more Perl. My friends Liz and Yaakov helped me and I thank them :)