Hadoop Streaming and Python: No Jython necessary!


Here is another jewel of an article, written by Michael Noll:

As hardware costs plummet and greater bandwidth becomes more affordable, cloud computing becomes even more alluring and feasible for the hobbyist. Working with large data sets is an interesting problem set, now attainable by the average developer with a free LINUX hard drive partition and some spare time.

Hadoop is essentially a framework, written in Java, for accessing large data sets. For the average developer, it’s usually used in conjunction with HDFS which is a custom file system designed for stroing large data sets on cheaper hardware. HDFS data is usually stored in units of entire web pages written to disk “as-is”. Once stored, this large data set is accessible for parsing and analysis.

Indexing of this data is done via what is called a map-reduce algorithm. The easiest map-reduce algorithm to envision is the word count algorithm. Each word on a page is assigned the value “1″ in the map portion, then all words are counted by sorting the words and adding their values. The result is a frequency count of each word in a page. This is the simplest of ten well known map-reduce algorithms, found in this white paper:


The magic of Michael’s article is in how to prevent the stringent, often painful Jython interface to Hadoop. This trick is by streaming this data on the command line from his simpler, more Pythonic map-reduce algorithm directly to Hadoop, thereby avoiding the complex Jython interface. Very clever indeed! It would be wonderful to see all of the relatively standard map-reduce algorithms written in Python.

He has a couple of other great articles on exactly how to install and configure Hadoop on Ubuntu here:

and here:


On top of Hadoop is usually a metdata store, holding high level information about which page sets or data “chunks” can be found on which volumes, for efficiency. After attending a cloud computing conference in November, I realized that many companies have chosen to write their own metadata layer using HiveDB, or custom metadata tools, and many distributed MySQL databases, rather than settling for the expensive commercial solutions. This great work is the primary cause of the proliferation of easily accessible Open Source cloud computing tools.

Another project to watch is called Mahout, which is an attempt to make all of the well known map-reduce algorithms generally accessible across multicore Hadoop systems.

I run Fedora 10, and have a bit more work to do, but it is outstanding to see such an easily accessible installation for such a relatively complex set of tools. So many software tools and toys, so little time!


The Creative Commons license, and change.gov: Our new president gets it!


I was really excited and moved to see this:


This copyright license affects most content on the website change.gov, guaranteeing transparency and the feedom to redistribute and discuss it’s content.

All I can say is, wow. This president not only believes in transparency, but understands how to implement it using existing tools and conventions. He understands the spirit of Open Source software development, and everything that has come about since this movement began. Imagine an “Open Government”, genuinely implemented and influenced by the people, and it’s hard not to want to get involved.

The change.gov web site has a very nice, simple form, requesting your ideas and suggestions. They already have proposals for nationwide fiber optic broadband, automation and transparency in the patent process, and automation of health insurance data to bring healthcare costs down. There are many more places where technology can automate away inefficient and error prone processes.

We here on DevChix all know how technology has changed our lives. Let them know how it can change a nation. Submit your ideas and suggestions, and let’s help educate and influence this new administration.


Shameless plug


My article appears in this month’s Python Magazine.
This article was written months ago, and a few small utilities have been created since then, which would have made this article shorter. With the current 8.3 Postgresql distribution comes a new external db creation tool, which eliminates some of the documented weirdness of isolation level 0 DB creation and access. But other than that, the code associated with this article still serves as a great example of database agnostic ORM methodology in Python.

Please give the sample code a try, and feel free to ask any questions, etc.


From Python 2.6 to PHP 5.2: A circuitous journey


When I started heavily using PHP 5.2 (not by choice, I’ll admit), I was impressed, but I suffered from some incorrect assumptions about what PHP5 is and is not capable of doing. The good news is that it is more object oriented than it’s predecessor, but has some caveats to consider. Here are some things to be aware of when switching from a pure OO language to PHP5:

1: A nonexistent PHP array key generates no error or warning. When trying to iterate over a nonexistent array key, a warning occurs. In other languages, both of these conditions throw an exception.

Try this code for example:

$dictionary=array('one'=>'got one','two'=>'have two','four'=>'missing three?');
foreach (array_keys($dictionary) as $key)
	print "Key is:".$key.", value is:".$dictionary[$key]."\n";
print "Try undefined key three, no warning occurs:".$dictionary['three']."\n";
foreach ($dictionary['three'] as $value)
	print "Now we're iterating over a nonexistent key:";
	print "Key is: three, value is:".$dictionary['three']."\n";

Running it results in this output:

php test.php
Key is:one, value is:got one
Key is:two, value is:have two
Key is:four, value is:missing three?
Try undefined key three, no warning occurs:

Warning: Invalid argument supplied for foreach() in /root/test.php on line 8

If it is vital to me to make sure I am aware of missing keys, I only have two choices. If I need a proactive solution, I have to use the array_key_exists() function to do existence checking before use. If I want a reactive solution, I write a log scanner, to pick up on these warnings. In every other OO language I have used, an exception was thrown for this condition, and my exception handling determined if the error was vital enough to have to exit immediately or not. This seems like a more efficient way to handle this condition. I would imaging PHP5 does not do this because of it’s need to be backward compatible with PHP4, but this is a guess.

It would be wonderful to have a -OO flag for PHP, which gives you the option to run PHP and expect more standard, stricter OO behavior in these instances.

2: Warnings cannot be “caught” like exceptions. Exceptions and warnings are distinctly separate beasts, and never the twain shall meet. Fine, I thought, maybe I could detect warnings similar to how we detect errors. But it seems like warnings cannot be detected when they happen. There is no PHP code I know of which can check if a warning had occurred in runtime. I tried to detect it using array error_get_last() but to no avail. if you know how, post your trick here.

3: In PHP, ‘true’ evaluates to an integer ’1′. To get the boolean ‘true’ value from a ‘true’ statement, one needs to var_export() a true statement. Similarly, or maybe not, ‘false’ evaluates to no output. Here is an example:

print "\nThe raw value of a true statement in PHP:".true;
print "\nThe raw value of a false statement in PHP:".false;
print "\nThe exported value of a true statement in PHP:".var_export(true,true);
print "\nThe exported value of a false statement in PHP:".var_export(false,true);
print "\n";

And the output:

The raw value of a true statement in PHP:1
The raw value of a false statement in PHP:
The exported value of a true statement in PHP:true
The exported value of a false statement in PHP:false

This may not be noticeable to you in a standard expression. But if you’re doing funky stuff, like using the evaluated expression values as key references into the dictionary of a decision tree, for example, 1 does not equal ‘true’, and the difference matters quite a bit.

4: Long running processes with recursive circular references (such as Doctrine code) run out of memory. This is documented in many places, and the free() function works sometimes. A fix is coming in PHP 5.3. The foolproof solution for my code in production today (youch!) is to periodically restart the daemon. If you’re cringing right now, know that you’re not cringing alone.

There may be a part II to this article. Feel free to add your own PHP5 observations.


GEOPY: Rockin' out geocoding for Python



I have a bit of a breather between contracts, so I’m back from a writing hiatus to show you more nifty Python tools and juicy goodness.

This is yet another great example of Python’s elegance and ease of use:


It doesn’t just geocode. It gives you a common API to just about all of the publicly accessible URL-based geocode services. The web page labels which ones require a key for access, and which ones do not. Beautiful!

The library is also able to calculate distances using several different distance approximation formulae, as well as parse out and reformat geographical data. Now this is some nifty py-candy, eh?

Nice Job Brian, you rocked this out.

My only gripes are not with this tool at all, but with geocoding in general:

1: Where to get accurate, free IP address geocoding?

2: Why isn’t this data free for download, damn it? What if I want to use a cross section of it, and store it in my own database, without paying a fortune for it? One day, will Open Street Map be the source for Open Source geocode data which is freely downloadable, thus eliminating the 1000 queries per day limit?

Maybe we should all join forces, and store the results from 1000 queries per day per person, and build this database ourselves? Hmmm…


PS: The geopy project has a code sprint scheduled for this Sunday, Nov. 16. Join in!