Here is another jewel of an article, written by Michael Noll:
As hardware costs plummet and greater bandwidth becomes more affordable, cloud computing becomes even more alluring and feasible for the hobbyist. Working with large data sets is an interesting problem set, now attainable by the average developer with a free LINUX hard drive partition and some spare time.
Hadoop is essentially a framework, written in Java, for accessing large data sets. For the average developer, it’s usually used in conjunction with HDFS which is a custom file system designed for stroing large data sets on cheaper hardware. HDFS data is usually stored in units of entire web pages written to disk “as-is”. Once stored, this large data set is accessible for parsing and analysis.
Indexing of this data is done via what is called a map-reduce algorithm. The easiest map-reduce algorithm to envision is the word count algorithm. Each word on a page is assigned the value “1″ in the map portion, then all words are counted by sorting the words and adding their values. The result is a frequency count of each word in a page. This is the simplest of ten well known map-reduce algorithms, found in this white paper:
The magic of Michael’s article is in how to prevent the stringent, often painful Jython interface to Hadoop. This trick is by streaming this data on the command line from his simpler, more Pythonic map-reduce algorithm directly to Hadoop, thereby avoiding the complex Jython interface. Very clever indeed! It would be wonderful to see all of the relatively standard map-reduce algorithms written in Python.
He has a couple of other great articles on exactly how to install and configure Hadoop on Ubuntu here:
On top of Hadoop is usually a metdata store, holding high level information about which page sets or data “chunks” can be found on which volumes, for efficiency. After attending a cloud computing conference in November, I realized that many companies have chosen to write their own metadata layer using HiveDB, or custom metadata tools, and many distributed MySQL databases, rather than settling for the expensive commercial solutions. This great work is the primary cause of the proliferation of easily accessible Open Source cloud computing tools.
Another project to watch is called Mahout, which is an attempt to make all of the well known map-reduce algorithms generally accessible across multicore Hadoop systems.
I run Fedora 10, and have a bit more work to do, but it is outstanding to see such an easily accessible installation for such a relatively complex set of tools. So many software tools and toys, so little time!