Hadoop Streaming and Python: No Jython necessary!


DesignMySQLPythonThoughts

Here is another jewel of an article, written by Michael Noll:
http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

As hardware costs plummet and greater bandwidth becomes more affordable, cloud computing becomes even more alluring and feasible for the hobbyist. Working with large data sets is an interesting problem set, now attainable by the average developer with a free LINUX hard drive partition and some spare time.

Hadoop is essentially a framework, written in Java, for accessing large data sets. For the average developer, it’s usually used in conjunction with HDFS which is a custom file system designed for stroing large data sets on cheaper hardware. HDFS data is usually stored in units of entire web pages written to disk “as-is”. Once stored, this large data set is accessible for parsing and analysis.

Indexing of this data is done via what is called a map-reduce algorithm. The easiest map-reduce algorithm to envision is the word count algorithm. Each word on a page is assigned the value “1″ in the map portion, then all words are counted by sorting the words and adding their values. The result is a frequency count of each word in a page. This is the simplest of ten well known map-reduce algorithms, found in this white paper:

http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf

The magic of Michael’s article is in how to prevent the stringent, often painful Jython interface to Hadoop. This trick is by streaming this data on the command line from his simpler, more Pythonic map-reduce algorithm directly to Hadoop, thereby avoiding the complex Jython interface. Very clever indeed! It would be wonderful to see all of the relatively standard map-reduce algorithms written in Python.

He has a couple of other great articles on exactly how to install and configure Hadoop on Ubuntu here:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

and here:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

On top of Hadoop is usually a metdata store, holding high level information about which page sets or data “chunks” can be found on which volumes, for efficiency. After attending a cloud computing conference in November, I realized that many companies have chosen to write their own metadata layer using HiveDB, or custom metadata tools, and many distributed MySQL databases, rather than settling for the expensive commercial solutions. This great work is the primary cause of the proliferation of easily accessible Open Source cloud computing tools.

Another project to watch is called Mahout, which is an attempt to make all of the well known map-reduce algorithms generally accessible across multicore Hadoop systems.

I run Fedora 10, and have a bit more work to do, but it is outstanding to see such an easily accessible installation for such a relatively complex set of tools. So many software tools and toys, so little time!

Gloria

How to see exception_notification plugin work in development mode.


RailsRuby

I use HopToad by the Thoughtbot Guys (I say guys because I know they don’t have any girls on the team *wink*) to handle exceptions from my rails apps these days but today I found myself in a situation where I needed to use the exception_notification plugin instead. I haven’t used the plugin for quite sometime so I wanted to make sure I had everything all setup correctly before pushing out to staging and production. I remembered that I had done this before in development but I couldn’t remember everything I needed to do so I, of course, asked uncle Google. After reading the readme and a little googling I figured out what I needed to do in order to see it work in development. It took me far longer than I wanted and I don’t want to go through that again in the future so I figured I would just write a quick blog post to remind me next time I want to do it.

So here goes:

First get Exception notification all setup (this is all from the readme file)

script/plugin install git://github.com/rails/exception_notification.git

then in application.rb put
include ExceptionNotifiable

then in environment.rb put
ExceptionNotifier.exception_recipients = %w(joe@schmoe.com bill@schmoe.com)

Once you have it setup you can do all the other stuff that lets you see it work in your development environment.

put the following two lines in your application.rb file
alias :rescue_action_locally :rescue_action_in_public
local_addresses.clear

then in your development.rb file change
config.action_controller.consider_all_requests_local = true
to be
config.action_controller.consider_all_requests_local = false

Exception Notifier doesn’t send email notification on ActiveRecord::RecordNotFound and ActionController::UnknownAction errors. So you will need to create 500 error to see the notification going out in your log. You can just add an action to a controller that throws a divide by zero error, restart your server and hit that action and you should see the notification trigger in your development log.

Once you have seen it work make sure to undo everything in the second section.

Cheers

The Creative Commons license, and change.gov: Our new president gets it!


BookIntroductionsPythonThoughts

I was really excited and moved to see this:

http://change.gov/newsroom/entry/towards_a_21st_century_government/

This copyright license affects most content on the website change.gov, guaranteeing transparency and the feedom to redistribute and discuss it’s content.

All I can say is, wow. This president not only believes in transparency, but understands how to implement it using existing tools and conventions. He understands the spirit of Open Source software development, and everything that has come about since this movement began. Imagine an “Open Government”, genuinely implemented and influenced by the people, and it’s hard not to want to get involved.

The change.gov web site has a very nice, simple form, requesting your ideas and suggestions. They already have proposals for nationwide fiber optic broadband, automation and transparency in the patent process, and automation of health insurance data to bring healthcare costs down. There are many more places where technology can automate away inefficient and error prone processes.

We here on DevChix all know how technology has changed our lives. Let them know how it can change a nation. Submit your ideas and suggestions, and let’s help educate and influence this new administration.

Gloria