Hadoop Streaming and Python: No Jython necessary!


DesignMySQLPythonThoughts

Here is another jewel of an article, written by Michael Noll:
http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

As hardware costs plummet and greater bandwidth becomes more affordable, cloud computing becomes even more alluring and feasible for the hobbyist. Working with large data sets is an interesting problem set, now attainable by the average developer with a free LINUX hard drive partition and some spare time.

Hadoop is essentially a framework, written in Java, for accessing large data sets. For the average developer, it’s usually used in conjunction with HDFS which is a custom file system designed for stroing large data sets on cheaper hardware. HDFS data is usually stored in units of entire web pages written to disk “as-is”. Once stored, this large data set is accessible for parsing and analysis.

Indexing of this data is done via what is called a map-reduce algorithm. The easiest map-reduce algorithm to envision is the word count algorithm. Each word on a page is assigned the value “1″ in the map portion, then all words are counted by sorting the words and adding their values. The result is a frequency count of each word in a page. This is the simplest of ten well known map-reduce algorithms, found in this white paper:

http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf

The magic of Michael’s article is in how to prevent the stringent, often painful Jython interface to Hadoop. This trick is by streaming this data on the command line from his simpler, more Pythonic map-reduce algorithm directly to Hadoop, thereby avoiding the complex Jython interface. Very clever indeed! It would be wonderful to see all of the relatively standard map-reduce algorithms written in Python.

He has a couple of other great articles on exactly how to install and configure Hadoop on Ubuntu here:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

and here:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

On top of Hadoop is usually a metdata store, holding high level information about which page sets or data “chunks” can be found on which volumes, for efficiency. After attending a cloud computing conference in November, I realized that many companies have chosen to write their own metadata layer using HiveDB, or custom metadata tools, and many distributed MySQL databases, rather than settling for the expensive commercial solutions. This great work is the primary cause of the proliferation of easily accessible Open Source cloud computing tools.

Another project to watch is called Mahout, which is an attempt to make all of the well known map-reduce algorithms generally accessible across multicore Hadoop systems.

I run Fedora 10, and have a bit more work to do, but it is outstanding to see such an easily accessible installation for such a relatively complex set of tools. So many software tools and toys, so little time!

Gloria

An Attachment Walked Into A Bar. Was That U, Fu?


DatabaseDesignRails

I recently spent some time working on a Rails application that needed to have various kinds of attachments, such as PDFs and images, for various types of resources, such as Programs and Questions. The app in question lets you create programs and questionnaires for the purpose of granting continuing medical education (CME) credits to doctors — so they’re apprised of the bleeding edge techniques for sawing off your arm or tying your tooth to a doorknob. Not that this matters for the piece, but context — like sugar or salt — makes things go down easier.

If you can picture an office of harried administrators dealing with ever-changing AMA requirements — “This brochure needs to go with this program!”, “This xray picture needs to go with Question 7″, “This program needs to sing ‘Missing You’ while showing you diagrams of a root canal”.. (OK, maybe not the second item), you’ll see why I wanted to build a system that could expand easily. I did not want to find myself in the position of trying to look tough while shamefacedly muttering, “I’m sorry, PDFs can only go with *Programs*, as you initially specified” and watching these not-so-gentle administrative souls stare at me with fully warranted incredulity. Blaming the client is like [fill-in-the-blank]: weirdly fun for about 10 minutes before the hangover sets in.

This project turned into a tour of attachment_fu, single table inheritance, polymorphism and functional testing of uploads. Thanks to input from friends, blogs and the usual Internet suspects, I got it up and running in its first form and decided to write it up. Hopefully what I’ve learned can help someone else. “Link-outs”, the reclusive programmer’s form of the “Shout Out” (aka a list of great resources), can be found at the bottom of the article.

Goal

To implement a system that would allow for many types of files to be uploaded and attached to many types of resources.

Requirements

  1. Many types of assets (image, pdf, etc.)
  2. Assets belong to many types of resources (program, question, etc)
  3. Assets upload to the file system
  4. Assets validate in the context of their parent resource [you upload the pdf in the program form; the image in the question form, etc]
  5. All functional tests pass for uploads

Design

Rather than have database tables for each attachment type (PDFs and images, then eventually MP3s) I made one table called Assets. Listening to my requirements, a friend suggested setting up my Assets model using polymorphism and single table inheritance and that I look into attachment_fu for the uploads. For this exercise I will only be focusing on the PDF upload. Readers can extrapolate from there for image or other media types. This article does not address image resizing, or anything related to uploading images. There are some great links at the bottom that deal with these topics in depth.

Modeling the domain:

Assets (the table) has a field called type that is be [PDF, Image (etc)]; a field called resource_type which is the name of the class it belongs to, [Program, Question, etc.] and resource_id which is the id of the resource it belongs to.

The Asset class will inherit from ActiveRecord. PDF and Image (and any future file types we add to the system) inherit from Asset. Line items from Assets (the table) will look like this (leaving out the other fields for now)

type   | resource_type | resource_id

Pdf      | Program         | 82
Image  | Question         | 79

For my assets, I created the following models:
asset.rb, image.rb and pdf.rb. The latter two inherit from Asset.

# [Asset.rb]
class Asset < ActiveRecord::Base
  belongs_to :resource, :polymorphic=>true
end

# [Pdf.rb]
class Pdf < Asset
  belongs_to :program
  has_attachment  :content_type=>'application/pdf',
                         :storage=>:file_system,
                         :path_prefix=>'public/uploads/pdfs/',
                         :size => 1.megabyte..3.megabytes
  validates_presence_of   :content_type, :filename

   def validate
     if filename &&  /pdf$/.match(filename).nil?
         errors.add(:filename, "must be a PDF ")
     end
  end

end

Because I am using attachment_fu, I need to have a set of basic fields in my table that will handle image and file uploads. The migration I created looks like this:


t.column :parent_id,  :integer     # for the thumbnails of resized images, required for plugin
t.column :type, :string               # for STI on this table (curent types are image, PDF)
t.column :resource_type, :string  # for polymorphism on this table (current types are Program, Question)
t.column :resource_id, :integer    # for polymorphism, (program_id, question_id, etc)
t.column :content_type, :string    # Fields here [content_type] and below all attachment_fu requirements
t.column :filename, :string
t.column :thumbnail, :string
t.column :size, :integer
t.column :width, :integer
t.column :height, :integer

To create the relationship for my parent models (question and program) I set up the polymorphic relationship like so:

# [program.rb]
class Program < ActiveRecord::Base
   has_many    :pdfs, :as=>:resource,  :dependent => :destroy
end

# [question.rb]
class Question < ActiveRecord::Base
  has_many      :images, :as=>:resource, :dependent => :destroy
end

Note the difference between delete and destroy. Destroy will remove the entry from the database and the file from the filesystem. Delete will only delete the database entry.

Once this is set up, you can test it in the console and see the magic of Rails at work. Through the column names and relationships we've built, entries in Assets are automatically created. It's one of the things that's great about Rails. It doesn't simply describe different modeling patterns: it implements them if you set up your code correctly. It's a beautiful thing.

Uploading

Without moving files to the server, the system is not complete. So, onto attachment_fu. The attachment_fu tutorials I found online all made the assumption that the attachment you are uploading is an entity of its own, in its own form: a mugshot, for instance. My case was different: I needed to embed the attachment in the forms for their parent resources (pdf in the program form; image in the question form, etc) rather than independently. So I needed to validate a form for more than one model. In my _form.rhtml partial for program I have the pdf field like so:

# [views/program/_form.rhtml]

< label for="program_notes" >Notes
<%= f.text_area :notes %>

# ... more fields ... < label for="program_status" >Add Brochure? [PDF Files only]< /label > <%= file_field("pdf", "uploaded_data") %>

This means that both a pdf object and a program object are being passed back to the program controller. Since we know that a program has_many pdfs, we can use program.pdfs.build in order to create -- then validate and ultimately save -- pdf objects.

Validating in context of parent resource:

I needed to check for errors with my program fields and my pdf field. This snippet is what I ended up with. The second line, validate_and_build_pdf is just a result of refactoring: it allows the validation methods to be run if you select a file through the upload file field (so if you try to upload a gif, the validation in Pdf.rb will squak) -- but not if you just leave that field empty.

#[programs_controller.rb]
def create
  @program = Program.new(params[:program])
   validate_and_build_pdf(params[:pdf],params[:pdf][:uploaded_data])
   if @program.save
      flash[:notice] = 'Program was successfully created.'
      redirect_to :action => 'list' and return
   else
      render :action => 'new'
   end
end


# this method stops program from saving the asset if the field is blank,
# and returns type-errors if relevant (files not PDF)
def validate_and_build_pdf(fileParams,file)
  if Asset.is_valid_file?(file)
     @pdf = @program.pdfs.build(fileParams)
  end
end

The reason for the Asset.is_valid_file? method was the same: if no file was uploaded an entry was still being made in the assets table due to the polymorphic relationship. So I had to make sure I was dealing with an actual fileobject, not an empty field. Since this would be used across the application for all types of file uploads, I made this a class method of Asset.

[Asset.rb]
# Checks that the object is one of the following types before running validation on it
def self.is_valid_file?(fileObj)
  if fileObj.is_a?(Tempfile)|| fileObj.is_a?(StringIO)
     || (defined?(ActionController::TestUploadedFile) && fileObj.instance_of?(ActionController::TestUploadedFile))
     true
  else
     false
  end
end

You may be wondering what's up with the ActionController::TestUploadedFile condition -- it's because when you run functional tests of uploaded assets, that is what they are -- they are not instances of fileObj.

Once it's determined that you're dealing with a valid file and you hit custom validation methods (such as informing the user that the file needs to be a pdf), you can include them with your program error messages so you only get one messages box at the top of your form, not two:

<%= error_messages_for :program, :pdf %>

Pass Functional Tests

This part is where you test all your uploads from your functional tests. To test uploaded files, you need to put files in your fixtures/files directory then build your tests.

# [programs_controller_test.rb]
def test_create_with_pdf
  fdata = fixture_file_upload('/files/semi.pdf', 'application/pdf')
  num_programs = Program.count
  post :create,   :multipart => true,
  :pdf=>{"uploaded_data"=>fdata},
  :program => {:name=>"Newest Created Program",
                    :is_ongoing=>1,
                    :status=>"Edit"}
  assert_response :redirect
  assert_redirected_to :action => 'list'
  assert_equal num_programs + 1, Program.count
end

I had a gotcha here -- I realized that I has to have the pdf array separate from the program array (since in the form it’s field_for pdf). I fixed it by submitting two objects, program and pdf, to the controller, seen above.

Finally, running the functional testing leaves files on the file system and they should be deleted at the end of all tests. So I added a destroy method in the teardown, which removes them.

def teardown
  Pdf.find(:all).each { |p| p.destroy }
end

Failing Upload Tests
Just to check that only Pdfs can be uploaded for programs, I added a handful of other file types to my directory and created an array of invalid file types in my setup:

# [programs_controller_test.rb]
def setup
 # other stuff....
 @invalid_files = [['lipsum.doc',"application/doc"],['lipsum.rtf',"application/rtf"],
             ['lipsum.txt',"text/plain"],['squiggle.gif','image/gif' ],
             ['squiggle.jpg','image/jpg' ], ['squiggle.png','image/png' ],
             ['squiggle.psd',"image/x-photoshop"],['squiggle.zip', 'application/zip']
             ]
end

and created a test to check these uploads will fail:

  def test_create_with_invalid_file_types
    @invalid_files.each do |name, type|
      fdata = fixture_file_upload('/files/' + name, type)
      post :create,   :multipart => true,
                      :pdf=>{"uploaded_data"=>fdata},
                      :program => {:name=>"Newerest Created Program",
                             :is_ongoing=>1,
                             :status=>"Edit",
                             :start_date=>Time.now.tomorrow,
                             :end_date=>''}
      assert_response 200
    end
  end

This gave me peace of mind that my files are being uploaded, only the right types are accepted, and all is good in the land of PDFs and Programs. And voila, we are done and on to the next task.


Link Outs

Here's a listing of various online resources on the topics covered in this post. This should be enough for a good start. Happy Uploading!