Hadoop Streaming and Python: No Jython necessary!


DesignMySQLPythonThoughts

Here is another jewel of an article, written by Michael Noll:
http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

As hardware costs plummet and greater bandwidth becomes more affordable, cloud computing becomes even more alluring and feasible for the hobbyist. Working with large data sets is an interesting problem set, now attainable by the average developer with a free LINUX hard drive partition and some spare time.

Hadoop is essentially a framework, written in Java, for accessing large data sets. For the average developer, it’s usually used in conjunction with HDFS which is a custom file system designed for stroing large data sets on cheaper hardware. HDFS data is usually stored in units of entire web pages written to disk “as-is”. Once stored, this large data set is accessible for parsing and analysis.

Indexing of this data is done via what is called a map-reduce algorithm. The easiest map-reduce algorithm to envision is the word count algorithm. Each word on a page is assigned the value “1″ in the map portion, then all words are counted by sorting the words and adding their values. The result is a frequency count of each word in a page. This is the simplest of ten well known map-reduce algorithms, found in this white paper:

http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf

The magic of Michael’s article is in how to prevent the stringent, often painful Jython interface to Hadoop. This trick is by streaming this data on the command line from his simpler, more Pythonic map-reduce algorithm directly to Hadoop, thereby avoiding the complex Jython interface. Very clever indeed! It would be wonderful to see all of the relatively standard map-reduce algorithms written in Python.

He has a couple of other great articles on exactly how to install and configure Hadoop on Ubuntu here:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

and here:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

On top of Hadoop is usually a metdata store, holding high level information about which page sets or data “chunks” can be found on which volumes, for efficiency. After attending a cloud computing conference in November, I realized that many companies have chosen to write their own metadata layer using HiveDB, or custom metadata tools, and many distributed MySQL databases, rather than settling for the expensive commercial solutions. This great work is the primary cause of the proliferation of easily accessible Open Source cloud computing tools.

Another project to watch is called Mahout, which is an attempt to make all of the well known map-reduce algorithms generally accessible across multicore Hadoop systems.

I run Fedora 10, and have a bit more work to do, but it is outstanding to see such an easily accessible installation for such a relatively complex set of tools. So many software tools and toys, so little time!

Gloria

Auto-creating MySql and Postgresql Databases with Powerful Python Tools


CherryPyMySQLPostgreSQLPythonThoughts

When designing and development web framework software for clients, I first spend a bit of time focused on the client data. I try to determine the traits of the client data, asking these questions:
Is it temporary?
If so, for how long? Should it expire and disappear automatically or manually?
Should it be archived?
How many users/applications will be entering data?
Is each user’s data mutually exclusive? If not, what is the logical grouping of users and data?
What level of control do users have over their own, or other user/app data?
How is the data used by the clients? Are there “super users”?

The database is such an essential part of the design, and being able to automatically create, archive, purge and delete databases is tremendously handy in so many of these designs.

I wrote this example to generate dynamic databases for short courses, training sessions, and demos. I use the example itself to demonstrate the functionality of CherryPy and SqlAlchemy.

First of all, I start my CherryPy process on my local port 7001, allowing me to do this:

http://grrlcamp.org:7001/dbi/create_my_db?db_type=postgres

or this:

http://grrlcamp.org:7001/dbi/create_my_db?db_type=mysql

First, the small bits of quirkiness in Postgresql and MySql

Starting a Postgresql session on the command line, we see what happens before the URL is referenced:

[cwc2008@localhost user1]$ psql -d template1 -U my_super_user
Welcome to psql 8.2.6, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

template1=# SELECT datname FROM pg_database ;
  datname
-----------
 postgres
 template1
 template0
(3 rows)

template1=#

and after (Note: I had to exit the command line session, run the URL, then re-enter the command line session):

template1=# SELECT datname FROM pg_database ;
   datname
--------------
 postgres
 template1
 template0
 cwcuser_7001
(4 rows)

template1=#

The cwcuser_7001 name is auto-generated from the user’s ID and port number, making it useful for training sessions with many users.

my_super_user was created using the Postgresql createuser command, and, of course, must be a super user to do this operation. The template1 database is the Postgresql master database, from which all other databases in that instance are visible.

Caveat: In Postgresql, only one user can access template1 at any given time. Violating this rule gives you this error:

template1=# create database blahblah;
ERROR:  source database "template1" is being accessed by other users
template1=#

Be sure to close your database descriptor when finished with template1. If this is not a production app, and/or you feel comfortable bouncing the instance, this lock can also be cleared by running:

-bash-3.2$ /usr/local/postgresql/8.2.6/bin/pg_ctl restart -m immediate -D/usr/local/postgresql/8.2.6/data

waiting for server to shut down... done
server stopped
server starting
-bash-3.2$ LOG:  database system was interrupted at 2008-02-14 15:07:47 EST
LOG:  checkpoint record is at 0/4B9054
LOG:  redo record is at 0/4B9054; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2046; next OID: 57344
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  record with zero length at 0/4B909C
LOG:  redo is not required
LOG:  database system is ready

This is nasty. In live production environments, do this with great care.

MySql is a bit forgiving when adding, removing or accessing new databases. It does not require the isolation level to be set, as in Postgresql.

We see what databases exist before:

/usr/local/mysql6/bin/mysql --socket=/tmp/mysql10.sock  -u my_super_user -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 170
Server version: 6.0.3-alpha-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| test               |
+--------------------+
3 rows in set (0.00 sec)

mysql>

and after:

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| cwcuser_7001       |
| mysql              |
| test               |
+--------------------+
4 rows in set (0.00 sec)

mysql>

MySql does not require my hopping in and out of the command line like Postgresql. It handles user scope much better, in my opinion. It does, however require each new non-super user to be granted access to each database they need, but this is (1) trivial, and (2) outside of the scope of this article.

The SqlAlchemy code to handle the quirky bits

This is invoked by CherryPy, but can be run in any other manner of Python execution. I have a Class called DBInterface, method is called create_engine:

class DBInterface:
    #~~~~~~~~~~~~~~~~~~~~~~~
    def create_engine(self,choice,local_dbname=None):

        if choice == 'super_mysql':

            '''
            Super user MySql commands do not require connection to a database.
            '''
            db = sqlalchemy.create_engine('mysql://%s:%s@%s' % ('my_super_user','some_password','127.0.0.1:3333'),strategy='thr
eadlocal')
            # 'self' is not threadsafe! Don't use it!
            #self.show_databases = "show databases ;"
            cherrypy.session['show_databases'] = "show databases ;"

        elif choice == 'mysql':
            db = sqlalchemy.create_engine('mysql://%s:%s@%s/%s' % ('my_super_user','some_password','127.0.0.1:3333',local_dbnam
e),strategy='threadlocal')

        elif choice == 'super_postgres':
            '''
            Super user Postgresql commands require that you connect to the
            built-in template1 database, and set the isolation level to 0.
            This can't be done in SqlAlchemy, so we do it in Psycopg.
            '''
            DSN_string = "dbname=template1 port=2222 user=my_super_user  host=127.0.0.1 password=some_password"
            def connection():
                self.local_conn = psycopg.connect(DSN_string)
                self.local_conn.set_isolation_level(0)
                return self.local_conn

            db = sqlalchemy.create_engine('postgres://',creator=connection, strategy='threadlocal')
            # NO!
            #self.show_databases = "SELECT datname FROM pg_database ;"
            cherrypy.session['show_databases'] = "SELECT datname FROM pg_database ;"

        elif choice == 'postgres':
            db = sqlalchemy.create_engine('postgres://%s:%s@%s/%s' % ('my_super_user','some_password','127.0.0.1:2222',local_db
name),strategy='threadlocal')

        else:
            db = None

        return db

The exposed method in the same class which invokes the create_engine method, is called create_my_db, which we saw in the URL above (I will address the CherryPy URL functionality after this point):

    @cherrypy.expose
    def create_my_db(self,db_type):

        '''
        Creating a db is a super user act.
        '''
        cherrypy.session['db_type'] = 'super_' + db_type
        cherrypy.session['my_dbname'] = "cwcuser_%s" % (cherrypy.request.config['server.socket_port'])

        db = self.create_engine(cherrypy.session['db_type'])
        db.execute("create database %s ;" % cherrypy.session['my_dbname'])
        databases = db.execute(cherrypy.session['show_databases'])

        dblist = []
        for d in databases:
             dblist.append(d[0])

        print ', '.join(dblist)
        self.dbclose(cherrypy.session['db_type'],databases)
        return ', '.join(dblist)

Because I clump together super user and non-super user database engine connection functionality, I distinguish the difference by modifying the users db type to a “super_” setting in a CherryPy session variable. This way, only one method is called in various internal modes of operation. When in non-super user mode, the SqlAlchemy connect string is the same for both Postgresql and MySql.

No more db quirks in this app: The rest is clean SqlAlchemy

All db operations after this point use the exact same SqlAlchemy code for both MySql and Postgresql. That is a profoundly beautiful thing.

The destroy_my_db method is generic:

    @cherrypy.expose
    def destroy_my_db(self,db_type):

        '''
        Destroying a db is a super user act.
        '''
        cherrypy.session['db_type'] = 'super_' + db_type

        if not cherrypy.session.has_key('my_dbname'):
            cherrypy.session['my_dbname'] = "cwcuser_%s" % (cherrypy.request.config['server.socket_port'])

        db = self.create_engine(cherrypy.session['db_type'])
        db.execute("drop database %s ;" % cherrypy.session['my_dbname'])
        databases = db.execute(cherrypy.session['show_databases'])

        dblist = []
        for d in databases:
             dblist.append(d[0])

        print ', '.join(dblist)
        self.dbclose(cherrypy.session['db_type'],databases)
        return ', '.join(dblist)

As above, all other database operations from this point forward are generic SqlAlchemy code. Yaaay!

The CherryPy layer

My CherryPy main module, which sets up the needed objects and assoiated URLs, looks like this:

import cherrypy
import pdb
import sys,os,re,traceback
import user_data
sys.path.append('../')

#hostname = os.environ.get('HOSTNAME')
#hostname = re.sub('\..*$','',hostname)

# Add your app here.
import db_wrapper

# Only generic headers and footers go here.
# To use generic templates in your app, you may need to symbolicly link
# them into your Templates dir, or references them by absolute path.
TemplatePath = os.path.join(os.path.dirname(__file__), '/cherrypy/dev/templates')

# Add your class here. Note that the 'root' should be the authentication module.

# To test, go to http://grrlcamp.org:1112/run_test
root = db_wrapper.WrapperTest()

# http://grrlcamp.org:port/dbi/create_my_db, destroy_my_db, etc.
root.dbi = db_wrapper.DBInterface()

# http://grrlcamp.org:port/user/get_user_data
root.user = user_data.UserData()

try:
        cherrypy.tree.mount(root)
        cherrypy.config.update(os.path.join(os.path.dirname(__file__),'cp.conf'))
        cherrypy.server.quickstart()
        cherrypy.engine.start()

except SystemExit:
        #smtpInterface.sendSysAdminEmail('cherrypy@xyz.com','sysadmin@xyz.com',"From %s: CherryPy is going down due to shutdown." % hostname,"")
        sys.exit()

except:
        print "ERROR: CherryPy crashed, error:"
        raise

finally:
        #smtpInterface.sendSysAdminEmail('cherrypy@xyz.com','sysadmin@xyz.com',"From %s: CherryPy is going down (from finally clause, check for another error message after this one)." % hostname,"")
        pass

The CherryPy “root” determines what objects are accessible via URL. I mounted a simple test under the root, and nested the above database code under root.dbi, hence the /dbi/ reference in the URL.

root.dbi contains an instance of the DBInterface class we saw earlier. All exposed methods in the DBInterface class become accessible via the /dbi/method_name URL when CherryPy is started.

I’ve attached the complete code used for training, configured to run on localhost, and connect to both MySql and Postgresql daemons. It contains some explicit redundancy and commented code for learning purposes. Your configuration variables will vary slightly. Enjoy.

auto-create-any-db.tar

GrrlCamp NYC 2008


EventsMySQLPostgreSQLPythonThoughts

We are kicking off GrrlCamp 2008. There are two GrrlCamps being started from this list. This GrrlCamp starts now, runs until May 2008, and meets in-person in NYC in May of 2008.

Qualifications:
- Female (willing to voice verify. You’d be surprised that this is necessary, but at times, it is.)
- Programmer, or programmer wannabe.
- Able to meet online weekly for one hour.
- Able to dedicate a few hours a week to this project.
- Willing to read, try, do, roll up your sleeves and design and develop, even if you’ve never done this before.
- Able to travel to NYC in May for the software wrap-up and release (optional). Your expenses = your round-trip flight/ride, and minimal donations for food. Lodging is free, food will be prepared and cooked by participants.
- Willing to work in Python (If we get a Rails following, we may use both).
- Willing to learn Dojo, CSS and HTML for the user-facing interface.
- Willing to release the software to the public, free of charge.
- Willing to provide small bits of support after development is complete.

This is a large project, touching on everything from basic Linux admin, svn setup, LINUX shell, Apache configuration, threading, database schema design and access, process/content management, front end design and development, inter-language data transfer, and the list goes on. If you are willing, this is an outstanding learning opportunity.

If interested, reply to this post with a valid email address, and we will send you an invitation.

Please, no observers/reporters. This is strictly a development group, and all members must be willing to write code.

For want of a comma


MySQLTips and Tricks

my kingdom was lost.

Watch those commas. A little SQL detail that can bite you:

SELECT
SID
NAME,
ETC
FROM MYTABLE

MySQL, and presumably other databases, is just fine with that statement. It’ll alias the field SID to NAME. If you name columns explicitly with the AS keyword, a missing comma will yield an error:

SELECT
SID AS studentid <--baaarf
NAME AS studentname,
ETC AS studentetc
FROM MYTABLE

Mysql gotcha


DatabaseMySQL

I fix bugs for a living. I found this doozy recently. Get out your sticky yellow pad and write this down:

Don’t do this:

select now() - max(my_timestamp) from my_table;

Go find a table with a TIMESTAMP field in it.

mysql> select my_timestamp from my_table limit 1;
my_timestamp: 2006-02-15 02:22:15
1 row in set (0.00 sec)

Looks nice. Now try:

mysql> select max(my_timestamp) from my_table;
max(my_timestamp): 2007-01-31 22:39:17
1 row in set (0.35 sec)

Now this:

mysql> select max(my_timestamp)+0 from my_table;
max(my_timestamp)+0: 2007
1 row in set (0.35 sec)

What happened? max(my_timestamp) was evaluated in a numeric context. The timestamp string was truncated at the first non-numeric character.

Something a bit different happens to now:

mysql> select now();
now(): 2007-02-06 22:47:25
1 row in set (0.00 sec)
mysql> select now()+0;
now()+0: 20070206224744
1 row in set (0.00 sec)

So the original exssion now() – max(my_timestamp) will yield

20070206224744 – 2007

which I doubt is what you had in mind.

If you need to calculate the difference in days, try:

to_days(now()) - max(to_days(my_timestamp))

For something a bit more fine-grained, try:

select (unix_timestamp(now())- unix_timestamp(max(stamp)))