I Love Python: Twitter Feed/Content Auth and Scrape in One HTTP Request


PythonThoughts

This is some lingering code from a GrrlCamp project, written by Gaba. It is such a handy, great little nugget of code. It logs into a Twitter account and scrapes either the content or the RSS feed (your choice), in one fell swoop, using Base64 encoding of the login and password in the HTTP request header. Very clever:

import urllib2, base64
import sys
import feedparser
#import configuration

class Page:

    def __init__(self):
        self.data = {}
        self.data["url"] = 'http://www.twitter.com/gloriajw'
        self.data["username"] = 'gloriajw'
        self.data["password"] = 'XXXXXXXXXX'

        self.data["urlrss"] = 'http://twitter.com/statuses/user_timeline/18107956.rss'

    def __getitem__(self, key):
        return self.data[key]

    def __setitem__(self,key, value):
        self.data[key] = value

    def getContent(self):
        base64string = base64.encodestring('%s:%s' % (self.data['username'], self.data['password']))[:-1]
        authheader =  "Basic %s" % base64string

        req = urllib2.Request(self.data["url"])
        req.add_header("Authorization", authheader)
        try:
            handle = urllib2.urlopen(req)
        except IOError, e:                  # here we shouldn't fail if the username/password is right
            print "It looks like the username or password is wrong."
            sys.exit(1)

        return handle.read()

    def getRSS(self):
        base64string = base64.encodestring('%s:%s' % (self.data['username'], self.data['password']))[:-1]
        authheader =  "Basic %s" % base64string

        req = urllib2.Request(self.data["urlrss"])
        req.add_header("Authorization", authheader)
        try:
            handle = urllib2.urlopen(req)
        except IOError, e:                  # here we shouldn't fail if the username/password is right
            print "It looks like the username or password is wrong."
            sys.exit(1)

        return handle.read()

    def getData(self):
        """auth = urllib2.HTTPBasicAuthHandler()
        auth.add_password('BasicTest', 'twitter.com', self.data['username'], self.data['password'])

        return feedparser.parse('http://www.twitter.com/statuses/user_timeline/18107956.rss', handlers=[auth])
        """

        return feedparser.parse('http://%s:%s@twitter.com/statuses/user_timeline/18107956.rss' % (self.data['username'], self.data['password']))


To invoke it:

class Data:
    def __init__(self, entries):
        self.entries = entries

    def save(self):
        pass

    def parse(self):
        pass

    def imprimir(self):
        for item in self.entries:
            print item.title

And this:

def main():
    page = Page()

    statuses = page.getData().entries

    data = Data(statuses)
    data.save()

    data.imprimir()


if __name__ == "__main__":
    main()

This code will also be attached, in case of copy/paste mangling.

In both getContent() and getData(), Gaba constructs the HTTP response header so that the encoded username and password are passed in the Authorization section of the header. This is easier and more secure than making two requests, and maintaining session cookies. Very nice indeed. This can be used to sign into any web site which accepts HTTP Basic authentication headers (there are different types of HTTP authentication (BASIC, DIGEST, FORM, and CLIENT-CERT).

It is left as an exercise for you to get the content (not the feed) and use BeautifulSoup to extract the data portions. If you want to try this, and need help, post questions here.

Enjoy,
Gloria


2 Responses to “I Love Python: Twitter Feed/Content Auth and Scrape in One HTTP Request”

  1. gloriajw

    Wow, hi! Thank your sig.other and all others involved for this great tool. I’ve used it for several years now, in many environments, and it simply rocks.

    Reply

Leave a Reply