I Love Python: Twitter Feed/Content Auth and Scrape in One HTTP Request
February 21st, 2009 byThis is some lingering code from a GrrlCamp project, written by Gaba. It is such a handy, great little nugget of code. It logs into a Twitter account and scrapes either the content or the RSS feed (your choice), in one fell swoop, using Base64 encoding of the login and password in the HTTP request header. Very clever:
import urllib2, base64
import sys
import feedparser
#import configuration
class Page:
def __init__(self):
self.data = {}
self.data["url"] = 'http://www.twitter.com/gloriajw'
self.data["username"] = 'gloriajw'
self.data["password"] = 'XXXXXXXXXX'
self.data["urlrss"] = 'http://twitter.com/statuses/user_timeline/18107956.rss'
def __getitem__(self, key):
return self.data[key]
def __setitem__(self,key, value):
self.data[key] = value
def getContent(self):
base64string = base64.encodestring('%s:%s' % (self.data['username'], self.data['password']))[:-1]
authheader = "Basic %s" % base64string
req = urllib2.Request(self.data["url"])
req.add_header("Authorization", authheader)
try:
handle = urllib2.urlopen(req)
except IOError, e: # here we shouldn't fail if the username/password is right
print "It looks like the username or password is wrong."
sys.exit(1)
return handle.read()
def getRSS(self):
base64string = base64.encodestring('%s:%s' % (self.data['username'], self.data['password']))[:-1]
authheader = "Basic %s" % base64string
req = urllib2.Request(self.data["urlrss"])
req.add_header("Authorization", authheader)
try:
handle = urllib2.urlopen(req)
except IOError, e: # here we shouldn't fail if the username/password is right
print "It looks like the username or password is wrong."
sys.exit(1)
return handle.read()
def getData(self):
"""auth = urllib2.HTTPBasicAuthHandler()
auth.add_password('BasicTest', 'twitter.com', self.data['username'], self.data['password'])
return feedparser.parse('http://www.twitter.com/statuses/user_timeline/18107956.rss', handlers=[auth])
"""
return feedparser.parse('http://%s:%s@twitter.com/statuses/user_timeline/18107956.rss' % (self.data['username'], self.data['password']))
To invoke it:
class Data:
def __init__(self, entries):
self.entries = entries
def save(self):
pass
def parse(self):
pass
def imprimir(self):
for item in self.entries:
print item.title
And this:
def main():
page = Page()
statuses = page.getData().entries
data = Data(statuses)
data.save()
data.imprimir()
if __name__ == "__main__":
main()
This code will also be attached, in case of copy/paste mangling.
In both getContent() and getData(), Gaba constructs the HTTP response header so that the encoded username and password are passed in the Authorization section of the header. This is easier and more secure than making two requests, and maintaining session cookies. Very nice indeed. This can be used to sign into any web site which accepts HTTP Basic authentication headers (there are different types of HTTP authentication (BASIC, DIGEST, FORM, and CLIENT-CERT).
It is left as an exercise for you to get the content (not the feed) and use BeautifulSoup to extract the data portions. If you want to try this, and need help, post questions here.
Enjoy,
Gloria

March 5th, 2009 at 8:20 am
As the wife of the Beautiful Soup developer & maintainer, I’m glad it’s so useful to the DevChix community! You might be interested in this post about the newest version:
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
March 17th, 2009 at 8:38 am
Wow, hi! Thank your sig.other and all others involved for this great tool. I’ve used it for several years now, in many environments, and it simply rocks.