Case Study: Rapid Development with Python

It's been said all over the web planet: Python is the best programming choice when it comes to rapid development. You want something done well now (as opposed to done a week later shoddily): use Python. I'll present here a case-study describing exactly how Python delivered in an instance.

The Problem and The End Result

The assignment was to parse data out of Yellow Pages in a crabbily coded HTML site, and chunk it out properly to a CSV file. The solution that I coded, in a total time of around 6 hours (broken into fragments of roughly 2, 2 and 2), with the initial regular expression evaluation (we'll come to that later) and the basic framework taking half that time, not only does crawl a "root" url to find proper descendant nodes, it also creates separate CSV files ordered by the type of industry. You can view live samples of the input pages here and here and a sample of the output is attached.

Dissecting It

I'll unceremoniously start with the dissection. You'd have to be able to follow Python code some.

from sys import argv, exit
from urllib import urlopen
from urlparse import urljoin
import re
from csv import writer

The best thing about Python is the simplicity of the language. The next best thing is the Library. I've imported in the statements above an argument parse, an exit function which allows me to stop execution anywhere, an url opener which translates a url into a file, some url handling functions, the regular expression engine, and the CSV writer. Note that I've been extremely selective, and since I'm a bit of a minimalist, this is generally a Good Thing. The advantages of having such a library is obvious: in C or perhaps most other scripting languages, you'd be a stumped duck to find all these ingredients directly in the language library.

def main():
    try:
        options = file('options.txt', 'r').readlines()
    except IOError:
        print "The options file does not exist. Create options.txt in the same directory, and try again."
        exit()
    rooturl = options[0]

In six statements, I've opened a file, handled an error and got in my start parameter for the program: the root url (the url to the first "here" that you saw above).

    urls, services = [], []    
    print "Parsing root url: ", rooturl
    rootdata = urlopen(rooturl).readlines()
    for line in rootdata:
        if line.find("PATTERN") != -1:
            sublines = line.split("PATTERN2")
            for subline in sublines:
                matches = re.compile(r"REGEX").finditer(subline)
                for match in matches:
                    urls.append(urljoin(rooturl, match.group(1)))
                    services.append(match.group(2).lower())

Note that PATTERN, PATTERN2 and REGEX are acual patterns and regexes, they were too stuffed to include in here, get them in the source download.

Initialized two arrays here, wrapped rootdata into a urlopen-ed file - note how the two methods use the same readlines() and then parse line by line. Since the source pages make it a helluva lot difficult to find sensible data, I split them by delimiters (PATTERN2). I've an absolute block with regexes, but Python's RE module is simple even for me. Read those lines above: I match the regex against the subline, the finditer() function automatically creates an iterator for me - this means that I can use it in the subsequent for statement: "for match in matches" simply sounds too damn elegant. The urlhandlers are then used to join the input url with the relative url and come up with an absolute one. Note that urls stand for the path of the sublink being crawled and services for the name associated with that url. In this page for example, the services tuple would have "Airlines", "Ambulance Service", "Ambulance Service" and so on... and the urls would contain their corresponding url. The rest of the code is run for each of those pages, extracting data from them.

    for count in range(0, len(urls)):        
        if len(urls[count]) > 1:
            print "Parsing url: ", urls[count], '... ',
            data = urlopen(urls[count]).readlines()
        else:
            print "no data found!"
            break            

        csv = writer(file(services[count].replace('/', '-') + ".csv", "a+"))
        
        for line in data:
            if line.find("PATTERN") != -1:
                sublines = line.split("PATTERN2")
                for subline in sublines:
                    matches = re.compile(r"REGEX").finditer(subline)
                    
                    for match in matches:
                        if match:
                            csv.writerow(match.groups())                            
                        else:
                            print "No match"

        print "done."         

It uses much the same logic, the input pages have the same kind of crappy HTML everywhere (Thank God!) and the only thing to mention are the csv functions. Note writerow used on the csv file. Hell, note the little one line that gets me the csv object.

Conclusions

While the code has zero error checking and is not fault tolerant a bit, the client needed it with a day and no other language with the time schedule that I had could've come up with a solution this soon. Python not only has the shallowest learning curve ever (the Jump-In factor is great), the propensity to acquire more knowledge seems to increase the greater you're experienced with it, and I simply Love that fact.

A little side note: Portability. Python programs are notoriously portable on Unixes: every modern Linux and BSD system has a Python implementation on by default (or easily available). For Windows, an installer is available and after that Python scripts can be run by hand. Since this is cumbersome, lo behold Py2exe which does exactly that: converted a .py to a .exe. Once you zip the output adding a hefty 1MB plus to your tiny script, you have a working executable right out of the box. The clients in fact used Windows, and aside from a missing .dll on an old Win 98 computer, when it worked, it worked without a hitch.

Dear reader, I'm sure you've had a lot of experience with Python fanboys. I'm not a fanboy, but I'll recommend and order people to use Python when they have a critical programming task that just needs doing right now. Even if you have zero knowledge of the language, getting it up to the level to perform functions like the above would take you no more time than it took me. Six hours. Trust me.

AttachmentSize
web123.py.txt2.93 KB
airlines.csv1.26 KB