Null Disquisition

Python, AWS, Grad School, and your face

Archive for the ‘python’ tag

Weekend Project – CloudCached

without comments

A friend and I have been bouncing around the idea of a caching system that ran on Amazon’s cloud for a while now. Basically something like memcached, but without the (very real) limitations of physical memory or the need of a whole server. Sure, it’s hard to beat the speed of memory-level read access, but I think the appeal of a distributed, limitless cache might outweigh the slowdown.

Idea

Provide an interface for storing/retrieving serialized data on S3

Pretty simple idea, pretty simple implementation. Thanks to the S3 interface provided by Boto, things were a lot easier. I’m going to keep this open source under the MIT license. You can check out the code on GitHub repository – please feel free to fork, improve, submit, etc.

Overview

A quick walkthrough of the code will reveal truly how simple this is. The Client class provides basic CRUD methods for interfacing with S3: put, get, update, delete. The put and update methods store a timestamp as the “expires” header for the file to keep track of cache expiration. Also these two methods write a “type” header to the meta-data so CloudCached knows how to de-serialize the file.

class Client:
"Here's the class schema"
    def get(self, key)
    def put(self, key, value, time_to_expire=3600, replace=False)
    def update(self, key, value, time_to_expire=3600)
    def delete(self, key)

There are 6 basic data types used in this code for serializing any bit of python data: basestring (for str and unicode), int (for int and long), complex, float, and other. The other data type represents anything that is not a base type in Python. These “other” types get pickled while everything else just gets str’d.

The put method checks the md5sum to make sure everything went through cleanly (maybe a bit costly, but worth it in my opinion). cPickle is used in favor of pickle for obvious reasons (it’s much faster).

Results

Some very early tests show that this might just be usable.

    CloudCached Benchmarks (10 runs)
    --------------------------------------------------------
    Test                 |  Average (s)     | Total (s)
-------------------------------------------------------- GET integer | 0.0283360004425 | 0.283360004425 GET string (32 byte) | 0.0315794944763 | 0.315794944763 GET string (512KB) | 0.1265994787220 | 1.265994787220 PUT integer | 0.0650457143784 | 0.650457143784 PUT string (32 byte) | 0.0563205003738 | 0.563205003738 PUT string (512KB) | 0.1773290872570 | 1.773290872570 --------------------------------------------------------

Advantages

  • Highly distributed. S3 data is distributed across multiple availability zones and could therefor be utilized by an application running across multiple availability zones.
  • No size limit. Unlike the physical limitations of a memcached machine (or cluster of memcached machines), S3 does not have limits on the number of files (caches) you can store. Also, with S3, you can write files from 1 byte to 5 GB (although I think a 5GB cache file would defeat the purpose).
  • Parallel read access. If applicable to the application, cache reads can be largely parallelized which could potentially give linear speedup to the cache loading.
  • No server necessary. Since the application is reading and writing directly to S3, there is no need to a “cache server”. This could lead to a great deal of savings for people running multiple memcached machines. Memcached servers typically have a large memory capacity which means a m1.xlarge or c1.xlarge EC2 instance (assuming it’s running in EC2).

Considerations

It’s going to be hard to beat the speed of memcached. As far as speed is concerned, I’m using built-in Python stuff including urllib, httplib, xml.sax, etc (all of which are used by Boto). It might be worthwhile to write a C implementation of the S3 communication methods (but maybe not). The most costly part of this code aside from network communication is probably the serialization, and since cPickle is used there is not really improvement to be made there.

It might be cool to couple the meta-data with SimpleDB.

I registered cloudcached.com in case this gains some momentum. I will post updates and benchmarks there as they arrive.

-David

Written by david

June 20th, 2009 at 4:34 pm

Posted in Amazon Web Services, python

Tagged with , , ,

Python unit testing super fun time

without comments

There’s a weird thing that happens after a long night of mind-blowing back-breaking coding. Well, hacking in this case. Every time I stay up late working really hard on something, I feel compelled to blog/tweet/emote about my experience so others might feel sympathy/compassion for me. Even though I’m dizzyingly tired, and have to get up in ~5 hours, I cannot deny this urge to massage my ego.

So tonight I bring to you the joy of unit testing in Python. I’ve been using py.test, and loving it. It extends the basic functionality of Python’s built-in module, unittest (which is really not that bad). The main improvements are in the simplicity of writing the tests. Py.test supports unit testing on methods, classes, even whole modules.

Here’s your first test

def test_iszero():
    assert 1==0

If you haven’t guessed, this test will fail (1 does not equal 0). A cool thing about py.test is that you just prefix the method name with “test_” and that becomes a test. If it’s in a class or module, you need setup and teardown methods, but beyond that just write methods starting with “test_”. There’s lots more fancy stuff you can do, I suggest checking out the docs (link above).

However, by my favorite thing py.test does is support generative testing. By using generators, a test can spawn “sub” tests with a yield statement. Let’s say I want to test if a bunch of numbers are even.

def isEven(x):
    assert x%2==0
def test_evenNumbers():
    n = [1,2,3,4,5,6]
    for x in n:
        yield isEven,x
This can be tremendously helpful when you need to do a repetitive test on many input parameters. Enjoy! -David

Written by david

February 10th, 2009 at 1:59 am

Posted in python

Tagged with , ,

Python static class members and You

with one comment

After getting yelled at for not grading my student’s homework, I decided to ignore the threatening emails and continue doing what I feel like. Undergrads, know this: TAs don’t really care about you – sorry. I was debugging some code built on top of my awesome HTMLParser, and kept having a really frustrating problem. Some of my class variables were not getting reset during the init call. So I poke around and after a while discover (buried in my libraries)

class Foo:
    a = True
    b = []
    c = []
    def init(self):
        ""
It seems the class members a, b, and c are not getting reset when I instanciate becasue, quite simply, I am not resetting them in init. I originally put them there for prettiness (self.a, self.b, self.c is so cumbersome), and moving them back into init fixed my problem.

A little more digging reveals what is going on here. If you define a variable outside of a class method, the variable is implicitly made static.

class Foo:
    a = "Hello"
print Foo.a
>> Hello
These static members are accessed just like regular members, with the “self” object. For things like str, int, float, the value will seem to be reset when you create a new instance of the class. But what’s really happening is when you alter the static variable, you are actually creating a new class variable (in memory) which overrides the static for the duration of that object. This is not true for lists and dicts. I assume this is because Python uses pointers for array-like structures and the static member is just a pointer here. So when you alter the static list (via getitem, append, remove, et al.) you are operating on the pointer, not a copy of the list.
class Foo:
    a = []
    def init(self):
        print self.a
        self.a.append(1)
f = Foo()
f = Foo()
f = Foo()
>> []
>> [1]
>> [1,1]
Depending on how you’re structuring your code (or how good at Python you are) you might want this functionality. For me though, this was not the case, so I put everything back in init. Another good thing to point out is Python has a very convienent syntax for making a copy of an array.
    a = [1,2,3,4]
    b = a
    c = a[:]
    b[0] = 5
    c[0] = 6
    print a
    print b
    print c
    >> [5,2,3,4]
    >> [5,2,3,4]
    >> [6,2,3,4]
Sometimes I miss pointers, but not really. -David

Written by david

December 4th, 2008 at 8:45 pm

Posted in python

Tagged with ,

HTMLParser, not for the faint of heart

with one comment

In recent efforts to create a general purpose HTML scraper for mein Geschäftsführer, I’ve been getting my hands dirty in some Py. After much research and experimentation, I’ve decided to go with the built-in HTMLParser instead of the XML expat parser or the SGMLParser. Also, I should clarify this is not the HTMLParser from htmllib, this is HTMLParser’s HTMLParser. For all it’s wonderment, Python really fails on consistant naming schemes. Oh well.

One of the things I like most about HTMLParser is that it is not a module per say, but it is a factory for creating a wrapper. There is no default HTMLParser which you can feed HTML to and get output – you only get the factories for parsing.

class MyParser(HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag == "a":
        print "Found link:",attrs
    def handle_startendtag(self,tag,attrs):
        if tag == "img":
        print "Found image:",attrs
parser = MyParser()
parser.feed(rawhtml)
Lovely, no? There are a few more methods which you overwrite in order to achieve desired functionality. The nice thing about parsing HTML like this is that it is a one-pass operation. Unlike a series of regexp to find desired content, this allows us find multiple targets in a streaming fashion.

There was one really annoying thing about this module however. The built-in getpos() returns a tuple of line number and column position. I can’t think of an instance when this would be useful for anything really (unless you’re making a HTML editor in python or something), so natrually I modified it to my liking. My first solution was to just remove all the newlines and then work based on the column offset alone. Unfortunately, HTMLParser chokes on some really long lines. My next idea (the one I’m currently using) was to strip out tabs and trailing whitespace and precalculate the length of each line before I feed the parser.

linepos = []
charpos = 0
for line in self.html.split("\n"):
    self.linepos.append(charpos)
    charpos += len(line)
parser = MyParser(linepos=linepos)
This produces an array like [0,10,20,30,...] (if each line were 10 characters long). The next modification is to create a new method for MyParser.
def getcharpos(self):
    return self.linepos[self.lineno-1] + self.offset
The two properties lineno and offset are inherited from HTMLParser (actually inherited from markupbase), and they represent exactly what you’d think.

Now that I have absolute position of tags in the HTML, I can all kinds of fun things like use K-means grouping to find clusters of images. Or maybe I want to see the average distance between occuraces of the word “the” in an article. It’s 276.21 for this one, btw.

-David

Written by david

December 1st, 2008 at 4:32 pm

Posted in python

Tagged with ,

1d Fokker-Plank equation

without comments

As promised, I bring pretty pictures. The past few days I’ve been working on a solution to the 1d diffusion equation with a drift term, better known as the Fokker-Planck equation.

The 1d Fokker-Planck equation

Sexy, I know. Anyhow, I finally worked out the Python code to get it rolling (literally!). The test system I did has periodic boundary conditions and an initial condition of a sharply-peaked Gaussian (a = 20). I’ll spare the details and jump to the fun part.

Here’s the Python code that made it happen (scipy and matplotlib required).

-David

Written by david

October 10th, 2008 at 4:45 am

Posted in School, python

Tagged with , , ,