Null Disquisition

Lots of talk about nothing

Making Python's pickle safe(r)

Media_httpmumrahdotne_rsjfn
Everyone loves pickle, I mean, what's not to love. Super fast object serialization (via cPickle). However, there are some legitimate concerns regarding the security of pickle - specifically the load/loads method. The basic problem is, if you try to unpickle untrusted data, you are liable to create some objects that can do nasty things (like make system calls). Python even gives us a nice warning right in the docs
Warning pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Now there are plenty of things you can do to improve the security of the unpickling process. Python lets you subclass pickle.Unpickler to give the user finer grained control over what gets unpickled. This is a fine approach (a nice example here), and will work for most, but I will give my take on the issue. For most of the applications I write that use pickle, I'm just looking for a way to store arbitrary Python data as a string. One example might be storing small data objects on S3, or perhaps implementing user sessions for a webapp. Either way, I should be able to trust my own data for unpickling, but it's always best to be double-extra-sure when dealing with something where you can blindly execute arbitrary bits of code (think, the evil eval method). So, for my case, I simply want to verify that the pickled data I stored is coming back to me unmodified. My solution: sign the pickled data. Using the same signing method as AWS, I present the following:
import hmac
import hashlib
import base64
from cPickle import dumps
 # The unsigned pickled data
string_to_sign = dumps({'foo':"bar",'spam':"eggs",'the answer':42})
 # The signature object
signature = hmac.HMAC(key="my application's super secret key",
    msg= string_to_sign, digestmod=hashlib.sha256)
 # The signed string: store this
signed_string = string_to_sign + base64.encodestring(signature.digest())
Now you have your pickled data as the first part of the string with the last 45 characters being the signature. The key for HMAC signing is specific to your application, so if someone gets access to your pickled data and tries to mess with it and resign it, it won't work. Here's the unpickling process:
import hmac
import hashlib
import base64
from cPickle import loads
 # Break up the signed string into message and signature
signature = signed_string[-45:]
message = signed_string[:-45]
 # Calculate the signature of the message
msg_sig = hmac.HMAC(key="my application's super secret key",
    msg= message, digestmod=hashlib.sha256)
 # See that it matches the given signature
assert base64.encodestring(msg_sig.digest()) == signature
-David

Filed under  //   programming   python  

API Functional Testing with Python

Recently, at work we have written a totally badass XML API for clients to interface with our data (sorry no public side yet). After some gentle reassuring (and some not-so-gentle arm twisting), I convinced my boss-man we could do this in Python with AWS on the back-end. We settled on the Turbogears 2.0 meta-framework using Amazon S3/SimpleDB. The whole experience was very educational for many reasons - one, we had never using something besides MySQL for a data store, two, we had never used a Python framework before, and three, we had never really developed an app with a proper set of tests. That final point, testing, is the subject of this entry. Py.Test, from the vaingloriously-named "py" module, is my unit testing framework of choice (I have written about it before). It provides a convenient way to collect tests and to write generative tests (which are super useful) for unit testing. After getting a few sets of unit tests rolled out for our API, we recognized that we would need some higher level tests - so called functional, or acceptance tests. ### Functional Tests Functional tests describe high-level tests that rely on the interaction of many components of the system, whereas a unit test will only test smaller, lower level components. For example, one (very high-level) functional test for an XML API would be to see that the resulting XML is well-formed. The well-formedness of an XML response from an API request is dependent on several components of the system. It requires proper request parsing, validation, error handling, template rendering, et al. A more typical test might be to see that the number of items returned by the API does not exceed a user-provided maximum, i.e., if the user requests http://api.example.com/?[request params]&max_count=10, no more than 10 results are shown. Now, how to go about running these tests. The number of functional testing frameworks is too great to mention (here's a bunch), but one that is well known and widely used is Selenium. It is written in Java and can do some pretty fancy stuff. However, one big drawback of Selenium is it's weight. It's heavy - it is Java after all, and requires a client server (whether you sacrifice your own cycles or a remote server). For the simple functional tests we were writing, it was completely overkill. After searching around for a Python functional testing framework (or at least something lighter than Selenium), it occurred to me that I could just use the test-collecting abilities of Py.Test plus some additional libraries. And that's what we did. ### Bottom Line Mix together PyXML, Urllib2, and Py.Test and you have a pretty powerful (and portable) testing suite in Python. PyXML extends the built-in 'xml' module with some really nice packages including an XPath parser which I love. ### Exempli Gratia Consider an API that has a "users" noun, and just one verb "show". We will allow one optional parameter order_by and one required parameter max_count. An valid URL would look like http://api.example.com/users/show?max_count=10&order_by=date. We'll start by creating the class that will contain the tests, and writing a function to get an XML doc given some url parameters.

import urllib2
from collections import defaultdict
from xml.dom import minidom
from xml import xpath
class TestUserNoun:
        def get_xml_doc(self,url_params):
                url = "http://api.example.com/users/show?"
                url += "max_count=%(max_count)s&order_by=%(order_by)s"
                url_p = urllib2.urlopen( url % defaultdict(str,url_params) )
                doc = minidom.parseString( url_p.read() )
                url_p.close()
                return doc
N.B., you can create a specific User-Agent with urllib2 if so desired, and defaultdict is used so we don't have to check if the incoming dict (url_params) has everything we need for the url string. Now we can start writing some tests
class TestUserNoun:
        ...
        def test_user_count(self):
                # Test several values of max_count
                counts = (5,10,15,20)
                def count_users(n):
                        # Test that the number of results returned is less than or equal to n
                        doc = self.get_xml_doc({'max_count':n})
                        user_count = len( xpath.Evaluate('/xpath/expr',doc.documentElement) )
                        assert user_count 
And you get the idea - one can write tests ad nauseum (although I'm not sure if there's such a thing as too many tests). Of course neither of these tests will work since the XPath expressions are not valid - I didn't really feel like spelling out a whole XML schema just for this example. There are plenty of good XPath tutorials out there. The basic idea here is you want to test all of your request parameters for the API to see a number of things: 

* Does the controller handle the requests properly? What about missing/extra parameters?
* Are errors handled properly?
* Is the resulting XML valid? This is implicitly done by parsing the XML document
* Does the resulting data correspond to the request parameters? This one will require the most tests to be written - don't forget about generative tests!

A powerful test suite means a robust application. When you have a nice set of tests, you can push your code with confidence - and believe me, that is a very rewarding and relieving feeling. Writing this API has been an extremely rewarding experience, and probably the most educational thing I've done programming-wise since I wrote a cross-browser javascript event library like 5 years ago.

So go forth, programmer - embrace testing and empower yourself.

-David

Filed under  //   functional testing   programming   py.test   python   unit testing   xml  
Posted July 18, 2009

Python static class members and You

After getting yelled at for not grading my student's homework, I decided to ignore the threatening emails and continue doing what I feel like. Undergrads, know this: TAs don't really care about you - sorry. I was debugging some code built on top of my awesome HTMLParser, and kept having a really frustrating problem. Some of my class variables were not getting reset during the __init__ call. So I poke around and after a while discover (buried in my libraries)

class Foo:
    a = True
    b = []
    c = []
    def __init__(self):
                ""
It seems the class members a, b, and c are not getting reset when I instanciate becasue, quite simply, I am not resetting them in __init__. I originally put them there for prettiness (self.a, self.b, self.c is so cumbersome), and moving them back into __init__ fixed my problem. A little more digging reveals what is going on here. If you define a variable outside of a class method, the variable is implicitly made static.
class Foo:
    a = "Hello"
print Foo.a
>> Hello
These static members are accessed just like regular members, with the "self" object. For things like str, int, float, the value will seem to be reset when you create a new instance of the class. But what's really happening is when you alter the static variable, you are actually creating a new class variable (in memory) which overrides the static for the duration of that object. This is not true for lists and dicts. I assume this is because Python uses pointers for array-like structures and the static member is just a pointer here. So when you alter the static list (via __getitem__, append, remove, et al.) you are operating on the pointer, not a copy of the list.
class Foo:
    a = []
    def __init__(self):
        print self.a
        self.a.append(1)
f = Foo()
f = Foo()
f = Foo()
>> []
>> [1]
>> [1,1]
Depending on how you're structuring your code (or how good at Python you are) you might want this functionality. For me though, this was not the case, so I put everything back in __init__. Another good thing to point out is Python has a very convienent syntax for making a copy of an array.
a = [1,2,3,4]
        b = a
        c = a[:]
        b[0] = 5
        c[0] = 6
        print a
        print b
        print c
        >> [5,2,3,4]
        >> [5,2,3,4]
        >> [6,2,3,4]
Sometimes I miss pointers, but not really. -David

Filed under  //   programming   python