<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Null Disquisition &#187; python</title>
	<atom:link href="http://mumrah.net/topics/programming/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://mumrah.net</link>
	<description>Python, AWS, Grad School, and your face</description>
	<lastBuildDate>Fri, 05 Mar 2010 00:23:23 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Making Python&#8217;s pickle safe(r)</title>
		<link>http://mumrah.net/2009/09/making-pythons-pickle-safer/</link>
		<comments>http://mumrah.net/2009/09/making-pythons-pickle-safer/#comments</comments>
		<pubDate>Wed, 09 Sep 2009 16:24:22 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://mumrah.net/?p=329</guid>
		<description><![CDATA[Securing pickled data with a HMAC SHA256 signature]]></description>
			<content:encoded><![CDATA[<p><img class="alignright s3-img" style="border: 0px initial initial;margin-left:10px;margin-bottom:10px;" src="http://mumrah-dot-net.s3.amazonaws.com/terrified-pickles_LRG.jpg" border="0" alt="Scared Pickles" width="137" height="181" /> Everyone loves pickle, I mean, what&#8217;s not to love. Super fast object serialization (via cPickle). However, there are some legitimate concerns regarding the security of pickle &#8211; specifically the load/loads method. The basic problem is, if you try to unpickle untrusted data, you are liable to create some objects that can do nasty things (<a title="Importing OS with Pickle" href="http://docs.python.org/3.1/library/pickle.html#restricting-globals" target="_blank">like make system calls</a>). Python even gives us a nice warning right in the docs</p>

<blockquote><strong>Warning</strong>
<em>pickle</em> module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.</blockquote>

<p><br/></p>

<p>Now there are plenty of things you can do to improve the security of the unpickling process. Python lets you subclass pickle.Unpickler to give the user finer grained control over what gets unpickled. This is a fine approach (<a title="Example of a safer Unpickler class" href="http://nadiana.com/python-pickle-insecure" target="_blank">a nice example here</a>), and will work for most, but I will give my take on the issue.</p>

<p>For most of the applications I write that use pickle, I&#8217;m just looking for a way to store arbitrary Python data as a string. One example might be storing small data objects on S3, or perhaps implementing user sessions for a webapp. Either way, I <em>should</em> be able to trust my own data for unpickling, but it&#8217;s always best to be double-extra-sure when dealing with something where you can blindly execute arbitrary bits of code (think, the evil eval method).</p>

<p>So, for my case, I simply want to verify that the pickled data I stored is coming back to me unmodified. My solution: sign the pickled data. Using the same signing method as AWS, I present the following:
<pre class="python" name="code">import hmac
import hashlib
import base64
from cPickle import dumps
 # The unsigned pickled data
string_to_sign = dumps({'foo':"bar",'spam':"eggs",'the answer':42})
 # The signature object
signature = hmac.HMAC(key="my application's super secret key",
    msg= string_to_sign, digestmod=hashlib.sha256)
 # The signed string: store this
signed_string = string_to_sign + base64.encodestring(signature.digest())</pre>
Now you have your pickled data as the first part of the string with the last 45 characters being the signature. The key for HMAC signing is specific to your application, so if someone gets access to your pickled data and tries to mess with it and resign it, it won&#8217;t work. Here&#8217;s the unpickling process:
<pre class="python" name="code">import hmac
import hashlib
import base64
from cPickle import loads
 # Break up the signed string into message and signature
signature = signed_string[-45:]
message = signed_string[:-45]
 # Calculate the signature of the message
msg_sig = hmac.HMAC(key="my application's super secret key",
    msg= message, digestmod=hashlib.sha256)
 # See that it matches the given signature
assert base64.encodestring(msg_sig.digest()) == signature</pre>
-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2009/09/making-pythons-pickle-safer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>API Functional Testing with Python</title>
		<link>http://mumrah.net/2009/07/api-functional-testing-with-python/</link>
		<comments>http://mumrah.net/2009/07/api-functional-testing-with-python/#comments</comments>
		<pubDate>Sat, 18 Jul 2009 22:00:56 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[functional testing]]></category>
		<category><![CDATA[py.test]]></category>
		<category><![CDATA[unit testing]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://mumrah.net/?p=285</guid>
		<description><![CDATA[Recently, at work we have written a totally badass XML API for clients to interface with our data (sorry no public side yet). After some gentle reassuring (and some not-so-gentle arm twisting), I convinced my boss-man we could do this in Python with AWS on the back-end. We settled on the Turbogears 2.0 meta-framework using [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, <a href="http://loud3r.com" target="_blank" title="Loud3r">at work</a> we have written a totally badass XML API for clients to interface with our data (sorry no public side yet). After some gentle reassuring (and some not-so-gentle arm twisting), I convinced my boss-man we could do this in Python with AWS on the back-end. We settled on the Turbogears 2.0 meta-framework using Amazon S3/SimpleDB. The whole experience was very educational for many reasons &#8211; one, we had never using something besides MySQL for a data store, two, we had never used a Python framework before, and three, we had never really developed an app with a proper set of tests. That final point, testing, is the subject of this entry.</p>

<p><a href="http://codespeak.net/py/dist/test/test.html" title="[test]" target="_blank">Py.Test</a>, from the vaingloriously-named &#8220;py&#8221; module, is my unit testing framework of choice (I have <a href="http://mumrah.net/2009/02/python-unit-testing-super-fun-time/" title="Python unit testing super fun time" target="_self">written about it before</a>). It provides a convenient way to collect tests and to write generative tests (which are super useful) for unit testing. After getting a few sets of unit tests rolled out for our API, we recognized that we would need some higher level tests &#8211; so called functional, or acceptance tests.</p>

<h3>Functional Tests</h3>

<p>Functional tests describe high-level tests that rely on the interaction of many components of the system, whereas a unit test will only test smaller, lower level components. For example, one (very high-level) functional test for an XML API would be to see that the resulting XML is well-formed. The well-formedness of an XML response from an API request is dependent on several components of the system. It requires proper request parsing, validation, error handling, template rendering, et al. A more typical test might be to see that the number of items returned by the API does not exceed a user-provided maximum, i.e., if the user requests http://api.example.com/?[request params]&amp;max_count=10, no more than 10 results are shown.</p>

<p>Now, how to go about running these tests. The number of functional testing frameworks is too great to mention (<a href="http://www.opensourcetesting.org/functional.php" target="_blank" title="Exhaustive list of functional testing frameworks">here&#8217;s a bunch</a>), but one that is well known and widely used is Selenium. It is written in Java and can do some pretty fancy stuff. However, one big drawback of Selenium is it&#8217;s weight. It&#8217;s <em>heavy</em> &#8211; it is Java after all, and requires a client server (whether you sacrifice your own cycles or a remote server). For the simple functional tests we were writing, it was completely overkill. After searching around for a Python functional testing framework (or at least something lighter than Selenium), it occurred to me that I could just use the test-collecting abilities of Py.Test plus some additional libraries. And that&#8217;s what we did.</p>

<h3>Bottom Line</h3>

<p>Mix together <a href="http://pyxml.sourceforge.net/topics/" target="_blank" title="PyXML">PyXML</a>, Urllib2, and Py.Test and you have a pretty powerful (and portable) testing suite in Python. PyXML extends the built-in &#8216;xml&#8217; module with some really nice packages including an XPath parser which I love.</p>

<h3>Exempli Gratia</h3>

<p>Consider an API that has a &#8220;users&#8221; noun, and just one verb &#8220;show&#8221;. We will allow one optional parameter <em>order_by</em> and one required parameter <em>max_count</em>. An valid URL would look like http://api.example.com/users/show?max_count=10&amp;order_by=date.</p>

<p>We&#8217;ll start by creating the class that will contain the tests, and writing a function to get an XML doc given some url parameters.
<pre name="code" class="python">
import urllib2
from collections import defaultdict
from xml.dom import minidom
from xml import xpath
class TestUserNoun:
    def get_xml_doc(self,url_params):
        url = "http://api.example.com/users/show?"
        url += "max_count=%(max_count)s&amp;order_by=%(order_by)s"
        url_p = urllib2.urlopen( url % defaultdict(str,url_params) )
        doc = minidom.parseString( url_p.read() )
        url_p.close()
        return doc
</pre>
N.B., you can create a specific User-Agent with urllib2 if so desired, and defaultdict is used so we don&#8217;t have to check if the incoming dict (url_params) has everything we need for the url string.</p>

<p>Now we can start writing some tests
<pre name="code" class="python">
class TestUserNoun:
    ...
    def test_user_count(self):
        # Test several values of max_count
        counts = (5,10,15,20)
        def count_users(n):
            # Test that the number of results returned is less than or equal to n
            doc = self.get_xml_doc({'max_count':n})
            user_count = len( xpath.Evaluate('/xpath/expr',doc.documentElement) )
            assert user_count &lt;= n
        for c in counts:
            yield count_users,c
    def test_order_by_date(self):
        # See that each item is older than the previous one
        doc = self.get_xml_doc({'max_count':10,'order_by':"date"})
        items = xpath.Evaluate('/xpath/expr',doc.documentElement)
        # Get the date of the first item
        last_date = xpath.Evaluate('@date_attr',items[0])
        # Compare the date of each item to the previous one
        for item,i in zip(items[1:],range(len(items[1:]))):
            item_date = xpath.Evaluate('@date_attr',item)
            assert item_date &lt;= last_date
            last_date = item_date
</pre>
And you get the idea &#8211; one can write tests ad nauseum (although I&#8217;m not sure if there&#8217;s such a thing as too many tests). Of course neither of these tests will work since the XPath expressions are not valid &#8211; I didn&#8217;t really feel like spelling out a whole XML schema just for this example. There are plenty of good XPath tutorials out there. The basic idea here is you want to test all of your request parameters for the API to see a number of things:</p>

<ul>
<li>Does the controller handle the requests properly? What about missing/extra parameters?</li>
<li>Are errors handled properly?</li>
<li>Is the resulting XML valid? This is implicitly done by parsing the XML document</li>
<li>Does the resulting data correspond to the request parameters? This one will require the most tests to be written &#8211; don&#8217;t forget about generative tests!</li>
</ul>

<p>A powerful test suite means a robust application. When you have a nice set of tests, you can push your code with confidence &#8211; and believe me, that is a very rewarding and relieving feeling. Writing this API has been an extremely rewarding experience, and probably the most educational thing I&#8217;ve done programming-wise since I wrote a cross-browser javascript event library like 5 years ago.</p>

<p>So go forth, programmer &#8211; embrace testing and empower yourself.</p>

<p>-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2009/07/api-functional-testing-with-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weekend Project &#8211; CloudCached</title>
		<link>http://mumrah.net/2009/06/weekend-project-cloudcached/</link>
		<comments>http://mumrah.net/2009/06/weekend-project-cloudcached/#comments</comments>
		<pubDate>Sat, 20 Jun 2009 20:34:15 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[cache]]></category>
		<category><![CDATA[s3]]></category>

		<guid isPermaLink="false">http://mumrah.net/2009/06/weekend-project-cloudcached/</guid>
		<description><![CDATA[A friend and I have been bouncing around the idea of a caching system that ran on Amazon&#8217;s cloud for a while now. Basically something like memcached, but without the (very real) limitations of physical memory or the need of a whole server. Sure, it&#8217;s hard to beat the speed of memory-level read access, but [...]]]></description>
			<content:encoded><![CDATA[<p>A friend and I have been bouncing around the idea of a caching system that ran on Amazon&#8217;s cloud for a while now. Basically something like memcached, but without the (very real) limitations of physical memory or the need of a whole server. Sure, it&#8217;s hard to beat the speed of memory-level read access, but I think the appeal of a distributed, <a href="http://aws.amazon.com/s3/#functionality" title="Max 5GB per item" target="_blank">limitless</a> cache might outweigh the slowdown.</p>

<h3>Idea</h3>

<p>Provide an interface for storing/retrieving serialized data on S3</p>

<p>Pretty simple idea, pretty simple implementation. Thanks to the S3 interface provided by <a href="http://code.google.com/p/boto/" title="Boto rocks!">Boto</a>, things were a lot easier. I&#8217;m going to keep this open source under the MIT license. You can check out the code on <a href="http://github.com/mumrah/cloudcached/tree/master" title="CloudCached on GitHub">GitHub repository</a> &#8211; please feel free to fork, improve, submit, etc.</p>

<h3>Overview</h3>

<p>A quick walkthrough of the code will reveal truly how simple this is. The Client class provides basic CRUD methods for interfacing with S3: <strong>put</strong>, <strong>get</strong>, <strong>update</strong>, <strong>delete</strong>. The put and update methods store a timestamp as the &#8220;expires&#8221; header for the file to keep track of cache expiration. Also these two methods write a &#8220;type&#8221; header to the meta-data so CloudCached knows how to de-serialize the file. 
<pre name="code" class="python">
class Client:
"Here's the class schema"
    def get(self, key)
    def put(self, key, value, time_to_expire=3600, replace=False)
    def update(self, key, value, time_to_expire=3600)
    def delete(self, key)
</pre><br />
There are 6 basic data types used in this code for serializing any bit of python data: basestring (for str and unicode), int (for int and long), complex, float, and other. The other data type represents anything that is not a base type in Python. These &#8220;other&#8221; types get pickled while everything else just gets str&#8217;d.</p>

<p>The put method checks the md5sum to make sure everything went through cleanly (maybe a bit costly, but worth it in my opinion). cPickle is used in favor of pickle for obvious reasons (it&#8217;s much faster).</p>

<h3>Results</h3>

<p>Some very early tests show that this might just be usable. 
<pre name="code" class="python">
    CloudCached Benchmarks (10 runs)
    --------------------------------------------------------
    Test                 |  Average (s)     | Total (s)<br />
    --------------------------------------------------------
    GET integer          |  0.0283360004425 | 0.283360004425
    GET string (32 byte) |  0.0315794944763 | 0.315794944763
    GET string (512KB)   |  0.1265994787220 | 1.265994787220
    PUT integer          |  0.0650457143784 | 0.650457143784
    PUT string (32 byte) |  0.0563205003738 | 0.563205003738
    PUT string (512KB)   |  0.1773290872570 | 1.773290872570
    --------------------------------------------------------
</pre></p>

<h3>Advantages</h3>

<ul>
<li>Highly distributed. S3 data is distributed across multiple availability zones and could therefor be utilized by an application running across multiple availability zones.</li>
<li>No size limit. Unlike the physical limitations of a memcached machine (or cluster of memcached machines), S3 does not have limits on the number of files (caches) you can store. Also, with S3, you can write files from 1 byte to 5 GB (although I think a 5GB cache file would defeat the purpose).</li>
<li>Parallel read access. If applicable to the application, cache reads can be largely parallelized which could potentially give linear speedup to the cache loading.</li>
<li>No server necessary. Since the application is reading and writing directly to S3, there is no need to a &#8220;cache server&#8221;. This could lead to a great deal of savings for people running multiple memcached machines. Memcached servers typically have a large memory capacity which means a m1.xlarge or c1.xlarge EC2 instance (assuming it&#8217;s running in EC2). </li>
</ul>

<h3>Considerations</h3>

<p>It&#8217;s going to be hard to beat the speed of memcached. As far as speed is concerned, I&#8217;m using built-in Python stuff including urllib, httplib, xml.sax, etc (all of which are used by Boto). It might be worthwhile to write a C implementation of the S3 communication methods (but maybe not). The most costly part of this code aside from network communication is probably the serialization, and since cPickle is used there is not really improvement to be made there.</p>

<p>It might be cool to couple the meta-data with SimpleDB.</p>

<p>I registered cloudcached.com in case this gains some momentum. I will post updates and benchmarks there as they arrive.</p>

<p>-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2009/06/weekend-project-cloudcached/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Python unit testing super fun time</title>
		<link>http://mumrah.net/2009/02/python-unit-testing-super-fun-time/</link>
		<comments>http://mumrah.net/2009/02/python-unit-testing-super-fun-time/#comments</comments>
		<pubDate>Tue, 10 Feb 2009 05:59:38 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[py.test]]></category>
		<category><![CDATA[unit testing]]></category>

		<guid isPermaLink="false">http://mumrah.net/?p=50</guid>
		<description><![CDATA[There&#8217;s a weird thing that happens after a long night of mind-blowing back-breaking coding. Well, hacking in this case. Every time I stay up late working really hard on something, I feel compelled to blog/tweet/emote about my experience so others might feel sympathy/compassion for me. Even though I&#8217;m dizzyingly tired, and have to get up [...]]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s a weird thing that happens after a long night of mind-blowing back-breaking coding. Well, hacking in this case. Every time I stay up late working really hard on something, I feel compelled to blog/tweet/emote about my experience so others might feel sympathy/compassion for me. Even though I&#8217;m dizzyingly tired, and have to get up in ~5 hours, I cannot deny this urge to massage my ego.</p>

<p>So tonight I bring to you the joy of unit testing in Python. I&#8217;ve been using <a title="py.test - awesome python unit testing" href="http://codespeak.net/py/dist/test.html" target="_blank">py.test</a>, and loving it. It extends the basic functionality of Python&#8217;s built-in module, unittest (which is really not that bad). The main improvements are in the simplicity of writing the tests. Py.test supports unit testing on methods, classes, even whole modules.</p>

<p>Here&#8217;s your first test</p>

<p><pre name="code" class="python:nogutter:nocontrols">
def test_iszero():
    assert 1==0
</pre></p>

<p>If you haven&#8217;t guessed, this test will fail (1 does not equal 0). A cool thing about py.test is that you just prefix the method name with &#8220;test_&#8221; and that becomes a test. If it&#8217;s in a class or module, you need setup and teardown methods, but beyond that just write methods starting with &#8220;test_&#8221;. There&#8217;s lots more fancy stuff you can do, I suggest checking out the docs (link above).</p>

<p>However, by my favorite thing py.test does is support generative testing. By using generators, a test can spawn &#8220;sub&#8221; tests with a yield statement. Let&#8217;s say I want to test if a bunch of numbers are even.
<pre name="code" class="python:nogutter:nocontrols">
def isEven(x):
    assert x%2==0
def test_evenNumbers():
    n = [1,2,3,4,5,6]
    for x in n:
        yield isEven,x
</pre>
This can be <em>tremendously</em> helpful when you need to do a repetitive test on many input parameters. Enjoy!
-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2009/02/python-unit-testing-super-fun-time/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Python static class members and You</title>
		<link>http://mumrah.net/2008/12/python-static-class-members-and-you/</link>
		<comments>http://mumrah.net/2008/12/python-static-class-members-and-you/#comments</comments>
		<pubDate>Fri, 05 Dec 2008 00:45:15 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://mumrah.net/?p=44</guid>
		<description><![CDATA[After getting yelled at for not grading my student&#8217;s homework, I decided to ignore the threatening emails and continue doing what I feel like. Undergrads, know this: TAs don&#8217;t really care about you &#8211; sorry. I was debugging some code built on top of my awesome HTMLParser, and kept having a really frustrating problem. Some [...]]]></description>
			<content:encoded><![CDATA[<p>After getting yelled at for not grading my student&#8217;s homework, I decided to ignore the threatening emails and continue doing what I feel like. Undergrads, know this: TAs don&#8217;t really care about you &#8211; sorry. I was debugging some code built on top of my <a title="More awesome than this one actually" href="http://mumrah.net/2008/12/htmlparser-not-for-the-faint-of-heart/" target="_self">awesome HTMLParser</a>, and kept having a really frustrating problem. Some of my class variables were not getting reset during the <strong>init</strong> call. So I poke around and after a while discover (buried in my libraries)
<pre name="code" class="python:nocontrols:nogutter">
class Foo:
    a = True
    b = []
    c = []
    def <strong>init</strong>(self):
        ""
</pre>
It seems the class members a, b, and c are not getting reset when I instanciate becasue, quite simply, I am not resetting them in <strong>init</strong>. I originally put them there for prettiness (self.a, self.b, self.c is so cumbersome), and moving them back into <strong>init</strong> fixed my problem.</p>

<p>A little more digging reveals what is going on here. If you define a variable outside of a class method, the variable is implicitly made static.
<pre name="code" class="python:nocontrols:nogutter">
class Foo:
    a = "Hello"
print Foo.a
&gt;&gt; Hello
</pre>
These static members are accessed just like regular members, with the &#8220;self&#8221; object. For things like str, int, float, the value will seem to be reset when you create a new instance of the class. But what&#8217;s really happening is when you alter the static variable, you are actually creating a new class variable (in memory) which overrides the static for the duration of that object. This is not true for lists and dicts. I assume this is because Python uses pointers for array-like structures and the static member is just a pointer here. So when you alter the static list (via <strong>getitem</strong>, append, remove, et al.) you are operating on the pointer, not a copy of the list.
<pre name="code" class="python:nocontrols:nogutter">
class Foo:
    a = []
    def <strong>init</strong>(self):
        print self.a
        self.a.append(1)
f = Foo()
f = Foo()
f = Foo()
&gt;&gt; []
&gt;&gt; [1]
&gt;&gt; [1,1]
</pre>
Depending on how you&#8217;re structuring your code (or how good at Python you are) you might want this functionality. For me though, this was not the case, so I put everything back in <strong>init</strong>. Another good thing to point out is Python has a very convienent syntax for making a copy of an array.
<pre name="code" class="python:nocontrols:nogutter">
    a = [1,2,3,4]
    b = a
    c = a[:]
    b[0] = 5
    c[0] = 6
    print a
    print b
    print c
    &gt;&gt; [5,2,3,4]
    &gt;&gt; [5,2,3,4]
    &gt;&gt; [6,2,3,4]
</pre>
Sometimes I miss pointers, but not really.
-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2008/12/python-static-class-members-and-you/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>HTMLParser, not for the faint of heart</title>
		<link>http://mumrah.net/2008/12/htmlparser-not-for-the-faint-of-heart/</link>
		<comments>http://mumrah.net/2008/12/htmlparser-not-for-the-faint-of-heart/#comments</comments>
		<pubDate>Mon, 01 Dec 2008 20:32:05 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[trade secrets]]></category>

		<guid isPermaLink="false">http://mumrah.net/?p=39</guid>
		<description><![CDATA[

In recent efforts to create a general purpose HTML scraper for mein Geschäftsführer, I&#8217;ve been getting my hands dirty in some Py. After much research and experimentation, I&#8217;ve decided to go with the built-in HTMLParser instead of the XML expat parser or the SGMLParser. Also, I should clarify this is not the HTMLParser from htmllib, [...]]]></description>
			<content:encoded><![CDATA[<div style="float:right"><img class="alignnone size-thumbnail wp-image-40" title="Pie crust" src="http://mumrah.net/wp-content/uploads/2008/12/pie-crust-de-150x150.gif" alt="" width="150" height="150" /></div>

<p>In recent efforts to create a general purpose HTML scraper for mein Geschäftsführer, I&#8217;ve been getting my hands dirty in some Py. After much research and experimentation, I&#8217;ve decided to go with the built-in HTMLParser instead of the XML expat parser or the SGMLParser. Also, I should clarify this is not the HTMLParser from htmllib, this is HTMLParser&#8217;s HTMLParser. For all it&#8217;s wonderment, Python really fails on consistant naming schemes. Oh well.</p>

<p>One of the things I like most about HTMLParser is that it is not a module per say, but it is a factory for creating a wrapper. There is no default HTMLParser which you can feed HTML to and get output &#8211; you only get the factories for parsing.
<pre name="code" class="python">
class MyParser(HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag == "a":
        print "Found link:",attrs
    def handle_startendtag(self,tag,attrs):
        if tag == "img":
        print "Found image:",attrs
parser = MyParser()
parser.feed(rawhtml)
</pre>
Lovely, no? There are a few more methods which you overwrite in order to achieve desired functionality. The nice thing about parsing HTML like this is that it is a one-pass operation. Unlike a series of regexp to find desired content, this allows us find multiple targets in a streaming fashion.</p>

<p>There was one really annoying thing about this module however. The built-in <em>getpos()</em> returns a tuple of line number and column position. I can&#8217;t think of an instance when this would be useful for anything really (unless you&#8217;re making a HTML editor in python or something), so natrually I modified it to my liking. My first solution was to just remove all the newlines and then work based on the column offset alone. Unfortunately, HTMLParser chokes on some really long lines. My next idea (the one I&#8217;m currently using) was to strip out tabs and trailing whitespace and precalculate the length of each line before I feed the parser.
<pre name="code" class="python">
linepos = []
charpos = 0
for line in self.html.split("\n"):
    self.linepos.append(charpos)
    charpos += len(line)
parser = MyParser(linepos=linepos)
</pre>
This produces an array like [0,10,20,30,...] (if each line were 10 characters long). The next modification is to create a new method for MyParser.
<pre name="code" class="python:nocontrols">
def getcharpos(self):
    return self.linepos[self.lineno-1] + self.offset
</pre>
The two properties <em>lineno</em> and <em>offset</em> are inherited from HTMLParser (actually inherited from markupbase), and they represent exactly what you&#8217;d think.</p>

<p>Now that I have absolute position of tags in the HTML, I can all kinds of fun things like use <a title="K-means algorithm" href="http://en.wikipedia.org/wiki/K-means_algorithm" target="_blank">K-means grouping</a> to find clusters of images. Or maybe I want to see the average distance between occuraces of the word &#8220;the&#8221; in an article. It&#8217;s 276.21 for this one, btw.</p>

<p>-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2008/12/htmlparser-not-for-the-faint-of-heart/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>1d Fokker-Plank equation</title>
		<link>http://mumrah.net/2008/10/1d-fokker-plank-equation/</link>
		<comments>http://mumrah.net/2008/10/1d-fokker-plank-equation/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 08:45:51 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[School]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[pde]]></category>
		<category><![CDATA[scipy]]></category>

		<guid isPermaLink="false">http://mumrah.net/?p=29</guid>
		<description><![CDATA[As promised, I bring pretty pictures. The past few days I&#8217;ve been working on a solution to the 1d diffusion equation with a drift term, better known as the Fokker-Planck equation.



Sexy, I know. Anyhow, I finally worked out the Python code to get it rolling (literally!). The test system I did has periodic boundary conditions [...]]]></description>
			<content:encoded><![CDATA[<p>As promised, I bring pretty pictures. The past few days I&#8217;ve been working on a solution to the 1d diffusion equation with a drift term, better known as the Fokker-Planck equation.</p>

<p style="text-align: center;"><img class="alignnone size-medium wp-image-30 aligncenter" title="fokker-planck-1d" src="http://mumrah.net/wp-content/uploads/2008/10/latex-image-1-300x58.png" alt="The 1d Fokker-Planck equation" width="300" height="58" /></p>

<p style="text-align: left;">Sexy, I know. Anyhow, I finally worked out the Python code to get it rolling (literally!). The test system I did has periodic boundary conditions and an initial condition of a sharply-peaked Gaussian (a = 20). I&#8217;ll spare the details and jump to the fun part.</p>

<p style="text-align:center;">
<embed id="VideoPlayback" src="http://video.google.com/googleplayer.swf?docid=8670238611717151756&#038;hl=en&#038;fs=true" style="width:400px;height:326px" allowFullScreen="true" allowScriptAccess="always" type="application/x-shockwave-flash"> </embed></p>

<p style="text-align: left;">Here&#8217;s the <a title="Python code for 1d Fokker-Planck" href="http://mumrah-dot-net.s3.amazonaws.com/fokker-planck.py" target="_self">Python code</a> that made it happen (scipy and matplotlib required).</p>

<p style="text-align: left;">-David</p>
]]></content:encoded>
			<wfw:commentRss>http://mumrah.net/2008/10/1d-fokker-plank-equation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->