Null Disquisition

In Mother Russia, Thesis writes You!

Archive for December, 2008

Time Machine in your pocket

without comments

A close friend of mine was recently subject of home invasion. Aside from the regular pickings (TV, computers) he had his external harddrives stolen. I could only imagine how that felt - just that extra twist of the dagger in your chest. After hearing about this, I reevaluated my personal backup proceedures - specifically, how could one avoid loosing everything if something like this happens.

I’ve been using Time Machine on my MacBook Pro for the past several months. TM is certainly a great program, but we all know it has certain shortcomings. One of those being the lack of a whitelist (you can only blacklist directories from the backup schedule). Say for instance, I only want to backup a single directory (like all of my code). This is effectively impossible with Time Machine.

So until Apple gets their shit together, we Unix folk have rsync. Michael over at IMHO has a great writeup on how to acheive incremental backups in a Time Machine fashion using rsync and cron. I threw together the following script to add in a little OS X sexiness and so that it uses a USB thumb drive.

#!/bin/bash
if [ -d /Volumes/PNY8GB ]
then
    HOME=/Volumes/PNY8GB/Backups.backupdb/TM
    date=`date “+%Y-%m-%d-%H%M%S”`
    n=`rsync –stats -aR –exclude “*.swp” –exclude “*.bak” –exclude “*.pyc”
–exclude “*~” –exclude “.svn” –delete-excluded –link-dest=$HOME/Latest
/Users/davidarthur/Code/Loud3r $HOME/$date |
sed -n ’s/Number of files transferred: \([^0]\)/\1/p’`
    rm $HOME/Latest
    ln -s $date $HOME/Latest
else
    exit
fi
if [ $n ]
then
    /usr/local/bin/growlnotify -m ‘rsync complete,
number of files: ‘$n &> /dev/null
fi

This first looks for the target thumb drive (PNY8GB in my case), then runs the rsync command, and finally pops up a nice little growl notification showing how many files were backed up. For non-OS X people, just ignore the growlnotify part.

#!/bin/bash
if [ -d /Volumes/PNY8GB ]
then
    HOME=/Volumes/PNY8GB/Backups.backupdb/TM
    date=`date “+%Y-%m-%d-%H%M%S”`
    rsync –stats -aR –exclude “*.swp” –exclude “*.bak” –exclude “*.pyc”
–exclude “*~” –exclude “.svn” –delete-excluded –link-dest=$HOME/Latest
/Users/davidarthur/Code/Loud3r $HOME/$date
    rm $HOME/Latest
    ln -s $date $HOME/Latest
else
    exit
fi

I’m still working on getting the directory structure to match Time Machine’s so I can actually use Time Machine to explore/restore files from the backups.

This gives me a little piece of mind, and hopefully will you too.

-David

Edit: OS X has some issues with preserving permissions and ownerships with HFS+ volumes - so the incremental part doesn’t really work. Should work on other *nix systems though.

2nd Edit: Got it working finally. Here is my complete script (running every 15mins). N.B. must be run by root (so put it in root’s crontab).

#!/bin/bash -x 

DEST="/Volumes/PNY8GB/Backups"
LATEST="Latest"
EXCLUDES_FILE="/Users/davidarthur/.rsyncexcludes"
RSYNC="/usr/bin/rsync "

# Make sure user is root
if (( `id -u` != 0 )); then
    { echo "Sorry, must be root. Exiting..."; exit; }
fi;

# Make sure backup device is attached
! test -d "$DEST" && echo "Please mount the backup drive!" && exit

# Run rsync
DATE=`date "+%Y-%m-%d-%H%M%S"`
n=`$RSYNC -a -x -S --stats --delete --link-dest=$DEST/$LATEST \
    --exclude-from $EXCLUDES_FILE $* /Users/davidarthur/Code/Loud3r $DEST/$DATE
| sed -n 's/Number of files transferred: \([^0]\)/\1/p’`
# Update ‘Latest’ link
rm $DEST/$LATEST
ln -s $DEST/$DATE $DEST/$LATEST 

# Send a growl notification
if [ $n ]
then
    /usr/local/bin/growlnotify -m ‘rsync complete,
number of files: ‘$n &> /dev/null
fi

Written by david

December 21st, 2008 at 6:01 pm

Posted in General

Tagged with , ,

Python static class members and You

without comments

After getting yelled at for not grading my student’s homework, I decided to ignore the threatening emails and continue doing what I feel like. Undergrads, know this: TAs don’t really care about you - sorry. I was debugging some code built on top of my awesome HTMLParser, and kept having a really frustrating problem. Some of my class variables were not getting reset during the __init__ call. So I poke around and after a while discover (buried in my libraries)

class Foo:
    a = True
    b = []
    c = []
    def __init__(self):
        …

It seems the class members a, b, and c are not getting reset when I instanciate becasue, quite simply, I am not resetting them in __init__. I originally put them there for prettiness (self.a, self.b, self.c is so cumbersome), and moving them back into __init__ fixed my problem.

A little more digging reveals what is going on here. If you define a variable outside of a class method, the variable is implicitly made static.

class Foo:
    a = "Hello"
print Foo.a
>> Hello



These static members are accessed just like regular members, with the “self” object. For things like str, int, float, the value will seem to be reset when you create a new instance of the class. But what’s really happening is when you alter the static variable, you are actually creating a new class variable (in memory) which overrides the static for the duration of that object. This is not true for lists and dicts. I assume this is because Python uses pointers for array-like structures and the static member is just a pointer here. So when you alter the static list (via __getitem__, append, remove, et al.) you are operating on the pointer, not a copy of the list.

class Foo:
    a = []
    def __init__(self):
        print self.a
        self.a.append(1)
f = Foo()
f = Foo()
f = Foo()
>> []
>> [1]
>> [1,1]



Depending on how you’re structuring your code (or how good at Python you are) you might want this functionality. For me though, this was not the case, so I put everything back in __init__. Another good thing to point out is Python has a very convienent syntax for making a copy of an array.

a = [1,2,3,4]
b = a
c = a[:]
b[0] = 5
c[0] = 6
print a
print b
print c
>> [5,2,3,4]
>> [5,2,3,4]
>> [6,2,3,4]


Sometimes I miss pointers, but not really.
-David

Written by david

December 4th, 2008 at 8:45 pm

Posted in General

Tagged with ,

HTMLParser, not for the faint of heart

with one comment

In recent efforts to create a general purpose HTML scraper for mein Geschäftsführer, I’ve been getting my hands dirty in some Py. After much research and experimentation, I’ve decided to go with the built-in HTMLParser instead of the XML expat parser or the SGMLParser. Also, I should clarify this is not the HTMLParser from htmllib, this is HTMLParser’s HTMLParser. For all it’s wonderment, Python really fails on consistant naming schemes. Oh well.

One of the things I like most about HTMLParser is that it is not a module per say, but it is a factory for creating a wrapper. There is no default HTMLParser which you can feed HTML to and get output - you only get the factories for parsing.

class MyParser(HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag == "a":
	    print "Found link:",attrs
    def handle_startendtag(self,tag,attrs):
        if tag == "img":
	    print "Found image:",attrs  

parser = MyParser()
parser.feed(rawhtml)



Lovely, no? There are a few more methods which you overwrite in order to achieve desired functionality. The nice thing about parsing HTML like this is that it is a one-pass operation. Unlike a series of regexp to find desired content, this allows us find multiple targets in a streaming fashion.

There was one really annoying thing about this module however. The built-in getpos() returns a tuple of line number and column position. I can’t think of an instance when this would be useful for anything really (unless you’re making a HTML editor in python or something), so natrually I modified it to my liking. My first solution was to just remove all the newlines and then work based on the column offset alone. Unfortunately, HTMLParser chokes on some really long lines. My next idea (the one I’m currently using) was to strip out tabs and trailing whitespace and precalculate the length of each line before I feed the parser.

# Suppose incoming HTML is clean
linepos = []
charpos = 0
for line in self.html.split(”\n”):
    self.linepos.append(charpos)
    charpos += len(line)
parser = MyParser(linepos=linepos)



This produces an array like [0,10,20,30,...] (if each line were 10 characters long). The next modification is to create a new method for MyParser.

def getcharpos(self):
    return self.linepos[self.lineno-1] + self.offset



The two properties lineno and offset are inherited from HTMLParser (actually inherited from markupbase), and they represent exactly what you’d think.

Now that I have absolute position of tags in the HTML, I can all kinds of fun things like use K-means grouping to find clusters of images. Or maybe I want to see the average distance between occuraces of the word “the” in an article. It’s 276.21 for this one, btw.

-David

Written by david

December 1st, 2008 at 4:32 pm

Posted in General

Tagged with ,

The Official FSU Re-fill cup

without comments

Busy times. Been busting ass to keep up with everything this semester - teaching assistantship, classes, thesis, not to mention my new(ish) job at Loud3r. Had a few interesting things happen latetly that I felt warranted an update. To enumerate: I hacked some kid’s Facebook account, wrote an HTML scraper to get the latest Naruto Shippüden episodes from Dattebayo, got my code working for my thesis, and did my final edits on my first real academic paper.

Facebook failure:

I was sitting in the class I TA for (along with my cohort Billzebub), I was sniffing the wifi traffic (like you do) and took a look at the pcap dump that was captured. Amongst the garbage was some request/response headers for Facebook. Being the curious little monkey I am, I fire up Firefox and copy/paste all the cookie information into my session (using Web Developer 2 extension). I head over to facebook.com and low and behold, I am Matt Whatshisname. Full access too, not just a temporary hiccup in the login system. After resisting messing with stuff and/or snooping, I clear my cookies and sit back in awe. Awe at how ridiculous it is that the Facebook login system is so exposed and broken.

I tried to replicate the cookie spoof for some pics for this post, but apperently one or more of the cookies are time-sensitive.

Edit: Hack successfully reproduced! Epic fail!!

http://skitch.com/mumrah/7gcg/his-name-is-robert-paulson

http://skitch.com/mumrah/7gcj/full-access


More Naruto Shenanigans:

No need to waste time about how I did it, here’s a link that pretty much explains it all. Naruto Shippüden XML feed.

Thesis:

After going back and forth with my professor for weeks not getting anywhere, I sit him down and start at the beginning and force him to work through all the details with me. 4 hours later we have some functioning code. Obligatory photos to follow.

The above picture demonstrates the orthogonality and normalization of the eigenvectors (meaning we finally have the parition function correct as well as the normalization criterion). The following three pictures are just the first three eigenvectors.

Looking forward to the break.

-David

Written by david

December 1st, 2008 at 3:25 pm

Posted in General, School

Tagged with , ,