Null Disquisition

In Mother Russia, Thesis writes You!

HTMLParser, not for the faint of heart

without comments

In recent efforts to create a general purpose HTML scraper for mein Geschäftsführer, I’ve been getting my hands dirty in some Py. After much research and experimentation, I’ve decided to go with the built-in HTMLParser instead of the XML expat parser or the SGMLParser. Also, I should clarify this is not the HTMLParser from htmllib, this is HTMLParser’s HTMLParser. For all it’s wonderment, Python really fails on consistant naming schemes. Oh well.

One of the things I like most about HTMLParser is that it is not a module per say, but it is a factory for creating a wrapper. There is no default HTMLParser which you can feed HTML to and get output - you only get the factories for parsing.

class MyParser(HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag == "a":
	    print "Found link:",attrs
    def handle_startendtag(self,tag,attrs):
        if tag == "img":
	    print "Found image:",attrs  

parser = MyParser()
parser.feed(rawhtml)



Lovely, no? There are a few more methods which you overwrite in order to achieve desired functionality. The nice thing about parsing HTML like this is that it is a one-pass operation. Unlike a series of regexp to find desired content, this allows us find multiple targets in a streaming fashion.

There was one really annoying thing about this module however. The built-in getpos() returns a tuple of line number and column position. I can’t think of an instance when this would be useful for anything really (unless you’re making a HTML editor in python or something), so natrually I modified it to my liking. My first solution was to just remove all the newlines and then work based on the column offset alone. Unfortunately, HTMLParser chokes on some really long lines. My next idea (the one I’m currently using) was to strip out tabs and trailing whitespace and precalculate the length of each line before I feed the parser.

# Suppose incoming HTML is clean
linepos = []
charpos = 0
for line in self.html.split(”\n”):
    self.linepos.append(charpos)
    charpos += len(line)
parser = MyParser(linepos=linepos)



This produces an array like [0,10,20,30,...] (if each line were 10 characters long). The next modification is to create a new method for MyParser.

def getcharpos(self):
    return self.linepos[self.lineno-1] + self.offset



The two properties lineno and offset are inherited from HTMLParser (actually inherited from markupbase), and they represent exactly what you’d think.

Now that I have absolute position of tags in the HTML, I can all kinds of fun things like use K-means grouping to find clusters of images. Or maybe I want to see the average distance between occuraces of the word “the” in an article. It’s 276.21 for this one, btw.

-David

Written by david

December 1st, 2008 at 4:32 pm

Posted in General

Tagged with ,

The Official FSU Re-fill cup

without comments

Busy times. Been busting ass to keep up with everything this semester - teaching assistantship, classes, thesis, not to mention my new(ish) job at Loud3r. Had a few interesting things happen latetly that I felt warranted an update. To enumerate: I hacked some kid’s Facebook account, wrote an HTML scraper to get the latest Naruto Shippüden episodes from Dattebayo, got my code working for my thesis, and did my final edits on my first real academic paper.

Facebook failure:

I was sitting in the class I TA for (along with my cohort Billzebub), I was sniffing the wifi traffic (like you do) and took a look at the pcap dump that was captured. Amongst the garbage was some request/response headers for Facebook. Being the curious little monkey I am, I fire up Firefox and copy/paste all the cookie information into my session (using Web Developer 2 extension). I head over to facebook.com and low and behold, I am Matt Whatshisname. Full access too, not just a temporary hiccup in the login system. After resisting messing with stuff and/or snooping, I clear my cookies and sit back in awe. Awe at how ridiculous it is that the Facebook login system is so exposed and broken.

I tried to replicate the cookie spoof for some pics for this post, but apperently one or more of the cookies are time-sensitive.

Edit: Hack successfully reproduced! Epic fail!!

http://skitch.com/mumrah/7gcg/his-name-is-robert-paulson

http://skitch.com/mumrah/7gcj/full-access


More Naruto Shenanigans:

No need to waste time about how I did it, here’s a link that pretty much explains it all. Naruto Shippüden XML feed.

Thesis:

After going back and forth with my professor for weeks not getting anywhere, I sit him down and start at the beginning and force him to work through all the details with me. 4 hours later we have some functioning code. Obligatory photos to follow.

The above picture demonstrates the orthogonality and normalization of the eigenvectors (meaning we finally have the parition function correct as well as the normalization criterion). The following three pictures are just the first three eigenvectors in reverse order because WordPress won’t let me edit the gallery.

Looking forward to the break.

-David

Written by david

December 1st, 2008 at 3:25 pm

Posted in General, School

Tagged with , ,

1d Fokker-Plank equation

without comments

As promised, I bring pretty pictures. The past few days I’ve been working on a solution to the 1d diffusion equation with a drift term, better known as the Fokker-Planck equation.

The 1d Fokker-Planck equation

Sexy, I know. Anyhow, I finally worked out the Python code to get it rolling (literally!). The test system I did has periodic boundary conditions and an initial condition of a sharply-peaked Gaussian (a = 20). I’ll spare the details and jump to the fun part.

Here’s the Python code that made it happen (scipy and matplotlib required).

-David

Written by david

October 10th, 2008 at 4:45 am

Posted in School

Tagged with , ,

Naruto easy player

without comments

Well. Damn. Not much to update about my thesis. Been working on a 1d solution to the diffusion equation with a drift potential (see Fokker-Planck). When I get some pretty pictures for that I’ll post them.

In the mean time, I made an awesome time-saving app for watching Naruto. Took me about 5 mins (granted I’ve been doing a lot of PHP programming lately). Goes like this:

  • cURL request to MySpaceTV for “Naruto Episode ###”
  • uses preg_match to find the first result
  • strips out the video id and pastes it into a flash embed

Check out the source or watch an episode yourself!

-David

Written by david

October 9th, 2008 at 3:59 am

Posted in General

Tagged with , ,

Afternoon decafe

without comments

I can’t believe I’m actually doing work this afternoon instead of playing Spore. Oh well, I’m already up to Civilization Phase (only took 12 hours). The wife is at Starbucks off campus studying with her cohorts (with shitty T-Mobile wifi), so I went to the Starbucks on campus (with awesome free campus Wifi).

Earlier this week, I promised that I was going to keep everyone updated with how my research is going and what I’m doing. So, here we are.

As previously mentioned, this first paper I’m writing is a “Why and When” sort of paper. We talk about three popular methods for doing MCMC simulations and which is best under what circumstances. The three methods are traditional Molecular Dynamics, Multiensemble, and Hamiltonian Replica Exchange Molecular Dynamics. The method I’m researching is the last one, hREMD (love the title). You can tell it’s a recent method cause the name is really long and convoluted. I’m pretty sure all the good names in MCMC were taken by the mid 90s. I won’t go into full detail on each method here (trying not to lose anyone’s interest), just know they exist.

The basic jist of the paper follows. You can break down the computation resources of a simulation into two parts: equilibration phase, and production phase. When doing MCMC, you must let your system fully equilibrate before you can start sampling data. An example of why this is would be, in 2D suppose you stick a particle in a box and let the particle move around a tiny bit each iteration. For the next several iterations, the position of the particle is going to be correlated to the starting point. This is called configuration bias (or startup bias, et al.). The following figure is the Autocorrelation of some MC time series data (the thickness is from the error bars).

A visual analogy follows: Suppose you take a thatch of color, magenta. If you break it down into 3 color channels (red, green, blue) the corresponding hex code would be something around #A03. Call this our starting point.

Iteration 1 (#AA0033)

Now let’s make up an update rule for our Markov chain. Each iteration, we pick a color channel and shift it by some amount x where x is a random integer between [-1,1]. After 100 iterations, we have moved around in the 3d color-space (where each channel is a dimension), and have ended up at #953.

Iteration 100 (#995533)

Hmm. Not much has changed. Lets look at 10000 iterations.

Iteration 10000 (#222255)

Ok, that’s better. What I’m trying to demonstrate is that when you do a stochastic simulation like this, the starting point is going to bias what the system does for the first several iterations. You need to let the system run for a very long time in order for your current state to have no “history” of the first state. The plot above shows the correlation of the system as it goes through time. Notice, at the beginning, the correlation is very high (in fact at time 0 it is infinite). The reasoning for this is the same as for why Iteration 1 and Iteration 100 of our color simulation are very similar.

That said, my paper talks about how long it takes each of the 3 different methods to reach equilibrium - when the current state has lost all memory of the original state. I think I’ll make a little demo of the color thing.

-David

Written by david

September 7th, 2008 at 2:26 pm

Posted in School

Tagged with , ,

So much for this weekend

without comments

I consented to a verbal NDA at the game store, so I can’t give any details, but I got Spore a day early. Native OS X support… *drool*. Consequently, this will be the first time I’ll have used my dvd drive on my new laptop. Lulz.

I literally jumped in the air when they guy handed my the bag and prompted me to vacate the premesis. Screenshots after the break.

-Break-

Intro Movie

My Creature

Run!!

Yum, algae

On land

Making babies

-David

Written by david

September 6th, 2008 at 9:42 pm

Posted in General

Tagged with , ,

My first real paper

without comments

My prof is putting me down as primary author on a paper we’re working on. Or rather, I’m putting my professor down as a corresponding author on a paper I’m writing. heh. The topic is testing the efficieny/effectivness of Replica Exchange Molecular Dynamics to Multiensemeble methods (obligitory wiki links), and when it’s best to use each method. The funny thing about statistical mechanics (and a lot of science in general) is that the concepts are fundamentally simple, but the literature is so far obfuscated with jargon and assumptions that hardly anyone can understand them. I mean, shit, I hardly follow half of what I read - and now I’m supposed to be writing it.

My generation of grad student is coming from the first batch of kids who grew up with the internet, and really the first generation of Wikipedia. As such, I’m going to try a new type of research dogma that attempts to make my research available (and accessable) to anyone. This type of transparent research is become more common, and I hope to see more of it.

Here’s a quick run down of my goals for this experiment

  • Provide all of my publications and projects freely (source too)
  • Keep the language deflated, no jargon
  • Document my methods, keep the research process transparent
  • Contribute info (not necessarily new research) back into Wikipedia
  • Not get caught by my committee for giving away research ^_^

Hopefully, by the time I finish my thesis I will have enough content here for anyone (idiots excluded) to somewhat understand what it’s all about. Hope you’re all ready - hope I’m ready.

Edit: Loving my macbook.

Written by david

September 4th, 2008 at 9:33 pm

Posted in School

Tagged with ,

Thesis and First iMpressions

without comments

This semester is all about writing. My professor wants to get two papers out this fall. Luckily, however, these papers will be chapters 2 and 3 of my thesis. Huzzah.

I got a Macbook Pro last week (from an undisclosed source), so now I fit in with the other grad students. The adjustment to OS X has been relatively painless (coming from Ubuntu). My first inclination was to ditch OS X completely and load Linux, but I’ve been persuaded by the Apple Demons (evil and benine) to give Mac a chance. I’ve had to do a lot of customization to get the terminal anywhere near the functionality of gnome-terminal. In fact, I ditched the default terminal for a project call iTerm. Indeed, I will miss gnome-terminal.

The multi touch is incredible. The hardware is incredible.

I’ve quit my crappy job (see Access Nightmares), and taken an awesome job (see How Happy). This is going to be a busy few months. Here we go.

-David

Written by david

September 4th, 2008 at 2:17 pm

Posted in School

Tagged with , ,

Old posts

without comments

In a tragic episode of miscommunication and server migration, I have managed to loose my once beloved Wordpress Blog (http://enja.org/david). I did, however, by the magic of the Wayback Machine uncover an archive from 2007. All of the important posts are there (the AJAX articles) and the other diatribe remains as well.

http://web.archive.org/web/20070105031325rn_1/www.enja.org/david/

Mad props to Archive.org.

-David

Written by david

July 2nd, 2008 at 12:25 am

Posted in General, archives

Tagged with ,

From the Archives: Ajax IE Caching Issue - Recap

without comments

Author’s Note:
Here is another article from my long lost blog. Ajaxian still links in here, as well as many other personal blogs. See my previous post for background.

-David

Post:
In light of the comments on the original post, i thought it would be nice to reiterate some other people’s workarounds.

This was the first one brought to my attention, and actually is how i work around the caching these days. Basically, you add a superfluous variable to your string of GET vars that is unique - timestamp works fine. I actually go a little bit further and use milliseconds since the app i’m working on makes several requests per second.

servlet/imagemaker.jsp?foo=bar&goo=car
Change to:
servlet/imagemaker.jsp?foo=bar&goo=car&time=111111111

Here is function to uncache the url:

function uncache(url){
var d = new Date();
var time = d.getTime();
return url + ‘&time=’+time;
}

This workaround looks like as close to a valid solution as there is. It simply adds an arbitrary variable to the POST vars (sent in the send method).

http.open(’post’, ‘myfile.php’);
http.setRequestHeader(’Content-Type’, ‘application/x-www-form-urlencoded’);
http.onreadystatechange = handleResponse;
http.send(’var=1′);

A null or empty string for the send method will cause IE to cache.

My only recommendation for that second solution would be to change the POST var to send(’ie=teh_suck’);

Kudos to contributors.

Comment Thread:

Sujai Said…

What about xml files. I am trying to call xml using ajax. I tried POST its not working

  • September 14th, 2006 at 11:57 pm
Steve Said…

Thanks for this, I had already spent far too long trying to work out why ie was caching the records (thought it may have been a bug in ie7). For those that are interested below is a script i used the above suggestion with, it uses a php loop to change the field which it will output the fetched data to.
cheers

function getContent(){
var page=’./query.php’;
var selected=document.getElementById(’selected’).value;

if (selected==’’){selected=’none’;}

//params has to have following format
//i.e.: c=1&id=3….

//Clear our fetching variable

var xmlhttp=false;

//Try to create active x object
try {
xmlhttp = new ActiveXObject(’Msxml2.XMLHTTP’);
} catch (e) {

try {
xmlhttp = new ActiveXObject(’Microsoft.XMLHTTP’);
} catch (E) {
xmlhttp = false;
}

}

if (!xmlhttp && typeof XMLHttpRequest!=’undefined’) {
alert(’Error’);
}

var rowNum=’’;

xmlhttp.onreadystatechange = function(){
if(xmlhttp.readyState==4 && xmlhttp.status== 200) {
document.getElementById(’qty’).innerHTML = xmlhttp.responseText;

}
}

params=’selected=’+ selected +’&rownum=’+ rowNum +’&rnd=’+ Math.random()*99999;

xmlhttp.open(’POST’, ‘./query.php’, false);

xmlhttp.setRequestHeader(’Content-type’, ‘application/x-www-form-urlencoded’);

xmlhttp.send(params);

}

  • October 25th, 2006 at 6:56 pm
Chris Said…

Glad I have found this article, bookmarking for work =D

We plan on doing an entirely new website (hopefully I can get them to separate html/css/back-end) this time around as well as get some ajax in there to make life a bit easier!

  • November 14th, 2006 at 7:41 pm
Mark Said…

Excellent. Thanks for this.

Been banging our heads against the proverbial Brick 2.0 Walls trying to work out why our Ajax looping updates don’t update in IE7.

  • December 13th, 2006 at 1:42 pm
Gareth Said…

That worked a treat.

I’m still a bit annoyed that I have to do this for every request. Such is the life we lead.

  • December 18th, 2006 at 3:43 pm
Gene Said…

Wow, thank you VERY much. I’ve been having this problem for some time now. This little line “req.send(’var=1′);” really helped me out. Thanks VERY VERY much.

  • December 20th, 2006 at 6:56 pm

Written by david

July 2nd, 2008 at 12:12 am

Posted in archives

Tagged with , , , , ,