Archive for the ‘Computing’ Category
1080p content on your PS3
New toys bring new adventures. My awesome wife got me a PS3 for my birthday recently and I’ve been tinkering around with getting some non-Bluray HD content to play on it. My initial attempts to stream stuff over my network proved unsatisfying. Since the PS3 is only capable of 802.11g, I gave up on the high bitrate stuff (it’s perfectly capable of DVD quality content (1.5~2.5 Mb/s).
Software used (on Mac OS X 10.5):
- tsMuxer
- Disk Utility
- newfs_udf
- hdiutil
Hardware used:
- Macbook Pro
- Blank CD/DVD Media
- PS3
Files used: * MKV file with AC3 audio stream and H264 video stream
Attempt 1 (successful): Load the MKV with tsMuxerGUI, select M2TS muxing. If the video profile is above 4.1, lower it to 4.1 (as the PS3 cannot support higher than an H264 level 4.1). Generate the m2ts and meta file, burn them both to a CD/DVD. This will be readable by the PS3 as a data disk – it will not autoplay, but you can access it and play it. To me, this is not an ideal solution as it does not support menus, chapters, or seeking.
Attempt 2 (unsuccessful): Same deal as before, but select “AVCHD disk”. This option will create a BD friendly file structure (folders named BDMV and CERTIFICATE). The trick here is to burn the disk as UDF 2.5 (this is not super easy on OS X or Linux). I wasted many CDs trying to regular ISO9660. Following the instructions here, you must create the image and format it as UDF 2.5.
dd if=/dev/zero of=myfile.img bs=1k count=716800
newfs_udf -r 2.5 myfile.img -v volume_label
hdiutil mount -nobrowse myfile.img
cp -R /path/to/avcdh-files/ /Volume/volume_label/
hdiutil unmount /Volume/volume_label
In plain-speak, create an empty (large) image, format it to UDF 2.5, mount it, copy the BD-compatible files into the volume, and unmount. You then use Disk Utility to burn the resulting image. No success here, but at this point I was so close, I could taste it.
Attempt 3 (successful): Identical procedure as Attempt 2, with one important exception – the image you create with dd must be sized in even increments of 1GB – that is, count==N*1024*1024.
A few things to try next:
- Menus
- DTS audio stream
- Subtitles
Anyone wanting to test this out with a super high quality 1080p rip, I recommend Big Buck Bunny. You’ll need Handbrake to convert it to a compatible container format if you download the AVI (tsMuxer doesn’t like AVIs).
Weekend Project – CloudCached
A friend and I have been bouncing around the idea of a caching system that ran on Amazon’s cloud for a while now. Basically something like memcached, but without the (very real) limitations of physical memory or the need of a whole server. Sure, it’s hard to beat the speed of memory-level read access, but I think the appeal of a distributed, limitless cache might outweigh the slowdown.
Idea
Provide an interface for storing/retrieving serialized data on S3
Pretty simple idea, pretty simple implementation. Thanks to the S3 interface provided by Boto, things were a lot easier. I’m going to keep this open source under the MIT license. You can check out the code on GitHub repository – please feel free to fork, improve, submit, etc.
Overview
A quick walkthrough of the code will reveal truly how simple this is. The Client class provides basic CRUD methods for interfacing with S3: put, get, update, delete. The put and update methods store a timestamp as the “expires” header for the file to keep track of cache expiration. Also these two methods write a “type” header to the meta-data so CloudCached knows how to de-serialize the file.
class Client:
"Here's the class schema"
def get(self, key)
def put(self, key, value, time_to_expire=3600, replace=False)
def update(self, key, value, time_to_expire=3600)
def delete(self, key)
There are 6 basic data types used in this code for serializing any bit of python data: basestring (for str and unicode), int (for int and long), complex, float, and other. The other data type represents anything that is not a base type in Python. These “other” types get pickled while everything else just gets str’d.
The put method checks the md5sum to make sure everything went through cleanly (maybe a bit costly, but worth it in my opinion). cPickle is used in favor of pickle for obvious reasons (it’s much faster).
Results
Some very early tests show that this might just be usable.
CloudCached Benchmarks (10 runs)
--------------------------------------------------------
Test | Average (s) | Total (s)
--------------------------------------------------------
GET integer | 0.0283360004425 | 0.283360004425
GET string (32 byte) | 0.0315794944763 | 0.315794944763
GET string (512KB) | 0.1265994787220 | 1.265994787220
PUT integer | 0.0650457143784 | 0.650457143784
PUT string (32 byte) | 0.0563205003738 | 0.563205003738
PUT string (512KB) | 0.1773290872570 | 1.773290872570
--------------------------------------------------------
Advantages
- Highly distributed. S3 data is distributed across multiple availability zones and could therefor be utilized by an application running across multiple availability zones.
- No size limit. Unlike the physical limitations of a memcached machine (or cluster of memcached machines), S3 does not have limits on the number of files (caches) you can store. Also, with S3, you can write files from 1 byte to 5 GB (although I think a 5GB cache file would defeat the purpose).
- Parallel read access. If applicable to the application, cache reads can be largely parallelized which could potentially give linear speedup to the cache loading.
- No server necessary. Since the application is reading and writing directly to S3, there is no need to a “cache server”. This could lead to a great deal of savings for people running multiple memcached machines. Memcached servers typically have a large memory capacity which means a m1.xlarge or c1.xlarge EC2 instance (assuming it’s running in EC2).
Considerations
It’s going to be hard to beat the speed of memcached. As far as speed is concerned, I’m using built-in Python stuff including urllib, httplib, xml.sax, etc (all of which are used by Boto). It might be worthwhile to write a C implementation of the S3 communication methods (but maybe not). The most costly part of this code aside from network communication is probably the serialization, and since cPickle is used there is not really improvement to be made there.
It might be cool to couple the meta-data with SimpleDB.
I registered cloudcached.com in case this gains some momentum. I will post updates and benchmarks there as they arrive.
-David
First (real) MPI run on EC2
After a few days of tinkering with EC2MPI, I spent some time polishing up a stat mech MPI simulation. The code in question is a 2d Ising model simulation using Replica Exchange. Right now it stands at around 400 lines of C++ using STL vectors (which I love). Once I know it works (or at least works well enough) I might post it up here, but for now I’m just trying to generate pretty hysteresis plots and observe the critical behavior of a 2d Ising model system. Here’s a picture with points on it.

Energy per spin plotted against magnetization
I leave the interpretation to you. The best part of this is that I can do these MPI runs without burning a hole in my lap (the MacBook gets rather warm). -David
Time Machine In Your Pocket – Addendum
Addendum to two previoius posts.
The other day, I noticed my 8GB USB volume that I use for temporary incremental backups was quite full. Curious, since the folders I back up to that volume do not total but 200MB or so, and rsync was supposed to be doing incremental backups (link-dest ftw).
After a little searching around, I found someone who had a similar problem (and a solution). When you format a volume with OS X it will, by default, ignore file ownership (the linked article explores why this is perhaps). This proves to be a problem for rsync which considers file permissions and ownership as part of the file stat (as it should). Luckily the fix is easy – “Get Info” for the volume in question, then at the bottom unselect “Ignore ownership on this volume”
You will probably want to delete any backups that have been created (since they won’t have the correct file ownership). Source: Terminalapp.net
MPI running on Amazon EC2
For my Master’s thesis, I’m going to be running a lot of MPI code, and naturally I need a place to run it. Let me first say that my university has an excellent high-performance computing center run by one of my committee chairs that is more than capable of serving my needs – but yet, I am unfulfilled. With our scheduling system, there is a “backfill” that is always available for running small jobs (like the ones I run), but for my thesis, I want to test the massive scalability of an algorithm (Replica Exchange). When I mean massive, I mean massive – think 1000 compute nodes or more.
Big ideas, people.
In order to satisfy my need for a massively parallel platform, I looked no further than Amazon EC2. As should be apparent from many of my previous posts, I have been doing a lot of work with Amazon’s cloud services – both school and work.
A few weeks ago, I started an MIT-licensed open source project on GitHub aptly named EC2MPI. Today I made a major step forward with this project which was the motivation for this post. I finally have everything configured properly and got my first no-hassle MPI cluster up and running.
The script I wrote (EC2MPI), is written in Python and presents an interactive prompt to the user. You select the architecture (i386 or x64), the number of instances, and I also have support for user-defined SSH keypairs (not AWS keypairs) for cluster security. The instances are spawned, and EC2MPI sets up the SSH keys, as well as MPI configuration. It is so freaking sweet.
I wanted to share some issues I’ve had so far while developing this and how I solved them.
Intra-EC2 communication – For this, I needed each instance to be able to talk to one another for point-to-point as well as collective communication. My solution for this was to allow the user to generate SSH keypairs which were stored in a private S3 bucket (owned by the user). My user-data script sent to the instances took care of downloading and installing the keys upon startup.
Shared storage among instances – In order to run MPI code, the nodes in the cluster need access to a shared storage volume which will contain binary files compiled by MPI. Since EC2 has no shared storage (for now), I had to find an alternate solution. The solution I settled on was to use s3fs: a fuse-based filesystem which allows you to mount an S3 bucket as a volume. Reading and writing to the shared volume is pretty slow (unless it’s cached), so for certain kinds of code this might not be ideal. However, I believe it is the best solution for now. I imagine one day Amazon will add a feature to the Elastic Block Storage volumes that allow them to act as shared volumes.
Starting up and tearing down clusters – I used Amazon SimpleDB to keep meta-data about the cluster: how many instances are in the cluster, internal/external IP addresses, etc. This is also how I define the master node and worker nodes. This will allow me to add features such as adding and removing instances from a cluster without having to tear the whole thing down. Also I did all startup config with a user-data script so the script does not have to log into each instance upon startup. This allows the clusters startup to scale well.
Check back soon for some benchmarks and more detailed write-ups as the project progresses. First, I need to get my maximum number of instances increased (right now I can do 20 max). Fast times ahead, friends.
-David
Managing multiple AWS accounts
On my personal computer, I have three sets of x509 certificates/private keys. This makes using the EC2-API-tools quite the hassle. Echoes of EC2_CERT and EC2_PRIVATE_KEY haunt my dreams.
So, like you do with these sort of things, I wrote a bash script to work some magic.
#!/bin/bash
echo "Choose Account:"
read account
base=grep $account ~/.ec2/README -i | awk '{print $1}'
if [ ! -n "$base" ]; then
echo "Sorry, that account does not exist"
return
fi
declare -x EC2_CERT="~/.ec2/cert-$base.pem"
declare -x EC2_PRIVATE_KEY="~/.ec2/pk-$base.pem"
echo "EC2 environment updated"
Requires that you your private keys/certs in ~/.ec2, and they are named cert-{something}.pem and pk-{something}.pem. Also, you need a README file in ~/.ec2 that looks like
something account1
something-else account2
I setup an alias so I just run “ec2-account personal” to switch to my personal credentials, and “ec2-account work” to switch to my work account.
-David
Funded!
Amazon issued me 300 dollars in EC2 credits to support Master’s project. Very exciting.
If you’re a university researcher, student, or professor, visit http://aws.amazon.com/education for more information. One of my professors talked to me about giving a seminar on cloud computing in the fall. I believe these types of grants are issued for that sort of thing as well.
Totally putting this on my CV.
Serve gzipped content from Amazon S3
Set the “Content-encoding” header to “gzip”. Really, it’s that easy.
Kthxbye.
Well, since you came all this way, I’ll give a little more detail. First, make a file.
Now gzip it.
Upload it.
Find a utility that can modify file headers on S3: S3Hub (OS X), Cloudberry S3 Explorer (Windows), or any of the various 3rd party libraries.
Set the Content-type header to whatever the appropriate content type is: text/plain, text/css, text/javascript, image/jpeg, etc.
Set the Content-encoding to gzip.
Pat yourself on the back.
Here’s three versions of a text file I made and gzipped. Note that with appropriate headers, file extensions don’t mean squat.
- http://mumrah-dot-net.s3.amazonaws.com/gziptest.txt.gz
- http://mumrah-dot-net.s3.amazonaws.com/gziptest.txt
- http://mumrah-dot-net.s3.amazonaws.com/gziptest
Go ahead and download one – you’ll see that the file is actually gzipped and your browser is doing the deflating on the fly. This is the same effect producted by mod_deflate in Apache.
-David
Updates, Upgrades, and Migrates
New Server, new WordPress install. I must say, the export/import feature in WordPress is very slick. I’ve been using it since well before v1.0, and it has come a long way.
The motivation for the upgrade came with a server migration I’m in the middle of. I’m in the process of starting up a consulting company for Amazon Web Services, and decided it would be rather obscene if I at least didn’t host my blog on EC2. So here we are – in the cloud. It’s kinda cold, and wet.
A web server on EC2, you ask. But what about the htdocs, and virt-host files? We need persistent storage! I created two EBS volumes (both formatted to XFS): one for MySQL data stores and Apache config, and another for /home. I decided to put all of the htdocs in /home (along with user’s public_html) instead of the traditional /var/www. It was easier than creating a volume for /var as well.
So we have a full LAMP stack running on a small EC2 instance, costing us the same as our machine at ServerBeach. The main difference being we now have a development environment within AWS making things much easier to test and deploy.
Here’s a sad-face icon I made for my growl-notification when/if my instance goes down
-David
Time Machine In Your Pocket – Part 2
After a little tinkering here, a little tinkering there, I’ve finally settled on a good solution for my portable backup drive (8GB usb thumb drive). As outlined in my previous post, I wanted a portable backup solution that could do incremental backups (like Apple’s TimeMachine does). I looked, of course, to the wonderful unix utility rsync. Here’s my latest version.
#!/bin/bash -x DEST="/Volumes/PNY8GB/Backups" LATEST="Latest" EXCLUDES_FILE="$HOME/.rsyncexcludes" FILES_FROM="$HOME/.rsyncfiles" RSYNC="/usr/bin/rsync --max-size 10m" # Make sure user is root if (( `id -u` != 0 )); then { echo "Sorry, must be root. Exiting..."; exit; } fi; # Make sure backup device is attached ! test -d "$DEST" && echo "Please mount the backup drive!" && exit # Run rsync DATE=`date "+%Y-%m-%d-%H%M%S"` n=`$RSYNC -r -a -x -S -R --stats --delete --link-dest=$DEST/$LATEST \ --exclude-from $EXCLUDES_FILE --files-from $FILES_FROM $* $HOME \ $DEST/$DATE | sed -n 's/Number of files transferred: \([^0]\)/\1/p'` # Update 'Latest' link rm $DEST/$LATEST ln -s $DEST/$DATE $DEST/$LATEST # Send a growl notification if [ $n ] then /usr/local/bin/growlnotify -m 'rsync complete, number of files: '$n fi
By using ––exlude-from and ––files-from, you get more fine grained control of what gets backed up. My Code folder is ~1GB, and my School folder is about 3GB. When I exclude all of my compiled code, data files, images, .git and .svn folders, and other various annoying swap files my base backup footprint is less than 500MB (for both Code and School).
Here’s my excludes file – it’s just one line per exclude filter
*.sql *.bak *.swp .svn *.pyc *.log *.tar.gz *.dvi *.o *.out *.d *.tmp .git
Similarly, the files-from file is one file path per line (remember the trailing slash!). An important note (found in the rsync manual) is that when you specify ––files-from, -r is no longer implied with -a. So make sure to add -r to your argument list.
And yet again, I leave the scheduling to you.
-David


