Null Disquisition

Python, AWS, Grad School, and your face

MPI running on Amazon EC2

without comments

Amazon Web Services

For my Master’s thesis, I’m going to be running a lot of MPI code, and naturally I need a place to run it. Let me first say that my university has an excellent high-performance computing center run by one of my committee chairs that is more than capable of serving my needs – but yet, I am unfulfilled. With our scheduling system, there is a “backfill” that is always available for running small jobs (like the ones I run), but for my thesis, I want to test the massive scalability of an algorithm (Replica Exchange). When I mean massive, I mean massive – think 1000 compute nodes or more.

Big ideas, people.

In order to satisfy my need for a massively parallel platform, I looked no further than Amazon EC2. As should be apparent from many of my previous posts, I have been doing a lot of work with Amazon’s cloud services – both school and work.

A few weeks ago, I started an MIT-licensed open source project on GitHub aptly named EC2MPI. Today I made a major step forward with this project which was the motivation for this post. I finally have everything configured properly and got my first no-hassle MPI cluster up and running.

The script I wrote (EC2MPI), is written in Python and presents an interactive prompt to the user. You select the architecture (i386 or x64), the number of instances, and I also have support for user-defined SSH keypairs (not AWS keypairs) for cluster security. The instances are spawned, and EC2MPI sets up the SSH keys, as well as MPI configuration. It is so freaking sweet.

I wanted to share some issues I’ve had so far while developing this and how I solved them.

Intra-EC2 communication – For this, I needed each instance to be able to talk to one another for point-to-point as well as collective communication. My solution for this was to allow the user to generate SSH keypairs which were stored in a private S3 bucket (owned by the user). My user-data script sent to the instances took care of downloading and installing the keys upon startup.

Shared storage among instances – In order to run MPI code, the nodes in the cluster need access to a shared storage volume which will contain binary files compiled by MPI. Since EC2 has no shared storage (for now), I had to find an alternate solution. The solution I settled on was to use s3fs: a fuse-based filesystem which allows you to mount an S3 bucket as a volume. Reading and writing to the shared volume is pretty slow (unless it’s cached), so for certain kinds of code this might not be ideal. However, I believe it is the best solution for now. I imagine one day Amazon will add a feature to the Elastic Block Storage volumes that allow them to act as shared volumes.

Starting up and tearing down clusters – I used Amazon SimpleDB to keep meta-data about the cluster: how many instances are in the cluster, internal/external IP addresses, etc. This is also how I define the master node and worker nodes. This will allow me to add features such as adding and removing instances from a cluster without having to tear the whole thing down. Also I did all startup config with a user-data script so the script does not have to log into each instance upon startup. This allows the clusters startup to scale well.

Check back soon for some benchmarks and more detailed write-ups as the project progresses. First, I need to get my maximum number of instances increased (right now I can do 20 max). Fast times ahead, friends.

-David

Written by david

May 30th, 2009 at 7:40 pm

Leave a Reply