Null Disquisition

Python, AWS, Grad School, and your face

Archive for the ‘s3’ tag

Weekend Project – CloudCached

without comments

A friend and I have been bouncing around the idea of a caching system that ran on Amazon’s cloud for a while now. Basically something like memcached, but without the (very real) limitations of physical memory or the need of a whole server. Sure, it’s hard to beat the speed of memory-level read access, but I think the appeal of a distributed, limitless cache might outweigh the slowdown.

Idea

Provide an interface for storing/retrieving serialized data on S3

Pretty simple idea, pretty simple implementation. Thanks to the S3 interface provided by Boto, things were a lot easier. I’m going to keep this open source under the MIT license. You can check out the code on GitHub repository – please feel free to fork, improve, submit, etc.

Overview

A quick walkthrough of the code will reveal truly how simple this is. The Client class provides basic CRUD methods for interfacing with S3: put, get, update, delete. The put and update methods store a timestamp as the “expires” header for the file to keep track of cache expiration. Also these two methods write a “type” header to the meta-data so CloudCached knows how to de-serialize the file.

class Client:
"Here's the class schema"
    def get(self, key)
    def put(self, key, value, time_to_expire=3600, replace=False)
    def update(self, key, value, time_to_expire=3600)
    def delete(self, key)

There are 6 basic data types used in this code for serializing any bit of python data: basestring (for str and unicode), int (for int and long), complex, float, and other. The other data type represents anything that is not a base type in Python. These “other” types get pickled while everything else just gets str’d.

The put method checks the md5sum to make sure everything went through cleanly (maybe a bit costly, but worth it in my opinion). cPickle is used in favor of pickle for obvious reasons (it’s much faster).

Results

Some very early tests show that this might just be usable.

    CloudCached Benchmarks (10 runs)
    --------------------------------------------------------
    Test                 |  Average (s)     | Total (s)
-------------------------------------------------------- GET integer | 0.0283360004425 | 0.283360004425 GET string (32 byte) | 0.0315794944763 | 0.315794944763 GET string (512KB) | 0.1265994787220 | 1.265994787220 PUT integer | 0.0650457143784 | 0.650457143784 PUT string (32 byte) | 0.0563205003738 | 0.563205003738 PUT string (512KB) | 0.1773290872570 | 1.773290872570 --------------------------------------------------------

Advantages

  • Highly distributed. S3 data is distributed across multiple availability zones and could therefor be utilized by an application running across multiple availability zones.
  • No size limit. Unlike the physical limitations of a memcached machine (or cluster of memcached machines), S3 does not have limits on the number of files (caches) you can store. Also, with S3, you can write files from 1 byte to 5 GB (although I think a 5GB cache file would defeat the purpose).
  • Parallel read access. If applicable to the application, cache reads can be largely parallelized which could potentially give linear speedup to the cache loading.
  • No server necessary. Since the application is reading and writing directly to S3, there is no need to a “cache server”. This could lead to a great deal of savings for people running multiple memcached machines. Memcached servers typically have a large memory capacity which means a m1.xlarge or c1.xlarge EC2 instance (assuming it’s running in EC2).

Considerations

It’s going to be hard to beat the speed of memcached. As far as speed is concerned, I’m using built-in Python stuff including urllib, httplib, xml.sax, etc (all of which are used by Boto). It might be worthwhile to write a C implementation of the S3 communication methods (but maybe not). The most costly part of this code aside from network communication is probably the serialization, and since cPickle is used there is not really improvement to be made there.

It might be cool to couple the meta-data with SimpleDB.

I registered cloudcached.com in case this gains some momentum. I will post updates and benchmarks there as they arrive.

-David

Written by david

June 20th, 2009 at 4:34 pm

Posted in Amazon Web Services, python

Tagged with , , ,

Serve gzipped content from Amazon S3

with 3 comments

gzipper

Set the “Content-encoding” header to “gzip”. Really, it’s that easy.

Kthxbye.

Well, since you came all this way, I’ll give a little more detail. First, make a file.

Now gzip it.

Upload it.

Find a utility that can modify file headers on S3: S3Hub (OS X), Cloudberry S3 Explorer (Windows), or any of the various 3rd party libraries.

Set the Content-type header to whatever the appropriate content type is: text/plain, text/css, text/javascript, image/jpeg, etc.

Set the Content-encoding to gzip.

Pat yourself on the back.

Here’s three versions of a text file I made and gzipped. Note that with appropriate headers, file extensions don’t mean squat.

  1. http://mumrah-dot-net.s3.amazonaws.com/gziptest.txt.gz
  2. http://mumrah-dot-net.s3.amazonaws.com/gziptest.txt
  3. http://mumrah-dot-net.s3.amazonaws.com/gziptest

Go ahead and download one – you’ll see that the file is actually gzipped and your browser is doing the deflating on the fly. This is the same effect producted by mod_deflate in Apache.

-David

Written by david

May 5th, 2009 at 12:15 am