Null Disquisition

Lots of talk about nothing

MongoDC - Afterthoughts

Cool conference, clearly lots of smart ppl working on this product. I think they are trying to do too much too soon (geospatial indexing, full text, etc). I'd like to see some of core pieces bolstered before features like this. Maybe they can work on lowering mongoDB's pH and getting closer to ACID.

Filed under  //   Databases   NoSQL   mongoDB  

MongoDC - ACID

So, what about command isolation? "We don't do isolation". What about command level atomicity? Nope, only document level atomic updates.

How do you guarantee consistency with replica sets in a sharded environment? Only read from the master.

And we all know single server durability is not yet possible.

ACID fail

Filed under  //   Databases   NoSQL   mongoDB  

MongoDC - First update

Some good talks here at MongoDC. Starting to realize that MongoDB isn't really a DBMS at all, but really a structured document store with flexible indexing and querying

Filed under  //   Databases   NoSQL   mongoDB  

MongoDC - eCommerce

Spoke with the CTO of Totsy about the semantics of their eCommerce app. They use a document per inventory item rather than keeping a "count" attribute for each inventory. This approach reduces resource contention on inventory documents from atomic document updates, but increases the number if documents and redundant data.

Seems the theme of the day is: redundant data

Filed under  //   Databases   NoSQL   mongoDB  

JSON Encoding mongoDB Documents in Python

One thing that kept puzzling me about pymongo was that I couldn’t a serialize a document as JSON. Aren’t these things just fancy JSON objects on the backend? Well, they are – but ObjectId is part of mongoDB extension of JavaScript so there is no JSON-equivalent. And since Python only knows about the standard JSON spec, it won’t know what to do with ObjectId instances. When attempting to encode a Python dictionary which has an ObjectId as one of its values, I get a TypeError saying ObjectId “is not JSON serializable”.

My solution is to extend the JSONEncoder included in Python’s json module (in 2.6 or later)

It adds a special case to handle encoding an ObjectId into a literal “ObjectId” in the encoded JSON.

Custom JSON encoders can be used when issuing json.dump or json.dumps by specifying cls

json.dump(obj, cls=MongoEncoder)

E.g.,

>>> import json
>>> from pymongo.objectid import ObjectId
>>> from mongoencoder import MongoEncoder
>>> x = {'a':1,'b':"foo",'c':ObjectId()}
>>> print x
{'a': 1, 'c': ObjectId('4c4f4f5e2554c813e4000001'), 'b': 'foo'}
>>> print json.dumps(x)
Traceback (most recent call last):
[...]
TypeError: ObjectId('4c4f4f5e2554c813e4000001') 
  is not JSON serializable
>>> print json.dumps(x, cls=MongoEncoder)
'{"a": 1, "c": ObjectId("4c4f4f5e2554c813e4000001"), "b": "foo"}'

Viola! Enjoy

Filed under  //   NoSQL   mongoDB   pymongo  
Posted July 27, 2010

Server-side Document Dereferencing in mongoDB

Seems like no one can agree on the best way to structure documents in mongoDB. The consensus seems to be: do what works for you. The nice folks at 10gen offer some guidance on laying out your documents, and they seem to sit in the camp of “redundant data over references”. Redundancy over references is fine for some things, but it can be a real pain in the ass for certain situations. E.g., if your would-be embedded documents are updated frequently, you’re talking about a ridiculous amount of effort to make all the right changes in the right places. Write situations like this make me nervous about data consistency, but that’s another story.

For a little prototype I was working on this past week, I was using ObjectIds to reference documents instead of going the embedded document route. One big disadvantage of this approach with mongoDB is that there is no capacity for JOIN-like operations (it’s part of their NoSQL philosophy). I think this is somewhat bullshit, so I took it upon myself to find a workaround. The goal, get some super basic JOIN-like functionality that I can use from a client library (such as pymongo).

Let’s begin. Suppose I’ve got document class Person that looks like

{
    _id : ObjectId(...),
    name : "string",
    school: ObjectId(...)
}

And document class School that looks like

{
    _id : ObjectId(...),
    name : "string",
}

With the reference document approach (called DBRef by the 10gen folks), the dereferencing takes place on the client side meaning a call back to the server for each document that needs dereferencing. That’s a lot of churn on the wire just for a little bit of data. My solution was to do the dereferencing on the database using JavaScript and db.eval().

var deref = function (field, collection) {
    // C-C-C-Closure!!
    return function (doc) {
        return _deref(doc, field, collection);
    };
}
var _deref = function (doc, field, col) {
    var oid = ObjectId(doc[field]);
    delete doc[field];
    doc[field] = db[col].findOne({_id:oid});
    return doc;
}

Once you have saved these functions on the server, you can use them in MapReduce, $where, or db.eval() calls. Here’s an example call using pymongo (the collection name is “people”):

>>> db.eval("db.people.find({}).map(deref('school','people'))")

Now instead of an ObjectId as the ‘school’ field, you get the document whose ‘_id’ is that ObjectId. The deref function takes in a field and collection so it knows which field contains the reference Id and where it should look for that document. N.B., calling map on a cursor will unroll that cursor (so use skip() and limit() accordingly). Also, db.eval calls will block (though I don’t think it should be problematic since findOne is cheap).

The code as a Gist: http://gist.github.com/477121

Filed under  //   Databases   NoSQL   mongoDB   pymongo  
Posted July 15, 2010