Null Disquisition

Lots of talk about nothing

MongoDC - Afterthoughts

Cool conference, clearly lots of smart ppl working on this product. I think they are trying to do too much too soon (geospatial indexing, full text, etc). I'd like to see some of core pieces bolstered before features like this. Maybe they can work on lowering mongoDB's pH and getting closer to ACID.

Filed under  //   Databases   NoSQL   mongoDB  

MongoDC - ACID

So, what about command isolation? "We don't do isolation". What about command level atomicity? Nope, only document level atomic updates.

How do you guarantee consistency with replica sets in a sharded environment? Only read from the master.

And we all know single server durability is not yet possible.

ACID fail

Filed under  //   Databases   NoSQL   mongoDB  

MongoDC - First update

Some good talks here at MongoDC. Starting to realize that MongoDB isn't really a DBMS at all, but really a structured document store with flexible indexing and querying

Filed under  //   Databases   NoSQL   mongoDB  

MongoDC - eCommerce

Spoke with the CTO of Totsy about the semantics of their eCommerce app. They use a document per inventory item rather than keeping a "count" attribute for each inventory. This approach reduces resource contention on inventory documents from atomic document updates, but increases the number if documents and redundant data.

Seems the theme of the day is: redundant data

Filed under  //   Databases   NoSQL   mongoDB  

JSON Encoding mongoDB Documents in Python

One thing that kept puzzling me about pymongo was that I couldn’t a serialize a document as JSON. Aren’t these things just fancy JSON objects on the backend? Well, they are – but ObjectId is part of mongoDB extension of JavaScript so there is no JSON-equivalent. And since Python only knows about the standard JSON spec, it won’t know what to do with ObjectId instances. When attempting to encode a Python dictionary which has an ObjectId as one of its values, I get a TypeError saying ObjectId “is not JSON serializable”.

My solution is to extend the JSONEncoder included in Python’s json module (in 2.6 or later)

It adds a special case to handle encoding an ObjectId into a literal “ObjectId” in the encoded JSON.

Custom JSON encoders can be used when issuing json.dump or json.dumps by specifying cls

json.dump(obj, cls=MongoEncoder)

E.g.,

>>> import json
>>> from pymongo.objectid import ObjectId
>>> from mongoencoder import MongoEncoder
>>> x = {'a':1,'b':"foo",'c':ObjectId()}
>>> print x
{'a': 1, 'c': ObjectId('4c4f4f5e2554c813e4000001'), 'b': 'foo'}
>>> print json.dumps(x)
Traceback (most recent call last):
[...]
TypeError: ObjectId('4c4f4f5e2554c813e4000001') 
  is not JSON serializable
>>> print json.dumps(x, cls=MongoEncoder)
'{"a": 1, "c": ObjectId("4c4f4f5e2554c813e4000001"), "b": "foo"}'

Viola! Enjoy

Filed under  //   NoSQL   mongoDB   pymongo  
Posted July 27, 2010

Server-side Document Dereferencing in mongoDB

Seems like no one can agree on the best way to structure documents in mongoDB. The consensus seems to be: do what works for you. The nice folks at 10gen offer some guidance on laying out your documents, and they seem to sit in the camp of “redundant data over references”. Redundancy over references is fine for some things, but it can be a real pain in the ass for certain situations. E.g., if your would-be embedded documents are updated frequently, you’re talking about a ridiculous amount of effort to make all the right changes in the right places. Write situations like this make me nervous about data consistency, but that’s another story.

For a little prototype I was working on this past week, I was using ObjectIds to reference documents instead of going the embedded document route. One big disadvantage of this approach with mongoDB is that there is no capacity for JOIN-like operations (it’s part of their NoSQL philosophy). I think this is somewhat bullshit, so I took it upon myself to find a workaround. The goal, get some super basic JOIN-like functionality that I can use from a client library (such as pymongo).

Let’s begin. Suppose I’ve got document class Person that looks like

{
    _id : ObjectId(...),
    name : "string",
    school: ObjectId(...)
}

And document class School that looks like

{
    _id : ObjectId(...),
    name : "string",
}

With the reference document approach (called DBRef by the 10gen folks), the dereferencing takes place on the client side meaning a call back to the server for each document that needs dereferencing. That’s a lot of churn on the wire just for a little bit of data. My solution was to do the dereferencing on the database using JavaScript and db.eval().

var deref = function (field, collection) {
    // C-C-C-Closure!!
    return function (doc) {
        return _deref(doc, field, collection);
    };
}
var _deref = function (doc, field, col) {
    var oid = ObjectId(doc[field]);
    delete doc[field];
    doc[field] = db[col].findOne({_id:oid});
    return doc;
}

Once you have saved these functions on the server, you can use them in MapReduce, $where, or db.eval() calls. Here’s an example call using pymongo (the collection name is “people”):

>>> db.eval("db.people.find({}).map(deref('school','people'))")

Now instead of an ObjectId as the ‘school’ field, you get the document whose ‘_id’ is that ObjectId. The deref function takes in a field and collection so it knows which field contains the reference Id and where it should look for that document. N.B., calling map on a cursor will unroll that cursor (so use skip() and limit() accordingly). Also, db.eval calls will block (though I don’t think it should be problematic since findOne is cheap).

The code as a Gist: http://gist.github.com/477121

Filed under  //   Databases   NoSQL   mongoDB   pymongo  
Posted July 15, 2010

Schema, Less

Or should that be "Schema Free"? Lately, I've been digging into the features of several NoSQL systems, and each time I read the bullet points I see something to the effect of "schema-free" or "unstructured documents". This is often touted as one of the features that makes document databases so great - you are released of the bonds of relational databases: no more key constraints, no more type checking, you are essentially free to insert whatever the hell you want. And how. On the flip side, one of the shitty things about document databases is that since you are so free, it is very difficult to code in the oh-so-familiar OO paradigm if you don't know what the data looks like. There have been many efforts to mitigate this including a plethora of frameworks which let you define structured models in your application (with types and everything!) so that the documents are homogeneous in the database. Ok, order is restored.

But hold on a second - did we just move a very pain-in-the-ass, expensive piece of our data flow out of the database and into the application? Oh yes, yes we certainly did. We have overthrown the Monarchy in favor of Anarchy, but then realized we need order and rules or else everything turns to shit. So the People take on the burden of maintaining the State. That's right, document databases lead to Communism.

Analogies aside, inserting documents all willy-nilly is great for write performance (particularly batch loading), but having a database that doesn't allow any kind of constraints on a document, its fields, or its relationships really puts a lot of work on the application. The big question here is: is it worth it? I think so (maybe). Putting this work on the application will certainly slow down the application layer, and at a small scale the net performance will probably be worse. However, document databases are somewhat easier to scale than traditional SQL-based systems. So as the application+database scale out the net performance will be considerably better than a traditional SQL-backed stack. At least in theory. 

A lot of this remains to be seen.

Filed under  //   Databases   NoSQL  
Posted July 15, 2010