Seems like no one can agree on the best way to structure documents in mongoDB. The consensus seems to be: do what works for you. The nice folks at 10gen offer some guidance on laying out your documents, and they seem to sit in the camp of “redundant data over references”. Redundancy over references is fine for some things, but it can be a real pain in the ass for certain situations. E.g., if your would-be embedded documents are updated frequently, you’re talking about a ridiculous amount of effort to make all the right changes in the right places. Write situations like this make me nervous about data consistency, but that’s another story.
For a little prototype I was working on this past week, I was using ObjectIds to reference documents instead of going the embedded document route. One big disadvantage of this approach with mongoDB is that there is no capacity for JOIN-like operations (it’s part of their NoSQL philosophy). I think this is somewhat bullshit, so I took it upon myself to find a workaround. The goal, get some super basic JOIN-like functionality that I can use from a client library (such as pymongo).
Let’s begin. Suppose I’ve got document class Person that looks like
{
_id : ObjectId(...),
name : "string",
school: ObjectId(...)
}
And document class School that looks like
{
_id : ObjectId(...),
name : "string",
}
With the reference document approach (called DBRef by the 10gen folks), the dereferencing takes place on the client side meaning a call back to the server for each document that needs dereferencing. That’s a lot of churn on the wire just for a little bit of data. My solution was to do the dereferencing on the database using JavaScript and db.eval().
var deref = function (field, collection) {
return function (doc) {
return _deref(doc, field, collection);
};
}
var _deref = function (doc, field, col) {
var oid = ObjectId(doc[field]);
delete doc[field];
doc[field] = db[col].findOne({_id:oid});
return doc;
}
Once you have saved these functions on the server, you can use them in MapReduce, $where, or db.eval() calls. Here’s an example call using pymongo (the collection name is “people”):
>>> db.eval("db.people.find({}).map(deref('school','people'))")
Now instead of an ObjectId as the ‘school’ field, you get the document whose ‘_id’ is that ObjectId. The deref function takes in a field and collection so it knows which field contains the reference Id and where it should look for that document. N.B., calling map on a cursor will unroll that cursor (so use skip() and limit() accordingly). Also, db.eval calls will block (though I don’t think it should be problematic since findOne is cheap).
The code as a Gist: http://gist.github.com/477121