You are currently browsing the archives for the MongoDB category


NoSQL Database Comparison

People often ask to compare the various NoSQL solutions. There are a number of comparisons out there, in particular there’s . But I think that many comparisons focus on details that are secondary to the concerns of most developers.

There’s a lot of focus out there on mechanisms and implementation details, and not a lot of focus on the abstract guarantees that these databases can provide. I tend to break down the various NoSQL datastores according to the data models exposed to the developer and the invariants that the system can guarantee against those data models.

You can break these down into 4 characteristics of your data store: Data Model, Consistency, Availability, and Partitioning. In this post we’ll look at the various design choices within each category and the implications of each choice.

Full disclaimer, I work for 10gen, the makers of MongoDB, and before that I operated a MongoDB cluster in production for about a year. So I have a lot more in depth knowledge of MongoDB than the other datastores discussed in this document. Please point out any inaccurate statements in the content below and I’ll happily correct it.

We’ll start with an overall picture of some of the more popular NoSQL stores and where they fit along these axes. We’ll then dig into each axes and look at the choices and what it means for our application.

NoSQL Datastore Overview

MongoDB Cassandra Riak HBase
Data Model Documents Wide Columns Key Value Wide Columns
Consistency Strong Eventual / Quorum Eventual / Quorum Strong
Availability Single Master Multi-Master Multi-Master Single Master
Partioning Range or Hash Hash Hash Range

Data Model

The data model specifies how your application will format and store data in the database and how you can query and update that data later. Most of the NoSQL data stores fall into one of a few different data model options:

Key Value

Data is modelled as opaque blobs identified by a unique identifier. Key Value stores are the simplest data models available since there’s relatively little work for the database to do. Your query language is limited to lookups by primary key, tho some vendors have extended their key value store with simple secondary indexes (eg. riak).

Key Features

  • High performance

Wide Column

Data is modelled as a multi-level map comprising a Row Key, Column Family, and Column name. This model was popularized by .

Queries must include the row key and can optionally include names of column families and columns to further restrict the query.

This is a big improvement over the key-value API in terms of query flexibility as we can now filter on column names. However, we still lack the general ability to query on values as we can in traditional SQL queries. This model works pretty good for things like time series data where you need to store lots of values associate with a stream. But it can be challenging to model business data.

Key Features

  • More flexible than key-value
  • Good at storing time series data

Document

Data is modelled as hierarchical documents that consist of name value pairs nested within each other. This could be JSON, XML or any other similar syntax, tho most document oriented stores that are popular today work off of JSON. Documents benefit from being closer a closer abstraction to the objects we use in code. You tend to be able to take object trees and serializes them as JSON directly into the database, rather than requiring a mapping layer as is typical with relational or wide-column stores.

Key Features

  • Maps closer to programming language
  • Easier to query, index

Consistency

in the context of distributed databases refers to whether two clients trying to perform operations on objects in the database see the same or conflicting views of the world. The more they see the same view, the more “consistent” we say they are. Eric Brewer’s work on the has guided much of the discussion around consistency today, pointing out that when a distributed database experiences failures, it must choose whether it wants to maintain Availability (e.g. clients can still perform operations on objects) or Consistency (e.g. clients see the same version of objects). We find that if we relax our consistency model, we can get stronger availability and vise versa.

Eventual Consistency

Eventual consistency is a relaxed consistency model typically employed in order to achieve higher availability in the database. In an eventually consistent system, writes to the database are eventually visible to to all readers. This means that if I send a write request to the database and immediately try to read, then I may not read my own write.

Key Features

  • High availability (as in, no downtime)
  • Might have inconsistent data

Strong Consistency

Strong consistency ensures that if I perform a write operation to the database, other clients are guaranteed to see my write on the very next read. Databases that provide this level of guarantee often require small periods of un-availability during failures when they are temporarily unable to enforce this level of consistency.

Key Features

  • Periods of unavailability while fail-overs happen
  • Clients can get stonger guarantees on object consistency

Availability

Intimately tied to the Consistency model of your database is the Availability model. Most NoSQl stores use one of two approaches to achieve availability: single-master or multi-master. Note that pretty much any database can be highly available for eventually consistent reads at all times assuming part of the cluster is running. The really interesting part of availability is whether the system is available for writes, and whether it’s available for consistent reads.

Single Master

In this type of data store, there is a single master that owns each object in the database and multiple slaves that have eventually consistent copies of objects. This master is the only node that can process write requests or strongly consistent read requests for an object. If this master node fails or is otherwise unavailable, the system must go through a leader election process to find a new master. During this election process, the objects hosted at the master are unavailable for writes or consistent reads. Most systems still allow eventually consistent reads from slaves even during failures.

Key Features

  • Enables strong consistency
  • Some unavailability during fail-overs

Multi-Master

There are multiple masters for a single object and all of them are writable simultaneously. A client must write to a quorem of available masters in order to write the object. This allows multiple nodes to be failed without losing the ability to write. However, during failures clients may fail writes if a quorum is not available, or you may have inconsistency if you write to less than a quorum or employ techniques like sloppy quorums.

Key Features

  • Enables high availability
  • Some inconsistency expected

Partitioning

Partitioning refers to how data is distributed across the cluster.

Range Based Partitioning

In Range Based Partitioning, contiguous ranges of values are stored together on nodes. This means that the system can easily support range query operations. However, it also means that load distribution may be focused on particular ranges of values. For example, if I have partitioned on a value with low cardinality, then I may have inefficient distribution of requests across partitions. However, it’s pretty easy to implement hash partitioning on top of range partitioning by simply using a hashed value as your key.

Key Features

  • Efficient range queries
  • Hash partitioning is easy to add

Hash based partitioning

In Hash Based Partitioning, objects are distributed across the cluster according to a consistent hashing function. This enables efficient distribution of work for writes and reads by primary key, but it also means that a range query must hit every node on the cluster.

Key Features

  • Good load distribution
  • Range queries are impossible or ineffiecient

Summary

Choosing the right NoSQL store requires you understanding your use case and the tradeoffs of the platform you select. Most people move to NoSQL because some aspect of their relational datastore is inadequate. It’s a useful exercise to think about what it is that you don’t like about your existing database and what it is you want to get from NoSQL. I find that it’s useful to think about this by asking the following questions:

  1. What does my data look like and what kind of queries do I want to run? Are range queries important?
  2. When I have failures, would I rather have consistency violations, or unavailbility?

Compare the answers to these questions to the categories above and see where you fit.

Schemas for schemaless databases

Most of the modern NoSQL databases have eschewed the traditional RDBMS schema for a schemaless design. Databases like MongoDB, CouchDB, HBase, and Riak all allow you to store arbitrary new fields in your database without having to change any configuration.

With this comes some great advantages. Development cycles and data management just go more quickly because there’s less code to change (I can just update my Java or Python code without reconfiguring my Database).

But there are still some challenges that a schema would make easier:

Validation

Today most NoSQL stores leave validation as an exercise for the reader. This means that in your application code, you need to write lots of defensive code & logic to make sure data is valid before you put it into the store.

Once it’s there, it’s very difficult to figure out if the data in your database is actually valid. Human errors, faulty software, or any number of defects or software upgrades could result in invalid data.

Validation is difficult, especially with document-oriented data stores. Traditional SQL schemas don’t really fit the bill for a few reasons:

  • Data is not stored in references, so a referential schema is more or less useless
  • SQL schemas are typically not “round-trip” compatible. In other words, I cannot generate a schema from my code, and then generate code from my schema
  • It’s difficult to retain the highly dynamic nature of document oriented stores in conjunction with a strict schema

A great validation engine for document oriented database would survive these challenges.

Multi-language development

If you’re writing clients to your data in multiple languages, you need to essentially recreate your schema in each language accessing your database. Modern tools like , , and , and even older tools like , , and provide smart workflows for dealing with multi-language environments because you can build a generic description of your data and generate language specific stubs for any environment you want to access it.

It would be great to have this ability for document oriented stores

Language neutral

Ideally a schema definition language would be external to the language being used to access the database. The schema should be the same regardless of which language I’m using to access the DB and let us work in whatever language is necessary for the job at hand

Type systems for add on tools

When building things like Map Reduce jobs or processing pipelines, it’s useful to be able to reason about the types of objects passing between phases in my pipeline. Jobs are significantly simpler if I can have some guarantees that, for example, each document contains specific fields so the system can validate objects before entering the pipeline.

Thoughts and next steps

I’ve been talking to about a system that would bridge some of these gaps, specifically for mongoDB. Look for some follow up posts where we expand on our thoughts here

Getting started with Play Framework, Scala and Casbah

I’ve been dorking around with the , and recently and I wanted to share my recent progress. I’m new to scala and play, so I am probably running afoul of some of the best practices out there. If you’ve got any advice for me after reading this post, please do share.

Here’s what we’re going to do in this tutorial:

  1. Set up a new play framework project using scala
  2. Add the dependencies so you can use Casbah
  3. Create a controller that lists / posts messages to MongoDB
  4. Create templates for the app
  5. Create routes for our new controller
  6. You win!

Set up a new play framework project using scala

bash$ play install scala 
bash$ play new myCasbahDemo --with scala 

Now you should have a new directory called “myCasbahDemo” populated with the template for a play app.

Add the dependencies so you can use Casbah

We need to modify the conf/dependencies.yml file to tell play how to load the casbah dependencies. Here’s my conf/dependencies.yml file

 
# Application dependencies
require:
    - play
    - play -> scala 0.9
    - com.mongodb.casbah -> casbah_2.8.1 2.1.2

repositories:
  - scalatools:
     type: iBiblio
     root: http://scala-tools.org/repo-releases/
     contains:
       - com.mongodb.casbah -> *
       - org.scalaj -> *

Now we can run

bash$ play dependencies

Play will fetch the casbah dependencies and install them in the local project.

Create a controller that lists / posts messages to MongoDB

I created a new controller in app/controllers/Messages.scala with the following content:

package controllers;

import play.mvc._;
import com.mongodb.casbah.Imports._
import scala.collection.JavaConverters._

object Messages extends Controller {

  val _mongoConn = MongoConnection()

  def index = {

    val msgs = _mongoConn("casbah_test")("test_data").find( "msg" $exists true $ne "" )
    val msgStrings = msgs.map( (obj: DBObject) => obj.getOrElse("msg","") )
    Template( 'msgStrings -> msgStrings.asJava )
  }

  def save(msg:String) = {
    val doc = MongoDBObject("msg" -> msg)
    _mongoConn("casbah_test")("test_data").save( doc )
    Redirect("/messages")
  }
}

Our controller has two methods: “index” and “save”.

Index grabs a list of messages from mongo, extracts the text, and renders them in the corresponding template (we’ll look at those in a moment).

There’s a bit of magic going on here. Play uses Groovy as its template language, but many Scala types don’t directly translate to groovy types. So you’ll notice these lines of code:

    val msgStrings = msgs.map( (obj: DBObject) => obj.getOrElse("msg","") )
    Template( 'msgStrings -> msgStrings.asJava )

First, we got back a cursor from our mongo query. So we use map to extract the “msg” attribute from each of the docs returned. This gives us a scala List. But the Groovy language currently used by Play templates do not know how to iterate over a Scala collection. In order to help programmers working with other JVM languages, Scala 2.8.1 + provide the scala.collection.JavaConverters library, which adds the asJava method to Scala Collections (And an equivalent asScala method to Java collections). By calling asJava, we wrap our List in a Java compatible object that can be iterated.

We are working to provide an add on to Casbah or a patch to Play to make this easier in the future. In the mean time, be sure to include scala.collection.JavaConverters and convert your scala types to java types before calling your template.

Save accepts a new string message from the client and saves it to the collection before redirecting the user back to the message list template.

Note that I’m opening a connection to MongoDB in the controller. This isn’t a great design choice as this connection can be re-used by other controllers. In a future post, i’m going to add a real model layer and abstract the connection to mongoDB out. But for the time being, this will suffice.

Create templates for the app

There’s just one template for my app, a simple page with a form field for submitting a new message, followed by a list of the existing messages. My template is in app/views/Messages/index.html, following the play pattern.

 
#{extends 'main.html' /}
#{set title:'Home' /}

<form action="@{Messages.save()}" method="POST"/>
  <input type="text" name="msg"/>
  <input type="submit" value="Add message" />
</form>

<ul>
  #{list items:msgStrings, as:'mess' }
  <li>${ mess }</li>
  #{/list}
</ul>

Create a route for our new controller

in conf/routes, we need to add a route to our new controller. Here’s what mine looks like:

 
# Routes
# This file defines all application routes (Higher priority routes first)
# ~~~~

# Home page
GET     /                                       Messages.index
POST    /                                       Messages.save

Try it out!

First off, make sure that mongod is running on your localhost (we used the default constructor of the MongoDB connection which means it will look for mongod on localhost on the default ports.

bash$ play run 

Wait for the app to startup, then point your browser at localhost:9000

And bam! You’ve got your first play + scala + casbah app up and running!

You can also check out the .