Tag Archives: Big Table Databases

[repost ]Survey of Distributed Databases

orignal:http://dbpedias.com/wiki/NoSQL:Survey_of_Distributed_Databases

 

Survey of Distributed Databases

 

Contents

[hide]

Overview

This document, researched and authored by Randy Guck, provides a summary of distributed databases. These are commercial products, open source projects, and research technologies that support massive data storage (petabyte+) using an architecture that distributes storage and processing across multiple servers. These can be considered “Internet age” databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases. Due to their contrast in priorities and architecture compared to relational databases, these technologies are loosely referred to as “NoSQL” databases, though an absence of SQL is not a requirement.

Distributed Database Concepts

This section describes concepts that constitute the nature of modern distributed databases.

NoSQL Databases

Meaning “no SQL”, this is a term that casually describes the new breed of databases that are appearing largely in response to the limitations of existing relational databases. Strictly speaking, there is no reason any distributed database couldn’t support SQL, but the implication of the term NoSQL is that relational databases are the antithesis of the goals of the modern, distributed database.[1] For example, instead of supporting ACID transactions, which is the mainstay transactional model for relational databases, many of the new-generation databases support BASE principles (described separately). Virtually all of the databases in this paper are NoSQL databases, though some support SQL or an SQL-like query language.

There is no precise definition of what constitutes a NoSQL database, but these databases tend to have most or all of the following tendencies:

  • Schema-less: “Tables” don’t have a pre-defined schema. Records have a variable number of fields that can vary from record to record. Record contents and semantics are enforced by applications.
  • Shared nothing architecture: Instead of using a common storage pool (e.g., SAN), each server uses only its own local storage. This allows storage to be accessed at local disk speeds instead of network speeds, and it allows capacity to be increased by adding more nodes. Cost is also reduced since commodity hardware can be used.
  • Elasticity: Both storage and server capacity can be added on-the-fly by merely adding more servers. No downtime is required. When a new node is added, the database begins giving it something to do and requests to fulfill.
  • Sharding: Instead of viewing the storage as a monolithic space, records are partitioned into shards. Usually, a shard is small enough to be managed by a single server, though shards are usually replicated. Sharding can be automatic (e.g., an existing shard splits when it gets too big), or applications can assist in data sharding by assigning each record a partition ID.
  • Asynchronous replication: Compared to RAID storage (mirroring and/or striping) or synchronous replication, NoSQL databases employ asynchronous replication. This allows writes to complete more quickly since they don’t depend on extra network traffic. One side effect of this strategy is that data is not immediately replicated and could be lost in certain windows. Also, locking is usually not available to protect all copies of a specific unit of data.
  • BASE instead of ACID: NoSQL databases emphasize performance and availability. This requires prioritizing the components of the CAP theorem (described elsewhere) that tends to make true ACID transactions implausible.
  1.  To prevent the misunderstanding that SQL somehow cannot be used, some have re-defined NoSQL as meaning “not only SQL”.

Database Types by Entity Type

The new breed of NoSQL databases can be categorized several ways. This section proposes a taxonomy that describes databases from lower to high levels of functionality based on the type of entity that each supports. An entity could be a tuple, record, document, or something else. Generally, more functionality is available with the more complex entity types. Strictly speaking, however, higher level databases are not necessarily supersets of the database “below” them.

Distributed Memory Caches

A memory cache sits on top of a persistent store such as a SQL database, caching most-recently-used data, typically records or field values. At a high level, a memory cache is just a hash table: every value is accessed via a key. The open source project Memcached was one of the first to expand the memory cache model and create the notion of distributed caching. It allows a request on one node to fetch the value from any node in the network. Consequently, the total size of the in-memory cache is the sum of the memory cache on all nodes. Using commodity hardware, more nodes can be added to increase the cache size and therefore overall performance of the application. Intended to accelerate web applications, Memcached demonstrates that any data-intensive application can enjoy significant performance and scalability gains using the distributed cache architecture.

Memcached is the most-used open source distributed cache, but there are several commercial distributed caches that provide similar functionality. Examples are Oracle’s Coherence, GigaStore’sXAP In-Memory Grid, and IBM’s WebSphere eXtreme Scale.

Distributed caches prompted developers to view distributed databases in a new way, by inverting the concept of a distributed cache. Instead of providing a caching layer to an existing data store, some new distributed databases treat the distributed hash table as the database, backed by a persistent store. Such implementations do not require a backend SQL database but instead provide their own persistence. This gives rise to the key/value database.

Key/Value Databases

These databases provide an efficient key to value persistent map. Compared to more sophisticated databases, they are very limited because they provide only a single way to efficiently access values. Auxiliary paths to rows must be managed externally, e.g., via Lucene or an application-managed index. Most of the databases described in this paper are key/value stores at their core, often providing additional functionality for access by secondary values.

Examples of key/value databases without additional indexing facilities are:

  • Berkeley DB
  • HBase
  • MemcacheDB
  • Redis
  • SimpleDB
  • Tokyo Cabinet/Tyrant
  • Voldemort
  • Riak

“Big Table” Databases

These are also called record-oriented or tabular databases. The term big table has become popular due to Google’s proprietary BigTable implementation. Similar to relational databases, a big table database consists of multiple tables, each containing a set of addressable rows. Each row consists of a set of values that are considered columns. However, due to the scalability and storage requirements imposed by the applications for which they are used, big table databases differ from relational tables in several important ways:

  • Each row can have a different set of columns. Although all rows in a given table may be required to have a pre-defined set of column groups, individual rows can differ by specific columns within these groups. This means that a table can have different values from row to row. Furthermore, the columns do not need to be pre-defined in a schema but instead can be added dynamically.
  • Tables are intended to have many more columns than a typical relational database. A row could contain thousands of columns. Some big table databases support millions of column values per row.
  • All big table database support compound values, which means, compare to a key/value store, a record can have multiple fields. However, some only support scalar values within a record, requiring BLOBs and other large values to be stored separately.
  • Rows are typically versioned. This means that multiple copies of the same row may exist (i.e., with the same row ID). Rows are typically versioned by a system-assigned timestamp.
  • Data storage is typically divided into shards, which are independently managed.
  • Sometimes only updates to a single row are considered atomic. In this case, multi-row updates are performed in separate transactions, and inter-row consistency is handled through soft-transaction techniques. Alternatively, a big table may allow multi-row updates but only when restricted to rows in the same shard.

In addition to Google’s BigTable database (which is surfaced in the Google App Engine asdatastore), other examples are:

  • Azure Tables (Microsoft)
  • Cassandra (Apache)
  • HBase (Apache Hadoop project)
  • Hypertable
  • SimpleDB (Amazon)
  • Voldemort (LinkedIn, now open source)

Document Databases

Also known as document-oriented databases, these databases focus on storage and access optimized for documents as opposed to rows or records. Essentially this an API issue since document database entities are just records with multiple fields. However, the focus is programmatic access better suited for documents. Consequently, they emphasize up-time, scalability, and similar features less than other distributed databases. All of the current document databases support documents in JSON format.

Some document databases provide RDBMS-like capabilities such as SQL or an SQL-like query language. Some document databases are also big table databases; some provide MapReduce processing; some are also columnar databases. This means that the storage and access implementation is independent of the database’s orientation towards document-oriented databases.

Examples:

  • CouchDB
  • MongoDB
  • Terrastore
  • ThruDB

Graph Databases

Graph databases use nodesedges, and properties as primary elements, providing a sharp contrast to the tables, rows, and columns of the relational model. Additionally, they emphasize high performance for associative data access, preventing the need for joins. (Eifr√©m and Skogman say that graph DBs are “whiteboard friendly”.)

As with document databases, graph databases may be implemented with different storage techniques. The graph nature of these databases exists mainly in the data model presented to applications. Under the hood, a graph database may be a key/value store, a columnar database, a “big table” database, or some combination of these and other architectures.

One feature that graph DBs possess is the ability for a field value to directly store the ID of another entity. Moreover, graph DBs typically allow collections of IDs to be stored as a single value using an array or map layout. Consequently, navigating from one entity, say a node, to its related entities, such as its edges, can be performed quickly without maintaining secondary indexes. When multi-value collections are stored in both directions of an edge type, the database can quickly navigate many-to-many relationships in either direction.

Examples:

  • AllegroGraph RDFStore
  • HyperGraphDB
  • InfoGrid
  • Neo4J

CAP Theorem

The CAP theorem was proposed in a keynote by Eric Brewer in 2000 on the Principles of Distributed Computing. Basically, the theorem states that you can optimize for only two of three priorities in a distributed database:

  • Consistency: This is essentially the same as “atomicity” in ACID transactions. It is the principle that, for example, prevents two customers from both buying the last copy of a book in inventory. Atomicity guarantees that only one user can (a) lock the book object, (b) decrement its inventory count, and (c) add the book item to the user’s shopping cart, all in one transaction.
  • Availability: This means that the database, or more importantly the services that use it, are entirely available. Ensuring availability requires technologies such as replication and parallel processing so that changes in demand can be met while maintaining a minimum response time.
  • Partition Tolerance: In a distributed network, “partitions” form when certain failures occur, such as a network cable being unplugged. This gives rise to conflicts and ambiguities as to where data lives and who has control over it. Partition tolerance is defined as: “No set of failures less than total network failure is allowed to cause the system to respond incorrectly”.

The significance of the CAP theorem is that applications must understand their priorities carefully and optimize their distributed database solution correspondingly. Amazon, for example, claims that that just an extra one tenth of a second on their response times will cost them 1% in sales. Similarly, Google says they’ve noticed that just a half second increase in latency caused traffic to drop by a fifth. In these cases, availability may be more important than consistency.

Columnar Databases

Also known as column-oriented databases, columnar databases store column values for multiple rows in the same block instead of storing multiple rows within the same block. This is faster because (a) many tables contain lots of columns, and (b) few queries need all of the columns at once. So, rather than fetch lots of blocks and throwing away most of the data, columnar databases only perform I/Os to the blocks corresponding to columns that are actually being read (or updated). In addition to the smaller I/O overhead, this uses memory more efficiently and allows more effective blocking based on each column’s type (date, text, etc.)

Some columnar databases also support compression, which affords many advantages. Because multiple values for the same column are stored in each block, it often compresses far greater than a single row or a block of rows would compress. Furthermore, custom compression techniques can be used for each column type, such as one that works well for text data and another that works well for timestamps. The denser storage yielded by compression reduces I/O transfer time and overall database size. To minimize the CPU time required when compressed blocks are read, special “real time” compression techniques are sometimes used. These algorithms do not compress as densely as say GZIP, but they provide significant compression with much less CPU time.

Examples:

  • Alterian
  • C-Store
  • HBase
  • QD Technology
  • SmartFocus
  • Vertica

Elasticity and BASE

Elasticity describes databases that can be dynamically expanded. That is, additional servers and/or storage can be added without taking the DBMS software down. Seminal research on this topic was performing by Amazon in defining their proprietary Dynamo key/value database. When new storage is added to the mesh, some subset of data begins to be replicated to it. When a new server is added, it begins to “steal” work from other servers. The elasticity of storage and server resources has given rise to a new database paradigm known as BASE.

Without question, the acronym BASE was deliberately chosen to contrast to the ACID paradigm.

  • Basically Available: This means that most data is available most of the time. A failure could cause some data to not be available, but only catastrophic failures cause everything to be down.
  • Soft state: This means the DB provides a relaxed view of data in terms of consistency. For example, an inventory value may be examined that is not precisely up-to-date. But for ordering purposes, approximations are OK.
  • Eventually consistent: As data is replicated throughout the system, such as when new storage nodes are added, it is eventually copied to all applicable nodes. But there is no requirement for all nodes to have identical copies of any given data all the time.

Elastic databases that emphasize BASE principles necessarily must give-up the traditional paradigm of ACID transactions: atomicity, consistency, isolation, and durability. However, various implementations can prioritize certain requirements such as availability over other requirements such as consistency. This balance of priorities must observe the paradox of the CAP Theorem, described separately.

Although Amazon’s Dynamo database is proprietary, some other databases have used its principles to support elasticity:

  • Cassandra
  • Dynomite
  • Voldemort

MapReduce

MapReduce (or MR) is an algorithm for dividing a work load into units suitable for parallel processing. Though the general principle of divide-and-conquer for large data sets has been around for decades, Google popularized it with an ODSI paper in 2004. MR is especially useful for queries against large sets of data: the query can be distributed to 100’s or 1000’s of nodes, each of which works on a subset of the target data. The results are then merged together, ultimately yielding a single “answer” to the original query.

Perhaps more important than the MR algorithm are MR frameworks, which provide a common foundation for solving a variety of map-reducible problems. The MR framework takes care of common problems such as scheduling work for each node, combining results, and handling node failures. The MR framework application only needs to provide key plug-in methods, such as how to process each parcel of work and how to merge to result subsets.

MR can provide an important tool to distributed databases for very large data access problems. However, MR frameworks tend to have high start-up and shut-down costs, hence they are not appropriate for low-latency transactions such as online searches. Some papers argue that MR is a step backwards compared to SQL and OLTP.

Survey of Distributed Database Technologies

Overview

The databases surveyed in this section are summarized in the table below. The terms used in this table are described below:

  • Database types: The four subsections of the table classify databases by their basic data model:
    • Key/value Databases: These manage a simple value or row, indexed by a key. Key/value DBs generally do not support secondary indexes.
    • Big table Databases: These manage large, multi-column rows in the spirit of Google’s BigTable database.
    • Document Databases: These manage multi-field documents (or objects) and provide JSON access.
    • Graph Databases: These manage nodes, edges, and properties and are optimized for associative access. All the current offerings are commercial products.
  • Name: The common name of the database product or open source project.
  • Owner: The company or open source owner of the database.
  • Written in: The primary programming language in which the database is written.
  • Languages/APIs: The primary programming languages and APIs with which the database can be accessed.
  • Platforms: The known operating systems on which the database is supported.
  • License: The type of commercial or open source license with which database usage is controlled.
  • Schemaless?: Indicates if the database supports run-time addition of columns/fields to each rows, document, or other primary entity.
  • Sharding?: Indicates if the database supports automatic sharding (partitioning) of data across multiple servers. (Application-managed sharding is always possible but doesn’t count as a database feature.)
  • Indexes?: Indicates if the database provides support for automatic maintenance of secondary indexes. All databases provide efficient access by primary ID (e.g., row ID). Applications can always manually manage secondary indexes.
  • Active?: Indicates if the database appears to have an active user community or ecosystem.
  • Interest?: Indicates the amount of apparent activity and applicability for commercial use.
  • Notes: Noteworthy things about the product.

SurveyTable1-KeyValueDBs.JPGSurveyTable2-BigTableDBsDocDBs.JPGSurveyTable3-GraphDBs.JPG

Key/Value Databases

Dynomite (Open Source)

Dynomite is an open source clone of Amazon’s Dynamo framework, written in Erlang. It is brand new; version 0.6.0 is currently shipping. Dynomite is an “eventually consistent distributed key/value store” that is targeting the following features for the 1.0 release:

  • Vector clocks
  • Merkle trees
  • Consistent hashing
  • Tunable quorum
  • Gossiping of membership
  • Gossiped synchronization of partitions
  • Pluggable storage engines
  • Thrift interface
  • Web console with canvas visualizations

It’s not clear how active Dynomite development is. It appears to be the work on one person (Cliff Moon)[1] The last entry on the web site seems to be ~1 year ago, and the last build posted was May, 2009. Documentation is minimal as are code comments. There isn’t even a clear mention of what machines it runs on.

Dynomite seems to be similar to Cassandra but with far less activity. It may be dead.

  1.  On 3/16/10, Cliff Moon released a Scala slient for Cassandra called Scromium. Does this mean he’s given-up on Dynomite and is now working with Cassandra?.

Riak (Basho Technologies)

Riak (pronounced “ree-ahk”) is a fully-distributed, scalable key/value store created by Basho Technologies that was open sourced under the Apache 2 license in August of 2009. Basho describes Riak thusly: “Riak is a Dynamo-inspired key/value store that scales predictably and easily. A truly fault-tolerant system, Riak has no single point of failure, with no machines being special or central in Riak.”

It is currently shipping at version 0.12 and appears to have active usage. It is written in Erlang, has full-featured HTTP and Protocol Buffers APIs, and many language interfaces (Erlang, Python, Ruby, Java, PHP, etc.).

Central to any Riak cluster is a 160-bit integer space, known as the “ring,” which is divided into equally-sized partitions. Physical servers, referred to in the cluster as “nodes,” run a certain number of virtual nodes, or “vnodes”. Each vnode will claim a partition on the ring. The number of active vnodes is determined by the number of physical nodes in the a cluster at any given time.

All nodes in a Riak cluster are equal. Each node is fully capable of serving any client request. This is possible due to the way Riak uses “consistent hashing” to distribute data around the cluster.

Each read/write request indicates how many nodes must respond with the requested operation for “success”. This is illustrated below, using N as the number of replicas, R as the requested number of responses for a read, and W as the requested number of responses for a write:

Riak.JPG

Riak uses “vector clocks” as its version control mechanism. This gives developers the option to resolve version conflicts in their application logic. The alternative is to let Riak resolve conflicts automatically. Other mechanisms at work in Riak include “Hinted Handoff,” a technique for dealing with node failure in the Riak cluster in which neighboring nodes temporarily takeover storage operations for the failed node; and “Read Repair,” an anti-entropy mechanism used to optimistically update stale replicas when they reply to a read request with stale data.

Riak also supports MapReduce in both JavaScript and Erlang for data processing that requires query functionality beyond the standard GET, PUT and DELETE operations offered by a key/value store.

Riak natively supports links which are one way relationships between objects. For example, an artistobject can link to all of its albums, each of which can link to all of its “songs.”

Riak is available in a commercial version called Riak EnterpriseDS that builds on the core open source code and includes support.

Riak can run on a variety of operating systems: it is known to run on several *NIX platforms and should be able to run on most anything that isn’t Windows. Basho also provides pre-packaged binary builds for most platforms. After one year of being open sourced, Riak has seen success and adoption in various markets, ranging from FORTUNE 100 enterprises like Comcast, to open source authorities like Mozilla, to innovative startups like Mochi Media.

Tokyo Cabinet/Tyrant (Open Source)

Tokyo Cabinet is an open source key/value database created and maintained by Mikio Hirabayashi. It is a set of libraries, intended for embedding in database applications. Berkeley DB is often cited as its biggest “competitor”.

There are no tables and no schema: all “objects” are merely keys whose values are strings or byte arrays. Tokyo Cabinet provides “row” locking and built-in support for compression. It is written in C, supports both hash tables and B-trees, and is “crazy fast” (reportedly storing 1 million records in as little as 0.7 seconds). It has interfaces in many languages including Ruby, Perl, C, etc. Below is a diagram of its B-tree organization:

TokyoCabinet-BTree.JPG

Tokyo Cabinet is used by Mixi.jp (Japan’s version of Facebook). It is accompanied by Tokyo Tyrant, which provides networking, replication, and failover capabilities. Tokyo Cabinet and Tyrant are apparently used in many production applications, though there are anecdotal descriptions of maturity problems. They are both licensed under the GNU Lesser General Public License (LGPL).

Unfortunately, the licensing seems to be a show stopper for commercial use. Also, it’s not clear that Cabinet provides support for secondary indices: applications must roll their own, thereby also handling all transaction processing.

Vertica (Vertica Systems)

Vertica Analytic Database is a commercial, columnar database provided by Vertica Systems, Inc. (HP purchase announced in Feb, deal to close in April) One of Vertica’s founders is Michael Stonebraker, who also founded Ingres, Postgres, and many other database initiatives. It is built upon (or inspired by) C Store, adding clustering and other robust features. It is positioned as a data warehouse DB, more specifically as “the only columnar DB that supports massively parallel processing (MPP)”. Vertica supports compression, replication, automatic failover, and MapReduce support. The company claims that Vertica offers “50x-200x faster speed on 90% less hardware” (presumably compared to other data warehouse databases). The only supported platforms are Linux and VMware.

A “cloud” version of Vertica is available via Amazon’s EC2 platform. This SaaS version is intended for no-DBA/pay-as-you-go applications and scale-out as much as needed. The EC2 Vertica solution is available as an Amazon Machine Image (AMI).

Vertica is only available as a commercial product, and only available on Linux and EC2 (not Windows).

Voldemort (Open Source)

The Voldemort database was built by LinkedIn for their needs. It is a key/value store that uses Dynamo principles for distributed processing, replication, and failover. It can be considered “Cassandra 1.0″ since the same designer left LinkedIn to create a similar database for Facebook. Since it is a key/value store, Voldemort does not directly support relationships. Moreover, 1-to-many relationships must be handled by using an embedded map (e.g., java.util.HashMap) as a value. For very large fan-outs (say millions of values), a separate, application-managed table must be used. The designers claim that this simplistic design is an advantage since performance is very predictable.

Voldemort was recently open sourced and is available under the Apache 2.0 license.

Stemming from the Dynamo research, Voldemort uses “read repair” for conflict detection and resolution instead of two-phase commit or Paxos-style consensus algorithms. This also means that versioning is used and that applications must participate in conflict resolution.

Voldemort’s architecture employs a distinct layer to resolve each logical function: conflict resolution, serialization, etc. This is illustrated below:

Voldemort-LogicalArch.JPG

As shown, Voldemort requires a storage engine such as Berkeley DB or MySQL. Multiple physical architecture options are available to support various load-balancing and performance goals. This is shown below:

Voldemort-PhysicalArch.JPG

The native API of Voldemort is Java, but access is also possible via HTTP.

Because Voldemort was only recently open-sourced, and its adoption in the presence of other options is unclear. Its dependency on BDB or MySQL as a backend store limits its use for commercial applications.

Big Table Databases

Azure Tables (Microsoft)

Azure is Microsoft’s cloud computing platform, first coming into production use in 2010. The underlying Azure fabric provides distributed computing services such as communication, service management, replication, and failover. On this foundation, Azure offers several storage options, including:

  • Windows Azure XDrive: This is a distributed NTFS-like file system built on top of Azure Blobs.
  • Block and Page Blobs: These are monolithic storage objects, optimized for streaming and random access, respectively.
  • Azure Tables: These are “big table” like storage structures that use row-oriented access and storage.
  • SQL Azure: This is a limited version of SQL Server that can be used in Azure.

Of these options, only SQL Azure is a true database system. However, it is limited to a 10GB maximum size, doesn’t support full text search, and is “crippled” in several other ways. Microsoft intends to expand the features of SQL Azure over time, moving it in the direction of a distributed database. This means that Azure doesn’t provide a distributed database solution today. Instead, it provides building blocks with which a distributed database could be built.

For large storage, Azure Tables provide the most flexible building block. Each table allows an unlimited number of rows (also called “entities”) that can be separated into partitions. Each partition is serviced independently, allowing parallel processing. Any row can be efficiently accessed via {partition key, row key}. Transactions are limited to rows within the same partition. Tables do not have a fixed schema: each row can have its own set of properties, and the same-named property can have a different value type from one row to the next. However, a row can have no more than 255 properties and cannot be larger than 1MB in size, including property names[1]. Moreover, property types are limited to strings and simple scalar values. A large value such as an image or document must be stored as a Blob and referenced from a table property. Rows are versioned with a system-assigned timestamp.

Azure Tables do not provide indexes or any other means of efficient row access other than by {partition key, row ID}. Linear searches (aka “table scans”) are supported, but indexed auxiliary searching must be application-maintained. Similarly, full text searching must be implemented at the application level.

Windows Azure can be accessed in .Net languages as well as Java.

As a “big table” implementation similar to many other cloud- and on-premise databases, Azure tables are the obvious choice for a big table-storage solution when Azure is used.

  1.  Compared to the rather unlimited constraints of other “big table” databases, Azure Tables might be more appropriately described as “mini tables”.

Cassandra (Apache)

Cassandra is an Apache project that aims to provide a scalable, distributed database solution that employs the technology advances of Amazon’s Dynamo and Google’s BigTable projects. It was open-sourced by Facebook in 2008 and designed by one of Dynamo’s original authors (Avinash Lakshman) as well as Facebook engineer Prashant Malik. It is used by many large companies such as Rackspace, Digg, Facebook, Twitter, and Cisco. The largest production server has over 150 TB of data. Cassandra is available under the Apache 2.0 license.

Key features of Cassandra are summarized below:

  • Decentralized: All nodes are equal; there is no single point of failure.
  • Fault Tolerant: Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
  • Eventually Consistent: Data uses an eventually consistent model with optimizations such as Hinted Handoff and Read Repair to minimize inconsistency windows.
  • Elasticity: New machines can be added dynamically with no downtime, and read/write operations scale linearly as new servers are added.
  • Rich Data Model: More complex records are supported than simple key/value stores.

Cassandra reports impressive performance numbers compared to MySQL. For example:

  • MySQL > 50 GB Data
 Writes Average: ~300 ms
 Reads Average: ~350 ms
  • Cassandra > 50 GB Data
 Writes Average: 0.12 ms
 Reads Average: 15 ms

Cassandra is written in Java. It uses the Apache Thrift service framework to provide access in many languages. Based on its roots, it emphasizes the Linux platform though it is being used on Windows as well. At the PyCon conference, Cassandra was described as “not polished”. For example, it doesn’t yet support compression.

In mid-March, 2010, Cassandra was moved from an incubator project to a top-level project. Consequently, it now has its own top-level web page: http://cassandra.apache.org. Cassandra is very active and progressing rapidly.

Cassandra appears to be very promising, especially with its backing by major companies currently managing lots of data on the web.

Datastore (Google)

Google’s cloud computing framework, known as the Google App Engine (GAE), provides a data storage solution known as the datastore. Under the hood, it is the famous BigTable database, which is a foundation technology that helped launch the “no SQL” movement. The primary features of datastore are summarized below:

  • The datastore is a distributed database that provides queries and transactions.
  • Data objects are called entities and have a kind and a set of properties.
  • Data objects of the same kind form a data class, essentially analogous to a “table”.
  • Many property types are supported including scalar types (string, Boolean, integer, float, etc.) and specialty types such as lists, blobs, links, emails, and geographic coordinates.
  • Each datastore is schema-less: the application specifies the properties of each entity at runtime.
  • Multiple updates within a single transaction are supported when all affected entities are in the same entity group.
  • Every query uses an index that returns entities in the desired order. Indexes can be predefined for better performance.

GAE datastores can be accessed with Python or Java (Java is a relatively new addition). Python applications use a Python-specific API to access a datastore. Java applications can use Java Data Objects (JDO), the Java Persistence API (JPA), or a low-level datastore API. Datastores support the JDO Query Language (JDOQL). Datastores also have their own SQL-like query language called GQL, which is used primarily in the Python interface. Google maintains accessible statistics on each datastore including entity counts and total size.

Datastores have some limitations as summarized below:

  • An entity cannot be bigger than 1MB.
  • An entity’s columns cannot be referenced in indexes more than 1,000 times.
  • No more than 500 entities can be added (put) or deleted in a single batch.

New accounts are not charged until they reach 500MB or storage or 5 million page views a month. Thereafter, cost is apportioned based on the space used and amount of data transferred in and out of the application.

GAE tables are the obvious candidate for storage for GAE applications. Because they are similar to Azure Tables and other BigTable implementations, it may be possible to target multiple “big table” implementations with a single storage solution.

HBase (Apache)

HBase is a “big table” columnar database, modeled after Google’s BigTable and built upon Apache’s Hadoop framework. It is still fairly new, currently shipping 0.90.x (Feb ’11) release (see [versions|http://wiki.apache.org/hadoop/Hbase/HBaseVersions]). HBase supports extremely large tables and distributed database principles such as:

  • Elastic storage (dynamic addition of new servers)
  • No single point of failure
  • Rolling restart for configuration changes and minor upgrades
  • Support for MapReduce processing
  • Apache Thrift network for service deployment
  • REST-ful web service gateway

Hadoop is used by Yahoo search, Facebook, Microsoft Bing (via the Powerset acquisition?), and other major vendors. Hadoop and HBase are Apache open source projects, and both are offered by Amazon via its Amazon Web Services (AWS) offering.

HBase is a columnar database designed for scale in all directions: billions of rows, each of which can have millions of columns. Each row/cell intersection is a cell that can be versioned. The number of versions stored is configurable.

Each HBase table is named. Rows are sorted by row ID, which is a user-provided byte array value. Tables are organized into regions, which are stored separately. When a region becomes full, it splits and some rows are moved. Every region has contiguous row IDs defined by (first row, last row]. All locking is on a per-row basis, regardless of the number of columns.

Columns are organized into column families, which must be declared. For example, a column family “station:” can be declared. Subsequently, an application can add values with column names such as “station:id”, “station:name”, “station:description”, etc.

Each database has a single “master” node and one or more regionserver nodes. HBase can be mapped to different storage solutions. When mapped to HDFS, nodes are replicated for redundancy. Each HBase has catalog tables called -ROOT- and ‚Ä¢META‚Ä¢. The -ROOT- table has list of ‚Ä¢META‚Ä¢ table regions, and ‚Ä¢META‚Ä¢ holds the list of user-space regions (including the first row number). This architecture allows nodes to split as data grows, yet it allows the correct node to be quickly found given a row number.

HBase uses the Apache ZooKeeper framework for distributed access. When mapped to the Hadoop HDFS, HBase implements a special file class that provides features such as compression and bloom filters. GZIP compression is used by default, but LZO compression (because it is GPL) can be installed on each HBase for faster, real-time compression.

HBase, as with most of Hadoop, is written in Java and favors Linux systems. However, it reportedly will also work on Windows systems.

HBase is more complicated than some solutions because of its layering: it requires Hadoop’s HDFS, which is one set of services, as well as its own set of services, connected and managed by Apache ZooKeeper. It is older than Cassandra and potentially more mature, especially when secondary indexes are required.

Hypertable (Open Source)

Hypertable is an open source, “web scale” database modeled after Google’s BigTable. It is available under the GNU GPLv2. It is a relatively new project, currently shipping version 0.9.2.7. Hypertable can be deployed on top of the Hadoop HDFS or CloudStore KFS file systems. It supports an SQL-like language for creating tables called HQL. An overview of a Hypertable deployment is shown below:

HyperTable.JPG

Hypertable uses a Google Chubby equivalent called Hyperspace, which holds metadata and acts as a global lock space. Hypertable uses a similar table/column model as other big table implementations:

  • A table consists of rows, each of which has a unique row ID.
  • Each row consists of a set of column families.
  • Each column family has an unlimited set of column qualifiers, each of which defines a unique column (field) value.
  • Each column value is versioned by a system-defined timestamp.
  • Rows are stored in row ID order within each table.
  • When a table reaches a certain size, it splits into a range, served by a separate range server, which handles all rows within the corresponding range. (This is analogous to HBase regions and region servers.)

A table is essentially a key/value map, where each key is the row ID concatenated with the column family, column qualifier, and value timestamp. A full map key therefore looks like this:

HyperTable-MapKey.JPG

The native API is C++. The Apache Thrift API is used to provide support for Java, Perl, PHP, Python and Ruby. Linux and Mac OSX are support; porting to other platforms is possible.

Hypertable seems identical functionality-wise to Cassandra but with a much more restrictive GPL license.

MckoiDDB (Diehl and Associates)

MckoiDDB is an open source distributed database developed by a small company (Diehl and Associates) and distributed under the GNU GPLv2 license. It manages large sets of data distributed within a cluster, supporting features such as low-latency queries, snapshots, sharding, versioning, and data forks. MckoiDDB supports add-on modules, one of which provides SQL support. (The SQL module started shipping as Beta in May, 2010.) Version 1.0 was released 10/5/09; version 1.1 started shipping in May, 2010. It appears to be managed by a single person (Tobias Downer).

MckoiDDB utilizes optimistic locking, narrowing all contention down to a “consensus function”, which is engaged before data is committed. Sharding is supported, which allows data and contention to be distributed among multiple servers. Multiple data models are provided; for example, a “simple database API” provides traditional file and table models.

In comparison to BigTable and Hadoop HBase, MckoiDDB claims to be more oriented towards online transaction processing[1]. MckoiDDB can be used in cloud deployments such as EC2. Like other distributed databases, it writes data to three drives, providing fault tolerance and distributed access. MckoiDDB is written in Java and reportedly works anywhere Java 1.6 is available.

MckoiDDB is supported by one person and doesn’t appear to have lot of activity. It is also surrounded by the dreaded GPL firewall.

  1.  On the surface, this doesn’t make sense since all of the “big tables” provide highly deterministic times for all read/write operations, so performance tends to be related to the amount of sharding and caching used.

SimpleDB (Amazon)

SimpleDB is Amazon’s “big table” database, currently in beta, available through the Amazon Web Services (AWS) portfolio. It is written in Erlang and is similar to BigTable, Cassandra, and other big table databases in many ways: each domain (table) has no pre-defined schema; each item (row) can have a different set of attributes (columns); etc. Domains are currently limited to 10 GB each, and each account is limited to 100 domains.

Attribute values must be simple text or scalar types. Large objects such as documents and images should be stored in Amazon’s S3 storage and referenced via pointers. Not only is S3 cheaper, but this helps avoid hitting the 10GB limit. Moreover, data transfer within the AWS is free, so there is no penalty for storing large objects separately.

Interestingly, some articles such as the Wikipedia entry, classify SimpleDB as a document database. This may be because it supports REST and SOAP, thereby allowing rows to be retrieved as XML documents. However, SimpleDB does not appear to support JSON, whereas all other document databases do. Moreover, large text and binary values must be stored in S3 as described above, pushing to applications the responsibility of managing the content of “documents”, be they textual, image, audio, etc. SimpleDB seems to more accurately belong to the somewhat lower taxonomic label of “big table” database.

Currently, SimpleDB is free for experimental use:

Free Tier

You can get started with SimpleDB for free. Amazon SimpleDB users pay no charges on the first 25 Machine Hours, 1 GB of Storage, and 1 GB of Data Transfer Out consumed every month. Under the new free inbound data transfer promotion, all data transfer into Amazon SimpleDB is free of charge until June 30, 2010. In most use cases, the free tier enables approximately 2,000,000 GET or SELECT API requests to be completed per month before incurring any usage charges. Many applications should be able to operate perpetually within this free tier, such as a daily website analysis and traffic reporting tool, a web indexing service, or an analytics utility for online marketing programs.

SimpleDB employs “eventually consistent” replication and transaction semantics. However, on 2/24/10, Amazon added new features such as conditional puts and deletes and consistent read that prevent ambiguities and small timing gaps that can result in inconsistent data. These operations trade speed for consistency but open SimpleDB to a larger class of applications.

SimpleDB looks promising as a big table-like database for applications hosting data in the Amazon cloud.

Document Databases

CouchDB (Apache)

CouchDB is a document database, sometimes referred to as the document DB “poster child”. It is an ad-hoc, schema-less database with a flat address space. Written in Erlang, CouchDB can be queried and indexed using MapReduce expressions. It supports JSON and a REST-style access. An overview of its architecture from the Apache web site is shown below:

CouchDB-Arch.JPG

CouchDB uses incremental, asynchronous replication with conflict detection and management. A multi-master architecture allows an individual master to be off-line (writes are queued until master comes back on-line).

Under the hood, CouchDB is a key/value store that maps document IDs to documents. A CouchDB document is an object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document is a blog post:

CouchDB-DocExample.JPG

A CouchDB database is a flat collection of these documents, each document identified by a unique ID. Full text search is provided by a Lucene index. Additional indexes are views defined using the MapReduce pattern and JavaScript for methods. (This approach has been described as “awkward”.)

Emil Eifrém and Adam Skogman described CouchDB as being difficult to use. Other internal research at Quest also categorized it as a special purpose database.

CouchDB seems to be an early example of a document DB and is best used for applications specifically needing that functionality.

DovetailDB (Millstone Creative Works)

DovetailDB is a schema-less, JSON-based database, similar (in spirit) to Amazon’s SimpleDB. It is produced by Millstone Creative Works and is open-sourced under the Apache 2.0 license. It allows server-side JavaScript and provides full transaction support. Compared to CouchDB, it provides automatic data indexing, though with less flexibility. The product is listed in a couple of “No SQL” articles, but the web site is sparse, and the newsgroup has no entries since 9/2009. It’s not clear that the project is still active.

It’s not clear what advantages DovetailDB has over other document-oriented databases, and it doesn’t appear to be active.

MongoDB (Open Source)

MongoDB is a document database system written in C++ once described as “CouchDB without the crazy”. Its web site defines MongoDB as a scalable, high-performance, open source, schema-free, document-oriented database with the following primary features:

  • Document-oriented storage (the simplicity and power of JSON-like data schemas)
  • Dynamic queries
  • Full index support, extending to inner-objects and embedded arrays
  • Query profiling
  • Fast, in-place updates
  • Efficient storage of binary data large objects (e.g. photos and videos)
  • Replication and failover support
  • Auto-sharding for cloud-level scalability
  • MapReduce for complex aggregation
  • Commercial Support, Training, and Consulting

Programmatic access is available in a number of languages such as C, C++, C#, Erlang, Java, Python, Ruby, and more. MongoDB uses an AGPL/Apache license model, but it is friendlier to applications that want to use it. From the web site:

Our goal with using AGPL is to preserve the concept of copyleft with MongoDB. With traditional GPL, copyleft was associated with the concept of distribution of software. The problem is that nowadays, distribution of software is rare: things tend to run in the cloud. AGPL fixes this “loophole” in GPL by saying that if you use the software over a network, you are bound by the copyleft. Other than that, the license is virtually the same as GPL v3.

Note however that it is never required that applications using mongo be published. The copyleft applies only to the mongod and mongos database programs. This is why Mongo DB drivers are all licensed under an Apache license. Your application, even though it talks to the database, is a separate program and “work”.

One web site compares the MongoDB MapReduce functionality to SQL this way:

MongoDB.JPG

MongoDB is supported on 32- and 64-bit versions of Mac OSX, Linux, and Windows. One web review points out a few limitations with MongoDB:

  • It is not as fast as other key/value store databases such as Tokyo Cabinet or Redis. However it as plenty fast, inserting 2,500 simple documents per second and performing 2,800 reads per second on commodity Linux hardware.
  • Its support for sharding is new and essentially alpha-ish.
  • The query syntax is not “pretty”, but the ease-of-use rivals Redis.

MongoDB’s only apparent potential drawbacks are its immaturity and suitability for objects other than documents.

Terrastore (Open Source)

Terrastore is a document database built on top of Terracotta. It uses an Apache 2.0 license, but its dependency on Terrastore needs further research. Terrastore is distributed and supports dynamic addition and removal of nodes (elasticity). Documents are automatically distributed to new nodes, though it doesn’t appear to support replication or automatic failover features. Terrastore supports JSON, accessible via an HTTP interface.

It’s not clear how active/used Terrastore is. It seems to be supported by one person (Sergio Bossa). There are a few papers available, but not much documentation.

ThruDB (Open Source)

ThruDB is an open source (New BSD license) document database built on top of the Apache Thrift framework. Like Thrift, ThruDB was started by (and still used by) Facebook but is now open source on the Google code pages. ThruDB provides web-scale data management by providing separate services for various concerns:

  • Thrucene: provides Lucene-based indexing
  • Throxy: provides partitioning and load balancing
  • Thrudoc: provides document storage
  • Thruqueue: provides a persistent message queue service
  • Thrift: provides cross-language services framework

ThruDB provides several storage back-end solutions, including local disk, S3 (cloud), and a hybrid local disk/S3 solution. The differences in these solutions are summarized below:

ThruDB-BackendStorageTable.JPG

ThruDB uses Memcached as an in-memory, cross-server cache. When the hybrid disk/S3 storage solution is used, this means that documents are checked in three places: memory, local disk, and then S3. This allows the S3 store to be a seamless backup/recovery source in case a local disk fails. This is illustrated below:

ThruDB-S3Hybrid.JPG

Compared to SimpleDB, ThruDB removes many of the former’s restrictions: no 1024 byte limit per attribute, ability to use proprietary data formats, no query time limits, and more. Since it also runs on EC2 and public Thrift/ThruDB Amazon Machine Images (AMIs) exist, ThruDB is in a sense a competitor to SimpleDB.

Graph Databases

AllegroGraph RDFStore (Franz Inc.)

AllegroGraph RDFStore is a persistent RDF graph database that purportedly scales to “billions of triples while maintaining superior performance”[1]. It is intended for building RDF-centric semantic webapplications. AllegroGraph supports SPARQL, RDFS++, and Prolog reasoning from Java applications. It is a commercial product of Franz Inc.: a free version stores up to 50 million triples; aDeveloper version stores up to 600 million triples; an Enterprise version stores an unlimited number of triples.

An overview of its architecture is shown below:

AllegroGraphDB.JPG

AllegroGraph RDFStore supports database federation. Independent databases can be deployed, and a single query can be distributed across databases. The product is intended for fast loading of and access to RDF information and not as a general purpose database. The requirement of a commercial license also puts it at a disadvantage compare to various open source alternatives.

  1.  Resource Description Framework (RDF) is a structure for describing and interchanging metadata on the Web. In RDF, a triple consists of three values: a source, a predicate, and anobject. All three are typically fairly short strings.

InfoGrid (NetMesh Inc.)

InfoGrid is a graph database developed by NetMesh, Inc. It provides graph-based traversal for access instead of joins. Each unit of data is separately addressable, thereby supporting REST-ful access. InfoGrid uses XPRISO and Probe frameworks to allow external data to appear as if it was native to the database. InfoGrid supports data sharding, object/relational mapping, data import, and other functions. An overview of the architecture is shown below:

InfoGrid-Arch.JPG

An InfoGrid database instance is called a MeshBase and contains of a graph of MeshObjects. Each MeshObject has a type and a set of properties. A Relationship connects two objects and may a type and a direction. MeshBases can be mapped to an RDBMS, file system, or a distributed file system such as Hadoop HDFS. Viewlets allow search results to be rendered to a specific target such as a web page.

Probes are used to access real-world data, as illustrated below:

InfoGrid-Probes.JPG

InfoGrid is available under the GNU Affero General Public License (AGPL), which is a network version of the GPL. This means that commercial use requires a license from NetMesh.

HyperGraphDB (Kobrix)

HyperGraphDB is a graph database designed specifically for artificial intelligence and semantic web projects. It is also positioned as a “general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism.” It also slices, dices, and bakes bread. It is developed by Kobrix and covered by the Lesser GNU Public License (LGPL) and is shipping at version 1.0.

Here are some key facts from the product’s main web site:

  • The mathematical definition of a hypergraph is an extension to the standard graph concept that allows an edge to point to more than two nodes. HyperGraphDB extends this even further by allowing edges to point to other edges as well and making every node or edge carry an arbitrary value as payload.
  • The original requirements that triggered the development of the system came from the OpenCog project, which is an attempt at building an AGI (Artificial General Intelligence) system based on self-modifying probabilistic hypergraphs.
  • The basic unit of storage in HyperGraphDB is called an atom. Each atom is typed, has an arbitrary value and can point to zero or more other atoms.
  • Data types are managed by a general, extensible type system embedded itself as a hypergraph structure. Types are themselves atoms as everybody else, but with a particular role (well, as everybody else too).
  • The storage scheme is platform independent and can thus be accessed by any programming language from any platform. Low-level storage is currently based on Berkeley DB from Sleepycat Software.
  • Size limitations are virtually non-existent. There is no software limit on the size of the graph managed by a HyperGraphDB instance. Each individual value’s size is limited by the underlying storage, i.e. by Berkeley DB’s 2GB limit. However, the architecture allows bypassing Berkeley DB for particular types of atoms if one so desires.
  • The current implementation is solely Java based. It offers an automatic mapping of idiomatic Java types to a HyperGraphDB data schema which makes HyperGraphDB into an object-oriented database suitable for regular business applications. A C++ implementation has been frequently contemplated, but never initiated due to lack of manpower. Note that the storage scheme being open and precisely specified, all languages and platforms are able to share the same data.
  • Embedded in-process: the database comes in the form of a software library to be used directly through its API.
  • A P2P framework for distributed processing has been implemented for replication/data partitioning algorithms as well as client-server style computing

Given its immaturity (version 1.0) and dependency on Berkeley DB, which is GPL licensed, HyperGraphDB doesn’t seem feasible for commercial applications.

Neo4j (Neo Technologies)

Neo4j is a graph database intended for embedded applications. It is written in Java, but it has bindings for Python, Ruby, Clojure, and Scala. It supports ACID transactions and provides master-slave replication. It is provided by Neo Technologies and has both AGPL and commercial licenses. The Neo4j jar file is only 500KB. It reportedly supports billions of nodes, relationships, and properties on a single system. It has been in production use since 2003. An example of a typical information graph is shown below:

Neo4J.JPG

One reference to Neo4j suggested that it had supported very basic master/slave replication, but this is apparently done through a very soft mechanism such as a disk replicator. Anecdotal mention of sharding is also made, but a blog entry explains that this essentially must be done at the client. As a result, built-in scalability support appears to be limited to what can be done on a single machine.

The lack of scale-out strategy, the focus on a graph model, and the need for a commercial license seem to make Neo4j of limited use.

Honorable Mentions

The following technologies are related to distributed database technology but do not currently warrant deeper study. Some are not databases but are related in an interesting way. Some are databases but do not qualify as sufficiently interesting for this survey. Some are very new and/or evolving such that they may warrant more study later.

Alterian

Alterian specializes in web content management, campaign management, and various marketing areas. They use a proprietary columnar database as an “integrated marketing platform”, but it is not available as a general purpose database.

Berkeley DB

Berkeley DB was originally distributed by Sleepycat Software, this key/value store DB is now owned by Oracle. It targets embedded databases with high-performance blob-like access. It offers no network API, and applications must manage the contents of records. It is really an embedded DB building block, and it requires licensing for commercial application use.

C-Store

C-Store is an experimental columnar database developed by a consortium of universities. It is built on top of Berkeley DB, is optimized for reads, and supports SQL. However, it is no longer supported, though it cited as inspiration for a number of other columnar databases.

Chordless

Chordless is a distributed hash table, written in Java, built on the Chord adaptive, distributed, peer-to-peer hashing algorithm. It is a GPL-licensed utility available for free download.

CloudStore KFS

CloudStore KFS, formerly called Kosmos, is an open-source high performance distributed file system implemented in C++. It is modeled after Google’s GFS file system. Integration with Hadoop is available, providing an alternative to the Hadoop HDFS file system. It uses the Apache License Version 2.0. Applications can use C++, or via glue code, Java and Python. Activity is questionable since the shipping release is 0.4, and no builds have been done since 10/2009. Presumably, KFS is faster than HDFS since it is implemented in C++ instead of Java.

Coherence

Coherence was originally developed by Tangosol and is now owned by Oracle. Coherence is a “grid caching” solution that places frequently-used data for Java applications in memory, distributed across multiple servers. It is a commercial analogy of Memcached.

Db4o

Db4o is an open source (GPL) database intended to be embedded within object applications. It allows .Net and Java objects to be stored without decomposition or normalization. It is mentioned only because it (apparently) uses a creative storage technique that affords high performance. However, the GPL license prevents commercial usage.

Drizzle

Drizzle is a lightweight SQL database based on MySQL. Its goal is to produce a sufficiently lightweight database that can take advantage of cloud computing and massive concurrency/parallelism while retaining the advantages of SQL. It is a relatively new project, licensed with the GNU GPLv2.

Dynamo

Dynamo is a distributed key/value storage system developed by Amazon. Though the actual implementation is proprietary, Dynamo provides foundation research that has been used in several other implementations such as Voldemort and Cassandra.

Force Database

The Force.com Database is an object/relational hybrid database that is used by Salesforce.com but can be used for other Force.com cloud applications. It has programming tools, a programming language called Apex, and two query languages called Salesforce Object Query Language (SOQL) and Salesforce Object Search Language (SOSL).

GigaSpaces XAP

GigaSpaces XAP In-Memory Grid is a Memcached-like distributed memory caching architecture. It front-ends a persistent store such as a database, providing high-performance access for a subset of data, say the newest data, while supporting horizontal scale-out.

Gremlin

Gremlin is not a DBMS but rather an open source graph programming language for navigating graph DBs. It gets mentioned by some of the graph DB articles.

Google App Engine

Google App Engine is Google’s application development and framework for building and running web applications within Google’s cloud infrastructure. Applications can be written in Python or Java and choose from a variety of services: web services, persistence with queries and transactions, automatic scaling and load balancing, task queues, and more. Applications can be developed locally and deployed to the Google cloud. Applications that use store less than 500GB and experience less than 5M page views per month are free.

GT.M

GT.M is a commercial/recently open-sourced database provided by FIS Financial. It is a multi-dimensional key/store database that qualifies as a “no SQL”, schema free database, yet it supports ACID transactions. It started in 1986 on VAX/VMS and now runs on Linux. It favors development in the ANSI “M” language, also known as MUMPS, which is used in the financial and banking industries.

HamsterDB

HamsterDB is a variable-key, memory or disk-based database intended for embedded applications. It is available under a GNU GPL license; a commercial license is also available. As a non-relational database, it is much faster than an RDBMS for certain applications.

Heroku

Heroku is a cloud application platform for Ruby-based applications.

Hive

Hive is a data warehouse infrastructure built on top of Hadoop. It is intended for ETL applications and utilizes the Hadoop MapReduce infrastructure for its processing. It is not intended for real-time processing. Latencies tend to be high (minutes) even when data batches are relatively small (a few hundred megabytes).

Infinispan

Infinispan is a distributed, in-memory key/value caching mechanism used by JBoss, similar to Memcached or WebSpere eXtreme Scale.

Jiql

Jiql is a JDBC library that allows Java applications to access Google’s BigTable. Originally, Google App Engine (GAE) applications had to be written in Python. Jiql provides a Python controller that provides a JDBC gateway to BigTable. This appears to be obsolete since Google now provides a Java interface to BigTable.

Keyspace

Keyspace is a distributed, persistent key/value store developed by Scalien. It is available under the GNU AGPL license as well as under commercial licenses. Keyspace supports automatic replication and failover, and it accessible via HTTP, C, and Python. It does not apparently support secondary indexes.

LightCloud

LightCloud is a lightweight layer on top of Tokyo Cabinet/Tyrant, providing a key/value store that is comparable to a persistent version of Memcached. LightCloud is written in Python, but a Ruby port is underway. A port to Redis as an alternative to Tokyo Tyrant is also available. LightCloud is available under the BSD license.

Lucandra

Lucandra is a project that allows Lucene to store its data in Cassandra. Essentially, Lucandra extends Lucene’s IndexReader and IndexWriter interfaces to use Cassandra, resulting in only a 10% loss of read performance compared to using regular disk files. (Write performance with Cassandra shows no loss of performance.)

LucidDB

LucidDB is an open source, columnar RDBMS database for Java and C++, intended for data warehouse applications. It does not support clustering and has unknown stability. Activity seems to be limited.

M/DB

M/DB is a SimpleDB clone built on top of GT.M. It is available under the GNU AGPLv3 license. Its apparent purpose is to server as a training/test implementation of SimpleDB.

MDEX

MDEX is a semi-structured database used by the Endeca Information Access Engine (IAE), which is an enterprise search engine. Each MDEX record can have its own set of attributes and can be versioned. MDEX supports a distributed architecture for large deployments, and it can enforce access-controlled searches. Access is supported via a SOAP interface and Java and .Net APIs. MDEX is apparently available for commercial use.

Memcached

Memcached is a widely used distributed, memory cache system, intended for web access acceleration but used in many other applications. It was originally developed by Brad Fitzpatrick for LiveJournal in 2003. Distributed caching means that a request for an object on one server can get a “hit” if the object lives in the cache of any known server. Over time, objects expire and are deleted from the cache. Memcached is open source and freely available.

MemcacheDB

MemcacheDB is a persistent version of Memcached that uses the Berkeley DB as the persistent mechanism. The last version shipped was 1.2.1 (beta) on 12/25/2008. The project does not appear to be active. It uses a BSD-like license.

MonetDB

MonetDB is an open source, column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It supports SQL but has been described as “crashes a lot” and “corrupts your data”. It appears to be a research project with little activity.

Mnesia

Mnesia is a distributed, “soft real-time” database management system written in Erlang. It is comparable to Berkeley DB but intended as a persistence engine for Erlang applications. It was initially written by Ericsson for distributed, high-availability applications in the telecom space.

MySQL NDB Cluster

MySQL NDB Clusters deserve an honorable mention. MySQL provides a clustered storage solution called Network Database (NDB). A MySQL clusters utilize a shared-nothing architecture, meaning that each MySQL server instance has local disk. However, MySQL clusters can use NDB to store data on shared data nodes, which can use mirrored disks. This provides a horizontal scaling capability with no single point of failure. Because the resulting database is still an RDBMS using ACID transactions, scalability hits a much lower ceiling than BASE-style databases.

PNUTS

PNUTS (Platform for Nimble Universal Table Storage), internally called Sherpa, is a Yahoo! project for big table-like distributed storage. It is currently a research project. Its goals are high availability, auto-sharding, multi-data center replication, and failover, similar to BigTable, Cassandra, and others.

PostgreSQL

PostgreSQL is probably the oldest open source, object/relational database systems. It is an ANSI-compliant SQL database that supports replication and other production-level features. Databases have been reported in the terabyte range. PostgreSQL provides a modular, customizable architecture, though that same feature has been criticized for making deployments complex.

QD Technology

QD Technology is a company that provides a “quick response” database, which is an RDBMS that copies and compresses a production database so that it can be stored and queried on another computer, say a laptop. It is apparently a columnar database, so it is occasionally mentioned in articles about these things.

RDDB

RDDB is a REST-ful Ruby Database that provides a document-oriented functionality for Ruby applications. It is a new, alpha-stage project.

Redis

Redis is a persistent key/value database written in C, analogous to Tokyo Cabinet/Tyrant and MemcacheDB. It is reportedly “blazingly fast”: 110,000 “SETs” per second and 81,000 “GETs” per second on commodity Linux hardware. Redis is currently shipping version 1.2.0 and is available under the New BSD license. Sharding is supported only as a client-side option. Replication is also not supported. Reportedly, clustering is one of the next priorities. Retwis is a project using Twitter and Redis. Retwis-RB is a Twitter clone using Ruby, Sinatra, and Redis. Redis was 1 year old on 2/25/2010.

Relational Database AMI

A Relational Database AMI is a popular, pre-packaged database offered by Amazon Web Services (AWS) as an Amazon Machine Image (AMI). Users can choose between various versions of DB2, Oracle, MySQL, SQL Server, PostgreSQL, and Sybase. Vertica is offered as a service. Essentially, these are the same as regular databases but deployed in Amazon’s cloud. The user assumes all DB administration tasks.

Relational Data Service (RDS)

Relational Data Service (RDS) is a full-featured version of the MySQL relational database available within the Amazon Web Services (AWS) portfolio. Five DB instance classes are available from “small” to “quadruple extra-large”. Each instance class can support storage up to 1 TB. Price is based instance class on actual usage of storage, I/Os, and data transferred in and out of RDS. Although automatic backups are performed, RDS databases currently have no redundancy, failover, or high-availability capabilities. Amazon is planning to provide a high availability offering in the future.

Ringo

Ringo is an experimental distributed key/value store modeled after Amazon’s Dynamo. It uses consistent hashing for elasticity and replication for availability. It is intended for immutable data. It is written in Erlang and is available under the BSD license.

Scalaris

Scalaris is a distributed key/value database initiated by Zuse Institute Berlin, written in Erlang. It uses Paxos for strong consistency during commits. It is open source under the Apache 2.0 license. Access is available via a web interface, Java, Erlang, or a JSON-based RPC. It is supported only on Linux variants. The backend can be switched to Tokyo Cabinet. It doesn’t seem to be widely used or very active at the moment.

Sesame

Sesame is an open source framework for storing and querying RDF data. It is available under a BSD-like license. Release 2.0 is shipping. It appears to sit above MySQL and does not add replication, sharding, or any other “big DB” features.

Terracotta

Terracotta is a “JVM persistence” database, allowing Java objects to be seamlessly saved to disk. It has been euphemistically called a “database killer” since it persists objects without a database backend. An open source version is available; a commercial enterprise version adds clustering, replication, fault-tolerance, and other features.

VPork

VPork is a utility for load-testing a distributed hash table (namely, project Voldemort). It is written in Java and launches a variable number of threads for reading, writing, and modifying hash table entries. It also uses Groovy.

VoltDB

VoltDB is a new OLTP RDBMS intended to allow horizontal scale-out. Michael Stonebraker is one of its co-founders. The project is currently in stealth mode, though an overview paper is available.

WebSphere eXtreme Scale

WebSphere eXtreme Scale was formerly WebSphere Extended Deployment Data Grid. This IBM product appears to be a commercial equivalent of Memcached. It provides distributed in-memory caching across a large number of computers, providing a caching grid with features such as automatic partitioning and replication.