Category Archives: LevelDB

[repost ]Performance Data For LevelDB, Berkley DB And BangDB For Random Operations

original:http://highscalability.com/blog/2012/11/29/performance-data-for-leveldb-berkley-db-and-bangdb-for-rando.html

This is a guest post by Sachin Sinha, Founder of Iqlect and developer of BangDB.

The goal for the paper is to provide the performances data for following embedded databases under various scenarios for random operations such as write and read. The data is presented in graphical manner to make the data self explanatory to some extent.

  • LevelDB:LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Leveldb is based on LSM (Log-Structured Merge-Tree) and uses SSTable and MemTable for the database implementation. It’s written in C++ and availabe under BSD license. LevelDB treats key and value as arbitrary byte arrays and stores keys in ordered fashion. It uses snappy compression for the data compression. Write and Read are concurrent for the db, but write performs best with single thread whereas Read scales with number of cores
  • BerkleyDB:BerkleyDB (BDB) is a library that provides high performance embedded database for key/value data. Its the most widely used database library with millions of deployed copies. BDB can be configured to run from concurrent data store to transactional data store to fully ACID compliant db. It’s written in C and availabe under Sleepycat Public License. BDB treats key and value as arbitrary byte arrays and stores keys in both ordered fashion using BTREE and un-ordered way using HASH. Write and Read are concurrent for the db, and scales well with number of cores especially the Read operation
  • BangDB:BangDB is a high performance embedded database for key value data. It’s a new entrant into the embedded db space. It’s written in C++ and available under BSD license. BangDB treats key and value as arbitrary byte arrays and stores keys in both ordered fashion using BTREE and un-ordered way using HASH. Write, Read are concurrent and scales well with the number of cores

The comparison has been done on the similar grounds (as much as possible) for all the dbs to measure the data as crisply and accurately as possible.

The results of the test show BangDB faster in both reads and writes:

Since BangDB may not be familiar to most people, here’s a quick summary of its features:

  1. Highly concurrent operations on B link -Tree
    1. Manipulation of the tree by any thread uses only a small constant number of page locks any time
    2. Search through the tree does not prevent reading any node. Search procedure in-fact does no locking most of the time
    3. Based on Lehman and Yao paper but extended further for performance improvement
  2. Concurrent Buffer Pools
    1. Separate pools for different types of data. This gives flexibility and better performance when it comes to managing data in the buffer pool in different scenarios
    2. Semi adaptive data flush to ensure performance degrades gracefully in case of data overflowing out of buffer
    3. Access to individual buffer header in the pool is extensively optimized which allows multiple threads to get right headers in highly efficient manner resulting in better performance
    4. 2 LRU lists algorithm for better temporal locality
  3. Other
    1. Write of data/log is always sequential
    2. Vectored read and write as far as possible
    3. Aries algorithm for WAL has been extended for performance. For ex; only index pages have metadata related to logging, data pages are totally free of any such metadata
    4. Key pages for index are further cached to shortcut various steps which results in less locking on highly accessed pages giving way towards better performance
    5. Slab allocator for most of the memory requirements

Since leveldb writes data in sorted manner always(based on user defined comparator) hence I have used BTREE (and not HASH) for BDB and BangDB across all tests (though you should note that the BTREE way of organizing data is different than the way LSM-Tree does it). Since the best write performance by LevelDB is achieved using single thread and best read performance by using 4 threads(on 4 core machine), hence I have used the best possible numbers for all these dbs unless mentioned for a particular test.

For BangDB and BDB we have used 4 threads for both read and write. Of course these are the active number of threads performing the write and read but there could be more number of threads doing the background operations for each db. For ex; LevelDB uses background thread for compaction, similarly BangDB uses background threads for log flush, managing the buffer pool, read/write from disk, checkpointing etc. I have used the same machine and the similar db configuration (as much as possible) for all the dbs for the performance analysis.

 

Following machine ($750 commodity hardware) used for the test;

  • Model: 4 CPU cores, Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz, 64bit
  • CPU cache : 6MB
  • OS : Linux, 3.2.0-32-generic, Ubuntu, x86_64
  • RAM : 8GB
  • Disk : 500GB, 7200 RPM, 16MB cache
  • File System: ext4

Following are the configuration that are kept constant throughout the analysis (unless restated before the test)

  • Assertion: OFF
  • Compression for LevelDB: ON
  • Write and Read: Random
  • Write Ahead Log for BangDB and BDB: ON
  • Single Process and multiple threads
  • Transaction for BDB and BangDB: ON, with only in-memory update, Writes are not synchronous
  • Checkpointing, overflow to disk, log flush, write ahead logging: ON
  • Access method: Tree/Btree
  • Key Size: 24 bytes, random
  • Value Size: 100 – 400 bytes, random
  • Page Size: 8KB

There are overall six test cases as described below. Before each test case, the different configuration parameters set for the test case are mentioned. For most of test cases large cache sizes(around 1 – 2GB) are used simply because mostly I have used relatively large number of operations (in the range of couple of millions and 100K operations here consumes around 50MB including key and value) and for many tests I needed to ensure that data doesn’t overflow out of the buffer. But for few test cases I have repeated the tests using small memory as well (for test A and B) to show the numbers when cache size is smaller (in the range of 64MB). Other test cases can be repeated for small (or even smaller) cache sizes and similar trends (compared to the bigger cache size tests) could be seen.

The reason for showing the graph for individual test rather than putting a number against the test case is to assess the variability of the performance with different number of operations and other variable parameters. The typical number of operations are in millions starting with thousands.

Random Write And Read Using Single Thread

No Overflow of data (out of the memory), 100K – 2M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 512MB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB

Same test case result for smaller cache size is shown below;

  • Write Buffer size for LevelDB: 64MB
  • Cache size for all the dbs: 64MB

Random Write And Read Using Multiple Threads

No Overflow of data (out of the memory), 100K – 2M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 512MB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB
  • LevelDB – 1 Thread for Write and 4 Threads for Read
  • 4 Threads for Read and Write for BDB and BangDB

Same test case result for smaller cache size is shown below;

  • Write Buffer size for LevelDB: 64MB
  • Cache size for all the dbs: 64MB

Random Write And Read Of 3GB Of Data Using 1.5GB Buffer Cache

50% Overflow of data (50% to/from disk), 1M – 10M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 1GB
  • Cache size for all the dbs: 1.5GB
  • Log Buffer size for BDB and BangDB: 256MB
  • LevelDB – 1 Thread for Write and 4 Threads for Read
  • 4 Threads for Read and Write for BDB and BangDB

Random Write And Read Vs The Number Of Concurrent Threads

No Overflow of data (out of the memory), 1M operations, following are the DB configurations for the test;

  • Write Buffer size for LevelDB: 512MB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB

Random Write And Read For Large Values

No Overflow of data (out of the memory), 10K – 100K operations, following are the DB configurations for the test;

  • Value size: 10,000 bytes
  • Write Buffer size for LevelDB: 1.5GB
  • Cache size for all the dbs: 2GB
  • Log Buffer size for BDB and BangDB: 512MB
  • Page Size: 32KB

Random Write And Read Simultaneously In Overlapped Manner

No Overflow of data (out of the memory), 1M operations. The test basically first inserts around (100-x)% of data for x% write and (100-x)% read and then tries to write remaining x% of data with (100-x)% of simultaneous read by 4 concurrent threads and time elapsed is computed only for the later x% write with (100-x)% read. If data is already present then it updates the existing data. following are DB configuration for the test;

  • Write Buffer size for LevelDB: 1GB
  • Cache size for all the dbs: 1GB
  • Log Buffer size for BDB and BangDB: 256MB

 

Summary

I have just covered very few use cases for the performance analysis, for example the sequential read and writes, batch operations are not covered. Also the changes in various configurable parameters may change the graphs completely. BDB and BangDB support Hash based access of the data as well but not covered here because LevelDB stores data in sorted order only.

If we see the amount of data written by individual dbs, we will find that BangDB is writing more data than the other two dbs. BangDB doesn’t use compression at the moment hence this can be improved in future if compression is supported.

BDB write performance goes down drastically when writing more than the allocated buffer. For multiple thread tests the dbs perform best (as expected) when number of threads is close to the number of cores in the machine, with the exception of LevelDB write, which works best when single thread performs the operation.

All dbs are available for download freely hence one can/may cover more scenarios comparing these three or other/more such dbs. The test files for the performance analysis is available fordownload.

Related Articles

[repost ]数据分析与处理之二(Leveldb 实现原理)

original:http://www.cnblogs.com/haippy/archive/2011/12/04/2276064.html

郑重声明:本篇博客是自己学习 Leveldb 实现原理时参考了郎格科技系列博客整理的,原文地址:http://www.samecity.com/blog/Index.asp?SortID=12,只是为了加深印象,本文的配图是自己重新绘制的,大部分内容与原文相似,大家可以浏览原始页面 :-),感兴趣的话可以一起讨论 Leveldb 的实现原理!

LevelDb日知录之一LevelDb 101

说起LevelDb也许您不清楚,但是如果作为IT工程师,不知道下面两位大神级别的工程师,那您的领导估计会Hold不住了:Jeff Dean和Sanjay Ghemawat。这两位是Google公司重量级的工程师,为数甚少的Google Fellow之二。

Jeff Dean其人:http://research.google.com/people/jeff/index.html,Google大规模分布式平台Bigtable和MapReduce主要设计和实现者。

Sanjay Ghemawat其人:http://research.google.com/people/sanjay/index.html,Google大规模分布式平台GFS,Bigtable和MapReduce主要设计和实现工程师。

LevelDb就是这两位大神级别的工程师发起的开源项目,简而言之,LevelDb是能够处理十亿级别规模Key-Value型数据持久性存储的C++ 程序库。正像上面介绍的,这二位是Bigtable的设计和实现者,如果了解Bigtable的话,应该知道在这个影响深远的分布式存储系统中有两个核心的部分:Master Server和Tablet Server。其中Master Server做一些管理数据的存储以及分布式调度工作,实际的分布式数据存储以及读写操作是由Tablet Server完成的,而LevelDb则可以理解为一个简化版的Tablet Server。

LevelDb有如下一些特点:

首先,LevelDb是一个持久化存储的KV系统,和Redis这种内存型的KV系统不同,LevelDb不会像Redis一样狂吃内存,而是将大部分数据存储到磁盘上。

其次,LevleDb在存储数据时,是根据记录的key值有序存储的,就是说相邻的key值在存储文件中是依次顺序存储的,而应用可以自定义key大小比较函数,LevleDb会按照用户定义的比较函数依序存储这些记录。

再次,像大多数KV系统一样,LevelDb的操作接口很简单,基本操作包括写记录,读记录以及删除记录。也支持针对多条操作的原子批量操作。

另外,LevelDb支持数据快照(snapshot)功能,使得读取操作不受写操作影响,可以在读操作过程中始终看到一致的数据。

除此外,LevelDb还支持数据压缩等操作,这对于减小存储空间以及增快IO效率都有直接的帮助。

LevelDb性能非常突出,官方网站报道其随机写性能达到40万条记录每秒,而随机读性能达到6万条记录每秒。总体来说,LevelDb的写操作要大大快于读操作,而顺序读写操作则大大快于随机读写操作。至于为何是这样,看了我们后续推出的LevelDb日知录,估计您会了解其内在原因。

LevelDb日知录之二:整体架构

LevelDb本质上是一套存储系统以及在这套存储系统上提供的一些操作接口。为了便于理解整个系统及其处理流程,我们可以从两个不同的角度来看待LevleDb:静态角度和动态角度。从静态角度,可以假想整个系统正在运行过程中(不断插入删除读取数据),此时我们给LevelDb照相,从照片可以看到之前系统的数据在内存和磁盘中是如何分布的,处于什么状态等;从动态的角度,主要是了解系统是如何写入一条记录,读出一条记录,删除一条记录的,同时也包括除了这些接口操作外的内部操作比如compaction,系统运行时崩溃后如何恢复系统等等方面。

本节所讲的整体架构主要从静态角度来描述,之后接下来的几节内容会详述静态结构涉及到的文件或者内存数据结构,LevelDb日知录后半部分主要介绍动态视角下的LevelDb,就是说整个系统是怎么运转起来的。

LevelDb作为存储系统,数据记录的存储介质包括内存以及磁盘文件,如果像上面说的,当LevelDb运行了一段时间,此时我们给LevelDb进行透视拍照,那么您会看到如下一番景象:

图1.1:LevelDb结构

    从图中可以看出,构成LevelDb静态结构的包括六个主要部分:内存中的MemTable和Immutable MemTable以及磁盘上的几种主要文件:Current文件,Manifest文件,log文件以及SSTable文件。当然,LevelDb除了这六个主要部分还有一些辅助的文件,但是以上六个文件和数据结构是LevelDb的主体构成元素。

LevelDb的Log文件和Memtable与Bigtable论文中介绍的是一致的,当应用写入一条Key:Value记录的时候,LevelDb会先往log文件里写入,成功后将记录插进Memtable中,这样基本就算完成了写入操作,因为一次写入操作只涉及一次磁盘顺序写和一次内存写入,所以这是为何说LevelDb写入速度极快的主要原因。

Log文件在系统中的作用主要是用于系统崩溃恢复而不丢失数据,假如没有Log文件,因为写入的记录刚开始是保存在内存中的,此时如果系统崩溃,内存中的数据还没有来得及Dump到磁盘,所以会丢失数据(Redis就存在这个问题)。为了避免这种情况,LevelDb在写入内存前先将操作记录到Log文件中,然后再记入内存中,这样即使系统崩溃,也可以从Log文件中恢复内存中的Memtable,不会造成数据的丢失。

当Memtable插入的数据占用内存到了一个界限后,需要将内存的记录导出到外存文件中,LevleDb会生成新的Log文件和Memtable,原先的Memtable就成为Immutable Memtable,顾名思义,就是说这个Memtable的内容是不可更改的,只能读不能写入或者删除。新到来的数据被记入新的Log文件和Memtable,LevelDb后台调度会将Immutable Memtable的数据导出到磁盘,形成一个新的SSTable文件。SSTable就是由内存中的数据不断导出并进行Compaction操作后形成的,而且SSTable的所有文件是一种层级结构,第一层为Level 0,第二层为Level 1,依次类推,层级逐渐增高,这也是为何称之为LevelDb的原因。

SSTable中的文件是Key有序的,就是说在文件中小key记录排在大Key记录之前,各个Level的SSTable都是如此,但是这里需要注意的一点是:Level 0的SSTable文件(后缀为.sst)和其它Level的文件相比有特殊性:这个层级内的.sst文件,两个文件可能存在key重叠,比如有两个level 0的sst文件,文件A和文件B,文件A的key范围是:{bar, car},文件B的Key范围是{blue,samecity},那么很可能两个文件都存在key=”blood”的记录。对于其它Level的SSTable文件来说,则不会出现同一层级内.sst文件的key重叠现象,就是说Level L中任意两个.sst文件,那么可以保证它们的key值是不会重叠的。这点需要特别注意,后面您会看到很多操作的差异都是由于这个原因造成的。

SSTable中的某个文件属于特定层级,而且其存储的记录是key有序的,那么必然有文件中的最小key和最大key,这是非常重要的信息,LevelDb应该记下这些信息。Manifest就是干这个的,它记载了SSTable各个文件的管理信息,比如属于哪个Level,文件名称叫啥,最小key和最大key各自是多少。下图是Manifest所存储内容的示意:

图2.1:Manifest存储示意图

图中只显示了两个文件(manifest会记载所有SSTable文件的这些信息),即Level 0的test.sst1和test.sst2文件,同时记载了这些文件各自对应的key范围,比如test.sstt1的key范围是“an”到 “banana”,而文件test.sst2的key范围是“baby”到“samecity”,可以看出两者的key范围是有重叠的。

Current文件是干什么的呢?这个文件的内容只有一个信息,就是记载当前的manifest文件名。因为在LevleDb的运行过程中,随着Compaction的进行,SSTable文件会发生变化,会有新的文件产生,老的文件被废弃,Manifest也会跟着反映这种变化,此时往往会新生成Manifest文件来记载这种变化,而Current则用来指出哪个Manifest文件才是我们关心的那个Manifest文件。

以上介绍的内容就构成了LevelDb的整体静态结构,在LevelDb日知录接下来的内容中,我们会首先介绍重要文件或者内存数据的具体数据布局与结构。

LevelDb日知录之三:log文件

上节内容讲到log文件在LevelDb中的主要作用是系统故障恢复时,能够保证不会丢失数据。因为在将记录写入内存的Memtable之前,会先写入Log文件,这样即使系统发生故障,Memtable中的数据没有来得及Dump到磁盘的SSTable文件,LevelDB也可以根据log文件恢复内存的Memtable数据结构内容,不会造成系统丢失数据,在这点上LevelDb和Bigtable是一致的。

下面我们带大家看看log文件的具体物理和逻辑布局是怎样的,LevelDb对于一个log文件,会把它切割成以32K为单位的物理Block,每次读取的单位以一个Block作为基本读取单位,下图展示的log文件由3个Block构成,所以从物理布局来讲,一个log文件就是由连续的32K大小Block构成的。

图3.1 log文件布局

        在应用的视野里是看不到这些Block的,应用看到的是一系列的Key:Value对,在LevelDb内部,会将一个Key:Value对看做一条记录的数据,另外在这个数据前增加一个记录头,用来记载一些管理信息,以方便内部处理,图3.2显示了一个记录在LevelDb内部是如何表示的。

 

图3.2 记录结构

       记录头包含三个字段,ChechSum是对“类型”和“数据”字段的校验码,为了避免处理不完整或者是被破坏的数据,当LevelDb读取记录数据时候会对数据进行校验,如果发现和存储的CheckSum相同,说明数据完整无破坏,可以继续后续流程。“记录长度”记载了数据的大小,“数据”则是上面讲的Key:Value数值对,“类型”字段则指出了每条记录的逻辑结构和log文件物理分块结构之间的关系,具体而言,主要有以下四种类型:FULL/FIRST/MIDDLE/LAST。

        如果记录类型是FULL,代表了当前记录内容完整地存储在一个物理Block里,没有被不同的物理Block切割开;如果记录被相邻的物理Block切割开,则类型会是其他三种类型中的一种。我们以图3.1所示的例子来具体说明。

       假设目前存在三条记录,Record A,Record B和Record C,其中Record A大小为10K,Record B 大小为80K,Record C大小为12K,那么其在log文件中的逻辑布局会如图3.1所示。Record A是图中蓝色区域所示,因为大小为10K<32K,能够放在一个物理Block中,所以其类型为FULL;Record B 大小为80K,而Block 1因为放入了Record A,所以还剩下22K,不足以放下Record B,所以在Block 1的剩余部分放入Record B的开头一部分,类型标识为FIRST,代表了是一个记录的起始部分;Record B还有58K没有存储,这些只能依次放在后续的物理Block里面,因为Block 2大小只有32K,仍然放不下Record B的剩余部分,所以Block 2全部用来放Record B,且标识类型为MIDDLE,意思是这是Record B中间一段数据;Record B剩下的部分可以完全放在Block 3中,类型标识为LAST,代表了这是Record B的末尾数据;图中黄色的Record C因为大小为12K,Block 3剩下的空间足以全部放下它,所以其类型标识为FULL。

     从这个小例子可以看出逻辑记录和物理Block之间的关系,LevelDb一次物理读取为一个Block,然后根据类型情况拼接出逻辑记录,供后续流程处理。

LevelDb日知录之四:SSTable文件

SSTable是Bigtable中至关重要的一块,对于LevelDb来说也是如此,对LevelDb的SSTable实现细节的了解也有助于了解Bigtable中一些实现细节。

本节内容主要讲述SSTable的静态布局结构,我们曾在“LevelDb日知录之二:整体架构”中说过,SSTable文件形成了不同Level的层级结构,至于这个层级结构是如何形成的我们放在后面Compaction一节细说。本节主要介绍SSTable某个文件的物理布局和逻辑布局结构,这对了解LevelDb的运行过程很有帮助。

LevelDb不同层级有很多SSTable文件(以后缀.sst为特征),所有.sst文件内部布局都是一样的。上节介绍Log文件是物理分块的,SSTable也一样会将文件划分为固定大小的物理存储块,但是两者逻辑布局大不相同,根本原因是:Log文件中的记录是Key无序的,即先后记录的key大小没有明确大小关系,而.sst文件内部则是根据记录的Key由小到大排列的,从下面介绍的SSTable布局可以体会到Key有序是为何如此设计.sst文件结构的关键。

图4.1 .sst文件的分块结构

  图4.1展示了一个.sst文件的物理划分结构,同Log文件一样,也是划分为固定大小的存储块,每个Block分为三个部分,红色部分是数据存储区, 蓝色的Type区用于标识数据存储区是否采用了数据压缩算法(Snappy压缩或者无压缩两种),CRC部分则是数据校验码,用于判别数据是否在生成和传输中出错。

以上是.sst的物理布局,下面介绍.sst文件的逻辑布局,所谓逻辑布局,就是说尽管大家都是物理块,但是每一块存储什么内容,内部又有什么结构等。图4.2展示了.sst文件的内部逻辑解释。

图4.2 逻辑布局

  从图4.2可以看出,从大的方面,可以将.sst文件划分为数据存储区和数据管理区,数据存储区存放实际的Key:Value数据,数据管理区则提供一些索引指针等管理数据,目的是更快速便捷的查找相应的记录。两个区域都是在上述的分块基础上的,就是说文件的前面若干块实际存储KV数据,后面数据管理区存储管理数据。管理数据又分为四种不同类型:紫色的Meta Block,红色的MetaBlock 索引和蓝色的数据索引块以及一个文件尾部块。

LevelDb 1.2版对于Meta Block尚无实际使用,只是保留了一个接口,估计会在后续版本中加入内容,下面我们看看数据索引区和文件尾部Footer的内部结构。

图4.3 数据索引

  图4.3是数据索引的内部结构示意图。再次强调一下,Data Block内的KV记录是按照Key由小到大排列的,数据索引区的每条记录是对某个Data Block建立的索引信息,每条索引信息包含三个内容,以图4.3所示的数据块i的索引Index i来说:红色部分的第一个字段记载大于等于数据块i中最大的Key值的那个Key,第二个字段指出数据块i在.sst文件中的起始位置,第三个字段指出Data Block i的大小(有时候是有数据压缩的)。后面两个字段好理解,是用于定位数据块在文件中的位置的,第一个字段需要详细解释一下,在索引里保存的这个Key值未必一定是某条记录的Key,以图4.3的例子来说,假设数据块i 的最小Key=“samecity”,最大Key=“the best”;数据块i+1的最小Key=“the fox”,最大Key=“zoo”,那么对于数据块i的索引Index i来说,其第一个字段记载大于等于数据块i的最大Key(“the best”)同时要小于数据块i+1的最小Key(“the fox”),所以例子中Index i的第一个字段是:“the c”,这个是满足要求的;而Index i+1的第一个字段则是“zoo”,即数据块i+1的最大Key。

  文件末尾Footer块的内部结构见图4.4,metaindex_handle指出了metaindex block的起始位置和大小;inex_handle指出了index Block的起始地址和大小;这两个字段可以理解为索引的索引,是为了正确读出索引值而设立的,后面跟着一个填充区和魔数。

图4.4 Footer

  上面主要介绍的是数据管理区的内部结构,下面我们看看数据区的一个Block的数据部分内部是如何布局的(图4.1中的红色部分),图4.5是其内部布局示意图。

图4.5 数据Block内部结构

  从图中可以看出,其内部也分为两个部分,前面是一个个KV记录,其顺序是根据Key值由小到大排列的,在Block尾部则是一些“重启点”(Restart Point),其实是一些指针,指出Block内容中的一些记录位置。

“重启点”是干什么的呢?我们一再强调,Block内容里的KV记录是按照Key大小有序的,这样的话,相邻的两条记录很可能Key部分存在重叠,比如key i=“the Car”,Key i+1=“the color”,那么两者存在重叠部分“the c”,为了减少Key的存储量,Key i+1可以只存储和上一条Key不同的部分“olor”,两者的共同部分从Key i中可以获得。记录的Key在Block内容部分就是这么存储的,主要目的是减少存储开销。“重启点”的意思是:在这条记录开始,不再采取只记载不同的Key部分,而是重新记录所有的Key值,假设Key i+1是一个重启点,那么Key里面会完整存储“the color”,而不是采用简略的“olor”方式。Block尾部就是指出哪些记录是这些重启点的。

图4.6 记录格式

  在Block内容区,每个KV记录的内部结构是怎样的?图4.6给出了其详细结构,每个记录包含5个字段:key共享长度,比如上面的“olor”记录, 其key和上一条记录共享的Key部分长度是“the c”的长度,即5;key非共享长度,对于“olor”来说,是4;value长度指出Key:Value中Value的长度,在后面的Value内容字段中存储实际的Value值;而key非共享内容则实际存储“olor”这个Key字符串。

上面讲的这些就是.sst文件的全部内部奥秘。

LevelDb日知录之五:MemTable详解

LevelDb日知录前述小节大致讲述了磁盘文件相关的重要静态结构,本小节讲述内存中的数据结构Memtable,Memtable在整个体系中的重要地位也不言而喻。总体而言,所有KV数据都是存储在Memtable,Immutable Memtable和SSTable中的,Immutable Memtable从结构上讲和Memtable是完全一样的,区别仅仅在于其是只读的,不允许写入操作,而Memtable则是允许写入和读取的。当Memtable写入的数据占用内存到达指定数量,则自动转换为Immutable Memtable,等待Dump到磁盘中,系统会自动生成新的Memtable供写操作写入新数据,理解了Memtable,那么Immutable Memtable自然不在话下。

LevelDb的MemTable提供了将KV数据写入,删除以及读取KV记录的操作接口,但是事实上Memtable并不存在真正的删除操作,删除某个Key的Value在Memtable内是作为插入一条记录实施的,但是会打上一个Key的删除标记,真正的删除操作是Lazy的,会在以后的Compaction过程中去掉这个KV。

需要注意的是,LevelDb的Memtable中KV对是根据Key大小有序存储的,在系统插入新的KV时,LevelDb要把这个KV插到合适的位置上以保持这种Key有序性。其实,LevelDb的Memtable类只是一个接口类,真正的操作是通过背后的SkipList来做的,包括插入操作和读取操作等,所以Memtable的核心数据结构是一个SkipList。

SkipList是由William Pugh发明。他在Communications of the ACM June 1990, 33(6) 668-676 发表了Skip lists: a probabilistic alternative to balanced trees,在该论文中详细解释了SkipList的数据结构和插入删除操作。

SkipList是平衡树的一种替代数据结构,但是和红黑树不相同的是,SkipList对于树的平衡的实现是基于一种随机化的算法的,这样也就是说SkipList的插入和删除的工作是比较简单的。

关于SkipList的详细介绍可以参考这篇文章:http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html,讲述的很清楚,LevelDb的SkipList基本上是一个具体实现,并无特殊之处。

SkipList不仅是维护有序数据的一个简单实现,而且相比较平衡树来说,在插入数据的时候可以避免频繁的树节点调整操作,所以写入效率是很高的,LevelDb整体而言是个高写入系统,SkipList在其中应该也起到了很重要的作用。Redis为了加快插入操作,也使用了SkipList来作为内部实现数据结构。

LevelDb日知录之六 写入与删除记录

在之前的五节LevelDb日知录中,我们介绍了LevelDb的一些静态文件及其详细布局,从本节开始,我们看看LevelDb的一些动态操作,比如读写记录,Compaction,错误恢复等操作。

       本节介绍levelDb的记录更新操作,即插入一条KV记录或者删除一条KV记录。levelDb的更新操作速度是非常快的,源于其内部机制决定了这种更新操作的简单性。

图6.1 LevelDb写入记录

        图6.1是levelDb如何更新KV数据的示意图,从图中可以看出,对于一个插入操作Put(Key,Value)来说,完成插入操作包含两个具体步骤:首先是将这条KV记录以顺序写的方式追加到之前介绍过的log文件末尾,因为尽管这是一个磁盘读写操作,但是文件的顺序追加写入效率是很高的,所以并不会导致写入速度的降低;第二个步骤是:如果写入log文件成功,那么将这条KV记录插入内存中的Memtable中,前面介绍过,Memtable只是一层封装,其内部其实是一个Key有序的SkipList列表,插入一条新记录的过程也很简单,即先查找合适的插入位置,然后修改相应的链接指针将新记录插入即可。完成这一步,写入记录就算完成了,所以一个插入记录操作涉及一次磁盘文件追加写和内存SkipList插入操作,这是为何levelDb写入速度如此高效的根本原因。

从上面的介绍过程中也可以看出:log文件内是key无序的,而Memtable中是key有序的。那么如果是删除一条KV记录呢?对于levelDb来说,并不存在立即删除的操作,而是与插入操作相同的,区别是,插入操作插入的是Key:Value 值,而删除操作插入的是“Key:删除标记”,并不真正去删除记录,而是后台Compaction的时候才去做真正的删除操作。

levelDb的写入操作就是如此简单。真正的麻烦在后面将要介绍的读取操作中。

LevelDb日知录之七:读取记录

LevelDb是针对大规模Key/Value数据的单机存储库,从应用的角度来看,LevelDb就是一个存储工具。而作为称职的存储工具,常见的调用接口无非是新增KV,删除KV,读取KV,更新Key对应的Value值这么几种操作。LevelDb的接口没有直接支持更新操作的接口,如果需要更新某个Key的Value,你可以选择直接生猛地插入新的KV,保持Key相同,这样系统内的key对应的value就会被更新;或者你可以先删除旧的KV, 之后再插入新的KV,这样比较委婉地完成KV的更新操作。

假设应用提交一个Key值,下面我们看看LevelDb是如何从存储的数据中读出其对应的Value值的。图7-1是LevelDb读取过程的整体示意图。

图7-1  LevelDb读取记录流程

  LevelDb首先会去查看内存中的Memtable,如果Memtable中包含key及其对应的value,则返回value值即可;如果在Memtable没有读到key,则接下来到同样处于内存中的Immutable Memtable中去读取,类似地,如果读到就返回,若是没有读到,那么只能万般无奈下从磁盘中的大量SSTable文件中查找。因为SSTable数量较多,而且分成多个Level,所以在SSTable中读数据是相当蜿蜒曲折的一段旅程。总的读取原则是这样的:首先从属于level 0的文件中查找,如果找到则返回对应的value值,如果没有找到那么到level 1中的文件中去找,如此循环往复,直到在某层SSTable文件中找到这个key对应的value为止(或者查到最高level,查找失败,说明整个系统中不存在这个Key)。

那么为什么是从Memtable到Immutable Memtable,再从Immutable Memtable到文件,而文件中为何是从低level到高level这么一个查询路径呢?道理何在?之所以选择这么个查询路径,是因为从信息的更新时间来说,很明显Memtable存储的是最新鲜的KV对;Immutable Memtable中存储的KV数据对的新鲜程度次之;而所有SSTable文件中的KV数据新鲜程度一定不如内存中的Memtable和Immutable Memtable的。对于SSTable文件来说,如果同时在level L和Level L+1找到同一个key,level L的信息一定比level L+1的要新。也就是说,上面列出的查找路径就是按照数据新鲜程度排列出来的,越新鲜的越先查找。

为啥要优先查找新鲜的数据呢?这个道理不言而喻,举个例子。比如我们先往levelDb里面插入一条数据{key=”www.samecity.com”  value=”我们”},过了几天,samecity网站改名为:69同城,此时我们插入数据{key=”www.samecity.com”  value=”69同城”},同样的key,不同的value;逻辑上理解好像levelDb中只有一个存储记录,即第二个记录,但是在levelDb中很可能存在两条记录,即上面的两个记录都在levelDb中存储了,此时如果用户查询key=”www.samecity.com”,我们当然希望找到最新的更新记录,也就是第二个记录返回,这就是为何要优先查找新鲜数据的原因。

前文有讲:对于SSTable文件来说,如果同时在level L和Level L+1找到同一个key,level L的信息一定比level L+1的要新。这是一个结论,理论上需要一个证明过程,否则会招致如下的问题:为神马呢?从道理上讲呢,很明白:因为Level L+1的数据不是从石头缝里蹦出来的,也不是做梦梦到的,那它是从哪里来的?Level L+1的数据是从Level L 经过Compaction后得到的(如果您不知道什么是Compaction,那么……..也许以后会知道的),也就是说,您看到的现在的Level L+1层的SSTable数据是从原来的Level L中来的,现在的Level L比原来的Level L数据要新鲜,所以可证,现在的Level L比现在的Level L+1的数据要新鲜。

SSTable文件很多,如何快速地找到key对应的value值?在LevelDb中,level 0一直都爱搞特殊化,在level 0和其它level中查找某个key的过程是不一样的。因为level 0下的不同文件可能key的范围有重叠,某个要查询的key有可能多个文件都包含,这样的话LevelDb的策略是先找出level 0中哪些文件包含这个key(manifest文件中记载了level和对应的文件及文件里key的范围信息,LevelDb在内存中保留这种映射表), 之后按照文件的新鲜程度排序,新的文件排在前面,之后依次查找,读出key对应的value。而如果是非level 0的话,因为这个level的文件之间key是不重叠的,所以只从一个文件就可以找到key对应的value。

最后一个问题,如果给定一个要查询的key和某个key range包含这个key的SSTable文件,那么levelDb是如何进行具体查找过程的呢?levelDb一般会先在内存中的Cache中查找是否包含这个文件的缓存记录,如果包含,则从缓存中读取;如果不包含,则打开SSTable文件,同时将这个文件的索引部分加载到内存中并放入Cache中。 这样Cache里面就有了这个SSTable的缓存项,但是只有索引部分在内存中,之后levelDb根据索引可以定位到哪个内容Block会包含这条key,从文件中读出这个Block的内容,在根据记录一一比较,如果找到则返回结果,如果没有找到,那么说明这个level的SSTable文件并不包含这个key,所以到下一级别的SSTable中去查找。

从之前介绍的LevelDb的写操作和这里介绍的读操作可以看出,相对写操作,读操作处理起来要复杂很多,所以写的速度必然要远远高于读数据的速度,也就是说,LevelDb比较适合写操作多于读操作的应用场合。而如果应用是很多读操作类型的,那么顺序读取效率会比较高,因为这样大部分内容都会在缓存中找到,尽可能避免大量的随机读取操作。

LevelDb日知录之八:Compaction操作

前文有述,对于LevelDb来说,写入记录操作很简单,删除记录仅仅写入一个删除标记就算完事,但是读取记录比较复杂,需要在内存以及各个层级文件中依照新鲜程度依次查找,代价很高。为了加快读取速度,levelDb采取了compaction的方式来对已有的记录进行整理压缩,通过这种方式,来删除掉一些不再有效的KV数据,减小数据规模,减少文件数量等。

levelDb的compaction机制和过程与Bigtable所讲述的是基本一致的,Bigtable中讲到三种类型的compaction: minor ,major和full。所谓minor Compaction,就是把memtable中的数据导出到SSTable文件中;major compaction就是合并不同层级的SSTable文件,而full compaction就是将所有SSTable进行合并。

LevelDb包含其中两种,minor和major。

我们将为大家详细叙述其机理。

    先来看看minor Compaction的过程。Minor compaction 的目的是当内存中的memtable大小到了一定值时,将内容保存到磁盘文件中,图8.1是其机理示意图。

图8.1 minor compaction

     从8.1可以看出,当memtable数量到了一定程度会转换为immutable memtable,此时不能往其中写入记录,只能从中读取KV内容。之前介绍过,immutable memtable其实是一个多层级队列SkipList,其中的记录是根据key有序排列的。所以这个minor compaction实现起来也很简单,就是按照immutable memtable中记录由小到大遍历,并依次写入一个level 0 的新建SSTable文件中,写完后建立文件的index 数据,这样就完成了一次minor compaction。从图中也可以看出,对于被删除的记录,在minor compaction过程中并不真正删除这个记录,原因也很简单,这里只知道要删掉key记录,但是这个KV数据在哪里?那需要复杂的查找,所以在minor compaction的时候并不做删除,只是将这个key作为一个记录写入文件中,至于真正的删除操作,在以后更高层级的compaction中会去做。

当某个level下的SSTable文件数目超过一定设置值后,levelDb会从这个level的SSTable中选择一个文件(level>0),将其和高一层级的level+1的SSTable文件合并,这就是major compaction。

    我们知道在大于0的层级中,每个SSTable文件内的Key都是由小到大有序存储的,而且不同文件之间的key范围(文件内最小key和最大key之间)不会有任何重叠。Level 0的SSTable文件有些特殊,尽管每个文件也是根据Key由小到大排列,但是因为level 0的文件是通过minor compaction直接生成的,所以任意两个level 0下的两个sstable文件可能再key范围上有重叠。所以在做major compaction的时候,对于大于level 0的层级,选择其中一个文件就行,但是对于level 0来说,指定某个文件后,本level中很可能有其他SSTable文件的key范围和这个文件有重叠,这种情况下,要找出所有有重叠的文件和level 1的文件进行合并,即level 0在进行文件选择的时候,可能会有多个文件参与major compaction。

  levelDb在选定某个level进行compaction后,还要选择是具体哪个文件要进行compaction,levelDb在这里有个小技巧, 就是说轮流来,比如这次是文件A进行compaction,那么下次就是在key range上紧挨着文件A的文件B进行compaction,这样每个文件都会有机会轮流和高层的level 文件进行合并。

如果选好了level L的文件A和level L+1层的文件进行合并,那么问题又来了,应该选择level L+1哪些文件进行合并?levelDb选择L+1层中和文件A在key range上有重叠的所有文件来和文件A进行合并。

   也就是说,选定了level L的文件A,之后在level L+1中找到了所有需要合并的文件B,C,D…..等等。剩下的问题就是具体是如何进行major 合并的?就是说给定了一系列文件,每个文件内部是key有序的,如何对这些文件进行合并,使得新生成的文件仍然Key有序,同时抛掉哪些不再有价值的KV 数据。

    图8.2说明了这一过程。

图8.2 SSTable Compaction

  Major compaction的过程如下:对多个文件采用多路归并排序的方式,依次找出其中最小的Key记录,也就是对多个文件中的所有记录重新进行排序。之后采取一定的标准判断这个Key是否还需要保存,如果判断没有保存价值,那么直接抛掉,如果觉得还需要继续保存,那么就将其写入level L+1层中新生成的一个SSTable文件中。就这样对KV数据一一处理,形成了一系列新的L+1层数据文件,之前的L层文件和L+1层参与compaction 的文件数据此时已经没有意义了,所以全部删除。这样就完成了L层和L+1层文件记录的合并过程。

那么在major compaction过程中,判断一个KV记录是否抛弃的标准是什么呢?其中一个标准是:对于某个key来说,如果在小于L层中存在这个Key,那么这个KV在major compaction过程中可以抛掉。因为我们前面分析过,对于层级低于L的文件中如果存在同一Key的记录,那么说明对于Key来说,有更新鲜的Value存在,那么过去的Value就等于没有意义了,所以可以删除。

 

LevelDb日知录之九 levelDb中的Cache

书接前文,前面讲过对于levelDb来说,读取操作如果没有在内存的memtable中找到记录,要多次进行磁盘访问操作。假设最优情况,即第一次就在level 0中最新的文件中找到了这个key,那么也需要读取2次磁盘,一次是将SSTable的文件中的index部分读入内存,这样根据这个index可以确定key是在哪个block中存储;第二次是读入这个block的内容,然后在内存中查找key对应的value。

levelDb中引入了两个不同的Cache:Table Cache和Block Cache。其中Block Cache是配置可选的,即在配置文件中指定是否打开这个功能。

图9.1 table cache

   图9.1是table cache的结构。在Cache中,key值是SSTable的文件名称,Value部分包含两部分,一个是指向磁盘打开的SSTable文件的文件指针,这是为了方便读取内容;另外一个是指向内存中这个SSTable文件对应的Table结构指针,table结构在内存中,保存了SSTable的index内容以及用来指示block cache用的cache_id ,当然除此外还有其它一些内容。

比如在get(key)读取操作中,如果levelDb确定了key在某个level下某个文件A的key range范围内,那么需要判断是不是文件A真的包含这个KV。此时,levelDb会首先查找Table Cache,看这个文件是否在缓存里,如果找到了,那么根据index部分就可以查找是哪个block包含这个key。如果没有在缓存中找到文件,那么打开SSTable文件,将其index部分读入内存,然后插入Cache里面,去index里面定位哪个block包含这个Key 。如果确定了文件哪个block包含这个key,那么需要读入block内容,这是第二次读取。

图9.2 block cache

   Block Cache是为了加快这个过程的,图9.2是其结构示意图。其中的key是文件的cache_id加上这个block在文件中的起始位置block_offset。而value则是这个Block的内容。

如果levelDb发现这个block在block cache中,那么可以避免读取数据,直接在cache里的block内容里面查找key的value就行,如果没找到呢?那么读入block内容并把它插入block cache中。levelDb就是这样通过两个cache来加快读取速度的。从这里可以看出,如果读取的数据局部性比较好,也就是说要读的数据大部分在cache里面都能读到,那么读取效率应该还是很高的,而如果是对key进行顺序读取效率也应该不错,因为一次读入后可以多次被复用。但是如果是随机读取,您可以推断下其效率如何。

LevelDb日知录之十 Version、VersionEdit、VersionSet

  Version 保存了当前磁盘以及内存中所有的文件信息,一般只有一个Version叫做”current” version(当前版本)。Leveldb还保存了一系列的历史版本,这些历史版本有什么作用呢?

当一个Iterator创建后,Iterator就引用到了current version(当前版本),只要这个Iterator不被delete那么被Iterator引用的版本就会一直存活。这就意味着当你用完一个Iterator后,需要及时删除它。

当一次Compaction结束后(会生成新的文件,合并前的文件需要删除),Leveldb会创建一个新的版本作为当前版本,原先的当前版本就会变为历史版本。

  VersionSet 是所有Version的集合,管理着所有存活的Version。

  VersionEdit 表示Version之间的变化,相当于delta 增量,表示有增加了多少文件,删除了文件。下图表示他们之间的关系。

Version0 +VersionEdit–>Version1

VersionEdit会保存到MANIFEST文件中,当做数据恢复时就会从MANIFEST文件中读出来重建数据。

leveldb的这种版本的控制,让我想到了双buffer切换,双buffer切换来自于图形学中,用于解决屏幕绘制时的闪屏问题,在服务器编程中也有用处。

比如我们的服务器上有一个字典库,每天我们需要更新这个字典库,我们可以新开一个buffer,将新的字典库加载到这个新buffer中,等到加载完毕,将字典的指针指向新的字典库。

leveldb的version管理和双buffer切换类似,但是如果原version被某个iterator引用,那么这个version会一直保持,直到没有被任何一个iterator引用,此时就可以删除这个version。

注:博文参考了郎格科技博客:http://www.samecity.com/blog/Index.asp?SortID=12

[repost ]levelDB源码分析-SSTable

original:http://blog.csdn.net/tankles/article/details/7663905

SSTable是Bigtable中至关重要的一块,对于LevelDB来说也是如此,对LevelDB的SSTable实现细节的了解也有助于了解Bigtable中一些实现细节。
本节内容主要讲述SSTable的静态布局结构,SSTable文件形成了不同Level的层级结构,至于这个层级结构是如何形成的我们放在后面Compaction一节细说。本节主要介绍SSTable某个文件的物理布局和逻辑布局结构,这对了解LevelDB的运行过程很有帮助。
LevelDB不同层级都有一个或多个SSTable文件(以后缀.sst为特征),所有.sst文件内部布局都是一样的。上节介绍Log文件是物理分块的,SSTable也一样会将文件划分为固定大小的物理存储块Block,但是两者逻辑布局大不相同,根本原因是:Log文件中的记录是Key无序的,即先后记录的key大小没有明确大小关系,而.sst文件内部则是根据记录的Key由小到大排列的,从下面介绍的SSTable布局可以体会到Key有序是为何如此设计.sst文件结构的关键。

图1 .sst文件的分块结构

图1展示了一个.sst文件的物理划分结构,同Log文件一样,也是划分为固定大小的存储块,每个Block分为三个部分,包括Block、Type和CRC。Block为数据存储区,Type区用于标识Block中数据是否采用了数据压缩算法(Snappy压缩或者无压缩两种),CRC部分则是Block数据校验码,用于判别数据是否在生成和传输中出错。
以上是.sst的物理布局,下面介绍.sst文件的逻辑布局,所谓逻辑布局,就是说尽管大家都是物理块,但是每一块存储什么内容,内部又有什么结构等。图4.2展示了.sst文件的内部逻辑解释。

图2 逻辑布局

从图2可以看出,从大的方面,可以将.sst文件划分为数据存储区和数据管理区,数据存储区存放实际的Key:Value数据,数据管理区则提供一些索引指针等管理数据,目的是更快速便捷的查找相应的记录。两个区域都是在上述的分块基础上的,就是说文件的前面若干块实际存储KV数据,后面数据管理区存储管理数据。管理数据又分为四种不同类型:紫色的Meta Block,红色的MetaBlock Index和蓝色的Index block以及一个文件尾部块Footer。
LevelDB 1.2版对于Meta Block尚无实际使用,只是保留了一个接口,估计会在后续版本中加入内容,下面我们看看Index block和文件尾部Footer的内部结构。

 
图3 Index block结构

图3是Index block的内部结构示意图。再次强调一下,Data Block内的KV记录是按照Key由小到大排列的,Index block的每条记录是对某个Data Block建立的索引信息,每条索引信息包含三个内容:Data Block中key上限值(不一定是最大key)、Data Block在.sst文件的偏移和大小,以图3所示的数据块i的索引Index i来说:红色部分的第一个字段记载大于等于数据块i中最大的Key值的那个Key,第二个字段指出数据块i在.sst文件中的起始位置,第三个字段指出Data Block i的大小(有时候是有数据压缩的)。后面两个字段好理解,是用于定位数据块在文件中的位置的,第一个字段需要详细解释一下,在索引里保存的这个Key值未必一定是某条记录的Key,以图3的例子来说,假设数据块i 的最小Key=“samecity”,最大Key=“the best”;数据块i+1的最小Key=“the fox”,最大Key=“zoo”,那么对于数据块i的索引Index i来说,其第一个字段记载大于等于数据块i的最大Key(“the best”),同时要小于数据块i+1的最小Key(“the fox”),所以例子中Index i的第一个字段是:“the c”,这个是满足要求的;而Index i+1的第一个字段则是“zoo”,即数据块i+1的最大Key。
文件末尾Footer块的内部结构见图4,metaindex_handle指出了metaindex block的起始位置和大小;inex_handle指出了index Block的起始地址和大小;这两个字段可以理解为索引的索引,是为了正确读出索引值而设立的,后面跟着一个填充区和魔数(0xdb4775248b80fb57)。

 
图4 Footer

上面主要介绍的是数据管理区的内部结构,下面我们看看数据区的一个Block的数据部分内部是如何布局的,图5是其内部布局示意图。


图5 Data Block内部结构

从图中可以看出,其内部也分为两个部分,前面是一个个KV记录,其顺序是根据Key值由小到大排列的,在Block尾部则是一些“重启点”(Restart Point),其实是一些指针,指出Block内容中的一些记录位置。
“重启点”是干什么的呢?简单来说就是进行数据压缩,减少存储空间。我们一再强调,Block内容里的KV记录是按照Key大小有序的,这样的话,相邻的两条记录很可能Key部分存在重叠,比如key i=“the car”,Key i+1=“the color”,那么两者存在重叠部分“the c”,为了减少Key的存储量,Key i+1可以只存储和上一条Key不同的部分“olor”,两者的共同部分从Key i中可以获得。记录的Key在Block内容部分就是这么存储的,主要目的是减少存储开销。“重启点”的意思是:在这条记录开始,不再采取只记载不同的Key部分,而是重新记录所有的Key值,假设Key i+1是一个重启点,那么Key里面会完整存储“the color”,而不是采用简略的“olor”方式。但是如果记录条数比较多,随机访问一条记录,需要从头开始一直解析才行,这样也产生很大的开销,所以设置了多个重启点,Block尾部就是指出哪些记录是这些重启点的。


图6 记录格式

在Block内容区,每个KV记录的内部结构是怎样的?图6给出了其详细结构,每个记录包含5个字段:key共享长度,key非共享长度,value长度,key非共享内容,value内容。比如上面的“the car”和“the color”记录,key共享长度5;key非共享长度是4;而key非共享内容则实际存储“olor”;value长度及内容分别指出Key:Value中Value的长度和存储实际的Value值。
上面讲的这些就是.sst文件的全部内部奥秘。

Block格式及相关操作请参阅《levelDB源码分析-SSTable:Block》。

SSTable造作相关请参阅《levelDB源码分析-SSTable:.sst文件的构建与读取》

[repost ]LevelDB – Fast And Lightweight Key/Value Database From The Authors Of MapReduce And BigTable

original:http://highscalability.com/blog/2011/8/10/leveldb-fast-and-lightweight-keyvalue-database-from-the-auth.html

LevelDB is an exciting new entrant into the pantheon of embedded databases, notable both for its pedigree, being authored by the makers of the now mythical Google MapReduce and BigTable products, and for its emphasis on efficient disk based random access using log-structured-merge (LSM) trees.

The plan is to keep LevelDB fairly low-level. The intention is that it will be a useful building block for higher-level storage systems. Basho is already investigating using LevelDB as one if its storage engines.

In the past many systems were built around embedded databases, though most developers now use database servers connected to via RPCs. An embedded database is a database distributed as a library and linked directly into your application. The application is responsible for providing a service level API, sharding, backups, initiating consistency checking, initiation rollback,  startup, shutdown, queries, etc. Applications become the container for the database and the manager of the database.

Architectures using embedded databases typically never expose a raw database abstraction at all. They have a service API and the services use the embedded database library call transparently behind the scene. Often an embedded database will provide multiple access types, like indexed access for key-value uses and btrees for range queries and cursors.

BerkelyDB is one well known example of an embedded database, SQLite is another, the file system is perhaps the most commonly used database, and there have been many many other btree libraries in common use. I’ve used C-tree on several projects. In a battle of old versus new, a user named IM46 compared Leveldb to BerkelyDB and found that LevelDB solidly outperforms Berkeley DB for larger databases.

Programmers usually thought doing this stuff was easy, wrote their own failed on-disk btree library (raises hand), and then look around for a proven product. It’s only relatively recently the databases have gone up market and included a network layer and higher level services.

Building a hybrid application/database architecture is still a very viable option when you want everything to be just so. If you are going to load balance requests across sharded application servers anyway, using a heavy weight external database infrastructure may not be necessary.

The LevelDB mailing list started off very active and has died down a bit, but is still nicely active and informative. Here are some excellent FAQish tips, performance suggestions, and porting issues extracted from the list:

  • Largest tested database: 1 billion entries with 16 byte keys and 100 byte values (roughly 100 GB of raw data, about half that after compression).
  • LevelDB has been Open Sourced.
  • Relationship between LevelDB and BigTable: The implementation of LevelDB is similar in spirit to the representation of a single Bigtable tablet (section 5.3). However the organization of the files that make up the representation is somewhat different and is explained [in source code comments]. They wanted to put together something like the BigTable tablet stack that had minimal dependencies and would be suitable for open sourcing, and also would be suitable for use in Chrome (for the IndexedDB implementation). LevelDB has the same general design as the BigTable tablet stack, but does not share any of the code.
  • Didier Spezia on log-structured-merge (LSM) trees: They are mostly useful to optimize random I/Os at insertion/delete time at the price of a slightly degradation of read access time. They are extremely efficient at indexing data in random order stored on rotational disks (i.e. better than b-trees).
  • Optimized for random writes. TokyoCabinet could be filled with a million 100-byte writes in less than two seconds if writing sequentially, but the time  ballooned to ~2000 seconds when writing randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds.
  • In the tradition of BerkelyDB it’s a library you embed in your program, it’s not a server. You will have to add the networker layer, sharding etc if a single process won’t suffice.
  • Quite appropriately threading decisions are left to the application, the library is not thread safe. Threads sharing iterators, for example, will need to lock.
  • Data is written in sorted order.
  • C++ only.
  • Variable sized keys are used to save memory.
  • What leveldb does differently from B+trees is that it trades off write latency for write throughput: write latency is reduced by doing bulk writes, but the same data may be rewritten multiple times (at high throughput) in the background due to compactions.
  • Log-Structured Merge Trees offer better random write performance (compared to btrees).  It always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.
  • Search performance is still O(lg N) with a very large branching factor (so the constant factor is small and number of seeks should be <= 10 even for gigantic databases).
  • One early user found performance degraded at around 200 million keys.
  • Bigger block sizes are better, increasing the block size to 256k (from 64k).
  • Batching writes increases performance substantially.
  • Every write will cause a log file to grow, regardless of whether or not you are writing to a key which already exists in the database, and regardless of whether or not you are overwriting a key with the exact same value.  Only background compactions will get rid of overwritten data.  So you should expect high cpu usage while you are inserting data, and also for a while afterwards as background compactions rearrange things.
  • LevelDB Benchmarks look good:
    • Using 16 byte keys at 100 byte values:
      • Sequential Reads: LevelDB 4,030,000 ops/sec; Kyoto TreeDB 1,010,000 ops/sec; SQLite3 186,000 ops/sec
      • Random Reads: LevelDB 129,000 ops/sec; Kyoto TreeDB 151,000 ops/sec; SQLite3146,000 ops/sec
      • Sequential Writes: LevelDB 779,000 ops/sec; Kyoto TreeDB 342,000 ops/sec; SQLite3 26,900 ops/sec
      • Random Writes: LevelDB 164,000 ops/sec; Kyoto TreeDB 88,500 ops/sec; SQLite3420 ops/sec
    • Writing large values of 100,000 bytes each: LevelDB is even Kyoto TreeDB. SQLite3 is nearly 3 times as fast. LevelDB writes keys and values at least twice.
    • A single batch of N writes may be significantly faster than N individual writes.
    • LevelDB’s performance improves greatly with more memory, a larger write buffer reduces the need to merge sorted files (since it creates a smaller number of larger sorted files).
    • Random read performance is much better in Kyoto TreeDB because it cached in RAM.
    • View many more results by following the link, but tha’s the jist of it.
  • InnoDB benchmarks as run by Basho.
    • LevelDB showed a higher throughput than InnoDB and a similar or lower latency than InnoDB.
    • LevelDB may become a preferred choice for Riak users whose data set has massive numbers of keys and therefore is a poor match with Bitcask’s model.
    • Before LevelDB can be a first-class storage engine under Riak it must be portable to all of the same platforms that Riak is supported on.
  • LEVELDB VS KYOTO CABINET MY FINDINGS. Ecstortive says wait a minute here, Kyoto is actually faster.
  • A good sign of adoption, language bindings are being built: JavaTie::LevelDB on CPAN
  • Comparing LevelDB and Bitcask: LevelDB is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration). Bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications. Bitcask can guarantee at most one disk seek per lookup I think. LevelDB may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks. Bitcask is a combination of Erlang and C.
  • Writes can be lost, but that doesn’t trash the data files: Leveldb never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.
  • LevelDB is being used as the back-end for IndexedDB in Chrome. For designing how to map secondary indices into LevelDB key/values, look at how the IndexedDB support within Chrome is implemented.
  • In case of a crash partial writes are ignored.
  • Possible scalability issues:
    • LevelDB keeps a separate file for every couple of MB of data, and these are all in one directory.  Depending on the underlying file system, this might start causing trouble at some point.
    • Scalability is more limited by the frequency of reads and writes that are being done, rather than the number of bytes in the system.
  • Transactions are not supported. Writes (including batches) are atomic.  Consistency is up to you.  There is limited isolation support.  Durability is a configurable option.  Full blown ACID transactions require a layer on top of LevelDB (see WebKit’s IndexedDB).
  • Michi Mutsuzaki compared LevelDB to MySQL as a key-value store. LevelDB had better overall insert throughput, but it was less stable (high variation in throughput and latency) than mysql. There was no significant performance difference for 80% read / 20% update workload.
  • LevelDB hasn’t been tuned for lots of concurrent readers and writers. Possible future enhancements:
    1. Do not hold the mutex while the writer is appending to the log (allow concurrent readers to proceed)
    2. Implement group commit (so concurrent writers have their writes grouped together).

Related Articles

[repost ]Survey of Distributed Databases

orignal:http://dbpedias.com/wiki/NoSQL:Survey_of_Distributed_Databases

 

Survey of Distributed Databases

 

Contents

[hide]

Overview

This document, researched and authored by Randy Guck, provides a summary of distributed databases. These are commercial products, open source projects, and research technologies that support massive data storage (petabyte+) using an architecture that distributes storage and processing across multiple servers. These can be considered “Internet age” databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases. Due to their contrast in priorities and architecture compared to relational databases, these technologies are loosely referred to as “NoSQL” databases, though an absence of SQL is not a requirement.

Distributed Database Concepts

This section describes concepts that constitute the nature of modern distributed databases.

NoSQL Databases

Meaning “no SQL”, this is a term that casually describes the new breed of databases that are appearing largely in response to the limitations of existing relational databases. Strictly speaking, there is no reason any distributed database couldn’t support SQL, but the implication of the term NoSQL is that relational databases are the antithesis of the goals of the modern, distributed database.[1] For example, instead of supporting ACID transactions, which is the mainstay transactional model for relational databases, many of the new-generation databases support BASE principles (described separately). Virtually all of the databases in this paper are NoSQL databases, though some support SQL or an SQL-like query language.

There is no precise definition of what constitutes a NoSQL database, but these databases tend to have most or all of the following tendencies:

  • Schema-less: “Tables” don’t have a pre-defined schema. Records have a variable number of fields that can vary from record to record. Record contents and semantics are enforced by applications.
  • Shared nothing architecture: Instead of using a common storage pool (e.g., SAN), each server uses only its own local storage. This allows storage to be accessed at local disk speeds instead of network speeds, and it allows capacity to be increased by adding more nodes. Cost is also reduced since commodity hardware can be used.
  • Elasticity: Both storage and server capacity can be added on-the-fly by merely adding more servers. No downtime is required. When a new node is added, the database begins giving it something to do and requests to fulfill.
  • Sharding: Instead of viewing the storage as a monolithic space, records are partitioned into shards. Usually, a shard is small enough to be managed by a single server, though shards are usually replicated. Sharding can be automatic (e.g., an existing shard splits when it gets too big), or applications can assist in data sharding by assigning each record a partition ID.
  • Asynchronous replication: Compared to RAID storage (mirroring and/or striping) or synchronous replication, NoSQL databases employ asynchronous replication. This allows writes to complete more quickly since they don’t depend on extra network traffic. One side effect of this strategy is that data is not immediately replicated and could be lost in certain windows. Also, locking is usually not available to protect all copies of a specific unit of data.
  • BASE instead of ACID: NoSQL databases emphasize performance and availability. This requires prioritizing the components of the CAP theorem (described elsewhere) that tends to make true ACID transactions implausible.
  1.  To prevent the misunderstanding that SQL somehow cannot be used, some have re-defined NoSQL as meaning “not only SQL”.

Database Types by Entity Type

The new breed of NoSQL databases can be categorized several ways. This section proposes a taxonomy that describes databases from lower to high levels of functionality based on the type of entity that each supports. An entity could be a tuple, record, document, or something else. Generally, more functionality is available with the more complex entity types. Strictly speaking, however, higher level databases are not necessarily supersets of the database “below” them.

Distributed Memory Caches

A memory cache sits on top of a persistent store such as a SQL database, caching most-recently-used data, typically records or field values. At a high level, a memory cache is just a hash table: every value is accessed via a key. The open source project Memcached was one of the first to expand the memory cache model and create the notion of distributed caching. It allows a request on one node to fetch the value from any node in the network. Consequently, the total size of the in-memory cache is the sum of the memory cache on all nodes. Using commodity hardware, more nodes can be added to increase the cache size and therefore overall performance of the application. Intended to accelerate web applications, Memcached demonstrates that any data-intensive application can enjoy significant performance and scalability gains using the distributed cache architecture.

Memcached is the most-used open source distributed cache, but there are several commercial distributed caches that provide similar functionality. Examples are Oracle’s Coherence, GigaStore’sXAP In-Memory Grid, and IBM’s WebSphere eXtreme Scale.

Distributed caches prompted developers to view distributed databases in a new way, by inverting the concept of a distributed cache. Instead of providing a caching layer to an existing data store, some new distributed databases treat the distributed hash table as the database, backed by a persistent store. Such implementations do not require a backend SQL database but instead provide their own persistence. This gives rise to the key/value database.

Key/Value Databases

These databases provide an efficient key to value persistent map. Compared to more sophisticated databases, they are very limited because they provide only a single way to efficiently access values. Auxiliary paths to rows must be managed externally, e.g., via Lucene or an application-managed index. Most of the databases described in this paper are key/value stores at their core, often providing additional functionality for access by secondary values.

Examples of key/value databases without additional indexing facilities are:

  • Berkeley DB
  • HBase
  • MemcacheDB
  • Redis
  • SimpleDB
  • Tokyo Cabinet/Tyrant
  • Voldemort
  • Riak

“Big Table” Databases

These are also called record-oriented or tabular databases. The term big table has become popular due to Google’s proprietary BigTable implementation. Similar to relational databases, a big table database consists of multiple tables, each containing a set of addressable rows. Each row consists of a set of values that are considered columns. However, due to the scalability and storage requirements imposed by the applications for which they are used, big table databases differ from relational tables in several important ways:

  • Each row can have a different set of columns. Although all rows in a given table may be required to have a pre-defined set of column groups, individual rows can differ by specific columns within these groups. This means that a table can have different values from row to row. Furthermore, the columns do not need to be pre-defined in a schema but instead can be added dynamically.
  • Tables are intended to have many more columns than a typical relational database. A row could contain thousands of columns. Some big table databases support millions of column values per row.
  • All big table database support compound values, which means, compare to a key/value store, a record can have multiple fields. However, some only support scalar values within a record, requiring BLOBs and other large values to be stored separately.
  • Rows are typically versioned. This means that multiple copies of the same row may exist (i.e., with the same row ID). Rows are typically versioned by a system-assigned timestamp.
  • Data storage is typically divided into shards, which are independently managed.
  • Sometimes only updates to a single row are considered atomic. In this case, multi-row updates are performed in separate transactions, and inter-row consistency is handled through soft-transaction techniques. Alternatively, a big table may allow multi-row updates but only when restricted to rows in the same shard.

In addition to Google’s BigTable database (which is surfaced in the Google App Engine asdatastore), other examples are:

  • Azure Tables (Microsoft)
  • Cassandra (Apache)
  • HBase (Apache Hadoop project)
  • Hypertable
  • SimpleDB (Amazon)
  • Voldemort (LinkedIn, now open source)

Document Databases

Also known as document-oriented databases, these databases focus on storage and access optimized for documents as opposed to rows or records. Essentially this an API issue since document database entities are just records with multiple fields. However, the focus is programmatic access better suited for documents. Consequently, they emphasize up-time, scalability, and similar features less than other distributed databases. All of the current document databases support documents in JSON format.

Some document databases provide RDBMS-like capabilities such as SQL or an SQL-like query language. Some document databases are also big table databases; some provide MapReduce processing; some are also columnar databases. This means that the storage and access implementation is independent of the database’s orientation towards document-oriented databases.

Examples:

  • CouchDB
  • MongoDB
  • Terrastore
  • ThruDB

Graph Databases

Graph databases use nodesedges, and properties as primary elements, providing a sharp contrast to the tables, rows, and columns of the relational model. Additionally, they emphasize high performance for associative data access, preventing the need for joins. (Eifr√©m and Skogman say that graph DBs are “whiteboard friendly”.)

As with document databases, graph databases may be implemented with different storage techniques. The graph nature of these databases exists mainly in the data model presented to applications. Under the hood, a graph database may be a key/value store, a columnar database, a “big table” database, or some combination of these and other architectures.

One feature that graph DBs possess is the ability for a field value to directly store the ID of another entity. Moreover, graph DBs typically allow collections of IDs to be stored as a single value using an array or map layout. Consequently, navigating from one entity, say a node, to its related entities, such as its edges, can be performed quickly without maintaining secondary indexes. When multi-value collections are stored in both directions of an edge type, the database can quickly navigate many-to-many relationships in either direction.

Examples:

  • AllegroGraph RDFStore
  • HyperGraphDB
  • InfoGrid
  • Neo4J

CAP Theorem

The CAP theorem was proposed in a keynote by Eric Brewer in 2000 on the Principles of Distributed Computing. Basically, the theorem states that you can optimize for only two of three priorities in a distributed database:

  • Consistency: This is essentially the same as “atomicity” in ACID transactions. It is the principle that, for example, prevents two customers from both buying the last copy of a book in inventory. Atomicity guarantees that only one user can (a) lock the book object, (b) decrement its inventory count, and (c) add the book item to the user’s shopping cart, all in one transaction.
  • Availability: This means that the database, or more importantly the services that use it, are entirely available. Ensuring availability requires technologies such as replication and parallel processing so that changes in demand can be met while maintaining a minimum response time.
  • Partition Tolerance: In a distributed network, “partitions” form when certain failures occur, such as a network cable being unplugged. This gives rise to conflicts and ambiguities as to where data lives and who has control over it. Partition tolerance is defined as: “No set of failures less than total network failure is allowed to cause the system to respond incorrectly”.

The significance of the CAP theorem is that applications must understand their priorities carefully and optimize their distributed database solution correspondingly. Amazon, for example, claims that that just an extra one tenth of a second on their response times will cost them 1% in sales. Similarly, Google says they’ve noticed that just a half second increase in latency caused traffic to drop by a fifth. In these cases, availability may be more important than consistency.

Columnar Databases

Also known as column-oriented databases, columnar databases store column values for multiple rows in the same block instead of storing multiple rows within the same block. This is faster because (a) many tables contain lots of columns, and (b) few queries need all of the columns at once. So, rather than fetch lots of blocks and throwing away most of the data, columnar databases only perform I/Os to the blocks corresponding to columns that are actually being read (or updated). In addition to the smaller I/O overhead, this uses memory more efficiently and allows more effective blocking based on each column’s type (date, text, etc.)

Some columnar databases also support compression, which affords many advantages. Because multiple values for the same column are stored in each block, it often compresses far greater than a single row or a block of rows would compress. Furthermore, custom compression techniques can be used for each column type, such as one that works well for text data and another that works well for timestamps. The denser storage yielded by compression reduces I/O transfer time and overall database size. To minimize the CPU time required when compressed blocks are read, special “real time” compression techniques are sometimes used. These algorithms do not compress as densely as say GZIP, but they provide significant compression with much less CPU time.

Examples:

  • Alterian
  • C-Store
  • HBase
  • QD Technology
  • SmartFocus
  • Vertica

Elasticity and BASE

Elasticity describes databases that can be dynamically expanded. That is, additional servers and/or storage can be added without taking the DBMS software down. Seminal research on this topic was performing by Amazon in defining their proprietary Dynamo key/value database. When new storage is added to the mesh, some subset of data begins to be replicated to it. When a new server is added, it begins to “steal” work from other servers. The elasticity of storage and server resources has given rise to a new database paradigm known as BASE.

Without question, the acronym BASE was deliberately chosen to contrast to the ACID paradigm.

  • Basically Available: This means that most data is available most of the time. A failure could cause some data to not be available, but only catastrophic failures cause everything to be down.
  • Soft state: This means the DB provides a relaxed view of data in terms of consistency. For example, an inventory value may be examined that is not precisely up-to-date. But for ordering purposes, approximations are OK.
  • Eventually consistent: As data is replicated throughout the system, such as when new storage nodes are added, it is eventually copied to all applicable nodes. But there is no requirement for all nodes to have identical copies of any given data all the time.

Elastic databases that emphasize BASE principles necessarily must give-up the traditional paradigm of ACID transactions: atomicity, consistency, isolation, and durability. However, various implementations can prioritize certain requirements such as availability over other requirements such as consistency. This balance of priorities must observe the paradox of the CAP Theorem, described separately.

Although Amazon’s Dynamo database is proprietary, some other databases have used its principles to support elasticity:

  • Cassandra
  • Dynomite
  • Voldemort

MapReduce

MapReduce (or MR) is an algorithm for dividing a work load into units suitable for parallel processing. Though the general principle of divide-and-conquer for large data sets has been around for decades, Google popularized it with an ODSI paper in 2004. MR is especially useful for queries against large sets of data: the query can be distributed to 100’s or 1000’s of nodes, each of which works on a subset of the target data. The results are then merged together, ultimately yielding a single “answer” to the original query.

Perhaps more important than the MR algorithm are MR frameworks, which provide a common foundation for solving a variety of map-reducible problems. The MR framework takes care of common problems such as scheduling work for each node, combining results, and handling node failures. The MR framework application only needs to provide key plug-in methods, such as how to process each parcel of work and how to merge to result subsets.

MR can provide an important tool to distributed databases for very large data access problems. However, MR frameworks tend to have high start-up and shut-down costs, hence they are not appropriate for low-latency transactions such as online searches. Some papers argue that MR is a step backwards compared to SQL and OLTP.

Survey of Distributed Database Technologies

Overview

The databases surveyed in this section are summarized in the table below. The terms used in this table are described below:

  • Database types: The four subsections of the table classify databases by their basic data model:
    • Key/value Databases: These manage a simple value or row, indexed by a key. Key/value DBs generally do not support secondary indexes.
    • Big table Databases: These manage large, multi-column rows in the spirit of Google’s BigTable database.
    • Document Databases: These manage multi-field documents (or objects) and provide JSON access.
    • Graph Databases: These manage nodes, edges, and properties and are optimized for associative access. All the current offerings are commercial products.
  • Name: The common name of the database product or open source project.
  • Owner: The company or open source owner of the database.
  • Written in: The primary programming language in which the database is written.
  • Languages/APIs: The primary programming languages and APIs with which the database can be accessed.
  • Platforms: The known operating systems on which the database is supported.
  • License: The type of commercial or open source license with which database usage is controlled.
  • Schemaless?: Indicates if the database supports run-time addition of columns/fields to each rows, document, or other primary entity.
  • Sharding?: Indicates if the database supports automatic sharding (partitioning) of data across multiple servers. (Application-managed sharding is always possible but doesn’t count as a database feature.)
  • Indexes?: Indicates if the database provides support for automatic maintenance of secondary indexes. All databases provide efficient access by primary ID (e.g., row ID). Applications can always manually manage secondary indexes.
  • Active?: Indicates if the database appears to have an active user community or ecosystem.
  • Interest?: Indicates the amount of apparent activity and applicability for commercial use.
  • Notes: Noteworthy things about the product.

SurveyTable1-KeyValueDBs.JPGSurveyTable2-BigTableDBsDocDBs.JPGSurveyTable3-GraphDBs.JPG

Key/Value Databases

Dynomite (Open Source)

Dynomite is an open source clone of Amazon’s Dynamo framework, written in Erlang. It is brand new; version 0.6.0 is currently shipping. Dynomite is an “eventually consistent distributed key/value store” that is targeting the following features for the 1.0 release:

  • Vector clocks
  • Merkle trees
  • Consistent hashing
  • Tunable quorum
  • Gossiping of membership
  • Gossiped synchronization of partitions
  • Pluggable storage engines
  • Thrift interface
  • Web console with canvas visualizations

It’s not clear how active Dynomite development is. It appears to be the work on one person (Cliff Moon)[1] The last entry on the web site seems to be ~1 year ago, and the last build posted was May, 2009. Documentation is minimal as are code comments. There isn’t even a clear mention of what machines it runs on.

Dynomite seems to be similar to Cassandra but with far less activity. It may be dead.

  1.  On 3/16/10, Cliff Moon released a Scala slient for Cassandra called Scromium. Does this mean he’s given-up on Dynomite and is now working with Cassandra?.

Riak (Basho Technologies)

Riak (pronounced “ree-ahk”) is a fully-distributed, scalable key/value store created by Basho Technologies that was open sourced under the Apache 2 license in August of 2009. Basho describes Riak thusly: “Riak is a Dynamo-inspired key/value store that scales predictably and easily. A truly fault-tolerant system, Riak has no single point of failure, with no machines being special or central in Riak.”

It is currently shipping at version 0.12 and appears to have active usage. It is written in Erlang, has full-featured HTTP and Protocol Buffers APIs, and many language interfaces (Erlang, Python, Ruby, Java, PHP, etc.).

Central to any Riak cluster is a 160-bit integer space, known as the “ring,” which is divided into equally-sized partitions. Physical servers, referred to in the cluster as “nodes,” run a certain number of virtual nodes, or “vnodes”. Each vnode will claim a partition on the ring. The number of active vnodes is determined by the number of physical nodes in the a cluster at any given time.

All nodes in a Riak cluster are equal. Each node is fully capable of serving any client request. This is possible due to the way Riak uses “consistent hashing” to distribute data around the cluster.

Each read/write request indicates how many nodes must respond with the requested operation for “success”. This is illustrated below, using N as the number of replicas, R as the requested number of responses for a read, and W as the requested number of responses for a write:

Riak.JPG

Riak uses “vector clocks” as its version control mechanism. This gives developers the option to resolve version conflicts in their application logic. The alternative is to let Riak resolve conflicts automatically. Other mechanisms at work in Riak include “Hinted Handoff,” a technique for dealing with node failure in the Riak cluster in which neighboring nodes temporarily takeover storage operations for the failed node; and “Read Repair,” an anti-entropy mechanism used to optimistically update stale replicas when they reply to a read request with stale data.

Riak also supports MapReduce in both JavaScript and Erlang for data processing that requires query functionality beyond the standard GET, PUT and DELETE operations offered by a key/value store.

Riak natively supports links which are one way relationships between objects. For example, an artistobject can link to all of its albums, each of which can link to all of its “songs.”

Riak is available in a commercial version called Riak EnterpriseDS that builds on the core open source code and includes support.

Riak can run on a variety of operating systems: it is known to run on several *NIX platforms and should be able to run on most anything that isn’t Windows. Basho also provides pre-packaged binary builds for most platforms. After one year of being open sourced, Riak has seen success and adoption in various markets, ranging from FORTUNE 100 enterprises like Comcast, to open source authorities like Mozilla, to innovative startups like Mochi Media.

Tokyo Cabinet/Tyrant (Open Source)

Tokyo Cabinet is an open source key/value database created and maintained by Mikio Hirabayashi. It is a set of libraries, intended for embedding in database applications. Berkeley DB is often cited as its biggest “competitor”.

There are no tables and no schema: all “objects” are merely keys whose values are strings or byte arrays. Tokyo Cabinet provides “row” locking and built-in support for compression. It is written in C, supports both hash tables and B-trees, and is “crazy fast” (reportedly storing 1 million records in as little as 0.7 seconds). It has interfaces in many languages including Ruby, Perl, C, etc. Below is a diagram of its B-tree organization:

TokyoCabinet-BTree.JPG

Tokyo Cabinet is used by Mixi.jp (Japan’s version of Facebook). It is accompanied by Tokyo Tyrant, which provides networking, replication, and failover capabilities. Tokyo Cabinet and Tyrant are apparently used in many production applications, though there are anecdotal descriptions of maturity problems. They are both licensed under the GNU Lesser General Public License (LGPL).

Unfortunately, the licensing seems to be a show stopper for commercial use. Also, it’s not clear that Cabinet provides support for secondary indices: applications must roll their own, thereby also handling all transaction processing.

Vertica (Vertica Systems)

Vertica Analytic Database is a commercial, columnar database provided by Vertica Systems, Inc. (HP purchase announced in Feb, deal to close in April) One of Vertica’s founders is Michael Stonebraker, who also founded Ingres, Postgres, and many other database initiatives. It is built upon (or inspired by) C Store, adding clustering and other robust features. It is positioned as a data warehouse DB, more specifically as “the only columnar DB that supports massively parallel processing (MPP)”. Vertica supports compression, replication, automatic failover, and MapReduce support. The company claims that Vertica offers “50x-200x faster speed on 90% less hardware” (presumably compared to other data warehouse databases). The only supported platforms are Linux and VMware.

A “cloud” version of Vertica is available via Amazon’s EC2 platform. This SaaS version is intended for no-DBA/pay-as-you-go applications and scale-out as much as needed. The EC2 Vertica solution is available as an Amazon Machine Image (AMI).

Vertica is only available as a commercial product, and only available on Linux and EC2 (not Windows).

Voldemort (Open Source)

The Voldemort database was built by LinkedIn for their needs. It is a key/value store that uses Dynamo principles for distributed processing, replication, and failover. It can be considered “Cassandra 1.0″ since the same designer left LinkedIn to create a similar database for Facebook. Since it is a key/value store, Voldemort does not directly support relationships. Moreover, 1-to-many relationships must be handled by using an embedded map (e.g., java.util.HashMap) as a value. For very large fan-outs (say millions of values), a separate, application-managed table must be used. The designers claim that this simplistic design is an advantage since performance is very predictable.

Voldemort was recently open sourced and is available under the Apache 2.0 license.

Stemming from the Dynamo research, Voldemort uses “read repair” for conflict detection and resolution instead of two-phase commit or Paxos-style consensus algorithms. This also means that versioning is used and that applications must participate in conflict resolution.

Voldemort’s architecture employs a distinct layer to resolve each logical function: conflict resolution, serialization, etc. This is illustrated below:

Voldemort-LogicalArch.JPG

As shown, Voldemort requires a storage engine such as Berkeley DB or MySQL. Multiple physical architecture options are available to support various load-balancing and performance goals. This is shown below:

Voldemort-PhysicalArch.JPG

The native API of Voldemort is Java, but access is also possible via HTTP.

Because Voldemort was only recently open-sourced, and its adoption in the presence of other options is unclear. Its dependency on BDB or MySQL as a backend store limits its use for commercial applications.

Big Table Databases

Azure Tables (Microsoft)

Azure is Microsoft’s cloud computing platform, first coming into production use in 2010. The underlying Azure fabric provides distributed computing services such as communication, service management, replication, and failover. On this foundation, Azure offers several storage options, including:

  • Windows Azure XDrive: This is a distributed NTFS-like file system built on top of Azure Blobs.
  • Block and Page Blobs: These are monolithic storage objects, optimized for streaming and random access, respectively.
  • Azure Tables: These are “big table” like storage structures that use row-oriented access and storage.
  • SQL Azure: This is a limited version of SQL Server that can be used in Azure.

Of these options, only SQL Azure is a true database system. However, it is limited to a 10GB maximum size, doesn’t support full text search, and is “crippled” in several other ways. Microsoft intends to expand the features of SQL Azure over time, moving it in the direction of a distributed database. This means that Azure doesn’t provide a distributed database solution today. Instead, it provides building blocks with which a distributed database could be built.

For large storage, Azure Tables provide the most flexible building block. Each table allows an unlimited number of rows (also called “entities”) that can be separated into partitions. Each partition is serviced independently, allowing parallel processing. Any row can be efficiently accessed via {partition key, row key}. Transactions are limited to rows within the same partition. Tables do not have a fixed schema: each row can have its own set of properties, and the same-named property can have a different value type from one row to the next. However, a row can have no more than 255 properties and cannot be larger than 1MB in size, including property names[1]. Moreover, property types are limited to strings and simple scalar values. A large value such as an image or document must be stored as a Blob and referenced from a table property. Rows are versioned with a system-assigned timestamp.

Azure Tables do not provide indexes or any other means of efficient row access other than by {partition key, row ID}. Linear searches (aka “table scans”) are supported, but indexed auxiliary searching must be application-maintained. Similarly, full text searching must be implemented at the application level.

Windows Azure can be accessed in .Net languages as well as Java.

As a “big table” implementation similar to many other cloud- and on-premise databases, Azure tables are the obvious choice for a big table-storage solution when Azure is used.

  1.  Compared to the rather unlimited constraints of other “big table” databases, Azure Tables might be more appropriately described as “mini tables”.

Cassandra (Apache)

Cassandra is an Apache project that aims to provide a scalable, distributed database solution that employs the technology advances of Amazon’s Dynamo and Google’s BigTable projects. It was open-sourced by Facebook in 2008 and designed by one of Dynamo’s original authors (Avinash Lakshman) as well as Facebook engineer Prashant Malik. It is used by many large companies such as Rackspace, Digg, Facebook, Twitter, and Cisco. The largest production server has over 150 TB of data. Cassandra is available under the Apache 2.0 license.

Key features of Cassandra are summarized below:

  • Decentralized: All nodes are equal; there is no single point of failure.
  • Fault Tolerant: Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
  • Eventually Consistent: Data uses an eventually consistent model with optimizations such as Hinted Handoff and Read Repair to minimize inconsistency windows.
  • Elasticity: New machines can be added dynamically with no downtime, and read/write operations scale linearly as new servers are added.
  • Rich Data Model: More complex records are supported than simple key/value stores.

Cassandra reports impressive performance numbers compared to MySQL. For example:

  • MySQL > 50 GB Data
 Writes Average: ~300 ms
 Reads Average: ~350 ms
  • Cassandra > 50 GB Data
 Writes Average: 0.12 ms
 Reads Average: 15 ms

Cassandra is written in Java. It uses the Apache Thrift service framework to provide access in many languages. Based on its roots, it emphasizes the Linux platform though it is being used on Windows as well. At the PyCon conference, Cassandra was described as “not polished”. For example, it doesn’t yet support compression.

In mid-March, 2010, Cassandra was moved from an incubator project to a top-level project. Consequently, it now has its own top-level web page: http://cassandra.apache.org. Cassandra is very active and progressing rapidly.

Cassandra appears to be very promising, especially with its backing by major companies currently managing lots of data on the web.

Datastore (Google)

Google’s cloud computing framework, known as the Google App Engine (GAE), provides a data storage solution known as the datastore. Under the hood, it is the famous BigTable database, which is a foundation technology that helped launch the “no SQL” movement. The primary features of datastore are summarized below:

  • The datastore is a distributed database that provides queries and transactions.
  • Data objects are called entities and have a kind and a set of properties.
  • Data objects of the same kind form a data class, essentially analogous to a “table”.
  • Many property types are supported including scalar types (string, Boolean, integer, float, etc.) and specialty types such as lists, blobs, links, emails, and geographic coordinates.
  • Each datastore is schema-less: the application specifies the properties of each entity at runtime.
  • Multiple updates within a single transaction are supported when all affected entities are in the same entity group.
  • Every query uses an index that returns entities in the desired order. Indexes can be predefined for better performance.

GAE datastores can be accessed with Python or Java (Java is a relatively new addition). Python applications use a Python-specific API to access a datastore. Java applications can use Java Data Objects (JDO), the Java Persistence API (JPA), or a low-level datastore API. Datastores support the JDO Query Language (JDOQL). Datastores also have their own SQL-like query language called GQL, which is used primarily in the Python interface. Google maintains accessible statistics on each datastore including entity counts and total size.

Datastores have some limitations as summarized below:

  • An entity cannot be bigger than 1MB.
  • An entity’s columns cannot be referenced in indexes more than 1,000 times.
  • No more than 500 entities can be added (put) or deleted in a single batch.

New accounts are not charged until they reach 500MB or storage or 5 million page views a month. Thereafter, cost is apportioned based on the space used and amount of data transferred in and out of the application.

GAE tables are the obvious candidate for storage for GAE applications. Because they are similar to Azure Tables and other BigTable implementations, it may be possible to target multiple “big table” implementations with a single storage solution.

HBase (Apache)

HBase is a “big table” columnar database, modeled after Google’s BigTable and built upon Apache’s Hadoop framework. It is still fairly new, currently shipping 0.90.x (Feb ’11) release (see [versions|http://wiki.apache.org/hadoop/Hbase/HBaseVersions]). HBase supports extremely large tables and distributed database principles such as:

  • Elastic storage (dynamic addition of new servers)
  • No single point of failure
  • Rolling restart for configuration changes and minor upgrades
  • Support for MapReduce processing
  • Apache Thrift network for service deployment
  • REST-ful web service gateway

Hadoop is used by Yahoo search, Facebook, Microsoft Bing (via the Powerset acquisition?), and other major vendors. Hadoop and HBase are Apache open source projects, and both are offered by Amazon via its Amazon Web Services (AWS) offering.

HBase is a columnar database designed for scale in all directions: billions of rows, each of which can have millions of columns. Each row/cell intersection is a cell that can be versioned. The number of versions stored is configurable.

Each HBase table is named. Rows are sorted by row ID, which is a user-provided byte array value. Tables are organized into regions, which are stored separately. When a region becomes full, it splits and some rows are moved. Every region has contiguous row IDs defined by (first row, last row]. All locking is on a per-row basis, regardless of the number of columns.

Columns are organized into column families, which must be declared. For example, a column family “station:” can be declared. Subsequently, an application can add values with column names such as “station:id”, “station:name”, “station:description”, etc.

Each database has a single “master” node and one or more regionserver nodes. HBase can be mapped to different storage solutions. When mapped to HDFS, nodes are replicated for redundancy. Each HBase has catalog tables called -ROOT- and ‚Ä¢META‚Ä¢. The -ROOT- table has list of ‚Ä¢META‚Ä¢ table regions, and ‚Ä¢META‚Ä¢ holds the list of user-space regions (including the first row number). This architecture allows nodes to split as data grows, yet it allows the correct node to be quickly found given a row number.

HBase uses the Apache ZooKeeper framework for distributed access. When mapped to the Hadoop HDFS, HBase implements a special file class that provides features such as compression and bloom filters. GZIP compression is used by default, but LZO compression (because it is GPL) can be installed on each HBase for faster, real-time compression.

HBase, as with most of Hadoop, is written in Java and favors Linux systems. However, it reportedly will also work on Windows systems.

HBase is more complicated than some solutions because of its layering: it requires Hadoop’s HDFS, which is one set of services, as well as its own set of services, connected and managed by Apache ZooKeeper. It is older than Cassandra and potentially more mature, especially when secondary indexes are required.

Hypertable (Open Source)

Hypertable is an open source, “web scale” database modeled after Google’s BigTable. It is available under the GNU GPLv2. It is a relatively new project, currently shipping version 0.9.2.7. Hypertable can be deployed on top of the Hadoop HDFS or CloudStore KFS file systems. It supports an SQL-like language for creating tables called HQL. An overview of a Hypertable deployment is shown below:

HyperTable.JPG

Hypertable uses a Google Chubby equivalent called Hyperspace, which holds metadata and acts as a global lock space. Hypertable uses a similar table/column model as other big table implementations:

  • A table consists of rows, each of which has a unique row ID.
  • Each row consists of a set of column families.
  • Each column family has an unlimited set of column qualifiers, each of which defines a unique column (field) value.
  • Each column value is versioned by a system-defined timestamp.
  • Rows are stored in row ID order within each table.
  • When a table reaches a certain size, it splits into a range, served by a separate range server, which handles all rows within the corresponding range. (This is analogous to HBase regions and region servers.)

A table is essentially a key/value map, where each key is the row ID concatenated with the column family, column qualifier, and value timestamp. A full map key therefore looks like this:

HyperTable-MapKey.JPG

The native API is C++. The Apache Thrift API is used to provide support for Java, Perl, PHP, Python and Ruby. Linux and Mac OSX are support; porting to other platforms is possible.

Hypertable seems identical functionality-wise to Cassandra but with a much more restrictive GPL license.

MckoiDDB (Diehl and Associates)

MckoiDDB is an open source distributed database developed by a small company (Diehl and Associates) and distributed under the GNU GPLv2 license. It manages large sets of data distributed within a cluster, supporting features such as low-latency queries, snapshots, sharding, versioning, and data forks. MckoiDDB supports add-on modules, one of which provides SQL support. (The SQL module started shipping as Beta in May, 2010.) Version 1.0 was released 10/5/09; version 1.1 started shipping in May, 2010. It appears to be managed by a single person (Tobias Downer).

MckoiDDB utilizes optimistic locking, narrowing all contention down to a “consensus function”, which is engaged before data is committed. Sharding is supported, which allows data and contention to be distributed among multiple servers. Multiple data models are provided; for example, a “simple database API” provides traditional file and table models.

In comparison to BigTable and Hadoop HBase, MckoiDDB claims to be more oriented towards online transaction processing[1]. MckoiDDB can be used in cloud deployments such as EC2. Like other distributed databases, it writes data to three drives, providing fault tolerance and distributed access. MckoiDDB is written in Java and reportedly works anywhere Java 1.6 is available.

MckoiDDB is supported by one person and doesn’t appear to have lot of activity. It is also surrounded by the dreaded GPL firewall.

  1.  On the surface, this doesn’t make sense since all of the “big tables” provide highly deterministic times for all read/write operations, so performance tends to be related to the amount of sharding and caching used.

SimpleDB (Amazon)

SimpleDB is Amazon’s “big table” database, currently in beta, available through the Amazon Web Services (AWS) portfolio. It is written in Erlang and is similar to BigTable, Cassandra, and other big table databases in many ways: each domain (table) has no pre-defined schema; each item (row) can have a different set of attributes (columns); etc. Domains are currently limited to 10 GB each, and each account is limited to 100 domains.

Attribute values must be simple text or scalar types. Large objects such as documents and images should be stored in Amazon’s S3 storage and referenced via pointers. Not only is S3 cheaper, but this helps avoid hitting the 10GB limit. Moreover, data transfer within the AWS is free, so there is no penalty for storing large objects separately.

Interestingly, some articles such as the Wikipedia entry, classify SimpleDB as a document database. This may be because it supports REST and SOAP, thereby allowing rows to be retrieved as XML documents. However, SimpleDB does not appear to support JSON, whereas all other document databases do. Moreover, large text and binary values must be stored in S3 as described above, pushing to applications the responsibility of managing the content of “documents”, be they textual, image, audio, etc. SimpleDB seems to more accurately belong to the somewhat lower taxonomic label of “big table” database.

Currently, SimpleDB is free for experimental use:

Free Tier

You can get started with SimpleDB for free. Amazon SimpleDB users pay no charges on the first 25 Machine Hours, 1 GB of Storage, and 1 GB of Data Transfer Out consumed every month. Under the new free inbound data transfer promotion, all data transfer into Amazon SimpleDB is free of charge until June 30, 2010. In most use cases, the free tier enables approximately 2,000,000 GET or SELECT API requests to be completed per month before incurring any usage charges. Many applications should be able to operate perpetually within this free tier, such as a daily website analysis and traffic reporting tool, a web indexing service, or an analytics utility for online marketing programs.

SimpleDB employs “eventually consistent” replication and transaction semantics. However, on 2/24/10, Amazon added new features such as conditional puts and deletes and consistent read that prevent ambiguities and small timing gaps that can result in inconsistent data. These operations trade speed for consistency but open SimpleDB to a larger class of applications.

SimpleDB looks promising as a big table-like database for applications hosting data in the Amazon cloud.

Document Databases

CouchDB (Apache)

CouchDB is a document database, sometimes referred to as the document DB “poster child”. It is an ad-hoc, schema-less database with a flat address space. Written in Erlang, CouchDB can be queried and indexed using MapReduce expressions. It supports JSON and a REST-style access. An overview of its architecture from the Apache web site is shown below:

CouchDB-Arch.JPG

CouchDB uses incremental, asynchronous replication with conflict detection and management. A multi-master architecture allows an individual master to be off-line (writes are queued until master comes back on-line).

Under the hood, CouchDB is a key/value store that maps document IDs to documents. A CouchDB document is an object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document is a blog post:

CouchDB-DocExample.JPG

A CouchDB database is a flat collection of these documents, each document identified by a unique ID. Full text search is provided by a Lucene index. Additional indexes are views defined using the MapReduce pattern and JavaScript for methods. (This approach has been described as “awkward”.)

Emil Eifrém and Adam Skogman described CouchDB as being difficult to use. Other internal research at Quest also categorized it as a special purpose database.

CouchDB seems to be an early example of a document DB and is best used for applications specifically needing that functionality.

DovetailDB (Millstone Creative Works)

DovetailDB is a schema-less, JSON-based database, similar (in spirit) to Amazon’s SimpleDB. It is produced by Millstone Creative Works and is open-sourced under the Apache 2.0 license. It allows server-side JavaScript and provides full transaction support. Compared to CouchDB, it provides automatic data indexing, though with less flexibility. The product is listed in a couple of “No SQL” articles, but the web site is sparse, and the newsgroup has no entries since 9/2009. It’s not clear that the project is still active.

It’s not clear what advantages DovetailDB has over other document-oriented databases, and it doesn’t appear to be active.

MongoDB (Open Source)

MongoDB is a document database system written in C++ once described as “CouchDB without the crazy”. Its web site defines MongoDB as a scalable, high-performance, open source, schema-free, document-oriented database with the following primary features:

  • Document-oriented storage (the simplicity and power of JSON-like data schemas)
  • Dynamic queries
  • Full index support, extending to inner-objects and embedded arrays
  • Query profiling
  • Fast, in-place updates
  • Efficient storage of binary data large objects (e.g. photos and videos)
  • Replication and failover support
  • Auto-sharding for cloud-level scalability
  • MapReduce for complex aggregation
  • Commercial Support, Training, and Consulting

Programmatic access is available in a number of languages such as C, C++, C#, Erlang, Java, Python, Ruby, and more. MongoDB uses an AGPL/Apache license model, but it is friendlier to applications that want to use it. From the web site:

Our goal with using AGPL is to preserve the concept of copyleft with MongoDB. With traditional GPL, copyleft was associated with the concept of distribution of software. The problem is that nowadays, distribution of software is rare: things tend to run in the cloud. AGPL fixes this “loophole” in GPL by saying that if you use the software over a network, you are bound by the copyleft. Other than that, the license is virtually the same as GPL v3.

Note however that it is never required that applications using mongo be published. The copyleft applies only to the mongod and mongos database programs. This is why Mongo DB drivers are all licensed under an Apache license. Your application, even though it talks to the database, is a separate program and “work”.

One web site compares the MongoDB MapReduce functionality to SQL this way:

MongoDB.JPG

MongoDB is supported on 32- and 64-bit versions of Mac OSX, Linux, and Windows. One web review points out a few limitations with MongoDB:

  • It is not as fast as other key/value store databases such as Tokyo Cabinet or Redis. However it as plenty fast, inserting 2,500 simple documents per second and performing 2,800 reads per second on commodity Linux hardware.
  • Its support for sharding is new and essentially alpha-ish.
  • The query syntax is not “pretty”, but the ease-of-use rivals Redis.

MongoDB’s only apparent potential drawbacks are its immaturity and suitability for objects other than documents.

Terrastore (Open Source)

Terrastore is a document database built on top of Terracotta. It uses an Apache 2.0 license, but its dependency on Terrastore needs further research. Terrastore is distributed and supports dynamic addition and removal of nodes (elasticity). Documents are automatically distributed to new nodes, though it doesn’t appear to support replication or automatic failover features. Terrastore supports JSON, accessible via an HTTP interface.

It’s not clear how active/used Terrastore is. It seems to be supported by one person (Sergio Bossa). There are a few papers available, but not much documentation.

ThruDB (Open Source)

ThruDB is an open source (New BSD license) document database built on top of the Apache Thrift framework. Like Thrift, ThruDB was started by (and still used by) Facebook but is now open source on the Google code pages. ThruDB provides web-scale data management by providing separate services for various concerns:

  • Thrucene: provides Lucene-based indexing
  • Throxy: provides partitioning and load balancing
  • Thrudoc: provides document storage
  • Thruqueue: provides a persistent message queue service
  • Thrift: provides cross-language services framework

ThruDB provides several storage back-end solutions, including local disk, S3 (cloud), and a hybrid local disk/S3 solution. The differences in these solutions are summarized below:

ThruDB-BackendStorageTable.JPG

ThruDB uses Memcached as an in-memory, cross-server cache. When the hybrid disk/S3 storage solution is used, this means that documents are checked in three places: memory, local disk, and then S3. This allows the S3 store to be a seamless backup/recovery source in case a local disk fails. This is illustrated below:

ThruDB-S3Hybrid.JPG

Compared to SimpleDB, ThruDB removes many of the former’s restrictions: no 1024 byte limit per attribute, ability to use proprietary data formats, no query time limits, and more. Since it also runs on EC2 and public Thrift/ThruDB Amazon Machine Images (AMIs) exist, ThruDB is in a sense a competitor to SimpleDB.

Graph Databases

AllegroGraph RDFStore (Franz Inc.)

AllegroGraph RDFStore is a persistent RDF graph database that purportedly scales to “billions of triples while maintaining superior performance”[1]. It is intended for building RDF-centric semantic webapplications. AllegroGraph supports SPARQL, RDFS++, and Prolog reasoning from Java applications. It is a commercial product of Franz Inc.: a free version stores up to 50 million triples; aDeveloper version stores up to 600 million triples; an Enterprise version stores an unlimited number of triples.

An overview of its architecture is shown below:

AllegroGraphDB.JPG

AllegroGraph RDFStore supports database federation. Independent databases can be deployed, and a single query can be distributed across databases. The product is intended for fast loading of and access to RDF information and not as a general purpose database. The requirement of a commercial license also puts it at a disadvantage compare to various open source alternatives.

  1.  Resource Description Framework (RDF) is a structure for describing and interchanging metadata on the Web. In RDF, a triple consists of three values: a source, a predicate, and anobject. All three are typically fairly short strings.

InfoGrid (NetMesh Inc.)

InfoGrid is a graph database developed by NetMesh, Inc. It provides graph-based traversal for access instead of joins. Each unit of data is separately addressable, thereby supporting REST-ful access. InfoGrid uses XPRISO and Probe frameworks to allow external data to appear as if it was native to the database. InfoGrid supports data sharding, object/relational mapping, data import, and other functions. An overview of the architecture is shown below:

InfoGrid-Arch.JPG

An InfoGrid database instance is called a MeshBase and contains of a graph of MeshObjects. Each MeshObject has a type and a set of properties. A Relationship connects two objects and may a type and a direction. MeshBases can be mapped to an RDBMS, file system, or a distributed file system such as Hadoop HDFS. Viewlets allow search results to be rendered to a specific target such as a web page.

Probes are used to access real-world data, as illustrated below:

InfoGrid-Probes.JPG

InfoGrid is available under the GNU Affero General Public License (AGPL), which is a network version of the GPL. This means that commercial use requires a license from NetMesh.

HyperGraphDB (Kobrix)

HyperGraphDB is a graph database designed specifically for artificial intelligence and semantic web projects. It is also positioned as a “general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism.” It also slices, dices, and bakes bread. It is developed by Kobrix and covered by the Lesser GNU Public License (LGPL) and is shipping at version 1.0.

Here are some key facts from the product’s main web site:

  • The mathematical definition of a hypergraph is an extension to the standard graph concept that allows an edge to point to more than two nodes. HyperGraphDB extends this even further by allowing edges to point to other edges as well and making every node or edge carry an arbitrary value as payload.
  • The original requirements that triggered the development of the system came from the OpenCog project, which is an attempt at building an AGI (Artificial General Intelligence) system based on self-modifying probabilistic hypergraphs.
  • The basic unit of storage in HyperGraphDB is called an atom. Each atom is typed, has an arbitrary value and can point to zero or more other atoms.
  • Data types are managed by a general, extensible type system embedded itself as a hypergraph structure. Types are themselves atoms as everybody else, but with a particular role (well, as everybody else too).
  • The storage scheme is platform independent and can thus be accessed by any programming language from any platform. Low-level storage is currently based on Berkeley DB from Sleepycat Software.
  • Size limitations are virtually non-existent. There is no software limit on the size of the graph managed by a HyperGraphDB instance. Each individual value’s size is limited by the underlying storage, i.e. by Berkeley DB’s 2GB limit. However, the architecture allows bypassing Berkeley DB for particular types of atoms if one so desires.
  • The current implementation is solely Java based. It offers an automatic mapping of idiomatic Java types to a HyperGraphDB data schema which makes HyperGraphDB into an object-oriented database suitable for regular business applications. A C++ implementation has been frequently contemplated, but never initiated due to lack of manpower. Note that the storage scheme being open and precisely specified, all languages and platforms are able to share the same data.
  • Embedded in-process: the database comes in the form of a software library to be used directly through its API.
  • A P2P framework for distributed processing has been implemented for replication/data partitioning algorithms as well as client-server style computing

Given its immaturity (version 1.0) and dependency on Berkeley DB, which is GPL licensed, HyperGraphDB doesn’t seem feasible for commercial applications.

Neo4j (Neo Technologies)

Neo4j is a graph database intended for embedded applications. It is written in Java, but it has bindings for Python, Ruby, Clojure, and Scala. It supports ACID transactions and provides master-slave replication. It is provided by Neo Technologies and has both AGPL and commercial licenses. The Neo4j jar file is only 500KB. It reportedly supports billions of nodes, relationships, and properties on a single system. It has been in production use since 2003. An example of a typical information graph is shown below:

Neo4J.JPG

One reference to Neo4j suggested that it had supported very basic master/slave replication, but this is apparently done through a very soft mechanism such as a disk replicator. Anecdotal mention of sharding is also made, but a blog entry explains that this essentially must be done at the client. As a result, built-in scalability support appears to be limited to what can be done on a single machine.

The lack of scale-out strategy, the focus on a graph model, and the need for a commercial license seem to make Neo4j of limited use.

Honorable Mentions

The following technologies are related to distributed database technology but do not currently warrant deeper study. Some are not databases but are related in an interesting way. Some are databases but do not qualify as sufficiently interesting for this survey. Some are very new and/or evolving such that they may warrant more study later.

Alterian

Alterian specializes in web content management, campaign management, and various marketing areas. They use a proprietary columnar database as an “integrated marketing platform”, but it is not available as a general purpose database.

Berkeley DB

Berkeley DB was originally distributed by Sleepycat Software, this key/value store DB is now owned by Oracle. It targets embedded databases with high-performance blob-like access. It offers no network API, and applications must manage the contents of records. It is really an embedded DB building block, and it requires licensing for commercial application use.

C-Store

C-Store is an experimental columnar database developed by a consortium of universities. It is built on top of Berkeley DB, is optimized for reads, and supports SQL. However, it is no longer supported, though it cited as inspiration for a number of other columnar databases.

Chordless

Chordless is a distributed hash table, written in Java, built on the Chord adaptive, distributed, peer-to-peer hashing algorithm. It is a GPL-licensed utility available for free download.

CloudStore KFS

CloudStore KFS, formerly called Kosmos, is an open-source high performance distributed file system implemented in C++. It is modeled after Google’s GFS file system. Integration with Hadoop is available, providing an alternative to the Hadoop HDFS file system. It uses the Apache License Version 2.0. Applications can use C++, or via glue code, Java and Python. Activity is questionable since the shipping release is 0.4, and no builds have been done since 10/2009. Presumably, KFS is faster than HDFS since it is implemented in C++ instead of Java.

Coherence

Coherence was originally developed by Tangosol and is now owned by Oracle. Coherence is a “grid caching” solution that places frequently-used data for Java applications in memory, distributed across multiple servers. It is a commercial analogy of Memcached.

Db4o

Db4o is an open source (GPL) database intended to be embedded within object applications. It allows .Net and Java objects to be stored without decomposition or normalization. It is mentioned only because it (apparently) uses a creative storage technique that affords high performance. However, the GPL license prevents commercial usage.

Drizzle

Drizzle is a lightweight SQL database based on MySQL. Its goal is to produce a sufficiently lightweight database that can take advantage of cloud computing and massive concurrency/parallelism while retaining the advantages of SQL. It is a relatively new project, licensed with the GNU GPLv2.

Dynamo

Dynamo is a distributed key/value storage system developed by Amazon. Though the actual implementation is proprietary, Dynamo provides foundation research that has been used in several other implementations such as Voldemort and Cassandra.

Force Database

The Force.com Database is an object/relational hybrid database that is used by Salesforce.com but can be used for other Force.com cloud applications. It has programming tools, a programming language called Apex, and two query languages called Salesforce Object Query Language (SOQL) and Salesforce Object Search Language (SOSL).

GigaSpaces XAP

GigaSpaces XAP In-Memory Grid is a Memcached-like distributed memory caching architecture. It front-ends a persistent store such as a database, providing high-performance access for a subset of data, say the newest data, while supporting horizontal scale-out.

Gremlin

Gremlin is not a DBMS but rather an open source graph programming language for navigating graph DBs. It gets mentioned by some of the graph DB articles.

Google App Engine

Google App Engine is Google’s application development and framework for building and running web applications within Google’s cloud infrastructure. Applications can be written in Python or Java and choose from a variety of services: web services, persistence with queries and transactions, automatic scaling and load balancing, task queues, and more. Applications can be developed locally and deployed to the Google cloud. Applications that use store less than 500GB and experience less than 5M page views per month are free.

GT.M

GT.M is a commercial/recently open-sourced database provided by FIS Financial. It is a multi-dimensional key/store database that qualifies as a “no SQL”, schema free database, yet it supports ACID transactions. It started in 1986 on VAX/VMS and now runs on Linux. It favors development in the ANSI “M” language, also known as MUMPS, which is used in the financial and banking industries.

HamsterDB

HamsterDB is a variable-key, memory or disk-based database intended for embedded applications. It is available under a GNU GPL license; a commercial license is also available. As a non-relational database, it is much faster than an RDBMS for certain applications.

Heroku

Heroku is a cloud application platform for Ruby-based applications.

Hive

Hive is a data warehouse infrastructure built on top of Hadoop. It is intended for ETL applications and utilizes the Hadoop MapReduce infrastructure for its processing. It is not intended for real-time processing. Latencies tend to be high (minutes) even when data batches are relatively small (a few hundred megabytes).

Infinispan

Infinispan is a distributed, in-memory key/value caching mechanism used by JBoss, similar to Memcached or WebSpere eXtreme Scale.

Jiql

Jiql is a JDBC library that allows Java applications to access Google’s BigTable. Originally, Google App Engine (GAE) applications had to be written in Python. Jiql provides a Python controller that provides a JDBC gateway to BigTable. This appears to be obsolete since Google now provides a Java interface to BigTable.

Keyspace

Keyspace is a distributed, persistent key/value store developed by Scalien. It is available under the GNU AGPL license as well as under commercial licenses. Keyspace supports automatic replication and failover, and it accessible via HTTP, C, and Python. It does not apparently support secondary indexes.

LightCloud

LightCloud is a lightweight layer on top of Tokyo Cabinet/Tyrant, providing a key/value store that is comparable to a persistent version of Memcached. LightCloud is written in Python, but a Ruby port is underway. A port to Redis as an alternative to Tokyo Tyrant is also available. LightCloud is available under the BSD license.

Lucandra

Lucandra is a project that allows Lucene to store its data in Cassandra. Essentially, Lucandra extends Lucene’s IndexReader and IndexWriter interfaces to use Cassandra, resulting in only a 10% loss of read performance compared to using regular disk files. (Write performance with Cassandra shows no loss of performance.)

LucidDB

LucidDB is an open source, columnar RDBMS database for Java and C++, intended for data warehouse applications. It does not support clustering and has unknown stability. Activity seems to be limited.

M/DB

M/DB is a SimpleDB clone built on top of GT.M. It is available under the GNU AGPLv3 license. Its apparent purpose is to server as a training/test implementation of SimpleDB.

MDEX

MDEX is a semi-structured database used by the Endeca Information Access Engine (IAE), which is an enterprise search engine. Each MDEX record can have its own set of attributes and can be versioned. MDEX supports a distributed architecture for large deployments, and it can enforce access-controlled searches. Access is supported via a SOAP interface and Java and .Net APIs. MDEX is apparently available for commercial use.

Memcached

Memcached is a widely used distributed, memory cache system, intended for web access acceleration but used in many other applications. It was originally developed by Brad Fitzpatrick for LiveJournal in 2003. Distributed caching means that a request for an object on one server can get a “hit” if the object lives in the cache of any known server. Over time, objects expire and are deleted from the cache. Memcached is open source and freely available.

MemcacheDB

MemcacheDB is a persistent version of Memcached that uses the Berkeley DB as the persistent mechanism. The last version shipped was 1.2.1 (beta) on 12/25/2008. The project does not appear to be active. It uses a BSD-like license.

MonetDB

MonetDB is an open source, column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It supports SQL but has been described as “crashes a lot” and “corrupts your data”. It appears to be a research project with little activity.

Mnesia

Mnesia is a distributed, “soft real-time” database management system written in Erlang. It is comparable to Berkeley DB but intended as a persistence engine for Erlang applications. It was initially written by Ericsson for distributed, high-availability applications in the telecom space.

MySQL NDB Cluster

MySQL NDB Clusters deserve an honorable mention. MySQL provides a clustered storage solution called Network Database (NDB). A MySQL clusters utilize a shared-nothing architecture, meaning that each MySQL server instance has local disk. However, MySQL clusters can use NDB to store data on shared data nodes, which can use mirrored disks. This provides a horizontal scaling capability with no single point of failure. Because the resulting database is still an RDBMS using ACID transactions, scalability hits a much lower ceiling than BASE-style databases.

PNUTS

PNUTS (Platform for Nimble Universal Table Storage), internally called Sherpa, is a Yahoo! project for big table-like distributed storage. It is currently a research project. Its goals are high availability, auto-sharding, multi-data center replication, and failover, similar to BigTable, Cassandra, and others.

PostgreSQL

PostgreSQL is probably the oldest open source, object/relational database systems. It is an ANSI-compliant SQL database that supports replication and other production-level features. Databases have been reported in the terabyte range. PostgreSQL provides a modular, customizable architecture, though that same feature has been criticized for making deployments complex.

QD Technology

QD Technology is a company that provides a “quick response” database, which is an RDBMS that copies and compresses a production database so that it can be stored and queried on another computer, say a laptop. It is apparently a columnar database, so it is occasionally mentioned in articles about these things.

RDDB

RDDB is a REST-ful Ruby Database that provides a document-oriented functionality for Ruby applications. It is a new, alpha-stage project.

Redis

Redis is a persistent key/value database written in C, analogous to Tokyo Cabinet/Tyrant and MemcacheDB. It is reportedly “blazingly fast”: 110,000 “SETs” per second and 81,000 “GETs” per second on commodity Linux hardware. Redis is currently shipping version 1.2.0 and is available under the New BSD license. Sharding is supported only as a client-side option. Replication is also not supported. Reportedly, clustering is one of the next priorities. Retwis is a project using Twitter and Redis. Retwis-RB is a Twitter clone using Ruby, Sinatra, and Redis. Redis was 1 year old on 2/25/2010.

Relational Database AMI

A Relational Database AMI is a popular, pre-packaged database offered by Amazon Web Services (AWS) as an Amazon Machine Image (AMI). Users can choose between various versions of DB2, Oracle, MySQL, SQL Server, PostgreSQL, and Sybase. Vertica is offered as a service. Essentially, these are the same as regular databases but deployed in Amazon’s cloud. The user assumes all DB administration tasks.

Relational Data Service (RDS)

Relational Data Service (RDS) is a full-featured version of the MySQL relational database available within the Amazon Web Services (AWS) portfolio. Five DB instance classes are available from “small” to “quadruple extra-large”. Each instance class can support storage up to 1 TB. Price is based instance class on actual usage of storage, I/Os, and data transferred in and out of RDS. Although automatic backups are performed, RDS databases currently have no redundancy, failover, or high-availability capabilities. Amazon is planning to provide a high availability offering in the future.

Ringo

Ringo is an experimental distributed key/value store modeled after Amazon’s Dynamo. It uses consistent hashing for elasticity and replication for availability. It is intended for immutable data. It is written in Erlang and is available under the BSD license.

Scalaris

Scalaris is a distributed key/value database initiated by Zuse Institute Berlin, written in Erlang. It uses Paxos for strong consistency during commits. It is open source under the Apache 2.0 license. Access is available via a web interface, Java, Erlang, or a JSON-based RPC. It is supported only on Linux variants. The backend can be switched to Tokyo Cabinet. It doesn’t seem to be widely used or very active at the moment.

Sesame

Sesame is an open source framework for storing and querying RDF data. It is available under a BSD-like license. Release 2.0 is shipping. It appears to sit above MySQL and does not add replication, sharding, or any other “big DB” features.

Terracotta

Terracotta is a “JVM persistence” database, allowing Java objects to be seamlessly saved to disk. It has been euphemistically called a “database killer” since it persists objects without a database backend. An open source version is available; a commercial enterprise version adds clustering, replication, fault-tolerance, and other features.

VPork

VPork is a utility for load-testing a distributed hash table (namely, project Voldemort). It is written in Java and launches a variable number of threads for reading, writing, and modifying hash table entries. It also uses Groovy.

VoltDB

VoltDB is a new OLTP RDBMS intended to allow horizontal scale-out. Michael Stonebraker is one of its co-founders. The project is currently in stealth mode, though an overview paper is available.

WebSphere eXtreme Scale

WebSphere eXtreme Scale was formerly WebSphere Extended Deployment Data Grid. This IBM product appears to be a commercial equivalent of Memcached. It provides distributed in-memory caching across a large number of computers, providing a caching grid with features such as automatic partitioning and replication.

[repost ]LevelDB – Fast And Lightweight Key/Value Database From The Authors Of MapReduce And BigTable

original:http://highscalability.com/blog/2011/8/10/leveldb-fast-and-lightweight-keyvalue-database-from-the-auth.html

LevelDB is an exciting new entrant into the pantheon of embedded databases, notable both for its pedigree, being authored by the makers of the now mythical Google MapReduce and BigTable products, and for its emphasis on efficient disk based random access using log-structured-merge (LSM) trees.

The plan is to keep LevelDB fairly low-level. The intention is that it will be a useful building block for higher-level storage systems. Basho is already investigating using LevelDB as one if its storage engines.

In the past many systems were built around embedded databases, though most developers now use database servers connected to via RPCs. An embedded database is a database distributed as a library and linked directly into your application. The application is responsible for providing a service level API, sharding, backups, initiating consistency checking, initiation rollback,  startup, shutdown, queries, etc. Applications become the container for the database and the manager of the database.

Architectures using embedded databases typically never expose a raw database abstraction at all. They have a service API and the services use the embedded database library call transparently behind the scene. Often an embedded database will provide multiple access types, like indexed access for key-value uses and btrees for range queries and cursors.

BerkelyDB is one well known example of an embedded database, SQLite is another, the file system is perhaps the most commonly used database, and there have been many many other btree libraries in common use. I’ve used C-tree on several projects. In a battle of old versus new, a user named IM46 compared Leveldb to BerkelyDB and found that LevelDB solidly outperforms Berkeley DB for larger databases.

Programmers usually thought doing this stuff was easy, wrote their own failed on-disk btree library (raises hand), and then look around for a proven product. It’s only relatively recently the databases have gone up market and included a network layer and higher level services.

Building a hybrid application/database architecture is still a very viable option when you want everything to be just so. If you are going to load balance requests across sharded application servers anyway, using a heavy weight external database infrastructure may not be necessary.

The LevelDB mailing list started off very active and has died down a bit, but is still nicely active and informative. Here are some excellent FAQish tips, performance suggestions, and porting issues extracted from the list:

  • Largest tested database: 1 billion entries with 16 byte keys and 100 byte values (roughly 100 GB of raw data, about half that after compression).
  • LevelDB has been Open Sourced.
  • Relationship between LevelDB and BigTable: The implementation of LevelDB is similar in spirit to the representation of a single Bigtable tablet (section 5.3). However the organization of the files that make up the representation is somewhat different and is explained [in source code comments]. They wanted to put together something like the BigTable tablet stack that had minimal dependencies and would be suitable for open sourcing, and also would be suitable for use in Chrome (for the IndexedDB implementation). LevelDB has the same general design as the BigTable tablet stack, but does not share any of the code.
  • Didier Spezia on log-structured-merge (LSM) trees: They are mostly useful to optimize random I/Os at insertion/delete time at the price of a slightly degradation of read access time. They are extremely efficient at indexing data in random order stored on rotational disks (i.e. better than b-trees).
  • Optimized for random writes. TokyoCabinet could be filled with a million 100-byte writes in less than two seconds if writing sequentially, but the time  ballooned to ~2000 seconds when writing randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds.
  • In the tradition of BerkelyDB it’s a library you embed in your program, it’s not a server. You will have to add the networker layer, sharding etc if a single process won’t suffice.
  • Quite appropriately threading decisions are left to the application, the library is not thread safe. Threads sharing iterators, for example, will need to lock.
  • Data is written in sorted order.
  • C++ only.
  • Variable sized keys are used to save memory.
  • What leveldb does differently from B+trees is that it trades off write latency for write throughput: write latency is reduced by doing bulk writes, but the same data may be rewritten multiple times (at high throughput) in the background due to compactions.
  • Log-Structured Merge Trees offer better random write performance (compared to btrees).  It always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.
  • Search performance is still O(lg N) with a very large branching factor (so the constant factor is small and number of seeks should be <= 10 even for gigantic databases).
  • One early user found performance degraded at around 200 million keys.
  • Bigger block sizes are better, increasing the block size to 256k (from 64k).
  • Batching writes increases performance substantially.
  • Every write will cause a log file to grow, regardless of whether or not you are writing to a key which already exists in the database, and regardless of whether or not you are overwriting a key with the exact same value.  Only background compactions will get rid of overwritten data.  So you should expect high cpu usage while you are inserting data, and also for a while afterwards as background compactions rearrange things.
  • LevelDB Benchmarks look good:
    • Using 16 byte keys at 100 byte values:
      • Sequential Reads: LevelDB 4,030,000 ops/sec; Kyoto TreeDB 1,010,000 ops/sec; SQLite3 186,000 ops/sec
      • Random Reads: LevelDB 129,000 ops/sec; Kyoto TreeDB 151,000 ops/sec; SQLite3146,000 ops/sec
      • Sequential Writes: LevelDB 779,000 ops/sec; Kyoto TreeDB 342,000 ops/sec; SQLite3 26,900 ops/sec
      • Random Writes: LevelDB 164,000 ops/sec; Kyoto TreeDB 88,500 ops/sec; SQLite3420 ops/sec
    • Writing large values of 100,000 bytes each: LevelDB is even Kyoto TreeDB. SQLite3 is nearly 3 times as fast. LevelDB writes keys and values at least twice.
    • A single batch of N writes may be significantly faster than N individual writes.
    • LevelDB’s performance improves greatly with more memory, a larger write buffer reduces the need to merge sorted files (since it creates a smaller number of larger sorted files).
    • Random read performance is much better in Kyoto TreeDB because it cached in RAM.
    • View many more results by following the link, but tha’s the jist of it.
  • InnoDB benchmarks as run by Basho.
    • LevelDB showed a higher throughput than InnoDB and a similar or lower latency than InnoDB.
    • LevelDB may become a preferred choice for Riak users whose data set has massive numbers of keys and therefore is a poor match with Bitcask’s model.
    • Before LevelDB can be a first-class storage engine under Riak it must be portable to all of the same platforms that Riak is supported on.
  • LEVELDB VS KYOTO CABINET MY FINDINGS. Ecstortive says wait a minute here, Kyoto is actually faster.
  • A good sign of adoption, language bindings are being built: JavaTie::LevelDB on CPAN
  • Comparing LevelDB and Bitcask: LevelDB is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration). Bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications. Bitcask can guarantee at most one disk seek per lookup I think. LevelDB may have to do a small handful of disk seeks. To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks. Bitcask is a combination of Erlang and C.
  • Writes can be lost, but that doesn’t trash the data files: Leveldb never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). Leveldb recovery code uses checksums to detect this and will skip the incomplete records.
  • LevelDB is being used as the back-end for IndexedDB in Chrome. For designing how to map secondary indices into LevelDB key/values, look at how the IndexedDB support within Chrome is implemented.
  • In case of a crash partial writes are ignored.
  • Possible scalability issues:
    • LevelDB keeps a separate file for every couple of MB of data, and these are all in one directory.  Depending on the underlying file system, this might start causing trouble at some point.
    • Scalability is more limited by the frequency of reads and writes that are being done, rather than the number of bytes in the system.
  • Transactions are not supported. Writes (including batches) are atomic.  Consistency is up to you.  There is limited isolation support.  Durability is a configurable option.  Full blown ACID transactions require a layer on top of LevelDB (see WebKit’s IndexedDB).
  • Michi Mutsuzaki compared LevelDB to MySQL as a key-value store. LevelDB had better overall insert throughput, but it was less stable (high variation in throughput and latency) than mysql. There was no significant performance difference for 80% read / 20% update workload.
  • LevelDB hasn’t been tuned for lots of concurrent readers and writers. Possible future enhancements:
    1. Do not hold the mutex while the writer is appending to the log (allow concurrent readers to proceed)
    2. Implement group commit (so concurrent writers have their writes grouped together).

Related Articles

[repost ]leveldb 资料

1,leveldb 实现解析 
http://rdc.taobao.com/blog/cs/wp-content/plugins/leveldb%E5%AE%9E%E7%8E%B0%E8%A7%A3%E6%9E%90.pdf
淘宝-核心系统研发-存储
那岩
neveray@gmail.com

 

2,【朗格科技】LevelDb

http://www.samecity.com/blog/Index.asp?SortID=12