标签归档:NoSQL

[repost ]MySQL vs. Neo4j on a Large-Scale Graph Traversal

original:http://java.dzone.com/articles/mysql-vs-neo4j-large-scale

This post presents an analysis of MySQL (a relational database) and Neo4j (a graph database) in a side-by-side comparison on a simple graph traversal.

The data set that was used was an artificially generated graph with natural statistics. The graph has 1 million vertices and 4 million edges. The degree distribution of this graph on a log-log plot is provided below. A visualization of a 1,000 vertex subset of the graph is diagrammed above.

Loading the Graph

The graph data set was loaded both into MySQL and Neo4j. In MySQL a single table was used with the following schema.

1.CREATE TABLE graph (
2.outV INT NOT NULL,
3.inV INT NOT NULL
4.);
5.CREATE INDEX outV_index USING BTREE ON graph (outV);
6.CREATE INDEX inV_index USING BTREE ON graph (inV);

After loading the data, the table appears as below. The first line reads: “vertex 0 is connected to vertex 1.”

01.mysql> SELECT FROM graph LIMIT 10;
02.+------+-----+
03.| outV | inV |
04.+------+-----+
05.|    0 |   1 |
06.|    0 |   2 |
07.|    0 |   6 |
08.|    0 |   7 |
09.|    0 |   8 |
10.|    0 |   9 |
11.|    0 |  10 |
12.|    0 |  12 |
13.|    0 |  19 |
14.|    0 |  25 |
15.+------+-----+
16.10 rows in set (0.04 sec)

The 1 million vertex graph data set was also loaded into Neo4j. In Gremlin, the graph edges appear as below. The first line reads: “vertex 0 is connected to vertex 992915.”

01.gremlin> g.E[1..10]
02.==>e[183][0-related->992915]
03.==>e[182][0-related->952836]
04.==>e[181][0-related->910150]
05.==>e[180][0-related->897901]
06.==>e[179][0-related->871349]
07.==>e[178][0-related->857804]
08.==>e[177][0-related->798969]
09.==>e[176][0-related->773168]
10.==>e[175][0-related->725516]
11.==>e[174][0-related->700292]

Warming Up the Caches

Before traversing the graph data structure in both MySQL and Neo4j, each database had a “warm up” procedure run on it. In MySQL, a “SELECT * FROM graph” was evaluated and all of the results were iterated through. In Neo4j, every vertex in the graph was iterated through and the outgoing edges of each vertex were retrieved. Finally, for both MySQL and Neo4j, the experiment discussed next was run twice in a row and the results of the second run were evaluated.

Traversing the Graph

The traversal that was evaluated on each database started from some root vertex and emanated n-steps out. There was no sorting, no distinct-ing, etc. The only two variables for the experiments are the length of the traversal and the root vertex to start the traversal from. In MySQL, the following 5 queries denote traversals of length 1 through 5. Note that the “?” is a variable parameter of the query that denotes the root vertex.

01.SELECT a.inV FROM graph as a WHERE a.outV=?
02.
03.SELECT b.inV FROM graph as a, graph as b WHERE a.inV=b.outV AND a.outV=?
04.
05.SELECT c.inV FROM graph as a, graph as b, graph as c WHERE a.inV=b.outV AND b.inV=c.outV AND a.outV=?
06.
07.SELECT d.inV FROM graph as a, graph as b, graph as c, graph as d WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND a.outV=?
08.
09.SELECT e.inV FROM graph as a, graph as b, graph as c, graph as d, graph as e WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND d.inV=e.outV AND a.outV=?

For Neo4j, the Blueprints Pipes framework was used. A pipe of length n was constructed using the following static method.

01.public static Pipeline createPipeline(final Integer steps) {
02.final ArrayList pipes = new ArrayList();
03.for (int i = 0; i < steps; i++) {
04.Pipe pipe1 = new VertexEdgePipe(VertexEdgePipe.Step.OUT_EDGES);
05.Pipe pipe2 = new EdgeVertexPipe(EdgeVertexPipe.Step.IN_VERTEX);
06.pipes.add(pipe1);
07.pipes.add(pipe2);
08.}
09.return new Pipeline(pipes);
10.}

For both MySQL and Neo4j, the results of the query (SQL and Pipes) were iterated through. Thus, all results were retrieved for each query. In MySQL, this was done as follows.

1.while (resultSet.next()) {
2.resultSet.getInt(finalColumn);
3.}

In Neo4j, this is done as follows.

1.while (pipeline.hasNext()) {
2.pipeline.next();
3.}

Experimental Results

The artificial graph dataset was constructed with a “rich get richer“, preferential attachment model. Thus, the vertices created earlier are the most dense (i.e. highest number of adjacent vertices). This property was used to limit the amount of time it would take to evaluate the tests for each traversal. Only the first 250 vertices were used as roots of the traversals. Before presenting timing results, note that all of these experiments were run on a MacBook Pro with a 2.66GHz Intel Core 2 Duo and 4Gigs of RAM at 1067 MHz DDR3. The packages used were Java 1.6, MySQL JDBC 5.0.8, and Blueprints Pipes 0.1.2.

1.java version "1.6.0_17"
2.Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
3.Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

The following Java Virtual Machine parameters were used:

1.-Xmx1000M -Xms500M

Below are the total running times for both MySQL (red) and Neo4j (blue) for traversals of length 1, 2, 3, and 4.

The raw data is presented below along with the total number of vertices returned by each traversal—which, of course, is the same for both MySQL and Neo4j given that its the same graph data set being processed. Also realize that traversals can loop and thus, many of the same vertices are returned multiple times. Finally, note that only Neo4j has the running time for a traversal of length 5. MySQL did not finish after waiting 2 hours to complete. In comparison, Neo4j took 14.37 minutes to complete a 5 step traversal.

01.[mysql steps-1] time(ms):124 -- vertices_returned:11360
02.[mysql steps-2] time(ms):922 -- vertices_returned:162640
03.[mysql steps-3] time(ms):8851 -- vertices_returned:2206437
04.[mysql steps-4] time(ms):112930 -- vertices_returned:28125623
05.[mysql steps-5] N/A
06.
07.[neo4j steps-1] time(ms):27 -- vertices_returned:11360
08.[neo4j steps-2] time(ms):474 -- vertices_returned:162640
09.[neo4j steps-3] time(ms):3366 -- vertices_returned:2206437
10.[neo4j steps-4] time(ms):49312 -- vertices_returned:28125623
11.[neo4j steps-5] time(ms):862399 -- vertices_returned:358765631

Next, the individual data points for both MySQL and Neo4j are presented in the plot below. Each point denotes how long it took to return n number of vertices for the varying traversal lengths.

Finally, the data below provides the number of vertices returned per millisecond (on average) for each of the traversals. Again, MySQL did not finish in its 2 hour limit for a traversal of length 5.

01.[mysql steps-1] vertices/ms:91.6128847554668
02.[mysql steps-2] vertices/ms:176.399127537985
03.[mysql steps-3] vertices/ms:249.286746556076
04.[mysql steps-4] vertices/ms:249.053599519823
05.[mysql steps-5] N/A
06.
07.[neo4j steps-1] vertices/ms:420.740351166341
08.[neo4j steps-2] vertices/ms:343.122344772028
09.[neo4j steps-3] vertices/ms:655.507125256186
10.[neo4j steps-4] vertices/ms:570.360621871775
11.[neo4j steps-5] vertices/ms:416.00886711325

Conclusion

In conclusion, given a traversal of an artificial graph with natural statistics, the graph database Neo4j is more optimal than the relational database MySQL. However, no attempts have been made to optimize the Java VM, the SQL queries, etc. These experiments were run with both Neo4j and MySQL “out of the box” and with a “natural syntax” for both types of queries.

Source: http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-graph-traversal/

[repost ]12款免费与开源的NoSQL数据库介绍

original:http://www.infoq.com/cn/news/2014/01/12-free-and-open-source-nosql

Naresh Kumar是位软件工程师与热情的博主,对于编程与新事物拥有极大的兴趣,非常乐于与其他开发者和程序员分享技术上的研究成果。近日,Naresh撰文谈到了12款知名的免费、开源NoSQL数据库,并对这些数据库的特点进行了分析。

现在,NoSQL数据库变得越来越流行,我在这里总结出了一些非常棒的、免费且开源的NoSQL数据库。在这些数据库中,MongoDB独占鳌头,拥有相当大的使用量。这些免费且开源的NoSQL数据库具有很好的可伸缩性与灵活性,非常适合于大数据存储与处理。相较于传统的关系型数据库,这些NoSQL数据库在性能上具有很大的优势。然而,这些NoSQL数据库未必最适合你。大多数常见的应用仍然可以使用传统的关系型数据库进行开发。NoSQL数据库依然不太适合于那些任务关键型的事务要求。我对这些数据库进行了一些简单介绍,下面就来看看。

1. MongoDB

MongoDB是个面向文档的数据库,使用JSON风格的数据格式。它非常适合于网站的数据存储、内容管理与缓存应用,并且通过配置可以实现复制与高可用性功能。

MongoDB具有很强的可伸缩性,性能表现优异。它使用C++编写,基于文档存储。此外,MongoDB还支持全文检索、跨WAN与LAN的高可用性、易于实现的复制、水平扩展、基于文档的丰富查询、在数据处理与聚合等方面具有很强的灵活性。

2. Cassandra

这是个Apache软件基金会的项目,Cassandra是个分布式数据库,支持分散的数据存储,可以实现容错以及无单点故障等。换句话说,“Cassandra非常适合于那些无法忍受数据丢失的应用”。

3. CouchDB

这也是Apache软件基金会的一个项目,CouchDB是另一个面向文档的数据库,以JSON格式存储数据。它兼容于ACID,像MongoDB一样,CouchDB也可以用于存储网站的数据与内容,以及提供缓存等。你可以通过JavaScript在CouchDB上运行MapReduce查询。此外,CouchDB还提供了一个非常方便的基于Web的管理控制台。它非常适合于Web应用。

4. Hypertable

Hypertable模仿的是Google的BigTable数据库系统。Hypertable的创建者将“成为高可用、PB规模的数据库开源标准”作为Hypertable的目标。换言之,Hypertable的设计目标是跨越多个廉价的服务器可靠地存储大量数据。

5. Redis

这是个开源、高级的键值存储。由于在键中使用了hash、set、string、sorted set及list,因此Redis也称作数据结构服务器。这个系统可以帮助你执行原子操作,比如说增加hash中的值、集合的交集运算、字符串拼接、差集与并集等。Redis通过内存中的数据集实现了高性能。此外,该数据库还兼容于大多数编程语言。

6. Riak

Riak是最为强大的分布式数据库之一,它提供了轻松且可预测的伸缩能力,向用户提供了快速测试、原型与应用部署能力,从而简化应用的开发过程。

7. Neo4j

Neo4j是一款NoSQL图型数据库,具有非常高的性能。它拥有一个健壮且成熟的系统的所有特性,向程序员提供了灵活且面向对象的网络结构,可以让开发者充分享受到拥有完整事务特性的数据库的所有好处。相较于RDBMS,Neo4j还对某些应用提供了不少性能改进。

8. Hadoop HBase

HBase是一款可伸缩、分布式的大数据存储。它可以用在数据的实时与随机访问的场景下。HBase拥有模块化与线性的可伸缩性,并且能够保证读写的严格一致性。HBase提供了一个Java API,可以实现轻松的客户端访问;提供了可配置且自动化的表分区功能;还有Bloom过滤器以及block缓存等特性。

9. Couchbase

虽然Couchbase是CouchDB的派生,不过它已经成为了一款功能完善的数据库产品。它向文档数据库转移的趋势会让MongoDB感到压力。每个节点上它都是多线程的,这是个非常主要的可伸缩性优势,特别是当托管在自定义或是Bare-Metal硬件上时更是如此。借助于一些非常棒的集成特性,诸如与Hadoop的集成,Couchbase对于数据存储来说是个非常不错的选择。

10. MemcacheDB

这是个分布式的键值存储系统,我们不应该将其与缓存解决方案搞混;相反,它是个持久化存储引擎,用于数据存储并以非常快速且可靠的方式检索数据。它遵循memcache协议。其存储后端用于Berkeley DB中,支持诸如复制与事务等特性。

11. REVENDB

RAVENDB是第二代开源数据库,它面向文档存储并且无模式,这样就可以轻松将对象存储到其中了。它提供了非常灵活且快速的查询,通过对复制、多租与分片提供开箱即用的支持使得我们可以非常轻松地实现伸缩功能。它对ACID事务提供了完整的支持,同时又能保证数据的安全性。除了高性能之外,它还通过bundle提供了轻松的可扩展性。

12. Voldemort

这是个自动复制的分布式存储系统。它提供了自动化的数据分区功能,透明的服务器失败处理、可插拔的序列化功能、独立的节点、数据版本化以及跨越各种数据中心的数据分发功能。

各位InfoQ读者,不知在你的项目中曾经、现在或是未来使用了哪些NoSQL数据库。现今的NoSQL世界纷繁复杂,NoSQL数据库也多如牛毛,而且有一些数据库提供了相似的特性,本文所列出的只是其中比较有代表性的12款NoSQL产品。你是否使用过他们呢?是否使用了本文没有介绍的产品呢?他们有哪些特性打动了你,让你决定使用他们呢?非常欢迎将你的经历与看法与我们一起分享。