标签归档:NoSQL

[repost ]MySQL vs. Neo4j on a Large-Scale Graph Traversal

original:http://java.dzone.com/articles/mysql-vs-neo4j-large-scale

This post presents an analysis of MySQL (a relational database) and Neo4j (a graph database) in a side-by-side comparison on a simple graph traversal.

The data set that was used was an artificially generated graph with natural statistics. The graph has 1 million vertices and 4 million edges. The degree distribution of this graph on a log-log plot is provided below. A visualization of a 1,000 vertex subset of the graph is diagrammed above.

Loading the Graph

The graph data set was loaded both into MySQL and Neo4j. In MySQL a single table was used with the following schema.

1.CREATE TABLE graph (
2.outV INT NOT NULL,
3.inV INT NOT NULL
4.);
5.CREATE INDEX outV_index USING BTREE ON graph (outV);
6.CREATE INDEX inV_index USING BTREE ON graph (inV);

After loading the data, the table appears as below. The first line reads: “vertex 0 is connected to vertex 1.”

01.mysql> SELECT FROM graph LIMIT 10;
02.+------+-----+
03.| outV | inV |
04.+------+-----+
05.|    0 |   1 |
06.|    0 |   2 |
07.|    0 |   6 |
08.|    0 |   7 |
09.|    0 |   8 |
10.|    0 |   9 |
11.|    0 |  10 |
12.|    0 |  12 |
13.|    0 |  19 |
14.|    0 |  25 |
15.+------+-----+
16.10 rows in set (0.04 sec)

The 1 million vertex graph data set was also loaded into Neo4j. In Gremlin, the graph edges appear as below. The first line reads: “vertex 0 is connected to vertex 992915.”

01.gremlin> g.E[1..10]
02.==>e[183][0-related->992915]
03.==>e[182][0-related->952836]
04.==>e[181][0-related->910150]
05.==>e[180][0-related->897901]
06.==>e[179][0-related->871349]
07.==>e[178][0-related->857804]
08.==>e[177][0-related->798969]
09.==>e[176][0-related->773168]
10.==>e[175][0-related->725516]
11.==>e[174][0-related->700292]

Warming Up the Caches

Before traversing the graph data structure in both MySQL and Neo4j, each database had a “warm up” procedure run on it. In MySQL, a “SELECT * FROM graph” was evaluated and all of the results were iterated through. In Neo4j, every vertex in the graph was iterated through and the outgoing edges of each vertex were retrieved. Finally, for both MySQL and Neo4j, the experiment discussed next was run twice in a row and the results of the second run were evaluated.

Traversing the Graph

The traversal that was evaluated on each database started from some root vertex and emanated n-steps out. There was no sorting, no distinct-ing, etc. The only two variables for the experiments are the length of the traversal and the root vertex to start the traversal from. In MySQL, the following 5 queries denote traversals of length 1 through 5. Note that the “?” is a variable parameter of the query that denotes the root vertex.

01.SELECT a.inV FROM graph as a WHERE a.outV=?
02.
03.SELECT b.inV FROM graph as a, graph as b WHERE a.inV=b.outV AND a.outV=?
04.
05.SELECT c.inV FROM graph as a, graph as b, graph as c WHERE a.inV=b.outV AND b.inV=c.outV AND a.outV=?
06.
07.SELECT d.inV FROM graph as a, graph as b, graph as c, graph as d WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND a.outV=?
08.
09.SELECT e.inV FROM graph as a, graph as b, graph as c, graph as d, graph as e WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND d.inV=e.outV AND a.outV=?

For Neo4j, the Blueprints Pipes framework was used. A pipe of length n was constructed using the following static method.

01.public static Pipeline createPipeline(final Integer steps) {
02.final ArrayList pipes = new ArrayList();
03.for (int i = 0; i < steps; i++) {
04.Pipe pipe1 = new VertexEdgePipe(VertexEdgePipe.Step.OUT_EDGES);
05.Pipe pipe2 = new EdgeVertexPipe(EdgeVertexPipe.Step.IN_VERTEX);
06.pipes.add(pipe1);
07.pipes.add(pipe2);
08.}
09.return new Pipeline(pipes);
10.}

For both MySQL and Neo4j, the results of the query (SQL and Pipes) were iterated through. Thus, all results were retrieved for each query. In MySQL, this was done as follows.

1.while (resultSet.next()) {
2.resultSet.getInt(finalColumn);
3.}

In Neo4j, this is done as follows.

1.while (pipeline.hasNext()) {
2.pipeline.next();
3.}

Experimental Results

The artificial graph dataset was constructed with a “rich get richer“, preferential attachment model. Thus, the vertices created earlier are the most dense (i.e. highest number of adjacent vertices). This property was used to limit the amount of time it would take to evaluate the tests for each traversal. Only the first 250 vertices were used as roots of the traversals. Before presenting timing results, note that all of these experiments were run on a MacBook Pro with a 2.66GHz Intel Core 2 Duo and 4Gigs of RAM at 1067 MHz DDR3. The packages used were Java 1.6, MySQL JDBC 5.0.8, and Blueprints Pipes 0.1.2.

1.java version "1.6.0_17"
2.Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
3.Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

The following Java Virtual Machine parameters were used:

1.-Xmx1000M -Xms500M

Below are the total running times for both MySQL (red) and Neo4j (blue) for traversals of length 1, 2, 3, and 4.

The raw data is presented below along with the total number of vertices returned by each traversal—which, of course, is the same for both MySQL and Neo4j given that its the same graph data set being processed. Also realize that traversals can loop and thus, many of the same vertices are returned multiple times. Finally, note that only Neo4j has the running time for a traversal of length 5. MySQL did not finish after waiting 2 hours to complete. In comparison, Neo4j took 14.37 minutes to complete a 5 step traversal.

01.[mysql steps-1] time(ms):124 -- vertices_returned:11360
02.[mysql steps-2] time(ms):922 -- vertices_returned:162640
03.[mysql steps-3] time(ms):8851 -- vertices_returned:2206437
04.[mysql steps-4] time(ms):112930 -- vertices_returned:28125623
05.[mysql steps-5] N/A
06.
07.[neo4j steps-1] time(ms):27 -- vertices_returned:11360
08.[neo4j steps-2] time(ms):474 -- vertices_returned:162640
09.[neo4j steps-3] time(ms):3366 -- vertices_returned:2206437
10.[neo4j steps-4] time(ms):49312 -- vertices_returned:28125623
11.[neo4j steps-5] time(ms):862399 -- vertices_returned:358765631

Next, the individual data points for both MySQL and Neo4j are presented in the plot below. Each point denotes how long it took to return n number of vertices for the varying traversal lengths.

Finally, the data below provides the number of vertices returned per millisecond (on average) for each of the traversals. Again, MySQL did not finish in its 2 hour limit for a traversal of length 5.

01.[mysql steps-1] vertices/ms:91.6128847554668
02.[mysql steps-2] vertices/ms:176.399127537985
03.[mysql steps-3] vertices/ms:249.286746556076
04.[mysql steps-4] vertices/ms:249.053599519823
05.[mysql steps-5] N/A
06.
07.[neo4j steps-1] vertices/ms:420.740351166341
08.[neo4j steps-2] vertices/ms:343.122344772028
09.[neo4j steps-3] vertices/ms:655.507125256186
10.[neo4j steps-4] vertices/ms:570.360621871775
11.[neo4j steps-5] vertices/ms:416.00886711325

Conclusion

In conclusion, given a traversal of an artificial graph with natural statistics, the graph database Neo4j is more optimal than the relational database MySQL. However, no attempts have been made to optimize the Java VM, the SQL queries, etc. These experiments were run with both Neo4j and MySQL “out of the box” and with a “natural syntax” for both types of queries.

Source: http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-graph-traversal/

[repost ]12款免费与开源的NoSQL数据库介绍

original:http://www.infoq.com/cn/news/2014/01/12-free-and-open-source-nosql

Naresh Kumar是位软件工程师与热情的博主,对于编程与新事物拥有极大的兴趣,非常乐于与其他开发者和程序员分享技术上的研究成果。近日,Naresh撰文谈到了12款知名的免费、开源NoSQL数据库,并对这些数据库的特点进行了分析。

现在,NoSQL数据库变得越来越流行,我在这里总结出了一些非常棒的、免费且开源的NoSQL数据库。在这些数据库中,MongoDB独占鳌头,拥有相当大的使用量。这些免费且开源的NoSQL数据库具有很好的可伸缩性与灵活性,非常适合于大数据存储与处理。相较于传统的关系型数据库,这些NoSQL数据库在性能上具有很大的优势。然而,这些NoSQL数据库未必最适合你。大多数常见的应用仍然可以使用传统的关系型数据库进行开发。NoSQL数据库依然不太适合于那些任务关键型的事务要求。我对这些数据库进行了一些简单介绍,下面就来看看。

1. MongoDB

MongoDB是个面向文档的数据库,使用JSON风格的数据格式。它非常适合于网站的数据存储、内容管理与缓存应用,并且通过配置可以实现复制与高可用性功能。

MongoDB具有很强的可伸缩性,性能表现优异。它使用C++编写,基于文档存储。此外,MongoDB还支持全文检索、跨WAN与LAN的高可用性、易于实现的复制、水平扩展、基于文档的丰富查询、在数据处理与聚合等方面具有很强的灵活性。

2. Cassandra

这是个Apache软件基金会的项目,Cassandra是个分布式数据库,支持分散的数据存储,可以实现容错以及无单点故障等。换句话说,“Cassandra非常适合于那些无法忍受数据丢失的应用”。

3. CouchDB

这也是Apache软件基金会的一个项目,CouchDB是另一个面向文档的数据库,以JSON格式存储数据。它兼容于ACID,像MongoDB一样,CouchDB也可以用于存储网站的数据与内容,以及提供缓存等。你可以通过JavaScript在CouchDB上运行MapReduce查询。此外,CouchDB还提供了一个非常方便的基于Web的管理控制台。它非常适合于Web应用。

4. Hypertable

Hypertable模仿的是Google的BigTable数据库系统。Hypertable的创建者将“成为高可用、PB规模的数据库开源标准”作为Hypertable的目标。换言之,Hypertable的设计目标是跨越多个廉价的服务器可靠地存储大量数据。

5. Redis

这是个开源、高级的键值存储。由于在键中使用了hash、set、string、sorted set及list,因此Redis也称作数据结构服务器。这个系统可以帮助你执行原子操作,比如说增加hash中的值、集合的交集运算、字符串拼接、差集与并集等。Redis通过内存中的数据集实现了高性能。此外,该数据库还兼容于大多数编程语言。

6. Riak

Riak是最为强大的分布式数据库之一,它提供了轻松且可预测的伸缩能力,向用户提供了快速测试、原型与应用部署能力,从而简化应用的开发过程。

7. Neo4j

Neo4j是一款NoSQL图型数据库,具有非常高的性能。它拥有一个健壮且成熟的系统的所有特性,向程序员提供了灵活且面向对象的网络结构,可以让开发者充分享受到拥有完整事务特性的数据库的所有好处。相较于RDBMS,Neo4j还对某些应用提供了不少性能改进。

8. Hadoop HBase

HBase是一款可伸缩、分布式的大数据存储。它可以用在数据的实时与随机访问的场景下。HBase拥有模块化与线性的可伸缩性,并且能够保证读写的严格一致性。HBase提供了一个Java API,可以实现轻松的客户端访问;提供了可配置且自动化的表分区功能;还有Bloom过滤器以及block缓存等特性。

9. Couchbase

虽然Couchbase是CouchDB的派生,不过它已经成为了一款功能完善的数据库产品。它向文档数据库转移的趋势会让MongoDB感到压力。每个节点上它都是多线程的,这是个非常主要的可伸缩性优势,特别是当托管在自定义或是Bare-Metal硬件上时更是如此。借助于一些非常棒的集成特性,诸如与Hadoop的集成,Couchbase对于数据存储来说是个非常不错的选择。

10. MemcacheDB

这是个分布式的键值存储系统,我们不应该将其与缓存解决方案搞混;相反,它是个持久化存储引擎,用于数据存储并以非常快速且可靠的方式检索数据。它遵循memcache协议。其存储后端用于Berkeley DB中,支持诸如复制与事务等特性。

11. REVENDB

RAVENDB是第二代开源数据库,它面向文档存储并且无模式,这样就可以轻松将对象存储到其中了。它提供了非常灵活且快速的查询,通过对复制、多租与分片提供开箱即用的支持使得我们可以非常轻松地实现伸缩功能。它对ACID事务提供了完整的支持,同时又能保证数据的安全性。除了高性能之外,它还通过bundle提供了轻松的可扩展性。

12. Voldemort

这是个自动复制的分布式存储系统。它提供了自动化的数据分区功能,透明的服务器失败处理、可插拔的序列化功能、独立的节点、数据版本化以及跨越各种数据中心的数据分发功能。

各位InfoQ读者,不知在你的项目中曾经、现在或是未来使用了哪些NoSQL数据库。现今的NoSQL世界纷繁复杂,NoSQL数据库也多如牛毛,而且有一些数据库提供了相似的特性,本文所列出的只是其中比较有代表性的12款NoSQL产品。你是否使用过他们呢?是否使用了本文没有介绍的产品呢?他们有哪些特性打动了你,让你决定使用他们呢?非常欢迎将你的经历与看法与我们一起分享。

[repost ]NoSQL Job Trends: August 2013

original:http://architects.dzone.com/articles/nosql-job-trends-august-2013

The NoSQL Zone is sponsored by Neo4j, a major player in the open source graph database space. See an intro video or check out a free O’Reilly book on Graph DB’s.

Today is the NoSQL installment of the August job trends. For the NoSQL job trends, I am focusing on CassandraRedisCouchbase , SimpleDBCouchDBMongoDBHBase, and Riak. As was stated previously, Voldemort has been replaced by Couchbase. Hadoop is still the clear leader in demand but it may be included in the next update given how close the others are getting.

First, let’s look at the trends from Indeed:

Indeed NoSql Job Trends - August 2013

MongoDB continues to lead the pack with a big jump in demand since the beginning of 2013. The other major players, Cassandra, HBase and Redis have seen similar increases in 2013. CouchDB stays flat but maintains a slim lead over Riak, which may finally be gaining some steam. Couchbase is also showing positive signs but still lags behind most. SimpleDB remains flat as well, with Couchbase surpassing its demand earlier this year.

Now, it’s time for the short term trends from SimplyHired:

SimplyHired NoSql Job Trends - August 2013

Overall, SimplyHired is showing completely different trends than Indeed. MongoDB shows a drop in demand early in the year and has stayed flat since February. Cassandra dipped a bit in the spring but stays ahead of HBase for now. HBase is really the only tool that has a consistently positive trend since February. It should overtake Cassandra by the end of the year. Redis and CouchDB both show a very slight decline in the past few months. Consistent with the Indeed trend, Riak is showing some growth the past few months. Couchbase and SimpleDB trail the group with flat trends.

Lastly, we look at the relative growth from Indeed:

Indeed NoSql Job Growth - August 2013

HBase continues to grow like a weed with the other tools lagging significantly. Cassandra is still growing rapidly and maintains a solid growth lead over the other tools. MongoDB continues to grow and outpaces Redis and Riak which have almost identical growing trends. Couchbase is starting to grow this year, and is outpacing CouchDB and SimpleDB which barely register any growth on this chart.

It really looks like the major players in the NoSQL space is not going to change much. The Couchbase and CouchDB differentiation sounds like it will blur even more as Couchbase starts updating CouchDB again. SimpleDB is not gaining adoption like the other tools, but if Amazon continues to support it I would not be surprised if it does start to grow at some point. Obviously, the continued growth of each of these tools really shows how much activity there is in the NoSQL space and that these are tools that we all need to start learning.

As always, if there are other NoSQL tools that should be included, please let me know in the comments.

 

[repost ]SQL on Hadoop系统的最新进展(1)

original:http://yanbohappy.sinaapp.com/?p=381

为什么非要把SQL放到Hadoop上? SQL易于使用。
那为什么非得基于Hadoop呢?the robust and scalable architecture of Hadoop

目前SQL on Hadoop产品主要有以下几种:
Hive, Tez/Stinger, Impala, Shark/Spark, Phoenix, Hawq/Greenplum, HadoopDB, Citusdata等。本文主要讨论Hive, Tez/Stinger, Impala, Shark以及传统开源数据仓库brighthouse的特点和最新进展;下一篇文章会讨论Hawq/Greenplum, Phoenix, HadoopDB, Citusdata。

在互联网企业中一般的基于Hadoop的数据仓库的数据来源主要有以下几个:
1,通过Flume/Scribe/Chukwa这样的日志收集和分析系统把来自Apache/nginx等Server cluster的日志收集到HDFS上,然后通过Hive创建Table时指定SerDe把非结构化的日志数据转化成结构化数据。
2,通过Sqoop这样的工具把用户和业务维度数据(一般存储在Oracle/MySQL中)定期导入Hive,那么OLTP数据就有了一个用于OLAP的副本了。
3,通过ETL工具从其他外部DW数据源里导入的数据。

目前所有的SQL on Hadoop产品其实都是在某个或者某些特定领域内适合的,没有silver bullet。像当年Oracle/Teradata这样的满足几乎所有企业级应用的产品在现阶段是不现实的。所以每一种SQL on Hadoop产品都在尽量满足某一类应用的特征。
典型需求:
1,  interactive query (ms~3min)
2,data analyst, reporting query (3min~20min)
3,data mining, modeling and large ETL (20 min ~ hr ~ day)
4,机器学习需求(通过MapReduce/MPI/Spark等计算模型来满足)

Hive
Hive是目前互联网企业中处理大数据、构建数据仓库最常用的解决方案,甚至在很多公司部署了Hadoop集群不是为了跑原生MapReduce程序,而全用来跑Hive SQL的查询任务。

对于有很多data scientist和analyst的公司,会有很多相同table的查询需求。那么显然每个人都从hive中查数据速度既慢又浪费资源。我们在online的数据库系统部署的时候都会在DB前面部署Redis或者memcache用于缓存用户经常访问的数据。那么OLAP应用也可以参考类似的方法,把经常访问的数据放到内存组成的集群中供用户查询。
Facebook针对这一需求开发了Presto,一个把热数据放到内存中供SQL查询的系统。这个设计思路跟Impala和Stinger非常类似了。使用Presto进行简单查询只需要几百毫秒,即使是非常复杂的查询,也只需数分钟即可完成,它在内存中运行,并且不会向磁盘写入。Facebook有超过850名工程师每天用它来扫描超过320TB的数据,满足了80%的ad-hoc查询需求。

目前Hive的主要缺点:
1,data shuffle时网络瓶颈,Reduce要等Map结束才能开始,不能高效利用网络带宽
2,一般一个SQL都会解析成多个MR job,Hadoop每次Job输出都直接写HDFS,性能差
3,每次执行Job都要启动Task,花费很多时间,无法做到实时
4,由于把SQL转化成MapReduce job时,map,shuffle和reduce所负责执行的SQL功能不同。那么就有Map->MapReduce或者MapReduce->Reduce这样的需求。这样可以降低写HDFS的次数,从而提高性能。

目前Hive主要的改进:

1,同一条hive sql解析出的多个MR任务的合并。
由Hive解析出来的MR jobs中有非常多的Map->MapReduce类型的job,可以考虑把这个过程合并成一个MRjob。https://issues.apache.org/jira/browse/HIVE-3952

2,Hive query optimizer

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/optimize-joins.html

  • Joins where one side fits in memory
  • Star schema join的改进,就是原来一个大表和多个小表在不同column匹配的条件下join需要解析成多个map join + MR job,现在可以合并成一个MR job

这个改进方向要做的就是用户不用给太多的hint,hive可以自己根据表的大小、行数等,自动选择最快的join的方法(小表能装进内存的话就用map join,Map join能和其他MR job合并的就合并)。这个思路跟cost-based query optimizer有点类似了,用户写出来的SQL在翻译成执行计划之前要计算那种执行方式效率更高。

3,ORCFile
ORCFile是一种列式存储的文件,对于分析型应用来说列存有非常大的优势。 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
原来的RCFile中把每一列看成binary blob,没有任何语义,所以只能用通用的zlib,LZO,Snappy等压缩方法。
ORCFile能够获取每一列的类型(int还是string),那么就可以使用诸如dictionary encoding, bit packing, delta encoding, run-length encoding等轻量级的压缩技术。这种压缩技术的优势有两点:一是提高压缩率;二是能够起到过滤无关数据的效果
现在ORCFile中主要有三种编码:

  • bit编码,所有数据类型都可以用。Google’s protocol buffers and uses the high bit to represent whether this byte is not the last and the lower 7 bits to encode data
  • run-length encoding(行程长度压缩算法),int类型专用。
  • dictionary encoding,string类型专用。同时这个dictionary还能帮助过滤查询中的predicate条件。

Run length Encoding对某些列压缩会减少存储3-4个数量级,对内存提升也有2-3个数量级,Dictionary Encoding一般对磁盘空间减少大概20倍,对内存空间大概减少5倍,根据Google PowerDrill的实验,在常见的聚合查询中这些特殊的编码方式会对查询速度有2-3个数量级的提升.

Predicate Pushdown:原来的Hive是把所有的数据都读到内存中,然后再判断哪些是符合查询需求的。在ORCFile中数据以Stripe为单元读取到内存,那么ORCFile的RecordReader会根据Stripe的元数据(Index Data,常驻内存)判断该Stripe是否满足这个查询的需求,如果不满足直接略过不读,从而节省了IO。

关于ORCFile的压缩效果,使用情况和性能可以参考hortonworks的博客

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

未来ORCFile还会支持轻量级索引,就是每一列中以1W行作为一组的最大值和最小值。

通过对ORCFile的上述分析,我想大家已经看到了brighthouse的影子了吧。都是把列数据相应的索引、统计数据、词典等放到内存中参与查询条件的过滤,如果不符合直接略过不读,大量节省IO。关于brighthouse大家可以参考下面的分析。

4,HiveServer2的Security和Concurrency特性

http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/

HiveServer2能够支持并发客户端(JDBC/ODBC)的访问。
Cloudera还搞了个Sentry用于Hadoop生态系统的的安全性和授权管理方面的工作。
这两个特点是企业级应用Hadoop/Hive主要关心的。

5,HCatalog Hadoop的统一元数据管理平台
目前Hive存储的表格元数据和HDFS存储的表格数据之间在schema上没有一致性保证,也就是得靠管理员来保证。目前Hive对列的改变只会修改 Hive 的元数据,而不会改变实际数据。比如你要添加一个column,那么你用hive命令行只是修改了了Hive元数据,没有修改HDFS上存储的格式。还得通过修改导入HDFS的程序来改变HDFS上存储的文件的格式。而且还要重启Hive解析服务,累坏了系统管理员。

  • Hadoop系统目前对表的处理是’schema on read’,有了HCatlog就可以做到EDW的’schema on write’。
  • HCatlog提供REST接口提供元数据服务,有利于不同平台(HDFS/HBase/Oracle/MySQL)上的不同数据(unstructured/semi-structured/structured)共享。能够把Hadoop和EDW结合起来使用。
  • HCatlog对用户解耦了schema和storage format。举个例子吧,在写MR任务的时候,目前是把所有的行数据都当成Text来处理,Text一点点解析出各个Column需要编程人员来控制。有个HCatlog之后编程人员就不用管这事了,直接告诉它是哪个Database->Table,然后schema可以通过查询HCatlog来获得。也省得数据存储格式发生变化之后,原来的程序不能用的情况发生。

6,Vectorized Query Execution in Hive

https://issues.apache.org/jira/browse/HIVE-4160

  • 目前Hive中一行一行的处理数据,然后调用lazy deserialization解析出该列的Java对象,显然会严重影响效率。
  • 多行数据同时读取并处理(基本的比较或者数值计算),降低了一行一行处理中过多的函数调用的次数,提高了CPU利用率和cache命中率
  • 需要实现基于向量的vectorized scan, filter, scalar aggregate, group-by-aggregate, hash join等基本操作单元。

Tez/Stinger

  • 底层执行引擎不再使用MR,而是使用基于YARN的更加通用的DAG执行引擎
  • MR是高度抽象的Map和Reduce两个操作,而Tez则是在这两个操作的基础上提供了更丰富的接口。把Map具体到Input, Processor, Sort, Merge, Output,而Reduce也具体化成Input, Shuffle, Sort, Merge, Processor, Output。在MR程序里,编程人员只需编写对应的Processor逻辑,其他的是通过指定几种具体实现来完成的;而在Tez里面给我们更大的自由度。其实这个跟Spark有点类似了,都是提供更丰富的可操作单元给用户。
  • 传统的Reduce只能输出到HDFS,而Tez的Reduce Processor能够输出给下一个Reduce Processor作为输入。
  • Hot table也放到内存中cache起来
  • Tez service:预启动container和container重用,降低了每次Query执行计划生成之后Task启动的时间,从而提高实时性。
  • Tez本身只是YARN框架下得一个library,无需部署。只需指定mapreduce.framework.name=yarn-tez

http://dongxicheng.org/mapreduce-nextgen/apache-tez-newest-progress/

未来工作方向:
Cost-based optimizer,基于统计选择执行策略,多表JOIN时按照怎样的顺序执行效率最高。
统计执行过程中每个中间表的Row/Column等数目,从而决定启动多少个MR执行

Impala
Impala可以看成是Google Dremel架构和MPP (Massively Parallel Processing)结构的混合体。

https://github.com/cloudera/impala

Dremel论文: http://research.google.com/pubs/pub36632.html

优点:

  • 目前支持两种类型的JOIN:broadcast join和partition join。对于大表JOIN时由于内存限制,装不下时就要dump部分数据到磁盘,那样就会比较慢
  • Parguet列存格式,同时能够处理嵌套数据。通过嵌套数据以及扩展的SQL查询语义,在某些特定的场景上避开了JOIN从而解决了一部分性能的bottleneck。
  • Cloudera Manager 4.6以后会有slow query的分析功能
  • Runtime Code Generation http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/
  • impala可以直接使用硬盘上的数据而不经过hdfs

缺点:

  • impala不会按照group by的列排序
  • 目前不支持UDF,impala 1.2即将支持Hive UDFs(Java写的)和Impala native UDFs and UDAs(接口类似PosgreSQL)
  • 不支持像Hive的Serializer/Deserializer,从而使得它做从非结构化到结构化数据的ETL工作比较麻烦。
  • 不支持线上查询容错,如果参与查询的某个node出错,Impala将会丢弃本次查询。
  • 安全方面的支持还比较差。impalad之间传输的数据没有加密,不支持表或者列级别的授权。
  • 每个PlanFragment执行尽量并行化,但是有的时候并不是很容易。例如Hash Join需要等到其中一个表完全Scan结束才能开始。

不过虽然有这么多缺点,但是很多公司还是开始尝试Impala了。以百度为例,百度尝试把MySQL接入Impala的后端作为存储引擎,同时实现相应操作对应的PlanFragment,那么用户来的query还是按照原来的解析方法解析成各种PlanFragment,然后直接调度到对应的节点(HDFS DataNode/HBase RegionServer/MySQL)上执行。会把某些源数据或者中间数据放到MySQL中,用户的query涉及到使用这部分数据时直接去MySQL里面拿。

Shark/Spark
由于数据能放到内存尽量放到内存,使用内存非常aggressive。优点是做JOIN时会比较快,缺点是占用内存太大,且自行管理内存,占用内存后不会释放。
支持UDF
性能:
特别简单的select…where查询,shark性能的提升不明显。(因为hive也不怎么费时间)
但是如果查询比较复杂select…join…where…group by,hive的job数目会比较多,读写HDFS次数增多,时间自然会变长。当内存还足够大的时候shark性能是最好的,如果内存不够装下所有的数据时性能会下降,但还是会比Hive好很多。

SQL on Hadoop产品需要向传统数据仓库学习的地方
以开源数据仓库brighthouse(基于MySQL的数据仓库存储引擎)为例。
VLDB 2008 论文 <<Brighthouse: An Analytic Data Warehouse for Ad-hoc Queries>>

brighthouse的SQL解析用的是MySQL的代码,开发了brighthouse专用的optimizer,executor以及storage engine
brighthouse的数据存储通过三层来组织:Data Pack, Data Pack Node, Knowledge Node

  • DP(Data Pack):brighthouse是列存储的,每个DP存储一列中64K个单元的数据。
  • DPN(Data Pack Node):DPN和DP是一对一的关系,DPN中记录每个DP数据对应的一些统计值(max,min,count,sum)
  • KN(Knowledge Node):DP的更详细的数据信息和DP之间关系的信息

KN又分为一下三个部分:

  • HISTs(Histograms):数值类型列的统计直方图,能够快速判断这个DP是否符合查询条件。
  • CMAPs(Character Maps):文本类型的位图,用于快速查找字符。(优化关键字like)
  • Pack-To-Pack:等值JOIN操作时产生的两个列(DP)之间关系的位图。

DPN和KN相当于DP的一些统计信息,占整个DP的1%的存储空间,所以可以轻松装入内存。他们是为了快速定位哪些DP是跟这个query相关(relevant)的,哪些是不相关(irrelevant)的,哪些是可能相关(suspect)的。从而减小IO读取的数据量,提高性能。

性能测试:http://www.fuchaoqun.com/tag/brighthouse/
从这个性能测试中可以看出:
1,压缩率:infobright比MyISAM/tar.gz的压缩率都要高很多
2,查询性能:跟建了索引的MyISAM表相比,查询速度也要快3-6倍

总之,大家都缺少的是:
1,workload management or query optimization
多个表的JOIN如何执行,例如3个表的JOIN会有6种执行策略,那么哪一种才是效率最高的呢。显然要通过计算每种执行顺序的开销来获得。在传统数据库或者数据仓库领域(Oracle/Teradata/PostgreSQL)都有非常好的查询优化器,而在分布式系统中该如何衡量这些指标(磁盘IO,网络带宽,内存)与最后查询效率之间的关系是个需要认真研究的问题。
2,关联子查询correlated sub-queries还是没有谁能够实现。
在TPC-H中又很多关联子查询的例子,但是现在的SQL on Hadoop产品都不支持。听Impala的人说,他们客户对这个的需求不是很强烈,大部分关联子查询可以转化成JOIN操作。但是目前的商业产品像Hawq/Greenplum都是支持关联子查询的。

[repost ]Big Data Debate: Will HBase Dominate NoSQL?

original:http://www.informationweek.com/software/enterprise-applications/big-data-debate-will-hbase-dominate-nosq/240159475

HBase is modeled after Google BigTable and is part of the world’s most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop’s popularity and HBase’s scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop’s HDFS architecture to overcome. These flaws will forever limit HBase’s applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

 

For The Motion

 Michael Hausenblas

Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption

 

The answer to the question is a crystal-clear “Yes, but…”

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that “one size does not fit it all.”

Hence, I’m going to interpret the “dominant” in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, “Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?”

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last throughAmazon Web Services (AWS), for example.

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon’s DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

From an application developer’s point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the “eventual consistency tax” without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we’ve created a “next version” of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it’s available in the cloud via AWS Elastic MapReduce.

Last but not least, HBase has — through its legacy as a Hadoop contribution project — a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

Against The Motion

 Jonathan Ellis

Jonathan Ellis
Co-founder & CTO,
DataStax

HBase Is Plagued By Too Many Flaws

 

NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase’s lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

Engineering Problems

– Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it’s too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

– RegionServer failover takes 10 to 15 minutes. HBase partitions rows intoregions, each managed by aRegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.

– Developing against HBase is painful.HBase’s API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

– The HBase community is fragmented.The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I’ve been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

Architectural Flaws

– Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra’s peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

– Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase’s design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.

– HDFS is primarily designed for streaming access to large files. HBase is built on a distributed file system optimized for batch analytics. This is directly responsible for HBase’s poor performance, particularly for reads, andparticularly on solid-state disks. Just as relational databases haven’t been able to optimize btree engines designed 30 years ago for pre-big-data workloads, HDFS won’t be able to undo the tradeoffs it made for what is still its primary purpose and close the gap on critical functionality:

– Mixing solid state and hard disks in a single cluster and pinning tables to workload-appropriate media.

– Snapshots, incremental backups, and point-in-time recovery.

– Compaction throttling to avoid spikes in application response time.

– Dynamically routing requests to the best-performing replicas.

The same design that makes HBase’s foundation, HDFS, a good fit for batch analytics will ensure that it remains inherently unsuited for the high velocity, random access workloads that characterize the NoSQL market.

Jonathan Ellis is chief technology officer and co-founder at DataStax, where he sets the technical direction and leads Apache Cassandra as project chair.

[repost ]NoSQL Job Trends: August 2013

original:http://architects.dzone.com/articles/nosql-job-trends-august-2013

Today is the NoSQL installment of the August job trends. For the NoSQL job trends, I am focusing on CassandraRedisCouchbase , SimpleDBCouchDBMongoDBHBase, and Riak. As was stated previously, Voldemort has been replaced by Couchbase. Hadoop is still the clear leader in demand but it may be included in the next update given how close the others are getting.

First, let’s look at the trends from Indeed:

Indeed NoSql Job Trends - August 2013

MongoDB continues to lead the pack with a big jump in demand since the beginning of 2013. The other major players, Cassandra, HBase and Redis have seen similar increases in 2013. CouchDB stays flat but maintains a slim lead over Riak, which may finally be gaining some steam. Couchbase is also showing positive signs but still lags behind most. SimpleDB remains flat as well, with Couchbase surpassing its demand earlier this year.

Now, it’s time for the short term trends from SimplyHired:

SimplyHired NoSql Job Trends - August 2013

Overall, SimplyHired is showing completely different trends than Indeed. MongoDB shows a drop in demand early in the year and has stayed flat since February. Cassandra dipped a bit in the spring but stays ahead of HBase for now. HBase is really the only tool that has a consistently positive trend since February. It should overtake Cassandra by the end of the year. Redis and CouchDB both show a very slight decline in the past few months. Consistent with the Indeed trend, Riak is showing some growth the past few months. Couchbase and SimpleDB trail the group with flat trends.

Lastly, we look at the relative growth from Indeed:

Indeed NoSql Job Growth - August 2013

HBase continues to grow like a weed with the other tools lagging significantly. Cassandra is still growing rapidly and maintains a solid growth lead over the other tools. MongoDB continues to grow and outpaces Redis and Riak which have almost identical growing trends. Couchbase is starting to grow this year, and is outpacing CouchDB and SimpleDB which barely register any growth on this chart.

It really looks like the major players in the NoSQL space is not going to change much. The Couchbase and CouchDB differentiation sounds like it will blur even more as Couchbase starts updating CouchDB again. SimpleDB is not gaining adoption like the other tools, but if Amazon continues to support it I would not be surprised if it does start to grow at some point. Obviously, the continued growth of each of these tools really shows how much activity there is in the NoSQL space and that these are tools that we all need to start learning.

As always, if there are other NoSQL tools that should be included, please let me know in the comments.

Published at DZone with permission of Robert Diana, author and DZone MVB. (source)

[repost ]orientdb wiki

original:https://github.com/orientechnologies/orientdb/wiki/Tutorials

Tutorials

This is a collection of tutorials about OrientDB. Look also at the Main documentation page.

Basics

Java tutorial

PHP tutorial

Node.js tutorial

.NET tutorial

External tutorials

English

OrientDB User’s Guide on a step-by-step tutorial approach :

Miscellaneous

Italian

Step-by-step tutorial about the usage of OrientDB:

HTML.it guide to OrientDB:

Tecnicume blog by Marco Berri:

Japanese

Try to manipulate the OrientDB from java (Part RawGraphDatabase):

Make GraphDB OrientDB app deployment experience: 1. Part 1: http://fungoing.jp/2011/08/379

Step-by-step tutorial by Junji Takakura: