[repost ]Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig


Posted on December 27, 2012

Big Data is the hotness, there is no doubt about it.  Every year its just gotten bigger and bigger and shows no sign of slowing.  There is a lot out there about big data, but despite the hype, there isn’t a lot of good technical content for those who want to get started.  The lack of technical how-to info is made worse by the fact that many Hadoop projects have moved their documentation around over time and Google searches commonly point to obsolete docs.  My intent here is to provide some solid guidance on how to actually get started with practical uses of Hadoop and to encourage others to do the same.

From an SA perspective, the most interesting Hadoop sub-projects have been those for log transport, namely Scribe, Chukwa, and Flume.  Lets examine each.

Log Transport Choices

Scribe was created at Facebook and got a lot of popularity early on due to adoption at high profile sites like Twitter, but development has apparently ceased  and word is that Facebook stopped using it themselves.  So Scribe is off my list.

Chukwa is a confusing beast, its said to be distributed with Hadoop’s core but its just an old version in the same sub-directory of the FTP site, the actual current version is found under the incubator sub-tree.  It is a very comprehensive solution, including a web interface for log analysis, but that functionality is based on HBase, which is fine if you want to use HBase but may be a bit more than you wish to chew off for simple Hive/Pig analysis.  Most importantly, the major Hadoop distributions from HortonWorks,MapR, and Cloudera use Flume instead.  So if your looking for a comprehensive toolset for log analysis, Chukwa is worth checking out, but if you simply need to efficiently get data into Hadoop for use by other Hadoop components, Flume is the clear choice.

That brings us to Flume, more specifically Flume-NG.  The first thing to know about Flume is that there were major changes to Flume pre and post 1.0, major enough that they took to refering to pre 1.0 as “Flume OG” (“Old generation” or “Origonal Gangsta” depending on your mood) and the new post 1.0 releases as “Flume NG”.  Whenever looking at documentation or help on the web about Flume be certain as to which you are looking at!  In particular, stay away from the Flume CWiki pages,  refer only to theflume.apache.org.  I say that because there is so much old cruft in the CWiki pages that you can be easily mislead and become frustrated, so just avoid it.

Now that we’ve thinned out the available options, what can we do with Flume?

Getting Started with Flume

Flume is a very sophisticated tool for transporting data.  We are going to focus on log data, however it can transport just about anything you throw at it.  For our purposes we’re going to use it to transport Apache log data from a web server back to our Hadoop cluster and store it in HDFS where we can then operate on it using other Hadoop tools.

Flume NG is a java application that, like other Hadoop tools, can be downloaded, unpacked, configured and run, without compiling or other forms of tinkering.  Download the latest “bin” tarball and untar it into /opt and rename or symlink to “/opt/flume” (it doesn’t matter where you put it, this is just my preference).  You will need to have Java already installed.

Before we can configure Flume its important to understand its architecture.  Flume runs as an agent.  The agent is sub-divided into 3 categories: sources, channels, and sinks.  Inside the Flume agent process there is a pub-sub flow between these 3 components.  A source accepts or retrieves data and sends it into a channel.  Data then queues in the channel.  A sink takes data from the channel and does something with it.  There can be multiple sources, multiple channels, and multiple sinks per agent.  The only important thing to remember is that a source can write to multiple channels, but a sink can draw from only one channel.

Lets take an example.  A “source” might tail a file.  New log lines are sent into a channel where they are queued up.  A “sink” then extracts the log lines from the channel and writes them into HDFS.

At first glance this might appear overly complicated, but the distinct advantage  here is that the channel de-couples input and output, which is important if you have performance slowdowns in the sinks.  It also allows the entire system to be plugin-based.  Any number of new sinks can be created to do something with data… for instance, Casandra sinks are available, there is an IRC sink for writing data into an IRC channel.  Flume is extremely flexible thanks to this architecture.

In the real world we want to collect data from a local file, send it across the network and then store it centrally.  In Flume we’d accomplish this by chaining agents together.  The “sink” of one agent sends to the “source” of another.  The standard method of sending data across the network with Flume is usingAvro.  For our purposes here you don’t need to know anything about Avro except one of the things it can do is to move data over the network.  Here is what this ultimately looks like:

So on our web server, we create a /opt/flume/conf/flume.conf that looks like this:

## Flume NG Apache Log Collection
## Refer to https://cwiki.apache.org/confluence/display/FLUME/Getting+Started
# http://flume.apache.org/FlumeUserGuide.html#exec-source
agent.sources = apache
agent.sources.apache.type = exec
agent.sources.apache.command = gtail -F /var/log/httpd/access_log
agent.sources.apache.batchSize = 1
agent.sources.apache.channels = memoryChannel
agent.sources.apache.interceptors = itime ihost itype
# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
agent.sources.apache.interceptors.itime.type = timestamp
# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
agent.sources.apache.interceptors.ihost.type = host
agent.sources.apache.interceptors.ihost.useIP = false
agent.sources.apache.interceptors.ihost.hostHeader = host
# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
agent.sources.apache.interceptors.itype.type = static
agent.sources.apache.interceptors.itype.key = log_type
agent.sources.apache.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on (Hadoop Slave Node)
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
agent.sinks = AvroSink
agent.sinks.AvroSink.type = avro
agent.sinks.AvroSink.channel = memoryChannel
agent.sinks.AvroSink.hostname =
agent.sinks.AvroSink.port = 4545

## Debugging Sink, Comment out AvroSink if you use this one
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
#agent.sinks = localout
#agent.sinks.localout.type = file_roll
#agent.sinks.localout.sink.directory = /var/log/flume
#agent.sinks.localout.sink.rollInterval = 0
#agent.sinks.localout.channel = memoryChannel

This configuration looks overwhelming at first, but it breaks down simply into an “exec” source, a “memory” channel, and an “Avro” sink, with additional parameters specified for each. The syntax for each is in the following form:

agent_name.sources = source1 source2 ...
agent_name.sources.source1.type = exec

agent_name.channel = channel1 channel2 ...
agent_name.channel.channel1.type = memory

agent_name.sinks = sink1 sink2 ...
agent_name.sinks.sink1.type = avro

In my example the agent name was “agent”, but you can name it anything you want. You will specify the agent name when you start the agent, like this:

$ cd /opt/flume
$ bin/flume-ng agent -f conf/flume.conf -n agent

Now that our agent is running on the web server, we need to setup the other agent which will deposit logs lines into HDFS. This type of agent is commonly called a “collector”. Here is the config:

## Sources #########################################################
## Accept Avro data In from the Edge Agents
# http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ########################################################
## Source writes to 2 channels, one for each sink (Fan Out)
collector.channels = mc1 mc2

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ###########################################################
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem (Debugging)
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
# http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /flume/events/%{log_type}/%{host}/%y-%m-%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

This configuration is a little different because the source accepts Avro network events and then sends them into 2 memory channels (“fan out”) which feed 2 different sinks, one for HDFS and another for a local log file (for debugging). We start this agent like so:

# bin/flume-ng agent -f conf/flume.conf -n collector

Once both sides are up, you should see data moving. Use “hadoop fs -lsr /flume” to examine files there and if you included the file_roll sink, look in /var/log/flume.

# hadoop fs -lsr /flume/events
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com
drwxr-xr-x   - root supergroup          0 2012-12-24 09:50 /flume/events/apache_access_combined/cuddletech.com/12-12-24
-rw-r--r--   3 root supergroup     224861 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845948
-rw-r--r--   3 root supergroup      85437 2012-12-24 06:27 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845949
-rw-r--r--   3 root supergroup     195381 2012-12-24 06:37 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845950

Flume Tunables & Gotcha’s

There are a lot of tunables to play with and carefully consider in the example configs above. I included the documentation links for each component and I highly recommend you review it. Lets specifically look at some things that might cause you frustration while getting started.

First, interceptors. If you look at our HDFS sink path, you’ll see the path includes “log_type”, “host”, and a date. That data is associated with an event when the source grabs it, it is meta-data headers on each event. You associate that data with the event using an “interceptor”. So look back at the source where we ‘gtail’ our log file and you’ll see that we’re using interceptors to associate the log_type, “host”, and date with each event.

Secondly, by default Flume’s HDFS sink writes out SequenceFiles. This seems fine until you run Pig or Hive and get inconsistent or usual results back. Ensure that you specify the “fileType” as “DataStream” and the “writeFormat” as “Text”.

Lastly, there are 3 triggers that will cause Flume to “roll” the HDFS output file: size, count, and interval. When Flume writes data, if any one of the triggers is true it will roll to use a new file. By default the count is 30 (seconds), size is 1024 (bytes), and count is 10. Think about that, if any of those is true the file is rolled. So you end up with a LOT of HDFS files, which may or may not be what you want. Setting any value to 0 disables that type of rolling.

Analysis using Pig

Pig is a great tool for the Java challenged. Its quick, easy, and repeatable. The only real challenge is in accurately describing the data your asking it to chew on.

The PiggyBank library can provide you with a set of loaders which can save you from regex hell. The following is an example of using Pig on my Flume ingested Apache combined format logs using thePiggyBank “CombinedLogLoader”:

# cd /opt/pig
# ./bin/pig 
2012-12-23 10:32:56,053 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0-SNAPSHOT (r: unknown) compiled Dec 23 2012, 10:29:56
2012-12-23 10:32:56,054 [main] INFO  org.apache.pig.Main - Logging error messages to: /opt/pig-0.10.0/pig_1356258776048.log
2012-12-23 10:32:56,543 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://
2012-12-23 10:32:57,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at:

grunt> REGISTER /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar;
grunt> raw = LOAD '/flume/events/apache_access_combined/cuddletech.com/12-12-24/'' 
    USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader 
    AS (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent); 
grunt> agents = FOREACH raw GENERATE userAgent;
grunt> agents_uniq = DISTINCT agents;
grunt> DUMP agents_uniq;

(Mozilla/5.0 ())
(Recorded Future)

While Pig is easy enough to install (unpack and run), you must build the Piggybank JAR, which means you’ll need a JDK and Ant. On a SmartMachine with Pig installed in /opt/pig, it’d look like this:

# pkgin in sun-jdk6-6.0.26 apache-ant
# cd /opt/pig/
# ant
# cd /opt/pig/contrib/piggybank/java
# ant
     [echo]  *** Creating pigudf.jar ***
      [jar] Building jar: /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar

Total time: 5 seconds

Analysis using Hive

Similar to Pig, the challenge with Hive is really just describing the schema around the data. Thankfullythere is assistance out there for just this problem.

[root@hadoop02 /opt/hive]# bin/hive
Logging initialized using configuration in jar:file:/opt/hive-0.9.0-bin/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201212241029_318322444.txt
    >   host STRING,
    >   identity STRING,
    >   user STRING,
    >   time STRING,
    >   request STRING,
    >   status STRING,
    >   size STRING,
    >   referer STRING,
    >   agent STRING)
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    >   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
    >   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
    > )
    > LOCATION '/flume/events/apache_access_combined/cuddletech.com/12-12-24/';
Time taken: 7.514 seconds

Now you can query to your hearts content. Please note that in the above example if you omit the “EXTERNAL” keyword when creating the table that Hive will move your data into its own data warehouse directory, which may not be what you want.

Next Steps

Hadoop provides an extremely powerful set of tools to solve very big problems. Pig and Hive are easy to use and very powerful. Flume-NG is an excellent tool for reliably moving data and extremely extensible. There is a lot I’m not getting into here, like using file-backed or database backed channels in Flume to protect against node failure thus increasing delivery reliability, or using multi-tiered aggregation by using intermediate Flume agents (meaning, Avro Source to Avro Sink)… there is a lot of fun things to explore here. My hope is that I’ve provided you with an additional source of data to help you on your way.

If you start getting serious with Hadoop, I highly recommend you buy the following O’Reilly books for Hadoop, which are very good and will save you a lot of time wasted in trial-and-error:

A Friendly Warning

In closing, I feel it necessarily to point out the obvious. For most people there is no reason to do any of this. Hadoop is a Peterbilt for data. You don’t use a Peterbilt for a job that can be done with a Ford truck, its not worth the time, money and effort.

When I’ve asked myself “How big must data be for it to be big data?” I’ve come up with the following rule: If a “grep” of a file takes more than 5 minutes, its big. If the file can not be reasonably sub-divided to be smaller files or any query requires examining multiple files, then it might be Hadoop time.

For most logging applications, I strongly recommend either Splunk (if you can afford it) or usingRsyslog/Logstash and ElasticSearch, they are far more suited to the task with less hassle, less complexity and much more functionality.

[repost ]阿里云发布 ODPS 可分析 PB 级海量数据


8 日,阿里云计算发布核武级大数据产品—— ODPS。通过 ODPS 在线服务,小型公司花几百元即可分析海量数据。ODPS 可在 6 小时内处理 100PB 数据,相当于 1 亿部高清电影。此前,全球掌握这种能力的公司屈指可数,如 Google、亚马逊等。

  五年间,阿里云的工程师们写下 250 万行代码,不断打磨 ODPS。该团队在一封公开信中描述:” 把数据海洋里的水灌进 ODPS,设定好一套参数,拧开水龙头,出来的就是鲜榨果汁!” 对比工业时代,ODPS 相当于大数据时代的流水生产线,水龙头里流出来的 ” 果汁 “,随原始数据和算法的改变而千变万化。

  公开信中表示:工业革命后的 200 多年里,人类对物理资源的利用登峰造极,对数据资源的利用却仍处于起步阶段。Google、Facebook、阿里巴巴等互联网公司先行一步,触碰到了大数据的魅力。然而,人类拥有的绝大部分数据还无法产生价值。

  采用传统方案处理大规模数据,一般得耗资数千万自建数据中心,请专业技术人员维护运作。一旦数据总量超过 100TB,技术挑战会非常大。Hadoop 开源运动降低了这一成本,不过自建一个像样的 Hadoop 集群,仍然需要上百万的起步资金。专业的 Hadoop 人才则更加稀缺。

  相比而言,使用 ODPS 的成本和门槛则低得多。ODPS 采取按量收费的模式,目前定价 0.3 元 /GB,即开即用,一个月内免费。根据大部分公司的数据量来测算,一般每月只需花费数百元。

  在对外商用之前,ODPS 一直是阿里巴巴内部的秘密武器。阿里小贷最先将 ODPS 应用到商业领域。如今,超过 36 万人从阿里小贷借款,最小贷款额为 1 元,并实现 3 分钟申请、1 秒放款、0 人工干预。要做到这一点,阿里小贷每天得处理 30PB 数据,包括店铺等级、收藏、评价等 800 亿个信息项,运算 100 多个数据模型,甚至得测评小企业主对假设情景的掩饰和撒谎程度。阿里小贷每笔贷款成本 3 毛钱,不到普通银行的 1/1000。

  据悉,淘宝、支付宝等阿里巴巴最核心的数据业务,都运行在 ODPS 平台。比如阿里妈妈广告的核心算法,点击预测模型的训练等。ODPS 商用,意味着阿里云将这种大数据处理能力对外开放,此举将大幅降低社会创新成本。

  在公共领域,ODPS 也具备广泛应用的潜力:华大基因利用 ODPS 进行基因测序,耗时不到传统方式的十分之一,未来一旦爆发生物危机,可以为人类赢得宝贵的破译时间;药监部门利用 ODPS,全程监管药品流向,解决假药问题。” 我们期待未来每一桶油、每一道菜的数据都跑在 ODPS 上,食品安全问题需要通过创新的方式来解决。”

  目前,全球提供类似服务的仅有 Google 和亚马逊,国内尚无同类产品可供比较。阿里云方面表示,ODPS 将比 Google BigQuery 更强大,不仅支持更丰富的 SQL 语法,还将提供 MapReduce 编程模型和机器学习建模能力,可以服务更多应用场景。


  ODPS 在阿里集团内部已经被广泛用于海量结构化数据的处理分析。作为飞天的重要模块,伴随阿里云的发展一路走来的 ODPS,在成长道路上历经坎坷:

  1.2010 年春节期间,ODPS 的前身 Sql Engine 第一版上线,首个应用是支持阿里云金融的信用贷款和订单贷款业务(牧羊犬),运行在 30 台机器的飞天集群上;

  2.2010 年年中,牧羊犬项目飞速发展,而 Sql Engine 的稳定性则很难跟上业务的步伐,为了保证业务连续性和稳定性,我们采用了 4 个机群并行运行的方式,夜晚的护航苦乐参半;

  3.2011 年 1 月,随着业务方的更高要求,Sql Engine 的架构难以为继,Sql 团队利用 3 个月的时间对 Sql Engine 进行重构,新版本可以比较稳定的运行在 100 台机器上,并更名为 Data Engine;

  4.2011 年 Q3,飞天团队开始探索支撑集团内部数仓业务,利用 Moye 系统,在 1500 台机器上并行运行云梯 1 的生产作业,并取得了不输于 hadoop 的性能和稳定性成绩;

  5.2011 年年底,迅速发展的金融业务与蹒跚前进的平台之间的矛盾再次爆发,阿里小贷数位高管夜访阿里云;为了重拾客户信心,2012 年 1 月,阿里云的 40 位工程师进驻滨江,在客户身边进行新版本开发和线上系统维护,历经 4 个月的时间,Data Engine 版本升级,将客户业务迁移到 2500 台的集群上;依托新的机群,阿里小贷的业务量飞速增长;

  6.2012 年 Q1,冰火鸟项目启动,团队在 Data Engine 和 Moye 之间做技术选择,并决定使用 Moye 作为 ODPS 产品的核心引擎,Data Engine 退出历史舞台;历经 8 个月的时间,冰火鸟项目一期结束,阿里小贷和淘宝的数仓业务迁移到 ODPS,ODPS 第一个版本开始生产运行;

  7.2013 年年中,金融业务的快速发展再次造成机群资源吃紧,存储即将撞墙!由阿里云和数据交换平台组成的联合项目组开始 5k 项目,经过 5 个月的技术攻关,金融机群顺利扩容,生产系统平滑迁移,ODPS 进入 5000 台机器和跨机房调度时代;


  阿里内部关于云梯 1(基于 hadoop)和云梯 2 的技术争论由来已久,在事实上促进了 ODPS 团队的成长。

  2013 年 10 月,为了融合阿里小贷和支付宝的数据,支付宝希望 ODPS 团队协助他们搬家,将支付宝数仓业务从 hadoop 机群搬到 ODPS 上,登月 1 号项目启动。2014 年 5 月,登月 1 号项目顺利成功,小微金服的全部数据业务开始基于 ODPS 发展;

  2013 年底,受到登月 1 号项目的启发,阿里数据平台团队联合技术保障部和集团各事业部,开始了一系列宏大的登月计划,致力于将搜索、广告、物流等多个 BU 的数据统一,未来 ODPS 将成为承载阿里集团全部数据的统一处理平台。” 登月计划 ” 共计划了 20 多个项目,涉及阿里巴巴和小微金服所有的事业部,覆盖集团全部数据人员,其牵扯人员、资源之多,在集团内部罕见。登月计划的全面启动,标志着阿里集团自研的飞天 +ODPS 平台,从功能和性能上已经渐渐超越了 Hadoop,阿里云的技术走在了世界前列。

  从 Oracle 到 Hadoop,我们解决了海量数据如何存储和分析的问题,阿里的数据业务不再受制于规模的瓶颈;从 Hadoop 到 ODPS,更是一次质的飞跃,为后续大数据业务的开展扫清了障碍。

  ODPS 商用,则是中国大数据时代发展的一个标志性事件。


  最底层是 Linux+PC Server,上层软件是飞天,飞天是阿里云 09 年开始开发的一款分布式系统软件,主要提供分布式存储和分布式计算的调度、编程框架。开发语言是 C++, 2013 年该系统在生产环境支持调度 5000 台机器的集群。


  站在 hadoop 的角度看,飞天提供的功能和 hadoop 是类似的,在 yarn 之前,hadoop 主要的编程模型是 MapReduce,飞天的编程模型是一个有向无环图,而且除了支持批处理任务以外还支持常驻的 Service。实现的细节上当然完全不同,首先实现的编程语言飞天就选择了 C++。其他像安全、运维体系都有很大区别。

  ODPS 是在飞天之上提供的一套服务,功能包括 SQL,基于 java 的 Mapreduce 编程框架,图计算编程模型,一系列机器学习算法的实现等等。所有的功能是以 RESTful API 的形式对外提供,所以从系统边界上说,这层 API 隔离了 ODPS 平台和用户的系统,和 hadoop 的区别也很明显。ODPS 设计之初就是为了对外开放,做基于互联网的多租户的公共数据处理服务,所以安全性在 ODPS 的设计和实现中具有最高的优先级。


  目前阿里集团内部使用 ODPS 的状况:阿里集团大部分数据业务都运行其上,包括阿里小贷,数据魔方,阿里妈妈广告联盟,广告搜索,点击预测模型训练、支付宝所有业务,淘宝指数,阿里无线,高德,中信 21cn 等。阿里妈妈广告的核心算法,包括点击预测模型的训练


  1)海量运算触手可得:用户不必关心数据规模增长带来的存储困难、运算时间延长等烦恼,ODPS 可以根据用户的数据规模自动扩展机群的存储和计算能力,使用户专心于数据分析和挖掘,最大化发挥数据的价值

  2)服务 ” 开箱即用 “:用户不必关心机群的搭建、配置和运维工作,仅需简单的几步操作,就可以在 ODPS 中上传数据、分析数据并得到分析结果

  l   数据存储安全可靠:ODPS 采用三重备份、读写请求鉴权、应用沙箱、系统沙箱等多层次数据存储和访问安全机制保护用户的数据:不丢失、不泄露、不被窃取


  4)按量付费:ODPS 根据用户实际的存储和计算消耗收费,最大化的降低用户的数据使用成本


  后续阿里云会持续研发和优化 ODPS 平台,提供更多功能满足客户更多层次需求:

  1、提供 ADC 平台——基于 ODPS 的集成数据解决方案,包括代码集成开发环境,任务发布,任务调度,任务监控管理,BI/ 商业智能等。

  2、提供更加丰富的 SQL 语法,还将提供 MapReduce 编程模型和大规模机器学习建模能力。

  3、支持准实时查询:为支持 BI/ 商业智能等功能提供支持。


  5、更复杂的计算模型,不仅支持 Map/Reduce,还会支持 DAG 模型的任务。

  6、支持用户自定义的迭代 / 图模型编程接口:用户可以依赖这样的接口编写聚类 ( 例如 Kmeans ) ,PageRank。



  9、开放 odps 更底层的逻辑计算单元,支持用户基于 odps 开发 spark,pig 等分布式系统。

  10、地理 / 空间数据类型的支持。

  11、使用数据存储新技术 raidfile:1.5 份 copy 技术,目前数据每份存储三份,后续将会提供新的存储技术 raidfile,在同等保证数据安全的情况下,每份数据只存储 1.5 份,大大优化存储空间。


  1、” 天池 ” 平台是基于阿里云 ODPS 的大数据开放平台,向学术界免费提供科研数据和数据处理服务。ODPS 以 RESTful API 的形式提供针对 PB 级别数据的批量处理能力,主要应用于数据分析、海量数据统计、数据挖掘、商业智能等领域。

  2、天池平台已开放的科研数据:天池平台的数据和计算资源是免费开放给学术界的,第一期开放三类科研数据集,包括用户购买成交记录、商品购买评论记录、商品浏览日志记录等,数据经过脱敏处理,所有数据均只能在 ” 天池 ” 平台中使用。(ODPS 是一个收费的、商用的平台,而天池平台是基于 ODPS,开放给高校和科研机构的一个平台,也就是说,相当于阿里巴巴替这些高校科研机构付费了)




  基于 ODPS,阿里为第三方软件服务商和品牌商提供大数据计算、挖掘、存储的云环境开发平台,构建阿里数据生态。通过御膳房数据市场,数据消费者与数据提供者可以安全地交易、使用海量数据,实现数据价值。


[repost ]Explaining JavaScript VMs in JavaScript – Inline Caches


I have a thing for virtual machines that are implemented in the language (or a subset of the language) they are built to execute. If I were in the academia or just had a little bit more free time I would definitely start working on a JavaScript VM written in JavaScript. Actually this would not be a unique project for JavaScript because people from Université de Montréal kinda got there first with Tachyon, but I have some ideas I would like to pursue myself.

I however have another dream closely connected to (meta)circular virtual machines. I want to help JavaScript developers understand how JS engines work. I think understanding the tools you are wielding is of uttermost importance in our trade. The more people would stop seeing JS VM as a mysterious black box that converts JavaScript source into some zeros-and-ones the better.

I should say that I am not alone in my desire to explain how things work internally and help people write a more performant code. A lot of people from all over the world are trying to do the same. But there is I think a problem that prevents this knowledge from being absorbed efficiently by developers. We are trying to convey our knowledge in the wrong form. I am guilty of this myself:

  • sometimes I wrap things I know about V8 into hard to digest lists of “do this, not that” recommendations. The problem with such serving is that it really does not explain anything. Most probably it will be followed like a sacred ritual and might easily become outdated without anybody noticing it.
  • sometimes trying to explain how VMs works internally we choose wrong level of abstraction. I love a thought that seeing a slide full of assembly code might encourage people to learn assembly and reread this slide later, but I am afraid that sometimes these slides just fall trough and get forgotten by people as something not useful in practice.

I have been thinking about these problems for quite some time and I decided that it might be worth to try explaining JavaScript VM in JavaScript. The talk “V8 Inside Out” that I’ve given at WebRebels 2012 pursues exactly this idea

 [slides] and in this post I would like to revisit things I’ve been talking about in Oslo but now without any audible obstructions (I like to believe that my way of writing is much less funky than my way of speaking ☺).

Implementing dynamic language in JavaScript

Imagine that you want to implement in JavaScript a VM for a language that is very similar to JavaScript in terms semantics but has a much simpler object model: instead of JS objects it has tables mapping keys ofany type to values. For simplicity lets just think about Lua, which is actually both very similar to JavaScript and very different as a language. My favorite “make array of points and then compute vector sum” example would look approximately like this:

function MakePoint(x, y)
  local point = {}
  point.x = x
  point.y = y
  return point

function MakeArrayOfPoints(N)
  local array = {}
  local m = -1
  for i = 0, N do
    m = m * -1
    array[i] = MakePoint(m * i, m * -i)
  array.n = N
  return array

function SumArrayOfPoints(array)
  local sum = MakePoint(0, 0)
  for i = 0, array.n do
    sum.x = sum.x + array[i].x
    sum.y = sum.y + array[i].y
  return sum

function CheckResult(sum)
  local x = sum.x
  local y = sum.y
  if x ~= 50000 or y ~= -50000 then
    error("failed: x = " .. x .. ", y = " .. y)

local N = 100000
local array = MakeArrayOfPoints(N)
local start_ms = os.clock() * 1000;
for i = 0, 5 do
  local sum = SumArrayOfPoints(array)
local end_ms = os.clock() * 1000;
print(end_ms - start_ms)

Note that I have a habit of checking at least some final results computed by my μbenchmark. This saves me from embarrassment when somebody discovers that my revolutionary jsperf test-cases are nothing but my own bugs.

If you take the code above and put it into Lua interpreter you will get something like this:

∮ lua points.lua

Good, but does not help to understand how VMs work. So lets think how it could look like if we had quasi-Lua VM written in JavaScript. “Quasi” because I don’t want to implement full Lua semantics, I prefer to focus only on the objects are tables aspect of it. Naïve compiler could translate our code down to JavaScript like this:

function MakePoint(x, y) {
  var point = new Table();
  STORE(point, 'x', x);
  STORE(point, 'y', y);
  return point;

function MakeArrayOfPoints(N) {
  var array = new Table();
  var m = -1;
  for (var i = 0; i <= N; i++) {
    m = m * -1;
    STORE(array, i, MakePoint(m * i, m * -i));
  STORE(array, 'n', N);
  return array;

function SumArrayOfPoints(array) {
  var sum = MakePoint(0, 0);
  for (var i = 0; i <= LOAD(array, 'n'); i++) {
    STORE(sum, 'x', LOAD(sum, 'x') + LOAD(LOAD(array, i), 'x'));
    STORE(sum, 'y', LOAD(sum, 'y') + LOAD(LOAD(array, i), 'y'));
  return sum;

function CheckResult(sum) {
  var x = LOAD(sum, 'x');
  var y = LOAD(sum, 'y');
  if (x !== 50000 || y !== -50000) {
    throw new Error("failed: x = " + x + ", y = " + y);

var N = 100000;
var array = MakeArrayOfPoints(N);
var start = LOAD(os, 'clock')() * 1000;
for (var i = 0; i <= 5; i++) {
  var sum = SumArrayOfPoints(array);
var end = LOAD(os, 'clock')() * 1000;
print(end - start);

However if you just try to run translated code with d8 (V8′s standalone shell) it will politely refuse:

∮ d8 points.js
points.js:9: ReferenceError: Table is not defined
  var array = new Table();
ReferenceError: Table is not defined
    at MakeArrayOfPoints (points.js:9:19)
    at points.js:37:13

The reason for this refusal is simple: we are still missing runtime system code which is actually responsible for implementing object model and semantics of loads and stores. It might seem obvious, but I want to highlight this: VM, that looks like a single black box from the outside, on the inside is actually an orchestra of boxes playing together to deliver best possible performance. There are compilers, runtime routines, object model, garbage collector, etc. Fortunately our language and example are very simple so our runtime system is only couple dozen lines large:

function Table() {
  // Map from ES Harmony is a simple dictionary-style collection.
  this.map = new Map;

Table.prototype = {
  load: function (key) { return this.map.get(key); },
  store: function (key, value) { this.map.set(key, value); }

function CHECK_TABLE(t) {
  if (!(t instanceof Table)) {
    throw new Error("table expected");

function LOAD(t, k) {
  return t.load(k);

function STORE(t, k, v) {
  t.store(k, v);

var os = new Table();

STORE(os, 'clock', function () {
  return Date.now() / 1000;

Notice that I have to use Harmony Map instead of normal JavaScript Object because potentially table can contain any key, not just string ones.

∮ d8 --harmony quasi-lua-runtime.js points.js

Now our translated code works but is disappointingly slow because of all those levels of abstraction every load and store have to cross before they get to the value. Lets try to reduce this overhead by applying the very same fundamental optimization that most JavaScript VMs apply these days: inline caching. Even JS VMs written in Java will eventually use it because invokedynamic is essentially a structural inline cache exposed at bytecode level. Inline caching (usually abbreviated as IC in V8 sources) is actually a very old technique developed roughly 30 years ago for Smalltalk VMs.

Good duck always quacks the same way

The idea behind inline caching is very simple: we want to create a bypass or fast path that would allow us to quickly, without entering runtime system, load object’s property if our assumptions about object and it’s properties are correct. It’s quite hard to formulate any meaningful assumptions about object layout in a program written in language full of dynamic typing, late binding and other quirks like eval so instead we want to let our loads/stores observe&learn: once they see some object they can adapt themselves in a way that makes subsequent loads from similarly structured objects faster.  In a sense we are going to cache knowledge about the layout of the previously seen object inside the load/store itself hence the name inline caching. ICs can be actually applied to virtually any operation with a dynamic behavior as long as you can figure out a meaningful fast path: arithmetical operators, calls to free functions, calls to methods, etc. Some ICs can also cache more than a single fast path that is become polymorphic.

If we start thinking how to apply ICs to the translated code above it soon becomes obvious that we need to change our object model. There is no way we can do a fast load from a Map, we always have to go through get method. [If we could peak into raw hashtable behind Map we could make IC work for us even without new object layout by caching bucket index.]

Discovering hidden structure

For efficiency tables that are used like structured data should become more like C structs: a sequence of named fields at fixed offsets. The same about tables that are used as arrays: we want numeric properties to be stored in array like fashion. But it’s obvious that not every table fits such representation: some are actually used as tables, either contain non-string non-number keys or contain too many string named properties that come and disappear as table is mutated. Unfortunately we can’t perform any kind of expensive type inference, instead we have to discover a structure behind each and every table while the program runs creating and mutating them. Fortunately there is a well known technique that allows to do precisely that ☺. This technique is known as hidden classes.

The idea behind hidden classes boils down to two simple things:

  1. runtime system associates a hidden class with each an every object, just like Java VM would associate an instance of java.lang.Class with every object;
  2. if layout of the object changes then runtime system will create or find a new hidden class that matches this new layout and attach it to the object;

Hidden classes have a very important feature: they allow VM to quickly check assumptions about object layout by doing a simple comparison against a cached hidden class. This is exactly what we need for our inline caches. Lets implement some simple hidden classes system for our quasi-Lua runtime. Every hidden class is essentially a collection of property descriptors, where each descriptor is either a real property or a transition that points from a class that does not have some property to a class that has this property:

function Transition(klass) {
  this.klass = klass;

function Property(index) {
  this.index = index;

function Klass(kind) {
  // Classes are "fast" if they are C-struct like and "slow" is they are Map-like.
  this.kind = kind;
  this.descriptors = new Map;
  this.keys = [];

Transitions exist to enable sharing of hidden classes between objects that are created in the same way: if you have two objects that share hidden class and you add the same property to both of them you don’t want to get different hidden classes.

Klass.prototype = {
  // Create hidden class with a new property that does not exist on
  // the current hidden class.
  addProperty: function (key) {
    var klass = this.clone();
    // Connect hidden classes with transition to enable sharing:
    //           this == add property key ==> klass
    this.descriptors.set(key, new Transition(klass));
    return klass;

  hasProperty: function (key) {
    return this.descriptors.has(key);

  getDescriptor: function (key) {
    return this.descriptors.get(key);

  getIndex: function (key) {
    return this.getDescriptor(key).index;

  // Create clone of this hidden class that has same properties
  // at same offsets (but does not have any transitions).
  clone: function () {
    var klass = new Klass(this.kind);
    klass.keys = this.keys.slice(0);
    for (var i = 0; i < this.keys.length; i++) {
      var key = this.keys[i];
      klass.descriptors.set(key, this.descriptors.get(key));
    return klass;

  // Add real property to descriptors.
  append: function (key) {
    this.descriptors.set(key, new Property(this.keys.length - 1));

Now we can make our tables flexible and allow them to adapt to the way they are constructed

var ROOT_KLASS = new Klass("fast");

function Table() {
  // All tables start from the fast empty root hidden class.
  this.klass = ROOT_KLASS;
  this.properties = [];  // Array of named properties: 'x','y',...
  this.elements = [];  // Array of indexed properties: 0, 1, ...
  // We will actually cheat a little bit and allow any int32 to go here,
  // we will also allow V8 to select appropriate representation for
  // the array's backing store. There are too many details to cover in
  // a single blog post :-)

Table.prototype = {
  load: function (key) {
    if (this.klass.kind === "slow") {
      // Slow class => properties are represented as Map.
      return this.properties.get(key);

    // This is fast table with indexed and named properties only.
    if (typeof key === "number" && (key | 0) === key) {  // Indexed property.
      return this.elements[key];
    } else if (typeof key === "string") {  // Named property.
      var idx = this.findPropertyForRead(key);
      return (idx >= 0) ? this.properties[idx] : void 0;

    // There can be only string&number keys on fast table.
    return void 0;

  store: function (key, value) {
    if (this.klass.kind === "slow") {
      // Slow class => properties are represented as Map.
      this.properties.set(key, value);

    // This is fast table with indexed and named properties only.
    if (typeof key === "number" && (key | 0) === key) {  // Indexed property.
      this.elements[key] = value;
    } else if (typeof key === "string") {  // Named property.
      var index = this.findPropertyForWrite(key);
      if (index >= 0) {
        this.properties[index] = value;

    this.store(key, value);

  // Find property or add one if possible, returns property index
  // or -1 if we have too many properties and should switch to slow.
  findPropertyForWrite: function (key) {
    if (!this.klass.hasProperty(key)) {  // Try adding property if it does not exist.
      // To many properties! Achtung! Fast case kaput.
      if (this.klass.keys.length > 20) return -1;

      // Switch class to the one that has this property.
      this.klass = this.klass.addProperty(key);
      return this.klass.getIndex(key);

    var desc = this.klass.getDescriptor(key);
    if (desc instanceof Transition) {
      // Property does not exist yet but we have a transition to the class that has it.
      this.klass = desc.klass;
      return this.klass.getIndex(key);

    // Get index of existing property.
    return desc.index;

  // Find property index if property exists, return -1 otherwise.
  findPropertyForRead: function (key) {
    if (!this.klass.hasProperty(key)) return -1;
    var desc = this.klass.getDescriptor(key);
    if (!(desc instanceof Property)) return -1;  // Here we are not interested in transitions.
    return desc.index;

  // Copy all properties into the Map and switch to slow class.
  convertToSlow: function () {
    var map = new Map;
    for (var i = 0; i < this.klass.keys.length; i++) {
      var key = this.klass.keys[i];
      var val = this.properties[i];
      map.set(key, val);

    Object.keys(this.elements).forEach(function (key) {
      var val = this.elements[key];
      map.set(key | 0, val);  // Funky JS, force key back to int32.
    }, this);

    this.properties = map;
    this.elements = null;
    this.klass = new Klass("slow");

[I am not going to explain every line of the code because it's commented JavaScript; not C++ or assembly... This is the whole point of using JavaScript. However you can ask anything unclear in comments or by dropping me a mail]

Now that we have hidden classes in our runtime system that would allow us to perform quick checks of object layout and quick loads of properties by their index we just have to implement inline caches themselves. This requires some additional functionality both in compiler and runtime system (remember how I was talking about cooperation between different parts of VM?).

Patchwork quilts of generated code

One of many ways to implement an inline cache is to split it into two pieces: modifiable call site in the generated code and a set of stubs (small pieces of generated native code) that can be called from that call site. It is essential that stubs themselves (or runtime system) could find callsite from which they were called: stubs contain only fast paths compiled under certain assumptions, if those assumptions do not apply for an object that stub sees then it can initiate modification (patching) of the call site that invoked this stub to adapt that site for new circumstances. Our pure JavaScript ICs will also consist of two parts:

  1. a global variable per IC will be used to emulate modifiable call instruction;
  2. and closures will be used instead of stubs.

In the native code V8 finds IC sites to patch by inspecting return address sitting on the stack. We can’t do anything like that in the pure JavaScript (arguments.caller is not fine-grained enough) so we’ll just pass IC’s id into IC stub explicitly. Here is how IC-ified code will look like:

// Initially all ICs are in uninitialized state.
// They are not hitting the cache and always missing into runtime system.

function MakePoint(x, y) {
  var point = new Table();
  STORE$0(point, 'x', x, 0);  // The last number is IC's id: STORE$0 &rArr; id is 0
  STORE$1(point, 'y', y, 1);
  return point;

function MakeArrayOfPoints(N) {
  var array = new Table();
  var m = -1;
  for (var i = 0; i <= N; i++) {
    m = m * -1;
    // Now we are also distinguishing between expressions x[p] and x.p.
    // The fist one is called keyed load/store and the second one is called
    // named load/store.
    // The main difference is that named load/stores use a fixed known
    // constant string key and thus can be specialized for a fixed property
    // offset.
    KEYED_STORE$2(array, i, MakePoint(m * i, m * -i), 2);
  STORE$3(array, 'n', N, 3);
  return array;

function SumArrayOfPoints(array) {
  var sum = MakePoint(0, 0);
  for (var i = 0; i <= LOAD$4(array, 'n', 4); i++) {
    STORE$5(sum, 'x', LOAD$6(sum, 'x', 6) + LOAD$7(KEYED_LOAD$8(array, i, 8), 'x', 7), 5);
    STORE$9(sum, 'y', LOAD$10(sum, 'y', 10) + LOAD$11(KEYED_LOAD$12(array, i, 12), 'y', 11), 9);
  return sum;

function CheckResults(sum) {
  var x = LOAD$13(sum, 'x', 13);
  var y = LOAD$14(sum, 'y', 14);
  if (x !== 50000 || y !== -50000) throw new Error("failed x: " + x + ", y:" + y);

Changes above are again self-explanatory: every property load/store site got it’s own IC with an id. One small last step left: to implement MISS stubs and stub “compiler” that would produce specialized stubs:

function NAMED_LOAD_MISS(t, k, ic) {
  var v = LOAD(t, k);
  if (t.klass.kind === "fast") {
    // Create a load stub that is specialized for a fixed class and key k and
    // loads property from a fixed offset.
    var stub = CompileNamedLoadFastProperty(t.klass, k);
    PatchIC("LOAD", ic, stub);
  return v;

function NAMED_STORE_MISS(t, k, v, ic) {
  var klass_before = t.klass;
  STORE(t, k, v);
  var klass_after = t.klass;
  if (klass_before.kind === "fast" &&
      klass_after.kind === "fast") {
    // Create a store stub that is specialized for a fixed transition between classes
    // and a fixed key k that stores property into a fixed offset and replaces
    // object's hidden class if necessary.
    var stub = CompileNamedStoreFastProperty(klass_before, klass_after, k);
    PatchIC("STORE", ic, stub);

function KEYED_LOAD_MISS(t, k, ic) {
  var v = LOAD(t, k);
  if (t.klass.kind === "fast" && (typeof k === 'number' && (k | 0) === k)) {
    // Create a stub for the fast load from the elements array.
    // Does not actually depend on the class but could if we had more complicated
    // storage system.
    var stub = CompileKeyedLoadFastElement();
    PatchIC("KEYED_LOAD", ic, stub);
  return v;

function KEYED_STORE_MISS(t, k, v, ic) {
  STORE(t, k, v);
  if (t.klass.kind === "fast" && (typeof k === 'number' && (k | 0) === k)) {
    // Create a stub for the fast store into the elements array.
    // Does not actually depend on the class but could if we had more complicated
    // storage system.
    var stub = CompileKeyedStoreFastElement();
    PatchIC("KEYED_STORE", ic, stub);

function PatchIC(kind, id, stub) {
  this[kind + "$" + id] = stub;  // non-strict JS funkiness: this is global object.

function CompileNamedLoadFastProperty(klass, key) {
  // Key is known to be constant (named load). Specialize index.
  var index = klass.getIndex(key);

  function KeyedLoadFastProperty(t, k, ic) {
    if (t.klass !== klass) {
      // Expected klass does not match. Can't use cached index.
      // Fall through to the runtime system.
      return NAMED_LOAD_MISS(t, k, ic);
    return t.properties[index];  // Veni. Vidi. Vici.

  return KeyedLoadFastProperty;

function CompileNamedStoreFastProperty(klass_before, klass_after, key) {
  // Key is known to be constant (named load). Specialize index.
  var index = klass_after.getIndex(key);

  if (klass_before !== klass_after) {
    // Transition happens during the store.
    // Compile stub that updates hidden class.
    return function (t, k, v, ic) {
      if (t.klass !== klass_before) {
        // Expected klass does not match. Can't use cached index.
        // Fall through to the runtime system.
        return NAMED_STORE_MISS(t, k, v, ic);
      t.properties[index] = v;  // Fast store.
      t.klass = klass_after;  // T-t-t-transition!
  } else {
    // Write to an existing property. No transition.
    return function (t, k, v, ic) {
      if (t.klass !== klass_before) {
        // Expected klass does not match. Can't use cached index.
        // Fall through to the runtime system.
        return NAMED_STORE_MISS(t, k, v, ic);
      t.properties[index] = v;  // Fast store.

function CompileKeyedLoadFastElement() {
  function KeyedLoadFastElement(t, k, ic) {
    if (t.klass.kind !== "fast" || !(typeof k === 'number' && (k | 0) === k)) {
      // If table is slow or key is not a number we can't use fast-path.
      // Fall through to the runtime system, it can handle everything.
      return KEYED_LOAD_MISS(t, k, ic);
    return t.elements[k];

  return KeyedLoadFastElement;

function CompileKeyedStoreFastElement() {
  function KeyedStoreFastElement(t, k, v, ic) {
    if (t.klass.kind !== "fast" || !(typeof k === 'number' && (k | 0) === k)) {
      // If table is slow or key is not a number we can't use fast-path.
      // Fall through to the runtime system, it can handle everything.
      return KEYED_STORE_MISS(t, k, v, ic);
    t.elements[k] = v;

  return KeyedStoreFastElement;

It’s a lot of code (and comments) but it should be simple to understand given all explanations above: ICs observe and stub compiler/factory produces adapted-specialized stubs [attentive reader can even notice that I could have initialized all keyed store ICs with fast loads from the very start or that it gets stuck in fast state once it enters it].

If we throw all the code we got together and rerun our “benchmark” we’ll get very pleasing results:

∮ d8 --harmony quasi-lua-runtime-ic.js points-ic.js

This is a factor of 6 speedup compared to our first naïve attempt!

There is never a conclusion to JavaScript VMs optimizations

Hopefully you are reading this part because you have read everything above… I tried to look from a different perspective, that of a JavaScript developer, onto some ideas powering JavaScript engines these days. The more code I was writing the more it felt like a story about blind men and an elephant. Just to give you a feeling of looking into the abyss: V8 has 10 descriptors kinds, 5 elements kinds (+ 9 external elements kinds), ic.cc that contains most of IC state selection logic is more that 2500 LOC and ICs in V8 have more than 2 states (there are uninitialized, premonomorphic, monomorphic, polymorphic, generic states not mentioning special states for keyed load/stores ICs or completely different hierarchy of states for arithmetic ICs), ia32-specific hand written IC stubs take more than 5000 LOC, etc. These numbers only grow as time passes and V8 learns to distinguish&adapt to more and more object layouts. And I am not even touching object model itself (objects.cc 13kLOC), or garbage collector, or optimizing compiler.

Nevertheless I am sure that fundamentals will not change in the foreseeable future and when they do it will be a breakthrough with a loud bang! sound, so you’ll notice. Thus I think that this exercise of trying to understand fundamentals by (re)writing them in JavaScript is very-very-very important.

I hope tomorrow or maybe the week after you will stop and shout Eureka! and tell your coworkers why conditionally adding properties to an object in one place of the code can affect performance of some other distant hot loop touching these objects. Because hidden classes, you know, they change!

[repost ]论集体记忆


原文 http://baojie.org/blog/2013/07/31/on-collective-memory/

  • 1 原则
    • 1.1 以人为本
    • 1.2 Web 3.0基本属性
      • 1.2.1 Smart Data
      • 1.2.2 Distributed
      • 1.2.3 Refined and Personalized
      • 1.2.4 Open
    • 1.3 个人记忆
    • 1.4 集体记忆
  • 2 技术手段
    • 2.1 RDF的地位
    • 2.2 知识库的构造与增长
    • 2.3 HCI的重要性
    • 2.4 知识索引




知识重用的核心并非对机器友好的知识表现,而是对人友好的知识表现。传统KR领域往往忽视了人机交互以及人际交互中知识表现的特性。前者是hci问题,后者是tbl讲的social machine问题。解决了这两个,知识管理中最瓶颈的地方就好办了。再说一遍,解决AI问题的核心是人而不是机器,有多少人工就有多少智能


软件构架的核心是人。语义网作为最野心勃勃的软件构架之一,它早期的设计也映射了制造它的组织本身: 政府研究机构,学校,标准化组织。这些组织都不擅长对应复杂多变的需求,所以设计的软件构架难以实现规模化。后期的语义网能够规模化,需要更分布式,更灵活的组织来推动其设计。


这次在#semtech# ,和人工智能领域一个大型项目负责人谈了很久。检讨项目的经验教训,核心的问题之一在于如何利用好人的智能。智能的采集和交换,是比智能的提取(不管是知识工程那种人工方法还是机器学习那种”自动”方法)更核心的问题。说到底是人的智能。

知识提取大行于social web。基于知识表现的semantic web没能起飞。我觉得smart web并不是semantic web,它是基于知识重用的

今天读了几章The Art of Readable Code。推荐。觉得可以推广一下,到The Art of Readable Knowledge。传统上总是把知识看作机器表达,知识表现是用一种机器可理解的语言来表述人类知识。这种看法是片面的——知识表现语言是一种计算机程序,和所有的程序一样,其对人的可读性及其重要。

在这个标准上,无论是最近的RDF,OWL, RIF, 还是更传统的KIF,都不是很合适人类阅读的语言。写作这样的知识库很困难,理解这样的知识库更困难。”readable knowledge”是lean semantic web里很重要的一个原则。

Web 3.0基本属性

流程的分布性,数据的智能性,资源的可发现性,应用的未知性,是Web 3.0区别与Web2.0的几个核心优势,也是新的平台协议和软件必须具有的特征

Future of the World WIde Web by Jim Hendler 里面再次提到social machine的概念。Web 3.0是人和机器合作的网络——目前为止的Web 2.0主要是人和人合作的网络。人做难的那部分,机器做容易的那部分。http://www.slideshare.net/jahendler/future-of-the-world-wide-web-india

我怀疑Web 3.0上的搜索引擎(如果还这么叫的话),不会是”crawl”出来的。从Yahoo到Google,从G到Facebook,都不是简单的外推和加法,而是依托一种新的信息组织与发现的模式。

关键还是要产生价值。Web 3.0会带来什么?我想就是不仅是社交、文档、电子商务,而是人们生活的方方面面,那些现在不上Facebook的人,他们的生活也全面Web化——其实这个空间远远没有饱和。在技术上,就是从社交Web进化到(大)数据Web。现实中市场经济不只广告,Web上也会一样

Smart Data

Nova Spivack有个说法,Web 3.0是Data Web, 就是结构化数据时代,Web 4.0才是Intelligent Web,把语义网算是完全实现了。我觉得他说的有道理。怎么从用户数据到结构化数据,怎么构造这个社会工程,大概也是Web3.0的重要特征



我觉得数据和应用的分离是关键。从数据奴隶制度走向资本主义市场经济//@ShangguanRPI: native或者web都只关乎技术,未来的mobile app更重要的是形态,跟着用户走的形态。

Refined and Personalized

一个词总结Web 3.0:做减法。这是逆Web 2.0的趋势而动。Web在未来10年会渗透到70亿人中现在还不活跃于Web2.0那80%的人,延续Web2.0 的思路是不成的。怎么做减法,各种技术,包括新型用户界面,各种新型数据录用、采集方式,数据理解,用户理解,语义分析,高质量(结构化)数据,push技术,等等。

在Web 2.0时代,谁能浪费用户最多的时间,谁就能挣钱。在Web 3.0时代,谁能节约用户用户最多的时间,谁就能挣钱 //@西瓜大丸子汤: 广告是做加法,推给用户更多的信息。在小屏幕上,大概要做减法,减少给用户的信息



The Problem of Walled Gardenshttp://www.w3.org/2005/Incubator/socialweb/XGR-socialweb-20101206/#Problem Web之前很久就有超文本了?为什么到Web才爆发?因为URI和HTTP使任何文本可以和任何文本互通,而不仅仅是在单一的系统中。同样,互通现在封闭的社交网络,可能激发出十倍百倍的信息的力量,甚至不仅是Web的一种力量——正如Web不仅是Internet

看完了《Weaving the Web》这本书。感想太多了,虽然是12年前的书,对今天的指导意义依然很大。特别是对理解social web和semantic web的不同命运——尽管TBL在书里规划了后者而不是前者。开放,交流,合作。Web里,技术只是一小部分,社会模式的变迁才是最根本的。长江后浪推前浪,后人可以做得更好

A Standards-based, Open and Privacy-aware Social Web http://www.w3.org/2005/Incubator/socialweb/XGR-socialweb-20101206/ 2010年12月的W3C Social Web Incubator Group( http://www.w3.org/2005/Incubator/socialweb/ )报告:如何打破Facebook之类的数据壁垒。以后的社交网络,应该和电子邮件一样,任何人可以和任何网络上的任何人通信

The creation of a decentralized and federated Social Web, as part of Web architecture, is a revolutionary opportunity to provide both increased social data portability and enhanced end-user privacy. http://www.w3.org/2005/Incubator/socialweb/XGR-socialweb-20101206/ 如果说下一个十年Web最大的机会是什么,我觉得是开放数据Web

oracleOracle的总部大楼象征了一个过去的时代: 数据放在一个一个data silo里。下一个web 3.0的大企业总部该什么样来体现其哲学?魔方怎么样?

在开放数据Web上,用户不需要多个帐号。分布式的统一的身份认证将使任何人可以可任何人建立联系。那种个人数据被一个服务商锁定、过分甚至恶意使用的现象将得到限制。Facebook, G+等是社交网络目前的门户。正如Web 1.0的内容不会锁定在几个门户里,Social Web的内容也不会永远锁定在几个门户里


开放Web和语义Web天然是一孪生体。语义Web其实并不仅是关于“语义”(本体,推理等),更主要是让数据以最小摩擦在各Web主体间流动。不论是机器可查询、自动化、还是数据聚合,都是减少数据摩擦力。XML为什么还不够?XML Schema的形成本身就是摩擦力。开放Web并不需要事事基于约定的schema或者ontology



Web3.0的一个重要特点将是辅助用户降低工作记忆(working memory)负担。首先是后端数据结构的革命,各种on-demand, just-in-time的数据建模、检索方式,特别是图和语义这两种工作。其次是前端HCI的革命,以Google Glass为代表的可穿戴计算机,帮助并行思维和辅助短期记忆。两前后端相辅相成,相互促进。

文字是一次革命,线性记录思想。Web是一次革命,网络连接思想。下一次革命,把每个人的大脑本身Web化,非线性记录思想。智力越来越多的取决于大脑作为索引快速获取extended mind的能力。

Faceted Browser走出了一小步。Google Glass和其他的可穿戴计算机将走出一大步。帮助用户发现甚至他们自己都不知道怎么表达的需要,启发用户,允许用户将工作记忆从recall变成recognition,这会是极其惊人的革命//@西瓜大丸子汤: Web3.0的一个重要特点将是辅助用户降低工作记忆(working memory)负担




根据米勒法则,人的working memory capacity(工作记忆容量)大约是7个。也就是,要不产生压力,每天早上未读信件加上微博(或随便其他什么消息)应该少于7个。海量的信息流其实对心理健康和工作效率是有害的。

难的是低门槛的,持续转化短期记忆为外在存储,还要加上对这些数据的语义理解与检索。Google Glass之类的平台出来以后,可能带来比Web更大的革命。 //@unk89v: 外在存储不就可以么,加上快捷的搜寻机制就能达到回忆效果了。回忆也是这回事情吧









集体记忆的形成,不是靠consensus,而是靠shared understanding。任何两个人的世界观是不一样的,新的知识融入某人的记忆,不是靠强迫的知识结构转移,而是在试验错中新知识找到在个人知识体系中的连接点。知识管理的关键也就在减少共享认识过程中的摩擦。比如事先定义的ontology就最好避免




有结构的东西不容易众包,因为结构是一种世界观。freebase能成功就不是众包。众包的conceptnet十几年了也不太成功//@潘越_: 传统的知识工程(比如建立企业词汇表)其实很难采用分包的方式来做,因为标准难以一致 。 //@陈利人:必须要协同工作,共享,众包。

传统的基于本体的知识表现无法处理个人化的多种多样的世界观。集体记忆如果不依赖暴力或极为耗时的协调,必须采用web式的分布式设计,减少世界观摩擦,让所有人按他们舒服的方式组织 //@我:记忆是世界观数字化的过程。分享是世界观的一部分融合的过程




要回归rdf作为知识交换语言的本位。当年从kif, daml一路走下来,本意是做知识交换,后来却做成了数据库。所谓的瘦语义网lean semantic web,强调用更简单的数据处理语言,更对人友好的数据交互语言。rdf对这两个任务都不太胜任

有语义是有很多好处,不过也可以考虑更贴近通用开发语言的语义,如operational semantics//@潘越_:同意rdf回歸作為知識交換語言。在這一用途下,有一個嚴格的語義框架,比如模型論語義,還是非常必要的。//@西瓜大丸子汤:要回归rdf作为知识交换语言的本位




知识库按每1-2年一个数量级的速度在增长。现在还只是基础知识库,未来几年可能出现个人知识库,增长速度可能更快//@潘越_: 知识库增长速度快于数据库怎么讲? //@西瓜大丸子汤:知识库的增长速度目前快于数据的增长速度,今后几年会更快。在5年左右时间,知识质量的优势将超过数据数量的优势。

回复@Gary南京:考察知识库的增长,从cyc方法到现在的knowledge graph方法,知识库从10^6到10^10只花了6-7年。差不多每过一年到两年最大的知识库就要大一个数量级。由于基数小,知识增长的绝对量并不大,但趋势迅猛。随着更人性化的知识重用方法的普及,知识的积累可能更快


新型的知识库不是传统的KR模式,也不是Freebase模式。在新模式下,构造成本是极低的。“认知剩余”的汇总//@the王晗: 那构建知识库的成本将如何呢?//@西瓜大丸子汤: 简言之,数据这一块成本相对上升,模型和知识库这一块成本相对下降,

回复@SiDT: 同意前面一半。信息结构化还是要靠人工。其实人在产生信息的过程中本来是有很多结构的。好的知识重用方法能减少结构的浪费就很好了,效果可能会大大好于基于机器学习的方法。 //@SiDT:回复@西瓜大丸子汤:就是说,现阶段还是应该在知识抽取?把网上已有的大量信息结构化为知识??

不过根据我的不完全统计,语义数据在最近5年的发展,大体上每年涨一个数量级,远超内存的增长。估计三到五年后,语义数据的分析和使用将面临很大的大数据挑战。这都是高质量数据,不是打酱油数据,意义很大 //@西瓜大丸子汤: 到2012年1月,搜索引擎可见的语义网的规模有多大?17b数据,放内存里也就几T


搜索依赖于长期记忆(long-term memory)。新型用户界面利用短期记忆(short-term memory)降低长期记忆的要求。如何产生这些短期记忆,需要用户分析和内容分析。这就是为什么语义技术与新型HCI是密不可分的

表面上注意力是稀缺资源,内在的认知原因是人的工作记忆(working memory)是有限的。Google帮助人降低了长期记忆的负担。下一个能帮助人降低工作记忆负担的公司,是有潜力发展为下一个Google的。会是Evernote吗?

知道用什么关键词一般依赖于长期记忆 //@Alisoncastle:为啥说搜索是长期记忆啊?我一搜完,看了就忘了



新的hci和social machine的方法着眼点是促进人的智能的互联,把智能数字化的速度提高几个数量级。传统依赖专家的工作将分布到整个人群来做,包括那些不重度使用社交网络的人群



没有鼠标就没有Web。没有触摸屏就没有移动Web。没有Siri(及它代表的一大票现在还没有出来的新用户界面)就没有Semantic Web

Semantic Wiki对Semantic Web本身就是一个HCI的贡献,而不是知识结构、推理等传统方向。其中对重要的一个变化是,在wiki上,程序和数据的变化在界面上立即可见,而建界面本身的工作只需要极小的投入(尽管不一定好看)。设想一下,从头从HTML, Web Form, SPARQL, Triple Store搭起要多少功夫?

David Karger: 相比ftp, web只做了”很小”的工作流变化: 用url直接关联文档,用点击立即访问链接,用浏览器让用户停留在一个应用中。Web没有创造新的东西,但使旧的东西更简单。语义网能做到同样的事吗?







rdf最大的挑战之一是数据库索引。triple建模使索引复杂,冗余不得不极大。在cpu占有,磁盘空间,内存占有上,主流triple store如virtuoso都数倍甚至十倍于关系数据库或者mongodb之类。综合成本,性能,功能,在正式产品中rdf数据库极少被使用也就不奇怪了

RDF数据库由于三元组的无组织性(organization, context),索引结构不免复杂和冗余。同样规模的数据,triple store和图数据库比,磁盘空间消耗常大10倍,相应的I/O和网络消耗都大,性能上不能满足需求也就可以理解了

[repost ]语义网不需要描述逻辑



从接触描述逻辑(Description Logic, DL)到现在有差不多10年了,博士论文就是做描述逻辑。又在OWL Working Group工作一段时间,对描述逻辑还算是比较熟悉的。作为知识表现的工具,DL在某些领域是有价值的,特别是医疗等。但越深入了解这个工具,越觉得它对于语义网这个应用领域没有实际意义,是屠龙之技。



第一种复杂性是计算复杂性(Computational Complexity)。这个其实是DL研究界很熟悉的问题。本来DL的产生就是为了解决一阶逻辑的复杂性和不可判定性,作为一阶逻辑的一个可判定的子集而存在。可是到了OWL(SHOIN)和OWL2 (SROIQ),推理算法的复杂性达到了N2ExpTime (非确定二阶指数时间),这根本就没有实战意义了。OWL2里推出三个Profiles,可以做PTime(多项式时间)推理,想拿这个做卖点。可是你要在Web上用,多项式时间是远远不够的。当然,可以再简化简化,简化到某个LogTime, LogSpace,或 NLogSpace(比如Semantic MediaWiki)的子集。不过对那些子集,已经不能称为DL了。

第二种复杂性是工程复杂性(Engineering Complexity)。Web的特点是数据量极其巨大,几千台机器一起跑跟玩似的(Google里调用两千台服务器以下都不用审批的)。这么大的数据量,就是LogTime的算法也很难做到实时响应,各种分片,各种索引,各种缓存,各种并发,大数据各种工具,真是车载斗量。不要说描述逻辑,就是简单的为TF-IDF加上同义词关系和上下位概念关系,所需要的索引量就要大一个数量级——所以Google到现在不还没有支持这些在逻辑界看来简单到不能再简单的推理?更为复杂的推理,比如关系的包含,关系的复合,强逻辑非,势(cardinality),那需要的索引(不管是数据库索引还是搜索索引)都是天文数字,现在的工程平台一是办不到,二是不经济。很多搞逻辑的不写代码,也不参与工业实践,很难体会到工业界要规模化哪怕极小的语义关系(比如分类树)所面临的实际困难。

第三种复杂性是认知复杂性(Cognitive Complexity)。逻辑要能工作,对数据质量的要求极其得高,比数据库的要求高得多。产生高质量数据是一个昂贵的过程,产生一个triple人工操作要40美元,更贵的也不少见。我熟悉的一个项目,一个公理axiom或者模板template,要150美元或更多。这都是为了保证数据质量,要专家的缘故。那不用专家呢?各种collective intelligence的方法。很不幸,人一多,数据质量一定下降,各种混乱和不一致一定产生。普通Web 用户,连最基本的两个知识组织方法:给东西起名字(naming),把东西分类(classification),都搞不好。不信去看任何大一点的tagging系统,里面的概念混乱可以叫有逻辑洁癖的人抓狂。我在RPI的时候曾经做过一个有40个大学生参加的知识建模试验,把一群主要是计算机专业的本科生训练了6个小时,可是到最后还是很多人连最基本的concept和property的区别都无法实践。现实的Web上的数据质量,搞搞搜索还可以,拿来搞逻辑推理,根本没有实际意义。


[repost ]苹果与IBM联姻 几家欢乐几家愁系列之为什么选择IBM?


苹果与IBM联姻 几家欢乐几家愁系列之为什么选择IBM




那么也给新任CEO库克带来不小的压力,他需要不断的推出新品,保证苹果的增长预期,但是谁也不敢对苹果的未来业绩打保票,因此,苹果的困惑需要在C端以外的市场找到第二个增长点,而IPhone 6也是苹果一款真正意义上的企业级安全手机。
为什么说IPhone 6是为企业市场而生,从超大的屏幕设计和超长的待机时间都符合企业用户的需求,另外,在企业安全方面苹果也不遗余力。

在IOS7.1发布的时候,苹果除了更新修复错误和部分用户界面外,还对MDM进行了改进,iOS 7.1通过对MDM(移动设备管理)的改进,允许学校、政府和企业自定义应用程序的安装和配置,管理员可以限制或关闭相机等硬件,并阻止一些网站的访问和应用程序的运行。虽然苹果的MDM与专业的MDM厂商在功能上有一定的差距,但至少苹果一直在尝试。

从企业市场反馈来看,苹果在企业商务市场使用率占比位居首位,根据Good的五月份的报告指出Android手机在企业领域的占比达到了26%, Android平板在商务领域仅仅占据到了8%的份额。而苹果在企业商务市场则达到72%。将WP和黑莓抛出企业市场。



2010年5月IBM收购了云计算专业厂商Cast Iron Systems,提供在线托管的软件程序服务。





其实在WWDC大会上,苹果就显示出了软硬整合串起消费者智能生活的野心,iOS帝国俨然成形。如今,苹果急需再为其补强最后一块拼图,携手擅长数据分析与云端服务的IBM,将针对企业端推出移动解决方案IBM MobileFirst for iOS,让iPhone与iPad成为办公室里的最强工具。


1. 打造处理工作事务所需的优质App
2. 维护系统架构的质量与安全性以提升行动工作效能
3. 善用移动化工具营造更紧密的顾客关系
4. 将移动化有效转换为企业成长的价值

1. 为超过100种企业类型量身打造iPhone与iPad专用的原生App
2. 针对iOS进行优化的IBM云端服务,包括设备管理、安全性、数据分析及移动整合等层面
3. 专为企业需求所设计的全新苹果维护服务AppleCare
4. 设备启用、供应与管理的全新IBM套装工具


不过这种排它性协议必将会损害苹果原来合作伙伴的利益,将原有的合作伙伴推向谷歌,而这里面最受伤的莫过于甲骨文。明天我们推出《苹果与IBM联姻 几家欢乐几家愁系列之为什么甲骨文的忧伤》

[research ]IBM Statistical Information and Relation Extraction (SIRE)


To improve a computer’s ability to process rich unstructured text (and speech), one needs to detect mentions (mention detection) in the text to entities of interest (e.g. Person, Organization, Medication, etc.) and group all mentions that refer to the same entity in the world together (co-reference resolution) and extract relations between the detected entities from the text (relation extraction), e.g. the snippet “… its chairman …” implies the relation ManagerOf between the entity with the nominal Person mention “chairman” and the entity with pronominal Organization mention “its.”

We are developing the Statistical Information and Relation Extraction (SIRE) toolkit to build trainable extractors for new applications and domains. SIRE provides components for mention detection using Maximum Entropy models that can be trained from annotated data created by using a highly optimized web-browser annotation tool, called HAT, a trainable co-reference component for grouping detected mentions in a document that correspond to the same entity, and a trainable relation extraction system. The SIRE training tools produce annotators that can be deployed as UIMAannotators.

We have developed SIRE systems for multiple domains including a news domain (about 50 entity types and about 40 relations); more recently SIRE is being applied in healthcare analytics for fact extraction such as vital signs, allergies, medications in medical reports.