Category Archives: Analysis

[repost ]Computing Document Similarity using Lucene Term Vectors


Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

   <field name="content" type="text" indexed="true" stored="true" 
       termVectors="true" termPositions="true" termOffsets="true"/>

or enable term vectors in your Lucene indexing code like so:

  doc.addField(new Field("content", content, Store.YES, Index.ANALYZED,

Note that you don’t need positions and offsets enabled for term vectors for doing what I am doing, but I had it on in my schema.xml file already, so decided (to avoid confusion) to show the equivalent Lucene call in the code snippet above.

Once you have the index, find the list of all the terms in the “content” field across the entire index. These terms would represent the entries in the document vector for a document. Then for the documents which you want to consider for your similarity computation, extract its term vector. The term vector gives you two arrays, an array of terms within this document, and the corresponding frequency of that term in this document. Using these three data structures, it is easy to construct a (sparse) document vector representing the document(s).

Once you have the two document vectors, the similarity between the two can be computed using Cosine Similarity (or other measure – most of them work with document vectors).

For my test, I chose 4 documents out of my corpus, the titles of which (and their associated Lucene document IDs) are shown below:

Document#0: Mitral valve surgery - minimally invasive (31825)
Document#1: Mitral valve surgery - open (31835)
Document#2: Laryngectomy (31706)
Document#3: Shaken baby syndrome (30)

Notice that the first two are both about surgery and about the same body part, so intuitively I should expect a very high similarity between the documents 0 and 1. Document 2 is still about surgery, so I should expect a good, but smaller similarity value between documents 0 and 2. Document 3 is completely different, so I should expect an even smaller similarity value between documents 0 and 3.

The code for this is shown below (as a JUnit test).

package com.mycompany.termpos;

import java.util.HashMap;
import java.util.Map;

import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;
import org.apache.commons.math.linear.SparseRealVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.junit.Assert;
import org.junit.Test;

public class TermVectorBasedSimilarityTest {

  public void testSimilarity() throws Exception {
    IndexReader reader = File("/path/to/my/index")));
    // first find all terms in the index
    Map<String,Integer> terms = new HashMap<String,Integer>();
    TermEnum termEnum = reader.terms(new Term("content"));
    int pos = 0;
    while ( {
      Term term = termEnum.term();
      if (! "content".equals(term.field())) 
      terms.put(term.text(), pos++);
    int[] docIds = new int[] {31825, 31835, 31706, 30};
    DocVector[] docs = new DocVector[docIds.length];
    int i = 0;
    for (int docId : docIds) {
      TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
      Assert.assertTrue(tfvs.length == 1);
      docs[i] = new DocVector(terms); 
      for (TermFreqVector tfv : tfvs) {
        String[] termTexts = tfv.getTerms();
        int[] termFreqs = tfv.getTermFrequencies();
        Assert.assertEquals(termTexts.length, termFreqs.length);
        for (int j = 0; j < termTexts.length; j++) {
          docs[i].setEntry(termTexts[j], termFreqs[j]);
    // now get similarity between doc[0] and doc[1]
    double cosim01 = getCosineSimilarity(docs[0], docs[1]);
    System.out.println("cosim(0,1)=" + cosim01);
    // between doc[0] and doc[2]
    double cosim02 = getCosineSimilarity(docs[0], docs[2]);
    System.out.println("cosim(0,2)=" + cosim02);
    // between doc[0] and doc[3]
    double cosim03 = getCosineSimilarity(docs[0], docs[3]);
    System.out.println("cosim(0,3)=" + cosim03);

  private double getCosineSimilarity(DocVector d1, DocVector d2) {
    return (d1.vector.dotProduct(d2.vector)) /
      (d1.vector.getNorm() * d2.vector.getNorm());

  private class DocVector {
    public Map<String,Integer> terms;
    public SparseRealVector vector;

    public DocVector(Map<String,Integer> terms) {
      this.terms = terms;
      this.vector = new OpenMapRealVector(terms.size());

    public void setEntry(String term, int freq) {
      if (terms.containsKey(term)) {
        int pos = terms.get(term);
        vector.setEntry(pos, (double) freq);

    public void normalize() {
      double sum = vector.getL1Norm();
      vector = (SparseRealVector) vector.mapDivide(sum);

    public String toString() {
      RealVectorFormat formatter = new RealVectorFormat();
      return formatter.format(vector);

The results of running this test are shown below, and match up with my expectations, as shown below:


Using Lucene’s term vectors to generate document vectors can be useful not only for similarity calculations, but for other tasks where you need document vectors, such as clustering and classification. Computing document vectors from raw data is typically more involved (requires development time and can have scalability issues) than loading data into a Lucene index.

The other scalability issue is the size of the document vectors. My index is relatively tiny (about 3.000 documents), but we are still talking 10,000 unique terms after stopwording and stemming. However, document vectors are typically quite sparse (my test document vectors have an average density of 0.01), so an OpenMapRealVector is a good choice of data structure, and should scale to much larger sizes as well.

[repost ]加州大学伯克利分校:公共健康-数据统计分析



名称 视频下载
[第1集] 公共卫生课程简介 
[第2集] 二元结果数据和发病率 
[第3集] 发病率详解–区间,风险函数 
[第4集] 二元属性和风险因素,发病比 
[第5集] 接触与非接触组别,相对风险 
[第6集] 归因风险度和抽样 
[第7集] 队列研究和病例对照研究 
[第8集] 病例对照研究和队列研究的抽样 
[第9集] 卡方检验 
[第10集] 队列研究和病例对照研究的效能比较 
[第11集] 队列研究和病例对照研究的效能比较,显著性差异 
[第12集] 对数在计算比值比中的应用 
[第13集] 归因危险度和超额风险的置信区间及估值 
[第14集] 利用统计知识推理因果关系 
[第15集] 混杂变量,因果图和碰撞 
[第16集] 混杂,因果图和相依性指标 
[第17集] 可压缩性,基于分层表格的独立性检验 
[第18集] Wee表格,层级,零假设,CMH检验 
[第19集] woolf估计和Mantel-Haenszel估计 
[第20集] 协同作用和拮抗作用 
[第21集] 齐性检验和齐性woolf法检验 
[第22集] 总体检验,剂量响应,趋势检验和适合度 
[第23集] 罗吉斯回归 
[第24集] 线性模型、对数线性模型和罗吉斯回归模型的比较 
[第25集] 线性模型、对数线性模型和罗吉斯回归模型 
[第26集] 对风险因素的罗吉斯回归模型 
[第27集] 公共健康——数据统计分析第二十七课 
[第28集] 公共健康——数据统计分析第二十八课 
[第29集] 公共健康——数据统计分析第二十九课
[第30集] 公共健康——数据统计分析第三十课
[第31集] 公共健康——数据统计分析第三十一课
[第32集] 公共健康——数据统计分析第三十二课
[第33集] 公共健康——数据统计分析第三十三课
[第34集] 公共健康——数据统计分析第三十四课
[第35集] 公共健康——数据统计分析第三十五课
[第36集] 公共健康——数据统计分析第三十六课
[第37集] 公共健康——数据统计分析第三十七课
[第38集] 数据统计分析第三十八课
[第39集] 公共健康——数据统计分析第三十九课
[第40集] 公共健康——数据统计分析第四十课
[第41集] 公共健康——数据统计分析第四十一课
[第42集] 公共健康——数据统计分析第四十二课

[repost ]数据分析师的级别

































打印  |  类别:用户研究  |  源地址:……cb80c4bf.html

[repost ]看图说话:那些年一起玩过的分析工具


【编者按】那此年:word、excel、sas和R代码吗?那些年:数据、报表 、报告、 演讲吗? 爱人有了?房子到手吗?工资涨了么?苦逼的数据分析师们,是否曾回想陪伴度过的那些数据分析软件,有些离我们越来越远...本期怀念它们>;详见图片(查看大图更清晰)!

       还记得那些年我们做过的数据、报表 、报告、演讲吗?
       叶子的离开是风的多情还是树的不挽留;它们的离开是人生的催促还是青春的不停留 ……
再见了,我的童年,  再见了,我的玩伴,   再见了,我的青春,   再见了,我的软件.



[repost ]Built to address real-world challenges, jStart solutions offer clients tested and proven emerging technologies to generate immediate value.


By: Chris “Spence” Spencer

The challenges facing businesses today are complex, diverse, and challenging. But for every challenge, there exists an opportunity–and jStart believes that emerging technologies can surface those opportunities while simultaneously addressing challenges business face, today. By meeting the need, and exceeding it, jStart technologies are designed not only address existing issues, but extend business opportunities.

Intelligently analyzing data

So how can companies get started tackling such a potentially huge challenge? Antithetically, by doing the exact opposite: starting small. jStart can help your business explore how emerging technologies can allow your company to begin tackling this looming issue. We all know that emerging technology can be challenging. Afterall, it’s emerging. But that’s where the jStart team’s expertise can be of benefit: we specialize in helping companies get started with emerging technologies, not by solving some theoretical problem, but by tackling real, challenging, issues that companies face today.

What’s with this whole ‘New Intelligence’ concept from IBM?

It’s not just about having a ton of data. It’s about doing something intelligent with that information. IBM has recognized this, and has launched it’s New Intelligence initiative as part of its Smarter Planet efforts. Realizing returns on your web investments starts with more intelligent components running behind the scenes. Web activity metering, monitoring, and monetization tools let you take control. You can track consumption of your web assets even when they reside on platforms that you don’t control. They wisely protect your web investment by allowing non-intrusive oversight to external vendors and security metrics to guard against hackers. Providing you with consumer response, demand, and behavior pattern data, you can gather intelligence on how to differentiate your brand from the competition. Let jStart show you how surprisingly lightweight and easy it can be to get started.

our solutions+services

  • NEW: NLP UtilityA web-based tool to create, edit, and manage NLP rules and annotations
  • NEW: NLP Dictionary ServerA tool designed to be used by Line of Business users to create, edit, manage, and collaborate with their NLP dictionaries
  • Apache Hadoop™Get your arms around massive amounts of data with distributed computing–and leverage jStart’s expertise to do it.
  • MaqettaDesign and create HTML5 User Interfaces using a web-based application
  • Big DataBig Data isn’t just about a lot of information, it’s doing something interesting and useful with a large amount of information. Looking for key trends or patterns, understanding customers better. Big Data gives businesses opportunities they’ve never had before.
  • IBM BigSheetsGather, analyze, and visualize big data for business insight. BigSheets is now included as part of IBM InfoSphere’s BigInsights solution.
  • Text AnalyticsUnderstand the context of the data that you have–is it an entity (person, place, or thing)? Is it a date/time? How does it relate to the other information within your data? Text analytics can provide useful tools for understand what your data is trying to tell you.
  • Blue SpruceA Cooperative Web Platform

Intelligently Instrumented.

Streamlining processes for instrumented operations has become tougher as hybrid “click and brick” business models gain popularity. In order to provide focus and clear directives across your value chain, web channels need to become highly integrated for conflicts to be solved with specific, timely solutions. Business event processing tools trigger self-starting and self-regulating conflict resolutions between channels. jStart’s web solutions and services equip you with the information to take action plus the tools to automate those actions effectively.

Industry Approved.

jStart’s combined industry perspective and experience can be leveraged to manage risk while stimulating web-based innovations using emerging technologies. It’s knowledge of emerging growth markets and understanding of those markets can also aid in identifying opportunities. Such continuous innovation is required to create and sustain value. Without web-based innovations, you risk falling behind and losing market presence- especially during turbulent economic times. When you need a partner instead of another provider, an “open innovation” engagement utilizing jStart solutions and services helps your organization excel in industry.

[repost ]互联网时代的社会语言学:基于SNS的文本数据挖掘


今年上半年,我在人人网实习了一段时间,期间得到了很多宝贵的数据,并做了一些还算有意义的事情,在这里和大家一块儿分享。感谢人人网提供的数据与工作环境,感谢赵继承博士、詹卫东老师的支持和建议。在这项工作中,我得到了很多与众人交流的机会,特别感谢 OpenParty 、 TEDxBeijing 提供的平台。本文已发表在了《程序员》杂志,分上下两部分刊于 2012 年 7 月刊和 8 月刊,在此感谢卢鸫翔编辑的辛勤工作。由于众所周知的原因,《程序员》刊出的文章被和谐过(看到后面大家就自动地知道被和谐的内容是什么了),因而我决定把完整版发在 Blog 上,同时与更多的人一同分享。对此感兴趣的朋友可以给我发邮件继续交流。好了,开始说正文吧。

作为中文系应用语言学专业的学生以及一名数学 Geek ,我非常热衷于用计算的方法去分析汉语资料。汉语是一种独特而神奇的语言。对汉语资料进行自然语言处理时,我们会遇到很多其他语言不会有的困难,比如分词——汉语的词与词之间没有空格,那计算机怎么才知道,“已结婚的和尚未结婚的青年都要实行计划生育”究竟说的是“已/结婚/的/和/尚未/结婚/的/青年”,还是“已/结婚/的/和尚/未/结婚/的/青年”呢?这就是所谓的分词歧义难题。不过,现在很多语言模型已经能比较漂亮地解决这一问题了。但在中文分词领域里,还有一个比分词歧义更令人头疼的东西——未登录词。中文没有首字母大写,专名号也被取消了,这叫计算机如何辨认人名地名之类的东西?更惨的则是机构名、品牌名、专业名词、缩略语、网络新词等等,它们的产生机制似乎完全无规律可寻。最近十年来,中文分词领域都在集中攻克这一难关。自动发现新词成为了关键的环节。

挖掘新词的传统方法是,先对文本进行分词,然后猜测未能成功匹配的剩余片段就是新词。这似乎陷入了一个怪圈:分词的准确性本身就依赖于词库的完整性,如果词库中根本没有新词,我们又怎么能信任分词结果呢?此时,一种大胆的想法是,首先不依赖于任何已有的词库,仅仅根据词的共同特征,将一段大规模语料中可能成词的文本片段全部提取出来,不管它是新词还是旧词。然后,再把所有抽出来的词和已有词库进行比较,不就能找出新词了吗?有了抽词算法后,我们还能以词为单位做更多有趣的数据挖掘工作。这里,我所选用的语料是人人网 2011 年 12 月前半个月部分用户的状态。非常感谢人人网提供这份极具价值的网络语料。

要想从一段文本中抽出词来,我们的第一个问题就是,怎样的文本片段才算一个词?大家想到的第一个标准或许是,看这个文本片段出现的次数是否足够多。我们可以把所有出现频数超过某个阈值的片段提取出来,作为该语料中的词汇输出。不过,光是出现频数高还不够,一个经常出现的文本片段有可能不是一个词,而是多个词构成的词组。在人人网用户状态中,“的电影”出现了 389 次,“电影院”只出现了 175 次,然而我们却更倾向于把“电影院”当作一个词,因为直觉上看,“电影”和“院”凝固得更紧一些。

为了证明“电影院”一词的内部凝固程度确实很高,我们可以计算一下,如果“电影”和“院”真的是各自独立地在文本中随机出现,它俩正好拼到一起的概率会有多小。在整个 2400 万字的数据中,“电影”一共出现了 2774 次,出现的概率约为 0.000113 。“院”字则出现了 4797 次,出现的概率约为 0.0001969 。如果两者之间真的毫无关系,它们恰好拼在了一起的概率就应该是 0.000113 × 0.0001969 ,约为 2.223 × 10-8 次方。但事实上,“电影院”在语料中一共出现了 175 次,出现概率约为 7.183 × 10-6 次方,是预测值的 300 多倍。类似地,统计可得“的”字的出现概率约为 0.0166 ,因而“的”和“电影”随机组合到了一起的理论概率值为 0.0166 × 0.000113 ,约为 1.875 × 10-6 ,这与“的电影”出现的真实概率很接近——真实概率约为 1.6 × 10-5 次方,是预测值的 8.5 倍。计算结果表明,“电影院”更可能是一个有意义的搭配,而“的电影”则更像是“的”和“电影”这两个成分偶然拼到一起的。

当然,作为一个无知识库的抽词程序,我们并不知道“电影院”是“电影”加“院”得来的,也并不知道“的电影”是“的”加上“电影”得来的。错误的切分方法会过高地估计该片段的凝合程度。如果我们把“电影院”看作是“电”加“影院”所得,由此得到的凝合程度会更高一些。因此,为了算出一个文本片段的凝合程度,我们需要枚举它的凝合方式——这个文本片段是由哪两部分组合而来的。令 p(x) 为文本片段 x 在整个语料中出现的概率,那么我们定义“电影院”的凝合程度就是 p(电影院) 与 p(电) · p(影院) 比值和 p(电影院) 与 p(电影) · p(院) 的比值中的较小值,“的电影”的凝合程度则是 p(的电影) 分别除以 p(的) · p(电影) 和 p(的电) · p(影) 所得的商的较小值。


“信息熵”是一个非常神奇的概念,它能够反映知道一个事件的结果后平均会给你带来多大的信息量。如果某个结果的发生概率为 p ,当你知道它确实发生了,你得到的信息量就被定义为 – log(p) 。 p 越小,你得到的信息量就越大。如果一颗骰子的六个面分别是 1 、 1 、 1 、 2 、 2 、 3 ,那么你知道了投掷的结果是 1 时可能并不会那么吃惊,它给你带来的信息量是 – log(1/2) ,约为 0.693 。知道投掷结果是 2 ,给你带来的信息量则是 – log(1/3) ≈ 1.0986 。知道投掷结果是 3 ,给你带来的信息量则有 – log(1/6) ≈ 1.79 。但是,你只有 1/2 的机会得到 0.693 的信息量,只有 1/3 的机会得到 1.0986 的信息量,只有 1/6 的机会得到 1.79 的信息量,因而平均情况下你会得到 0.693/2 + 1.0986/3 + 1.79/6 ≈ 1.0114 的信息量。这个 1.0114 就是那颗骰子的信息熵。现在,假如某颗骰子有 100 个面,其中 99 个面都是 1 ,只有一个面上写的 2 。知道骰子的抛掷结果是 2 会给你带来一个巨大无比的信息量,它等于 – log(1/100) ,约为 4.605 ;但你只有百分之一的概率获取到这么大的信息量,其他情况下你只能得到 – log(99/100) ≈ 0.01005 的信息量。平均情况下,你只能获得 0.056 的信息量,这就是这颗骰子的信息熵。再考虑一个最极端的情况:如果一颗骰子的六个面都是 1 ,投掷它不会给你带来任何信息,它的信息熵为 – log(1) = 0 。什么时候信息熵会更大呢?换句话说,发生了怎样的事件之后,你最想问一下它的结果如何?直觉上看,当然就是那些结果最不确定的事件。没错,信息熵直观地反映了一个事件的结果有多么的随机。

我们用信息熵来衡量一个文本片段的左邻字集合和右邻字集合有多随机。考虑这么一句话“吃葡萄不吐葡萄皮不吃葡萄倒吐葡萄皮”,“葡萄”一词出现了四次,其中左邻字分别为 {吃, 吐, 吃, 吐} ,右邻字分别为 {不, 皮, 倒, 皮} 。根据公式,“葡萄”一词的左邻字的信息熵为 – (1/2) · log(1/2) – (1/2) · log(1/2) ≈ 0.693 ,它的右邻字的信息熵则为 – (1/2) · log(1/2) – (1/4) · log(1/4) – (1/4) · log(1/4) ≈ 1.04 。可见,在这个句子中,“葡萄”一词的右邻字更加丰富一些。

在人人网用户状态中,“被子”一词一共出现了 956 次,“辈子”一词一共出现了 2330 次,两者的右邻字集合的信息熵分别为 3.87404 和 4.11644 ,数值上非常接近。但“被子”的左邻字用例非常丰富:用得最多的是“晒被子”,它一共出现了 162 次;其次是“的被子”,出现了 85 次;接下来分别是“条被子”、“在被子”、“床被子”,分别出现了 69 次、 64 次和 52 次;当然,还有“叠被子”、“盖被子”、“加被子”、“新被子”、“掀被子”、“收被子”、“薄被子”、“踢被子”、“抢被子”等 100 多种不同的用法构成的长尾⋯⋯所有左邻字的信息熵为 3.67453 。但“辈子”的左邻字就很可怜了, 2330 个“辈子”中有 1276 个是“一辈子”,有 596 个“这辈子”,有 235 个“下辈子”,有 149 个“上辈子”,有 32 个“半辈子”,有 10 个“八辈子”,有 7 个“几辈子”,有 6 个“哪辈子”,以及“n 辈子”、“两辈子”等 13 种更罕见的用法。所有左邻字的信息熵仅为 1.25963 。因而,“辈子”能否成词,明显就有争议了。“下子”则是更典型的例子, 310 个“下子”的用例中有 294 个出自“一下子”, 5 个出自“两下子”, 5 个出自“这下子”,其余的都是只出现过一次的罕见用法。事实上,“下子”的左邻字信息熵仅为 0.294421 ,我们不应该把它看作一个能灵活运用的词。当然,一些文本片段的左邻字没啥问题,右邻字用例却非常贫乏,例如“交响”、“后遗”、“鹅卵”等,把它们看作单独的词似乎也不太合适。我们不妨就把一个文本片段的自由运用程度定义为它的左邻字信息熵和右邻字信息熵中的较小值。

我们把文本中出现过的所有长度不超过 d 的子串都当作潜在的词(即候选词,其中 d 为自己设定的候选词长度上限,我设定的值为 5 ),再为出现频数、凝固程度和自由程度各设定一个阈值,然后只需要提取出所有满足阈值要求的候选词即可。为了提高效率,我们可以把语料全文视作一整个字符串,并对该字符串的所有后缀按字典序排序。下表就是对“四是四十是十十四是十四四十是四十”的所有后缀进行排序后的结果。实际上我们只需要在内存中存储这些后缀的前 d + 1 个字,或者更好地,只储存它们在语料中的起始位置。


这样的话,相同的候选词便都集中在了一起,从头到尾扫描一遍便能算出各个候选词的频数和右邻字信息熵。将整个语料逆序后重新排列所有的后缀,再扫描一遍后便能统计出每个候选词的左邻字信息熵。另外,有了频数信息后,凝固程度也都很好计算了。这样,我们便得到了一个无需任何知识库的抽词算法,输入一段充分长的文本,这个算法能以大致 O(n · logn) 的效率提取出可能的词来。








哦,对了,还有我最喜欢的,《人民日报》 2000 年 4 月新闻版的抽词结果:




事实上,程序从人人网的状态数据中一共抽出了大约 1200 个词,里面大多数词也确实都是标准的现代汉语词汇。不过别忘了,我们的目标是新词抽取。将所有抽出来的词与已有词库作对比,于是得到了人人网特有的词汇(同样按频数从高到低排序):


我还想到了更有意思的玩法。为什么不拿每一天状态里的词去和前一天的状态作对比,从而提取出这一天里特有的词呢?这样一来,我们就能从人人网的用户状态中提取出每日热点了!从手里的数据规模看,这是完全有可能的。我选了 12 个比较具有代表性的词,并列出了它们在 2011 年 12 月 13 日的用户状态中出现的频数(左列的数),以及 2011 年 12 月 14 日的用户状态中出现的频数(右列的数):

下雪 33 92
那些年 139 146
李宇春 1 4
看见 145 695
魔兽 23 20
高数 82 83
生日快乐 235 210
今天 1416 1562
北半球 2 18
脖子 23 69
悲伤 61 33
电磁炉 0 3

大家可以从直觉上迅速判断出,哪些词可以算作是 12 月 14 日的热词。比方说,“下雪”一词在 12 月 13 日只出现了 33 次,在 12 月 14 日却出现了 92 次,后者是前者的 2.8 倍,这不大可能是巧合,初步判断一定是 12 月 14 日真的有什么地方下雪了。“那些年”在 12 月 14 日的频数确实比 12 月 13 日更多,但相差并不大,我们没有理由认为它是当日的一个热词。

一个问题摆在了我们面前:我们如何去量化一个词的“当日热度”?第一想法当然是简单地看一看每个词的当日频数和昨日频数之间的倍数关系,不过细想一下你就发现问题了:它不能解决样本过少带来的偶然性。 12 月 14 日“李宇春”一词的出现频数是 12 月 13 日的 4 倍,这超过了“下雪”一词的 2.8 倍,但我们却更愿意相信“李宇春”的现象只是一个偶然。更麻烦的则是“电磁炉”一行, 12 月 14 日的频数是 12 月 13 日的无穷多倍,但显然我们也不能因此就认为“电磁炉”是 12 月 14 日最热的词。

忽略所有样本过少的词?这似乎也不太好,样本少的词也有可能真的是热词。比如“北半球”一词,虽然它在两天里的频数都很少,但这个 9 倍的关系确实不容忽视。事实上,人眼很容易看出哪些词真的是 12 月 14 日的热词:除了“下雪”以外,“看见”、“北半球”和“脖子”也应该是热词。你或许坚信后三个词异峰突起的背后一定有什么原因(并且迫切地想知道这个原因究竟是什么),但却会果断地把“李宇春”和“电磁炉”这两个“异常”归结为偶然原因。你的直觉是对的—— 2011 年 12 月 14 日发生了极其壮观的双子座流星雨,此乃北半球三大流星雨之一。白天网友们不断转发新闻,因而“北半球”一词热了起来;晚上网友们不断发消息说“看见了”、“又看见了”,“看见”一词的出现频数猛增;最后呢,仰望天空一晚上,脖子终于出毛病了,于是回家路上一个劲儿地发“脖子难受”。


下表截取了前四个词,右边四列分别表示各词在 12 月 13 日出现的频数,在 12 月 14 日出现的频数,在两天里一共出现的总频数,以及后一天的频数所占的比重。第三列数字是前两列数字之和,第四列数字则是第二列数字除以第三列数字的结果。最后一列应该是一个 0 到 1 之间的数,它表明对应的词有多大概率出现在了 12 月 14 日这一天。最后一列可以看作是各词的得分。可以看到,此时“下雪”的得分低于“李宇春”,这是我们不希望看到的结果。“李宇春”的样本太少,我们想以此为缘由把它的得分拖下去。

下雪 33 92 125 0.736
那些年 139 146 285 0.512
李宇春 1 4 5 0.8
看见 145 695 840 0.827
(平均) 313.75 0.719

怎么做呢?我们把每个词的得分都和全局平均分取一个加权平均!首先计算出这四个词的平均总频数,为 313.75 ;再计算出这四个词的平均得分,为 0.719 。接下来,我们假设已经有 313.75 个人预先给每个词都打了 0.719 分,换句话说每个词都已经收到了 313.75 次评分,并且所有这 313.75 个评分都是 0.719 分。“下雪”这个词则还有额外的 125 个人评分,其中每个人都给了 0.736 分。因此,“下雪”一词的最终得分就是:

下雪 (0.736 × 125 + 0.719 × 313.75) / (125 + 313.75) ≈ 0.724


那些年 (0.512 × 285 + 0.719 × 313.75) / (285 + 313.75) ≈ 0.62
李宇春 (0.8 × 5 + 0.719 × 313.75) / (5 + 313.75) ≈ 0.7202
看见 (0.827 × 840 + 0.719 × 313.75) / (840 + 313.75) ≈ 0.798

容易看出,此时样本越大的词,就越有能力把最终得分拉向自己本来的得分,样本太小的词,最终得分将会与全局平均分非常接近。经过这么一番调整,“下雪”一词的得分便高于了“李宇春”。实际运用中, 313.75 这个数也可以由你自己来定,定得越高就表明你越在意样本过少带来的负面影响。这种与全局平均取加权平均的思想叫做 Bayesian average ,从上面的若干式子里很容易看出,它实际上是最常见的平滑处理方法之一——分子分母都加上一个常数——的一种特殊形式。

利用之前的抽词程序抽取出人人网每一天内用户状态所含的词,把它们的频数都与前一天的作对比,再利用刚才的方法加以平滑,便能得出每一天的热词了。我手上的数据是人人网 2011 年 12 月上半月的数据,因此我可以得出从 12 月 2 日到 12 月 15 日的热词(选取每日前 5 名,按得分从高到低)。


看来, 12 月 14 日果然有流星雨发生。


事实上,有了上述工具,我们可以任意比较两段不同文本中的用词特点。更有趣的是,人人网状态的大多数发布者都填写了性别和年龄的个人信息,我们为何不把状态重新分成男性和女性两组,或者 80 后和 90 后两组,挖掘出不同属性的人都爱说什么?要知道,在过去,这样的问题需要进行大规模语言统计调查才能回答!然而,在互联网海量用户生成内容的支持下,我们可以轻而易举地挖掘出答案来。





下面是 90 后用户爱用的词:


下面则是 80 后用户爱用的词:





[repost ]The Unreasonable Effectiveness of Data


E  X  P  E  R  T   O P  I N I O N
8 1541-1672/09/$25.00 © 2009 IEEE iEEE iNTElliGENT SYSTEMS
Published by the IEEE Computer Society
Contact Editor: Brian Brannon,
such as f = ma or e = mc
. Meanwhile, sciences that
involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over
their inability to neatly model human behavior.
An informal, incomplete grammar of the English
language runs over 1,700 pages.
Perhaps  when  it
comes to natural language processing and related
fi elds, we’re doomed to complex theories that will
never have the elegance of physics equations. But
if that’s so, we should stop acting as if our goal is
to author extremely elegant theories, and instead
embrace complexity and make use of the best ally
we have: the unreasonable effectiveness of data.
One of us, as an undergraduate at Brown University, remembers the excitement of having access to
the Brown Corpus, containing one million English
Since then, our fi eld has seen several notable
corpora that are about 100 times larger, and in 2006,
Google released a trillion-word corpus with frequency
counts for all sequences up to fi ve words long.
some ways this corpus is a step backwards from the
Brown Corpus: it’s taken from unfi ltered Web pages
and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected
part-of-speech tags. But the fact that it’s a million
times larger than the Brown Corpus outweighs these
drawbacks. A trillion-word corpus—along with other
Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human
behavior. So, this corpus could serve as the basis of
a complete model for certain tasks—if only we knew
how to extract the model from the data.
Learning from Text at Web Scale
The biggest successes in natural-language-related
machine learning have been statistical speech recognition and statistical machine translation. The
reason for these successes is not that these tasks are
easier than other tasks; they are in fact much harder
than tasks such as document classifi cation that extract just a few bits of information from each document. The reason is that translation is a natural
task routinely done every day for a real human need
(think of the operations of the European Union or
of news agencies). The same is true of speech transcription (think of closed-caption broadcasts). In
other words, a large training set of the input-output
behavior that we seek to automate is available to us
in the wild. In contrast, traditional natural language
processing problems such as document classifi cation, part-of-speech tagging, named-entity recognition, or parsing are not routine tasks, so they have
no large corpus available in the wild. Instead, a corpus for these tasks requires skilled human annotation. Such annotation is not only slow and expensive to acquire but also diffi cult for experts to agree
on, being bedeviled by many of the diffi culties we
discuss later in relation to the Semantic Web. The
fi rst lesson of Web-scale learning is to use available
large-scale data rather than hoping for annotated
data that isn’t available. For instance, we fi nd that
useful semantic relationships can be automatically
learned from the statistics of search queries and the
corresponding results
or from the accumulated evidence of Web-based text patterns and formatted tables,
in both cases without needing any manually
annotated data.
ugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”
examines  why  so  much  of  physics  can  be
neatly explained with simple mathematical formulas
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
The Unreasonable
Effectiveness of Data
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore.  Restrictions apply. March/april 2009 9
Another important lesson from statistical methods in speech recognition
and machine translation is that memorization is a good policy if you have
a lot of training data. The statistical
language models that are used in both
tasks consist primarily of a huge database of probabilities of short sequences
of consecutive words (n-grams). These
models are built by counting the number of occurrences of each n-gram sequence from a corpus of billions or trillions of words. Researchers have done
a lot of work in estimating the probabilities of new n-grams from the frequencies of observed  n-grams (using,
for example, Good-Turing or KneserNey smoothing), leading to elaborate
probabilistic models. But invariably,
simple models and a lot of data trump
more elaborate models based on less
data. Similarly, early work on machine
translation relied on elaborate rules for
the relationships between syntactic and
semantic patterns in the source and
target languages. Currently, statistical
translation models consist mostly of
large memorized phrase tables that give
candidate mappings between specific
source- and target-language phrases.
Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation
models introduce general rules only
when they improve translation over just
memorizing particular phrases (for instance, in rules for dates and numbers).
Similar observations have been made
in every other application of machine
learning to Web data: simple  n-gram
models or linear classifiers based on
millions of specific features perform
better than elaborate models that try
to discover general rules. In many cases
there appears to be a threshold of sufficient data. For example, James Hays
and Alexei A. Efros addressed the task
of scene completion:  removing  an  unwanted, unsightly automobile or exspouse from a photograph and filling
in the background with pixels taken
from a large corpus of other photos.
With a corpus of thousands of photos,
the results were poor. But once they
accumulated millions of photos, the
same algorithm performed quite well.
We know that the number of grammatical English sentences is theoretically infinite and the number of possible 2-Mbyte photos is 256
2 ,000,000
However, in practice we humans care
to make only a finite number of distinctions. For many tasks, once we
have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what
we need, without generative rules.
For those who were hoping that a
small number of general rules could explain language, it is worth noting that
language is inherently complex, with
hundreds of thousands of vocabulary
words and a vast variety of grammatical constructions. Every day, new words
are coined and old usages are modified.
This suggests that we can’t reduce what
we want to say to the free combination
of a few abstract primitives.
For those with experience in smallscale machine learning who are worried about the curse of dimensionality
and overfitting of models to data, note
that all the experimental evidence
from the last decade suggests that
throwing away rare events is almost
always a bad idea, because much Web
data consists of individually rare but
collectively frequent events. For many
tasks, words and word combinations
provide all the representational machinery we need to learn from text.
Human language has evolved over
millennia to have words for the important concepts; let’s use them. Abstract representations (such as clusters
from latent analysis) that lack linguistic counterparts are hard to learn or
validate and tend to lose information.
Relying on overt statistics of words
and word co-occurrences has the further advantage that we can estimate
models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally
The success of  n-gram models has
unfortunately led to a false dichotomy.
Many people now believe there are
only two approaches to natural language processing:
•     a deep approach that relies on handcoded grammars and ontologies,
represented as complex networks of
relations; and
•     a statistical approach that relies on
learning n-gram statistics from large
In reality, three orthogonal problems
•     choosing a representation language,
•     encoding a model in that language,
•     performing inference on the model.
Each problem can be addressed in several ways, resulting in dozens of approaches. The deep approach that was
popular in the 1980s used first-order
logic (or something similar) as the representation language, encoded a model
with the labor of a team of graduate
students, and did inference with complex inference rules appropriate to the
representation language. In the 1980s
and 90s, it became fashionable to
For many tasks,
words and word
combinations provide
all the representational
machinery we need
to learn from text.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore.  Restrictions apply. 10 iEEE iNTElliGENT SYSTEMS
use finite state machines as the representation language, use counting and
smoothing over a large corpus to encode a model, and use simple Bayesian
statistics as the inference method.
But many other combinations are
possible, and in the 2000s, many are
being tried. For example, Lise Getoor
and Ben Taskar collect work on statistical relational learning—that is, representation languages that are powerful enough to represent relations
between objects (such as first-order
logic) but that have a sound, probabilistic definition that allows models to
be built by statistical learning.
and his colleagues show how the same
kind of maximum-margin classifier
used in support vector machines can
improve traditional parsing.
Schoenmackers, Oren Etzioni, and
Daniel S. Weld show how a relational
logic and a 100-million-page corpus
can answer questions such as “what
vegetables help prevent osteoporosis?”
by isolating and combining the relational assertions that “kale is high in
calcium” and “calcium helps prevent
Semantic Web versus
Semantic Interpretation
The Semantic Web is a convention for
formal representation languages that
lets software services interact with
each other “without needing artificial
A software service that
enables us to make a hotel reservation
is transformed into a Semantic Web
service by agreeing to use one of several standards for representing dates,
prices, and locations. The service can
then interoperate with other services
that use either the same standard or
a different one with a known translation into the chosen standard. As Tim
Berners-Lee, James Hendler, and Ora
Lassila write, “The Semantic Web will
enable machines to  comprehend semantic documents and data, not human speech and writings.”
The problem of understanding human speech and writing—the semantic interpretation problem—is quite
different from the problem of software
service interoperability. Semantic interpretation deals with imprecise, ambiguous natural languages, whereas service interoperability deals with making
data precise enough that the programs
operating on the data will function effectively. Unfortunately, the fact that
the word “semantic” appears in both
“Semantic Web” and “semantic interpretation” means that the two problems have often been conflated, causing needless and endless consternation
and confusion. The “semantics” in Semantic Web services is embodied in the
code that implements those services in
accordance with the specifications expressed by the relevant ontologies and
attached informal documentation. The
“semantics” in semantic interpretation
of natural languages is instead embodied in human cognitive and cultural
processes whereby linguistic expression elicits expected responses and expected changes in cognitive state. Because of a huge shared cognitive and
cultural context, linguistic expression
can be highly ambiguous and still often be understood correctly.
Given these challenges, building Semantic Web services is an engineering
and sociological challenge. So, even
though we understand the required
technology, we must deal with significant hurdles:
•     Ontology writing. The important
easy cases have been done. For example, the Dublin Core defines
dates, locations, publishers, and
other concepts that are sufficient for
card catalog entries.
defines chromosomes, species, and
gene sequences. Other organizations
provide ontologies for their specific
fields. But there’s a long tail of rarely
used concepts that are too expensive
to formalize with current technology. Project Halo did an excellent
job of encoding and reasoning with
knowledge from a chemistry textbook, but the cost was US$10,000
per page.
Obviously we can’t afford that cost for a trillion Web
•     Difficulty of implementation.  Publishing a static Web page written
in natural language is easy; anyone
with a keyboard and Web connection can do it. Creating a databasebacked Web service is substantially
harder, requiring specialized skills.
Making that service compliant with
Semantic Web protocols is harder
still. Major sites with competent
technology experts will find the extra effort worthwhile, but the vast
majority of small sites and individuals will find it too difficult, at least
with current tools.
•     Competition. In some domains,
competing factions each want to
promote their own ontology. In
other domains, the entrenched leaders of the field oppose any ontology
because it would level the playing
field for their competitors. This is
a problem in diplomacy, not technology. As Tom Gruber says, “Every ontology is a treaty—a social
agreement—among people with
some common motive in sharing.”
When a motive for sharing is lacking, so are common ontologies.
•     Inaccuracy and deception. We
Because of a huge
shared cognitive and
cultural context, linguistic
expression can be highly
ambiguous and still often
be understood correctly.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore.  Restrictions apply. March/april 2009 11
know how to build sound inference
mechanisms that take true premises
and infer true conclusions. But we
don’t have an established methodology to deal with mistaken premises
or with actors who lie, cheat, or otherwise deceive. Some work in reputation management and trust exists,
but for the time being we can expect
Semantic Web technology to work
best where an honest, self-correcting
group of cooperative users exists
and not as well where competition
and deception exist.
The challenges for achieving accurate semantic interpretation are different. We’ve already solved the sociological problem of building a network
infrastructure that has encouraged
hundreds of millions of authors to
share a trillion pages of content. We’ve
solved the technological problem of aggregating and indexing all this content.
But we’re left with a scientific problem
of interpreting the content, which is
mainly that of learning as much as possible about the context of the content to
correctly disambiguate it. The semantic interpretation problem remains regardless of whether or not we’re using
a Semantic Web framework. The same
meaning can be expressed in many different ways, and the same expression
can express many different meanings.
For example, a table of company information might be expressed in ad hoc
HTML with column headers called
“Company,” “Location,” and so on.
Or it could be expressed in a Semantic
Web format, with standard identifiers
for “Company Name” and “Location,”
using the Dublin Core Metadata Initiative point-encoding scheme. But even if
we have a formal Semantic Web “Company Name” attribute, we can’t expect
to have an ontology for every possible
value of this attribute. For example, we
can’t know for sure what company the
string “Joe’s Pizza” refers to because
hundreds of businesses have that name
and new ones are being added all the
time. We also can’t always tell which
business is meant by the string “HP.”
It could refer to Helmerich & Payne
Corp. when the column is populated
by stock ticker symbols but probably
refers to Hewlett-Packard when the
column is populated by names of large
technology companies. The problem of
semantic interpretation remains; using
a Semantic Web formalism just means
that semantic interpretation must be
done on shorter strings that fall between angle brackets.
What we need are methods to infer
relationships between column headers
or mentions of entities in the world.
These inferences may be incorrect at
times, but if they’re done well enough
we can connect disparate data collections and thereby substantially enhance our interaction with Web data.
Interestingly, here too Web-scale data
might be an important part of the solution. The Web contains hundreds of
millions of independently created tables and possibly a similar number of
lists that can be transformed into tables.14
These tables represent structured data in myriad domains. They
also represent how different people
organize data—the choices they make
for which columns to include and the
names given to the columns. The tables
also provide a rich collection of column
values, and values that they decided
belong in the same column of a table.
We’ve never before had such a vast collection of tables (and their schemata) at
our disposal to help us resolve semantic heterogeneity. Using such a corpus,
we hope to be able to accomplish tasks
such as deciding when “Company”
and “Company Name” are synonyms,
deciding when “HP” means Helmerich
& Payne or Hewlett-Packard, and determining that an object with attributes “passengers” and “cruising altitude” is probably an aircraft.
How can we use such a corpus of tables? Suppose we want to find synonyms for attribute names—for example, when “Company Name” could be
equivalent to “Company” and “price”
could be equivalent to “discount”).
Such synonyms differ from those in a
thesaurus because here, they are highly
context dependent (both in tables and
in natural language). Given the corpus, we can extract a set of schemata
from the tables’ column labels; for example, researchers reliably extracted
2.5 million distinct schemata from a
collection of 150 million tables, not all
of which had schema.
We  can  now
examine the co-occurrences of attribute names in these schemata. If we
see a pair of attributes A and B that
rarely occur together but always occur
with the same other attribute names,
this might mean that A and B are synonyms. We can further justify this hypothesis if we see that data elements
have a significant overlap or are of the
same data type. Similarly, we can also
offer a  schema autocomplete  feature
for database designers. For example,
by analyzing such a large corpus of
schemata, we can discover that schemata that have the attributes Make
and Model also tend to have the attributes Year, Color, and Mileage. Providing such feedback to schemata creators can save them time but can also
help them use more common attribute
names, thereby decreasing a possible
The same meaning
can be expressed
in many different ways,
and the same expression
can express many
different meanings.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore.  Restrictions apply. 12 iEEE iNTElliGENT SYSTEMS
source of heterogeneity in Web-based
data. Of course, we’ll find immense
opportunities to create interesting data
sets if we can automatically combine
data from multiple tables in this collection. This is an area of active research.
Another opportunity is to combine
data from multiple tables with data
from other sources, such as unstructured Web pages or Web search queries. For example, Marius Paşca also
considered the task of identifying attributes of classes.
That  is,  his  system first identifies classes such as
“Company,” then finds examples such
as “Adobe Systems,” “Macromedia,”
“Apple Computer,” “Target,” and
so on, and finally identifies class attributes such as “location,” “CEO,”
“headquarters,” “stock price,” and
“company profile.” Michael Cafarella
and his colleagues showed this can be
gleaned from tables, but Paşca showed
it can also be extracted from plain
text on Web pages and from user queries in search logs. That is, from the
user query “Apple Computer stock
price” and from the other information
we know about existing classes and
attributes, we can confirm that “stock
price” is an attribute of the “Company” class. Moreover, the technique
works not just for a few dozen of the
most popular classes but for thousands of classes and tens of thousands
of attributes, including classes like
“Aircraft Model,” which has attributes “weight,” “length,” “fuel consumption,” “interior photos,” “specifications,” and “seating arrangement.”
Paşca shows that including query logs
can lead to excellent performance,
with 90 percent precision over the top
10 attributes per class.
So, follow the data. Choose a representation that can use unsupervised
learning on unlabeled data, which is
so much more plentiful than labeled
data. Represent all the data with a
nonparametric model rather than trying to summarize it with a parametric
model, because with very large data
sources, the data holds a lot of detail.
For natural language applications,
trust that human language has already
evolved words for the important concepts. See how far you can go by tying together the words that are already
there, rather than by inventing new
concepts with clusters of words. Now
go out and gather some data, and see
what it can do.
1. E. Wigner, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” Comm. Pure and Applied Mathematics, vol. 13, no. 1, 1960, pp. 1–14.
2. R. Quirk et al., A Comprehensive Grammar of the English Language, Longman,
3. H. Kucera, W.N. Francis, and J.B.
Carroll,  Computational Analysis of
Present-Day American English, Brown
Univ. Press, 1967.
4. T. Brants and A. Franz, Web 1T 5-Gram
Version 1, Linguistic Data Consortium,
5. S. Riezler, Y. Liu, and A. Vasserman,
“Translating Queries into Snippets for
Improved Query Expansion,”  Proc.
22nd Int’l Conf. Computational Linguistics  (Coling 08), Assoc. Computational Linguistics, 2008, pp. 737–744.
6. P.P. Talukdar et al., “Learning to Create Data-Integrating Queries,”  Proc.
34th Int’l Conf. Very Large Databases
(VLDB 08), Very Large Database Endowment, 2008, pp. 785–796.
7. J. Hays and A.A. Efros, “Scene Completion Using Millions of Photographs,”
Comm. ACM, vol. 51, no. 10, 2008, pp.
8. L. Getoor and B. Taskar, Introduction
to Statistical Relational Learning, MIT
Press, 2007.
9. B. Taskar et al., “Max-Margin Parsing,”
Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP 04),
Assoc. for Computational Linguistics,
2004, pp. 1–8.
10. S. Schoenmackers, O. Etzioni, and D.S.
Weld, “Scaling Textual Inference to
the Web,” Proc. 2008 Conf. Empirical
Methods in Natural Language Processing (EMNLP 08), Assoc. for Computational Linguistics, 2008, pp. 79−88.
11. T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web,”  Scientific
Am., 17 May 2001.
12. P. Friedland et al., “Towards a Quantitative, Platform-Independent Analysis of
Knowledge Systems,”  Proc. Int’l Conf.
Principles of Knowledge Representation, AAAI Press, 2004, pp. 507–514.
13. “Interview of Tom Gruber,”  AIS SIGSEMIS Bull., vol. 1, no. 3, 2004.
14. M.J. Cafarella et al., “WebTables: Exploring the Power of Tables on the Web,”
Proc. Very Large Data Base Endowment (VLDB 08), ACM Press, 2008, pp.
15. M. Paşca, “Organizing and Searching
the World Wide Web of Facts. Step Two:
Harnessing the Wisdom of the Crowds,”
Proc. 16th Int’l World Wide Web Conf.,
ACM Press, 2007, pp. 101–110.
Alon Halevy is a research scientist at Google.
Contact him at
Peter Norvig is a research director at Google.
Contact him at
Fernando Pereira  is a research director at
Google. Contact him at
Choose a representation
that can use unsupervised
learning on unlabeled
data, which is so
much more plentiful
than labeled data.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore.  Restrictions apply.