[repost ]使用Alize等工具构建说话人识别平台


前段时间有好几位同学询问如何用Alize实现说话人识别的问题,由于寒假前赶Paper,来不及详细解答,更没时间写Demo。 开学后不久抽时间写了一个Demo,并上传到了GitHub:VoicePrintReco-branch-master.



1.Features extraction 特征提取


2.Silence removal 静音检测和去除

NormFeat.exe 先能量规整
EnergyDetector.exe 基于能量检测的静音去除

3.Features Normalization 特征规整

NormFeat.exe 再使用这个工具进行特征规整

4.World model training

TrainWorld.exe 训练UBM

5.Target model training

TrainTarget.exe 在训练好UBM的基础上训练training set和testing set的GMM


ComputeTest.exe 将testing set 的GMM在training set的GMM上进行测试和打分

7.Score Normalization

ComputeNorm.exe 将得分进行规整

8. Compute EER 计算等错误率

你可以查查计算EER的matlab代码,NIST SRE的官网上有下载DETware_v2.1.tar.gz 。_


关于各步骤中参数的问题,可以在命令行“工具 -help”来查看该工具个参数的具体含义,另外还可参考Alize源码中各个工具的test目录中提供的实例, 而关于每个工具的作用及理论知识则需要查看相关论文。

常见问题及解答: Frequently asked questions – by alize

更多问题请在 Google Groups 提出,大家一起讨论!

另外,还可以通过 QQ 群:二⑦九⑥四④零⑤柒 进行Real-Time的交流与讨论,加群请注明学校姓名,以防广告。


[1] ALIZE – User Manual: userguide_alize.001.pdf
[2] LIA_SPKDET Package documentation: userguide_LIA_SpkDet.002.pdf
[3] Reference System based on speech modality ALIZE/LIA RAL
[4] Jean-Francois Bonastre, etc. ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition
[5] TOMMIE GANNERT. A Speaker Veri?cation System Under The Scope: Alize
[6] Alize Wiki

Original Link: http://ibillxia.github.io/blog/2013/04/26/building-speaker-recognition-system-using-alize-etc/
Attribution – NON-Commercial – ShareAlike – Copyright © Bill Xia

[repost ]kaldi上拓展的深度学习(Kaldi+PDNN)


这个是昨晚无意中的发现,由于远在cmu大学的苗亚杰(Yajie Miao)博士的贡献,我们又可以在kaldi上使用深度学习的模块。之前在htk上使用dbn一直还没成功,希望最近可以早点实现。以下是苗亚杰博士的主页上的关于kaldi+pdnn的介绍。希望大家可以把自己的力量也贡献出来,让我们作为学生的多学习学习。

Kaldi+PDNN — Implementing DNN-based ASR Systems with Kaldi and PDNN
Kaldi+PDNN contains a set of fully-fledged Kaldi ASR recipes, which realize DNN-based acoustic modeling using the PDNN toolkit. The overall pipeline has 3 stages: 
1. The initial GMM model is built with the existing Kaldi recipes
2. DNN acoustic models are trained by PDNN
3. The trained DNN model is ported back to Kaldi for hybrid decoding or further tandem system building

Model diversity
. Deep Neural Networks (DNNs); Deep Bottleneck Features (DBNFs); Deep Convolutional Networks (DCNs)
PDNN toolkit. Easy and fast to implement new DNN ideas
Open license. All the codes are released under Apache 2.0, the same license as Kaldi
Consistency with Kaldi. Recipes follow the Kaldi style and can be integrated seemlessly with the existing setups
Release Log
Dec 2013  —  version 1.0 (the initial release)
Feb 2014  —  version 1.1 (clean up the scripts, add the dnn+fbank recipe run-dnn-fbank.sh, enrich PDNN)
1. A GPU card should be available on your computing machine.
2. Initial model building should be run, ideally up to train_sat and align_fmllr
3. Software Requirements:
Theano. For information about Theano installation on Ubuntu Linux, refer to this document editted by Wonkyum Lee from CMU.
pfile_utils. This script (that is, kaldi-trunk/tools/install_pfile_utils.sh) installs pfile_utils automatically. 
Kaldi+PDNN is hosted on Sourceforge. You can enter your Kaldi Switchboard setup (such as egs/swbd/s5b) and download the latest version via svn:

svn co svn://svn.code.sf.net/p/kaldipdnn/code-0/trunk/pdnn pdnn
svn co svn://svn.code.sf.net/p/kaldipdnn/code-0/trunk/steps_pdnn steps_pdnn
svn co svn://svn.code.sf.net/p/kaldipdnn/code-0/trunk/run_swbd run_swbd
ln -s run_swbd/* ./

Now the new run-*.sh scripts appear in your setup. You can run them directly.


run-dnn.sh DNN hybrid system over fMLLR features
Targets: context-dependent states from the SAT model exp/tri4a
Input: spliced fMLLR features
Network:  360:1024:1024:1024:1024:1024:${target_num}
Pretraining: pre-training with stacked denoising autoencoders
run-dnn-fbank.sh DNN hybrid system over filterbank features
Targets: context-dependent states from the SAT model exp/tri4a
Input: spliced log-scale filterbank features with cepstral mean and variance normalization
Network:  330:1024:1024:1024:1024:1024:${target_num}
Pretraining: pre-training with stacked denoising autoencoders
run-bnf-tandem.sh GMM Tandem system over Deep Bottleneck features   [ reference paper ]
Targets: BNF network training uses context-dependent states from the SAT model exp/tri4a
spliced fMLLR features
BNF Network: 360:1024:1024:1024:1024:42:1024:${target_num}
Pretraining: pre-training the prior-to-bottleneck layers (360:1024:1024:1024:1024) with stacked denoising autoencoders
run-bnf-dnn.sh DNN hybrid system over Deep Bottleneck features   [ reference paper ]
BNF network: trained in the same manner as in run-bnf-tandem.sh
Hybrid Input
spliced BNF features
BNF Network: 378:1024:1024:1024:1024:${target_num}
Pretraining: pre-training with stacked denoising autoencoders
run-cnn.sh Hybrid system based on deep convolutional networks (DCNs)  [ reference paper ]
The CNN recipe is not stable. Needs more investigation.
context-dependent states from the SAT model exp/tri4a
spliced log-scale filterbank features with cepstral mean and variance normalization; each frame is taken as an input feature map
Network:  two convolution layers followed by three fully-connected layers. See this page for how to config the network structure.
Pretrainingno pre-training is performed for DCNs

Experiments & Results
The recipes are developed based on the Kaldi 110-hour Switchboard setup. This is the standard system you can get if you run egs/swbd/s5b/run.sh. Our experiments follow the similar configurations as described inthis paper. We have the following data partitions. The “validation” set is used to measure frame accuracy and determine termination in DNN fine-tuning.
training — train_100k_nohup (110 hours)         validation — train_dev_nohup        testing — eval2000 (HUB5’00)

WER% on HUB5’00-SWB
WER% on HUB5’00
          21.4        28.4

Our hybrid recipe run-dnn.sh is giving WER comparable with this paper (Table 5 for fMLLR features). We are confident to think that our recipes perform comparably with the Kaldi internal DNN setups.

Want to Contribute?

We look forward to your contributions. Improvement can be made on the following aspects (but not limited to):
1. Optimization to the above recipes
2  New recipes
3. Porting the recipes to other datasets
4. Experiments and results
5. Contributions to the PDNN toolkit

Contact Yajie Miao (ymiao@cs.cmu.edu) if you have any questions or suggestions.




[repost ]几个常见的语音交互平台的简介和比较



最近做了两个与语音识别相关的项目,两个项目的主要任务虽然都是语音识别,或者更确切的说是关键字识别,但开发的平台不同, 一个是windows下的,另一个是android平台的,于是也就选用了不同的语音识别平台,前者选的是微软的Speech API开发的,后者则选用 的是CMU的pocketsphinx,本文主要将一些常见的语音交互平台进行简单的介绍和对比。

这里所说的语音交互包含语音识别(Speech Recognition,SR,也称为自动语音识别,Automatic Speech Recognition,ASR)和语音 合成(Speech Synthesis,SS,也称为Text-To-Speech,简记为TTS)两种技术,另外还会提到声纹识别(Voice Print Recognition, 简记为VPR)技术。

语音识别技术是将计算机接收、识别和理解语音信号转变为相应的文本文件或者命令的技术。它是一门涉及到语音语言学、信号处理、 模式识别、概率论和信息论、发声机理和听觉机理、人工智能的交叉学科。在语音识别系统的帮助下,即使用户不懂电脑或者无法使用 电脑,都可以通过语音识别系统对电脑进行操作。

语音合成,又称文语转换(Text to Speech)技术,能将任意文字信息实时转化为标准流畅的语音朗读出来,相当于给机器装上了人工 嘴巴。它涉及声学、语言学、数字信号处理、计算机科学等多个学科技术,是中文信息处理领域的一项前沿技术,解决的主要问题就是如何 将文字信息转化为可听的声音信息,也即让机器像人一样开口说话。



1)微软Speech API

微软的Speech API(简称为SAPI)是微软推出的包含语音识别(SR)和语音合成(SS)引擎的应用编程接口(API),在Windows下应用 广泛。目前,微软已发布了多个SAPI版本(最新的是SAPI 5.4版),这些版本要么作为于Speech SDK开发包发布,要么直接被包含在windows 操作系统中发布。SAPI支持多种语言的识别和朗读,包括英文、中文、日文等。SAPI的版本分为两个家族,1-4为一个家族,这四个版本彼此 相似,只是稍微添加了一些新的功能;第二个家族是SAPI5,这个系列的版本是全新的,与前四个版本截然不同。

最早的SAPI 1.0于1995年发布,支持Windows 95和Windows NT 3.51。这个版本的SAPI包含比较初级的直接语音识别和直接语音合成的API, 应用程序可以直接控制识别或合成引擎,并简化更高层次的语音命令和语音通话的API。SAPI3.0于97年发布,它添加了听写语音识别(非连续 语音识别)和一些应用程序实例。98年微软发布了SAPI4.0,这个版本不仅包含了核心的COM API,用C++类封装,使得用C++来编程更容易, 而且还有ActiveX控件,这个控件可以再VB中拖放。这个版本的SS引擎随Windows2000一起发布,而SR引擎和SS引擎又一起以SDK的形式发布。

SAPI5.0 于2000年发布,新的版本将严格将应用与引擎分离的理念体现得更为充分,所有的调用都是通过动态调用sapi.dll来实现的, 这样做的目的是使得API更为引擎独立化,防止应用依赖于某个具有特定特征的引擎,这种改变也意图通过将一些配置和初始化的代码放 到运行时来使得应用程序的开发更为容易。

2).IBM viaVoice

IBM是较早开始语音识别方面的研究的机构之一,早在20世纪50年代末期,IBM就开始了语音识别的研究,计算机被设计用来检测特定的语言 模式并得出声音和它对应的文字之间的统计相关性。在1964年的世界博览会上,IBM向世人展示了数字语音识别的“shoe box recognizer”。 1984年,IBM发布的语音识别系统在5000个词汇量级上达到了95%的识别率。

1992年,IBM引入了它的第一个听写系统,称为“IBM Speech Server Series (ISSS)”。1996年发布了新版的听写系统,成为“VoiceType3.0”, 这是viaVoice的原型,这个版本的语音识别系统不需要训练,可以实现孤立单词的听写和连续命令的识别。VoiceType3.0支持Windows95系统, 并被集成到了OS/2 WARP系统之中。与此同时,IBM还发布了世界上首个连续听写系统“MedSpeak Radiology”。最后,IBM及时的在假日购物季节 发布了大众化的实用的“VoiceType Simply Speaking”系统,它是世界上首个消费版的听写产品(the world’s first consumer dictation product).

1999年,IBM发布了VoiceType的一个免费版。2003年,IBM授权ScanSoft公司拥有基于ViaVoice的桌面产品的全球独家经销权,而ScanSoft公司 拥有颇具竞争力的产品“Dragon NaturallySpeaking”。两年后,ScanSoft与Nuance合并,并宣布公司正式更名为Nuance Communications,Inc。 现在很难找到IBM viaVoice SDK的下载地址了,它已淡出人们的视线,取而代之的是Nuance。


Nuance通讯是一家跨国计算机软件技术公司,总部设在美国马萨诸塞州伯灵顿,主要提供语音和图像方面的解决方案和应用。目前的业务集中 在服务器和嵌入式语音识别,电话转向系统,自动电话目录服务,医疗转录软件与系统,光学字符识别软件,和台式机的成像软件等。

Nuance语音技术除了语音识别技术外,还包扩语音合成、声纹识别等技术。世界语音技术市场,有超过80%的语音识别是采用Nuance识别引擎技术, 其名下有超过1000个专利技术,公司研发的语音产品可以支持超过50种语言,在全球拥有超过20亿用户。据传,苹果的iPhone 4S的Siri语音识别中 应用了Nuance的语音识别服务。另外,据Nuance公司宣布的重磅消息,其汽车级龙驱动器Dragon Drive将在新奥迪A3上提供一个免提通讯接口, 可以实现信息的听说获取和传递。

Nuance Voice Platform(NVP)是Nuance公司推出的语音互联网平台。Nuance公司的NVP平台由三个功能块组成:Nuance Conversation Server 对话服务器,Nuance Application Environment (NAE)应用环境及Nuance Management Station管理站。Nuance Conversation Server对话服务 器包括了与Nuance语音识别模块集成在一起的VoiceXML解释器,文语转换器(TTS)以及声纹鉴别软件。NAE应用环境包括绘图式的开发工具, 使得语音应用的设计变得和应用框架的设计一样便利。Nuance Management Station管理站提供了非常强大的系统管理和分析能力,它们是为了 满足语音服务的独特需要而设计的。


提到科大讯飞,大家都不陌生,其全称是“安徽科大讯飞信息科技股份有限公司”,它的前身是安徽中科大讯飞信息科技有限公司,成立于99 年12月,07年变更为安徽科大讯飞信息科技股份有限公司,现在是一家专业从事智能语音及语音技术研究、软件及芯片产品开发、语音信息服务 的企业,在中国语音技术领域可谓独占鳌头,在世界范围内也具有相当的影响力。

科大讯飞作为中国最大的智能语音技术提供商,在智能语音技术领域有着长期的研究积累,并在中文语音合成、语音识别、口语评测等多项 技术上拥有国际领先的成果。03年,科大讯飞获迄今中国语音产业唯一的“国家科技进步奖(二等)”,05年获中国信息产业自主创新最高荣誉 “信息产业重大技术发明奖”。06年至11年,连续六届英文语音合成国际大赛(Blizzard Challenge)荣获第一名。08年获国际说话人识别评测 大赛(美国国家标准技术研究院—NIST 2008)桂冠,09年获得国际语种识别评测大赛(NIST 2009)高难度混淆方言测试指标冠军、通用测试 指标亚军。

科大讯飞提供语音识别、语音合成、声纹识别等全方位的语音交互平台。拥有自主知识产权的智能语音技术,科大讯飞已推出从大型电信级 应用到小型嵌入式应用,从电信、金融等行业到企业和家庭用户,从PC到手机到MP3/MP4/PMP和玩具,能够满足不同应用环境的多种产品,科大 讯飞占有中文语音技术市场60%以上市场份额,语音合成产品市场份额达到70%以上。


其他的影响力较大商用语音交互平台有谷歌的语音搜索(Google Voice Search),百度和搜狗的语音输入法等等,这些平台相对于以上的4个 语音交互平台,应用范围相对较为局限,影响力也没有那么强,这里就不详细介绍了。



CMU-Sphinx也简称为Sphinx(狮身人面像),是卡内基 – 梅隆大学( Carnegie Mellon University,CMU)开发的一款开源的语音识别系统, 它包括一系列的语音识别器和声学模型训练工具。

Sphinx有多个版本,其中Sphinx1~3是C语言版本的,而Sphinx4是Java版的,另外还有针对嵌入式设备的精简优化版PocketSphinx。Sphinx-I 由李开复(Kai-Fu Lee)于1987年左右开发,使用了固定的HMM模型(含3个大小为256的codebook),它被号称为第一个高性能的连续语音识别 系统(在Resource Management数据库上准确率达到了90%+)。Sphinx-II由Xuedong Huang于1992年左右开发,使用了半连续的HMM模型, 其HMM模型是一个包含了5个状态的拓扑结构,并使用了N-gram的语言模型,使用了Fast lextree作为实时的解码器,在WSJ数据集上的识别率 也达到了90%+。

Sphinx-III主要由Eric Thayer 和Mosur Ravishankar于1996年左右开发,使用了完全连续的(也支持半连续的)HMM模型,具有灵活 的feature vector和灵活的HMM拓扑结构,包含可选的两种解码器:较慢的Flat search和较快的Lextree search。该版本在BN(98的测评数据 集)上的WER(word error ratio)为19%。Sphinx-III的最初版还有很多limitations,诸如只支持三音素文本、只支持Ngram模型(不 支持CFG/FSA/SCFG)、对所有的sound unit其HMM拓扑结构都是相同的、声学模型也是uniform的。Sphinx-III的最新版是09年初发布的0.8版, 在这些方面有很多的改进。

 Pocketsphinx — recognizer library written in C.
 Sphinxbase — support library required by Pocketsphinx
 Sphinx4 — adjustable, modifiable recognizer written in Java
 CMUclmtk — language model tools
 Sphinxtrain — acoustic model training tools


HTK是Hidden Markov Model Toolkit(隐马尔科夫模型工具包)的简称,HTK主要用于语音识别研究,现在已经被用于很多其他方面的研究, 包括语音合成、字符识别和DNA测序等。

HTK最初是由剑桥大学工程学院(Cambridge University Engineering Department ,CUED)的机器智能实验室(前语音视觉及机器人组) 于1989年开发的,它被用来构建CUED的大词汇量的语音识别系统。93年Entropic Research Laboratory Inc.获得了出售HTK的权利,并在95年 全部转让给了刚成立的Entropic Cambridge Research Laboratory Ltd,Entropic一直销售着HTK,直到99年微软收购了Entropic,微软重新 将HTK的版权授予CUED,并给CUED提供支持,这样CUED重新发布了HTK,并在网络上提供开发支持。



Julius是一个高性能、双通道的大词汇量连续语音识别(large vocabulary continues speech recognition,LVCSR)的开源项目, 适合于广大的研究人员和开发人员。它使用3-gram及上下文相关的HMM,在当前的PC机上能够实现实时的语音识别,单词量达到60k个。

Julius整合了主要的搜索算法,高度的模块化使得它的结构模型更加独立,它同时支持多种HMM模型(如shared-state triphones 和 tied-mixture models等),支持多种麦克风通道,支持多种模型和结构的组合。它采用标准的格式,这使得和其他工具箱交叉使用变得 更容易。它主要支持的平台包括Linux和其他类Unix系统,也适用于Windows。它是开源的,并使用BSD许可协议。

自97年后,Julius作为日本LVCSR研究的一个自由软件工具包的一部分而延续下来,后在2000年转由日本连续语音识别联盟(CSRC)经营。 从3.4版起,引入了被称为“Julian”的基于语法的识别解析器,Julian是一个改自Julius的以手工设计的DFA作为语言模型的版本,它可以 用来构建小词汇量的命令识别系统或语音对话系统。


该工具箱包含最新的自动语音识别技术的算法实现,它由 RWTH Aachen 大学的Human Language Technology and Pattern Recognition Group 开发。

RWTH ASR工具箱包括声学模型的构建、解析器等重要部分,还包括说话人自适应组件、说话人自适应训练组件、非监督训练组件、个性化 训练和单词词根处理组件等,它支持Linux和Mac OS等操作系统,其项目网站上有比较全面的文档和实例,还提供了现成的用于研究目的的 模型等。



上面提到的开源工具箱主要都是用于语音识别的,其他的开源语音识别项目还有Kaldi 、simon 、iATROS-speech 、SHoUT 、 Zanzibar OpenIVR 等。

常见的语音合成的开源工具箱有MARY、SpeakRight、Festival 、FreeTTS 、Festvox 、eSpeak 、Flite 等。



本文介绍了几种常见的语音交互平台,主要是语音识别、语音合成的软件或工具包,还顺便提到了声纹识别的内容, 下面做一个简单的总结:



[3] Microsoft Speech API:http://en.wikipedia.org/wiki/Speech_Application_Programming_Interface#SAPI_1
[4] MSDN-SAPI:http://msdn.microsoft.com/zh-cn/library/ms723627.aspx
[5] 微软语音技术 Windows 语音编程初步:http://blog.csdn.net/yincheng01/article/details/3511525
[6]IBM Human Language Technologies History:http://www.research.ibm.com/hlt/html/history.html
[7] Nuance: http://en.wikipedia.org/wiki/Nuance_Communications
[8] 科大讯飞:http://baike.baidu.com/view/362434.htm
[9] CMU-Sphinx: http://en.wikipedia.org/wiki/CMU_Sphinx
[10] CMU Sphinx homepage:http://cmusphinx.sourceforge.net/wiki/
[11] HTK Toolkit:http://htk.eng.cam.ac.uk/
[12] Julius:http://en.wikipedia.org/wiki/Julius_(software)
[13] RWTH ASR:http://en.wikipedia.org/wiki/RWTH_ASR
[14] List of speech recognition software: http://en.wikipedia.org/wiki/List_of_speech_recognition_software
[15] Speech recognition: http://en.wikipedia.org/wiki/Speech_recognition
[16] Speech synthesis: http://en.wikipedia.org/wiki/Speech_synthesis
[17] Speaker recognition: http://en.wikipedia.org/wiki/Speaker_recognition

Original Link: http://ibillxia.github.io/blog/2012/11/24/several-plantforms-on-audio-and-speech-signal-processing/
Attribution – NON-Commercial – ShareAlike – Copyright © Bill Xia

[repost ]Introduction to Integrating Watson Developer Cloud with Watson Explorer


IBM Watson Explorer combines search and content analytics with unique cognitive computing capabilities offered by the Watson Developer Cloud to help users find and understand the information they need to work more efficiently and make better, more confident decisions. Watson Explorer Application Builder is the delivery tool that allows developers to quickly construct a 360-degree view combining data and analytics from many sources into a single view. These applications can be enhanced using content from external sources, external visualization libraries (such as D3.js), and external APIs. Integrating with the Watson Developer Cloud provides opportunities for further enhancing Watson Explorer applications to include cognitive-based features. Watson Developer Cloud applications can be integrated with Watson Explorer in a number of ways depending on the use cases and desired functionality.

In this set of examples, we introduce the basics for integrating a Watson Explorer application with applications deployed to the Watson Developer Cloud emphasizing some of the Watson Cognitive services. The examples provided are basic technical proofs of concept; we give you the technical foundation you need to build truly awesome cognitive applications. In each example we walk you through the process of deploying an application to the Watson Developer Cloud. We show you how to integrate that application with Watson Explorer. We then provide you with some food for thought — What should you think about when deploying this kind of integration into a production environment? What are some additional ideas for integration?

By the end of each example you should understand what each service does, how it could benefit your organization, and how to integrate it with an Application Builder application.

Before beginning the tutorials you should review the prerequisites provided below. For more information on the available Watson Developer Cloud cognitive services, please visit the services catalog.

Tutorial Listing

  1. Message Resonance Integration
  2. Machine Translation Integration
  3. Question and Answer Integration
  4. Relationship Extraction Integration
  5. User Modeling Integration
  6. Concept Expansion Integration


The integration between Watson Developer Cloud (WDC) and Watson Explorer follows relatively straightforward web services patterns. All of the WDC services use a basic REST API. This makes it relatively easy to use WDC services from WDC applications. The example WDC applications here also use a simple REST API to facilitate communication between the WDC application and Watson Explorer. Communication between a Watson Explorer application and a deployed WDC application is accomplished in two ways.

  1. Watson Explorer Engine can communicate with WDC applications via a parser node (parser nodes in Engine allow for advanced and basic web requests to be made).
  2. Watson Explorer Application Builder widgets communicate with WDC applications by going through a proxy deployed to the same web server as Application Builder.

The sample proxy enables two important properties. First, browsers enforce a same-origin policy for web requests made from JavaScript, thus to allow for effective asynchronous user interactions from a client browser (via Ajax) a URL from the same domain must be available. Rather than modify Application Builder core, the Proxy allows you to effectively create your own API for Ajax calls. This same proxy can also be used directly by Application Builder widgets to improve maintainability. In this capacity, the proxy creates an abstraction on top of WDC applications to buffer Application Builder widgets from WDC endpoint changes and better promote testing.

There are five basic integration patterns for combining Watson Explorer and Watson Developer Cloud. The specific integration pattern used will depend on the use cases and desired functionality.

  1. Application Builder Widget. The most common place to use a cognitive service is from within an Application Builder widget. Most examples here demonstrate this.
  2. In the client browser. Once a page is rendered in a user’s browser there may be use cases in which you would want to allow a user to interact with a WDC application without refreshing the page. For example, a user might dialog with Watson Q&A from an entity page. At this time the included proxy must be used to satisfy the end-user’s browser same-origin policy.
  3. At crawl time. The Relationship Extraction Integration provides an example of an Engine converter that indexes the data returned from the Relationship Extraction service.
  4. At query time. It is also possible to access WDC applications at query time from Engine.
  5. Pre- or post-process. In some cases it is useful to use a WDC application as a pre- or post-processing step and the output of this is used by the Watson Explorer application in some way.

Overview of the two options for integration

The integrations were developed using two runtimes on Bluemix; Java Web Services running on Websphere Liberty Profile and Ruby Sinatra. The following sections detail the setup for each of these approaches.

Setup for IBM Bluemix Development

Bluemix provides multiple different runtimes for your cloud-based application. In order to gain access to the Bluemix environment you will need to register for an account. After registration and configuration of the Cloud Foundry tools, you can setup your Java or Ruby development environment by following the instructions provided below.

Steps for Bluemix setup:

  1. Register for an account on Bluemix
  2. Install the Cloud Foundry command line tool

Java Web-Based Applications

Some examples like the Question and Answer Service and the Machine Translation Service are a Java-based Bluemix applications. The following steps get you setup and running for development of these applications.

Required development tools

  • A JDK is required to compile the Java code. Download and install IBM JDK 1.7
  • We use Ant to build the package. Download and install Apache Ant 1.9.4

Optional development tools

The following tools and plug-ins can make testing and deployments easier.

Required Libraries

The packages include the jars that are required in the dep-jar folder. These jars are derived from the following development libraries.

Ruby Sinatra Web-Based Applications

Some examples like the User Modeling Service and the Concept Expansion Service are a Ruby Sinatra-based Bluemix application. The following steps get you setup and running for development of these applications.

Required development tools

  • Ruby is required to compile the code. Download and install Ruby 1.9.3
  • Ruby DevKit is useful for development on Windows. Download and install Ruby DevKit 4.5.2
  • JRuby is used in the Proxy development. Download and install JRuby 1.7.13

Required libraries

  • Bundler is required to build the bundle install. Download and install using “gem install bundler” after setting up JRuby.

Watson Explorer

In Watson Explorer, you should:

[repost ]概率分布







pnorm(1.96)   表示当分位数是1.96时,分布函数值是0.97,即在服从标准正态分布的情况下p(x<1.96)是0.975

[1] 0.9750021


[1] -1.644854


[1] 0.0249979


[1] 1.644854


[1] 0.0002012253



pnorm(4,2,5) 表示当x服从均值为2,标准差为5的正态分布时,p(x<4)是0.65

[1] 0.6554217

rnorm(10,mean=2,sd=2)  生成10个均值是2,标准差是2的正态分布随机数

[1]  2.8553699  4.4006244  2.1879336  5.1519583  2.7741916  3.1413673

[7]  3.2797569  5.1823565  0.2609885 -1.4029419





dbinom(9,20,0.5)   表示n=20,p=0.5时,x=9的概率

[1] 0.1601791

pbinom(9,20,0.5)   x<=9的概率

[1] 0.4119015







分布律:    0<=xi<=n, ∑xi=n

dmultinom(c(5,3),8,c(0.6,0.4))  表示两个事件在概率为c(0.6,0.4)的情况下总共发生8次,出现5和3次的概率

[1] 0.2786918




dnbinom(3,5,0.6)   表明在0.6的概率的情况下,失败3次,到5次才成功的概率

[1] 0.1741824


含义:伯努利试验独立重复进行,一直到出现有成功时停止试验,则试验失败的次数服从一个参数为p的几何分布  即k=1的二项分布


dgeom(5,0.6)  表示在概率为0.6的情况下,到第5次才成功的概率

[1] 0.006144







E(X)= ,

dpois(3,5)   为5时,出现3次的概率

[1] 0.1403739

ppois(5,5)   小于5次的概率

[1] 0.6159607












[1] 0.1428571


[1] 0.1428571


[1] 0.4285714









密度函数:   当b=1时是指数分布

curve(dexp(x),xlim=c(0,3),ylim=c(0,2))  指数分布

curve(dweibull(x,1),add=T,lty=3,lwd=3)   b=1时






















含义:随机变量X与Y独立,X服从标准正态分布,Y服从自由度为n的卡方分布则 服从自由度为n的t分布










[repost ]Google Chrome: How to Use the Web Speech API


Check our add-ons for Atlassian Products: Confluence, JIRA, Bamboo.

This February Google released Chrome version 25. One of the newest and most interesting features introduced in this version was Web Speech API support. Web Speech API is the JavaScript library that allows speech recognition and speech-to-text conversion. Conversely, Web Speech API enables you to transform text to speech.

This helped us develop the Handsfree plugin for Confluence that creates pages and page comments with voice in Chrome 25+.

Speech recognition supports several popular languages and is quite effective. Currently, developers have two options of implementing speech recognition on web-pages.

The First Method

The easiest way to use this technology is to use the already implemented functionality for the html tag <input>. You only need to add the attribute x-webkit-speech:

<input x-webkit-speech>

And you get a text box that allows you to dictate a text.

By default the recognition language will be the same as that set in your browser. But you can change it in two ways:

1) By adding the attribute lang=”en” where the attribute value defines the language to be recognized:

<input lang=”en” x-webkit-speech>

2) By using the <meta> tag on your html page:

<meta http-equiv = 'Content-Language' content = 'en'/>

The x-webkit-speech attribute’s advantages are:

  • It’s easy to implement
  • The browser doesn’t request the user to allow to use the microphone.

However, there’re significant disadvantages:

  • Speech recognition is stopped after a pause
  • When you resume speech recognition in the same box, the old value is substituted by a new one, so you can’t add data.
  • It’s supported only by the <input> tag.
  • The interim results are not displayed which contributes to poor feedback as the user sees the recognition result only after they stop talking and the recognition process is finished.

You can see how it works here.

The Second Method – using Web Speech API on JavaScript

This method is based on the interaction with Web Speech API with the help of JavaScript (demo). To start using API, you need to create a new object that will be employed for recognition:

var recognition = new webkitSpeechRecognition();

Further you can set the following speech recognition parameters:

1) Set the continious recognizing that enable the user to make long pauses and dictate large texts. By default this property is set to false (i.e. a speech pause will stop the recognition process).

recognition.continuous = true;

2) Enable interim results fetching. Thus you have access to interim recognition results and can display them in the text box immediately after receiving them. The user will see a constantly refreshing text, otherwise the recognized text will be available only after a pause. The default value is false.

recognition.interimResults = true;

3) Set the recognition language. By default it corresponds to the browser language.

recognition.lang = “en”;

To start recognizing you need to call the function:


The following function is called to stop recognition:


Now you need to initialize the recognition results handler:

recognition.onresult = function (event) {};

Add the result handling logic inside this function. The event object has the following fields:

  • event.results[i] – the array containing recognition result objects. Each array element corresponds to a recognized word on the i recognition stage.
  • event.resultIndex – the current recognition result index.
  • event.results[i][j] – the j-th alternative of a recognized word. The first element is a mostly probable recognized word.
  • event.results[i].isFinal – the Boolean value that shows whether this result is final or interim.
  • event.results[i][ j].transcript – the text representation of a word.
  • event.results[i][ j].confidence – the probability of the given word correct decoding (value from 0 to 1).

Now let’s write an elementary function that adds only final results to a text box (<input> as well as <textarea> can be used):

recognition.onresult = function (event) {
    for (var i = event.resultIndex; i < event.results.length; ++i) {
        if (event.results[i].isFinal) {
            insertAtCaret(textAreaID, event.results[i][0].transcript);

This function contains a loop that iterates over all objects of recognized words. If the result is final, it’s displayed in the text box.

Here insertAtCaret() is the function that inserts a text (the 2nd argument ) into <input> or <textarea> with the textAreaID identificator.

Now let’s consider a more complex example that outputs interim results to a text box. The implementation of final results output is the same, but we added a code that outputs interim results.

recognition.onresult = function (event) {
    // Calculating and saving the cursor position where the text will be displayed
    var pos = textArea.getCursorPosition() - interimResult.length;
    // Deleting an interim result from the textArea field
    textArea.val(textArea.val().replace(interimResult, ''));
    interimResult = '';
    // Restoring the cursor position
    for (var i = event.resultIndex; i < event.results.length; ++i) {
        if (event.results[i].isFinal) {
            insertAtCaret(textAreaID, event.results[i][0].transcript);
        } else {
            // Outputting the interim result to the text field and adding
            // an interim result marker - 0-length space
            insertAtCaret(textAreaID, event.results[i][0].transcript + '\u200B');
            interimResult += event.results[i][0].transcript + '\u200B';

The advantages of JavaScript Web Speech API:

  • the possibility to continuously dictate a text
  • the possibility to implement a multi-session recognition and save the result.
  • the possibility to insert the recognized speech anywhere in a text
  • it can be used for any (and not only) html element (you can implement voice commands)
  • the possibility to display interim recognition results

The disadvantages are:

  • the user should allow the browser to use the microphone before starting the session
  • the session length is limited to 60 seconds.

The most important disadvantage of the whole implementation (including the x-webkit-speech attribute in the <input> tag) of Web Speech API in Chrome is that recognition is performed on the server and not locally in the browser. This is the reason why recognition has time limits. Currently popular browsers as Firefox, Internet Explorer and Safari don’t support Web Speech API, but it’s expected that these browsers also implement speech recognition shortly.

[repost ]Voice Driven Web Apps: Introduction to the Web Speech API


Comments: 128
The new JavaScript Web Speech API makes it easy to add speech recognition to your web pages. This API allows fine control and flexibility over the speech recognition capabilities in Chrome version 25 and later. Here’s an example with the recognized text appearing almost immediately while speaking.


Let’s take a look under the hood. First we check to see if the browser supports the Web Speech API by checking if thewebkitSpeechRecognition object exists. If not, we suggest the user upgrades his browser. (Since the API is still experimental, it’s currently vendor prefixed.) Lastly, we create the webkitSpeechRecognition object which provides the speech interface, and set some of its attributes and event handlers.

if (!('webkitSpeechRecognition' in window)) {
} else {
  var recognition = new webkitSpeechRecognition();
  recognition.continuous = true;
  recognition.interimResults = true;

  recognition.onstart = function() { ... }
  recognition.onresult = function(event) { ... }
  recognition.onerror = function(event) { ... }
  recognition.onend = function() { ... }

The default value for continuous is false, meaning that when the user stops talking, speech recognition will end. This mode is great for simple text like short input fields. In this demo, we set it to true, so that recognition will continue even if the user pauses while speaking.

The default value for interimResults is false, meaning that the only results returned by the recognizer are final and will not change. The demo sets it to true so we get early, interim results that may change. Watch the demo carefully, the grey text is the text that is interim and does sometimes change, whereas the black text are responses from the recognizer that are marked final and will not change.

To get started, the user clicks on the microphone button, which triggers this code:

function startButton(event) {
  final_transcript = '';
  recognition.lang = select_dialect.value;

We set the spoken language for the speech recognizer “lang” to the BCP-47 value that the user has selected via the selection drop-down list, for example “en-US” for English-United States. If this is not set, it defaults to the lang of the HTML document root element and hierarchy. Chrome speech recognition supports numerous languages (see the “langs” table in the demo source), as well as some right-to-left languages that are not included in this demo, such as he-IL and ar-EG.

After setting the language, we call recognition.start() to activate the speech recognizer. Once it begins capturing audio, it calls the onstart event handler, and then for each new set of results, it calls the onresult event handler.

recognition.onresult = function(event) {
    var interim_transcript = '';

    for (var i = event.resultIndex; i &lt; event.results.length; ++i) {
      if (event.results[i].isFinal) {
        final_transcript += event.results[i][0].transcript;
      } else {
        interim_transcript += event.results[i][0].transcript;
    final_transcript = capitalize(final_transcript);
    final_span.innerHTML = linebreak(final_transcript);
    interim_span.innerHTML = linebreak(interim_transcript);

This handler concatenates all the results received so far into two strings: final_transcript and interim_transcript. The resulting strings may include “\n”, such as when the user speaks “new paragraph”, so we use the linebreak function to convert these to HTML tags <br> or <p>. Finally it sets these strings as the innerHTML of their corresponding <span>elements: final_span which is styled with black text, and interim_span which is styled with gray text.

interim_transcript is a local variable, and is completely rebuilt each time this event is called because it’s possible that all interim results have changed since the last onresult event. We could do the same for final_transcript simply by starting the for loop at 0. However, because final text never changes, we’ve made the code here a bit more efficient by makingfinal_transcript a global, so that this event can start the for loop at event.resultIndex and only append any new final text.

That’s it! The rest of the code is there just to make everything look pretty. It maintains state, shows the user some informative messages, and swaps the GIF image on the microphone button between the static microphone, the mic-slash image, and mic-animate with the pulsating red dot.

The mic-slash image is shown when recognition.start() is called, and then replaced with mic-animate when onstartfires. Typically this happens so quickly that the slash is not noticeable, but the first time speech recognition is used, Chrome needs to ask the user for permission to use the microphone, in which case onstart only fires when and if the user allows permission. Pages hosted on HTTPS do not need to ask repeatedly for permission, whereas HTTP hosted pages do.

So make your web pages come alive by enabling them to listen to your users!

We’d love to hear your feedback…