Anthony Larcher, Jean-Francois Bonastre and Haizhou Li
SLTC Newsletter, May 2013
ALIZE is a collaborative Open Source toolkit developed for speaker recognition since 2004. The latest release (3.0) includes state-of-the-art methods such as Joint Factor Analysis, i-vector modelling and Probabilistic Linear Discriminant Analysis. The C++ multi-platform implementation of ALIZE is designed to handle the increasing data quantity required for speaker and language detection and facilitate the development of state-of-the-art systems. This article reveals the motivation of the ALIZE open source platform, its architecture, the collaborative community activities, and the functionalities that are available in the 3.0 release.
WHY AN OPEN-SOURCE PLATFORM?
Speaker and language detection systems have greatly improved in the last few decades. The performance observed in the NIST Speaker Recognition and Language Recognition Evaluations demonstrates the achievements of the scientific community to make systems more accurate and robust to noise and channel nuisances. These advancements lead to two major results in terms of system development. First, automatic systems are more complex and the probabilistic model training requires a huge amount of data that can only be handled through parallel computation. Second, evaluating system performance calls for enormous numbers of trials to maintain confidence in the results and provide statistical significance when dealing with low error rates. ALIZE offers a simple solution to tackle these issues by providing a set of free efficient multi-platform tools that can be used to build a state-of-the-art speaker or language recognition system.
ALIZE has been initiated by the University of Avignon – LIA in 2004. Since then, many research institutes and companies have contributed to the project. This collaboration is facilitated by a number of tools available from the ALIZE website: (http://alize.univ-avignon.fr/)
- Online documentation, wiki and tutorials
- Doxygen developer documentation to familiarize with the low level API
- A mailing list to be informed of the latest novelties and receive support from the community
- An SVN server to download the latest version of the source code
- A download platform to get the latest release of the toolkit
A LinkedIn group is also available to provide a means to learn about the facilities and people involved in the field of speaker recognition.
The ALIZE project consists of a low level API (ALIZE) and a set of high level executables that form the LIA_RAL toolkit. The ensemble makes it possible to easily set up a speaker recognition system for research purposes as well as develop industry based applications.
LIA_RAL is a high level toolkit based on the low level ALIZE API. It consists of three sets of executables: LIA_SpkSeg, LIA_Utils and LIA_SpkDET. LIA_SpkSeg and LIA_Utils respectively include executables dedicated to speaker segmentation and utility programs to handle ALIZE objects while LIA_SpkDet is developed to fulfil the main functions of a state-of-the-art speaker recognition system as described in the following figure.
Figure 1: General architecture of a speaker recognition system.
- Feature mean and variance normalization
- Energy-based speech activity detection
- Remove overlapping sections of speech features in two-channel recordings
- GMM adaptation (Maximum A Posteriori , Maximum Linear Logistic Regression )
- Joint Factor Analysis 
- Latent Factor Analysis 
- SVM modelling based on LibSVM  and Nuisance Attribute Projection 
- I-vector extraction  and normalization through Eigen Factor Radial , Length Normalization , Spherical Nuisance Normalization . Within Class Covariance Normalization , Linear Discriminant Analysis
- Frame-by-frame scoring
- Cosine similarity 
- Mahalanobis scoring 
- Two-covariance scoring 
- Probabilistic Linear Discriminant Analysis (PLDA)  following implementations proposed in  and scoring functions described in 
- EM based GMM training
- Total Variability matrix estimation  using minimum divergence criteria 
- Probabilistic Linear Discriminant Analysis training 
- Estimation of normalization meta-parameters [7,8,9]
- T-norm 
- Z-norm 
- ALIZE does not include acoustic feature extraction but is compatible withSPro , HTK and RAW formats
- Score matrices can be exported in binary format easily handled by the BOSARIS toolkit
LIA_RAL also includes a number of tools to manage objects in ALIZE format.
MULTI-PLATFORM C++ TOOLKIT
ALIZE source code is available under the LGPL license which imposes minimum restriction on the redistribution of covered softwares.
ALIZE software architecture is based on UML modelling and strict code conventions in order to facilitate collaborative development and code maintenance. The platform includes a Visual Studio ® solution as well as autotools for easy compilation on UNIX-like platforms. Parallization of the code has been implemented based on the Posix standard library and some executables can be linked to Lapack for accurate and fast matrix operations. The whole toolkit has been tested under Windows, MacOs and different Linux distributions in both 32 and 64 bits architectures.
An open-source and cross-platform test suite enables ALIZE’s contributors to quickly run regression tests in order to increase the reliability of future releases and to make the code easier to maintain. Test cases include a low level unit test level on the core ALIZE and the most important algorithmic classes as well as an integration test level on the high-level executable tools. Doxygen documentation is available online and can be compiled from the sources.
ALIZE 3.0 release has been partly supported by the BioSpeak project, part of the EU-funded Eurostar/Eureka program.
Subscribe to the mailing list by sending an email to: dev-alize-subscribe[AT]listes.univ-avignon.fr
The best way to contact ALIZE responsibles is to send a email to: alize[AT]univ-avignon.fr
 W. Campbell, D. Sturim and D. Reynolds, “Support Vector Machines Using GMM Supervectors for Speaker Verification,” in IEEE Signal Processing Letters, Institute of Electrical and Numerics Engineers, 2006, 13, 308
 D. Matrouf, N. Scheffer, B. Fauve and J.-F. Bonastre, “A straightforward and efficient implementation of the factor analysis model for speaker verification,” in Annual Conference of the International Speech Communication Association (Interspeech), pp. 1242-1245, 2007
 P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” in IEEE Transactions on Audio, Speech, and Language Processing, 15(4), pp. 1435-1447, 2007
 R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score Normalization for Text-Independent Speaker Verification System,” in Digital Signal Processing, pp. 42-54, 2000
 N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” in IEEE Transactions on Audio, Speech, and Language Processing, 19, pp. 788-798, 2011
 N. Brummer, “The EM algorithm and minimum divergence,” in Agnitio Labs Technical Report, Online: http://niko.brummer.googlepages
 P.-M. Bousquet, D. Matrouf, and J.-F. Bonastre, “Intersession compensation and scoring methods in the i-vectors space for speaker recognition,” in Annual Conference of the International Speech Communication Association (Interspeech), pp. 485-488, 2011
 D. Garcia-Romero and C.Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Annual Conference of the International Speech Communication Association (Interspeech), pp. 249-252, 2011
 P.-M. Bousquet, A. Larcher, D. Matrouf, J.-F. Bonastre and O. Plchot, “Variance-Spectra based Normalization for I-vector Standard and Probabilistic Linear Discriminant Analysis,” in Odyssey Speaker and Language Recognition Workshop, 2012
 N. Brummer, and E. de Villiers, “The speaker partitioning problem,” in Odyssey Speaker and Language Recognition Workshop, 2010
 S.J. Prince and J.H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in International Conference on Computer Vision, pp. 1-8, 2007
 Y. Jiang, K. A. Lee, Z. Tang, B. Ma, A. Larcher and H. Li, “PLDA Modeling in I-vector and Supervector Space for Speaker Verification,” in Annual Conference of the International Speech Communication Association (Interspeech), pp. 1680-1683, 2012
 K.-P. Li and J. E. Porter, “Normalizations and selection of speech segments for speaker recognition scoring,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, pp. 595-598, 1998
 J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, pp. 291-298, 1994
 C. J. Leggetter and P.C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” in Computer Speech and Language, pp. 171-185, 1995
 C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vector machines,” in ACM Transactions on Intelligent Systems and Technology, pp. 1-27, 2011
 A. Larcher, K. A. Lee, B. Ma and H. Li, “Phonetically-Constrained PLDA Modelling for Text-Dependent Speaker Verification with Multiple Short-Utterances,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2013
Anthony Larcher is Research Staff in the Department of Human Language Technology at Institute for Infocomm Research, Singapore. His interests are mainly in speaker and language recognition.
Jean-Francois Bonastre is IEEE Senior Member. He is Professor in Computer Sciences at the University of Avignon and Vice President of the university. He is also member of the Institut Universitaire de France (Junior 2006). He has been President of International Speech Communication Association (ISCA) since September 2011.
Haizhou Li is the Head of the Department of Human Language Technology at Institute for Infocomm Research, Singapore. He is a Board Member of International Speech Communication Association.