October 17, 2013

I started to compare the functionalities of some Python extensions (the list is not exhaustive) :

The first one (scikit-learn) covers many features and its documentation is quite clear. When a model is missing, you can look into PyBrain forReinforcement Learning, in Gensim for Dirichlet Application (Latent, Hierarchical) and in NLTK for any text processing (tokenization for example). For those who do not want to code, Orange would be a good option. The module Theano does gradient optimization using GPU.

A couple of forums, kind of FAQ for machine learning:

It would be difficult to do machine learning without using visualization tools. matplotlib and ggplot would be a good way to start. We also manipulate tables: numpy and pandas. For a command line: ipython or bpython are two common options.

If you are looking for data UC Irvine Machine Learning Repository.

Next table summarizes where you can find which features (with some errors):

 scikit-learn mlpy MDP PyBrain Theano MILK NLTK Gensim Orange AdaBoost yes yes C4.5 yes Canonical Correlation Analysis yes yes Cross Validation yes yes yes DBSCAN yes Decision Trees yes yes yes yes Deep Belief Networks yes Dictionary Learning yes Dynamic Time Warping (yes) yes Elastic Net yes yes yes Evolution Strategies (ES) yes Fast ICA yes yes Fast Principal Component Analysis (Fast PCA) yes Gaussian Mixture Model yes yes Gaussian Naive Bayes yes Genetic Algorithm yes Golub Classifier yes GPU computation yes Gradient Based Optimization yes yes Gradient Boosting Regression yes Grid Search yes Hidden Markov Model with Gaussian Mixture Emissions (HMM GMM) yes yes Hierarchical Clustering (Ward…) yes yes yes yes Hierarchical Dirichlet Application (HDP) yes ICA yes yes Isotonic Regression yes KDTree yes Kernal Density yes Kernel Fisher Discriminant yes Kernel PCA yes yes Kernel Ridg Regression yes k-Means yes yes yes yes yes k-NN yes yes yes yes Kohonen (SOM) yes yes Label Spreading yes yes Largest Common Subsequence (LCS) yes Lasso yes yes Latent Dirichlet Application (LDA) yes Least Angle Regression (LARS) yes yes Linear Discriminant Analysis (LDA) yes yes yes Linear Regression yes yes yes yes yes Logisitic Regression yes yes yes yes Naive Bayesian Learner yes yes Natural Language Processing (NLP) yes Neural Network (NN) yes yes yes yes Non-Negative matrix factorization by Projected Gradient (NMF) yes yes Partial Least Square (PLS) yes Partial Least Square (SVD) yes Particle Swarm Optimization (PSO) yes Passive Aggressive Classification yes Passive Aggressive Regression yes Pipeline yes yes Principal Component Analysis (PCA) yes yes yes yes yes Probabilistic Principal Component Analysis (pPCA) yes yes yes p-Value yes yes Quadratic Discriminant Analysis (QDA) yes yes Random Forests yes yes yes Recurrent Neural Network yes Regression Tree yes yes Reinforcement Learning yes Ridge Regression yes yes yes ROC / Precision / Recall yes yes SARSA yes Singular Value Decomposition (SVD) yes yes Sparse PCA yes yes Spectral BiClustering yes Spectral Clustering yes Spectral Coclustering yes Spectral Regression Discriminant Analysis yes Support Vector Classificiation (SVC) yes yes yes Support Vector Regression (SVR) yes yes SupportVector Machine (SVM) yes yes yes yes yes TF-IDF yes yes Wavelets yes