Python and Machine Learning
Upcoming SlideShare
Loading in...5

Python and Machine Learning



A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all ...

A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too complex to describe generally in programming languages, so that in effect programs must automatically describe programs.

Python is great for brainstorming and trying out new ideas. I will give an overview of the tools that are available to date that can assist in rapid prototyping and design of machine learning programs in Python.



Total Views
Views on SlideShare
Embed Views



4 Embeds 68 59 7 1 1



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Given the popularity of this talk and machine learning in general at EuroPython 2010 people may like to follow up their machine learning interests with, the Natural Computing Applications Forum (see also
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Python and Machine Learning Python and Machine Learning Presentation Transcript

  • Python and Machine Learning by Semen A. Trygubenko
  • Machine learning around us
    • Voice recognition
    • Spam filtering
    • Player ranking in online games
    • Vehicle stability systems
    • Computer vision (barcode, fingerprint and number plate readers)
    • Optimisation
  • Learning? Machine Learning?
    • Getting better at a task through practice
    • The act of remembering: data / experience
    • Generalisation, similarity and new inputs
    • Learning and flexibility: adaptation
  • Objective function game
    • Supervised learning, training
      • Regression
      • Classification: features, decision boundaries
    • Reinforcement learning
    • Semi-supervised learning
    • Evolutionary learning, fitness
    • Unsupervised
  • Everything is miscellaneous
    • Clustering
    • Graphical models
    • Artificial neural networks
    • Kernel methods
    • Dimensionality reduction
    • Optimisation
  • Interdisciplinary & as a branch of CS
  • ML is a vaaast field
  • ML Tools: Python
    • NLTK
    • FANN
    • Orange
    • PyMC
    • PyML
    • LIBSVM
    • PyBrain
    • ffnet
    • MDP
    • Shogun toolbox
    • Theano
    • mlpy
    • Elefant
    • Bayes Blocks
    • Monte Python
    • hcluster
    • Plearn
    • Pycplex
    • pymorph
  • We need total coverage ...
  • Nothing about everything
    • A sprint usability study:
    • (1) assess how easy it is
      • to install and obtain source code;
      • to get going with toy examples
    • (2) check the quality of documentation and source code
    • (3) Establish where the project is in its lifecycle
    • (4) Demo stuff that I think is cool
  • Testbed
    • OS and package repo:
      • Debian
      • Ubuntu 10.04
    • Make 3.81:
      • $make install
      • $make source
    • Free as in freedom
  • Orange
    • A machine learning and data mining suite
    • Visual programming framework + scriptable environment
    • Orange canvas: widgets and channels
    • Communication via tokens
    • Implements sampling, filtering, scaling, discretisation, regression, classification, clustering, scoring functions, SVMs
  • Orange
    • Faculty of computer and information science
    • AI lab, University of Ljubljana, Slovenia
    • (L)GPL
    • C++ components accessible from Python
    • 98K LOCs in Python, 66K LOCs in C++
    • 700 revisions in 2010, 14 developers
    • 10000 commits since 2003 when migrated from cvs
  • Orange # Homepage: # Dependencies: Python, PythonWin, NumPy, PyQt, PyQwt... .PHONY: install install: repo sudo apt-get install orange-svn python-orange-svn .PHONY: source source: svn checkout Orange .PHONY: repos repos: sudo echo -e "deb lenny main deb-src lenny main deb lucid main deb-src lucid main"| sudo tee -a /etc/apt/sources.list sudo apt-key adv --keyserver --recv-keys 5BB92C09DB82666C && sudo apt-get update sudo apt-get install python2.5
  • import orange data = orange.ExampleTable("in") print data.domain for item in data: print item orange.saveTabDelimited("",data) I/O
  • selection = orange.MakeRandomIndices2(data, 0.03) sample =, 0)"") Basic statistics
  • Learners and classifiers classifier = orange.BayesLearner(data) print classifier(newItem)
  • MDP
    • Modular toolkit for data processing
    • Supervised and unsupervised
    • PCA and ICA, Slow feature analysis, LLE, restricted Boltzmann machine
    • Hierarchical networks
    • Support vector machines (through wrappers around Shogun and LIBSVM)
    • Concept of nodes and flows; basic are parallelized
  • MDP
    • Pietro Berkes, Rike-Benjamin Schuppner, Niko Wilbert, Tiziano Zito, community contributions
    • Institute for Theoretical Biology of the Humboldt University, Berlin
    • Schuppner, Wilbert and Zito are active, 150 commits this year
    • LGPL, version 2.6 released in May this year
    • 30K LOCs of Python, 3K of comments in the code!
  • MDP # Homepage: # Dependencies: NumPy, SciPy .PHONY: install install: sudo apt-get install python-mdp .PHONY: source source: git clone git://
  • Dimensionality reduction import mdp x = mdp.numx_rand.random((100, 25)) # 25 variables, 100 observations y = mdp.pca(x) z = mdp.fastica(x, dtype='float32')
  • Nodes: training and usage n = mdp.nodes.PCANode() n.train(x) # learn PC of x n.stop_training() print n1.output_dim print n1.explained_variance z = n.execute(y) # project y on PC learned in training
  • Inverting the flow print n.is_invertible() # true for PCA node print n.inverse(z) # get y back
  • Flows: feed-forward architectures flow = mdp.Flow([ mdp.nodes.PCANode(output_dim=5) ,mdp.nodes.CuBICANode() ]) flow = mdp.nodes.PCANode(output_dim=5) + mdp.nodes.CuBICANode() flow.train(x)
  • More dimensions: defs
    • Layer = wrapper for a set of nodes trained and executed in parallel
    • FlowNode = a node with internal structure
    • Switchboard = DS for arbitrary routing
  • Hierarchical networks a = mdp.nodes.PCANode() b = mdp.nodes.SFANode() c = mdp.hinet.FlowNode( mdp.Flow([a,b]) ) layer = mdp.hinet.Layer([n, c])
  • PyMC
    • Markov chain Monte Carlo for Python
    • Can fit Bayesian statistical models with MCMC
    • Large suite of statistical distributions
    • Building blocks to construct probability models: stochastic, deterministic and potential
    • Python for scalar vars, numpy and hand-optimised Fortran code for arrays
  • PyMC
    • Christopher Fonnesbeck, David Huard, Anand Patil
    • USA
    • MIT lic
    • 3 developer, 100 commits this year, all in Jan
    • Version: 2.1 released in Jan
    • 26K LOCs of Python, 10K of Fortran and 3K LOCs of C
  • PyMC # Homepage: # Dependencies: NumPy,SciPy,matplotlib,pytables,pydot,nose .PHONY: install # grr... 2.1alpha is missing and ##dir:=pymc-2.1alpha dir:=pymc install: source ##${dir}.tar.gz sudo apt-get install python-dev gcc gfortran ## tar -xzvf $< cd ${dir} && python config_fc --fcompiler gnu95 build && sudo python install ##pymc-2.1alpha.tar.gz: ## wget ## .PHONY: source source: svn checkout pymc
  • Model module, part 1 x = numpy.array([-.86,-.3,-.05,.73]) alpha = pymc.Normal('alpha',mu=0,tau=.01) beta = pymc.Normal('beta',mu=0,tau=.01) @pymc.deterministic def theta(a=alpha,b=beta): return pymc.invlogit(a+b*x)
  • Model module, part 2 # Binomial likelihood for data d = pymc.Binomial( 'd' ,n=numpy.ones(4,dtype=int)*5 ,p=theta ,value=numpy.array([0.,1.,3.,5.]) ,observed=True )
  • Sampling from a distribution import pymc import model S=pymc.MCMC(model,db='pickle') S.sample(iter=10000,burn=5000,thin=2)
  • PyML
    • A Python machine learning package
    • Interactive OO framework, focuses on SVM and other kernel methods
    • tools for feature selection, model selection, syntax for combining classifiers
    • methods for assessing classifier performance
  • PyML
    • Asa Ben-Hur, Colorado, USA
    • Depends: matplotlib, numpy
    • 1 developer (+ research group?) + NSF funding
    • out last month
    • 18K LOCs of Python, 8K LOCs of C++
    • LGPL
  • PyML # Homepage: # Dependencies: numpy, matplotlib .PHONY: install install: PyML- tar -xzvf $< cd PyML- && python build && sudo python install PyML- wget
  • hcluster
    • Damian Eads, UCSC
    • 11K LOCs of Python, 8K of C and 500 of C++
    • 2500 lines of documentation
    • 1 developer, active 2007-2008
    • New BSD lic
    • 8 routines for agglomerative clustering, routines that compute statistics on hierarchies, 21 distance functions
  • hcluster # Homepage: .PHONY: install install: sudo apt-get install python-hcluster .PHONY: source source: svn checkout hcluster
  • NLTK # Homepage: # Dependencies: PyYAML .PHONY: install install: sudo apt-get install python-nltk .PHONY: source source: sudo apt-get source python-nltk
  • mlpy # Homepage: # Dependencies: NumPy, GSL .PHONY: install install: sudo echo &quot;deb jaunty main&quot;| sudo tee -a /etc/apt/sources.list sudo apt-key adv --keyserver --recv-keys 4AEC3064 sudo apt-get update sudo apt-get install python-mlpy .PHONY: source source: svn co mlpy
  • LIBSVM # Homepage: .PHONY: install install: sudo apt-get install libsvm2 python-libsvm .PHONY: source source: sudo apt-get source libsvm2
  • PyEvolve # Homepage: # Dependencies: matplotlib .PHONY: install install: wget sudo easy_install Pyevolve-0.5-py2.6.egg .PHONY: source source: svn co
  • FANN library # Homepage: # Dependencies: PyYAML .PHONY: install install: sudo apt-get install libfann1-dev .PHONY: source source: sudo apt-get source libfann1 .PHONY: pythonBindings: pythonBindings: # need to build from source!
  • Theano # Homepage: # Dependencies: blas .PHONY: install install: sudo apt-get install libblas-dev cd Theano && sudo python install .PHONY: source source: hg clone Theano
  • PyBrain # Homepage: .PHONY: install install: cd pybrain && sudo python install .PHONY: source source: git clone git://
  • Shogun # Homepage: .PHONY: install install: sudo apt-get install shogun-doc-en shogun-cmdline shogun-python .PHONY: source source: svn checkout shogun
  • ffnet # Homepage: .PHONY: install install: cd ffnet && sudo python install .PHONY: source source: svn co ffnet
  • Observations
    • There is a lot of activity in ML + Python
    • Not everything is packaged for your favourite OS
    • But: it is pretty easy to get going / help
    • Documentation is of varying quality, heterogeneous
    • But: there's source
    • Good libraries are usually centred on research groups / backed up by universities
  • Links
  • Books
    • Machine Learning: An Algorithmic Perspective
    • by Stephen Marsland, 2009
    • Artificial Intelligence: A Modern Approach
    • by Stuart Russell and Peter Norvig, 2009
    • Programming Collective Intelligence
    • by Toby Segaran, 2007
    • Information Theory, Inference, and Learning Algorithms
    • by David McKay, 2005
  • The End
  • Are we there yet?
      Todd L. Veldhuisen, in “Software libraries and their reuse: entropy, Kolmogorov's complexity and Zipf's law”, 2005:
    • program size is a rough lower bound for effort
    • using a component from a library we achieve reduction in size
    • components maybe whatever form of abstraction we may invent: subroutines, generators of subroutines, generators of generators of sub ...
  • Grunge end cont'd
    • complexity of an object is measured by the length of the smallest program that generates it
    • libraries allow us to “compress the incompressible” by taking the advantage of the commonality exhibited by programs within a problem domain
    • problem domain entropy is defined as a measure of program diversity
  • Grunge end cont'd
    • “libraries are essentially incomplete, and there always be room for more components in any problem domain”
    • “better tools and culture can have only marginal impact on reuse rates if the domain is inherently resistant to reuse”
  • Machine learning is a high-entropy domain
  • The future: doable and fascinating
    • “There is a fantastic existence proof that learning is possible, which is the bag of water and electricity (together with a few trace chemicals) sitting between your ears.” – Stephen Marsland
    • “You are a creative genius. Your creative genius is so accomplished that it appears, to you and to others, as effortless. Yet it far outstrips the most valiant efforts of today's fastest supercomputers. To invoke it, you need only open your eyes.” – Donald D. Hoffman
    • Thank you!