Python and Machine Learning by Semen A. Trygubenko
Machine learning around us <ul><li>Voice recognition
Spam filtering
Player ranking in online games
Vehicle stability systems
Computer vision (barcode, fingerprint and number plate readers)
Optimisation </li></ul>
Learning? Machine Learning? <ul><li>Getting better at a task through practice
The act of remembering: data / experience
Generalisation, similarity and new inputs
Learning and flexibility: adaptation </li></ul>
Objective function game <ul><li>Supervised learning, training </li><ul><li>Regression
Classification: features, decision boundaries </li></ul><li>Reinforcement learning
Semi-supervised learning
Evolutionary learning, fitness
Unsupervised </li></ul>
 
Everything is miscellaneous <ul><li>Clustering
Graphical models
Artificial neural networks
Kernel methods
Dimensionality reduction
Optimisation </li></ul>
Interdisciplinary & as a branch of CS
ML is a vaaast field
ML Tools: Python <ul><li>NLTK
FANN
Orange
PyMC
PyML
LIBSVM
PyBrain
ffnet </li></ul><ul><li>MDP
Shogun toolbox
Theano
mlpy
Elefant
Bayes Blocks
Monte Python
hcluster </li></ul><ul><li>Plearn
Pycplex
pymorph </li></ul>
We need total coverage ...
Nothing about everything <ul><li>A sprint usability study:
(1) assess how easy it is </li><ul><li>to install and obtain source code;
to get going with toy examples </li></ul><li>(2) check the quality of documentation and source code
(3) Establish where the project is in its lifecycle
(4) Demo stuff that I think is cool </li></ul>
Testbed <ul><li>OS and package repo: </li><ul><li>Debian
Ubuntu 10.04 </li></ul><li>Make 3.81: </li><ul><li>$make install
$make source </li></ul><li>Free as in freedom </li></ul>
Orange <ul><li>A machine learning and data mining suite
Visual programming framework + scriptable environment
Orange canvas: widgets and channels
Communication via tokens
Implements sampling, filtering, scaling, discretisation, regression, classification, clustering, scoring functions, SVMs <...
Orange <ul><li>Faculty of computer and information science
AI lab, University of Ljubljana, Slovenia
Upcoming SlideShare
Loading in...5
×

Python and Machine Learning

15,555

Published on

A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too complex to describe generally in programming languages, so that in effect programs must automatically describe programs.

Python is great for brainstorming and trying out new ideas. I will give an overview of the tools that are available to date that can assist in rapid prototyping and design of machine learning programs in Python.

Published in: Technology
1 Comment
17 Likes
Statistics
Notes
  • Given the popularity of this talk and machine learning in general at EuroPython 2010 people may like to follow up their machine learning interests with http://www.ncaf.org.uk/, the Natural Computing Applications Forum (see also http://uk.groups.yahoo.com/group/ncaforum/).
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
15,555
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
412
Comments
1
Likes
17
Embeds 0
No embeds

No notes for slide

Python and Machine Learning

  1. 1. Python and Machine Learning by Semen A. Trygubenko
  2. 2. Machine learning around us <ul><li>Voice recognition
  3. 3. Spam filtering
  4. 4. Player ranking in online games
  5. 5. Vehicle stability systems
  6. 6. Computer vision (barcode, fingerprint and number plate readers)
  7. 7. Optimisation </li></ul>
  8. 8. Learning? Machine Learning? <ul><li>Getting better at a task through practice
  9. 9. The act of remembering: data / experience
  10. 10. Generalisation, similarity and new inputs
  11. 11. Learning and flexibility: adaptation </li></ul>
  12. 12. Objective function game <ul><li>Supervised learning, training </li><ul><li>Regression
  13. 13. Classification: features, decision boundaries </li></ul><li>Reinforcement learning
  14. 14. Semi-supervised learning
  15. 15. Evolutionary learning, fitness
  16. 16. Unsupervised </li></ul>
  17. 18. Everything is miscellaneous <ul><li>Clustering
  18. 19. Graphical models
  19. 20. Artificial neural networks
  20. 21. Kernel methods
  21. 22. Dimensionality reduction
  22. 23. Optimisation </li></ul>
  23. 24. Interdisciplinary & as a branch of CS
  24. 25. ML is a vaaast field
  25. 26. ML Tools: Python <ul><li>NLTK
  26. 27. FANN
  27. 28. Orange
  28. 29. PyMC
  29. 30. PyML
  30. 31. LIBSVM
  31. 32. PyBrain
  32. 33. ffnet </li></ul><ul><li>MDP
  33. 34. Shogun toolbox
  34. 35. Theano
  35. 36. mlpy
  36. 37. Elefant
  37. 38. Bayes Blocks
  38. 39. Monte Python
  39. 40. hcluster </li></ul><ul><li>Plearn
  40. 41. Pycplex
  41. 42. pymorph </li></ul>
  42. 43. We need total coverage ...
  43. 44. Nothing about everything <ul><li>A sprint usability study:
  44. 45. (1) assess how easy it is </li><ul><li>to install and obtain source code;
  45. 46. to get going with toy examples </li></ul><li>(2) check the quality of documentation and source code
  46. 47. (3) Establish where the project is in its lifecycle
  47. 48. (4) Demo stuff that I think is cool </li></ul>
  48. 49. Testbed <ul><li>OS and package repo: </li><ul><li>Debian
  49. 50. Ubuntu 10.04 </li></ul><li>Make 3.81: </li><ul><li>$make install
  50. 51. $make source </li></ul><li>Free as in freedom </li></ul>
  51. 52. Orange <ul><li>A machine learning and data mining suite
  52. 53. Visual programming framework + scriptable environment
  53. 54. Orange canvas: widgets and channels
  54. 55. Communication via tokens
  55. 56. Implements sampling, filtering, scaling, discretisation, regression, classification, clustering, scoring functions, SVMs </li></ul>
  56. 57. Orange <ul><li>Faculty of computer and information science
  57. 58. AI lab, University of Ljubljana, Slovenia
  58. 59. (L)GPL
  59. 60. C++ components accessible from Python
  60. 61. 98K LOCs in Python, 66K LOCs in C++
  61. 62. 700 revisions in 2010, 14 developers
  62. 63. 10000 commits since 2003 when migrated from cvs </li></ul>
  63. 64. Orange # Homepage: http://www.ailab.si/orange/ # Dependencies: Python, PythonWin, NumPy, PyQt, PyQwt... .PHONY: install install: repo sudo apt-get install orange-svn python-orange-svn .PHONY: source source: svn checkout http://www.ailab.si/svn/orange/trunk Orange .PHONY: repos repos: sudo echo -e &quot;deb http://www.ailab.si/orange/debian lenny main deb-src http://www.ailab.si/orange/debian lenny main deb http://ppa.launchpad.net/fkrull/deadsnakes/ubuntu lucid main deb-src http://ppa.launchpad.net/fkrull/deadsnakes/ubuntu lucid main&quot;| sudo tee -a /etc/apt/sources.list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 5BB92C09DB82666C && sudo apt-get update sudo apt-get install python2.5
  64. 65. import orange data = orange.ExampleTable(&quot;in&quot;) print data.domain for item in data: print item orange.saveTabDelimited(&quot;out.tab&quot;,data) I/O
  65. 66. selection = orange.MakeRandomIndices2(data, 0.03) sample = data.select(selection, 0) sample.save(&quot;sample.tab&quot;) Basic statistics
  66. 67. Learners and classifiers classifier = orange.BayesLearner(data) print classifier(newItem)
  67. 69. MDP <ul><li>Modular toolkit for data processing
  68. 70. Supervised and unsupervised
  69. 71. PCA and ICA, Slow feature analysis, LLE, restricted Boltzmann machine
  70. 72. Hierarchical networks
  71. 73. Support vector machines (through wrappers around Shogun and LIBSVM)
  72. 74. Concept of nodes and flows; basic are parallelized </li></ul>
  73. 75. MDP <ul><li>Pietro Berkes, Rike-Benjamin Schuppner, Niko Wilbert, Tiziano Zito, community contributions
  74. 76. Institute for Theoretical Biology of the Humboldt University, Berlin
  75. 77. Schuppner, Wilbert and Zito are active, 150 commits this year
  76. 78. LGPL, version 2.6 released in May this year
  77. 79. 30K LOCs of Python, 3K of comments in the code! </li></ul>
  78. 80. MDP # Homepage: http://mdp-toolkit.sourceforge.net/ # Dependencies: NumPy, SciPy .PHONY: install install: sudo apt-get install python-mdp .PHONY: source source: git clone git://mdp-toolkit.git.sourceforge.net/gitroot/mdp-toolkit/mdp-toolkit
  79. 81. Dimensionality reduction import mdp x = mdp.numx_rand.random((100, 25)) # 25 variables, 100 observations y = mdp.pca(x) z = mdp.fastica(x, dtype='float32')
  80. 82. Nodes: training and usage n = mdp.nodes.PCANode() n.train(x) # learn PC of x n.stop_training() print n1.output_dim print n1.explained_variance z = n.execute(y) # project y on PC learned in training
  81. 83. Inverting the flow print n.is_invertible() # true for PCA node print n.inverse(z) # get y back
  82. 84. Flows: feed-forward architectures flow = mdp.Flow([ mdp.nodes.PCANode(output_dim=5) ,mdp.nodes.CuBICANode() ]) flow = mdp.nodes.PCANode(output_dim=5) + mdp.nodes.CuBICANode() flow.train(x)
  83. 85. More dimensions: defs <ul><li>Layer = wrapper for a set of nodes trained and executed in parallel
  84. 86. FlowNode = a node with internal structure
  85. 87. Switchboard = DS for arbitrary routing </li></ul>
  86. 88. Hierarchical networks a = mdp.nodes.PCANode() b = mdp.nodes.SFANode() c = mdp.hinet.FlowNode( mdp.Flow([a,b]) ) layer = mdp.hinet.Layer([n, c])
  87. 89. PyMC <ul><li>Markov chain Monte Carlo for Python
  88. 90. Can fit Bayesian statistical models with MCMC
  89. 91. Large suite of statistical distributions
  90. 92. Building blocks to construct probability models: stochastic, deterministic and potential
  91. 93. Python for scalar vars, numpy and hand-optimised Fortran code for arrays </li></ul>
  92. 94. PyMC <ul><li>Christopher Fonnesbeck, David Huard, Anand Patil
  93. 95. USA
  94. 96. MIT lic
  95. 97. 3 developer, 100 commits this year, all in Jan
  96. 98. Version: 2.1 released in Jan
  97. 99. 26K LOCs of Python, 10K of Fortran and 3K LOCs of C </li></ul>
  98. 100. PyMC # Homepage: http://code.google.com/p/pymc/ # Dependencies: NumPy,SciPy,matplotlib,pytables,pydot,nose .PHONY: install # grr... 2.1alpha is missing gp_submodel.py and step_methods.py ##dir:=pymc-2.1alpha dir:=pymc install: source ##${dir}.tar.gz sudo apt-get install python-dev gcc gfortran ## tar -xzvf $< cd ${dir} && python setup.py config_fc --fcompiler gnu95 build && sudo python setup.py install ##pymc-2.1alpha.tar.gz: ## wget ##http://pymc.googlecode.com/files/pymc-2.1alpha.tar.gz .PHONY: source source: svn checkout http://pymc.googlecode.com/svn/trunk/ pymc
  99. 101. Model module, part 1 x = numpy.array([-.86,-.3,-.05,.73]) alpha = pymc.Normal('alpha',mu=0,tau=.01) beta = pymc.Normal('beta',mu=0,tau=.01) @pymc.deterministic def theta(a=alpha,b=beta): return pymc.invlogit(a+b*x)
  100. 102. Model module, part 2 # Binomial likelihood for data d = pymc.Binomial( 'd' ,n=numpy.ones(4,dtype=int)*5 ,p=theta ,value=numpy.array([0.,1.,3.,5.]) ,observed=True )
  101. 103. Sampling from a distribution import pymc import model S=pymc.MCMC(model,db='pickle') S.sample(iter=10000,burn=5000,thin=2)
  102. 104. PyML <ul><li>A Python machine learning package
  103. 105. Interactive OO framework, focuses on SVM and other kernel methods
  104. 106. tools for feature selection, model selection, syntax for combining classifiers
  105. 107. methods for assessing classifier performance </li></ul>
  106. 108. PyML <ul><li>Asa Ben-Hur, Colorado, USA
  107. 109. Depends: matplotlib, numpy
  108. 110. 1 developer (+ research group?) + NSF funding
  109. 111. 0.7.4.1 out last month
  110. 112. 18K LOCs of Python, 8K LOCs of C++
  111. 113. LGPL </li></ul>
  112. 114. PyML # Homepage: http://pyml.sourceforge.net/ # Dependencies: numpy, matplotlib .PHONY: install install: PyML-0.7.4.1.tar.gz tar -xzvf $< cd PyML-0.7.4.1 && python setup.py build && sudo python setup.py install PyML-0.7.4.1.tar.gz: wget https://sourceforge.net/projects/pyml/files/PyML-0.7.4.1.tar.gz/download
  113. 115. hcluster <ul><li>Damian Eads, UCSC
  114. 116. 11K LOCs of Python, 8K of C and 500 of C++
  115. 117. 2500 lines of documentation
  116. 118. 1 developer, active 2007-2008
  117. 119. New BSD lic
  118. 120. 8 routines for agglomerative clustering, routines that compute statistics on hierarchies, 21 distance functions </li></ul>
  119. 121. hcluster # Homepage: http://code.google.com/p/scipy-cluster/ .PHONY: install install: sudo apt-get install python-hcluster .PHONY: source source: svn checkout http://scipy-cluster.googlecode.com/svn/trunk/ hcluster
  120. 122. NLTK # Homepage: http://www.nltk.org/ # Dependencies: PyYAML .PHONY: install install: sudo apt-get install python-nltk .PHONY: source source: sudo apt-get source python-nltk
  121. 123. mlpy # Homepage: https://mlpy.fbk.eu/ # Dependencies: NumPy, GSL .PHONY: install install: sudo echo &quot;deb http://ppa.launchpad.net/davide-albanese/ppa/ubuntu jaunty main&quot;| sudo tee -a /etc/apt/sources.list sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4AEC3064 sudo apt-get update sudo apt-get install python-mlpy .PHONY: source source: svn co https://mlpy.fbk.eu/svn/ mlpy
  122. 124. LIBSVM # Homepage: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ .PHONY: install install: sudo apt-get install libsvm2 python-libsvm .PHONY: source source: sudo apt-get source libsvm2
  123. 125. PyEvolve # Homepage: http://pyevolve.sourceforge.net/ # Dependencies: matplotlib .PHONY: install install: wget http://downloads.sourceforge.net/pyevolve/Pyevolve-0.5-py2.6.egg sudo easy_install Pyevolve-0.5-py2.6.egg .PHONY: source source: svn co https://pyevolve.svn.sourceforge.net/svnroot/pyevolve
  124. 126. FANN library # Homepage: http://www.nltk.org/ # Dependencies: PyYAML .PHONY: install install: sudo apt-get install libfann1-dev .PHONY: source source: sudo apt-get source libfann1 .PHONY: pythonBindings: pythonBindings: # need to build from source!
  125. 127. Theano # Homepage: http://deeplearning.net/software/theano # Dependencies: blas .PHONY: install install: sudo apt-get install libblas-dev cd Theano && sudo python setup.py install .PHONY: source source: hg clone http://hg.assembla.com/theano Theano
  126. 128. PyBrain # Homepage: http://www.pybrain.org/pages/download .PHONY: install install: cd pybrain && sudo python setup.py install .PHONY: source source: git clone git://github.com/pybrain/pybrain.git
  127. 129. Shogun # Homepage: http://www.shogun-toolbox.org .PHONY: install install: sudo apt-get install shogun-doc-en shogun-cmdline shogun-python .PHONY: source source: svn checkout https://svn.tuebingen.mpg.de:/shogun/trunk shogun
  128. 130. ffnet # Homepage: http://ffnet.sourceforge.net .PHONY: install install: cd ffnet && sudo python setup.py install .PHONY: source source: svn co https://ffnet.svn.sourceforge.net/svnroot/ffnet/trunk ffnet
  129. 131. Observations <ul><li>There is a lot of activity in ML + Python
  130. 132. Not everything is packaged for your favourite OS
  131. 133. But: it is pretty easy to get going / help
  132. 134. Documentation is of varying quality, heterogeneous
  133. 135. But: there's source
  134. 136. Good libraries are usually centred on research groups / backed up by universities </li></ul>
  135. 137. Links <ul><li>http://mloss.org
  136. 138. http://videolectures.net
  137. 139. http://archive.ics.uci.edu/ml/ </li></ul>
  138. 140. Books <ul><li>Machine Learning: An Algorithmic Perspective
  139. 141. by Stephen Marsland, 2009
  140. 142. Artificial Intelligence: A Modern Approach
  141. 143. by Stuart Russell and Peter Norvig, 2009
  142. 144. Programming Collective Intelligence
  143. 145. by Toby Segaran, 2007
  144. 146. Information Theory, Inference, and Learning Algorithms
  145. 147. by David McKay, 2005 </li></ul>
  146. 148. The End
  147. 149. Are we there yet? <ul>Todd L. Veldhuisen, in “Software libraries and their reuse: entropy, Kolmogorov's complexity and Zipf's law”, 2005: <li>program size is a rough lower bound for effort
  148. 150. using a component from a library we achieve reduction in size
  149. 151. components maybe whatever form of abstraction we may invent: subroutines, generators of subroutines, generators of generators of sub ... </li></ul>
  150. 152. Grunge end cont'd <ul><li>complexity of an object is measured by the length of the smallest program that generates it
  151. 153. libraries allow us to “compress the incompressible” by taking the advantage of the commonality exhibited by programs within a problem domain
  152. 154. problem domain entropy is defined as a measure of program diversity </li></ul>
  153. 155. Grunge end cont'd <ul><li>“libraries are essentially incomplete, and there always be room for more components in any problem domain”
  154. 156. “better tools and culture can have only marginal impact on reuse rates if the domain is inherently resistant to reuse” </li></ul>
  155. 157. Machine learning is a high-entropy domain
  156. 158. The future: doable and fascinating <ul><li>“There is a fantastic existence proof that learning is possible, which is the bag of water and electricity (together with a few trace chemicals) sitting between your ears.” – Stephen Marsland
  157. 159. “You are a creative genius. Your creative genius is so accomplished that it appears, to you and to others, as effortless. Yet it far outstrips the most valiant efforts of today's fastest supercomputers. To invoke it, you need only open your eyes.” – Donald D. Hoffman </li></ul>
  158. 160. <ul>Thank you! </ul>
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×