Data is everywhere clickstream data users are bad at managing fb permissions; you can get a lot out of the graph APIThere’s value in learning about data - how people use your site- feature or advertisement personalizationOne thing that enables this is that resources are cheap these days
Python is a fantastic programming environment for data processing and analyticson one end of the spectrum, quick and dirty scripts... or full-featured applications ready for a deployment at scaleWide variety of toolkits for off-the-shelf analysis or building out your own data processing applications
For this talk... discussing one flavor of analytics and machine learning, the classification problemintuition: training set: what you know about the world train a classifier to predict things that you don’t
As a concrete example, I started playing around with some baseball stats to illustrate how one might go about building ML applications in pythoneven if you’re not into baseball, you know that the iconic visions of success and failure are the home run and the strikeout in all the movies, hitting a home run is equivalent to getting the girl and striking out is seen as a major setback
As with any machine learning problem, you want to get your data into a classifier-consumable format. That is, labeled feature sets. For each play in the game, keep track of the game state and output a labeled feature bundle representing the situation and its outcome: HR, K, (other)
speed: offline: deadline ~ hours, daysrealtime: user waiting on the other side (user actions: => milliseconds)transparency:seeing what’s going on with an algorithm in case the docs aren’t clearmodifying or patching an algorithm to meet your needssupport:maturity, active development how strong is the community around the project? are there tutorials available?
interface with external packages if you’ve done some analysis already and want to transition to python without throwing away codepython toolkits provide sets of algorithms, mostly python implementationsoften use external packages with C bindings, some even use other toolkitsDIY: use the external packages yourself
to give a sampling of what’s available, i chose some toolkits that were last updated within a yearAs a disclaimer... -Not exhaustive, just a sampling -some of these tools I’ve used, some I haven’t! -I’m sure I’ve missed your favorite, and for that I apologizedifferent packages focus on different things, so one isn’t necessarily going to suit all of your needs
buzz around scikit-learn last year - checked it out recently and it’s been built out a lot
NumPy: fast and efficient arraysSciPy: scientific tools and algorithms built on NumPyCan also use popular C/C++ implementations using python bindingspython is a modular language, so you can always sub out your implementation without disrupting your workflow too muchnow, as an example of applying these toolkits...
speed isn’t criticalspeed is critical (imagine that you’re a coach) baseball is slow, but it’s not THAT slow
identifies predictive features certain values are strongly correlated with certain labelssklearn- wasn’t clear on the documented usage, looked at the code
for a coach
don’t we need to train our classifier to run our web application?save them on disk!pickle or pull out a textual representation(another argument for using a package that allows you to do this)why compute things twice?use generatorslots and lots of dataavoid keeping it all in memorysingle pass algorithm (bayes)first-pass conversion to compact data (numpy vectors, not python objects)not always possible, but keep it in mindtake advantage of multiple cores - if your processing step has a minimal memory footprint (just one line at a time), do it on multiple cores - multiple processes on different input files or multiprocessing module is great at this
you don't need to know everything about the algorithms you use …but you can't just blindly apply these things and hope that they magically workml-class.org: free class, provides an excellent foundation and starting point for understanding MLin no time, you, too, can be a number muncher
source code for SluggerML on github; kind of a mess, and I’m sorry about thatand I’m @mattspitz on the twitters
Transcript of "Practical Machine Learning in Python"
Practical MachineLearning in PythonMatt Spitz via@mattspitz
Practical Machine Learning in Python 2This is the Age of Aquarius Data• Data is plentiful • application logs • external APIs • Facebook, Twitter • public datasets• Analysis adds value • understanding your users • dynamic application decisions• Storage / CPU time is cheap
Practical Machine Learning in Python 3Machine Learning in Python• Python is well-suited for data analysis• Versatile • quick and dirty scripts • full-featured, realtime applications• Mature ML packages • tons of choices (see: mloss.org) • plug-and-play or DIY
Practical Machine Learning in Python 4Classification Problem: Terminology• Data points • feature set: “interesting” facts about an event/thing • label: a description of that event/thing• Classification • training set: a bunch of labeled feature sets • given a training set, build a classifier to predict labels for unlabeled feature sets
Practical Machine Learning in Python 5SluggerML• Two questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out?• Feature sets represent game state for a plate appearance • game: day vs. night, wind direction... • at-bat: inning, #strikes, left-right matchup... • batter/pitcher: age, weight, fielding position...• Labels represent outcome • HR (home run), K (strikeout), OTHER• Poor Man’s Sabermetrics
Practical Machine Learning in Python 7SluggerML: Gathering Data• Sources • Retrosheet • play-by-play logs for every game since 1956 • Sean Lahman’s Baseball Archive • detailed stats about individual players• Coalescing • 1st pass, Lahman: create player database • shelve module • 2nd pass, Retrosheet: track game state, join on player db• Scrubbing • ensure consistency
Practical Machine Learning in Python 8SluggerML: Gathering Data• Training set • regular-season games from 1980-2011 • 5,669,301 plate appearances • 135,602 home runs • 871,226 strikeouts
Practical Machine Learning in Python 9Selecting a Toolkit: Tradeoffs• Speed • offline vs. realtime• Transparency • internal visibility • customizability• Support • maturity • community
Practical Machine Learning in Python 10Selecting a Toolkit: High-Level Options• External bindings • python interfaces to popular packages • Matlab, R, Octave, SHOGUN Toolbox • transition legacy workflows• Python implementations • collections of algorithms • (mostly) python • external subcomponents• DIY • building blocks
Practical Machine Learning in Python 11Selecting a Toolkit: Python Implementations• nltk • focus on NLP • book: Natural Language Processing with Python (O’Reilly ‘09)• mlpy • regression, classification, clustering• PyML • focus on SVM• PyBrain • focus on neural networks
Practical Machine Learning in Python 12Selecting a Toolkit: Python Implementations• mdp-toolkit • data processing management • nodes represent tasks in a data workflow • scheduling, parallelization• scikit-learn • supervised, unsupervised, feature selection, visualization • heavy development, large team • excellent documentation • active community
Practical Machine Learning in Python 13Selecting a Toolkit: Do It Yourself• Basic building blocks • NumPy • SciPy• C/C++ implementations • LIBLINEAR • LIBSVM • OpenCV • ...your own?
Practical Machine Learning in Python 14SluggerML: Two Questions• What features are strong predictors for home runs and strikeouts?• Given a particular situation, with what probability will the batter hit a home run or strike out?
Practical Machine Learning in Python 15SluggerML: Feature Selection• Identifies predictive features • strongly correlated with labels • predictive: max_benchpress • not predictive: favorite_cookie• scikit-learn: chi-square feature selection• Visualizing significance • for each well-supported value, find correlation with HR/K • “well-supported”: >= 0.05% of samples with feature=value • correlation: ( P(HR | feature=value) / P(HR) ) - 1
Practical Machine Learning in Python 16 SluggerML: Feature Selection Batter: Home vs. Visiting 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% home team visiting team
Practical Machine Learning in Python 17 SluggerML: Feature Selection Batter: Fielding Position 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% P C 1B 2B 3B SS LF CF RF DH PH
Practical Machine Learning in Python 19 SluggerML: Feature Selection Game: Year 50.0% 40.0% 30.0% 20.0% 10.0%Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011
Practical Machine Learning in Python 20SluggerML: Realtime Classification• Given features, predict label probabilities• nltk: NaiveBayesClassifier• Web frontend • gunicorn, nginx
Practical Machine Learning in Python 21Tips and Tricks• Persistent classifier internals • once trained, save and reuse • depends on implementation • string representation may exist • create your own• Using generators where possible • avoid keeping data in memory • single-pass algorithms • conversion pass before training• Multicore text processing • scrubbing: low memory footprint • multiprocessing module
Practical Machine Learning in Python 22The Fine Print™• Plug-and-play is easy!• Don’t blindly apply ML • understand your data • understand your algorithms • ml-class.org is an excellent resource
Practical Machine Learning in Python 23Thanks!github.com/mattspitz/sluggermlslideshare.net/mattspitz/practical-machine-learning-in-python@mattspitz