R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538

Who will win XLIX?
R, Data Wrangling &
Data Science
January 18, 2015
@ksankar // doubleclix.wordpress.com
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
"There are no facts, only interpretations." - Friedrich Nietzsche

etude
http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm,
http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg
We will focus on “short”, “acquiring
skill” & “having fun” !

Goals & non-goals
Goals
¤ Get familiar with the R
language & dplyr
¤ Work on a couple of interesting
data science problems
¤ Give you a focused time to
work
§ Work with me. I will wait
if you want to catch-up
¤ Less theory, more usage - let
us see if this works
¤ As straightforward as possible
§ The programs can be
optimized
Non-goals
¡ Go deep into the algorithms
•  We don’t have
sufficient time. The topic
can be easily a 5 day
tutorial !
¡ Dive into R internals
•  That is for another day
¡ A passive talk
•  Nope. Interactive &
hands-on

Activities & Results
o  Activities:
•  Get familiar with R, R Studio
•  Work on a couple of data sets
•  Get familiar with the mechanics of Data Science Competitions
•  Explore the intersection of Algorithms, Data, Intelligence, Inference &
Results
•  Discuss Data Science Horse Sense ;o)
o  Results :
•  Hands-on R
•  Familiar with some of the interesting algorithms
•  Submitted entries for 1 competition
•  Knowledge of Model Evaluation
•  Cross Validation, ROC Curves

About Me
o  Chief Data Scientist at BlackArrow.tv
o  Have been speaking at OSCON, PyCon, Pydata et al
o  Reviewing Packt Book “Machine Learning with Spark”
o  Picked up co-authorship Second Edition of “Fast Data Processing with Spark”
o  Have done lots of things:
•  Big Data (Retail, Bioinformatics, Financial, AdTech),
•  Written Books (Web 2.0, Wireless, Java,…)
•  Standards, some work in AI,
•  Guest Lecturer at Naval PG School,…
•  Planning MS-CFinance or Statistics
•  Volunteer as Robotics Judge at First Lego league World Competitions
o  @ksankar, doubleclix.wordpress.com
The
Nuthead
band
!

Setup & Data
R & IDE
o  Install R
o  Install R Studio
Tutorial Materials
o  Github : https://
github.com/xsankar/
hairy-octo-hipster
o  Clone or download zip
Setup an account in Kaggle (www.kaggle.com)
We will be using the data from 2 Kaggle competitions
①  Titanic: Machine Learning from Disaster
Download data from http://www.kaggle.com/c/titanic-gettingStarted
Directory ~/hairy-octo-hipster/titanic-r
②  Predicting Bike Sharing @ Washington DC
Download data from http://www.kaggle.com/c/bike-sharing-demand/data
Directory ~/hairy-octo-hipster/bike
③  2014 NFL Boxscore
http://www.pro-football-reference.com/years/2014/games.htm
Directory ~/hairy-octo-hipster/nfl
Data

Agenda
o  Jan 18 : 9:00-12:30 3 hrs
o  Intro, Goals, Logistics, Setup [10] [9:00-9:10)
o  Introduction to R & dplyR [30] [9:10-9:40)
o  Who will win Superbowl XLIX ?
The Art of ELO Ranking [30] [9:40-10:10)
•  The Algorithm
•  The Data
•  The Results (Compare with FiveThirtyEight
o  Anatomy of a Kaggle Competition [40] [10:10-10:50)
•  Competition Mechanics
•  Register, download data, create sub
directories
•  Trial Run : Submit Titanic
o  Break [20] [10:50-11:10)
o  Algorithms for the Amateur Data Scientist [20] [11:10-11:30)
•  Algorithms, Tools & frameworks in perspective
•  “Folk Wisdom”
o  Model Evaluation & Interpretation [30] [11:30 - 12:00)
•  Confusion Matrix, ROC Graph
o  Homework : The Art of a Competition – Bike Sharing
o  Homework : The Art of a Competition – Walmart

Overload Warning … There is enough material for a week’s training … which is good & bad !
Read thru at your pace, refer, ponder & internalize

Close Encounters
—  1st

◦  This Tutorial
—  2nd

◦  Do More Hands-on Walkthrough
—  3nd

◦  Listen To Lectures
◦  More competitions …

R Syntax – A quick overview
o aString <- "A String"
o aNumber <- 12
o class(aString)
o class(aNumber)
o aVector <- c(1,2,3,4)
o class(aVector)
o aVector * 2
o sqrt(aVector)
o Packages : dplyR & tidyR

Data wrangling with dplyR
o  dplyR – versatile package for various data operations
o  We will see dplyR is use
o  Resources:
•  “Data Manipulation with dplyR” - Hadley Wickham’s UseR! 2014
Tutorial Slides
•  http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/
•  Slides https://www.dropbox.com/sh/i8qnluwmuieicxc/
AAAgt9tIKoIm7WZKIyK25lh6a
•  Slides of Tutorial by Rstudio’s Garrett Grolemund
•  https://github.com/rstudio/webinars
•  And the cheatsheet is available at http://www.rstudio.com/resources/
cheatsheets/

dplyR verbs
o Select
o Filter
o Summarise
o Group_by
o Mutate
o Arrange

dplyR joins
Hiroaki Yutani ‫‏‬@yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

Who will win Super Bowl
XLIX
9:40

The Art of ELO Ranking
& Super Bowl XLIX
o Let us look at this from 3 angles:
•  The Algorithm
•  The R program
•  The Data
•  The Results
•  Comparing with the
FiveThirtyEight Results
http://www.imdb.com/title/tt1285016/trivia?item=qt1318850
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S

The ELO Algorithm (1 of 3)
1.  Basic Chess Algorithm proposed by Elo
•  Arpad Emrick Elo proposed the system for Chess ranking
•  Rnew = Rold + K(S-μ); μij = 1 / 1 + 10(Riold-Rjold)/400
•  K – varies depending on the match
•  Sij = 1, ½ or 0
2.  Soccer Ranking
•  http://www.eloratings.net/system.html
3.  NFL Ranking with adjusted factor for scores, 538
Ranking
Ref : Who is #1, Princeton University Press

NFL Ranking
http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/

NFL Ranking

The Data
http://www.pro-football-reference.com/years/2014/games.htm

The R Code
https://github.com/xsankar/hairy-octo-hipster

The Analysis – Week 1, Week 18

Wisdom from Nate Silver & the 538 Gang …
o  [Homework #1] Improve our core algorithm
to add the Margin of victory from the 538
gang !
•  Remember, kFactor = 20
o  [Homework #2] Weigh recent games more
heavily w/ Exponential Decay

The Art of ELO Ranking
& Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S
Ref : Who is #1, Princeton University Press

References:
o  ELO ranking – NFL,Soccer
•  http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
•  http://fivethirtyeight.com/datalab/nfl-week-20-elo-ratings-and-playoff-
odds-conference-championships/
•  http://www.eloratings.net/system.html
o  dplyR
•  http://www.rstudio.com/resources/webinars/ <- github for the slides
•  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part1/
•  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part2/
•  http://www.rstudio.com/resources/cheatsheets/
•  http://www.r-bloggers.com/data-analysis-example-with-ggplot-and-dplyr-
analyzing-supercar-data-part-2/

Anatomy Of a Kaggle
Competition 10:10

Kaggle Data Science Competitions
o  Hosts Data Science Competitions
o  Competition Attributes:
•  Dataset
•  Train
•  Test (Submission)
•  Final Evaluation Data Set (We don’t
see)
•  Rules
•  Time boxed
•  Leaderboard
•  Evaluation function
•  Discussion Forum
•  Private or Public

Titanic
Passenger
Metadata

•  Small

•  3
Predictors

•  Class

•  Sex

•  Age

•  Survived?

http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction (Washington DC)
Walmart Store Forecasting

Train.csv

Taken
from
Titanic
Passenger

Manifest

Variable
Descrip-on

Survived
0-‐No,
1=yes

Pclass
Passenger
Class
(
1st,2nd,3rd
)

Sibsp
Number
of
Siblings/Spouses
Aboard

Parch
Number
of
Parents/Children
Aboard

Embarked
Port
of
EmbarkaMon

o  C
=
Cherbourg

o  Q
=
Queenstown

o  S
=
Southampton

Titanic
Passenger
Metadata

•  Small

•  3
Predictors

•  Class

•  Sex

•  Age

•  Survived?

Test.csv

Submission
o 418 lines; 1st column should have 0 or 1 in each line
o Evaluation:
•  % correctly predicted

Approach
o  This is a classification problem - 0 or 1
o  Comb the forums !
o  Opportunity for us to try different algorithms & compare them
•  Simple Model
•  CART[Classification & Regression Tree]
•  Greedy, top-down binary, recursive partitioning that divides feature space into sets
of disjoint rectangular regions
•  RandomForest
•  Different parameters
•  SVM
•  Multiple kernels
•  Table the results
o  Use cross validation to predict our model performance & correlate with what Kaggle
says
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf

Simple Model – Our First Submission
o #1 : Simple Model (M=survived)
o #2 : Simple Model (F=survived)
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
Refer
to
1-‐Intro_to_Kaggle.R

at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/

#3 : Simple CART Model
o CART (Classification & Regression Tree)
hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassiﬁcaMon/Decision_Trees

May be better, because we have improved on the survival of
men !
Refer
to

at

#4 : Random Forest Model
o  https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
•  Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
o  https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
o  https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
Refer
to

at

#5 : SVM
o Multiple Kernels
o kernel = ‘radial’ #Radial Basis Function
o Kernel = ‘sigmoid’
o  agconti's blog - Ultimate Titanic !
o  http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
Refer
to

at

Feature Engineering - Homework
o  Add attribute : Age
•  In train 714/891 have age; in test 332/418 have age
•  Missing values can be just Mean Age of all passengers
•  We could be more precise and calculate Mean Age based on Title (Ms,
Mrs, Master et al)
•  Box plot age
o  Add attribute : Mother, Family size et al
o  Feature engineering ideas
•  http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/
sharing-experiences-about-data-munging-and-classification-steps-
with-python
o  More ideas at
http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/
o  And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md

What does it mean ? Let us ponder ….
o  We have a training data set representing a domain
•  We reason over the dataset & develop a model to predict outcomes
o  How good is our prediction when it comes to real life scenarios ?
o  The assumption is that the dataset is taken at random
•  Or Is it ? Is there a Sampling Bias ?
•  i.i.d ? Independent ? Identically Distributed ?
•  What about homoscedasticity ? Do they have the same finite variance ?
o  Can we assure that another dataset (from the same domain) will give us the same
result ?
o  Will our model & it’s parameters remain the same if we get another data set ?
o  How can we evaluate our model ?
o  How can we select the right parameters for a selected model ?

Algorithms for the
Amateur Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …
11:10

Ref: Anthony’s Kaggle Presentation
Data Scientists apply different techniques
•  Support Vector Machine
•  adaBoost
•  Bayesian Networks
•  Decision Trees
•  Ensemble Methods
•  Random Forest
•  Logistic Regression
•  Genetic Algorithms
•  Monte Carlo Methods
•  Principal Component Analysis
•  Kalman Filter
•  Evolutionary Fuzzy Modelling
•  Neural Networks
Quora
•  http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms

Algorithm spectrum
o  Regression
o  Logit
o  CART
o  Ensemble :
Random
Forest
o  Clustering
o  KNN
o  Genetic Alg
o  Simulated
Annealing

o  Collab
Filtering
o  SVM
o  Kernels
o  SVD
o  NNet
o  Boltzman
Machine
o  Feature
Learning

Machine
Learning
Cute
Math

Ar0ﬁcial

Intelligence

Classifying Classifiers
Statistical
Structural

Regression
Naïve

Bayes

Bayesian

Networks

Rule-‐based
Distance-‐based

Neural

Networks

Production
Rules
Decision
Trees

Multi-‐layer

Perception

Functional
Nearest
Neighbor

Linear
Spectral

Wavelet

kNN
Learning
vector

Quantization

Ensemble

Random
Forests

Logistic

Regression1

SVM
Boosting

1Max
Entropy
Classiﬁer

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Classiﬁers

Regression

Continuous
Variables
Categorical
Variables
Decision

Trees

k-‐NN(Nearest

Neighbors)

Bias

Variance

Model Complexity

Over-ﬁtting

BoosMng
Bagging

CART

Data Science
“folk knowledge”

Data Science “folk knowledge” (1 of A)
o  "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer
Mediated Transactions
o  Learning = Representation + Evaluation + Optimization
o  It’s Generalization that counts
•  The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o  Data alone is not enough
•  Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond
it
o  Machine Learning is not magic – one cannot get something from nothing
•  In order to infer, one needs the knobs & the dials
•  One also needs a rich expressive datasetA few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755

o  Over fitting has many faces
•  Bias – Model not strong enough. So the learner has the tendency to learn the
same wrong things
•  Variance – Learning too much from one dataset; model will fall apart (ie much
less accurate) on a different dataset
•  Sampling Bias
o  Intuition Fails in high Dimensions –Bellman
•  Blessing of non-conformity & lower effective dimension; many applications
have examples not uniformly spread but concentrated near a lower dimensional
manifold eg. Space of digits is much smaller then the space of images
o  Theoretical Guarantees are not What they seem
•  One of the major developments o f recent decades has been the realization that
we can have guarantees on the results of induction, particularly if we are
willing to settle for probabilistic guarantees.
o  Feature engineering is the Key
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755

o  More Data Beats a Cleverer Algorithm
•  Or conversely select algorithms that improve with data
•  Don’t optimize prematurely without getting more data
o  Learn many models, not Just One
•  Ensembles ! – Change the hypothesis space
•  Netflix prize
•  E.g. Bagging, Boosting, Stacking
o  Simplicity Does not necessarily imply Accuracy
o  Representable Does not imply Learnable
•  Just because a function can be represented does not mean
it can be learned
o  Correlation Does not imply Causation
o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  A few useful things to know about machine learning - by Pedro Domingos
§  http://dl.acm.org/citation.cfm?id=2347755

o  The simplest hypothesis that fits the data is also the most
plausible
•  Occam’s Razor
•  Don’t go for a 4 layer Neural Network unless
you have that complex data
•  But that doesn’t also mean that one should
choose the simplest hypothesis
•  Match the impedance of the domain, data & the
algorithms
o  Think of over fitting as memorizing as opposed to learning.
o  Data leakage has many forms
o  Sometimes the Absence of Something is Everything
o  [Corollary] Absence of Evidence is not the Evidence of
Absence
New to Machine Learning? Avoid these three mistakes, James Faghmous
https://medium.com/about-data/73258b3848a4
§  Simple
Model

§  High
Error
line
that
cannot
be

compensated
with
more
data

§  Gets
to
a
lower
error
rate
with
less
data

points

§  Complex
Model

§  Lower
Error
Line

§  But
needs
more
data
points
to
reach

decent
error

Ref: Andrew Ng/Stanford, Yaser S./CalTech

Importance of feature selection & weak models
o “Good features allow a simple model to beat a complex model”-Ben Lorica1
o “… using many weak predictors will always be more accurate than using a few
strong ones …” –Vladimir Vapnik2
o “A good decision rule is not a simple one, it cannot be described by a very few
parameters” 2
o “Machine learning science is not only about computers, but about humans, and
the unity of logic, emotion, and culture.” 2
o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well,
but it can’t surprise you” – Hadley Wickham3
hTp://radar.oreilly.com/2014/06/streamlining-‐feature-‐engineering.html

hTp://nauMl.us/issue/6/secret-‐codes/teaching-‐me-‐so^ly

hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-‐modeling-‐and-‐surprises/

Updated
Slide

Check your assumptions
o  The decisions a model makes, is directly related to the it’s assumptions about the
statistical distribution of the underlying data
o  For example, for regression one should check that:
① Variables are normally distributed
•  Test for normality via visual inspection, skew & kurtosis, outlier inspections via
plots, z-scores et al
② There is a linear relationship between the dependent & independent
variables
•  Inspect residual plots, try quadratic relationships, try log plots et al
③ Variables are measured without error
④ Assumption of Homoscedasticity
§  Homoscedasticity assumes constant or near constant error variance
§  Check the standard residual plots and look for heteroscedasticity
§  For example in the figure, left box has the errors scattered randomly around zero; while the
right two diagrams have the errors unevenly distributed
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test,
http://pareonline.net/getvn.asp?v=8&n=2

Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World

Knowns

Unknowns

You
UnKnown
Known

o  Others
know,
you
don’t
o  What
we
do

o  Facts,
outcomes
or

scenarios
we
have
not

encountered,
nor

considered

o  “Black
swans”,
outliers,

long
tails
of
probability

distribuMons

o  Lack
of
experience,

imaginaMon

o  PotenMal
facts,

outcomes
we

are
aware,
but

not

with

certainty

o  StochasMc

processes,

ProbabiliMes

o  Known Knowns
o  There are things we know that we know
o  Known Unknowns
o  That is to say, there are things that we
now know we don't know
o  But there are also Unknown Unknowns
o  There are things we do not know we
don't know

Data Science “folk knowledge” (6 of A) - Pipeline
o  Scalable
Model

Deployment

o  Big
Data

automation
&

purpose
built

appliances
(soft/
hard)

o  Manage
SLAs
&

response
times

o  Volume

o  Velocity

o  Streaming
Data

o  Canonical
form

o  Data
catalog

o  Data
Fabric
across
the

organization

o  Access
to
multiple

sources
of
data

o  Think
Hybrid
–
Big
Data

Apps,
Appliances
&

Infrastructure

Collect Store Transform
o  Metadata

o  Monitor
counters
&

Metrics

o  Structured
vs.
Multi-‐
structured

o  Flexible
&
Selectable

§  Data
Subsets

§  Attribute
sets

o  Reﬁne
model
with

§  Extended
Data

subsets

§  Engineered

Attribute
sets

o  Validation
run
across
a

larger
data
set

Reason Model Deploy
Data Management
Data Science
o  Dynamic
Data
Sets

o  2
way
key-‐value
tagging
of

datasets

o  Extended
attribute
sets

o  Advanced
Analytics

ExploreVisualize Recommend Predict
o  Performance

o  Scalability

o  Refresh
Latency

o  In-‐memory
Analytics

o  Advanced
Visualization

o  Interactive
Dashboards

o  Map
Overlay

o  Infographics

¤  Bytes to Business
a.k.a. Build the full
stack
¤  Find Relevant Data
For Business
¤  Connect the Dots

Volume
Velocity
Variety
Context
Connect
edness
Intelligence
Interface
Inference
“Data of unusual size”
that can't be brute forced
o  Three Amigos
o  Interface = Cognition
o  Intelligence = Compute(CPU) & Computational(GPU)
o  Infer Significance & Causality

Jeremy’s Axioms
o  Iteratively explore data
o  Tools
•  Excel Format, Perl, Perl Book
o  Get your head around data
•  Pivot Table
o  Don’t over-complicate
o  If people give you data, don’t assume that you
need to use all of it
o  Look at pictures !
o  History of your submissions – keep a tab
o  Don’t be afraid to submit simple solutions
•  We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-
howard/

①  Common Sense (some features make more sense then others)
②  Carefully read these forums to get a peak at other peoples’ mindset
③  Visualizations
④  Train a classifier (e.g. logistic regression) and look at the feature weights
⑤  Train a decision tree and visualize it
⑥  Cluster the data and look at what clusters you get out
⑦  Just look at the raw data
⑧  Train a simple classifier, see what mistakes it makes
⑨  Write a classifier using handwritten rules
⑩  Pick a fancy method that you want to apply (Deep Learning/Nnet)
-- Maarten Bosma
-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data

Data Science “folk knowledge” (A of A)
Lessons from Kaggle Winners
①  Don’t over-fit
②  All predictors are not needed
•  All data rows are not needed, either
③  Tuning the algorithms will give different results
④  Reduce the dataset (Average, select transition data,…)
⑤  Test set & training set can differ
⑥  Iteratively explore & get your head around data
⑦  Don’t be afraid to submit simple solutions
⑧  Keep a tab & history your submissions

The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
Data Scientist (noun): Person who is better at
statistics than any software engineer & better
at software engineering than any statistician
– Josh Wills (Cloudera)
Data Scientist (noun): Person who is worse at
statistics than any statistician & worse at
software engineering than any software
engineer – Will Cukierski (Kaggle)
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier !
– Titus Brown

Essential Reading List
o  A few useful things to know about machine learning - by Pedro Domingos
•  http://dl.acm.org/citation.cfm?id=2347755
o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
•  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o  http://www.no-free-lunch.org/
o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
Y. C
•  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
•  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  Avoid these three mistakes, James Faghmo
•  https://medium.com/about-data/73258b3848a4
o  Leakage in Data Mining: Formulation, Detection, and Avoidance
•  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf

For your reading & viewing pleasure … An ordered List
①  An Introduction to Statistical Learning
•  http://www-bcf.usc.edu/~‾gareth/ISL/
②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
•  http://online.stanford.edu/course/statistical-learning-winter-2014
③  Prof. Pedro Domingo
•  https://class.coursera.org/machlearning-001/lecture/preview
④  Prof. Andrew Ng
•  https://class.coursera.org/ml-003/lecture/preview
⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
•  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥  Mathematicalmonk @ YouTube
•  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦  The Elements Of Statistical Learning
•  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/

Of Models,
Performance, Evaluation
& Interpretation
11:30

Bias/Variance (1 of 2)
o Model Complexity
•  Complex Model increases the
training data fit
•  But then it overfits & doesn't
perform as well with real data
o  Bias vs. Variance
o  Classical diagram
o  From ELSII, By Hastie, Tibshirani & Friedman
o  Bias – Model learns wrong things; not
complex enough; error gap small; more
data by itself won’t help
o  Variance – Different dataset will give
different error rate; over fitted model;
larger error gap; more data could help
Prediction Error

Training

Error

Ref: Andrew Ng/Stanford, Yaser S./CalTech
Learning Curve

Bias/Variance (2 of 2)
o High Bias
•  Due to Underfitting
•  Add more features
•  More sophisticated model
•  Quadratic Terms, complex equations,…
•  Decrease regularization
o High Variance
•  Due to Overfitting
•  Use fewer features
•  Use more training sample
•  Increase Regularization
Prediction Error

Training

Error

Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need
more
features
or
more

complex
model
to
improve

Need
more
data
to
improve

'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos

Partition Data !
•  Training (60%)
•  Validation(20%) &
•  “Vault” Test (20%) Data sets
k-fold Cross-Validation
•  Split data into k equal parts
•  Fit model to k-1 parts &
calculate prediction error on kth
part
•  Non-overlapping dataset
Data Partition &
Cross-Validation
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)

Train
Validate
Test

#2
#3
#4

#5

#1

#2
#3
#5

#4

#1

#2
#4
#5

#3

#1

#3
#4
#5

#2

#1

#3
#4
#5

#1

#2

K-‐fold
CV
(k=5)

Train
Validate

Bootstrap
•  Draw datasets (with replacement) and fit model for each dataset
•  Remember : Data Partitioning (#1) & Cross Validation (#2) are without
replacement
Bootstrap & Bagging
—  Goal


◦  Variance (-)


Bagging (Bootstrap aggregation)
◦  Average prediction over a collection of
bootstrap-ed samples, thus reducing
variance

◦  “Output
of
weak
classifiers
into
a
powerful
commiTee”

◦  Final
PredicMon
=
weighted
majority
vote

◦  Later
classifiers
get
misclassified
points

–  With
higher
weight,

–  So
they
are
forced

–  To
concentrate
on
them

◦  AdaBoost
(AdapMveBoosting)

◦  BoosMng
vs
Bagging

–  Bagging
–
independent
trees

–  BoosMng
–
successively
weighted

Boosting
—  Goal


◦  Variance (-)


◦  Builds
large
collecMon
of
de-‐correlated
trees
&
averages

them

◦  Improves
Bagging
by
selecMng
i.i.d*
random
variables
for

spliong

◦  Simpler
to
train
&
tune

◦  “Do
remarkably
well,
with
very
li@le
tuning
required”
–
ESLII

◦  Less
suscepMble
to
over
ﬁong
(than
boosMng)

◦  Many
RF
implementaMons

–  Original
version
-‐
Fortran-‐77
!
By
Breiman/Cutler

–  Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab

* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
—  Goal


◦  Variance (-)


◦  Two
Step

–  Develop
a
set
of
learners

–  Combine
the
results
to
develop
a
composite
predictor

◦  Ensemble
methods
can
take
the
form
of:

–  Using
different
algorithms,

–  Using
the
same
algorithm
with
different
seongs

–  Assigning
different
parts
of
the
dataset
to
different
classifiers

◦  Bagging
&
Random
Forests
are
examples
of
ensemble

method

Ref: Machine Learning In Action
Ensemble Methods
—  Goal


◦  Variance (-)


Random Forests
o  While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o  Error prediction
•  For each iteration, predict for dataset that is not in the sample (OOB data)
•  Aggregate OOB predictions
•  Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
•  Can use this to search for optimal # of predictors
•  We will see how close this is to the actual error in the Heritage Health Prize
o  Assumes equal cost for mis-prediction. Can add a cost function
o  Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg

Model Evaluation &
Interpretation
Relevant Digression

Cross Validation
o Reference:
•  https://www.kaggle.com/wiki/
GettingStartedWithPythonForDataScience
•  Chris Clark ‘s blog :
http://blog.kaggle.com/2012/07/02/up-and-running-with-python-
my-first-kaggle-entry/
•  Predicive Modelling in py with scikit-learning, Olivier Grisel Strata
2013
•  titanic from pycon2014/parallelmaster/An introduction to Predictive
Modeling in Python

Model Evaluation - Accuracy
o Accuracy =
o For cases where tn is large compared tp, a degenerate return(false) will be
very accurate !
o Hence the F-measure is a better reflection of the model strength
Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)
–
Type
II

Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)

tp
+
tn

tp+fp+fn+tn

Model Evaluation – Precision & Recall
o  Precision = How many items we identified are relevant
o  Recall = How many relevant items did we identify
o  Inverse relationship – Tradeoff depends on situations
•  Legal – Coverage is important than correctness
•  Search – Accuracy is more important
•  Fraud
•  Support cost (high fp) vs. wrath of credit card co.(high fn)

tp

tp+fp

•  Precision

•  Accuracy

•  Relevancy

tp

tp+fn

•  Recall

•  True
+ve
Rate

•  Coverage

•  Sensitivity

•  Hit
Rate

http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

fp

fp+tn

•  Type
1
Error

Rate

•  False
+ve
Rate

•  False
Alarm
Rate

•  Speciﬁcity
=
1
–
fp
rate

•  Type
1
Error
=
fp

•  Type
2
Error
=
fn

Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)
-‐
Type
II

Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)

Confusion Matrix

Actual

Predicted

C1
C2
C3
C4

C1
10
5
9
3

C2
4
20
3
7

C3
6
4
13
3

C4
2
1
4
15

Correct
Ones
(cii)

Precision
=

Columns

i

cii

cij

Recall
=

Rows

j

cii

cij

Σ
Σ

Model Evaluation : F-Measure
Precision = tp / (tp+fp) : Recall = tp / (tp+fn)
F-Measure
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness
=

β2
P
+
R

Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
+
(1
–
α)
α
1

P

1

R

1
(β2
+
1)PR

Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)
-‐
Type
II

Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)

Hands-on Walkthru - Model Evaluation
Train
Test

712 (80%) 179
891
hTp://cran.r-‐project.org/web/packages/e1071/vigneTes/
svmdoc.pdf
-‐
model
eval

Kappa
measure
is
interesMng

Refer
to
2-‐Model_EvaluaMon.R

at

ROC Analysis
o “How good is my model?”
o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
o “A receiver operating characteristics (ROC) graph is a technique for visualizing,
organizing and selecting classiﬁers based on their performance”
o Much better than evaluating a model based on simple classification accuracy
o Plots tp rate vs. fp rate

ROC Graph - Discussion
o  E = Conservative, Everything
NO
o  H = Liberal, Everything YES
o Am not making any
political statement !
o  F = Ideal
o  G = Worst
o  The diagonal is the chance
o  North West Corner is good
o  South-East is bad
•  For example E
•  Believe it or Not - I have
actually seen a graph
with the curve in this
region !
E
F
G
H
Predicted=1
Predicted=0

Actual
=1
True+
(tp)
False-‐
(fn)

Actual=0
False+
(fp)
True-‐
(tn)

ROC Graph – Clinical Example
Ifcc
:
Measures
of
diagnostic
accuracy:
basic
deﬁnitions

ROC Graph Walk thru
Refer
to
2-‐Model_EvaluaMon.R
at

The Beginning As The End
Who will win Super BOWL
XLIX ?
12:15

References:
o  An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
•  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o  Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
•  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner
•  http://strataconf.com/strata2013/public/schedule/detail/27291
o  The Problem of Multiple Testing
•  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf

Homework:
Bike Sharing at Washington DC
12:30

Few interesting Links - Comb the forums
o  Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing
•  Solution by Brandon Harris
o  Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-
prediction
o  GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm
o  Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare
o  Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances
o  Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour
o  Casual & Registered Users :
http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count
o  RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r
o  Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data
o  Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402
o  Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html

Data Organization – train, test & submission
•  datetime - hourly date + timestamp
•  Season
•  1 = spring, 2 = summer, 3 = fall, 4 = winter
•  holiday - whether the day is considered a holiday
•  workingday - whether the day is neither a weekend nor holiday
•  Weather
•  1: Clear, Few clouds, Partly cloudy, Partly cloudy
•  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
•  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds
•  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
•  temp - temperature in Celsius
•  atemp - "feels like" temperature in Celsius
•  humidity - relative humidity
•  windspeed - wind speed
•  casual - number of non-registered user rentals initiated
•  registered - number of registered user rentals initiated
•  count - number of total rentals

Approach
o Convert to factors
o Engineer new features from date
o Explore other synthetic features

#1 : ctree
Refer
to
3-‐Session-‐I-‐Bikes.R

at

R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538

Similar to R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538 (20)

More from Krishna Sankar

More from Krishna Sankar (15)

Recently uploaded

Recently uploaded (20)

R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538