SlideShare a Scribd company logo
Who will win XLIX?
R, Data Wrangling &
Data Science
January 18, 2015
@ksankar // doubleclix.wordpress.com
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
"There are no facts, only interpretations." - Friedrich Nietzsche
etude
http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm,
http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg
We will focus on “short”, “acquiring
skill” & “having fun” !
Goals & non-goals
Goals
¤ Get familiar with the R
language & dplyr
¤ Work on a couple of interesting
data science problems
¤ Give you a focused time to
work
§ Work with me. I will wait
if you want to catch-up
¤ Less theory, more usage - let
us see if this works
¤ As straightforward as possible
§ The programs can be
optimized
Non-goals
¡ Go deep into the algorithms
•  We don’t have
sufficient time. The topic
can be easily a 5 day
tutorial !
¡ Dive into R internals
•  That is for another day
¡ A passive talk
•  Nope. Interactive &
hands-on
Activities & Results
o  Activities:
•  Get familiar with R, R Studio
•  Work on a couple of data sets
•  Get familiar with the mechanics of Data Science Competitions
•  Explore the intersection of Algorithms, Data, Intelligence, Inference &
Results
•  Discuss Data Science Horse Sense ;o)
o  Results :
•  Hands-on R
•  Familiar with some of the interesting algorithms
•  Submitted entries for 1 competition
•  Knowledge of Model Evaluation
•  Cross Validation, ROC Curves
About Me
o  Chief Data Scientist at BlackArrow.tv
o  Have been speaking at OSCON, PyCon, Pydata et al
o  Reviewing Packt Book “Machine Learning with Spark”
o  Picked up co-authorship Second Edition of “Fast Data Processing with Spark”
o  Have done lots of things:
•  Big Data (Retail, Bioinformatics, Financial, AdTech),
•  Written Books (Web 2.0, Wireless, Java,…)
•  Standards, some work in AI,
•  Guest Lecturer at Naval PG School,…
•  Planning MS-CFinance or Statistics
•  Volunteer as Robotics Judge at First Lego league World Competitions
o  @ksankar, doubleclix.wordpress.com
The	
  Nuthead	
  band	
  !	
  
Setup & Data
R & IDE
o  Install R
o  Install R Studio
Tutorial Materials
o  Github : https://
github.com/xsankar/
hairy-octo-hipster
o  Clone or download zip
Setup an account in Kaggle (www.kaggle.com)
We will be using the data from 2 Kaggle competitions
①  Titanic: Machine Learning from Disaster
Download data from http://www.kaggle.com/c/titanic-gettingStarted
Directory ~/hairy-octo-hipster/titanic-r
②  Predicting Bike Sharing @ Washington DC
Download data from http://www.kaggle.com/c/bike-sharing-demand/data
Directory ~/hairy-octo-hipster/bike
③  2014 NFL Boxscore
http://www.pro-football-reference.com/years/2014/games.htm
Directory ~/hairy-octo-hipster/nfl
Data
Agenda
o  Jan 18 : 9:00-12:30 3 hrs
o  Intro, Goals, Logistics, Setup [10] [9:00-9:10)
o  Introduction to R & dplyR [30] [9:10-9:40)
o  Who will win Superbowl XLIX ?
The Art of ELO Ranking [30] [9:40-10:10)
•  The Algorithm
•  The Data
•  The Results (Compare with FiveThirtyEight
o  Anatomy of a Kaggle Competition [40] [10:10-10:50)
•  Competition Mechanics
•  Register, download data, create sub
directories
•  Trial Run : Submit Titanic
o  Break [20] [10:50-11:10)
o  Algorithms for the Amateur Data Scientist [20] [11:10-11:30)
•  Algorithms, Tools & frameworks in perspective
•  “Folk Wisdom”
o  Model Evaluation & Interpretation [30] [11:30 - 12:00)
•  Confusion Matrix, ROC Graph
o  Homework : The Art of a Competition – Bike Sharing
o  Homework : The Art of a Competition – Walmart
Overload Warning … There is enough material for a week’s training … which is good & bad !
Read thru at your pace, refer, ponder & internalize
Close Encounters
—  1st	
  
◦  This Tutorial
—  2nd	
  
◦  Do More Hands-on Walkthrough
—  3nd	
  
◦  Listen To Lectures
◦  More competitions …
Introduction to R
9:10
R Syntax – A quick overview
o aString <- "A String"
o aNumber <- 12
o class(aString)
o class(aNumber)
o aVector <- c(1,2,3,4)
o class(aVector)
o aVector * 2
o sqrt(aVector)
o Packages : dplyR & tidyR
Data wrangling with dplyR
o  dplyR – versatile package for various data operations
o  We will see dplyR is use
o  Resources:
•  “Data Manipulation with dplyR” - Hadley Wickham’s UseR! 2014
Tutorial Slides
•  http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/
•  Slides https://www.dropbox.com/sh/i8qnluwmuieicxc/
AAAgt9tIKoIm7WZKIyK25lh6a
•  Slides of Tutorial by Rstudio’s Garrett Grolemund
•  https://github.com/rstudio/webinars
•  And the cheatsheet is available at http://www.rstudio.com/resources/
cheatsheets/
dplyR verbs
o Select
o Filter
o Summarise
o Group_by
o Mutate
o Arrange
dplyR joins
Hiroaki Yutani ‫‏‬@yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
dplyR joins
Hiroaki Yutani ‫‏‬@yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Who will win Super Bowl
XLIX
9:40
The Art of ELO Ranking
& Super Bowl XLIX
o Let us look at this from 3 angles:
•  The Algorithm
•  The R program
•  The Data
•  The Results
•  Comparing with the
FiveThirtyEight Results
http://www.imdb.com/title/tt1285016/trivia?item=qt1318850
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S
The ELO Algorithm (1 of 3)
1.  Basic Chess Algorithm proposed by Elo
•  Arpad Emrick Elo proposed the system for Chess ranking
•  Rnew = Rold + K(S-μ); μij = 1 / 1 + 10(Riold-Rjold)/400
•  K – varies depending on the match
•  Sij = 1, ½ or 0
2.  Soccer Ranking
•  http://www.eloratings.net/system.html
3.  NFL Ranking with adjusted factor for scores, 538
Ranking
Ref : Who is #1, Princeton University Press
The ELO Algorithm (2 of 3)
NFL Ranking
http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
The ELO Algorithm (3 of 3)
NFL Ranking
The Data
http://www.pro-football-reference.com/years/2014/games.htm
The R Code
https://github.com/xsankar/hairy-octo-hipster
The Analysis - Ranks
The Analysis – Week 1, Week 18
Analysis – Week 20 Results
Wisdom from Nate Silver & the 538 Gang …
o  [Homework #1] Improve our core algorithm
to add the Margin of victory from the 538
gang !
•  Remember, kFactor = 20
o  [Homework #2] Weigh recent games more
heavily w/ Exponential Decay
The Art of ELO Ranking
& Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S
Ref : Who is #1, Princeton University Press
References:
o  ELO ranking – NFL,Soccer
•  http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
•  http://fivethirtyeight.com/datalab/nfl-week-20-elo-ratings-and-playoff-
odds-conference-championships/
•  http://www.eloratings.net/system.html
o  dplyR
•  http://www.rstudio.com/resources/webinars/ <- github for the slides
•  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part1/
•  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part2/
•  http://www.rstudio.com/resources/cheatsheets/
•  http://www.r-bloggers.com/data-analysis-example-with-ggplot-and-dplyr-
analyzing-supercar-data-part-2/
Anatomy Of a Kaggle
Competition 10:10
Kaggle Data Science Competitions
o  Hosts Data Science Competitions
o  Competition Attributes:
•  Dataset
•  Train
•  Test (Submission)
•  Final Evaluation Data Set (We don’t
see)
•  Rules
•  Time boxed
•  Leaderboard
•  Evaluation function
•  Discussion Forum
•  Private or Public
Titanic	
  Passenger	
  Metadata	
  
•  Small	
  
•  3	
  Predictors	
  
•  Class	
  
•  Sex	
  
•  Age	
  
•  Survived?	
  
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction (Washington DC)
Walmart Store Forecasting
Train.csv	
  
Taken	
  from	
  Titanic	
  Passenger	
  
Manifest	
  
Variable	
   Descrip-on	
  
Survived	
   0-­‐No,	
  1=yes	
  
Pclass	
   Passenger	
  Class	
  (	
  1st,2nd,3rd	
  )	
  
Sibsp	
   Number	
  of	
  Siblings/Spouses	
  Aboard	
  
Parch	
   Number	
  of	
  Parents/Children	
  Aboard	
  
Embarked	
   Port	
  of	
  EmbarkaMon	
  
o  C	
  =	
  Cherbourg	
  
o  Q	
  =	
  Queenstown	
  
o  S	
  =	
  Southampton	
  
Titanic	
  Passenger	
  Metadata	
  
•  Small	
  
•  3	
  Predictors	
  
•  Class	
  
•  Sex	
  
•  Age	
  
•  Survived?	
  
Test.csv	
  
Submission
o 418 lines; 1st column should have 0 or 1 in each line
o Evaluation:
•  % correctly predicted
Approach
o  This is a classification problem - 0 or 1
o  Comb the forums !
o  Opportunity for us to try different algorithms & compare them
•  Simple Model
•  CART[Classification & Regression Tree]
•  Greedy, top-down binary, recursive partitioning that divides feature space into sets
of disjoint rectangular regions
•  RandomForest
•  Different parameters
•  SVM
•  Multiple kernels
•  Table the results
o  Use cross validation to predict our model performance & correlate with what Kaggle
says
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
Simple Model – Our First Submission
o #1 : Simple Model (M=survived)
o #2 : Simple Model (F=survived)
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
Refer	
  to	
  1-­‐Intro_to_Kaggle.R	
  	
  	
  
at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
#3 : Simple CART Model
o CART (Classification & Regression Tree)
hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees	
  
May be better, because we have improved on the survival of
men !
Refer	
  to	
  1-­‐Intro_to_Kaggle.R	
  	
  	
  
at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
#4 : Random Forest Model
o  https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
•  Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
o  https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
o  https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
Refer	
  to	
  1-­‐Intro_to_Kaggle.R	
  	
  	
  
at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
#5 : SVM
o Multiple Kernels
o kernel = ‘radial’ #Radial Basis Function
o Kernel = ‘sigmoid’
o  agconti's blog - Ultimate Titanic !
o  http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
Refer	
  to	
  1-­‐Intro_to_Kaggle.R	
  	
  	
  
at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
Feature Engineering - Homework
o  Add attribute : Age
•  In train 714/891 have age; in test 332/418 have age
•  Missing values can be just Mean Age of all passengers
•  We could be more precise and calculate Mean Age based on Title (Ms,
Mrs, Master et al)
•  Box plot age
o  Add attribute : Mother, Family size et al
o  Feature engineering ideas
•  http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/
sharing-experiences-about-data-munging-and-classification-steps-
with-python
o  More ideas at
http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/
o  And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
What does it mean ? Let us ponder ….
o  We have a training data set representing a domain
•  We reason over the dataset & develop a model to predict outcomes
o  How good is our prediction when it comes to real life scenarios ?
o  The assumption is that the dataset is taken at random
•  Or Is it ? Is there a Sampling Bias ?
•  i.i.d ? Independent ? Identically Distributed ?
•  What about homoscedasticity ? Do they have the same finite variance ?
o  Can we assure that another dataset (from the same domain) will give us the same
result ?
o  Will our model & it’s parameters remain the same if we get another data set ?
o  How can we evaluate our model ?
o  How can we select the right parameters for a selected model ?
Break
11:10
10:50
Algorithms for the
Amateur Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …
11:10
Ref: Anthony’s Kaggle Presentation
Data Scientists apply different techniques
•  Support Vector Machine
•  adaBoost
•  Bayesian Networks
•  Decision Trees
•  Ensemble Methods
•  Random Forest
•  Logistic Regression
•  Genetic Algorithms
•  Monte Carlo Methods
•  Principal Component Analysis
•  Kalman Filter
•  Evolutionary Fuzzy Modelling
•  Neural Networks
Quora
•  http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
Algorithm spectrum
o  Regression
o  Logit
o  CART
o  Ensemble :
Random
Forest
o  Clustering
o  KNN
o  Genetic Alg
o  Simulated
Annealing	
  
o  Collab
Filtering
o  SVM
o  Kernels
o  SVD
o  NNet
o  Boltzman
Machine
o  Feature
Learning	
  
Machine	
  Learning	
   Cute	
  Math	
  
Ar0ficial	
  
Intelligence	
  
Classifying Classifiers
Statistical	
   Structural	
  
Regression	
   Naïve	
  
Bayes	
  
Bayesian	
  
Networks	
  
Rule-­‐based	
   Distance-­‐based	
  
Neural	
  
Networks	
  
Production	
  Rules	
   Decision	
  Trees	
  
Multi-­‐layer	
  
Perception	
  
Functional	
   Nearest	
  Neighbor	
  
Linear	
   Spectral	
  
Wavelet	
  
kNN	
   Learning	
  vector	
  
Quantization	
  
Ensemble	
  
Random	
  Forests	
  
Logistic	
  
Regression1	
  
SVM	
  Boosting	
  
1Max	
  Entropy	
  Classifier	
  	
  
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
Classifiers	
  
Regression	
  
Continuous
Variables
Categorical 
Variables
Decision	
  
Trees	
  
k-­‐NN(Nearest	
  
Neighbors)	
  
Bias

Variance

Model Complexity

Over-fitting

BoosMng	
  Bagging	
  
CART	
  
Data Science
“folk knowledge”
Data Science “folk knowledge” (1 of A)
o  "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer
Mediated Transactions
o  Learning = Representation + Evaluation + Optimization
o  It’s Generalization that counts
•  The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o  Data alone is not enough
•  Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond
it
o  Machine Learning is not magic – one cannot get something from nothing
•  In order to infer, one needs the knobs & the dials
•  One also needs a rich expressive datasetA few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (2 of A)
o  Over fitting has many faces
•  Bias – Model not strong enough. So the learner has the tendency to learn the
same wrong things
•  Variance – Learning too much from one dataset; model will fall apart (ie much
less accurate) on a different dataset
•  Sampling Bias
o  Intuition Fails in high Dimensions –Bellman
•  Blessing of non-conformity & lower effective dimension; many applications
have examples not uniformly spread but concentrated near a lower dimensional
manifold eg. Space of digits is much smaller then the space of images
o  Theoretical Guarantees are not What they seem
•  One of the major developments o f recent decades has been the realization that
we can have guarantees on the results of induction, particularly if we are
willing to settle for probabilistic guarantees.
o  Feature engineering is the Key
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (3 of A)
o  More Data Beats a Cleverer Algorithm
•  Or conversely select algorithms that improve with data
•  Don’t optimize prematurely without getting more data
o  Learn many models, not Just One
•  Ensembles ! – Change the hypothesis space
•  Netflix prize
•  E.g. Bagging, Boosting, Stacking
o  Simplicity Does not necessarily imply Accuracy
o  Representable Does not imply Learnable
•  Just because a function can be represented does not mean
it can be learned
o  Correlation Does not imply Causation
o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  A few useful things to know about machine learning - by Pedro Domingos
§  http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (4 of A)
o  The simplest hypothesis that fits the data is also the most
plausible
•  Occam’s Razor
•  Don’t go for a 4 layer Neural Network unless
you have that complex data
•  But that doesn’t also mean that one should
choose the simplest hypothesis
•  Match the impedance of the domain, data & the
algorithms
o  Think of over fitting as memorizing as opposed to learning.
o  Data leakage has many forms
o  Sometimes the Absence of Something is Everything
o  [Corollary] Absence of Evidence is not the Evidence of
Absence
New to Machine Learning? Avoid these three mistakes, James Faghmous
https://medium.com/about-data/73258b3848a4
§  Simple	
  Model	
  
§  High	
  Error	
  line	
  that	
  cannot	
  be	
  
compensated	
  with	
  more	
  data	
  
§  Gets	
  to	
  a	
  lower	
  error	
  rate	
  with	
  less	
  data	
  
points	
  
§  Complex	
  Model	
  
§  Lower	
  Error	
  Line	
  
§  But	
  needs	
  more	
  data	
  points	
  to	
  reach	
  
decent	
  error	
  	
  
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Importance of feature selection & weak models
o “Good features allow a simple model to beat a complex model”-Ben Lorica1
o “… using many weak predictors will always be more accurate than using a few
strong ones …” –Vladimir Vapnik2
o “A good decision rule is not a simple one, it cannot be described by a very few
parameters” 2
o “Machine learning science is not only about computers, but about humans, and
the unity of logic, emotion, and culture.” 2
o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well,
but it can’t surprise you” – Hadley Wickham3
hTp://radar.oreilly.com/2014/06/streamlining-­‐feature-­‐engineering.html	
  
hTp://nauMl.us/issue/6/secret-­‐codes/teaching-­‐me-­‐so^ly	
  
hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-­‐modeling-­‐and-­‐surprises/	
  
Updated	
  Slide	
  
Check your assumptions
o  The decisions a model makes, is directly related to the it’s assumptions about the
statistical distribution of the underlying data
o  For example, for regression one should check that:
① Variables are normally distributed
•  Test for normality via visual inspection, skew & kurtosis, outlier inspections via
plots, z-scores et al
② There is a linear relationship between the dependent & independent
variables
•  Inspect residual plots, try quadratic relationships, try log plots et al
③ Variables are measured without error
④ Assumption of Homoscedasticity
§  Homoscedasticity assumes constant or near constant error variance
§  Check the standard residual plots and look for heteroscedasticity
§  For example in the figure, left box has the errors scattered randomly around zero; while the
right two diagrams have the errors unevenly distributed
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test,
http://pareonline.net/getvn.asp?v=8&n=2
Data Science “folk knowledge” (5 of A)
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World

Knowns	
  
	
  
	
  
	
  
	
  
	
  
Unknowns	
  
You
UnKnown	
   Known	
  
o  Others	
  know,	
  you	
  don’t	
   o  What	
  we	
  do	
  
o  Facts,	
  outcomes	
  or	
  
scenarios	
  we	
  have	
  not	
  
encountered,	
  nor	
  
considered	
  
o  “Black	
  swans”,	
  outliers,	
  
long	
  tails	
  of	
  probability	
  
distribuMons	
  
o  Lack	
  of	
  experience,	
  
imaginaMon	
  
o  PotenMal	
  facts,	
  
outcomes	
  we	
  
are	
  aware,	
  but	
  
not	
  	
  with	
  
certainty	
  
o  StochasMc	
  
processes,	
  
ProbabiliMes	
  
o  Known Knowns
o  There are things we know that we know
o  Known Unknowns
o  That is to say, there are things that we
now know we don't know
o  But there are also Unknown Unknowns
o  There are things we do not know we
don't know
Data Science “folk knowledge” (6 of A) - Pipeline
o  Scalable	
  Model	
  
Deployment	
  
o  Big	
  Data	
  
automation	
  &	
  
purpose	
  built	
  
appliances	
  (soft/
hard)	
  
o  Manage	
  SLAs	
  &	
  
response	
  times	
  
o  Volume	
  
o  Velocity	
  
o  Streaming	
  Data	
  
o  Canonical	
  form	
  
o  Data	
  catalog	
  
o  Data	
  Fabric	
  across	
  the	
  
organization	
  
o  Access	
  to	
  multiple	
  
sources	
  of	
  data	
  	
  
o  Think	
  Hybrid	
  –	
  Big	
  Data	
  
Apps,	
  Appliances	
  &	
  
Infrastructure	
  
Collect Store Transform
o  Metadata	
  
o  Monitor	
  counters	
  &	
  
Metrics	
  
o  Structured	
  vs.	
  Multi-­‐
structured	
  
o  Flexible	
  &	
  Selectable	
  
§  Data	
  Subsets	
  	
  
§  Attribute	
  sets	
  
o  Refine	
  model	
  with	
  
§  Extended	
  Data	
  
subsets	
  
§  Engineered	
  
Attribute	
  sets	
  
o  Validation	
  run	
  across	
  a	
  
larger	
  data	
  set	
  
Reason Model Deploy
Data Management
Data Science
o  Dynamic	
  Data	
  Sets	
  
o  2	
  way	
  key-­‐value	
  tagging	
  of	
  
datasets	
  
o  Extended	
  attribute	
  sets	
  
o  Advanced	
  Analytics	
  
ExploreVisualize Recommend Predict
o  Performance	
  
o  Scalability	
  
o  Refresh	
  Latency	
  
o  In-­‐memory	
  Analytics	
  
o  Advanced	
  Visualization	
  
o  Interactive	
  Dashboards	
  
o  Map	
  Overlay	
  
o  Infographics	
  
¤  Bytes to Business
a.k.a. Build the full
stack
¤  Find Relevant Data
For Business
¤  Connect the Dots
Volume
Velocity
Variety
Data Science “folk knowledge” (7 of A)
Context
Connect
edness
Intelligence
Interface
Inference
“Data of unusual size”
that can't be brute forced
o  Three Amigos
o  Interface = Cognition
o  Intelligence = Compute(CPU) & Computational(GPU)
o  Infer Significance & Causality
Data Science “folk knowledge” (8 of A)
Jeremy’s Axioms
o  Iteratively explore data
o  Tools
•  Excel Format, Perl, Perl Book
o  Get your head around data
•  Pivot Table
o  Don’t over-complicate
o  If people give you data, don’t assume that you
need to use all of it
o  Look at pictures !
o  History of your submissions – keep a tab
o  Don’t be afraid to submit simple solutions
•  We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-
howard/
Data Science “folk knowledge” (9 of A)
①  Common Sense (some features make more sense then others)
②  Carefully read these forums to get a peak at other peoples’ mindset
③  Visualizations
④  Train a classifier (e.g. logistic regression) and look at the feature weights
⑤  Train a decision tree and visualize it
⑥  Cluster the data and look at what clusters you get out
⑦  Just look at the raw data
⑧  Train a simple classifier, see what mistakes it makes
⑨  Write a classifier using handwritten rules
⑩  Pick a fancy method that you want to apply (Deep Learning/Nnet)
-- Maarten Bosma
-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
Data Science “folk knowledge” (A of A)
Lessons from Kaggle Winners
①  Don’t over-fit
②  All predictors are not needed
•  All data rows are not needed, either
③  Tuning the algorithms will give different results
④  Reduce the dataset (Average, select transition data,…)
⑤  Test set & training set can differ
⑥  Iteratively explore & get your head around data
⑦  Don’t be afraid to submit simple solutions
⑧  Keep a tab & history your submissions
The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
Data Scientist (noun): Person who is better at
statistics than any software engineer & better
at software engineering than any statistician
– Josh Wills (Cloudera)
Data Scientist (noun): Person who is worse at
statistics than any statistician & worse at
software engineering than any software
engineer – Will Cukierski (Kaggle)
http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/
Large is hard; Infinite is much easier !
– Titus Brown
Essential Reading List
o  A few useful things to know about machine learning - by Pedro Domingos
•  http://dl.acm.org/citation.cfm?id=2347755
o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
•  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o  http://www.no-free-lunch.org/
o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
Y. C
•  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
•  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  Avoid these three mistakes, James Faghmo
•  https://medium.com/about-data/73258b3848a4
o  Leakage in Data Mining: Formulation, Detection, and Avoidance
•  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered List
①  An Introduction to Statistical Learning
•  http://www-bcf.usc.edu/~‾gareth/ISL/
②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
•  http://online.stanford.edu/course/statistical-learning-winter-2014
③  Prof. Pedro Domingo
•  https://class.coursera.org/machlearning-001/lecture/preview
④  Prof. Andrew Ng
•  https://class.coursera.org/ml-003/lecture/preview
⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
•  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥  Mathematicalmonk @ YouTube
•  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦  The Elements Of Statistical Learning
•  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/
Of Models,
Performance, Evaluation
& Interpretation
11:30
Bias/Variance (1 of 2)
o Model Complexity
•  Complex Model increases the
training data fit
•  But then it overfits & doesn't
perform as well with real data
o  Bias vs. Variance
o  Classical diagram
o  From ELSII, By Hastie, Tibshirani & Friedman
o  Bias – Model learns wrong things; not
complex enough; error gap small; more
data by itself won’t help
o  Variance – Different dataset will give
different error rate; over fitted model;
larger error gap; more data could help
Prediction Error

Training 

Error

Ref: Andrew Ng/Stanford, Yaser S./CalTech
Learning Curve
Bias/Variance (2 of 2)
o High Bias
•  Due to Underfitting
•  Add more features
•  More sophisticated model
•  Quadratic Terms, complex equations,…
•  Decrease regularization
o High Variance
•  Due to Overfitting
•  Use fewer features
•  Use more training sample
•  Increase Regularization
Prediction Error

Training 

Error

Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need	
  more	
  features	
  or	
  more	
  
complex	
  model	
  to	
  improve	
  
Need	
  more	
  data	
  to	
  improve	
  
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
Partition Data !
•  Training (60%)
•  Validation(20%) &
•  “Vault” Test (20%) Data sets
k-fold Cross-Validation
•  Split data into k equal parts
•  Fit model to k-1 parts &
calculate prediction error on kth
part
•  Non-overlapping dataset
Data Partition &
Cross-Validation
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)

Train	
   Validate	
   Test	
  
#2	
   #3	
   #4	
  
#5	
  
#1	
  
#2	
   #3	
   #5	
  
#4	
  
#1	
  
#2	
   #4	
   #5	
  
#3	
  
#1	
  
#3	
   #4	
   #5	
  
#2	
  
#1	
  
#3	
   #4	
   #5	
  
#1	
  
#2	
  
K-­‐fold	
  CV	
  (k=5)	
  
Train	
   Validate	
  
Bootstrap
•  Draw datasets (with replacement) and fit model for each dataset
•  Remember : Data Partitioning (#1) & Cross Validation (#2) are without
replacement
Bootstrap & Bagging
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)

Bagging (Bootstrap aggregation)
◦  Average prediction over a collection of
bootstrap-ed samples, thus reducing
variance
◦  “Output	
  of	
  weak	
  classifiers	
  into	
  a	
  powerful	
  commiTee”	
  
◦  Final	
  PredicMon	
  =	
  weighted	
  majority	
  vote	
  	
  
◦  Later	
  classifiers	
  get	
  misclassified	
  points	
  	
  
–  With	
  higher	
  weight,	
  	
  
–  So	
  they	
  are	
  forced	
  	
  
–  To	
  concentrate	
  on	
  them	
  
◦  AdaBoost	
  (AdapMveBoosting)	
  
◦  BoosMng	
  vs	
  Bagging	
  
–  Bagging	
  –	
  independent	
  trees	
  
–  BoosMng	
  –	
  successively	
  weighted	
  
Boosting
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)
◦  Builds	
  large	
  collecMon	
  of	
  de-­‐correlated	
  trees	
  &	
  averages	
  
them	
  
◦  Improves	
  Bagging	
  by	
  selecMng	
  i.i.d*	
  random	
  variables	
  for	
  
spliong	
  
◦  Simpler	
  to	
  train	
  &	
  tune	
  
◦  “Do	
  remarkably	
  well,	
  with	
  very	
  li@le	
  tuning	
  required”	
  –	
  ESLII	
  
◦  Less	
  suscepMble	
  to	
  over	
  fiong	
  (than	
  boosMng)	
  
◦  Many	
  RF	
  implementaMons	
  
–  Original	
  version	
  -­‐	
  Fortran-­‐77	
  !	
  By	
  Breiman/Cutler	
  
–  Python,	
  R,	
  Mahout,	
  Weka,	
  Milk	
  (ML	
  toolkit	
  for	
  py),	
  matlab	
  	
  
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)
◦  Two	
  Step	
  
–  Develop	
  a	
  set	
  of	
  learners	
  
–  Combine	
  the	
  results	
  to	
  develop	
  a	
  composite	
  predictor	
  
◦  Ensemble	
  methods	
  can	
  take	
  the	
  form	
  of:	
  
–  Using	
  different	
  algorithms,	
  	
  
–  Using	
  the	
  same	
  algorithm	
  with	
  different	
  seongs	
  
–  Assigning	
  different	
  parts	
  of	
  the	
  dataset	
  to	
  different	
  classifiers	
  
◦  Bagging	
  &	
  Random	
  Forests	
  are	
  examples	
  of	
  ensemble	
  
method	
  	
  
Ref: Machine Learning In Action
Ensemble Methods
—  Goal

◦  Model Complexity (-)

◦  Variance (-)

◦  Prediction Accuracy (+)
Random Forests
o  While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o  Error prediction
•  For each iteration, predict for dataset that is not in the sample (OOB data)
•  Aggregate OOB predictions
•  Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
•  Can use this to search for optimal # of predictors
•  We will see how close this is to the actual error in the Heritage Health Prize
o  Assumes equal cost for mis-prediction. Can add a cost function
o  Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
Classifiers	
  
Regression	
  
Continuous
Variables
Categorical 
Variables
Decision	
  
Trees	
  
k-­‐NN(Nearest	
  
Neighbors)	
  
Bias

Variance

Model Complexity

Over-fitting

BoosMng	
  Bagging	
  
CART	
  
Model Evaluation &
Interpretation
Relevant Digression
Cross Validation
o Reference:
•  https://www.kaggle.com/wiki/
GettingStartedWithPythonForDataScience
•  Chris Clark ‘s blog :
http://blog.kaggle.com/2012/07/02/up-and-running-with-python-
my-first-kaggle-entry/
•  Predicive Modelling in py with scikit-learning, Olivier Grisel Strata
2013
•  titanic from pycon2014/parallelmaster/An introduction to Predictive
Modeling in Python
Model Evaluation - Accuracy
o Accuracy =
o For cases where tn is large compared tp, a degenerate return(false) will be
very accurate !
o Hence the F-measure is a better reflection of the model strength
Predicted=1	
   Predicted=0	
  
Actual	
  =1	
   True+	
  (tp)	
   False-­‐	
  (fn)	
  –	
  Type	
  II	
  
Actual=0	
   False+	
  (fp)	
  -­‐	
  Type	
  I	
   True-­‐	
  (tn)	
  
	
  	
  	
  	
  tp	
  +	
  tn	
  
tp+fp+fn+tn	
  
	
  
Model Evaluation – Precision & Recall
o  Precision = How many items we identified are relevant
o  Recall = How many relevant items did we identify
o  Inverse relationship – Tradeoff depends on situations
•  Legal – Coverage is important than correctness
•  Search – Accuracy is more important
•  Fraud
•  Support cost (high fp) vs. wrath of credit card co.(high fn)
	
  	
  	
  	
  tp	
  
tp+fp	
  
	
  
•  Precision	
  	
  
•  Accuracy	
  
•  Relevancy	
  
	
  	
  	
  	
  tp	
  
tp+fn	
  
	
  
•  Recall	
  	
  
•  True	
  +ve	
  Rate	
  
•  Coverage	
  
•  Sensitivity	
  
•  Hit	
  Rate	
  
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
	
  	
  	
  	
  fp	
  
fp+tn	
  
	
  
•  Type	
  1	
  Error	
  
Rate	
  
•  False	
  +ve	
  Rate	
  
•  False	
  Alarm	
  Rate	
  
•  Specificity	
  =	
  1	
  –	
  fp	
  rate	
  
•  Type	
  1	
  Error	
  =	
  fp	
  
•  Type	
  2	
  Error	
  =	
  fn	
  
Predicted=1	
   Predicted=0	
  
Actual	
  =1	
   True+	
  (tp)	
   False-­‐	
  (fn)	
  -­‐	
  Type	
  II	
  
Actual=0	
   False+	
  (fp)	
  -­‐	
  Type	
  I	
   True-­‐	
  (tn)	
  
Confusion Matrix
	
  
	
  
	
  
Actual	
  
Predicted	
  
C1	
   C2	
   C3	
   C4	
  
C1	
   10	
   5	
   9	
   3	
  
C2	
   4	
   20	
   3	
   7	
  
C3	
   6	
   4	
   13	
   3	
  
C4	
   2	
   1	
   4	
   15	
  
Correct	
  Ones	
  (cii)	
  
Precision	
  =	
  
Columns	
  
	
  	
  	
  	
  	
  	
  	
  	
  i	
  
cii	
  
cij	
  
Recall	
  =	
  
Rows	
  
	
  	
  	
  	
  	
  j	
  
	
  
cii	
  
cij	
  
Σ
Σ
Model Evaluation : F-Measure
Precision = tp / (tp+fp) : Recall = tp / (tp+fn)
F-Measure
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness
=	
  
β2	
  P	
  +	
  R	
  
Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
+	
  (1	
  –	
  α)	
  α	
   1	
  
P	
  
1	
  
R	
  
1	
   (β2	
  +	
  1)PR	
  
Predicted=1	
   Predicted=0	
  
Actual	
  =1	
   True+	
  (tp)	
   False-­‐	
  (fn)	
  -­‐	
  Type	
  II	
  
Actual=0	
   False+	
  (fp)	
  -­‐	
  Type	
  I	
   True-­‐	
  (tn)	
  
Hands-on Walkthru - Model Evaluation
Train	
   Test	
  
712 (80%) 179
891
hTp://cran.r-­‐project.org/web/packages/e1071/vigneTes/
svmdoc.pdf	
  -­‐	
  model	
  eval	
  
Kappa	
  measure	
  is	
  interesMng	
  
Refer	
  to	
  2-­‐Model_EvaluaMon.R	
  	
  	
  
at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
ROC Analysis
o “How good is my model?”
o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
o “A receiver operating characteristics (ROC) graph is a technique for visualizing,
organizing and selecting classifiers based on their performance”
o Much better than evaluating a model based on simple classification accuracy
o Plots tp rate vs. fp rate
ROC Graph - Discussion
o  E = Conservative, Everything
NO
o  H = Liberal, Everything YES
o Am not making any
political statement !
o  F = Ideal
o  G = Worst
o  The diagonal is the chance
o  North West Corner is good
o  South-East is bad
•  For example E
•  Believe it or Not - I have
actually seen a graph
with the curve in this
region !
E
F
G
H
Predicted=1	
   Predicted=0	
  
Actual	
  =1	
   True+	
  (tp)	
   False-­‐	
  (fn)	
  
Actual=0	
   False+	
  (fp)	
   True-­‐	
  (tn)	
  
ROC Graph – Clinical Example
Ifcc	
  :	
  Measures	
  of	
  diagnostic	
  accuracy:	
  basic	
  definitions	
  
ROC Graph Walk thru
Refer	
  to	
  2-­‐Model_EvaluaMon.R	
  at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
The Beginning As The End
Who will win Super BOWL
XLIX ?
12:15
References:
o  An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
•  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o  Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
•  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner
•  http://strataconf.com/strata2013/public/schedule/detail/27291
o  The Problem of Multiple Testing
•  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf
Homework:
Bike Sharing at Washington DC
12:30
Few interesting Links - Comb the forums
o  Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing
•  Solution by Brandon Harris
o  Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-
prediction
o  GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm
o  Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare
o  Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances
o  Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour
o  Casual & Registered Users :
http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count
o  RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please
o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r
o  Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data
o  Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402
o  Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html
Data Organization – train, test & submission
•  datetime - hourly date + timestamp
•  Season
•  1 = spring, 2 = summer, 3 = fall, 4 = winter
•  holiday - whether the day is considered a holiday
•  workingday - whether the day is neither a weekend nor holiday
•  Weather
•  1: Clear, Few clouds, Partly cloudy, Partly cloudy
•  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
•  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds
•  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
•  temp - temperature in Celsius
•  atemp - "feels like" temperature in Celsius
•  humidity - relative humidity
•  windspeed - wind speed
•  casual - number of non-registered user rentals initiated
•  registered - number of registered user rentals initiated
•  count - number of total rentals
Approach
o Convert to factors
o Engineer new features from date
o Explore other synthetic features
#1 : ctree
Refer	
  to	
  3-­‐Session-­‐I-­‐Bikes.R	
  	
  	
  
at	
  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/	
  
#2 : Add Month + year
#3 : RandomForest

More Related Content

What's hot

LTE - Qualcomm Leading the Global Success
LTE - Qualcomm Leading the Global SuccessLTE - Qualcomm Leading the Global Success
LTE - Qualcomm Leading the Global Success
Qualcomm Research
 
Intermediate: 5G Network Architecture Options (Updated)
Intermediate: 5G Network Architecture Options (Updated)Intermediate: 5G Network Architecture Options (Updated)
Intermediate: 5G Network Architecture Options (Updated)
3G4G
 
5G Network Architecture, Design and Optimisation
5G Network Architecture, Design and Optimisation5G Network Architecture, Design and Optimisation
5G Network Architecture, Design and Optimisation
3G4G
 
Advanced: 5G Service Based Architecture (SBA)
Advanced: 5G Service Based Architecture (SBA)Advanced: 5G Service Based Architecture (SBA)
Advanced: 5G Service Based Architecture (SBA)
3G4G
 
6G: Potential Use Cases and Enabling Technologies
6G: Potential Use Cases and Enabling Technologies6G: Potential Use Cases and Enabling Technologies
6G: Potential Use Cases and Enabling Technologies
3G4G
 
Gsm review note by zemaryali
Gsm review note by zemaryaliGsm review note by zemaryali
Gsm review note by zemaryali
Zaryal Social
 
Comparison of SRv6 Extensions uSID, SRv6+, C-SRH
Comparison of SRv6 Extensions uSID, SRv6+, C-SRHComparison of SRv6 Extensions uSID, SRv6+, C-SRH
Comparison of SRv6 Extensions uSID, SRv6+, C-SRH
Kentaro Ebisawa
 
NETCONF Call Home
NETCONF Call Home NETCONF Call Home
NETCONF Call Home
ADVA
 
Evolution of Core Networks
Evolution of Core NetworksEvolution of Core Networks
Evolution of Core Networks
Sarp Köksal
 
VoLTE and ViLTE.pdf
VoLTE and ViLTE.pdfVoLTE and ViLTE.pdf
VoLTE and ViLTE.pdf
AsitSwain5
 
Ericsson Lean Carrier
Ericsson Lean CarrierEricsson Lean Carrier
Ericsson Lean Carrier
Ericsson
 
Opinion: Would 5G NSA undergo Sunset?
Opinion: Would 5G NSA undergo Sunset?Opinion: Would 5G NSA undergo Sunset?
Opinion: Would 5G NSA undergo Sunset?
3G4G
 
Packet core network basics
Packet core network basicsPacket core network basics
Packet core network basics
Mustafa Golam
 
A New Route for Submarine Cables
A New Route for Submarine CablesA New Route for Submarine Cables
A New Route for Submarine Cables
MyNOG
 
PCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional AnalysisPCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional Analysis
Biju M R
 
5G Network Architecture, Planning and Design
5G Network Architecture, Planning and Design5G Network Architecture, Planning and Design
5G Network Architecture, Planning and Design
Tonex
 
3GPP 5G Control Plane Service Based Architecture
3GPP 5G Control Plane Service Based Architecture3GPP 5G Control Plane Service Based Architecture
3GPP 5G Control Plane Service Based Architecture
Sridhar Bhaskaran
 
End-to-End QoS in LTE
End-to-End QoS in LTEEnd-to-End QoS in LTE
End-to-End QoS in LTE
Radisys Corporation
 
Dell PowerEdge Zero Touch Provisioning
Dell PowerEdge Zero Touch ProvisioningDell PowerEdge Zero Touch Provisioning
Dell PowerEdge Zero Touch Provisioning
Dell World
 
Soft x3000 operation manual configuration guide
Soft x3000 operation manual configuration guideSoft x3000 operation manual configuration guide
Soft x3000 operation manual configuration guide
Tuhin Narayan
 

What's hot (20)

LTE - Qualcomm Leading the Global Success
LTE - Qualcomm Leading the Global SuccessLTE - Qualcomm Leading the Global Success
LTE - Qualcomm Leading the Global Success
 
Intermediate: 5G Network Architecture Options (Updated)
Intermediate: 5G Network Architecture Options (Updated)Intermediate: 5G Network Architecture Options (Updated)
Intermediate: 5G Network Architecture Options (Updated)
 
5G Network Architecture, Design and Optimisation
5G Network Architecture, Design and Optimisation5G Network Architecture, Design and Optimisation
5G Network Architecture, Design and Optimisation
 
Advanced: 5G Service Based Architecture (SBA)
Advanced: 5G Service Based Architecture (SBA)Advanced: 5G Service Based Architecture (SBA)
Advanced: 5G Service Based Architecture (SBA)
 
6G: Potential Use Cases and Enabling Technologies
6G: Potential Use Cases and Enabling Technologies6G: Potential Use Cases and Enabling Technologies
6G: Potential Use Cases and Enabling Technologies
 
Gsm review note by zemaryali
Gsm review note by zemaryaliGsm review note by zemaryali
Gsm review note by zemaryali
 
Comparison of SRv6 Extensions uSID, SRv6+, C-SRH
Comparison of SRv6 Extensions uSID, SRv6+, C-SRHComparison of SRv6 Extensions uSID, SRv6+, C-SRH
Comparison of SRv6 Extensions uSID, SRv6+, C-SRH
 
NETCONF Call Home
NETCONF Call Home NETCONF Call Home
NETCONF Call Home
 
Evolution of Core Networks
Evolution of Core NetworksEvolution of Core Networks
Evolution of Core Networks
 
VoLTE and ViLTE.pdf
VoLTE and ViLTE.pdfVoLTE and ViLTE.pdf
VoLTE and ViLTE.pdf
 
Ericsson Lean Carrier
Ericsson Lean CarrierEricsson Lean Carrier
Ericsson Lean Carrier
 
Opinion: Would 5G NSA undergo Sunset?
Opinion: Would 5G NSA undergo Sunset?Opinion: Would 5G NSA undergo Sunset?
Opinion: Would 5G NSA undergo Sunset?
 
Packet core network basics
Packet core network basicsPacket core network basics
Packet core network basics
 
A New Route for Submarine Cables
A New Route for Submarine CablesA New Route for Submarine Cables
A New Route for Submarine Cables
 
PCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional AnalysisPCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional Analysis
 
5G Network Architecture, Planning and Design
5G Network Architecture, Planning and Design5G Network Architecture, Planning and Design
5G Network Architecture, Planning and Design
 
3GPP 5G Control Plane Service Based Architecture
3GPP 5G Control Plane Service Based Architecture3GPP 5G Control Plane Service Based Architecture
3GPP 5G Control Plane Service Based Architecture
 
End-to-End QoS in LTE
End-to-End QoS in LTEEnd-to-End QoS in LTE
End-to-End QoS in LTE
 
Dell PowerEdge Zero Touch Provisioning
Dell PowerEdge Zero Touch ProvisioningDell PowerEdge Zero Touch Provisioning
Dell PowerEdge Zero Touch Provisioning
 
Soft x3000 operation manual configuration guide
Soft x3000 operation manual configuration guideSoft x3000 operation manual configuration guide
Soft x3000 operation manual configuration guide
 

Viewers also liked

Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
Krishna Sankar
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Predicting Football Using R
Predicting Football Using RPredicting Football Using R
Predicting Football Using R
Martin Eastwood
 

Viewers also liked (6)

Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Predicting Football Using R
Predicting Football Using RPredicting Football Using R
Predicting Football Using R
 

Similar to R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538

Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
Krishna Sankar
 
Data Science
Data Science Data Science
Data Science
University of Sindh
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
RahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
Ganesh E
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
kalai75
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
Aravind Reddy
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
data science
data sciencedata science
data science
laxman1216
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
Mark Nichols, P.E.
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
Alex D. Gaudio
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
Stanka Dalekova
 
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892WSO2
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
Traveloka
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
Ian Foster
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
Emily Curtin
 

Similar to R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538 (20)

Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Data Science
Data Science Data Science
Data Science
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
data science
data sciencedata science
data science
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
iSoligorsk #3 2013
iSoligorsk #3 2013iSoligorsk #3 2013
iSoligorsk #3 2013
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 

More from Krishna Sankar

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
Krishna Sankar
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
Krishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
Krishna Sankar
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
Krishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
Krishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
Krishna Sankar
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
Krishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
Krishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
Krishna Sankar
 

More from Krishna Sankar (15)

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538

  • 1. Who will win XLIX? R, Data Wrangling & Data Science January 18, 2015 @ksankar // doubleclix.wordpress.com “I want to die on Mars but not on impact” — Elon Musk, interview with Chris Anderson “The shrewd guess, the fertile hypothesis, the courageous leap to a tentative conclusion – these are the most valuable coin of the thinker at work” -- Jerome Seymour Bruner "There are no facts, only interpretations." - Friedrich Nietzsche
  • 3. Goals & non-goals Goals ¤ Get familiar with the R language & dplyr ¤ Work on a couple of interesting data science problems ¤ Give you a focused time to work § Work with me. I will wait if you want to catch-up ¤ Less theory, more usage - let us see if this works ¤ As straightforward as possible § The programs can be optimized Non-goals ¡ Go deep into the algorithms •  We don’t have sufficient time. The topic can be easily a 5 day tutorial ! ¡ Dive into R internals •  That is for another day ¡ A passive talk •  Nope. Interactive & hands-on
  • 4. Activities & Results o  Activities: •  Get familiar with R, R Studio •  Work on a couple of data sets •  Get familiar with the mechanics of Data Science Competitions •  Explore the intersection of Algorithms, Data, Intelligence, Inference & Results •  Discuss Data Science Horse Sense ;o) o  Results : •  Hands-on R •  Familiar with some of the interesting algorithms •  Submitted entries for 1 competition •  Knowledge of Model Evaluation •  Cross Validation, ROC Curves
  • 5. About Me o  Chief Data Scientist at BlackArrow.tv o  Have been speaking at OSCON, PyCon, Pydata et al o  Reviewing Packt Book “Machine Learning with Spark” o  Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o  Have done lots of things: •  Big Data (Retail, Bioinformatics, Financial, AdTech), •  Written Books (Web 2.0, Wireless, Java,…) •  Standards, some work in AI, •  Guest Lecturer at Naval PG School,… •  Planning MS-CFinance or Statistics •  Volunteer as Robotics Judge at First Lego league World Competitions o  @ksankar, doubleclix.wordpress.com The  Nuthead  band  !  
  • 6. Setup & Data R & IDE o  Install R o  Install R Studio Tutorial Materials o  Github : https:// github.com/xsankar/ hairy-octo-hipster o  Clone or download zip Setup an account in Kaggle (www.kaggle.com) We will be using the data from 2 Kaggle competitions ①  Titanic: Machine Learning from Disaster Download data from http://www.kaggle.com/c/titanic-gettingStarted Directory ~/hairy-octo-hipster/titanic-r ②  Predicting Bike Sharing @ Washington DC Download data from http://www.kaggle.com/c/bike-sharing-demand/data Directory ~/hairy-octo-hipster/bike ③  2014 NFL Boxscore http://www.pro-football-reference.com/years/2014/games.htm Directory ~/hairy-octo-hipster/nfl Data
  • 7. Agenda o  Jan 18 : 9:00-12:30 3 hrs o  Intro, Goals, Logistics, Setup [10] [9:00-9:10) o  Introduction to R & dplyR [30] [9:10-9:40) o  Who will win Superbowl XLIX ? The Art of ELO Ranking [30] [9:40-10:10) •  The Algorithm •  The Data •  The Results (Compare with FiveThirtyEight o  Anatomy of a Kaggle Competition [40] [10:10-10:50) •  Competition Mechanics •  Register, download data, create sub directories •  Trial Run : Submit Titanic o  Break [20] [10:50-11:10) o  Algorithms for the Amateur Data Scientist [20] [11:10-11:30) •  Algorithms, Tools & frameworks in perspective •  “Folk Wisdom” o  Model Evaluation & Interpretation [30] [11:30 - 12:00) •  Confusion Matrix, ROC Graph o  Homework : The Art of a Competition – Bike Sharing o  Homework : The Art of a Competition – Walmart
  • 8. Overload Warning … There is enough material for a week’s training … which is good & bad ! Read thru at your pace, refer, ponder & internalize
  • 9. Close Encounters —  1st   ◦  This Tutorial —  2nd   ◦  Do More Hands-on Walkthrough —  3nd   ◦  Listen To Lectures ◦  More competitions …
  • 11. R Syntax – A quick overview o aString <- "A String" o aNumber <- 12 o class(aString) o class(aNumber) o aVector <- c(1,2,3,4) o class(aVector) o aVector * 2 o sqrt(aVector) o Packages : dplyR & tidyR
  • 12. Data wrangling with dplyR o  dplyR – versatile package for various data operations o  We will see dplyR is use o  Resources: •  “Data Manipulation with dplyR” - Hadley Wickham’s UseR! 2014 Tutorial Slides •  http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/ •  Slides https://www.dropbox.com/sh/i8qnluwmuieicxc/ AAAgt9tIKoIm7WZKIyK25lh6a •  Slides of Tutorial by Rstudio’s Garrett Grolemund •  https://github.com/rstudio/webinars •  And the cheatsheet is available at http://www.rstudio.com/resources/ cheatsheets/
  • 14. dplyR joins Hiroaki Yutani ‫‏‬@yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
  • 15. dplyR joins Hiroaki Yutani ‫‏‬@yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
  • 16. Who will win Super Bowl XLIX 9:40
  • 17. The Art of ELO Ranking & Super Bowl XLIX o Let us look at this from 3 angles: •  The Algorithm •  The R program •  The Data •  The Results •  Comparing with the FiveThirtyEight Results http://www.imdb.com/title/tt1285016/trivia?item=qt1318850 I need the Algorithm, I need the Algorithm – Mark Z to Eduardo S
  • 18.
  • 19. The ELO Algorithm (1 of 3) 1.  Basic Chess Algorithm proposed by Elo •  Arpad Emrick Elo proposed the system for Chess ranking •  Rnew = Rold + K(S-μ); μij = 1 / 1 + 10(Riold-Rjold)/400 •  K – varies depending on the match •  Sij = 1, ½ or 0 2.  Soccer Ranking •  http://www.eloratings.net/system.html 3.  NFL Ranking with adjusted factor for scores, 538 Ranking Ref : Who is #1, Princeton University Press
  • 20. The ELO Algorithm (2 of 3) NFL Ranking http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
  • 21. The ELO Algorithm (3 of 3) NFL Ranking
  • 24. The Analysis - Ranks
  • 25. The Analysis – Week 1, Week 18
  • 26. Analysis – Week 20 Results
  • 27. Wisdom from Nate Silver & the 538 Gang … o  [Homework #1] Improve our core algorithm to add the Margin of victory from the 538 gang ! •  Remember, kFactor = 20 o  [Homework #2] Weigh recent games more heavily w/ Exponential Decay
  • 28. The Art of ELO Ranking & Super Bowl XLIX o The real formula is o Not what is written on the glass ! o But then that is Hollywood ! I need the Algorithm, I need the Algorithm – Mark Z to Eduardo S Ref : Who is #1, Princeton University Press
  • 29. References: o  ELO ranking – NFL,Soccer •  http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/ •  http://fivethirtyeight.com/datalab/nfl-week-20-elo-ratings-and-playoff- odds-conference-championships/ •  http://www.eloratings.net/system.html o  dplyR •  http://www.rstudio.com/resources/webinars/ <- github for the slides •  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part1/ •  http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part2/ •  http://www.rstudio.com/resources/cheatsheets/ •  http://www.r-bloggers.com/data-analysis-example-with-ggplot-and-dplyr- analyzing-supercar-data-part-2/
  • 30. Anatomy Of a Kaggle Competition 10:10
  • 31. Kaggle Data Science Competitions o  Hosts Data Science Competitions o  Competition Attributes: •  Dataset •  Train •  Test (Submission) •  Final Evaluation Data Set (We don’t see) •  Rules •  Time boxed •  Leaderboard •  Evaluation function •  Discussion Forum •  Private or Public
  • 32. Titanic  Passenger  Metadata   •  Small   •  3  Predictors   •  Class   •  Sex   •  Age   •  Survived?   http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html City Bike Sharing Prediction (Washington DC) Walmart Store Forecasting
  • 33. Train.csv   Taken  from  Titanic  Passenger   Manifest   Variable   Descrip-on   Survived   0-­‐No,  1=yes   Pclass   Passenger  Class  (  1st,2nd,3rd  )   Sibsp   Number  of  Siblings/Spouses  Aboard   Parch   Number  of  Parents/Children  Aboard   Embarked   Port  of  EmbarkaMon   o  C  =  Cherbourg   o  Q  =  Queenstown   o  S  =  Southampton   Titanic  Passenger  Metadata   •  Small   •  3  Predictors   •  Class   •  Sex   •  Age   •  Survived?  
  • 34. Test.csv   Submission o 418 lines; 1st column should have 0 or 1 in each line o Evaluation: •  % correctly predicted
  • 35. Approach o  This is a classification problem - 0 or 1 o  Comb the forums ! o  Opportunity for us to try different algorithms & compare them •  Simple Model •  CART[Classification & Regression Tree] •  Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions •  RandomForest •  Different parameters •  SVM •  Multiple kernels •  Table the results o  Use cross validation to predict our model performance & correlate with what Kaggle says http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
  • 36. Simple Model – Our First Submission o #1 : Simple Model (M=survived) o #2 : Simple Model (F=survived) https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python Refer  to  1-­‐Intro_to_Kaggle.R       at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 37. #3 : Simple CART Model o CART (Classification & Regression Tree) hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees   May be better, because we have improved on the survival of men ! Refer  to  1-­‐Intro_to_Kaggle.R       at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 38. #4 : Random Forest Model o  https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience •  Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/ o  https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests o  https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py Refer  to  1-­‐Intro_to_Kaggle.R       at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 39. #5 : SVM o Multiple Kernels o kernel = ‘radial’ #Radial Basis Function o Kernel = ‘sigmoid’ o  agconti's blog - Ultimate Titanic ! o  http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713 Refer  to  1-­‐Intro_to_Kaggle.R       at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 40. Feature Engineering - Homework o  Add attribute : Age •  In train 714/891 have age; in test 332/418 have age •  Missing values can be just Mean Age of all passengers •  We could be more precise and calculate Mean Age based on Title (Ms, Mrs, Master et al) •  Box plot age o  Add attribute : Mother, Family size et al o  Feature engineering ideas •  http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/ sharing-experiences-about-data-munging-and-classification-steps- with-python o  More ideas at http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/ o  And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
  • 41. What does it mean ? Let us ponder …. o  We have a training data set representing a domain •  We reason over the dataset & develop a model to predict outcomes o  How good is our prediction when it comes to real life scenarios ? o  The assumption is that the dataset is taken at random •  Or Is it ? Is there a Sampling Bias ? •  i.i.d ? Independent ? Identically Distributed ? •  What about homoscedasticity ? Do they have the same finite variance ? o  Can we assure that another dataset (from the same domain) will give us the same result ? o  Will our model & it’s parameters remain the same if we get another data set ? o  How can we evaluate our model ? o  How can we select the right parameters for a selected model ?
  • 43. Algorithms for the Amateur Data Scientist “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … 11:10
  • 44. Ref: Anthony’s Kaggle Presentation Data Scientists apply different techniques •  Support Vector Machine •  adaBoost •  Bayesian Networks •  Decision Trees •  Ensemble Methods •  Random Forest •  Logistic Regression •  Genetic Algorithms •  Monte Carlo Methods •  Principal Component Analysis •  Kalman Filter •  Evolutionary Fuzzy Modelling •  Neural Networks Quora •  http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
  • 45. Algorithm spectrum o  Regression o  Logit o  CART o  Ensemble : Random Forest o  Clustering o  KNN o  Genetic Alg o  Simulated Annealing   o  Collab Filtering o  SVM o  Kernels o  SVD o  NNet o  Boltzman Machine o  Feature Learning   Machine  Learning   Cute  Math   Ar0ficial   Intelligence  
  • 46. Classifying Classifiers Statistical   Structural   Regression   Naïve   Bayes   Bayesian   Networks   Rule-­‐based   Distance-­‐based   Neural   Networks   Production  Rules   Decision  Trees   Multi-­‐layer   Perception   Functional   Nearest  Neighbor   Linear   Spectral   Wavelet   kNN   Learning  vector   Quantization   Ensemble   Random  Forests   Logistic   Regression1   SVM  Boosting   1Max  Entropy  Classifier     Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
  • 47. Classifiers   Regression   Continuous Variables Categorical Variables Decision   Trees   k-­‐NN(Nearest   Neighbors)   Bias Variance Model Complexity Over-fitting BoosMng  Bagging   CART  
  • 49. Data Science “folk knowledge” (1 of A) o  "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions o  Learning = Representation + Evaluation + Optimization o  It’s Generalization that counts •  The fundamental goal of machine learning is to generalize beyond the examples in the training set o  Data alone is not enough •  Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o  Machine Learning is not magic – one cannot get something from nothing •  In order to infer, one needs the knobs & the dials •  One also needs a rich expressive datasetA few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 50. Data Science “folk knowledge” (2 of A) o  Over fitting has many faces •  Bias – Model not strong enough. So the learner has the tendency to learn the same wrong things •  Variance – Learning too much from one dataset; model will fall apart (ie much less accurate) on a different dataset •  Sampling Bias o  Intuition Fails in high Dimensions –Bellman •  Blessing of non-conformity & lower effective dimension; many applications have examples not uniformly spread but concentrated near a lower dimensional manifold eg. Space of digits is much smaller then the space of images o  Theoretical Guarantees are not What they seem •  One of the major developments o f recent decades has been the realization that we can have guarantees on the results of induction, particularly if we are willing to settle for probabilistic guarantees. o  Feature engineering is the Key A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 51. Data Science “folk knowledge” (3 of A) o  More Data Beats a Cleverer Algorithm •  Or conversely select algorithms that improve with data •  Don’t optimize prematurely without getting more data o  Learn many models, not Just One •  Ensembles ! – Change the hypothesis space •  Netflix prize •  E.g. Bagging, Boosting, Stacking o  Simplicity Does not necessarily imply Accuracy o  Representable Does not imply Learnable •  Just because a function can be represented does not mean it can be learned o  Correlation Does not imply Causation o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o  A few useful things to know about machine learning - by Pedro Domingos §  http://dl.acm.org/citation.cfm?id=2347755
  • 52. Data Science “folk knowledge” (4 of A) o  The simplest hypothesis that fits the data is also the most plausible •  Occam’s Razor •  Don’t go for a 4 layer Neural Network unless you have that complex data •  But that doesn’t also mean that one should choose the simplest hypothesis •  Match the impedance of the domain, data & the algorithms o  Think of over fitting as memorizing as opposed to learning. o  Data leakage has many forms o  Sometimes the Absence of Something is Everything o  [Corollary] Absence of Evidence is not the Evidence of Absence New to Machine Learning? Avoid these three mistakes, James Faghmous https://medium.com/about-data/73258b3848a4 §  Simple  Model   §  High  Error  line  that  cannot  be   compensated  with  more  data   §  Gets  to  a  lower  error  rate  with  less  data   points   §  Complex  Model   §  Lower  Error  Line   §  But  needs  more  data  points  to  reach   decent  error     Ref: Andrew Ng/Stanford, Yaser S./CalTech
  • 53. Importance of feature selection & weak models o “Good features allow a simple model to beat a complex model”-Ben Lorica1 o “… using many weak predictors will always be more accurate than using a few strong ones …” –Vladimir Vapnik2 o “A good decision rule is not a simple one, it cannot be described by a very few parameters” 2 o “Machine learning science is not only about computers, but about humans, and the unity of logic, emotion, and culture.” 2 o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you” – Hadley Wickham3 hTp://radar.oreilly.com/2014/06/streamlining-­‐feature-­‐engineering.html   hTp://nauMl.us/issue/6/secret-­‐codes/teaching-­‐me-­‐so^ly   hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-­‐modeling-­‐and-­‐surprises/   Updated  Slide  
  • 54. Check your assumptions o  The decisions a model makes, is directly related to the it’s assumptions about the statistical distribution of the underlying data o  For example, for regression one should check that: ① Variables are normally distributed •  Test for normality via visual inspection, skew & kurtosis, outlier inspections via plots, z-scores et al ② There is a linear relationship between the dependent & independent variables •  Inspect residual plots, try quadratic relationships, try log plots et al ③ Variables are measured without error ④ Assumption of Homoscedasticity §  Homoscedasticity assumes constant or near constant error variance §  Check the standard residual plots and look for heteroscedasticity §  For example in the figure, left box has the errors scattered randomly around zero; while the right two diagrams have the errors unevenly distributed Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test, http://pareonline.net/getvn.asp?v=8&n=2
  • 55. Data Science “folk knowledge” (5 of A) Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns             Unknowns   You UnKnown   Known   o  Others  know,  you  don’t   o  What  we  do   o  Facts,  outcomes  or   scenarios  we  have  not   encountered,  nor   considered   o  “Black  swans”,  outliers,   long  tails  of  probability   distribuMons   o  Lack  of  experience,   imaginaMon   o  PotenMal  facts,   outcomes  we   are  aware,  but   not    with   certainty   o  StochasMc   processes,   ProbabiliMes   o  Known Knowns o  There are things we know that we know o  Known Unknowns o  That is to say, there are things that we now know we don't know o  But there are also Unknown Unknowns o  There are things we do not know we don't know
  • 56. Data Science “folk knowledge” (6 of A) - Pipeline o  Scalable  Model   Deployment   o  Big  Data   automation  &   purpose  built   appliances  (soft/ hard)   o  Manage  SLAs  &   response  times   o  Volume   o  Velocity   o  Streaming  Data   o  Canonical  form   o  Data  catalog   o  Data  Fabric  across  the   organization   o  Access  to  multiple   sources  of  data     o  Think  Hybrid  –  Big  Data   Apps,  Appliances  &   Infrastructure   Collect Store Transform o  Metadata   o  Monitor  counters  &   Metrics   o  Structured  vs.  Multi-­‐ structured   o  Flexible  &  Selectable   §  Data  Subsets     §  Attribute  sets   o  Refine  model  with   §  Extended  Data   subsets   §  Engineered   Attribute  sets   o  Validation  run  across  a   larger  data  set   Reason Model Deploy Data Management Data Science o  Dynamic  Data  Sets   o  2  way  key-­‐value  tagging  of   datasets   o  Extended  attribute  sets   o  Advanced  Analytics   ExploreVisualize Recommend Predict o  Performance   o  Scalability   o  Refresh  Latency   o  In-­‐memory  Analytics   o  Advanced  Visualization   o  Interactive  Dashboards   o  Map  Overlay   o  Infographics   ¤  Bytes to Business a.k.a. Build the full stack ¤  Find Relevant Data For Business ¤  Connect the Dots
  • 57. Volume Velocity Variety Data Science “folk knowledge” (7 of A) Context Connect edness Intelligence Interface Inference “Data of unusual size” that can't be brute forced o  Three Amigos o  Interface = Cognition o  Intelligence = Compute(CPU) & Computational(GPU) o  Infer Significance & Causality
  • 58. Data Science “folk knowledge” (8 of A) Jeremy’s Axioms o  Iteratively explore data o  Tools •  Excel Format, Perl, Perl Book o  Get your head around data •  Pivot Table o  Don’t over-complicate o  If people give you data, don’t assume that you need to use all of it o  Look at pictures ! o  History of your submissions – keep a tab o  Don’t be afraid to submit simple solutions •  We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy- howard/
  • 59. Data Science “folk knowledge” (9 of A) ①  Common Sense (some features make more sense then others) ②  Carefully read these forums to get a peak at other peoples’ mindset ③  Visualizations ④  Train a classifier (e.g. logistic regression) and look at the feature weights ⑤  Train a decision tree and visualize it ⑥  Cluster the data and look at what clusters you get out ⑦  Just look at the raw data ⑧  Train a simple classifier, see what mistakes it makes ⑨  Write a classifier using handwritten rules ⑩  Pick a fancy method that you want to apply (Deep Learning/Nnet) -- Maarten Bosma -- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
  • 60. Data Science “folk knowledge” (A of A) Lessons from Kaggle Winners ①  Don’t over-fit ②  All predictors are not needed •  All data rows are not needed, either ③  Tuning the algorithms will give different results ④  Reduce the dataset (Average, select transition data,…) ⑤  Test set & training set can differ ⑥  Iteratively explore & get your head around data ⑦  Don’t be afraid to submit simple solutions ⑧  Keep a tab & history your submissions
  • 61. The curious case of the Data Scientist o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story Data Scientist (noun): Person who is better at statistics than any software engineer & better at software engineering than any statistician – Josh Wills (Cloudera) Data Scientist (noun): Person who is worse at statistics than any statistician & worse at software engineering than any software engineer – Will Cukierski (Kaggle) http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/ Large is hard; Infinite is much easier ! – Titus Brown
  • 62. Essential Reading List o  A few useful things to know about machine learning - by Pedro Domingos •  http://dl.acm.org/citation.cfm?id=2347755 o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert •  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/ lack_of_a_priori_distinctions_wolpert.pdf o  http://www.no-free-lunch.org/ o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C •  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y %20FDR.pdf o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe •  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o  Avoid these three mistakes, James Faghmo •  https://medium.com/about-data/73258b3848a4 o  Leakage in Data Mining: Formulation, Detection, and Avoidance •  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/ cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
  • 63. For your reading & viewing pleasure … An ordered List ①  An Introduction to Statistical Learning •  http://www-bcf.usc.edu/~‾gareth/ISL/ ②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning •  http://online.stanford.edu/course/statistical-learning-winter-2014 ③  Prof. Pedro Domingo •  https://class.coursera.org/machlearning-001/lecture/preview ④  Prof. Andrew Ng •  https://class.coursera.org/ml-003/lecture/preview ⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data •  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥  Mathematicalmonk @ YouTube •  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦  The Elements Of Statistical Learning •  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn- machine-learning/
  • 64. Of Models, Performance, Evaluation & Interpretation 11:30
  • 65. Bias/Variance (1 of 2) o Model Complexity •  Complex Model increases the training data fit •  But then it overfits & doesn't perform as well with real data o  Bias vs. Variance o  Classical diagram o  From ELSII, By Hastie, Tibshirani & Friedman o  Bias – Model learns wrong things; not complex enough; error gap small; more data by itself won’t help o  Variance – Different dataset will give different error rate; over fitted model; larger error gap; more data could help Prediction Error Training Error Ref: Andrew Ng/Stanford, Yaser S./CalTech Learning Curve
  • 66. Bias/Variance (2 of 2) o High Bias •  Due to Underfitting •  Add more features •  More sophisticated model •  Quadratic Terms, complex equations,… •  Decrease regularization o High Variance •  Due to Overfitting •  Use fewer features •  Use more training sample •  Increase Regularization Prediction Error Training Error Ref: Strata 2013 Tutorial by Olivier Grisel Learning Curve Need  more  features  or  more   complex  model  to  improve   Need  more  data  to  improve   'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
  • 67. Partition Data ! •  Training (60%) •  Validation(20%) & •  “Vault” Test (20%) Data sets k-fold Cross-Validation •  Split data into k equal parts •  Fit model to k-1 parts & calculate prediction error on kth part •  Non-overlapping dataset Data Partition & Cross-Validation —  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+) Train   Validate   Test   #2   #3   #4   #5   #1   #2   #3   #5   #4   #1   #2   #4   #5   #3   #1   #3   #4   #5   #2   #1   #3   #4   #5   #1   #2   K-­‐fold  CV  (k=5)   Train   Validate  
  • 68. Bootstrap •  Draw datasets (with replacement) and fit model for each dataset •  Remember : Data Partitioning (#1) & Cross Validation (#2) are without replacement Bootstrap & Bagging —  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+) Bagging (Bootstrap aggregation) ◦  Average prediction over a collection of bootstrap-ed samples, thus reducing variance
  • 69. ◦  “Output  of  weak  classifiers  into  a  powerful  commiTee”   ◦  Final  PredicMon  =  weighted  majority  vote     ◦  Later  classifiers  get  misclassified  points     –  With  higher  weight,     –  So  they  are  forced     –  To  concentrate  on  them   ◦  AdaBoost  (AdapMveBoosting)   ◦  BoosMng  vs  Bagging   –  Bagging  –  independent  trees   –  BoosMng  –  successively  weighted   Boosting —  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)
  • 70. ◦  Builds  large  collecMon  of  de-­‐correlated  trees  &  averages   them   ◦  Improves  Bagging  by  selecMng  i.i.d*  random  variables  for   spliong   ◦  Simpler  to  train  &  tune   ◦  “Do  remarkably  well,  with  very  li@le  tuning  required”  –  ESLII   ◦  Less  suscepMble  to  over  fiong  (than  boosMng)   ◦  Many  RF  implementaMons   –  Original  version  -­‐  Fortran-­‐77  !  By  Breiman/Cutler   –  Python,  R,  Mahout,  Weka,  Milk  (ML  toolkit  for  py),  matlab     * i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Random Forests+ —  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)
  • 71. ◦  Two  Step   –  Develop  a  set  of  learners   –  Combine  the  results  to  develop  a  composite  predictor   ◦  Ensemble  methods  can  take  the  form  of:   –  Using  different  algorithms,     –  Using  the  same  algorithm  with  different  seongs   –  Assigning  different  parts  of  the  dataset  to  different  classifiers   ◦  Bagging  &  Random  Forests  are  examples  of  ensemble   method     Ref: Machine Learning In Action Ensemble Methods —  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)
  • 72. Random Forests o  While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller) o  Error prediction •  For each iteration, predict for dataset that is not in the sample (OOB data) •  Aggregate OOB predictions •  Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate •  Can use this to search for optimal # of predictors •  We will see how close this is to the actual error in the Heritage Health Prize o  Assumes equal cost for mis-prediction. Can add a cost function o  Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk A Brief Overview of RF by Dan Steinberg
  • 73. Classifiers   Regression   Continuous Variables Categorical Variables Decision   Trees   k-­‐NN(Nearest   Neighbors)   Bias Variance Model Complexity Over-fitting BoosMng  Bagging   CART  
  • 75. Cross Validation o Reference: •  https://www.kaggle.com/wiki/ GettingStartedWithPythonForDataScience •  Chris Clark ‘s blog : http://blog.kaggle.com/2012/07/02/up-and-running-with-python- my-first-kaggle-entry/ •  Predicive Modelling in py with scikit-learning, Olivier Grisel Strata 2013 •  titanic from pycon2014/parallelmaster/An introduction to Predictive Modeling in Python
  • 76. Model Evaluation - Accuracy o Accuracy = o For cases where tn is large compared tp, a degenerate return(false) will be very accurate ! o Hence the F-measure is a better reflection of the model strength Predicted=1   Predicted=0   Actual  =1   True+  (tp)   False-­‐  (fn)  –  Type  II   Actual=0   False+  (fp)  -­‐  Type  I   True-­‐  (tn)          tp  +  tn   tp+fp+fn+tn    
  • 77. Model Evaluation – Precision & Recall o  Precision = How many items we identified are relevant o  Recall = How many relevant items did we identify o  Inverse relationship – Tradeoff depends on situations •  Legal – Coverage is important than correctness •  Search – Accuracy is more important •  Fraud •  Support cost (high fp) vs. wrath of credit card co.(high fn)        tp   tp+fp     •  Precision     •  Accuracy   •  Relevancy          tp   tp+fn     •  Recall     •  True  +ve  Rate   •  Coverage   •  Sensitivity   •  Hit  Rate   http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html        fp   fp+tn     •  Type  1  Error   Rate   •  False  +ve  Rate   •  False  Alarm  Rate   •  Specificity  =  1  –  fp  rate   •  Type  1  Error  =  fp   •  Type  2  Error  =  fn   Predicted=1   Predicted=0   Actual  =1   True+  (tp)   False-­‐  (fn)  -­‐  Type  II   Actual=0   False+  (fp)  -­‐  Type  I   True-­‐  (tn)  
  • 78. Confusion Matrix       Actual   Predicted   C1   C2   C3   C4   C1   10   5   9   3   C2   4   20   3   7   C3   6   4   13   3   C4   2   1   4   15   Correct  Ones  (cii)   Precision  =   Columns                  i   cii   cij   Recall  =   Rows            j     cii   cij   Σ Σ
  • 79. Model Evaluation : F-Measure Precision = tp / (tp+fp) : Recall = tp / (tp+fn) F-Measure Balanced, Combined, Weighted Harmonic Mean, measures effectiveness =   β2  P  +  R   Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R +  (1  –  α)  α   1   P   1   R   1   (β2  +  1)PR   Predicted=1   Predicted=0   Actual  =1   True+  (tp)   False-­‐  (fn)  -­‐  Type  II   Actual=0   False+  (fp)  -­‐  Type  I   True-­‐  (tn)  
  • 80. Hands-on Walkthru - Model Evaluation Train   Test   712 (80%) 179 891 hTp://cran.r-­‐project.org/web/packages/e1071/vigneTes/ svmdoc.pdf  -­‐  model  eval   Kappa  measure  is  interesMng   Refer  to  2-­‐Model_EvaluaMon.R       at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 81. ROC Analysis o “How good is my model?” o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf o “A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing and selecting classifiers based on their performance” o Much better than evaluating a model based on simple classification accuracy o Plots tp rate vs. fp rate
  • 82. ROC Graph - Discussion o  E = Conservative, Everything NO o  H = Liberal, Everything YES o Am not making any political statement ! o  F = Ideal o  G = Worst o  The diagonal is the chance o  North West Corner is good o  South-East is bad •  For example E •  Believe it or Not - I have actually seen a graph with the curve in this region ! E F G H Predicted=1   Predicted=0   Actual  =1   True+  (tp)   False-­‐  (fn)   Actual=0   False+  (fp)   True-­‐  (tn)  
  • 83. ROC Graph – Clinical Example Ifcc  :  Measures  of  diagnostic  accuracy:  basic  definitions  
  • 84. ROC Graph Walk thru Refer  to  2-­‐Model_EvaluaMon.R  at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 85. The Beginning As The End Who will win Super BOWL XLIX ? 12:15
  • 86. References: o  An Introduction to scikit-learn, pycon 2013, Jake Vanderplas •  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine- learning o  Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel •  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner •  http://strataconf.com/strata2013/public/schedule/detail/27291 o  The Problem of Multiple Testing •  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/ PIIS1934148209014609.pdf
  • 87.
  • 88. Homework: Bike Sharing at Washington DC 12:30
  • 89. Few interesting Links - Comb the forums o  Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing •  Solution by Brandon Harris o  Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this- prediction o  GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm o  Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare o  Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances o  Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour o  Casual & Registered Users : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count o  RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r o  Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data o  Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402 o  Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html
  • 90. Data Organization – train, test & submission •  datetime - hourly date + timestamp •  Season •  1 = spring, 2 = summer, 3 = fall, 4 = winter •  holiday - whether the day is considered a holiday •  workingday - whether the day is neither a weekend nor holiday •  Weather •  1: Clear, Few clouds, Partly cloudy, Partly cloudy •  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist •  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds •  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog •  temp - temperature in Celsius •  atemp - "feels like" temperature in Celsius •  humidity - relative humidity •  windspeed - wind speed •  casual - number of non-registered user rentals initiated •  registered - number of registered user rentals initiated •  count - number of total rentals
  • 91. Approach o Convert to factors o Engineer new features from date o Explore other synthetic features
  • 92. #1 : ctree Refer  to  3-­‐Session-­‐I-­‐Bikes.R       at  hTps://github.com/xsankar/hairy-­‐octo-­‐hipster/  
  • 93. #2 : Add Month + year