Page 1
Data Science: A view from the trenches
Ram Sriharsha
Twitter: @halfbrane
Vinay Shukla
Twitter: @neomythos
Page 2
Agenda
•  Problems we work on
•  Common Challenges
•  Reductions
•  Handling label sparsity
–  Co Training
–  Adaptive Learning
•  When you have to be fast and accurate
–  Online Clustering
–  Sketches
–  Online Learning
•  Visualization
Page 3
Some Problems
• Search Advertising
– Click Prediction: Given a query, ad and user context, how likely is the user to click on ad?
– Feature Engineering: Query/ ad categorization, query -> feature vector
• Entity Resolution and Disambiguation
• Over / Under Payment of claims detection
• Document Matching
• Login Risk Detection
Page 4
Common Challenges
•  Labeling is expensive and not clean
– Selectively ask for labels (active learning)
– Co-Training to expand label set
•  Not enough high quality implementations of algorithms
– Modular extensions of base implementations (Reductions)
– Boosting
•  Speed of training/ scoring important
– Online learning
– Online clustering
– Sketches
•  Freshness of models
– Online and adaptive learning
•  Visualizing performance and feature importance
– Zeppelin
Page 5
Reductions
OVR
Let R = rejection sampling algorithm
For each example h, sample according to
Cost of h and feed to 0/1 classifier
A
A
…
Randomize over classifiers that
Output yes
Importance Weighting R A
R^-1
Let A = Algorithm for optimizing 0/1 loss
Page 6
Active Learning
• Given a pool of examples determine which ones is the classifier least confident about
• Ask those examples to be labeled, and feed to training
• Choose query points that shrink the space of classifiers rapidly
• Exploit natural structure in data
45% 45%2.5% 2.5%5%
Page 7
Co Training
• Suppose you have two “views” of the data
– e.g, web pages have content, and hyperlinks pointing to and from them
– Suppose problem is to label webpage as about literature/ or not (binary classification)
• One approach:
– Label web pages manually. Train classifier to use both content text and hyperlinks as
features
– This requires a large # of labeled pages
• Other approach:
– Since we have two views , try to learn two classifiers
– Each classifier learns on a subset of labeled examples.
– The scores of each classifier are used to label a subset of unlabeled web pages and extend
the labels for the other classifier.
Page 8
Sketches
• Store a “summary” of the dataset
• Querying the sketch is “almost” as good as querying the dataset
• Example: frequent items in a stream
– Initialize associative array A of size k-1
– Process: for each j
-  if j is in keys(A), A[j] += 1
-  else if |keys(A)| < k - 1, A[j] = 1
-  else
–  for each l in keys(A),
»  A(l) -=1 ;
»  if A(l) = 0, remove l;
–  done
Page 9
Clustering is not fast enough
• Sample and then cluster
• Do clusters need to dynamically adapt?
– Online clustering
– Streaming K Means
Page 10
K Means
• Initialize cluster centers somehow
– random
– K means ++
• Alternate
– Assign each point to closest cluster center
– Move cluster center to average of points assigned to center
• Stop when convergence criteria reached
– Points don’t move “much”
– Number of iterations reached.
Page 11
11
Initialize Cluster Centers
k1
k2
k3
X
Y
Pick 3
initial
cluster
centers
(randomly)
Page 12
Assign each point
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center
Page 13
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3
Page 14
Streaming K Means
• For each new point
– Assign to closest cluster center
– Update cluster center to incrementally move in direction of new point
• Online version of Lloyd’s algorithm
• Good enough in practice
Page 15
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k2
k1
k3
Page 16
Online Clustering (Liberty, Sriharsha, Sviridenko)
• Initialization Phase:
– First point is its own cluster
– Pick some Normalization factor f
• Update Phase for point p:
– Let d = distance from p to closest center so far
– With probability proportional to d/ f , attach p to closest center
– With probability max (1 – d/f, 1), form a new cluster center at p.
• Merge Phase:
– Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
Page 17
Properties
• Provably close to optimal in online setting
• Does not open more than O(log(OPT)) clusters pays O(OPT) cost
• Very efficient to implement
• Adaptive algorithm
• Forgetfulness can be introduced in the merge process
• Leaving out the merge process still produces a clustering that might be indicative of
structure, i.e useful as a machine learning feature
Page 18
My classifier is not fast enough
• Even for batch problems online learning might be good enough!
• For real time problems, online learning or incremental learning is needed.
Page 19
What is online learning?
• Batch Learning:
– Classifier sees a set of labeled examples, and trains a model
– Predicts on trained model for unseen examples
• Online Learning:
– Classifier sees an example at a time.
– Limited look back window (often 0)
– Predicts on example and is revealed the cost
– Learns from mistake
– Yields a batch learning algorithm that is one pass: simply run online algorithm for each
example in a batch.
Page 20
Challenges of online learning
• normalization
– In batch set up, can normalize data by making a pass over the full dataset
– In online setting, cannot make a second pass
– Solution: Adaptive normalization
• Late arriving features
– In Batch setting, all features are recorded in the dataset
– In online setting different features may arrive at different times
– Solution: Adagrad (Adaptive gradient technique)
• Stochastic Gradient Descent convergence can be slow
– More data helps
– Adaptive normalization improves convergence
– Adagrad improves convergence and reduces sensitivity to step size
Page 21
Visualization
• Speed up feature discovery
• Intuitive visualization of model performance
• Improve debuggability
Page 22
The Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
Page 23
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page 24
Zeppelin today in Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
Page 25
Zeppelin – Road Ahead
Operations
-  Deploy to the cluster with Ambari
Security
-  Authentication against LDAP
-  SSL
-  Run in Kerberized Cluster
-  Authorization of notebooks
Sharing/ Collaboration
-  Share selected notebooks with selected
users/groups
-  Ability to read/publish notebooks to github
Data Import
- Visual data import/download
- Clean data as it comes
Usability
-  Summary Data – See column summary
-  Keyboard shortcuts, Auto complete, syntax high
light, line numbers
Visualization
-  Pluggable visualization & more charts, maps &
tables.
R support
-  Harden SparkR interpreter
Enterprise ReadyEase of Use
Page 26
Upcoming Work
• Entity Resolution package GA
– Supports Entity Graph based resolution
– Includes Random Walk algorithm for computing similarity score
• Online learning and clustering Spark Packages
• Contribute more Reduction algorithms to Spark ML
– Cost Sensitive Classification
– Filter tree based Multiclass Reduction
• Zeppelin GA
Page 27
Thank you!
• Ram Sriharsha
@halfabrane
• Vinay Shukla
@neomythos

Apache con big data 2015 - Data Science from the trenches

  • 1.
    Page 1 Data Science:A view from the trenches Ram Sriharsha Twitter: @halfbrane Vinay Shukla Twitter: @neomythos
  • 2.
    Page 2 Agenda •  Problemswe work on •  Common Challenges •  Reductions •  Handling label sparsity –  Co Training –  Adaptive Learning •  When you have to be fast and accurate –  Online Clustering –  Sketches –  Online Learning •  Visualization
  • 3.
    Page 3 Some Problems • SearchAdvertising – Click Prediction: Given a query, ad and user context, how likely is the user to click on ad? – Feature Engineering: Query/ ad categorization, query -> feature vector • Entity Resolution and Disambiguation • Over / Under Payment of claims detection • Document Matching • Login Risk Detection
  • 4.
    Page 4 Common Challenges • Labeling is expensive and not clean – Selectively ask for labels (active learning) – Co-Training to expand label set •  Not enough high quality implementations of algorithms – Modular extensions of base implementations (Reductions) – Boosting •  Speed of training/ scoring important – Online learning – Online clustering – Sketches •  Freshness of models – Online and adaptive learning •  Visualizing performance and feature importance – Zeppelin
  • 5.
    Page 5 Reductions OVR Let R= rejection sampling algorithm For each example h, sample according to Cost of h and feed to 0/1 classifier A A … Randomize over classifiers that Output yes Importance Weighting R A R^-1 Let A = Algorithm for optimizing 0/1 loss
  • 6.
    Page 6 Active Learning • Givena pool of examples determine which ones is the classifier least confident about • Ask those examples to be labeled, and feed to training • Choose query points that shrink the space of classifiers rapidly • Exploit natural structure in data 45% 45%2.5% 2.5%5%
  • 7.
    Page 7 Co Training • Supposeyou have two “views” of the data – e.g, web pages have content, and hyperlinks pointing to and from them – Suppose problem is to label webpage as about literature/ or not (binary classification) • One approach: – Label web pages manually. Train classifier to use both content text and hyperlinks as features – This requires a large # of labeled pages • Other approach: – Since we have two views , try to learn two classifiers – Each classifier learns on a subset of labeled examples. – The scores of each classifier are used to label a subset of unlabeled web pages and extend the labels for the other classifier.
  • 8.
    Page 8 Sketches • Store a“summary” of the dataset • Querying the sketch is “almost” as good as querying the dataset • Example: frequent items in a stream – Initialize associative array A of size k-1 – Process: for each j -  if j is in keys(A), A[j] += 1 -  else if |keys(A)| < k - 1, A[j] = 1 -  else –  for each l in keys(A), »  A(l) -=1 ; »  if A(l) = 0, remove l; –  done
  • 9.
    Page 9 Clustering isnot fast enough • Sample and then cluster • Do clusters need to dynamically adapt? – Online clustering – Streaming K Means
  • 10.
    Page 10 K Means • Initializecluster centers somehow – random – K means ++ • Alternate – Assign each point to closest cluster center – Move cluster center to average of points assigned to center • Stop when convergence criteria reached – Points don’t move “much” – Number of iterations reached.
  • 11.
    Page 11 11 Initialize ClusterCenters k1 k2 k3 X Y Pick 3 initial cluster centers (randomly)
  • 12.
    Page 12 Assign eachpoint k1 k2 k3 X Y Assign each point to the closest cluster center
  • 13.
    Page 13 Recompute ClusterCenters X Y Move each cluster center to the mean of each cluster k1 k2 k2 k1 k3 k3
  • 14.
    Page 14 Streaming KMeans • For each new point – Assign to closest cluster center – Update cluster center to incrementally move in direction of new point • Online version of Lloyd’s algorithm • Good enough in practice
  • 15.
    Page 15 Recompute ClusterCenters X Y Move each cluster center to the mean of each cluster k2 k1 k3
  • 16.
    Page 16 Online Clustering(Liberty, Sriharsha, Sviridenko) • Initialization Phase: – First point is its own cluster – Pick some Normalization factor f • Update Phase for point p: – Let d = distance from p to closest center so far – With probability proportional to d/ f , attach p to closest center – With probability max (1 – d/f, 1), form a new cluster center at p. • Merge Phase: – Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
  • 17.
    Page 17 Properties • Provably closeto optimal in online setting • Does not open more than O(log(OPT)) clusters pays O(OPT) cost • Very efficient to implement • Adaptive algorithm • Forgetfulness can be introduced in the merge process • Leaving out the merge process still produces a clustering that might be indicative of structure, i.e useful as a machine learning feature
  • 18.
    Page 18 My classifieris not fast enough • Even for batch problems online learning might be good enough! • For real time problems, online learning or incremental learning is needed.
  • 19.
    Page 19 What isonline learning? • Batch Learning: – Classifier sees a set of labeled examples, and trains a model – Predicts on trained model for unseen examples • Online Learning: – Classifier sees an example at a time. – Limited look back window (often 0) – Predicts on example and is revealed the cost – Learns from mistake – Yields a batch learning algorithm that is one pass: simply run online algorithm for each example in a batch.
  • 20.
    Page 20 Challenges ofonline learning • normalization – In batch set up, can normalize data by making a pass over the full dataset – In online setting, cannot make a second pass – Solution: Adaptive normalization • Late arriving features – In Batch setting, all features are recorded in the dataset – In online setting different features may arrive at different times – Solution: Adagrad (Adaptive gradient technique) • Stochastic Gradient Descent convergence can be slow – More data helps – Adaptive normalization improves convergence – Adagrad improves convergence and reduces sensitivity to step size
  • 21.
    Page 21 Visualization • Speed upfeature discovery • Intuitive visualization of model performance • Improve debuggability
  • 22.
    Page 22 The DataScience Workflow… What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  • 23.
    Page 23 Introducing ApacheZeppelin Web-based Notebook for interactive analytics Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 24.
    Page 24 Zeppelin todayin Data Science Workflow… What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  • 25.
    Page 25 Zeppelin –Road Ahead Operations -  Deploy to the cluster with Ambari Security -  Authentication against LDAP -  SSL -  Run in Kerberized Cluster -  Authorization of notebooks Sharing/ Collaboration -  Share selected notebooks with selected users/groups -  Ability to read/publish notebooks to github Data Import - Visual data import/download - Clean data as it comes Usability -  Summary Data – See column summary -  Keyboard shortcuts, Auto complete, syntax high light, line numbers Visualization -  Pluggable visualization & more charts, maps & tables. R support -  Harden SparkR interpreter Enterprise ReadyEase of Use
  • 26.
    Page 26 Upcoming Work • EntityResolution package GA – Supports Entity Graph based resolution – Includes Random Walk algorithm for computing similarity score • Online learning and clustering Spark Packages • Contribute more Reduction algorithms to Spark ML – Cost Sensitive Classification – Filter tree based Multiclass Reduction • Zeppelin GA
  • 27.
    Page 27 Thank you! • RamSriharsha @halfabrane • Vinay Shukla @neomythos