Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Page 1
Data Science: A view from the trenches
Ram Sriharsha
Twitter: @halfbrane
Vinay Shukla
Twitter: @neomythos
Page 2
Agenda
•  Problems we work on
•  Common Challenges
•  Reductions
•  Handling label sparsity
–  Co Training
–  Adapt...
Page 3
Some Problems
• Search Advertising
– Click Prediction: Given a query, ad and user context, how likely is the user t...
Page 4
Common Challenges
•  Labeling is expensive and not clean
– Selectively ask for labels (active learning)
– Co-Traini...
Page 5
Reductions
OVR
Let R = rejection sampling algorithm
For each example h, sample according to
Cost of h and feed to 0...
Page 6
Active Learning
• Given a pool of examples determine which ones is the classifier least confident about
• Ask those...
Page 7
Co Training
• Suppose you have two “views” of the data
– e.g, web pages have content, and hyperlinks pointing to an...
Page 8
Sketches
• Store a “summary” of the dataset
• Querying the sketch is “almost” as good as querying the dataset
• Exa...
Page 9
Clustering is not fast enough
• Sample and then cluster
• Do clusters need to dynamically adapt?
– Online clusterin...
Page 10
K Means
• Initialize cluster centers somehow
– random
– K means ++
• Alternate
– Assign each point to closest clus...
Page 11
11
Initialize Cluster Centers
k1
k2
k3
X
Y
Pick 3
initial
cluster
centers
(randomly)
Page 12
Assign each point
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center
Page 13
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3
Page 14
Streaming K Means
• For each new point
– Assign to closest cluster center
– Update cluster center to incrementally...
Page 15
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k2
k1
k3
Page 16
Online Clustering (Liberty, Sriharsha, Sviridenko)
• Initialization Phase:
– First point is its own cluster
– Pick...
Page 17
Properties
• Provably close to optimal in online setting
• Does not open more than O(log(OPT)) clusters pays O(OPT...
Page 18
My classifier is not fast enough
• Even for batch problems online learning might be good enough!
• For real time p...
Page 19
What is online learning?
• Batch Learning:
– Classifier sees a set of labeled examples, and trains a model
– Predi...
Page 20
Challenges of online learning
• normalization
– In batch set up, can normalize data by making a pass over the full...
Page 21
Visualization
• Speed up feature discovery
• Intuitive visualization of model performance
• Improve debuggability
Page 22
The Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyz...
Page 23
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Use Case
Data exploration and discovery
V...
Page 24
Zeppelin today in Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
t...
Page 25
Zeppelin – Road Ahead
Operations
-  Deploy to the cluster with Ambari
Security
-  Authentication against LDAP
-  S...
Page 26
Upcoming Work
• Entity Resolution package GA
– Supports Entity Graph based resolution
– Includes Random Walk algor...
Page 27
Thank you!
• Ram Sriharsha
@halfabrane
• Vinay Shukla
@neomythos
Upcoming SlideShare
Loading in …5
×

Apache con big data 2015 - Data Science from the trenches

1,624 views

Published on

ApacheBigData - Budapest, 2015

Data Science from the trenches
What are the issues?
How to select best algorithm?
How to tune?
What are the problems with visualization?
How does Zeppelin help

Published in: Software
  • Be the first to comment

Apache con big data 2015 - Data Science from the trenches

  1. 1. Page 1 Data Science: A view from the trenches Ram Sriharsha Twitter: @halfbrane Vinay Shukla Twitter: @neomythos
  2. 2. Page 2 Agenda •  Problems we work on •  Common Challenges •  Reductions •  Handling label sparsity –  Co Training –  Adaptive Learning •  When you have to be fast and accurate –  Online Clustering –  Sketches –  Online Learning •  Visualization
  3. 3. Page 3 Some Problems • Search Advertising – Click Prediction: Given a query, ad and user context, how likely is the user to click on ad? – Feature Engineering: Query/ ad categorization, query -> feature vector • Entity Resolution and Disambiguation • Over / Under Payment of claims detection • Document Matching • Login Risk Detection
  4. 4. Page 4 Common Challenges •  Labeling is expensive and not clean – Selectively ask for labels (active learning) – Co-Training to expand label set •  Not enough high quality implementations of algorithms – Modular extensions of base implementations (Reductions) – Boosting •  Speed of training/ scoring important – Online learning – Online clustering – Sketches •  Freshness of models – Online and adaptive learning •  Visualizing performance and feature importance – Zeppelin
  5. 5. Page 5 Reductions OVR Let R = rejection sampling algorithm For each example h, sample according to Cost of h and feed to 0/1 classifier A A … Randomize over classifiers that Output yes Importance Weighting R A R^-1 Let A = Algorithm for optimizing 0/1 loss
  6. 6. Page 6 Active Learning • Given a pool of examples determine which ones is the classifier least confident about • Ask those examples to be labeled, and feed to training • Choose query points that shrink the space of classifiers rapidly • Exploit natural structure in data 45% 45%2.5% 2.5%5%
  7. 7. Page 7 Co Training • Suppose you have two “views” of the data – e.g, web pages have content, and hyperlinks pointing to and from them – Suppose problem is to label webpage as about literature/ or not (binary classification) • One approach: – Label web pages manually. Train classifier to use both content text and hyperlinks as features – This requires a large # of labeled pages • Other approach: – Since we have two views , try to learn two classifiers – Each classifier learns on a subset of labeled examples. – The scores of each classifier are used to label a subset of unlabeled web pages and extend the labels for the other classifier.
  8. 8. Page 8 Sketches • Store a “summary” of the dataset • Querying the sketch is “almost” as good as querying the dataset • Example: frequent items in a stream – Initialize associative array A of size k-1 – Process: for each j -  if j is in keys(A), A[j] += 1 -  else if |keys(A)| < k - 1, A[j] = 1 -  else –  for each l in keys(A), »  A(l) -=1 ; »  if A(l) = 0, remove l; –  done
  9. 9. Page 9 Clustering is not fast enough • Sample and then cluster • Do clusters need to dynamically adapt? – Online clustering – Streaming K Means
  10. 10. Page 10 K Means • Initialize cluster centers somehow – random – K means ++ • Alternate – Assign each point to closest cluster center – Move cluster center to average of points assigned to center • Stop when convergence criteria reached – Points don’t move “much” – Number of iterations reached.
  11. 11. Page 11 11 Initialize Cluster Centers k1 k2 k3 X Y Pick 3 initial cluster centers (randomly)
  12. 12. Page 12 Assign each point k1 k2 k3 X Y Assign each point to the closest cluster center
  13. 13. Page 13 Recompute Cluster Centers X Y Move each cluster center to the mean of each cluster k1 k2 k2 k1 k3 k3
  14. 14. Page 14 Streaming K Means • For each new point – Assign to closest cluster center – Update cluster center to incrementally move in direction of new point • Online version of Lloyd’s algorithm • Good enough in practice
  15. 15. Page 15 Recompute Cluster Centers X Y Move each cluster center to the mean of each cluster k2 k1 k3
  16. 16. Page 16 Online Clustering (Liberty, Sriharsha, Sviridenko) • Initialization Phase: – First point is its own cluster – Pick some Normalization factor f • Update Phase for point p: – Let d = distance from p to closest center so far – With probability proportional to d/ f , attach p to closest center – With probability max (1 – d/f, 1), form a new cluster center at p. • Merge Phase: – Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
  17. 17. Page 17 Properties • Provably close to optimal in online setting • Does not open more than O(log(OPT)) clusters pays O(OPT) cost • Very efficient to implement • Adaptive algorithm • Forgetfulness can be introduced in the merge process • Leaving out the merge process still produces a clustering that might be indicative of structure, i.e useful as a machine learning feature
  18. 18. Page 18 My classifier is not fast enough • Even for batch problems online learning might be good enough! • For real time problems, online learning or incremental learning is needed.
  19. 19. Page 19 What is online learning? • Batch Learning: – Classifier sees a set of labeled examples, and trains a model – Predicts on trained model for unseen examples • Online Learning: – Classifier sees an example at a time. – Limited look back window (often 0) – Predicts on example and is revealed the cost – Learns from mistake – Yields a batch learning algorithm that is one pass: simply run online algorithm for each example in a batch.
  20. 20. Page 20 Challenges of online learning • normalization – In batch set up, can normalize data by making a pass over the full dataset – In online setting, cannot make a second pass – Solution: Adaptive normalization • Late arriving features – In Batch setting, all features are recorded in the dataset – In online setting different features may arrive at different times – Solution: Adagrad (Adaptive gradient technique) • Stochastic Gradient Descent convergence can be slow – More data helps – Adaptive normalization improves convergence – Adagrad improves convergence and reduces sensitivity to step size
  21. 21. Page 21 Visualization • Speed up feature discovery • Intuitive visualization of model performance • Improve debuggability
  22. 22. Page 22 The Data Science Workflow… What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  23. 23. Page 23 Introducing Apache Zeppelin Web-based Notebook for interactive analytics Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  24. 24. Page 24 Zeppelin today in Data Science Workflow… What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  25. 25. Page 25 Zeppelin – Road Ahead Operations -  Deploy to the cluster with Ambari Security -  Authentication against LDAP -  SSL -  Run in Kerberized Cluster -  Authorization of notebooks Sharing/ Collaboration -  Share selected notebooks with selected users/groups -  Ability to read/publish notebooks to github Data Import - Visual data import/download - Clean data as it comes Usability -  Summary Data – See column summary -  Keyboard shortcuts, Auto complete, syntax high light, line numbers Visualization -  Pluggable visualization & more charts, maps & tables. R support -  Harden SparkR interpreter Enterprise ReadyEase of Use
  26. 26. Page 26 Upcoming Work • Entity Resolution package GA – Supports Entity Graph based resolution – Includes Random Walk algorithm for computing similarity score • Online learning and clustering Spark Packages • Contribute more Reduction algorithms to Spark ML – Cost Sensitive Classification – Filter tree based Multiclass Reduction • Zeppelin GA
  27. 27. Page 27 Thank you! • Ram Sriharsha @halfabrane • Vinay Shukla @neomythos

×