Apache con big data 2015 - Data Science from the trenches

Data Science: A view from the trenches
Ram Sriharsha
Twitter: @halfbrane
Vinay Shukla
Twitter: @neomythos

Agenda
•  Problems we work on
•  Common Challenges
•  Reductions
•  Handling label sparsity
–  Co Training
–  Adaptive Learning
•  When you have to be fast and accurate
–  Online Clustering
–  Sketches
–  Online Learning
•  Visualization

Some Problems
• Search Advertising
– Click Prediction: Given a query, ad and user context, how likely is the user to click on ad?
– Feature Engineering: Query/ ad categorization, query -> feature vector
• Entity Resolution and Disambiguation
• Over / Under Payment of claims detection
• Document Matching
• Login Risk Detection

Common Challenges
•  Labeling is expensive and not clean
– Selectively ask for labels (active learning)
– Co-Training to expand label set
•  Not enough high quality implementations of algorithms
– Modular extensions of base implementations (Reductions)
– Boosting
•  Speed of training/ scoring important
– Online learning
– Online clustering
– Sketches
•  Freshness of models
– Online and adaptive learning
•  Visualizing performance and feature importance
– Zeppelin

Reductions
OVR
Let R = rejection sampling algorithm
For each example h, sample according to
Cost of h and feed to 0/1 classifier
A
A
…
Randomize over classifiers that
Output yes
Importance Weighting R A
R^-1
Let A = Algorithm for optimizing 0/1 loss

Active Learning
• Given a pool of examples determine which ones is the classifier least confident about
• Ask those examples to be labeled, and feed to training
• Choose query points that shrink the space of classifiers rapidly
• Exploit natural structure in data
45% 45%2.5% 2.5%5%

Co Training
• Suppose you have two “views” of the data
– e.g, web pages have content, and hyperlinks pointing to and from them
– Suppose problem is to label webpage as about literature/ or not (binary classification)
• One approach:
– Label web pages manually. Train classifier to use both content text and hyperlinks as
features
– This requires a large # of labeled pages
• Other approach:
– Since we have two views , try to learn two classifiers
– Each classifier learns on a subset of labeled examples.
– The scores of each classifier are used to label a subset of unlabeled web pages and extend
the labels for the other classifier.

Sketches
• Store a “summary” of the dataset
• Querying the sketch is “almost” as good as querying the dataset
• Example: frequent items in a stream
– Initialize associative array A of size k-1
– Process: for each j
-  if j is in keys(A), A[j] += 1
-  else if |keys(A)| < k - 1, A[j] = 1
-  else
–  for each l in keys(A),
»  A(l) -=1 ;
»  if A(l) = 0, remove l;
–  done

Clustering is not fast enough
• Sample and then cluster
• Do clusters need to dynamically adapt?
– Online clustering
– Streaming K Means

K Means
• Initialize cluster centers somehow
– random
– K means ++
• Alternate
– Assign each point to closest cluster center
– Move cluster center to average of points assigned to center
• Stop when convergence criteria reached
– Points don’t move “much”
– Number of iterations reached.

11
Initialize Cluster Centers
k1
k2
k3
X
Y
Pick 3
initial
cluster
centers
(randomly)

Assign each point
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center

Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3

Streaming K Means
• For each new point
– Assign to closest cluster center
– Update cluster center to incrementally move in direction of new point
• Online version of Lloyd’s algorithm
• Good enough in practice

Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k2
k1
k3

Online Clustering (Liberty, Sriharsha, Sviridenko)
• Initialization Phase:
– First point is its own cluster
– Pick some Normalization factor f
• Update Phase for point p:
– Let d = distance from p to closest center so far
– With probability proportional to d/ f , attach p to closest center
– With probability max (1 – d/f, 1), form a new cluster center at p.
• Merge Phase:
– Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters

Properties
• Provably close to optimal in online setting
• Does not open more than O(log(OPT)) clusters pays O(OPT) cost
• Very efficient to implement
• Adaptive algorithm
• Forgetfulness can be introduced in the merge process
• Leaving out the merge process still produces a clustering that might be indicative of
structure, i.e useful as a machine learning feature

My classifier is not fast enough
• Even for batch problems online learning might be good enough!
• For real time problems, online learning or incremental learning is needed.

What is online learning?
• Batch Learning:
– Classifier sees a set of labeled examples, and trains a model
– Predicts on trained model for unseen examples
• Online Learning:
– Classifier sees an example at a time.
– Limited look back window (often 0)
– Predicts on example and is revealed the cost
– Learns from mistake
– Yields a batch learning algorithm that is one pass: simply run online algorithm for each
example in a batch.

Challenges of online learning
• normalization
– In batch set up, can normalize data by making a pass over the full dataset
– In online setting, cannot make a second pass
– Solution: Adaptive normalization
• Late arriving features
– In Batch setting, all features are recorded in the dataset
– In online setting different features may arrive at different times
– Solution: Adagrad (Adaptive gradient technique)
• Stochastic Gradient Descent convergence can be slow
– More data helps
– Adaptive normalization improves convergence
– Adagrad improves convergence and reduces sensitivity to step size

Visualization
• Speed up feature discovery
• Intuitive visualization of model performance
• Improve debuggability

The Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript

Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”

Zeppelin today in Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript

Zeppelin – Road Ahead
Operations
-  Deploy to the cluster with Ambari
Security
-  Authentication against LDAP
-  SSL
-  Run in Kerberized Cluster
-  Authorization of notebooks
Sharing/ Collaboration
-  Share selected notebooks with selected
users/groups
-  Ability to read/publish notebooks to github
Data Import
- Visual data import/download
- Clean data as it comes
Usability
-  Summary Data – See column summary
-  Keyboard shortcuts, Auto complete, syntax high
light, line numbers
Visualization
-  Pluggable visualization & more charts, maps &
tables.
R support
-  Harden SparkR interpreter
Enterprise ReadyEase of Use

Upcoming Work
• Entity Resolution package GA
– Supports Entity Graph based resolution
– Includes Random Walk algorithm for computing similarity score
• Online learning and clustering Spark Packages
• Contribute more Reduction algorithms to Spark ML
– Cost Sensitive Classification
– Filter tree based Multiclass Reduction
• Zeppelin GA

Thank you!
• Ram Sriharsha
@halfabrane
• Vinay Shukla
@neomythos

Apache con big data 2015 - Data Science from the trenches

More Related Content

What's hot

Similar to Apache con big data 2015 - Data Science from the trenches

Recently uploaded

Apache con big data 2015 - Data Science from the trenches