SlideShare a Scribd company logo
1 of 44
1©MapR Technologies - Confidential
Hadoop Performance
2©MapR Technologies - Confidential
Agenda
 What is performance? Optimization?
 Case 1: Aggregation
 Case 2: Recommendations
 Case 3: Clustering
 Case 4: Matrix decomposition
3©MapR Technologies - Confidential
What is Performance?
 Is doing something faster better?
 Is it the right task?
 Do you have a wide enough view?
 What is the right performance metric?
4©MapR Technologies - Confidential
Aggregation
 Word-count and friends
– How many times did X occur?
– How many unique X’s occurred?
 Associative metrics permit decomposition
– Partial sums and grand totals for example
– Use combiners
– Use high resolution aggregates to compute low resolution aggregates
 Rank-based statistics do not permit decomposition
– Avoid them
– Use approximations
5©MapR Technologies - Confidential
Inside Map-Reduce
5
Input Map CombineShuffle
and sort
Reduce Output
Reduce
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
the, 1
time, 1
has, 1
come, 1
…
come, [3,2,1]
has, [1,5,2]
the, [1,2,1]
time,
[10,1,3]
…
come, 6
has, 8
the, 4
time, 14
…
6©MapR Technologies - Confidential
Don’t Do This
Raw
Daily
Weekly
Monthly
7©MapR Technologies - Confidential
Do This Instead
Raw
Daily Weekly
Monthly
8©MapR Technologies - Confidential
Aggregation
 First rule:
– Don’t read the big input multiple times
– Compute longer term aggregates from short term aggregates
 Second rule:
– Don’t read the big input multiple times
– Compute multiple windowed aggregates at the same time
9©MapR Technologies - Confidential
Rank Statistics Can Be Tamed
 Approximate quartiles are easily computed
– (but sorted data is evil)
 Approximate unique counts are easily computed
– use Bloom filter and extrapolate from number of set bits
– use multiple filters at different down-sample rates
 Approximate high or low approximate quantiles are easily
computed
– keep largest 1000 elements
– keep largest 1000 elements from 10x down-sampled data
– and so on
 Approximate top-40 also possible
10©MapR Technologies - Confidential
Recommendations
 Common patterns in the past may predict common patterns in the
future
 People who bought item x also bought item y
 But also, people who bought Chinese food in the past, …
 Or people in SoMa really liked this restaurant in the past
11©MapR Technologies - Confidential
People who bought …
 Key operation is counting number of people who bought x and y
– for all x’s and all y’s
 The raw problem appears to be O(N^3)
 At the least, O(k_max^2)
– for most prolific user, there are k^2 pairs to count
– k_max can be near N
 Scalable problems must be O(N)
12©MapR Technologies - Confidential
But …
 What do we learn from users who buy everything
– they have no discrimination
– they are often the QA team
– they tell us nothing
 What do we learn from items bought by everybody
– the dual of omnivorous buyers
– these are often teaser items
– they tell us nothing
13©MapR Technologies - Confidential
Also …
 What would you learn about a user from purchases
– 1 … 20?
– 21 … 100?
– 101 … 1000?
– 1001 … ∞?
 What about learning about an item?
– how many people do we need to see before we understand the item?
14©MapR Technologies - Confidential
So …
 Cheat!
 Downsample every user to at most 1000 interactions
– most recent
– most rare
– random selection
– whatever is easiest
 Now k_max ≤ 1000
15©MapR Technologies - Confidential
The Fundamental Things Apply
 Don’t read the raw data repeatedly
 Sessionize and denormalize per hour/day/week
– that is, group by user
– expand items with categories and content descriptors if feasible
 Feed all down-stream processing in one pass
– baby join to item characteristics
– downsample
– count grand totals
– compute cooccurrences
16©MapR Technologies - Confidential
Deployment Matters, Too
 For restaurant case, basic recommendation info includes:
– user x merchant histories
– user x cuisine histories
– top local restaurant by anomalous repeat visits
– restaurant x indicator merchant cooccurrence matrix
– restaurant x indicator cuisine cooccurrence matrix
 These can all be stored and accessed using text retrieval
techniques
 Fast deployment using mirrors and NFS (not standard Hadoop)
17©MapR Technologies - Confidential
Non-Traditional Deployment Demo
DEMO
18©MapR Technologies - Confidential
EM Algorithms
 Start with random model estimates
 Use model estimates to classify examples
 Use classified examples to find probability maximum estimates
 Use model estimates to classify examples
 Use classified examples to find probability maximum estimates
 … And so on …
19©MapR Technologies - Confidential
K-means as EM Algorithm
 Assign a random seed to each cluster
 Assign points to nearest cluster
 Move cluster to average of contained points
 Assign points to nearest cluster
… and so on …
20©MapR Technologies - Confidential
K-means as Map-Reduce
 Assignment of points to cluster is trivially parallel
 Computation of new clusters is also parallel
 Moving points to averages is ideal for map-reduce
21©MapR Technologies - Confidential
But …
 With map-reduce, iteration is evil
 Starting a program can take 10-30s
 Saving data to disk and then immediately reading from disk is silly
 Input might even fit in cluster memory
22©MapR Technologies - Confidential
Fix #1
 Don’t do that!
 Use Spark
– in memory interactive map-reduce
– 100x to 1000x faster
– must fit in memory
 Use Giraph
– BSP programming model rather than map-reduce
– essentially map-reduce-reduce-reduce…
 Use GraphLab
– Like BSP without the speed brakes
– 100x faster
23©MapR Technologies - Confidential
Fix #2
 Use a sketch-based algorithm
 Do one pass over the data to compute sketch of the data
 Cluster the sketch
 Done. With good theoretic bounds on accuracy
 Speedup of 3000x or more
24©MapR Technologies - Confidential
An Example
25©MapR Technologies - Confidential
The Problem
 Spirals are a classic “counter” example for k-means
 Classic low dimensional manifold with added noise
 But clustering still makes modeling work well
26©MapR Technologies - Confidential
An Example
27©MapR Technologies - Confidential
An Example
28©MapR Technologies - Confidential
The Cluster Proximity Features
 Every point can be described by the nearest cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point) by increasing number of
clusters
 Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation
29©MapR Technologies - Confidential
Lots of Clusters Are Fine
30©MapR Technologies - Confidential
Surrogate Method
 Start with sloppy clustering into κ = k log n clusters
 Use this sketch as a weighted surrogate for the data
 Cluster surrogate data using ball k-means
 Results are provably good for highly clusterable data
 Sloppy clustering is on-line
 Surrogate can be kept in memory
 Ball k-means pass can be done at any time
31©MapR Technologies - Confidential
Algorithm Costs
 O(k d log n) per point per iteration for Lloyd’s algorithm
 Number of iterations not well known
 Iteration > log n reasonable assumption
32©MapR Technologies - Confidential
Algorithm Costs
 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))
per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy clusters may suffice
33©MapR Technologies - Confidential
Algorithm Costs
 How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– log k + log log n = 11 + 5 = 17
– 30,000 times faster is a bona fide big deal
34©MapR Technologies - Confidential
Pragmatics
 But this requires a fast search internally
 Have to cluster on the fly for sketch
 Have to guarantee sketch quality
 Previous methods had very high complexity
35©MapR Technologies - Confidential
How It Works
 For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
 If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
36©MapR Technologies - Confidential
Matrix Decomposition
 Many big matrices can often be compressed
 Often used in recommendations
=
37©MapR Technologies - Confidential
Neighest Neighbor
 Very high dimensional vectors can be compressed to 10-100
dimensions with little loss of accuracy
 Fast search algorithms work up to dimension 50-100, don’t work
above that
38©MapR Technologies - Confidential
Random Projections
 Many problems in high dimension can be reduce to low dimension
 Reductions with good distance approximation are available
 Surprisingly, these methods can be done using random vectors
39©MapR Technologies - Confidential
Fundamental Trick
 Random orthogonal projection preserves action of A
Ax - Ay » QT
Ax -QT
Ay
40©MapR Technologies - Confidential
Projection Search
total ordering!
41©MapR Technologies - Confidential
LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis
42©MapR Technologies - Confidential
But How?
Y = AW
Q1R = Y
B = Q1
T
A
LQ2 = B
USVT
= L
(Q1U) S (Q2V)T
» A
43©MapR Technologies - Confidential
Summary
 Don’t repeat big scans
– Cascade aggregations
– Compute several aggregates at once
 Use approximate measures for rank statistics
 Downsample where appropriate
 Use non-traditional deployment
 Use sketches
 Use random projections
44©MapR Technologies - Confidential
Contact Me!
 We’re hiring at MapR in US and Europe
 Come get the slides at
http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-
12
 Get the code at
https://github.com/tdunning
 Contact me at tdunning@maprtech.com or @ted_dunning

More Related Content

What's hot

Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
Mining Top-k Closed Sequential Patterns in Sequential Databases
Mining Top-k Closed Sequential Patterns in Sequential Databases Mining Top-k Closed Sequential Patterns in Sequential Databases
Mining Top-k Closed Sequential Patterns in Sequential Databases IOSR Journals
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
 
Access strategies ppt_ind
Access strategies ppt_indAccess strategies ppt_ind
Access strategies ppt_indItamarCohen16
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...AzarulIkhwan
 
Evaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past RequestsEvaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past RequestsSmartenIT
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 
Object recognition of CIFAR - 10
Object recognition of CIFAR  - 10Object recognition of CIFAR  - 10
Object recognition of CIFAR - 10Ratul Alahy
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 

What's hot (17)

Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Mining Top-k Closed Sequential Patterns in Sequential Databases
Mining Top-k Closed Sequential Patterns in Sequential Databases Mining Top-k Closed Sequential Patterns in Sequential Databases
Mining Top-k Closed Sequential Patterns in Sequential Databases
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
 
Access strategies ppt_ind
Access strategies ppt_indAccess strategies ppt_ind
Access strategies ppt_ind
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
 
Evaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past RequestsEvaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past Requests
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Object recognition of CIFAR - 10
Object recognition of CIFAR  - 10Object recognition of CIFAR  - 10
Object recognition of CIFAR - 10
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 

Viewers also liked

Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith ChaosMapR Technologies
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsMapR Technologies
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 

Viewers also liked (8)

Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal Recommendations
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 

Similar to CMU Lecture on Hadoop Performance

Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutMapR Technologies
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptxArthur240715
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fittingWush Wu
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportMapR Technologies
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 

Similar to CMU Lecture on Hadoop Performance (20)

News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
London hug
London hugLondon hug
London hug
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache Mahout
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Polyvalent Recommendations
Polyvalent RecommendationsPolyvalent Recommendations
Polyvalent Recommendations
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

CMU Lecture on Hadoop Performance

  • 1. 1©MapR Technologies - Confidential Hadoop Performance
  • 2. 2©MapR Technologies - Confidential Agenda  What is performance? Optimization?  Case 1: Aggregation  Case 2: Recommendations  Case 3: Clustering  Case 4: Matrix decomposition
  • 3. 3©MapR Technologies - Confidential What is Performance?  Is doing something faster better?  Is it the right task?  Do you have a wide enough view?  What is the right performance metric?
  • 4. 4©MapR Technologies - Confidential Aggregation  Word-count and friends – How many times did X occur? – How many unique X’s occurred?  Associative metrics permit decomposition – Partial sums and grand totals for example – Use combiners – Use high resolution aggregates to compute low resolution aggregates  Rank-based statistics do not permit decomposition – Avoid them – Use approximations
  • 5. 5©MapR Technologies - Confidential Inside Map-Reduce 5 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
  • 6. 6©MapR Technologies - Confidential Don’t Do This Raw Daily Weekly Monthly
  • 7. 7©MapR Technologies - Confidential Do This Instead Raw Daily Weekly Monthly
  • 8. 8©MapR Technologies - Confidential Aggregation  First rule: – Don’t read the big input multiple times – Compute longer term aggregates from short term aggregates  Second rule: – Don’t read the big input multiple times – Compute multiple windowed aggregates at the same time
  • 9. 9©MapR Technologies - Confidential Rank Statistics Can Be Tamed  Approximate quartiles are easily computed – (but sorted data is evil)  Approximate unique counts are easily computed – use Bloom filter and extrapolate from number of set bits – use multiple filters at different down-sample rates  Approximate high or low approximate quantiles are easily computed – keep largest 1000 elements – keep largest 1000 elements from 10x down-sampled data – and so on  Approximate top-40 also possible
  • 10. 10©MapR Technologies - Confidential Recommendations  Common patterns in the past may predict common patterns in the future  People who bought item x also bought item y  But also, people who bought Chinese food in the past, …  Or people in SoMa really liked this restaurant in the past
  • 11. 11©MapR Technologies - Confidential People who bought …  Key operation is counting number of people who bought x and y – for all x’s and all y’s  The raw problem appears to be O(N^3)  At the least, O(k_max^2) – for most prolific user, there are k^2 pairs to count – k_max can be near N  Scalable problems must be O(N)
  • 12. 12©MapR Technologies - Confidential But …  What do we learn from users who buy everything – they have no discrimination – they are often the QA team – they tell us nothing  What do we learn from items bought by everybody – the dual of omnivorous buyers – these are often teaser items – they tell us nothing
  • 13. 13©MapR Technologies - Confidential Also …  What would you learn about a user from purchases – 1 … 20? – 21 … 100? – 101 … 1000? – 1001 … ∞?  What about learning about an item? – how many people do we need to see before we understand the item?
  • 14. 14©MapR Technologies - Confidential So …  Cheat!  Downsample every user to at most 1000 interactions – most recent – most rare – random selection – whatever is easiest  Now k_max ≤ 1000
  • 15. 15©MapR Technologies - Confidential The Fundamental Things Apply  Don’t read the raw data repeatedly  Sessionize and denormalize per hour/day/week – that is, group by user – expand items with categories and content descriptors if feasible  Feed all down-stream processing in one pass – baby join to item characteristics – downsample – count grand totals – compute cooccurrences
  • 16. 16©MapR Technologies - Confidential Deployment Matters, Too  For restaurant case, basic recommendation info includes: – user x merchant histories – user x cuisine histories – top local restaurant by anomalous repeat visits – restaurant x indicator merchant cooccurrence matrix – restaurant x indicator cuisine cooccurrence matrix  These can all be stored and accessed using text retrieval techniques  Fast deployment using mirrors and NFS (not standard Hadoop)
  • 17. 17©MapR Technologies - Confidential Non-Traditional Deployment Demo DEMO
  • 18. 18©MapR Technologies - Confidential EM Algorithms  Start with random model estimates  Use model estimates to classify examples  Use classified examples to find probability maximum estimates  Use model estimates to classify examples  Use classified examples to find probability maximum estimates  … And so on …
  • 19. 19©MapR Technologies - Confidential K-means as EM Algorithm  Assign a random seed to each cluster  Assign points to nearest cluster  Move cluster to average of contained points  Assign points to nearest cluster … and so on …
  • 20. 20©MapR Technologies - Confidential K-means as Map-Reduce  Assignment of points to cluster is trivially parallel  Computation of new clusters is also parallel  Moving points to averages is ideal for map-reduce
  • 21. 21©MapR Technologies - Confidential But …  With map-reduce, iteration is evil  Starting a program can take 10-30s  Saving data to disk and then immediately reading from disk is silly  Input might even fit in cluster memory
  • 22. 22©MapR Technologies - Confidential Fix #1  Don’t do that!  Use Spark – in memory interactive map-reduce – 100x to 1000x faster – must fit in memory  Use Giraph – BSP programming model rather than map-reduce – essentially map-reduce-reduce-reduce…  Use GraphLab – Like BSP without the speed brakes – 100x faster
  • 23. 23©MapR Technologies - Confidential Fix #2  Use a sketch-based algorithm  Do one pass over the data to compute sketch of the data  Cluster the sketch  Done. With good theoretic bounds on accuracy  Speedup of 3000x or more
  • 24. 24©MapR Technologies - Confidential An Example
  • 25. 25©MapR Technologies - Confidential The Problem  Spirals are a classic “counter” example for k-means  Classic low dimensional manifold with added noise  But clustering still makes modeling work well
  • 26. 26©MapR Technologies - Confidential An Example
  • 27. 27©MapR Technologies - Confidential An Example
  • 28. 28©MapR Technologies - Confidential The Cluster Proximity Features  Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation
  • 29. 29©MapR Technologies - Confidential Lots of Clusters Are Fine
  • 30. 30©MapR Technologies - Confidential Surrogate Method  Start with sloppy clustering into κ = k log n clusters  Use this sketch as a weighted surrogate for the data  Cluster surrogate data using ball k-means  Results are provably good for highly clusterable data  Sloppy clustering is on-line  Surrogate can be kept in memory  Ball k-means pass can be done at any time
  • 31. 31©MapR Technologies - Confidential Algorithm Costs  O(k d log n) per point per iteration for Lloyd’s algorithm  Number of iterations not well known  Iteration > log n reasonable assumption
  • 32. 32©MapR Technologies - Confidential Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice
  • 33. 33©MapR Technologies - Confidential Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal
  • 34. 34©MapR Technologies - Confidential Pragmatics  But this requires a fast search internally  Have to cluster on the fly for sketch  Have to guarantee sketch quality  Previous methods had very high complexity
  • 35. 35©MapR Technologies - Confidential How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 36. 36©MapR Technologies - Confidential Matrix Decomposition  Many big matrices can often be compressed  Often used in recommendations =
  • 37. 37©MapR Technologies - Confidential Neighest Neighbor  Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy  Fast search algorithms work up to dimension 50-100, don’t work above that
  • 38. 38©MapR Technologies - Confidential Random Projections  Many problems in high dimension can be reduce to low dimension  Reductions with good distance approximation are available  Surprisingly, these methods can be done using random vectors
  • 39. 39©MapR Technologies - Confidential Fundamental Trick  Random orthogonal projection preserves action of A Ax - Ay » QT Ax -QT Ay
  • 40. 40©MapR Technologies - Confidential Projection Search total ordering!
  • 41. 41©MapR Technologies - Confidential LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  • 42. 42©MapR Technologies - Confidential But How? Y = AW Q1R = Y B = Q1 T A LQ2 = B USVT = L (Q1U) S (Q2V)T » A
  • 43. 43©MapR Technologies - Confidential Summary  Don’t repeat big scans – Cascade aggregations – Compute several aggregates at once  Use approximate measures for rank statistics  Downsample where appropriate  Use non-traditional deployment  Use sketches  Use random projections
  • 44. 44©MapR Technologies - Confidential Contact Me!  We’re hiring at MapR in US and Europe  Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1- 12  Get the code at https://github.com/tdunning  Contact me at tdunning@maprtech.com or @ted_dunning