SlideShare a Scribd company logo
Practical Machine Learning
with Mahout
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning
Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering
What Works at Scale
• Logging
• Counting
• Session grouping
What Works at Scale
• Logging
• Counting
• Session grouping
• Really. Don’t bet on anything much more
complex than these
What Works at Scale
• Logging
• Counting
• Session grouping
• Really. Don’t bet on anything much more
complex than these
• These are harder than they look
Recommendations
Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
bought y”
• But soooo much more is possible
Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
(Musicmatch)
• Internet video watchers watching >30 s
Dyadic Structure
• Functional
– Interaction: actor -> item*
• Relational
– Interaction ⊆ Actors x Items
• Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
• Predict missing observations
Recommendations Analysis
• R(x,y) = # people who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
Rij = AuiBuju
å
= AT
B
Fundamental Algorithmic Structure
• Cooccurrence
• Matrix approximation by factoring
• LLR
K = AT
A
A » USVT
K » VS2
VT
r = VS2
VT
h
r =sparsify(AT
A)h
But Wait!
• Cooccurrence
• Cross occurrence
K = AT
A
K = BT
A
For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• A’A gives query recommendation
– “did you mean to ask for”
• B’B gives video recommendation
– “you might like these videos”
The punch-line
• B’A recommends videos in response to a
query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-
data)
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
Real-life example
Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever
users think they should be
Super-fast k-means Clustering
RATIONALE
What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue
An Example
An Example
Diagonalized Cluster Proximity
Clusters as Distribution Surrogate
Clusters as Distribution Surrogate
THEORY
For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)
ALGORITHMS
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time
• But for big data, k gets large
Surrogate Method
• Start with sloppy clustering into lots of
clusters
κ = k log n clusters
• Use this sketch as a weighted surrogate for the
data
• Results are provably good for highly
clusterable data
Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice
Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k,
looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice
Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
IMPLEMENTATION
But Wait, …
• Finding nearest centroid is inner loop
• This could take O( d κ ) per point and κ can be
big
• Happily, approximate nearest centroid works
fine
Projection Search
total ordering!
LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis
RESULTS
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
Quality
• Ball k-means implementation appears significantly
better than simple k-means
• Streaming k-means + ball k-means appears to be about
as good as ball k-means alone
• All evaluations on 20 newsgroups with held-out data
• Figure of merit is mean and median squared distance
to nearest cluster
Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Get the code as part of Mahout trunk (or 0.8 very soon)
• Contact me at tdunning@maprtech.com or @ted_dunning
• Share news with @apachemahout

More Related Content

What's hot

Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
Kostis Kyzirakos
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
Rajeev Raman
 
Querying Temporal Databases via OWL 2 QL
Querying Temporal Databases via OWL 2 QLQuerying Temporal Databases via OWL 2 QL
Querying Temporal Databases via OWL 2 QL
Szymon Klarman
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
MLconf
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain Adaptation
Mark Chang
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
Generative adversarial text to image synthesis
Generative adversarial text to image synthesisGenerative adversarial text to image synthesis
Generative adversarial text to image synthesis
Universitat Politècnica de Catalunya
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Nikolay Samokhvalov
 
Representing and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebRepresenting and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebKostis Kyzirakos
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
WiMLDSMontreal
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud confamarsri
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
Eiji Sekiya
 

What's hot (16)

Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Querying Temporal Databases via OWL 2 QL
Querying Temporal Databases via OWL 2 QLQuerying Temporal Databases via OWL 2 QL
Querying Temporal Databases via OWL 2 QL
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain Adaptation
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
 
Generative adversarial text to image synthesis
Generative adversarial text to image synthesisGenerative adversarial text to image synthesis
Generative adversarial text to image synthesis
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
 
Representing and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebRepresenting and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic Web
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
 
Jvm heap
Jvm heapJvm heap
Jvm heap
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 

Viewers also liked

Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
MapR Technologies
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
MapR Technologies
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
MapR Technologies
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
Ted Dunning
 
Technical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques NadeauTechnical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques Nadeau
MapR Technologies
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
MapR Technologies
 

Viewers also liked (7)

Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
Oscon Data 2011 Ted Dunning
Oscon Data 2011 Ted DunningOscon Data 2011 Ted Dunning
Oscon Data 2011 Ted Dunning
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Technical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques NadeauTechnical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques Nadeau
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 

Similar to Paris Data Geeks

Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
Valerio Maggio
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
Maarten Breddels
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Ali-ziane Myriam
 
managing big data
managing big datamanaging big data
managing big data
Suveeksha
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
Subrata Kumer Paul
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
eXascale Infolab
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
K Nearest neighbour
K Nearest neighbourK Nearest neighbour
K Nearest neighbour
Learnbay Datascience
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
Ajay Ohri
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
Tech in Asia ID
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
Skillwise Group
 
The ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years OldThe ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years Old
John Kunze
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptx
Shree Shree
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ssuserf07225
 
Big data
Big dataBig data

Similar to Paris Data Geeks (20)

Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
managing big data
managing big datamanaging big data
managing big data
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
K Nearest neighbour
K Nearest neighbourK Nearest neighbour
K Nearest neighbour
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
The ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years OldThe ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years Old
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptx
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Big data
Big dataBig data
Big data
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 

Recently uploaded (20)

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 

Paris Data Geeks

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring) • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • What works at scale • Recommendation • Unsupervised - Clustering
  • 4. What Works at Scale • Logging • Counting • Session grouping
  • 5. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these
  • 6. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these • These are harder than they look
  • 8. Recommendations • Special case of reflected intelligence • Traditionally “people who bought x also bought y” • But soooo much more is possible
  • 9. Examples • Customers buying books (Linden et al) • Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix) • Internet radio listeners not skipping songs (Musicmatch) • Internet video watchers watching >30 s
  • 10. Dyadic Structure • Functional – Interaction: actor -> item* • Relational – Interaction ⊆ Actors x Items • Matrix – Rows indexed by actor, columns by item – Value is count of interactions • Predict missing observations
  • 11. Recommendations Analysis • R(x,y) = # people who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 12. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 13. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 14. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 15. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 16. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 17. Recommendations Analysis Rij = AuiBuju å = AT B
  • 18. Fundamental Algorithmic Structure • Cooccurrence • Matrix approximation by factoring • LLR K = AT A A » USVT K » VS2 VT r = VS2 VT h r =sparsify(AT A)h
  • 19. But Wait! • Cooccurrence • Cross occurrence K = AT A K = BT A
  • 20. For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • A’A gives query recommendation – “did you mean to ask for” • B’B gives video recommendation – “you might like these videos”
  • 21. The punch-line • B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta- data)
  • 22. Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres del paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 24. Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 27. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization is critical • Agreement to “gold standard” is a non-issue
  • 34. For Example Grouping these two clusters seriously hurts squared distance D4 2 (X) > 1 s 2 D5 2 (X)
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 37. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 38. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 39. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O( nkd + k3d ) time • But for big data, k gets large
  • 40. Surrogate Method • Start with sloppy clustering into lots of clusters κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Results are provably good for highly clusterable data
  • 41. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 42. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice
  • 43. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 44. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 45. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 47. But Wait, … • Finding nearest centroid is inner loop • This could take O( d κ ) per point and κ can be big • Happily, approximate nearest centroid works fine
  • 49. LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  • 51. Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 52. Quality • Ball k-means implementation appears significantly better than simple k-means • Streaming k-means + ball k-means appears to be about as good as ball k-means alone • All evaluations on 20 newsgroups with held-out data • Figure of merit is mean and median squared distance to nearest cluster
  • 53. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Get the code as part of Mahout trunk (or 0.8 very soon) • Contact me at tdunning@maprtech.com or @ted_dunning • Share news with @apachemahout