SlideShare a Scribd company logo
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
DETERMINING THE NUMBER OF CLUSTERS
IN A DATASET USING ABC
I. KABUL, P. HALL, J. SILVA, W. SARLE
ENTERPRISE MINER R&D
SAS INSTITUTE
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
CLUSTERING
Objects within a
cluster are as
similar as possible
Objects from
different clusters
are as dissimilar as
possible
Hossein Parsaei
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
CHALLENGES IN CLUSTERING
• No prior knowledge
• Which similarity measure ?
• Which clustering algorithm?
• How to evaluate the results?
• How many clusters?
The Aligned Box Criterion (ABC) addresses the unsolved, important
problem of determining the number of clusters in a data set.
ABC can be applied in Market Segmentation and many other types of
statistical, data mining and machine learning analyses.
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
CONTENTS
• Background
• Aligned Box Criterion (ABC) Method
• Results
• ABC Method in Parallel and Distributed Architecture
• Conclusions
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
BACKGROUND
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
FINDING THE RIGHT NUMBER OF CLUSTERS
• Many methods have been proposed:
• Calinski-Harabasz index [Calinski 1974]
• Cubic clustering criterion (CCC) [Sarle 1983]
• Silhouette statistic [Rousseeuw 1987]
• Gap statistic [Tibshirani 2001]
• Jump method [Sugar 2003]
• Prediction strength [Tibshirani 2005]
• Dirichlet process [Teh 2006]
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
WITHIN CLUSTER SUM OF SQUARES
• A good clustering yields clusters where
observations have small within-cluster
sum-of-squares (and high between-
cluster sum-of-squares).
• Low values when the partition is good,
BUT these are by construction
monotone nonincreasing (within cluster
dissimilarity always decreases with
more clusters)

 

 


r
r r
Ci
ir
Ci Cj
jir
xxn
xxD
2
2
2


k
r
r
r
k D
n
W
1 2
1
Within-cluster SSE:
Measure of compactness of
clusters
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
BACKGROUND USING WK TO DETERMINE # OF CLUSTERS
Elbow method (L-curve method)
Idea:
use the k corresponding to the “elbow”
Problem:
no reference clustering to compare
the differences Wk  Wk1’s are not normalized for comparison
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
BACKGROUND REFERENCE DISTRIBUTIONS
• Cubic Clustering Criterion (CCC), Gap Statistic and ABC amplify the elbow
phenomenon by using differences between within cluster sum of squares of a
clustering solution in the training data (Wk) and a clustering solution in a reference
distribution (Wk
*).
• Aligned box criterion (ABC)
• Gap statistic
• Cubic clustering criterion (CCC)
Reference
distribution
complexity
Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983
Gap Statistic: Tibshirani et al, J.R. Statist. Soc., 2001
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
CCC METHOD
Instead of using Wk directly, CCC uses R2 .
𝑅2 = 1 −
𝑇𝑟𝑎𝑐𝑒 𝑊
𝑇𝑟𝑎𝑐𝑒 𝑇
, 𝑇𝑟𝑎𝑐𝑒 𝑊 = 𝑊𝑘
For CCC calculation, R2 and E(R2) are approximated by
heuristic formulas.
𝐶𝐶𝐶 = log
1 − 𝐸(𝑅2)
1 − 𝑅2
𝑛𝑝∗
2
(0.001 + 𝐸(𝑅2))1.2
Cubic Clustering Criterion
(CCC): SAS Technical Report
A-108, 1983
Derived from numerous Monte Carlo simulations to generate one
hyper-cube reference distribution based on the dimensions of the
given training dataset to test all k of interest.
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
GAP STATISTICS METHOD
The Gap Statistic computes the (log) ratio Wk* / Wk.
𝐺𝑎𝑝 𝑘 = log 𝑊𝑘
∗
− log 𝑊𝑘
Wk* is calculated from a clustering solution in the reference distribution.
Finds k that maximizes Gap(k) (within some tolerance)
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
TWO TYPES OF
UNIFORM
DISTRIBUTIONS
1. Align with feature axes (data-geometry independent)
Observations Bounding Box (aligned
with feature axes)
Monte Carlo
Simulations
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
TWO TYPES OF
UNIFORM
DISTRIBUTIONS
2. Align with principal axes (data-geometry dependent)
Observations Bounding Box (aligned
with principal axes)
Monte Carlo
Simulations
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
COMPUTATION
OF THE GAP
STATISTIC
for l = 1 to B
Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Cluster the M.C. sample into k groups and compute log Wkb
Compute
Compute sd(k), the standard deviation of {log Wkb}l=1,…,B
Set the total s.e.
Find the smallest k such that
)(/11 ksdBsk 


B
b
kkb WW
B
kGap
1
loglog
1
)(
1)1()(  kskGapkGap
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
GAP STATISTIC
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
NO-CLUSTER EXAMPLE
(JOURNAL VERSION)
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ABC (ALIGNED BOX CRITERION)
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ABC METHOD ABC improves upon CCC and Gap Statistics by generating better estimates for Wk*.
ABC uses k reference distributions, one for each tested k (k is number of clusters).
• Data-driven Monte Carlo simulation of reference distribution at each tested k.
• The reference distribution is k uniform hyper boxes aligned with the Principal
Components from the clustering solution of the input data.
Gap Statistic Reference Distribution ABC Reference Distribution
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ABC METHOD
Why multiple reference distributions?
The gap statistic performs hypothesis testing between k clusters/no-clusters for the whole
input space
• ABC is similar to recursive hypothesis testing between 1 cluster/2 clusters for each of
the k candidate clusters
• More stringent test. It is harder for larger k to pass this test. This is desirable.
Gap Statistic Reference Distribution ABC Reference Distribution
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k REFERENCE DISTRIBUTIONS
Sample Data
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k REFERENCE DISTRIBUTIONS
Aligned Box Criterion
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k REFERENCE DISTRIBUTIONS
Aligned Box Criterion
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
Aligned Box Criterion
ESTIMATING k REFERENCE DISTRIBUTIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ALIGNED BOX
CRITERION
(ABC)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Considering each cluster k separately
Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)
Cluster the M.C. sample into k groups and compute log Wkb
Compute
Compute sd(k), the s.d. of {log Wkb}l=1,…,B
Set the total s.e.
Find the smallest k such that
)(/11 ksdBsk 
1)1()(  kskABCkABC
𝐴𝐵𝐶(𝑘) = log 𝑊𝑘
+
− log 𝑊𝑘
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ABC METHOD
RESULTS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k REFERENCE DISTRIBUTIONS
Wk*decreases
faster.
Gap Statistic Aligned Box Criterion
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k REFERENCE DISTRIBUTIONS
Gap Statistic Aligned Box Criterion
AlignedBoxCriterion
Clearer
Maxima.
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
• Observations: 7,000
• Variables: 2
• Monte Carlo Replications: 20
CCC method ABC method
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k CLAIMS PREDICTION CHALLENGE DATA
• Anonymized customer data
• 32 customer and product features
• 13,184,290 customer records
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k EXECUTING CALCULATIONS
• Cubic clustering criterion: PROC FASTCLUS
• Gap statistic: R cluster package in the Open Source Integration Node in
SAS Enterprise Miner
• Aligned box criterion: PROC HPCLUS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k INTERPRETING RESULTS
Cubic Clustering Criterion
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k INTERPRETING RESULTS
Gap Statistic
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ESTIMATING k INTERPRETING RESULTS
Aligned Box Criterion
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
REFERENCE
DISTRIBUTION
EFFECT OF CHANGING NUMBER OF OBSERVATIONS
• How the number of observations in the reference distribution affects the result
• Based on the number of observations n in the input dataset,
we generated w*n number of observations in the reference distribution
where w is between 0 and 1
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS SIMPLE CASE
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS DATA SET WITH MORE CLUSTERS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS DATA SET WITH MORE OBSERVATIONS
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS REAL DATA
Kaggle Claims Prediction Challenge (n= 13,184,290, p= 35), 50 runs
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS SCALABILITY
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS STABILITY
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ABC METHOD
FOR PARALLEL AND DISTRIBUTED ARCHITECTURES
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
PARALLEL ABC PART 1-2
Node1
Root
…..
Node2
Node3
NodeN
1) Run clustering k-means (in parallel) for k clusters
2) Assign each observation to a cluster
3) Compute 𝑊𝑘
1) Assign each cluster to a node
2) Collect the XX’ matrix for each cluster
in the assigned node using a tree-based algorithm
3) Do PCA using XX’ matrix
Node1
…..
Node2
Node3
NodeN
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
PARALLEL ABC PART 3-4
Node1
…..
Node2
Node3
NodeN
1) Eigenvectors are broadcasted to every node
2) Based on their assigned clusters,
the observations in each node are projected into
the new space
1) Bounding boxes are computed locally at each node for each cluster k
2) Bounding box information from each node is collected at the root
and the root computes the bounding box coordinates for each cluster k
3) This information is distributed to each node and each node generates
reference distributions
Node1
…..
Node2
Node3
NodeN
Node1
Root
…..
Node2
Node3
NodeN
Node1
…..
Node2
Node3
NodeN
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
PARALLEL ABC PART 5
Node1
Root
…..
Node2
Node3
NodeN
Run k-means clustering in parallel for the reference distribution and compute 𝑊𝑘
+
Do this for B number of reference distributions
Compute ABC for cluster k
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
PARALLEL ABC PART 6
What about the O(n^3) complexity of SVD???
- Computation of XX’ is parallelized
- Or, do stochastic SVD
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
ABC METHOD
CONCLUSION
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
RESULTS
More accurate reference distributions lead to:
• Better defined maxima.
• Wk* values decreasing rapidly, especially for K > k.
• Exposure of possible alternative solutions.
Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
CONCLUSION
For large, highly dimensional or noisy data ABC is found
to be:
• Stable
• Scalable
Moreover, it exhibits desirable properties:
• Clearer peaks
• More stringent hypothesis test promotes smaller k
values
Copyr ight © 2012, SAS Institute Inc. All rights reser ved. www.SAS.com
Q&A
THANK YOU

More Related Content

Viewers also liked

360度精準式行銷實務--制定精準式行銷策略
360度精準式行銷實務--制定精準式行銷策略360度精準式行銷實務--制定精準式行銷策略
360度精準式行銷實務--制定精準式行銷策略
Neo Marketing Workshop
 
效度報告Final
效度報告Final效度報告Final
效度報告Final
Kuo Andrew
 
360度精準式行銷實務--精準式影音行銷
360度精準式行銷實務--精準式影音行銷360度精準式行銷實務--精準式影音行銷
360度精準式行銷實務--精準式影音行銷
Neo Marketing Workshop
 
360度精準式行銷實務--90分鐘上手CRM
360度精準式行銷實務--90分鐘上手CRM360度精準式行銷實務--90分鐘上手CRM
360度精準式行銷實務--90分鐘上手CRMNeo Marketing Workshop
 
360度精準式行銷實務--精準式社群行銷
360度精準式行銷實務--精準式社群行銷360度精準式行銷實務--精準式社群行銷
360度精準式行銷實務--精準式社群行銷
Neo Marketing Workshop
 
Co co都可茶飲策略分析
Co co都可茶飲策略分析Co co都可茶飲策略分析
Co co都可茶飲策略分析
Lee Phoebe
 
360度精準式行銷實務班--制定精準式行銷KPI
360度精準式行銷實務班--制定精準式行銷KPI360度精準式行銷實務班--制定精準式行銷KPI
360度精準式行銷實務班--制定精準式行銷KPI
Neo Marketing Workshop
 
Movement type
Movement typeMovement type
Movement type70620
 
創意行銷管理
創意行銷管理創意行銷管理
創意行銷管理
滄碩 劉
 
2012.05.10網站分析工具實作工作坊成果
2012.05.10網站分析工具實作工作坊成果2012.05.10網站分析工具實作工作坊成果
2012.05.10網站分析工具實作工作坊成果
Neo Marketing Workshop
 
Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...
Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...
Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...
Rod King, Ph.D.
 
Cin Training 5
Cin Training 5Cin Training 5
Cin Training 5
deep123
 
12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...
12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...
12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...
Rod King, Ph.D.
 
55 Business Models to Revolutionize your Business by Michaela Csik
55 Business Models to Revolutionize your Business by Michaela Csik55 Business Models to Revolutionize your Business by Michaela Csik
55 Business Models to Revolutionize your Business by Michaela Csik
jindrichweiss
 

Viewers also liked (15)

360度精準式行銷實務--制定精準式行銷策略
360度精準式行銷實務--制定精準式行銷策略360度精準式行銷實務--制定精準式行銷策略
360度精準式行銷實務--制定精準式行銷策略
 
效度報告Final
效度報告Final效度報告Final
效度報告Final
 
360度精準式行銷實務--精準式影音行銷
360度精準式行銷實務--精準式影音行銷360度精準式行銷實務--精準式影音行銷
360度精準式行銷實務--精準式影音行銷
 
360度精準式行銷實務--90分鐘上手CRM
360度精準式行銷實務--90分鐘上手CRM360度精準式行銷實務--90分鐘上手CRM
360度精準式行銷實務--90分鐘上手CRM
 
360度精準式行銷實務--精準式社群行銷
360度精準式行銷實務--精準式社群行銷360度精準式行銷實務--精準式社群行銷
360度精準式行銷實務--精準式社群行銷
 
Co co都可茶飲策略分析
Co co都可茶飲策略分析Co co都可茶飲策略分析
Co co都可茶飲策略分析
 
360度精準式行銷實務班--制定精準式行銷KPI
360度精準式行銷實務班--制定精準式行銷KPI360度精準式行銷實務班--制定精準式行銷KPI
360度精準式行銷實務班--制定精準式行銷KPI
 
Movement type
Movement typeMovement type
Movement type
 
創意行銷管理
創意行銷管理創意行銷管理
創意行銷管理
 
2012.05.10網站分析工具實作工作坊成果
2012.05.10網站分析工具實作工作坊成果2012.05.10網站分析工具實作工作坊成果
2012.05.10網站分析工具實作工作坊成果
 
Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...
Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...
Business DNA Model, Balanced Scorecard, and Strategy Map: A Visual Mathematic...
 
Cin Training 5
Cin Training 5Cin Training 5
Cin Training 5
 
12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...
12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...
12 Disruption Vulnerabilities of the Business Model Canvas: BUSINESS MODEL CA...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
55 Business Models to Revolutionize your Business by Michaela Csik
55 Business Models to Revolutionize your Business by Michaela Csik55 Business Models to Revolutionize your Business by Michaela Csik
55 Business Models to Revolutionize your Business by Michaela Csik
 

Similar to Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
Hyun Wong Choi
 
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
IJERA Editor
 
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
IJERA Editor
 
The Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for ResamplingThe Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for Resampling
JMP software from SAS
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
Bharath123Maddipati
 
Design and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adderDesign and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adder
IRJET Journal
 
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
Borhan Kazimipour
 
Answer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learningAnswer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learning
VijayAECE1
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
Xiaoyu Shi
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
k-Means Clustering.pptx
k-Means Clustering.pptxk-Means Clustering.pptx
k-Means Clustering.pptx
NJYOTSHNA
 
Part 2
Part 2Part 2
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clusteringLiang Xie, PhD
 
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Carlos Sierra
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
Md Abul Hayat
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
Olav Sandstå
 

Similar to Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15 (20)

Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
 
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
 
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
A Novel Approaches For Chromatic Squander Less Visceral Coding Techniques Usi...
 
The Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for ResamplingThe Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for Resampling
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Design and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adderDesign and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adder
 
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
 
Answer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learningAnswer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learning
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design a...
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
k-Means Clustering.pptx
k-Means Clustering.pptxk-Means Clustering.pptx
k-Means Clustering.pptx
 
Part 2
Part 2Part 2
Part 2
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clustering
 
Kmeans
KmeansKmeans
Kmeans
 
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
 
CSCC-X2007
CSCC-X2007CSCC-X2007
CSCC-X2007
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
MLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
MLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
MLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
MLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
MLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
MLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

  • 1. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. DETERMINING THE NUMBER OF CLUSTERS IN A DATASET USING ABC I. KABUL, P. HALL, J. SILVA, W. SARLE ENTERPRISE MINER R&D SAS INSTITUTE
  • 2. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CLUSTERING Objects within a cluster are as similar as possible Objects from different clusters are as dissimilar as possible Hossein Parsaei
  • 3. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CHALLENGES IN CLUSTERING • No prior knowledge • Which similarity measure ? • Which clustering algorithm? • How to evaluate the results? • How many clusters? The Aligned Box Criterion (ABC) addresses the unsolved, important problem of determining the number of clusters in a data set. ABC can be applied in Market Segmentation and many other types of statistical, data mining and machine learning analyses.
  • 4. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CONTENTS • Background • Aligned Box Criterion (ABC) Method • Results • ABC Method in Parallel and Distributed Architecture • Conclusions
  • 5. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. BACKGROUND
  • 6. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. FINDING THE RIGHT NUMBER OF CLUSTERS • Many methods have been proposed: • Calinski-Harabasz index [Calinski 1974] • Cubic clustering criterion (CCC) [Sarle 1983] • Silhouette statistic [Rousseeuw 1987] • Gap statistic [Tibshirani 2001] • Jump method [Sugar 2003] • Prediction strength [Tibshirani 2005] • Dirichlet process [Teh 2006]
  • 7. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. WITHIN CLUSTER SUM OF SQUARES • A good clustering yields clusters where observations have small within-cluster sum-of-squares (and high between- cluster sum-of-squares). • Low values when the partition is good, BUT these are by construction monotone nonincreasing (within cluster dissimilarity always decreases with more clusters)         r r r Ci ir Ci Cj jir xxn xxD 2 2 2   k r r r k D n W 1 2 1 Within-cluster SSE: Measure of compactness of clusters
  • 8. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. BACKGROUND USING WK TO DETERMINE # OF CLUSTERS Elbow method (L-curve method) Idea: use the k corresponding to the “elbow” Problem: no reference clustering to compare the differences Wk  Wk1’s are not normalized for comparison
  • 9. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. BACKGROUND REFERENCE DISTRIBUTIONS • Cubic Clustering Criterion (CCC), Gap Statistic and ABC amplify the elbow phenomenon by using differences between within cluster sum of squares of a clustering solution in the training data (Wk) and a clustering solution in a reference distribution (Wk *). • Aligned box criterion (ABC) • Gap statistic • Cubic clustering criterion (CCC) Reference distribution complexity Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983 Gap Statistic: Tibshirani et al, J.R. Statist. Soc., 2001
  • 10. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CCC METHOD Instead of using Wk directly, CCC uses R2 . 𝑅2 = 1 − 𝑇𝑟𝑎𝑐𝑒 𝑊 𝑇𝑟𝑎𝑐𝑒 𝑇 , 𝑇𝑟𝑎𝑐𝑒 𝑊 = 𝑊𝑘 For CCC calculation, R2 and E(R2) are approximated by heuristic formulas. 𝐶𝐶𝐶 = log 1 − 𝐸(𝑅2) 1 − 𝑅2 𝑛𝑝∗ 2 (0.001 + 𝐸(𝑅2))1.2 Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983 Derived from numerous Monte Carlo simulations to generate one hyper-cube reference distribution based on the dimensions of the given training dataset to test all k of interest.
  • 11. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. GAP STATISTICS METHOD The Gap Statistic computes the (log) ratio Wk* / Wk. 𝐺𝑎𝑝 𝑘 = log 𝑊𝑘 ∗ − log 𝑊𝑘 Wk* is calculated from a clustering solution in the reference distribution. Finds k that maximizes Gap(k) (within some tolerance)
  • 12. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. TWO TYPES OF UNIFORM DISTRIBUTIONS 1. Align with feature axes (data-geometry independent) Observations Bounding Box (aligned with feature axes) Monte Carlo Simulations
  • 13. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. TWO TYPES OF UNIFORM DISTRIBUTIONS 2. Align with principal axes (data-geometry dependent) Observations Bounding Box (aligned with principal axes) Monte Carlo Simulations
  • 14. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. COMPUTATION OF THE GAP STATISTIC for l = 1 to B Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the standard deviation of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that )(/11 ksdBsk    B b kkb WW B kGap 1 loglog 1 )( 1)1()(  kskGapkGap
  • 15. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. GAP STATISTIC
  • 16. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. NO-CLUSTER EXAMPLE (JOURNAL VERSION)
  • 17. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC (ALIGNED BOX CRITERION)
  • 18. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD ABC improves upon CCC and Gap Statistics by generating better estimates for Wk*. ABC uses k reference distributions, one for each tested k (k is number of clusters). • Data-driven Monte Carlo simulation of reference distribution at each tested k. • The reference distribution is k uniform hyper boxes aligned with the Principal Components from the clustering solution of the input data. Gap Statistic Reference Distribution ABC Reference Distribution
  • 19. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD Why multiple reference distributions? The gap statistic performs hypothesis testing between k clusters/no-clusters for the whole input space • ABC is similar to recursive hypothesis testing between 1 cluster/2 clusters for each of the k candidate clusters • More stringent test. It is harder for larger k to pass this test. This is desirable. Gap Statistic Reference Distribution ABC Reference Distribution
  • 20. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Sample Data
  • 21. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Aligned Box Criterion
  • 22. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Aligned Box Criterion
  • 23. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 24. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 25. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 26. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 27. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 28. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 29. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 30. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. Aligned Box Criterion ESTIMATING k REFERENCE DISTRIBUTIONS
  • 31. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ALIGNED BOX CRITERION (ABC) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Considering each cluster k separately Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the s.d. of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that )(/11 ksdBsk  1)1()(  kskABCkABC 𝐴𝐵𝐶(𝑘) = log 𝑊𝑘 + − log 𝑊𝑘
  • 32. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD RESULTS
  • 33. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Wk*decreases faster. Gap Statistic Aligned Box Criterion
  • 34. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k REFERENCE DISTRIBUTIONS Gap Statistic Aligned Box Criterion AlignedBoxCriterion Clearer Maxima.
  • 35. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
  • 36. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS • Observations: 7,000 • Variables: 2 • Monte Carlo Replications: 20 CCC method ABC method
  • 37. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
  • 38. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k CLAIMS PREDICTION CHALLENGE DATA • Anonymized customer data • 32 customer and product features • 13,184,290 customer records
  • 39. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k EXECUTING CALCULATIONS • Cubic clustering criterion: PROC FASTCLUS • Gap statistic: R cluster package in the Open Source Integration Node in SAS Enterprise Miner • Aligned box criterion: PROC HPCLUS
  • 40. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k INTERPRETING RESULTS Cubic Clustering Criterion
  • 41. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k INTERPRETING RESULTS Gap Statistic
  • 42. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ESTIMATING k INTERPRETING RESULTS Aligned Box Criterion
  • 43. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. REFERENCE DISTRIBUTION EFFECT OF CHANGING NUMBER OF OBSERVATIONS • How the number of observations in the reference distribution affects the result • Based on the number of observations n in the input dataset, we generated w*n number of observations in the reference distribution where w is between 0 and 1
  • 44. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SIMPLE CASE
  • 45. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS DATA SET WITH MORE CLUSTERS
  • 46. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS DATA SET WITH MORE OBSERVATIONS
  • 47. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS REAL DATA Kaggle Claims Prediction Challenge (n= 13,184,290, p= 35), 50 runs
  • 48. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS SCALABILITY
  • 49. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS STABILITY
  • 50. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD FOR PARALLEL AND DISTRIBUTED ARCHITECTURES
  • 51. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 1-2 Node1 Root ….. Node2 Node3 NodeN 1) Run clustering k-means (in parallel) for k clusters 2) Assign each observation to a cluster 3) Compute 𝑊𝑘 1) Assign each cluster to a node 2) Collect the XX’ matrix for each cluster in the assigned node using a tree-based algorithm 3) Do PCA using XX’ matrix Node1 ….. Node2 Node3 NodeN
  • 52. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 3-4 Node1 ….. Node2 Node3 NodeN 1) Eigenvectors are broadcasted to every node 2) Based on their assigned clusters, the observations in each node are projected into the new space 1) Bounding boxes are computed locally at each node for each cluster k 2) Bounding box information from each node is collected at the root and the root computes the bounding box coordinates for each cluster k 3) This information is distributed to each node and each node generates reference distributions Node1 ….. Node2 Node3 NodeN Node1 Root ….. Node2 Node3 NodeN Node1 ….. Node2 Node3 NodeN
  • 53. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 5 Node1 Root ….. Node2 Node3 NodeN Run k-means clustering in parallel for the reference distribution and compute 𝑊𝑘 + Do this for B number of reference distributions Compute ABC for cluster k
  • 54. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. PARALLEL ABC PART 6 What about the O(n^3) complexity of SVD??? - Computation of XX’ is parallelized - Or, do stochastic SVD
  • 55. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. ABC METHOD CONCLUSION
  • 56. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. RESULTS More accurate reference distributions lead to: • Better defined maxima. • Wk* values decreasing rapidly, especially for K > k. • Exposure of possible alternative solutions.
  • 57. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. CONCLUSION For large, highly dimensional or noisy data ABC is found to be: • Stable • Scalable Moreover, it exhibits desirable properties: • Clearer peaks • More stringent hypothesis test promotes smaller k values
  • 58. Copyr ight © 2012, SAS Institute Inc. All rights reser ved. www.SAS.com Q&A THANK YOU

Editor's Notes

  1. Within cluster sum of squares
  2. use the k corresponding to the “elbow” (the most significant increase in goodness-of-fit)
  3. SAS put forth the proprietary measure Cubic Clustering Criterion (CCC) as its primary method to estimate k. CCC uses the difference between the actual (Wk) and expected (Wk*) Within Cluster Sum of Squares over a range of possible values for k to suggest a best k. While Wk is calculated from a k-cluster solution in the training dataset, Wk* must be calculated from a k-cluster solution in a generated reference distribution. As direct simulation was computationally expensive when CCC was first developed, the technique employs a heuristic formula derived from numerous Monte Carlo simulations to generate one hyper-cube reference distribution based on the dimensions of the given training dataset to test all k of interest. Despite the intrinsic shortcomings of heuristic approximation, CCC remains perhaps the best method for estimating k, with only one meaningful improvement being proposed since its introduction: using Monte Carlo simulation, directly instead of heuristically, to generate a hyper-cube reference distribution. Tibshirani et al, Estimating the number of clusters in a dataset via the Gap Statistic, J.R. Statist. Soc. B 63, Oxford, UK: Wiley-Blackwell, 2001. 12 pp.
  4. Null hypothesis: reference distribution Normalize the curve log Wk v.s. k
  5. Error-tolerant normalized elbow!
  6. To generate more realistic null hypothesis values ABC performs a more precise test at each k than does CCC. Instead of comparing a k-cluster solution in training data to a k-cluster solution in an approximated hyper-cube, ABC compares a k-cluster solution in training data to k-cluster solution in a data-adaptive reference distribution comprised of k hyper-cubes with dimensions that change based on the training data, the clustering algorithm, and on k. Such descriptive reference distributions allow for enhanced detection of differences between Wk and Wk*, which in turn leads to more accurate determinations of k.
  7. To generate more realistic null hypothesis values ABC performs a more precise test at each k than does CCC. Instead of comparing a k-cluster solution in training data to a k-cluster solution in an approximated hyper-cube, ABC compares a k-cluster solution in training data to k-cluster solution in a data-adaptive reference distribution comprised of k hyper-cubes with dimensions that change based on the training data, the clustering algorithm, and on k. Such descriptive reference distributions allow for enhanced detection of differences between Wk and Wk*, which in turn leads to more accurate determinations of k.
  8. Discuss reference distributions first and then example.
  9. Harder to reject the null hypothesis of no clusters – the clustering solution in the training data has to be better than the clustering solution in this reference distribution. Which it is just slightly … Probably because of the boxy shape of the reference distribution.
  10. Error-tolerant normalized elbow! Wk+ is calculated from a clustering solution in the reference distribution. The difference between ABC and competing techniques is the reference distribution.
  11. Makes for an easier to interpret solution
  12. Makes for an easier to interpret solution
  13. Now we are discussing an example.
  14. Mention data prep Show R code in EM
  15. Point out 3 and 9
  16. Point out 3 and 9
  17. Point out 2, 4 and 9 So something between 2,3,4 and at 9