SlideShare a Scribd company logo
1 of 35
Download to read offline
LogAnalyticsinDataCenterwith
ApacheSparkandMachineLearningDataMass Summit 2017
Agnieszka Potulska
Intel Technology Poland
agnieszka.potulska@intel.com
Piotr Tylenda
Intel Technology Poland
piotr.tylenda@intel.com
© 2017 Intel Corporation 2
Agenda
1. What problem we would like to solve?
2. Data pipeline and key components
3. Cluster analysis
4. TF-IDF, word2vec
5. k-means algorithm
6. PySpark example
7. Clustering visualization sneak-peek
8. Lessons learned
Source: https://www.iconsfind.com/20151011/check-checklist-document-list-menu-todo-todo-list-icon/
© 2017 Intel Corporation 3
Problem statement
Workload - stimulus applied to the
observed target, with predefined
actions and observable parameters.
In other words - actions that we execute
on the specified server.
1. Need for workload log failure
information management
2. Duplication of work – engineers
analyzing similar problems
independently
© 2017 Intel Corporation 4
Workload execution
Logs collection
Expert analysis
Workload execution
Logs collection pipeline
Machine learning analysis
Standard workflow Machine Learning workflow
Decision Automated decisionAutomated decision
© 2017 Intel Corporation 5
Log Collection & Analysis
Machine Learning
Full Text Search
Workload
scheduler
Metadata
Clusters
* Other names and brands may be claimed as the property of others.
*
*
*
*
*
*
*
*
© 2017 Intel Corporation 6
Key Components
 Apache Kafka*
– Enables publishing data from numerous producers
– Logs are streamed as small messages in real time
 Apache Spark Streaming*
– Feeds data to HDFS* in micro-batches
 ELK* stack
– Full-text search and visualization of data in cluster
 Apache Spark*
– Machine learning batch processing
 Apache Zeppelin*
– Web-based workbook repository for data scientists * Other names and brands may be claimed as the property of others.
© 2017 Intel Corporation 7
Cluster Analysis
• Workload logs can be treated as text documents –
there are suitable clustering algorithms!
• Objective is to group dataset into clusters.
• Objects assigned to the same cluster are more similar (using a predefined
similarity measure) to each other than to objects in other clusters.
• Major technique used in exploratory data mining.
• Unsupervised machine learning.
© 2017 Intel Corporation 8
Cluster Analysis – s1 Dataset Example
Dataset: P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006
© 2017 Intel Corporation 9
Cluster Analysis Algorithms
• Hierarchical clustering (ex. single-linkage
clustering)
• Density-based clustering (ex. DBSCAN)
• Distribution-based clustering (ex. EM algorithm)
• Centroid-based clustering (methods derived
from k-means algorithm)
Source: https://upload.wikimedia.org/wikipedia/commons/1/12/Iris_dendrogram.png
© 2017 Intel Corporation 10
Log Data Machine Learning Steps
Filtering and Stopwords Removal
Tokenization
TF-IDF / word2vec Conversion
Normalization (optional)
k-means Clustering
© 2017 Intel Corporation 11
Feature Vectorization
How to represent texts and words as vectors?
Vector Space Model
DOC1 DOC2 DOC3
home 14 19 45
stop 9 0 0
event 0 32 4
documents are
represented as
vectors
each dimension of the vector
space corresponds to a word
© 2017 Intel Corporation 12
Term Frequency – Inverse Document Frequency
 Widely used in text mining, search and
classification tasks.
 Adds weights to the documents
vectors that reflect the importance of
the term.
W D1 D2 IDF TF-IDF
D1 D2
I 1 1 log⁡(3
3) 0 0
like 1 1 log⁡(3
3) 0 0
red 1 1 log⁡(3
3) 0 0
do 0 1 log⁡(3
2) 0 0.176
not 0 1 log⁡(3
2) 0 0.176
D1: I like red.
D2: I do not like red.
© 2017 Intel Corporation 13
Word2vec
• Developed by Mikolov et al. (Google 2013)*
• Word Embedding – produces vector
representation of words.
• Which words occurred near other words?
• Spark ML implementation of word2vec
supports cosine distance as similarity
measure – produced better results in our use
case.
• Provides dimensionality reduction – more
suitable for Spark k-means.
apple
banana
orange
bicycle
book
notepad
*Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space"
© 2017 Intel Corporation 14
k-means Algorithm
• Clustering algorithm which groups n objects
(points) into k groups, where k is a predefined
parameter.
• Each object is assigned to group which has the
closest (most similar) centroid to this object.
• k-means defines a whole family of algorithms
such as k-medians, k-medoids or c-means.
© 2017 Intel Corporation 15
k-means Problem Definition
Given a dataset of 𝑛 objects X = {𝑥1, 𝑥2, 𝑥3, … , 𝑥 𝑛}, where each object is a 𝑑-
dimensional vector (𝑥𝑖 ∈ ℝ 𝑑).
k-means problem is defined then as: divide the dataset X into 𝑘 groups
(clusters) C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} in such a way that within-set sum of squares is
minimized:
𝐸 𝐶 = 𝑑2(𝑥, 𝐶𝑖)
𝑥𝜖𝐶 𝑖
𝑘
𝑖=1
where 𝐶𝑖 =
1
𝐶 𝑖
𝑥𝑥∈𝐶𝑖
is a centroid of a given group, 𝑑(𝑥, 𝑦) is a distance
function between x and y.
This is an NP-hard problem.
© 2017 Intel Corporation 16
Heuristic Solution – k-means Algorithm
begin
initialization*: divide dataset 𝑋 into 𝑘 random, exclusive groups;
do
foreach group, compute its 𝐿2
−norm* centroid;
foreach object in dataset, assign it to the closest* group (using centroids);
while any 𝐶𝑖 group assignment changed;
end
𝑋 – dataset
𝑘 – expected number of clusters k-means
C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘}
(set of clusters)
© 2017 Intel Corporation 17
k-means Algorithm – Example (1)
Let's consider the following 2D points dataset.
Visualization based on: https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
© 2017 Intel Corporation 18
k-means Algorithm – Example (2)
The dataset will be divided into 4 clusters (k=4), Euclidean distance will be used as similarity measure.
© 2017 Intel Corporation 19
k-means Algorithm – Example (3)
Algorithm is initialized by choosing 4 center points randomly. The initial clustering will be random.
© 2017 Intel Corporation 20
k-means Algorithm – Example (4)
Each point is assigned to the closest center using Euclidean distance. This determines
the initial group assignment.
© 2017 Intel Corporation 21
k-means Algorithm – Example (5)
The new group assignment determines new centroid positions.
Each point is assigned to the closest centroid again.
© 2017 Intel Corporation 22
k-means Algorithm – Example (6)
These steps are repeated until the algorithm converges,
i.e. no points are assigned to a different cluster.
© 2017 Intel Corporation 23
k-means Algorithm – Example (7)
Some points have been reassigned again...
© 2017 Intel Corporation 24
k-means Algorithm – Example (8)
And again...
© 2017 Intel Corporation 25
k-means Algorithm – Example (9)
In the final iteration, the new positions of centroids have been calculated
and no point has been assigned to a different group. DONE!
© 2017 Intel Corporation 26
k-means Algorithm – Initialization
• The most important part of k-means algorithm.
• Initialization predefines how the algorithm will
converge.
• Different initialization output will give different
clustering.
• Simple approach: Random, Forgy, MacQueen, Kaufman.
• Advanced approach*: k-means++ and k-means||
*Source: David Arthur and Sergei Vassilvitskii: „The Advantages of Careful Seeding” [2007]
Random method
Forgy method
© 2017 Intel Corporation 27
k-means++ Initialization
1. Choose the first center („seed”) 𝑐1 in dataset 𝑋
(uniform distribution).
2. Choose the next center 𝑐𝑖⁡by choosing 𝑥 ∈ 𝑋
with probability
𝑑 𝑥, 𝑐 𝑥
2
𝑑(𝑥′, 𝑐(𝑥′))𝑥′∈𝑋
where 𝑐(𝑥) is the closest center for 𝑥
(„𝑑2
-weighting”).
3. Repeat step 2. until 𝑘 initial centers are
selected.
© 2017 Intel Corporation 28
Example (PySpark )
k = 25 # Number of clusters
df = sqlContext.read.parquet("data.parquet") # Input dataframe, 'content' column defines a single document
tokenizer = RegexTokenizer(inputCol="content", outputCol="words", gaps=True, pattern="W") # Word tokenization by white spaces
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered_words") # Standard stopwords removal
word2vec = Word2Vec(vectorSize=100, minCount=5, windowSize=10, maxIter=2, inputCol=remover.getOutputCol(), outputCol="features")
kmeans = KMeans(k=k, predictionCol="prediction", initMode="k-means||", initSteps=10, tol=1e-7, maxIter=600)
pipeline = Pipeline(stages=[tokenizer, remover, word2vec, kmeans])
print("Starting k-means... (k={0})".format(k))
model = pipeline.fit(df)
clustering_df = model.transform(df) # Column 'predictions' contains cluster number for each document
kmeans_model = model.stages[-1]
print("Cluster centers: {0}".format(pprint.pformat(kmeans_model.clusterCenters())))
print("Within set sum of squared errors ({0}) = {1}".format(k, kmeans_model.computeCost(clustering_df)))
* Other names and brands may be claimed as the property of others.
*
© 2017 Intel Corporation 29
Cluster Validation - WSSSE
• k-means minimizes within-set sum of squared
errors (WSSSE).
• The most simple internal cluster validation index
(no external labelling needed).
• Decreases monotonically with number of
detected clusters.
• Can be used for detection of number of clusters
(𝑘 parameter), „elbow method”.
© 2017 Intel Corporation 30
Interactive Datacenter Log Clustering Visualization
Timeframe 72h
Workload servers 71
Raw log data 127 GB
Log messages 172 million
Clusters 56
© 2017 Intel Corporation 31
Lessons learned
INFO: Error happened. Reboot platform
ERROR: This is a debug message
WARN: Critical error occurred
 Efficient logging requires consistency.
 Spark ML k-means implementation supports only Euclidean distance –
currently no support for cosine similarity.
 Document clustering is very sensitive to data preprocessing quality.
Backup
© 2017 Intel Corporation 34
Cluster Analysis – Use Cases
• Data exploration
• Statistical data analysis
• Recommender systems
• Text mining
• Pattern recognition
• Image segmentation and analysis
• Bioinformatics
• Medicine
• Market research
© 2017 Intel Corporation 35
Clustering Validation – ℱ1-score
• External clustering validation index (requires external labelling).
• Set-overlapping based measure.
• ℱ1-score for a cluster 𝐶𝑗 with respect to an external classification
𝑉𝑖 is defined as harmonic mean of precision and recall:
ℱ1 𝑉𝑖, 𝐶𝑗 =
2
1
𝑝𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛
+
1
𝑟𝑒𝑐𝑎𝑙𝑙
• Micro-averaged ℱ1-score for clustering C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} and
classification V = {𝑉1, 𝑉2, 𝑉3, … , 𝑉𝑚} is then defined as:
ℱ1 𝑉, 𝐶 =
|𝑉𝑖|
𝑛
max
1≤𝑗≤𝑘
ℱ1 𝑉𝑖, 𝐶𝑗
𝑚
𝑖=1 Source: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
(CC BY-SA 4.0)

More Related Content

What's hot

Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Mostafa G. M. Mostafa
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshopTae-Gil Noh
 
Compressed Sensing using Generative Model
Compressed Sensing using Generative ModelCompressed Sensing using Generative Model
Compressed Sensing using Generative Modelkenluck2001
 
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEMPEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEMijcisjournal
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...Daiki Tanaka
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function InterpolationJesse Bettencourt
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)neeraj7svp
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 

What's hot (20)

CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
 
Compressed Sensing using Generative Model
Compressed Sensing using Generative ModelCompressed Sensing using Generative Model
Compressed Sensing using Generative Model
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
 
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEMPEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function Interpolation
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Lec08 optimizations
Lec08 optimizationsLec08 optimizations
Lec08 optimizations
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
[ppt]
[ppt][ppt]
[ppt]
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
CSC446: Pattern Recognition (LN7)
CSC446: Pattern Recognition (LN7)CSC446: Pattern Recognition (LN7)
CSC446: Pattern Recognition (LN7)
 

Similar to Log Analytics in Datacenter with Apache Spark and Machine Learning

Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Shinwoo Jang
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonAfzal Ahmad
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
 
Presentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxPresentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxSYETB202RandhirBhosa
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Ganesan Narayanasamy
 

Similar to Log Analytics in Datacenter with Apache Spark and Machine Learning (20)

Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Clustering
ClusteringClustering
Clustering
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
Presentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxPresentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptx
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Log Analytics in Datacenter with Apache Spark and Machine Learning

  • 1. LogAnalyticsinDataCenterwith ApacheSparkandMachineLearningDataMass Summit 2017 Agnieszka Potulska Intel Technology Poland agnieszka.potulska@intel.com Piotr Tylenda Intel Technology Poland piotr.tylenda@intel.com
  • 2. © 2017 Intel Corporation 2 Agenda 1. What problem we would like to solve? 2. Data pipeline and key components 3. Cluster analysis 4. TF-IDF, word2vec 5. k-means algorithm 6. PySpark example 7. Clustering visualization sneak-peek 8. Lessons learned Source: https://www.iconsfind.com/20151011/check-checklist-document-list-menu-todo-todo-list-icon/
  • 3. © 2017 Intel Corporation 3 Problem statement Workload - stimulus applied to the observed target, with predefined actions and observable parameters. In other words - actions that we execute on the specified server. 1. Need for workload log failure information management 2. Duplication of work – engineers analyzing similar problems independently
  • 4. © 2017 Intel Corporation 4 Workload execution Logs collection Expert analysis Workload execution Logs collection pipeline Machine learning analysis Standard workflow Machine Learning workflow Decision Automated decisionAutomated decision
  • 5. © 2017 Intel Corporation 5 Log Collection & Analysis Machine Learning Full Text Search Workload scheduler Metadata Clusters * Other names and brands may be claimed as the property of others. * * * * * * * *
  • 6. © 2017 Intel Corporation 6 Key Components  Apache Kafka* – Enables publishing data from numerous producers – Logs are streamed as small messages in real time  Apache Spark Streaming* – Feeds data to HDFS* in micro-batches  ELK* stack – Full-text search and visualization of data in cluster  Apache Spark* – Machine learning batch processing  Apache Zeppelin* – Web-based workbook repository for data scientists * Other names and brands may be claimed as the property of others.
  • 7. © 2017 Intel Corporation 7 Cluster Analysis • Workload logs can be treated as text documents – there are suitable clustering algorithms! • Objective is to group dataset into clusters. • Objects assigned to the same cluster are more similar (using a predefined similarity measure) to each other than to objects in other clusters. • Major technique used in exploratory data mining. • Unsupervised machine learning.
  • 8. © 2017 Intel Corporation 8 Cluster Analysis – s1 Dataset Example Dataset: P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006
  • 9. © 2017 Intel Corporation 9 Cluster Analysis Algorithms • Hierarchical clustering (ex. single-linkage clustering) • Density-based clustering (ex. DBSCAN) • Distribution-based clustering (ex. EM algorithm) • Centroid-based clustering (methods derived from k-means algorithm) Source: https://upload.wikimedia.org/wikipedia/commons/1/12/Iris_dendrogram.png
  • 10. © 2017 Intel Corporation 10 Log Data Machine Learning Steps Filtering and Stopwords Removal Tokenization TF-IDF / word2vec Conversion Normalization (optional) k-means Clustering
  • 11. © 2017 Intel Corporation 11 Feature Vectorization How to represent texts and words as vectors? Vector Space Model DOC1 DOC2 DOC3 home 14 19 45 stop 9 0 0 event 0 32 4 documents are represented as vectors each dimension of the vector space corresponds to a word
  • 12. © 2017 Intel Corporation 12 Term Frequency – Inverse Document Frequency  Widely used in text mining, search and classification tasks.  Adds weights to the documents vectors that reflect the importance of the term. W D1 D2 IDF TF-IDF D1 D2 I 1 1 log⁡(3 3) 0 0 like 1 1 log⁡(3 3) 0 0 red 1 1 log⁡(3 3) 0 0 do 0 1 log⁡(3 2) 0 0.176 not 0 1 log⁡(3 2) 0 0.176 D1: I like red. D2: I do not like red.
  • 13. © 2017 Intel Corporation 13 Word2vec • Developed by Mikolov et al. (Google 2013)* • Word Embedding – produces vector representation of words. • Which words occurred near other words? • Spark ML implementation of word2vec supports cosine distance as similarity measure – produced better results in our use case. • Provides dimensionality reduction – more suitable for Spark k-means. apple banana orange bicycle book notepad *Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space"
  • 14. © 2017 Intel Corporation 14 k-means Algorithm • Clustering algorithm which groups n objects (points) into k groups, where k is a predefined parameter. • Each object is assigned to group which has the closest (most similar) centroid to this object. • k-means defines a whole family of algorithms such as k-medians, k-medoids or c-means.
  • 15. © 2017 Intel Corporation 15 k-means Problem Definition Given a dataset of 𝑛 objects X = {𝑥1, 𝑥2, 𝑥3, … , 𝑥 𝑛}, where each object is a 𝑑- dimensional vector (𝑥𝑖 ∈ ℝ 𝑑). k-means problem is defined then as: divide the dataset X into 𝑘 groups (clusters) C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} in such a way that within-set sum of squares is minimized: 𝐸 𝐶 = 𝑑2(𝑥, 𝐶𝑖) 𝑥𝜖𝐶 𝑖 𝑘 𝑖=1 where 𝐶𝑖 = 1 𝐶 𝑖 𝑥𝑥∈𝐶𝑖 is a centroid of a given group, 𝑑(𝑥, 𝑦) is a distance function between x and y. This is an NP-hard problem.
  • 16. © 2017 Intel Corporation 16 Heuristic Solution – k-means Algorithm begin initialization*: divide dataset 𝑋 into 𝑘 random, exclusive groups; do foreach group, compute its 𝐿2 −norm* centroid; foreach object in dataset, assign it to the closest* group (using centroids); while any 𝐶𝑖 group assignment changed; end 𝑋 – dataset 𝑘 – expected number of clusters k-means C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} (set of clusters)
  • 17. © 2017 Intel Corporation 17 k-means Algorithm – Example (1) Let's consider the following 2D points dataset. Visualization based on: https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
  • 18. © 2017 Intel Corporation 18 k-means Algorithm – Example (2) The dataset will be divided into 4 clusters (k=4), Euclidean distance will be used as similarity measure.
  • 19. © 2017 Intel Corporation 19 k-means Algorithm – Example (3) Algorithm is initialized by choosing 4 center points randomly. The initial clustering will be random.
  • 20. © 2017 Intel Corporation 20 k-means Algorithm – Example (4) Each point is assigned to the closest center using Euclidean distance. This determines the initial group assignment.
  • 21. © 2017 Intel Corporation 21 k-means Algorithm – Example (5) The new group assignment determines new centroid positions. Each point is assigned to the closest centroid again.
  • 22. © 2017 Intel Corporation 22 k-means Algorithm – Example (6) These steps are repeated until the algorithm converges, i.e. no points are assigned to a different cluster.
  • 23. © 2017 Intel Corporation 23 k-means Algorithm – Example (7) Some points have been reassigned again...
  • 24. © 2017 Intel Corporation 24 k-means Algorithm – Example (8) And again...
  • 25. © 2017 Intel Corporation 25 k-means Algorithm – Example (9) In the final iteration, the new positions of centroids have been calculated and no point has been assigned to a different group. DONE!
  • 26. © 2017 Intel Corporation 26 k-means Algorithm – Initialization • The most important part of k-means algorithm. • Initialization predefines how the algorithm will converge. • Different initialization output will give different clustering. • Simple approach: Random, Forgy, MacQueen, Kaufman. • Advanced approach*: k-means++ and k-means|| *Source: David Arthur and Sergei Vassilvitskii: „The Advantages of Careful Seeding” [2007] Random method Forgy method
  • 27. © 2017 Intel Corporation 27 k-means++ Initialization 1. Choose the first center („seed”) 𝑐1 in dataset 𝑋 (uniform distribution). 2. Choose the next center 𝑐𝑖⁡by choosing 𝑥 ∈ 𝑋 with probability 𝑑 𝑥, 𝑐 𝑥 2 𝑑(𝑥′, 𝑐(𝑥′))𝑥′∈𝑋 where 𝑐(𝑥) is the closest center for 𝑥 („𝑑2 -weighting”). 3. Repeat step 2. until 𝑘 initial centers are selected.
  • 28. © 2017 Intel Corporation 28 Example (PySpark ) k = 25 # Number of clusters df = sqlContext.read.parquet("data.parquet") # Input dataframe, 'content' column defines a single document tokenizer = RegexTokenizer(inputCol="content", outputCol="words", gaps=True, pattern="W") # Word tokenization by white spaces remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered_words") # Standard stopwords removal word2vec = Word2Vec(vectorSize=100, minCount=5, windowSize=10, maxIter=2, inputCol=remover.getOutputCol(), outputCol="features") kmeans = KMeans(k=k, predictionCol="prediction", initMode="k-means||", initSteps=10, tol=1e-7, maxIter=600) pipeline = Pipeline(stages=[tokenizer, remover, word2vec, kmeans]) print("Starting k-means... (k={0})".format(k)) model = pipeline.fit(df) clustering_df = model.transform(df) # Column 'predictions' contains cluster number for each document kmeans_model = model.stages[-1] print("Cluster centers: {0}".format(pprint.pformat(kmeans_model.clusterCenters()))) print("Within set sum of squared errors ({0}) = {1}".format(k, kmeans_model.computeCost(clustering_df))) * Other names and brands may be claimed as the property of others. *
  • 29. © 2017 Intel Corporation 29 Cluster Validation - WSSSE • k-means minimizes within-set sum of squared errors (WSSSE). • The most simple internal cluster validation index (no external labelling needed). • Decreases monotonically with number of detected clusters. • Can be used for detection of number of clusters (𝑘 parameter), „elbow method”.
  • 30. © 2017 Intel Corporation 30 Interactive Datacenter Log Clustering Visualization Timeframe 72h Workload servers 71 Raw log data 127 GB Log messages 172 million Clusters 56
  • 31. © 2017 Intel Corporation 31 Lessons learned INFO: Error happened. Reboot platform ERROR: This is a debug message WARN: Critical error occurred  Efficient logging requires consistency.  Spark ML k-means implementation supports only Euclidean distance – currently no support for cosine similarity.  Document clustering is very sensitive to data preprocessing quality.
  • 32.
  • 34. © 2017 Intel Corporation 34 Cluster Analysis – Use Cases • Data exploration • Statistical data analysis • Recommender systems • Text mining • Pattern recognition • Image segmentation and analysis • Bioinformatics • Medicine • Market research
  • 35. © 2017 Intel Corporation 35 Clustering Validation – ℱ1-score • External clustering validation index (requires external labelling). • Set-overlapping based measure. • ℱ1-score for a cluster 𝐶𝑗 with respect to an external classification 𝑉𝑖 is defined as harmonic mean of precision and recall: ℱ1 𝑉𝑖, 𝐶𝑗 = 2 1 𝑝𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 + 1 𝑟𝑒𝑐𝑎𝑙𝑙 • Micro-averaged ℱ1-score for clustering C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} and classification V = {𝑉1, 𝑉2, 𝑉3, … , 𝑉𝑚} is then defined as: ℱ1 𝑉, 𝐶 = |𝑉𝑖| 𝑛 max 1≤𝑗≤𝑘 ℱ1 𝑉𝑖, 𝐶𝑗 𝑚 𝑖=1 Source: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg (CC BY-SA 4.0)