SlideShare a Scribd company logo
CSMR: A Scalable Algorithm for 
Text Clustering with Cosine 
Similarity and MapReduce 
Giannakouris – Salalidis Victor - Undergraduate Student 
Plerou Antonia - PhD Candidate 
Sioutas Spyros - Associate Professor
Introduction 
• Big Data: Massive amount of data as a result of the huge 
rate of growth 
• Big Data need to be faced in various domains: Business 
Intelligence, Bioinformatics, Social Media Analytics etc. 
• Text Mining: Classification/Clustering in digital libraries, 
e-mail, Sentiment Analysis on Social Media 
• CSMR: Performs pairwise text similarity, represents text 
data in a vector space and measures similarity in parallel 
manner using MapReduce
Background 
• Vector Space Model: An algebraic model for representing 
text documents as vectors 
• Efficient method for text similarity measurement
TF-IDF 
• Term Frequency – Inverse Document Frequency 
• A numerical statistic that reflects the significance of a 
term in a corpus of documents 
• Usually used in search engines, text mining, text 
similarity in the vector space 
푇퐹 × 퐼퐷퐹 = 
푛푖,푗 
푡 ∈ 푑푗 
× 푙표푔 
|퐷| 
|푑 ∈ 퐷: 푡 ∈ 푑|
Cosine Similarity 
• Cosine Similarity: A measure of similarity between two 
documents represented as vector 
• Measuring of the angle between two vectors 
A  B A  
B 
  
1 
1 2 2 
A  
B 
1 1 
cos(A,B) 
|| A|| || B|| 
( ) ( ) 
n 
i i 
n 
i 
i i 
i i 
 
  
 
 
Hadoop 
• Framework developed by Apache 
• Large-Scale Data Processing and Analytics 
• Scalable and parallel processing of data on large 
computer clusters using MapReduce 
• Runs on commodity, low-end hardware 
• Main Components: HDFS (Hadoop Distributed File 
System), MapReduce 
• Currently used by: Adobe, Yahoo!, Amazon, eBay, 
Facebook and many other companies
MapReduce 
• Programming Paradigm running on Apache Hadoop 
• The main component of Hadoop 
• Useful for processing of large data-sets 
• Breaks the data into key-value pairs 
• Model derived from map and reduce functions of 
Functional Programming 
• Every MR program constitutes of Mappers and Reducers
MapReduce Diagram
CSMR 
• The purposed method, CSMR combines all the above 
mentioned techniques 
• Scalable Algorithm for text clustering using MapReduce model 
• Applies MR model on TF-IDF and Cosine Similarity 
• 4 Phases: 
1. Word Counting 
2. Text Vectorization using term frequencies 
3. Apply TF-IDF on document vectors 
4. Cosine Similarity Measurement
Phase 1: Word Counting 
Algorithm 1: Word Count 
1: class Mapper 
2: method Map( document ) 
3: for each term ∈ document 
4: write ( ( term , docId ) , 1 ) 
5: 
6: class Reducer 
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 
8: sum = 0 
9: for each one ∈ ones do 
10: sum = sum +1 
11: return ( ( term , docId ) , o ) 
12: 
13: /* { o ∈ N : the number of occurrences } */
Phase 2: Term Frequency 
Algorithm 2: Term Frequency 
1: class Mapper 
2: method Map( ( term , docId ) , o ) 
3: for each element ∈ ( term , docId ) 
4: write ( docId, ( term, o ) ) 
5: 
6: class Reducer 
7: method Reduce( docId, (term, o) ) 
8: N = 0 
9: for each tuple ∈ ( term, o ) do 
10: N = N + o 
return ( (docId, N), (term, o) )
Phase 3: TF-IDF 
Algorithm 3: Tf-Idf 
1: class Mapper 
2: method Map( ( docId , N ), ( term , o ) ) 
3: for each element ∈ ( term , o ) 
4: write ( term, ( docId, o, N ) ) 
5: 
6: class Reducer 
7: method Reduce( term, ( docId , o , N ) ) 
8: n = 0 
9: for each element ∈ ( docId , o , N ) do 
10: n = n + 1 
11: tf = o / N 
12: idf = log|D| /(1n) 
13: return ( docId, ( term , tf×idf ) ) 
14: 
15: /* Where |D| is the number of documents in the corpus */
Phase 4: Cosine Similarity 
Algorithm 4: Cosine Similarity 
1: class Mapper 
2: method Map( docs ) 
3: n = docs.length 
4: 
5: for i = 0 to docs.length 
6: for j = i+1 to docs.length 
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 
8: 
9: class Reducer 
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 
11: A = docA.tfidf 
12: B = docB.tfidf 
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 
14: return ( (docId_A, docId_B), cosine )
Phase 4: Diagram 
Map 
Doc1,Doc2 
[Doc1 TF-IDF], [Doc2 TF-IDF] 
Doc1,Doc3 
[Doc1 TF-IDF], [Doc3 TF-IDF] 
Doc1,Doc4 
Input [Doc1 TF-IDF], [Doc4 TF-IDF] 
Output 
Doc4,Doc10 
[Doc4 TF-IDF], [Doc10 TF-IDF] 
DocM,DocN 
[DocM TF-IDF], [DocN TF-IDF] 
Reduce 
Doc1,Doc3 
Cosine(Doc1, Doc3) 
Doc1,Doc4 
Cosine(Doc1 ,Doc4) 
Doc4,Doc10 
Cosine(Doc4, Doc10) 
DocM,DocN 
Cosine(DocM, DocN) 
Doc1,Doc2 
Cosine(Doc1, Doc2)
Conclusions & Future Work 
• Finalized proposed method 
• Implementation of the method 
• Experimental tests on real data and computer clusters 
• Deployment of an open-source project 
• Additional implementation using more efficient tools such 
as Apache Spark and Scala 
• Publication of test results

More Related Content

What's hot

Ir 08
Ir   08Ir   08
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
Yash Darak
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
FACTS Computer Software L.L.C
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering enginesunyil96
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
Krish_ver2
 

What's hot (20)

Ir 08
Ir   08Ir   08
Ir 08
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Lect4
Lect4Lect4
Lect4
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Ghost
GhostGhost
Ghost
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
 

Viewers also liked

OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
makoto onizuka
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
mobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
Prakhar Dhama
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
European Institute for Systems Biology & Medicine.
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
Nicolas Robinson-Garcia
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Seattle DAML meetup
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 

Viewers also liked (20)

OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 

Similar to CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
Mark Levy
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Hadoop
HadoopHadoop
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
Matthäus Zloch
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
Yoshitomo Matsubara
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searching
Mvenkatarao
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]
Pramati Technologies
 

Similar to CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce (20)

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Unit 2
Unit 2Unit 2
Unit 2
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searching
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]
 

Recently uploaded

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • 1. CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce Giannakouris – Salalidis Victor - Undergraduate Student Plerou Antonia - PhD Candidate Sioutas Spyros - Associate Professor
  • 2. Introduction • Big Data: Massive amount of data as a result of the huge rate of growth • Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc. • Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media • CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce
  • 3. Background • Vector Space Model: An algebraic model for representing text documents as vectors • Efficient method for text similarity measurement
  • 4. TF-IDF • Term Frequency – Inverse Document Frequency • A numerical statistic that reflects the significance of a term in a corpus of documents • Usually used in search engines, text mining, text similarity in the vector space 푇퐹 × 퐼퐷퐹 = 푛푖,푗 푡 ∈ 푑푗 × 푙표푔 |퐷| |푑 ∈ 퐷: 푡 ∈ 푑|
  • 5. Cosine Similarity • Cosine Similarity: A measure of similarity between two documents represented as vector • Measuring of the angle between two vectors A  B A  B   1 1 2 2 A  B 1 1 cos(A,B) || A|| || B|| ( ) ( ) n i i n i i i i i      
  • 6. Hadoop • Framework developed by Apache • Large-Scale Data Processing and Analytics • Scalable and parallel processing of data on large computer clusters using MapReduce • Runs on commodity, low-end hardware • Main Components: HDFS (Hadoop Distributed File System), MapReduce • Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies
  • 7. MapReduce • Programming Paradigm running on Apache Hadoop • The main component of Hadoop • Useful for processing of large data-sets • Breaks the data into key-value pairs • Model derived from map and reduce functions of Functional Programming • Every MR program constitutes of Mappers and Reducers
  • 9. CSMR • The purposed method, CSMR combines all the above mentioned techniques • Scalable Algorithm for text clustering using MapReduce model • Applies MR model on TF-IDF and Cosine Similarity • 4 Phases: 1. Word Counting 2. Text Vectorization using term frequencies 3. Apply TF-IDF on document vectors 4. Cosine Similarity Measurement
  • 10. Phase 1: Word Counting Algorithm 1: Word Count 1: class Mapper 2: method Map( document ) 3: for each term ∈ document 4: write ( ( term , docId ) , 1 ) 5: 6: class Reducer 7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0 9: for each one ∈ ones do 10: sum = sum +1 11: return ( ( term , docId ) , o ) 12: 13: /* { o ∈ N : the number of occurrences } */
  • 11. Phase 2: Term Frequency Algorithm 2: Term Frequency 1: class Mapper 2: method Map( ( term , docId ) , o ) 3: for each element ∈ ( term , docId ) 4: write ( docId, ( term, o ) ) 5: 6: class Reducer 7: method Reduce( docId, (term, o) ) 8: N = 0 9: for each tuple ∈ ( term, o ) do 10: N = N + o return ( (docId, N), (term, o) )
  • 12. Phase 3: TF-IDF Algorithm 3: Tf-Idf 1: class Mapper 2: method Map( ( docId , N ), ( term , o ) ) 3: for each element ∈ ( term , o ) 4: write ( term, ( docId, o, N ) ) 5: 6: class Reducer 7: method Reduce( term, ( docId , o , N ) ) 8: n = 0 9: for each element ∈ ( docId , o , N ) do 10: n = n + 1 11: tf = o / N 12: idf = log|D| /(1n) 13: return ( docId, ( term , tf×idf ) ) 14: 15: /* Where |D| is the number of documents in the corpus */
  • 13. Phase 4: Cosine Similarity Algorithm 4: Cosine Similarity 1: class Mapper 2: method Map( docs ) 3: n = docs.length 4: 5: for i = 0 to docs.length 6: for j = i+1 to docs.length 7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 8: 9: class Reducer 10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 11: A = docA.tfidf 12: B = docB.tfidf 13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 14: return ( (docId_A, docId_B), cosine )
  • 14. Phase 4: Diagram Map Doc1,Doc2 [Doc1 TF-IDF], [Doc2 TF-IDF] Doc1,Doc3 [Doc1 TF-IDF], [Doc3 TF-IDF] Doc1,Doc4 Input [Doc1 TF-IDF], [Doc4 TF-IDF] Output Doc4,Doc10 [Doc4 TF-IDF], [Doc10 TF-IDF] DocM,DocN [DocM TF-IDF], [DocN TF-IDF] Reduce Doc1,Doc3 Cosine(Doc1, Doc3) Doc1,Doc4 Cosine(Doc1 ,Doc4) Doc4,Doc10 Cosine(Doc4, Doc10) DocM,DocN Cosine(DocM, DocN) Doc1,Doc2 Cosine(Doc1, Doc2)
  • 15. Conclusions & Future Work • Finalized proposed method • Implementation of the method • Experimental tests on real data and computer clusters • Deployment of an open-source project • Additional implementation using more efficient tools such as Apache Spark and Scala • Publication of test results