SlideShare a Scribd company logo
1 of 3
Download to read offline
Big Data Processing using a AWS Dataset: Analysis
of Co-occurrence problem with MapReduce
Vishva Abeyrathne
School of Science
(student)
RMIT University
(Student)
Melbourne, Australia
s3735195@student.rmit.edu.au
Abstract— This paper discusses on problems related to
scaling algorithms with big data and many researches have been
performed to overcome that. Consequently, Cluster computing
has been identified as the best solution for big data processing.
Despite of that, still there were some drawbacks and
MapReduce has been introduced as programming model to
tackle the problem. Co-occurrence matrix is used to identify the
co-occurring words and frequency for a given word. Pairs and
Stripes approaches have been used to comparatively analyze the
performance of the program by differentiating size of the
dataset and the nodes assigned to a cluster. Further optimization
has been suggested to better conduct the research on the dataset.
Keywords—MapReduce, Co-occurrence Matrix, Pairs,
Stripes, Combiner, Mapper, Common Crawl
I. INTRODUCTION
Data driven approaches have immensely contributed to the
field of natural language processing over the last few years.
Most of the researches are ongoing to optimizes the processing
tasks with use of comparatively larger datasets. Web-scale
language models stand out as one of the major scenarios when
it comes to big data processing. Application of those models
to larger datasets has become problematic. Major reason for
that is the capabilities of machines to handle such big data
single handed. It has been identified that distribution of
computation across multiple nodes can work out well. This
paper focuses on implementing scalable language processing
algorithms with the use of MapReduce and cost-effective
cluster computing with Hadoop [1].
Rest of the paper is unfolded as follow. Next section
focuses on what is MapReduce and importance of it. Section
3 focuses on introducing co-occurrence problem whereas
section 4 is based on the implementation. Section 5 describes
the dataset that is utilized, followed by results in the section6.
Finally, section 7 focuses on the discussion of the experiment.
II. MAPREDUCE
Distributed computing is identified as the most reliable
and efficient solution to process large datasets where
computation can be done across multiple processors. Despite
being a good solution, still some issues arrived with parallel
algorithms such as cost required for large shared memory
machines. Many researches have been performed to come up
with an alternative programming model that can be used for
parallel computations. Consequently, MapReduce was
introduced back in 2004 with the capability of applying
computations and perform necessary processing for tons of
data coming from multiple sources.
Key- value pairs can be identified as the major data
structure in MapReduce where mapper and reducer being the
main operations behind all the processing tasks. Mapper is
applied on all the input key-pair values and intermediate key-
pair values are generated as a result of that. Reducer is used to
emit output key-value pairs where values with the same key
are gathered together prior to calculations. Programmers only
need to worry about implementing the relevant mapper and
reducer while runtime executes on number of clusters with the
use of a distributed file system. As further optimisations,
necessary combiners and partitioners can be implemented as
the part of MapReduce program. Combiners performs
aggregation for values with the same set of keys in respective
cluster nodes before moving on the process of the reducer. All
the generated key-value pairs will be written in to local disk.
Partitioners assign intermediate key-value pairs for all the
available reducers where values with the same key will be
reduced together despite the origin of the mapper [2].
III. CO-OCCURRENCE PROBLEM
This case study is primarily relevant to measure the
performance with the approaches of the co-occurrence
problem. Co-occurrence problem associates with calculating
or forming a N * N term co-occurrence matrix using all the
words within a given context. Co-occurring word or
neighbour of a word is defined using a sliding window with a
specific value or a sentence. This problem has been used to
calculate semantic distances which is useful for many tasks in
language processing. Pairs and Stripes are identified as two
major approaches in co-occurrence matrix. Key of the pairs
approach always be the co-occurring word whereas the value
would be the count of those co-occurring word. Stripes
approach is different in its own way where it uses an associate
array to process intermediate results. Key of the stripes
approach will be the specific word and the value will be the
associative array with all the co-occurring words and their
relevant occurrences [2].
IV. IMPLEMENTATION
This section focuses on implementing pairs and stripes
approaches using common crawl data. In pairs approach,
mapper takes all input words and generate intermediate key-
value pairs with co-occurring words as keys and 1 as the value
for co-occurring words. Reducer sums up all the values relate
to a unique key or co-occurring word and produce aggregation
for all the co-occurrences in the given dataset.
Compared to pairs approach, stripes approach emits fewer
intermediate results despite of each being larger due to
associative arrays. All the co-occurring words will be moved
in to an associative array whereas mapper provides outputs as
keys being the words and the values being the associative
arrays relevant to each specific word. Finally, reducer
performs the aggregation on all the intermediate key-value
pairs by summing up all the associative array related to all the
unique keys or words.
Java has been used as the main programming language for
MapReduce implementation it only required a few numbers of
lines to construct the code. Program will be responsible for all
necessary partitioning prior to go through the reducer and
further it will guarantee that values with same key will be
aggregated together. These features allow programmers to
focus on implementation whereas runtime will manage all the
other cluster-based requirements.
V. DATASET
Data is collected from Common Crawl corpus to perform
the experiment. Different subsets are selected to observe the
performance of the data processing tasks when size of the
dataset increases. Dataset with WET format that has plain text
is used to compare the performance of pairs and stripes
approaches with respect to number of nodes in the cluster.
Experimental dataset contains 150mb of data 100mb of data
and 75mb of data. 150mb dataset is used to perform the
experiment on pairs and stripes respect to the number of nodes
to a cluster. All the other 3 datasets are used to conduct the
analysis on the performance of two approaches with different
data size.
VI. RESULTS
As discussed in the section 4, performance of both pairs
and stripes approaches have been tested with the same dataset
using different set of nodes to the cluster. Window size for the
co-occurrence matrix has been used as 2 for the experiment.
Performance of the both approaches have been assessed with
the increase of the data size while having the same number of
nodes to the cluster.
TABLE I. COMPARISSON OF APPROACHES WITH CLUSTER NODES
Cluster Nodes Computation
Time (Pairs)
Computation
Time (Stripes)
2 Nodes 41m 9s 16m 38s
4 Nodes 39m 43s 16m 29s
6 Nodes 39m 40s 16m 27s
8 Nodes 39m 31s 16m 14s
10 Nodes 39m 05s 16m 11s
According to the results, it is obvious that stripes approach
has worked better in this case study over pairs approach with
a considerable time stamp. Stripes approach has been far more
efficient with elapsed time compared that of pairs approach.
TABLE II. COMPARISSON OF APPROACHES WITH DATA SIZE
Dataset Size Computation
Time (Pairs)
Computation
Time (Stripes)
75mb 19m 56s 9m 56s
100mb 27m 16s 12m 43s
150mb 41m 9s 16m 38s
As shown in the above Table, it can be observed that with
the increase in the size of the data, computation time tends to
increase, and efficiency goes down. On the other hand, stripes
approach has performed much better even with increase of the
dataset while pairs approach sticks to a linear model.
VII. DISCUSSION
Compared to work that have been performed in this
particular domain, this research needs to work with more
optimizations to come up with a minimal solution. Most of
the other ongoing researches and the researches that have
been conducted in past few years have utilized the luxury of
having a in-map combiner prior to generate intermediate
results. Combiner has the capability to minimize the portion
of intermediate key-value pairs by getting a local count for all
the words that are processed by each mapper separately.
Implementation of partitioner has been discussed in
several researches relates to this case study. It would be
efficient for reducer to perform the job since the partitioner
decides the exact reducer that a particular key-value should
move on to. As an improvement, some of the pre-processing
can be applied to the common crawl dataset, mainly to
remove unnecessary syntaxes prior to moving on to data
processing with both the approaches. Results suggest that
stripes approach is more effective out of two approaches in
both the scenarios.
CONCLUSION
This paper conducts a major analysis on processing
common crawl data over co-occurrence problem. Both Pairs
and Stripes approaches have been compared by increasing the
size of the dataset as well as adding more nodes to the cluster.
Further optimizations of partitioner and combiner can
provide far more efficient results in terms of running time.
More work pre-processing can be used to achieve more
accurate results whereas unnecessary words and tokens can
be removed prior to main analysis.
REFERENCES
[1] Lin, J. (2008, October). Scalable language processing algorithms for
the masses: A case study in computing word co-occurrence matrices
with MapReduce. In proceedings of the conference on empirical
[2] Lin, J., & Dyer, C. (2010). Data-intensive text processing with
MapReduce. Synthesis Lectures on Human Language
Technologies, 3(1), 1.
Big Data Co-occurrence Analysis with MapReduce

More Related Content

What's hot

WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latexIAESIJEECS
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Improved text clustering with
Improved text clustering withImproved text clustering with
Improved text clustering withIJDKP
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence InformationLatent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Informationcsandit
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation1crore projects
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detectionColleen Farrelly
 

What's hot (20)

WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
Noura2
Noura2Noura2
Noura2
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
 
50120140505013
5012014050501350120140505013
50120140505013
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
Improved text clustering with
Improved text clustering withImproved text clustering with
Improved text clustering with
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence InformationLatent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
 

Similar to Big Data Co-occurrence Analysis with MapReduce

VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...AIRCC Publishing Corporation
 
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...AIRCC Publishing Corporation
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graphijdms
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERijdpsjournal
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code executionAlexander Decker
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code executionAlexander Decker
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clusteringpaperpublications3
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 

Similar to Big Data Co-occurrence Analysis with MapReduce (20)

VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
 
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graph
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
NGBT_poster_v0.4
NGBT_poster_v0.4NGBT_poster_v0.4
NGBT_poster_v0.4
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Big Data Co-occurrence Analysis with MapReduce

  • 1. Big Data Processing using a AWS Dataset: Analysis of Co-occurrence problem with MapReduce Vishva Abeyrathne School of Science (student) RMIT University (Student) Melbourne, Australia s3735195@student.rmit.edu.au Abstract— This paper discusses on problems related to scaling algorithms with big data and many researches have been performed to overcome that. Consequently, Cluster computing has been identified as the best solution for big data processing. Despite of that, still there were some drawbacks and MapReduce has been introduced as programming model to tackle the problem. Co-occurrence matrix is used to identify the co-occurring words and frequency for a given word. Pairs and Stripes approaches have been used to comparatively analyze the performance of the program by differentiating size of the dataset and the nodes assigned to a cluster. Further optimization has been suggested to better conduct the research on the dataset. Keywords—MapReduce, Co-occurrence Matrix, Pairs, Stripes, Combiner, Mapper, Common Crawl I. INTRODUCTION Data driven approaches have immensely contributed to the field of natural language processing over the last few years. Most of the researches are ongoing to optimizes the processing tasks with use of comparatively larger datasets. Web-scale language models stand out as one of the major scenarios when it comes to big data processing. Application of those models to larger datasets has become problematic. Major reason for that is the capabilities of machines to handle such big data single handed. It has been identified that distribution of computation across multiple nodes can work out well. This paper focuses on implementing scalable language processing algorithms with the use of MapReduce and cost-effective cluster computing with Hadoop [1]. Rest of the paper is unfolded as follow. Next section focuses on what is MapReduce and importance of it. Section 3 focuses on introducing co-occurrence problem whereas section 4 is based on the implementation. Section 5 describes the dataset that is utilized, followed by results in the section6. Finally, section 7 focuses on the discussion of the experiment. II. MAPREDUCE Distributed computing is identified as the most reliable and efficient solution to process large datasets where computation can be done across multiple processors. Despite being a good solution, still some issues arrived with parallel algorithms such as cost required for large shared memory machines. Many researches have been performed to come up with an alternative programming model that can be used for parallel computations. Consequently, MapReduce was introduced back in 2004 with the capability of applying computations and perform necessary processing for tons of data coming from multiple sources. Key- value pairs can be identified as the major data structure in MapReduce where mapper and reducer being the main operations behind all the processing tasks. Mapper is applied on all the input key-pair values and intermediate key- pair values are generated as a result of that. Reducer is used to emit output key-value pairs where values with the same key are gathered together prior to calculations. Programmers only need to worry about implementing the relevant mapper and reducer while runtime executes on number of clusters with the use of a distributed file system. As further optimisations, necessary combiners and partitioners can be implemented as the part of MapReduce program. Combiners performs aggregation for values with the same set of keys in respective cluster nodes before moving on the process of the reducer. All the generated key-value pairs will be written in to local disk. Partitioners assign intermediate key-value pairs for all the available reducers where values with the same key will be reduced together despite the origin of the mapper [2]. III. CO-OCCURRENCE PROBLEM This case study is primarily relevant to measure the performance with the approaches of the co-occurrence problem. Co-occurrence problem associates with calculating or forming a N * N term co-occurrence matrix using all the words within a given context. Co-occurring word or neighbour of a word is defined using a sliding window with a specific value or a sentence. This problem has been used to calculate semantic distances which is useful for many tasks in language processing. Pairs and Stripes are identified as two major approaches in co-occurrence matrix. Key of the pairs approach always be the co-occurring word whereas the value would be the count of those co-occurring word. Stripes approach is different in its own way where it uses an associate array to process intermediate results. Key of the stripes approach will be the specific word and the value will be the associative array with all the co-occurring words and their relevant occurrences [2]. IV. IMPLEMENTATION This section focuses on implementing pairs and stripes approaches using common crawl data. In pairs approach, mapper takes all input words and generate intermediate key- value pairs with co-occurring words as keys and 1 as the value for co-occurring words. Reducer sums up all the values relate to a unique key or co-occurring word and produce aggregation for all the co-occurrences in the given dataset.
  • 2. Compared to pairs approach, stripes approach emits fewer intermediate results despite of each being larger due to associative arrays. All the co-occurring words will be moved in to an associative array whereas mapper provides outputs as keys being the words and the values being the associative arrays relevant to each specific word. Finally, reducer performs the aggregation on all the intermediate key-value pairs by summing up all the associative array related to all the unique keys or words. Java has been used as the main programming language for MapReduce implementation it only required a few numbers of lines to construct the code. Program will be responsible for all necessary partitioning prior to go through the reducer and further it will guarantee that values with same key will be aggregated together. These features allow programmers to focus on implementation whereas runtime will manage all the other cluster-based requirements. V. DATASET Data is collected from Common Crawl corpus to perform the experiment. Different subsets are selected to observe the performance of the data processing tasks when size of the dataset increases. Dataset with WET format that has plain text is used to compare the performance of pairs and stripes approaches with respect to number of nodes in the cluster. Experimental dataset contains 150mb of data 100mb of data and 75mb of data. 150mb dataset is used to perform the experiment on pairs and stripes respect to the number of nodes to a cluster. All the other 3 datasets are used to conduct the analysis on the performance of two approaches with different data size. VI. RESULTS As discussed in the section 4, performance of both pairs and stripes approaches have been tested with the same dataset using different set of nodes to the cluster. Window size for the co-occurrence matrix has been used as 2 for the experiment. Performance of the both approaches have been assessed with the increase of the data size while having the same number of nodes to the cluster. TABLE I. COMPARISSON OF APPROACHES WITH CLUSTER NODES Cluster Nodes Computation Time (Pairs) Computation Time (Stripes) 2 Nodes 41m 9s 16m 38s 4 Nodes 39m 43s 16m 29s 6 Nodes 39m 40s 16m 27s 8 Nodes 39m 31s 16m 14s 10 Nodes 39m 05s 16m 11s According to the results, it is obvious that stripes approach has worked better in this case study over pairs approach with a considerable time stamp. Stripes approach has been far more efficient with elapsed time compared that of pairs approach. TABLE II. COMPARISSON OF APPROACHES WITH DATA SIZE Dataset Size Computation Time (Pairs) Computation Time (Stripes) 75mb 19m 56s 9m 56s 100mb 27m 16s 12m 43s 150mb 41m 9s 16m 38s As shown in the above Table, it can be observed that with the increase in the size of the data, computation time tends to increase, and efficiency goes down. On the other hand, stripes approach has performed much better even with increase of the dataset while pairs approach sticks to a linear model. VII. DISCUSSION Compared to work that have been performed in this particular domain, this research needs to work with more optimizations to come up with a minimal solution. Most of the other ongoing researches and the researches that have been conducted in past few years have utilized the luxury of having a in-map combiner prior to generate intermediate results. Combiner has the capability to minimize the portion of intermediate key-value pairs by getting a local count for all the words that are processed by each mapper separately. Implementation of partitioner has been discussed in several researches relates to this case study. It would be efficient for reducer to perform the job since the partitioner decides the exact reducer that a particular key-value should move on to. As an improvement, some of the pre-processing can be applied to the common crawl dataset, mainly to remove unnecessary syntaxes prior to moving on to data processing with both the approaches. Results suggest that stripes approach is more effective out of two approaches in both the scenarios. CONCLUSION This paper conducts a major analysis on processing common crawl data over co-occurrence problem. Both Pairs and Stripes approaches have been compared by increasing the size of the dataset as well as adding more nodes to the cluster. Further optimizations of partitioner and combiner can provide far more efficient results in terms of running time. More work pre-processing can be used to achieve more accurate results whereas unnecessary words and tokens can be removed prior to main analysis. REFERENCES [1] Lin, J. (2008, October). Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with MapReduce. In proceedings of the conference on empirical [2] Lin, J., & Dyer, C. (2010). Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1), 1.