Customer Service Analytics - Make Sense of All Your Data.pptx
Big Data Co-occurrence Analysis with MapReduce
1. Big Data Processing using a AWS Dataset: Analysis
of Co-occurrence problem with MapReduce
Vishva Abeyrathne
School of Science
(student)
RMIT University
(Student)
Melbourne, Australia
s3735195@student.rmit.edu.au
Abstract— This paper discusses on problems related to
scaling algorithms with big data and many researches have been
performed to overcome that. Consequently, Cluster computing
has been identified as the best solution for big data processing.
Despite of that, still there were some drawbacks and
MapReduce has been introduced as programming model to
tackle the problem. Co-occurrence matrix is used to identify the
co-occurring words and frequency for a given word. Pairs and
Stripes approaches have been used to comparatively analyze the
performance of the program by differentiating size of the
dataset and the nodes assigned to a cluster. Further optimization
has been suggested to better conduct the research on the dataset.
Keywords—MapReduce, Co-occurrence Matrix, Pairs,
Stripes, Combiner, Mapper, Common Crawl
I. INTRODUCTION
Data driven approaches have immensely contributed to the
field of natural language processing over the last few years.
Most of the researches are ongoing to optimizes the processing
tasks with use of comparatively larger datasets. Web-scale
language models stand out as one of the major scenarios when
it comes to big data processing. Application of those models
to larger datasets has become problematic. Major reason for
that is the capabilities of machines to handle such big data
single handed. It has been identified that distribution of
computation across multiple nodes can work out well. This
paper focuses on implementing scalable language processing
algorithms with the use of MapReduce and cost-effective
cluster computing with Hadoop [1].
Rest of the paper is unfolded as follow. Next section
focuses on what is MapReduce and importance of it. Section
3 focuses on introducing co-occurrence problem whereas
section 4 is based on the implementation. Section 5 describes
the dataset that is utilized, followed by results in the section6.
Finally, section 7 focuses on the discussion of the experiment.
II. MAPREDUCE
Distributed computing is identified as the most reliable
and efficient solution to process large datasets where
computation can be done across multiple processors. Despite
being a good solution, still some issues arrived with parallel
algorithms such as cost required for large shared memory
machines. Many researches have been performed to come up
with an alternative programming model that can be used for
parallel computations. Consequently, MapReduce was
introduced back in 2004 with the capability of applying
computations and perform necessary processing for tons of
data coming from multiple sources.
Key- value pairs can be identified as the major data
structure in MapReduce where mapper and reducer being the
main operations behind all the processing tasks. Mapper is
applied on all the input key-pair values and intermediate key-
pair values are generated as a result of that. Reducer is used to
emit output key-value pairs where values with the same key
are gathered together prior to calculations. Programmers only
need to worry about implementing the relevant mapper and
reducer while runtime executes on number of clusters with the
use of a distributed file system. As further optimisations,
necessary combiners and partitioners can be implemented as
the part of MapReduce program. Combiners performs
aggregation for values with the same set of keys in respective
cluster nodes before moving on the process of the reducer. All
the generated key-value pairs will be written in to local disk.
Partitioners assign intermediate key-value pairs for all the
available reducers where values with the same key will be
reduced together despite the origin of the mapper [2].
III. CO-OCCURRENCE PROBLEM
This case study is primarily relevant to measure the
performance with the approaches of the co-occurrence
problem. Co-occurrence problem associates with calculating
or forming a N * N term co-occurrence matrix using all the
words within a given context. Co-occurring word or
neighbour of a word is defined using a sliding window with a
specific value or a sentence. This problem has been used to
calculate semantic distances which is useful for many tasks in
language processing. Pairs and Stripes are identified as two
major approaches in co-occurrence matrix. Key of the pairs
approach always be the co-occurring word whereas the value
would be the count of those co-occurring word. Stripes
approach is different in its own way where it uses an associate
array to process intermediate results. Key of the stripes
approach will be the specific word and the value will be the
associative array with all the co-occurring words and their
relevant occurrences [2].
IV. IMPLEMENTATION
This section focuses on implementing pairs and stripes
approaches using common crawl data. In pairs approach,
mapper takes all input words and generate intermediate key-
value pairs with co-occurring words as keys and 1 as the value
for co-occurring words. Reducer sums up all the values relate
to a unique key or co-occurring word and produce aggregation
for all the co-occurrences in the given dataset.
2. Compared to pairs approach, stripes approach emits fewer
intermediate results despite of each being larger due to
associative arrays. All the co-occurring words will be moved
in to an associative array whereas mapper provides outputs as
keys being the words and the values being the associative
arrays relevant to each specific word. Finally, reducer
performs the aggregation on all the intermediate key-value
pairs by summing up all the associative array related to all the
unique keys or words.
Java has been used as the main programming language for
MapReduce implementation it only required a few numbers of
lines to construct the code. Program will be responsible for all
necessary partitioning prior to go through the reducer and
further it will guarantee that values with same key will be
aggregated together. These features allow programmers to
focus on implementation whereas runtime will manage all the
other cluster-based requirements.
V. DATASET
Data is collected from Common Crawl corpus to perform
the experiment. Different subsets are selected to observe the
performance of the data processing tasks when size of the
dataset increases. Dataset with WET format that has plain text
is used to compare the performance of pairs and stripes
approaches with respect to number of nodes in the cluster.
Experimental dataset contains 150mb of data 100mb of data
and 75mb of data. 150mb dataset is used to perform the
experiment on pairs and stripes respect to the number of nodes
to a cluster. All the other 3 datasets are used to conduct the
analysis on the performance of two approaches with different
data size.
VI. RESULTS
As discussed in the section 4, performance of both pairs
and stripes approaches have been tested with the same dataset
using different set of nodes to the cluster. Window size for the
co-occurrence matrix has been used as 2 for the experiment.
Performance of the both approaches have been assessed with
the increase of the data size while having the same number of
nodes to the cluster.
TABLE I. COMPARISSON OF APPROACHES WITH CLUSTER NODES
Cluster Nodes Computation
Time (Pairs)
Computation
Time (Stripes)
2 Nodes 41m 9s 16m 38s
4 Nodes 39m 43s 16m 29s
6 Nodes 39m 40s 16m 27s
8 Nodes 39m 31s 16m 14s
10 Nodes 39m 05s 16m 11s
According to the results, it is obvious that stripes approach
has worked better in this case study over pairs approach with
a considerable time stamp. Stripes approach has been far more
efficient with elapsed time compared that of pairs approach.
TABLE II. COMPARISSON OF APPROACHES WITH DATA SIZE
Dataset Size Computation
Time (Pairs)
Computation
Time (Stripes)
75mb 19m 56s 9m 56s
100mb 27m 16s 12m 43s
150mb 41m 9s 16m 38s
As shown in the above Table, it can be observed that with
the increase in the size of the data, computation time tends to
increase, and efficiency goes down. On the other hand, stripes
approach has performed much better even with increase of the
dataset while pairs approach sticks to a linear model.
VII. DISCUSSION
Compared to work that have been performed in this
particular domain, this research needs to work with more
optimizations to come up with a minimal solution. Most of
the other ongoing researches and the researches that have
been conducted in past few years have utilized the luxury of
having a in-map combiner prior to generate intermediate
results. Combiner has the capability to minimize the portion
of intermediate key-value pairs by getting a local count for all
the words that are processed by each mapper separately.
Implementation of partitioner has been discussed in
several researches relates to this case study. It would be
efficient for reducer to perform the job since the partitioner
decides the exact reducer that a particular key-value should
move on to. As an improvement, some of the pre-processing
can be applied to the common crawl dataset, mainly to
remove unnecessary syntaxes prior to moving on to data
processing with both the approaches. Results suggest that
stripes approach is more effective out of two approaches in
both the scenarios.
CONCLUSION
This paper conducts a major analysis on processing
common crawl data over co-occurrence problem. Both Pairs
and Stripes approaches have been compared by increasing the
size of the dataset as well as adding more nodes to the cluster.
Further optimizations of partitioner and combiner can
provide far more efficient results in terms of running time.
More work pre-processing can be used to achieve more
accurate results whereas unnecessary words and tokens can
be removed prior to main analysis.
REFERENCES
[1] Lin, J. (2008, October). Scalable language processing algorithms for
the masses: A case study in computing word co-occurrence matrices
with MapReduce. In proceedings of the conference on empirical
[2] Lin, J., & Dyer, C. (2010). Data-intensive text processing with
MapReduce. Synthesis Lectures on Human Language
Technologies, 3(1), 1.