The document discusses big data mining and provides an overview of related concepts and techniques. It describes how big data is characterized by large volume, variety, and velocity of data that is difficult to manage with traditional methods. Common techniques for big data mining discussed include NoSQL databases, MapReduce, and Hadoop. Some challenges of big data mining are also mentioned, such as dealing with high volumes of unstructured data and limitations of traditional databases in handling diverse and continuously growing data sources.
Handwritten Text Recognition for manuscripts and early printed texts
Big Data Mining - Classification, Techniques and Issues
1. Abstract—At this moment, data deluge is continuously
producing a large amount of data in various sectors of
modern society. Such data are called big data. Big data
contain datasets originating both in our physical real world
and in social media and are difficult to manage with current
methodologies or data mining software tools due to their
large size and complexity. Big Data mining is the capability
of extracting useful information from these large datasets or
streams of data. The Big Data is providing the robust
solutions for overcoming the present issues caused due to the
volume, variability and velocity. We present in this issue, a
broad overview of the topic, its current status and techniques
such as NoSQL, MapReduce and Hadoop.
Keywords - Big Data Mining, Mining Techniques, NoSQL,
Hadoop, MapReduce.
1. INTRODUCTION
In the present age, large amounts of data are produced every
moment in various fields, such as science, Internet, and
physical systems. Such phenomena collectively called data
deluge [Mcfedries 2011]. According to researches carried out
by IDC [IDC 2008, IDC 2012], the size of data which are
generated and reproduced all over the world every year is
estimated to be 161 exa bytes. It is predicted that data
increase rapidly at a rate of 10x every five years [1]. In the
meanwhile the computing size of general purpose computers
encounter a 58% rise annually [2]. Consider the Internet data.
The web pages indexed by Google were around one million
in 1998 but quickly reached one billion in 2000 and have
already exceeded 1 trillion in 2008. This rapid expansion is
accelerated by the dramatic increase in acceptance of social
networking applications, such as Facebook, Twitter, Weibo,
etc., that allow users create content freely and amplify the
already huge Web volume.
Thus, the term “Big Data” is a critical issue that needs
solemn attention [3,4]. The etymology of the Big Data coined
by two person: First, John Mashey, who was the chief
scientist at Silicon Graphics in the 1990s, who gave a talk
“Big Data and the Next Wave of InfraStress” in 1998.
Second, Francis X. Diebold, an economist at the University
of Pennsylvania, for his paper on “Big Data Dynamic Factor
Models for Macroeconomic Measurement and
Forecasting,” (2000) [5].
We introduce Big Data Mining and its application in Section
2. We discuss some Data Mining Techniques in Section 3.
Then we discuss the Issues and Challenges in the Section 4.
2. BIG DATA MINING
The origin of the term ‘Big Data’, is due to the fat we are
creating a huge amount of data every day. Usama Fayyad
[11] in his invited talk at the KDD BigMine’ 12 Workshop
presented amazing data numbers about internet usage, among
them the following: each day Google has more than 1 billion
queries per day, Twitter has more than 250 million tweets per
day, Facebook has more than 800 million updates per day
and YouTube has more than 800 million updates per day. The
data produced nowadays is estimated in the order of
zettabytes and is growing around 40% every year.
There are mainly three concepts associated with big data:
structured, semi structured and unstructured data. In today’s
world structured data represent only 5 to 10% of all
informatics data. Structured data is the data that can be stored
in database SQL in table with specific rows and tables [7].
Following this, semi structured data, likewise represents a
few parts of data (approximately 5 to 10%). This type of data
does not have the precise organization infrastructure of
structured data, which fits into tables. In other words, semi-
structured data associated with metadata. Metadata is the
term that we use in order to describe the content and context
of data files, e.g. Means of creation, purpose, time and date
of creation, and author [9]. In particular, XML documents are
the semi-structured documents. Moreover, NoSQL databases
are considered as semi structured [7].
The eminent challenge is to find ways in order to cope with
unstructured data, which is everywhere and is most the
strong one among others, streaming such as text, images,
audios and videos. It represent 80% of data [7].
2.1 Big Data Definition - 3 V’s:
In today's world, organization have been bombed with bulk
of information, but there is a decline in the percent of data
that can be analyzed. The reason behind that is 80% of the
data is in the semi-structured and unstructured format. And
thus, we need new algorithms and new toolset deal with all
this data.
The features of big data can be summarized as follows:
• Volume: The quantity of data is extraordinary, but not the
percent of data that our tools can process.
• Variety: The kinds of data have expanded into
unstructured texts, audio, video, graph or XML.
• Velocity: Data is arriving continuously as streams, the
speed at which data are generated is very high.
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh Yeghia Koronian Gelareh Tavako Saberi
ka_ingh@encs.concordia.ca y_koroni@encs.concordia.ca g_tavako@encs.concordia.ca
Masters in Computer Science Masters in Computer Science Masters in Computer Science
Concordia University Concordia University Concordia University
2. Big Data Mining - Classification, Techniques and Issues
Therefore, big data are often characterized as V3 by taking
the initial letters of these three terms Volume, Variety, and
Velocity. Apart from these, there is another factor Variability
that corresponds to the changes in the structure of the data
and how users want to interpret that data.
Gartner[15] summarizes this in their definition of Big Data in
2012 as high volume, velocity and variety information assets
that demand cost-effective, innovation forms of information
processing for enhanced insight and decision making.
2.2 Data Mining
Data mining is, in a nutshell, to discover frequent patterns
and meaningful structures appearing in a large amount of
data used by applications.
Association Analysis: It is to discover frequent co-
occurrences between structured data used in business
applications, which are usually managed by DBMS. An
algorithm called Apriori is used in many cases for that
purpose. For example, it discovers combinations of items co-
occurring frequently in a group of items (i.e., contents of the
shopping carts) purchased at the same time in retail stores.
Based on association rules, a lot of application systems
recommend a set of items by revising arrangements of them.
Association rule mining is extended and applied to the
history of product purchases and the history of click streams
on the Web pages in order to discover the frequent patterns of
series data.
Classification: On the other hand, a classifier is learned
based on data whose classes (i.e., categories) are known in
advance. Then, if there is new data, classes to which they
should belong are determined by using the learned classifier.
This task called classification is one of the basic data mining
techniques. Naïve Bayes and decision trees are used as
typical classifiers. Classification is used by such a variety of
applications as determination of promising customers,
detection of spam e-mails and determination of categories of
new specimens in science or medicine. Determination of
continuous values such as temperatures and stock prices is
also called prediction of future values.
Clustering: It may be possible to define the degrees of
similarity between data even if the categories of the data are
not known in advance. The opposite concept of similarity is
dissimilarity or distance. Based on the defined similarity,
grouping data into the same group which are similar to each
other in a collection of data is called cluster analysis or
clustering, which is also one of the basic technologies of data
mining. Unlike classification, clustering doesn’t demand that
the names and characteristics of clusters are known in
advance. Techniques such as a hierarchical agglomerative
method and a nonhierarchical k-means method are often used
for clustering. Promising applications of clustering include
discovery of groups of similar customers for marketing.
Outlier Detection: This data mining task can detect
exceptional values or values different from standard values.
There are methods for outlier detection based on statistical
models, data distances, and data densities. There are
alternative ways to find outliers using clustering and
classification. Outlier detection has been used for
applications, such as detection of credit card frauds or
network intrusions.
2.3 Big data vs traditional DBMS
Big data convey us through the compelling opportunities for
the data manipulation. It allows us to encounter with huge
volume of semi-structured and unstructured data that the
traditional database is not able to store these data. Moreover,
it gives us a chance to uncover hidden insights in large sets of
data [10]. Enterprise and companies tend to track their
customers, monitor their transactions in order to achieve
desired statistics. Thus, evaluating the customer’s behaviors
permit to have a vantage point of the whole systems and
conducting advanced research in order to ensure long term
goals [6]. To illustrate with, Tesco’s loyalty program, a
British multinational grocery and general merchandise
retailer, generates a tremendous amount of customer data that
the company mines to inform decisions from promotions to
strategic segmentation of customers. Amazon uses customer
data to power its recommendation engine “you may also like
…” based on a type of predictive modelling technique called
collaborative filtering[6]. In this method, “the system observe
what the user has been done together with what all users have
done (what items they have bought, what music they have
listened) and predict how the user’s might behave in the
future[11]”.
2.4 Limitations of the traditional DBMS
In relational database, we can cope with structured and
sometimes semi-structured data. The data is neatly formatted
and fits into the schema. The data should fit into the table and
if the data does not fit into the table, there is a need to design
a database that is more complex and more difficult to handle.
This approach might result in loss of some hidden data. In
addition, the schema of traditional relational database is not
suitable for certain dynamic information, like weather
patterns, that change concurrently. However, ”There are
some more flexible mechanisms, such as the ability to store
XML documents and binary data, but the capabilities for
handling these types of data are usually quite limited
[10]”.Furthermore, in the traditional database to process data,
the data is to put in the central node location. As the data
grows, the processing central node has to be extended and
consequently, there are some limitations depending on the
chosen hardware platform like memory size[12].
“It’s important that understand that conventional database
technologies are an important, and relevant, part of an overall
analytic solution. In fact, they become even more vital when
used in conjunction with your Big Data platform [14].”
In Big Data, there is no limitation in storing the data. We can
have all sort of data, structured, semi-structured and,
particularly, unstructured data and easily query a data. Big
data solutions store the data in its raw format and apply a
schema only when the data is read, which preserves all of the
information within the data [10].
3. Big Data Mining - Classification, Techniques and Issues
3. DATA MINING TECHNIQUES
Traditionally, data mining handles transactions which are
recorded in databases if the customers actually purchase
products or services. Analyzing transactional data leads to
discovery of frequently purchased products or services,
especially repeat customers. But transaction mining cannot
obtain information about customers who are likely to be
interested in products or services, but have not purchased any
products or services yet. In other words, it is impossible to
discover prospective customers who are likely to be new
customers in the future.
In the physical real world, however, customers look at or
touch interesting items displayed in the racks. They trial-
listen to interesting videos or audios if they can. They may
even smell or taste interesting items if possible and even if
interesting items are unavailable for any reasons, customers
talk about them or collect information about them.
These behaviors can be considered, as parts of interactions
between customers and systems. Such interactions indicate
the interests of latent customers, who either purchase
interesting items or do not in the end, for some reasons.
Analyzing interactions in the physical real world leads to
understanding which items customers are interested in. By
such analysis, however, which aspects of the items the
customers are interested in, why they bought the items, or
why they didn’t, remain unknown. Therefore, if interests of
the users are extracted from heterogeneous data sources and
the reasons for purchasing or not purchasing the items are
uncovered, it will be possible to obtain valuable information
about latent customers. Traditional mining of transactional
data and new mining of interactional data are distinctively
called transaction mining and interaction mining.
3.1 NoSQL as a Database
It has been reported that 65% of queries processed by
Amazon depend on primary keys [Vogels 2007]. Therefore,
data access based on keys, key value stores mechanism is
used by Internet giants such as Google and Amazon. The
concrete key value stores include DynamoDB [DynamoDB
2014] of Amazon, BigTable [Chang et al. 2006] of Google,
HBase [HBase 2014] of the Hadoop project and Cassandra
[Cassandra 2014], by Facebook.
Generally, given key data, key value stores are suitable for
searching non-key data (attribute values) associated with the
key data. Initially a hash function is applied to a node which
stores data. Then, the node is mapped to a point (i.e., logical
place) on a ring type network. In storing data, the same hash
function is applied to a key value of each data and then the
data is similarly mapped to a point on the ring. Each data is
stored in the nearest node by the clockwise rotation of the
ring. Thus, for data access, search for the nearest node
located by applying the hash function to the key value. This
access structure is called consistent hashing, which is also
adopted by P2P systems used for various purposes such as
file sharing.
3.2 MapReduce
MapReduce is considered as a design pattern which can
process tasks efficiently by carrying out scale-out in a
straightforward manner. For example, human users browsing
web sites and robots aiming at crawling for search engines
leave the access log data in Web servers when they access the
sites. Therefore it is necessary to extract only the session
(i.e., a coherent series of page accesses) by each user from
the recorded access log data and store them in databases for
further analysis. Generally such a task is called extraction,
transformation, and loading (ETL).
MapReduce is suitable for applications which perform such
ETL tasks. It divides a task into subtasks and processes them
in a parallel distributed manner. MapReduce is suitable for
cases where only data or parameters of each subtask are
separate although the method of processing is completely the
same. First, the Map phase is carried out and the outputs are
rearranged so that they are suitable for the input of the
Reduce phase. For applications where similarity (i.e., identity
of processing in this case) and diversity (i.e., difference of
data and parameters for processing) are inherent, MapReduce
exploits these characters to improve the efficiency of
processing. Parallelization and distribution of large scale
computations are the two contributing factors for generating
this kind of model.
3.3 Hadoop
Hadoop [Hadoop 2014] is an open source software for
distributed processing on a computer cluster, which consists
of two or more servers. Hadoop consists of a distributed file
system called HDFS (Hadoop Distributed File System),
MapReduce as it is, and Hadoop Common as common
libraries. A computer system is a collection of clusters which
consist of two or more servers. Data is divided into blocks.
While one block for original data is stored in a server which
is determined by Hadoop, copies of the original data are
stored in two other servers (default) inside racks other than
the rack holding the server for the original data
simultaneously. Although such data arrangement has the
objective to improve availability, it also has another objective
to improve parallelism.
The special server called NameNode manages data
arrangement in HDFS. The NameNode server carries out
book keeping of all the metadata of data files. The metadata
are resident on core memories for high speed access.
Therefore, the server for NameNode should be more reliable
than the other servers.
It is expected that if copies of the same data exist in two or
more servers, candidate solutions increase in number for such
problems that process tasks in parallel by dividing them into
multiple subtasks. If Hadoop is fed a task, Hadoop searches
the location of relevant data by consulting NameNode and
sends a program for execution to the server which stores the
data. This is because communication cost for sending
programs is generally lower than that for sending data.
4. Big Data Mining - Classification, Techniques and Issues
4. ISSUES AND CHALLENGES
Variety and heterogeneity: In the past, the datasets that we
had had was quite simple and homogenous. We have to
interact with structured, semi-structured and unstructured
data. Structured data is compatible with conventional
DBMS. Semi-structured and unstructured dataset require to
envelope in the adequate and state-of-the-art platforms.
Volume/Scalability: Data now is in tremendous scale, which
will give us an opportunity to discover hidden knowledge
and serve/ understand people better. There are two
approaches if exploited properly, may lead to remarkable
scalability required for future data and mining systems to
manage and mine the big data; Advanced User
Interaction[5.6] Data mining in a straight forward manner
will implies extremely time consuming task on a large space,
however with user interaction we can decrease the search
space into more promising subspaces, Cloud Computing
which is an another approach that showed admirable
elasticity, which, combined with massively parallel
computing architectures can make our systems scalable.
Velocity/Speed: We must finish processing/mining in a
desired time or else the information is useless. Speed
depends a) Data access time and b) Efficiency of mining
algorithms, Exploitation of advanced indexing schemes is the
key to speed issue multidimensional indexing structures such
as R tree is useful for big data and data access time. An
additional approach to boost the speed of big data access and
mining is through maximally identifying and exploiting the
potential parallelism in the access and mining algorithms.
Accuracy, trust and provenance: In the past, we were dealing
with the dataset techniques which were reliable. On the era of
big data, evolution of big data urge us to deal with all the
rigors of a considerable amount of unstructured and
unreliable data. Moreover, how can we trust the unreliable
data? The use of learning algorithms is an appropriate way to
determine the creditability of the source of data, these
algorithms should be able to update the creditability of the
source of data in a timely manner.
Privacy crisis: Every piece of info can be mined out from the
internet about someone because data is interconnected, once
this info is put together the privacy will disappears. We are
working on developing a mining system that can mine a huge
portion of the web, so these same tools can be used to
retrieve personal and confidential information about you.
Interactiveness: Is the capability of data mining system that
allows user interaction such as feedback and guidance.
Interactiveness can help narrow down the search space,
accelerating the speed and increase system scalability, also
heterogeneity can be overcome by allowing users to interpret
intermediate and final results by interaction. Interactiveness
boosts the data mining results, even if data mining systems
are professionally designed but without interactiveness the
value of the results can be discounted or simply rejected.
Garbage mining: In WWW the volume of data is generated
very fast and outdated very fast so we require cyberspace
cleaning but it's not easy foreseeable reasons: garbage is
hidden, and there is an ownership issue, are you allowed to
dispose or collect garbage that does not belong to you? We
propose applying data mining approaches to mine garbage
and recycle it. We believe garbage mining is a serious
research topic mining for garbage is mining for knowledge.
REFERENCES
1. S. Hendrickson, Getting Started with Hadoop with Amazon’s
Elastic MapReduce, EMR, 2010.
2. M. Hilbert and P. L´opez, “The world’s technological capacity
to store, communicate, and compute information,” Science,
vol. 332, no. 6025, pp. 60–65, 2011.
3. J. M. Wing, “Computational thinking and thinking about
computing,” Philosophical Transactions of the Royal Society of
London A:Mathematical, Physical and Engineering Sciences,
vol.366, no. 1881, pp. 3717–3725, 2008.
4. J.Mervis, “Agencies rally to tackle big data,” Science, vol. 336,
no. 6077, p. 22, 2012.
5. http://www.marklogic.com/blog/birth-of-big-data/
6. Che.Dunren, Safran.Mejdl, Peng.Zhiyong, From big data to big
data mining: Challenges,Issues and Opportunities, In: DAFAA
Workshop 2013,LNCS 7827,pp. 1-15, 2013
7. https://jeremyronk.wordpress.com/2014/09/01/structured-semi-
structured-and-unstructured-data/
8. http://whatis.techtarget.com/definition/semi-structured-data
9. Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C
., Byers,AH., Big Data: The next frontier for innovation,
competition and productivity, McKinsey Global Institute, p33,
June 20111
10. https://msdn.microsoft.com/en-us/library/dn749785.aspx
11. https://en.wikipedia.org/wiki/Collaborative_filtering
12. Salehinia.A, Comparisons of Relational Databases with Big
Data: a Teaching Approach, South Dakota State University,
Brookings, SD 57007
13. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th,
Lapis,G,”Understanding Big Data”, p5,2012
14. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th,
Lapis,G,”Understanding Big Data”, p16,2012
15. Ishikawa.H, Social Big Data Mining, 2015