SlideShare a Scribd company logo
1 of 4
Download to read offline
Abstract—At this moment, data deluge is continuously
producing a large amount of data in various sectors of
modern society. Such data are called big data. Big data
contain datasets originating both in our physical real world
and in social media and are difficult to manage with current
methodologies or data mining software tools due to their
large size and complexity. Big Data mining is the capability
of extracting useful information from these large datasets or
streams of data. The Big Data is providing the robust
solutions for overcoming the present issues caused due to the
volume, variability and velocity. We present in this issue, a
broad overview of the topic, its current status and techniques
such as NoSQL, MapReduce and Hadoop.
Keywords - Big Data Mining, Mining Techniques, NoSQL,
Hadoop, MapReduce.
1. INTRODUCTION
In the present age, large amounts of data are produced every
moment in various fields, such as science, Internet, and
physical systems. Such phenomena collectively called data
deluge [Mcfedries 2011]. According to researches carried out
by IDC [IDC 2008, IDC 2012], the size of data which are
generated and reproduced all over the world every year is
estimated to be 161 exa bytes. It is predicted that data
increase rapidly at a rate of 10x every five years [1]. In the
meanwhile the computing size of general purpose computers
encounter a 58% rise annually [2]. Consider the Internet data.
The web pages indexed by Google were around one million
in 1998 but quickly reached one billion in 2000 and have
already exceeded 1 trillion in 2008. This rapid expansion is
accelerated by the dramatic increase in acceptance of social
networking applications, such as Facebook, Twitter, Weibo,
etc., that allow users create content freely and amplify the
already huge Web volume.
Thus, the term “Big Data” is a critical issue that needs
solemn attention [3,4]. The etymology of the Big Data coined
by two person: First, John Mashey, who was the chief
scientist at Silicon Graphics in the 1990s, who gave a talk
“Big Data and the Next Wave of InfraStress” in 1998.
Second, Francis X. Diebold, an economist at the University
of Pennsylvania, for his paper on “Big Data Dynamic Factor
Models for Macroeconomic Measurement and
Forecasting,” (2000) [5].
We introduce Big Data Mining and its application in Section
2. We discuss some Data Mining Techniques in Section 3.
Then we discuss the Issues and Challenges in the Section 4.
2. BIG DATA MINING
The origin of the term ‘Big Data’, is due to the fat we are
creating a huge amount of data every day. Usama Fayyad
[11] in his invited talk at the KDD BigMine’ 12 Workshop
presented amazing data numbers about internet usage, among
them the following: each day Google has more than 1 billion
queries per day, Twitter has more than 250 million tweets per
day, Facebook has more than 800 million updates per day
and YouTube has more than 800 million updates per day. The
data produced nowadays is estimated in the order of
zettabytes and is growing around 40% every year.
There are mainly three concepts associated with big data:
structured, semi structured and unstructured data. In today’s
world structured data represent only 5 to 10% of all
informatics data. Structured data is the data that can be stored
in database SQL in table with specific rows and tables [7].
Following this, semi structured data, likewise represents a
few parts of data (approximately 5 to 10%). This type of data
does not have the precise organization infrastructure of
structured data, which fits into tables. In other words, semi-
structured data associated with metadata. Metadata is the
term that we use in order to describe the content and context
of data files, e.g. Means of creation, purpose, time and date
of creation, and author [9]. In particular, XML documents are
the semi-structured documents. Moreover, NoSQL databases
are considered as semi structured [7].
The eminent challenge is to find ways in order to cope with
unstructured data, which is everywhere and is most the
strong one among others, streaming such as text, images,
audios and videos. It represent 80% of data [7].
2.1 Big Data Definition - 3 V’s:
In today's world, organization have been bombed with bulk
of information, but there is a decline in the percent of data
that can be analyzed. The reason behind that is 80% of the
data is in the semi-structured and unstructured format. And
thus, we need new algorithms and new toolset deal with all
this data.
The features of big data can be summarized as follows:
• Volume: The quantity of data is extraordinary, but not the
percent of data that our tools can process.
• Variety: The kinds of data have expanded into
unstructured texts, audio, video, graph or XML.
• Velocity: Data is arriving continuously as streams, the
speed at which data are generated is very high. 

Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh Yeghia Koronian Gelareh Tavako Saberi
ka_ingh@encs.concordia.ca y_koroni@encs.concordia.ca g_tavako@encs.concordia.ca
Masters in Computer Science Masters in Computer Science Masters in Computer Science
Concordia University Concordia University Concordia University
Big Data Mining - Classification, Techniques and Issues
Therefore, big data are often characterized as V3 by taking
the initial letters of these three terms Volume, Variety, and
Velocity. Apart from these, there is another factor Variability
that corresponds to the changes in the structure of the data
and how users want to interpret that data.
Gartner[15] summarizes this in their definition of Big Data in
2012 as high volume, velocity and variety information assets
that demand cost-effective, innovation forms of information
processing for enhanced insight and decision making.
2.2 Data Mining
Data mining is, in a nutshell, to discover frequent patterns
and meaningful structures appearing in a large amount of
data used by applications.
Association Analysis: It is to discover frequent co-
occurrences between structured data used in business
applications, which are usually managed by DBMS. An
algorithm called Apriori is used in many cases for that
purpose. For example, it discovers combinations of items co-
occurring frequently in a group of items (i.e., contents of the
shopping carts) purchased at the same time in retail stores.
Based on association rules, a lot of application systems
recommend a set of items by revising arrangements of them.
Association rule mining is extended and applied to the
history of product purchases and the history of click streams
on the Web pages in order to discover the frequent patterns of
series data.
Classification: On the other hand, a classifier is learned
based on data whose classes (i.e., categories) are known in
advance. Then, if there is new data, classes to which they
should belong are determined by using the learned classifier.
This task called classification is one of the basic data mining
techniques. Naïve Bayes and decision trees are used as
typical classifiers. Classification is used by such a variety of
applications as determination of promising customers,
detection of spam e-mails and determination of categories of
new specimens in science or medicine. Determination of
continuous values such as temperatures and stock prices is
also called prediction of future values.
Clustering: It may be possible to define the degrees of
similarity between data even if the categories of the data are
not known in advance. The opposite concept of similarity is
dissimilarity or distance. Based on the defined similarity,
grouping data into the same group which are similar to each
other in a collection of data is called cluster analysis or
clustering, which is also one of the basic technologies of data
mining. Unlike classification, clustering doesn’t demand that
the names and characteristics of clusters are known in
advance. Techniques such as a hierarchical agglomerative
method and a nonhierarchical k-means method are often used
for clustering. Promising applications of clustering include
discovery of groups of similar customers for marketing.
Outlier Detection: This data mining task can detect
exceptional values or values different from standard values.
There are methods for outlier detection based on statistical
models, data distances, and data densities. There are
alternative ways to find outliers using clustering and
classification. Outlier detection has been used for
applications, such as detection of credit card frauds or
network intrusions.
2.3 Big data vs traditional DBMS
Big data convey us through the compelling opportunities for
the data manipulation. It allows us to encounter with huge
volume of semi-structured and unstructured data that the
traditional database is not able to store these data. Moreover,
it gives us a chance to uncover hidden insights in large sets of
data [10]. Enterprise and companies tend to track their
customers, monitor their transactions in order to achieve
desired statistics. Thus, evaluating the customer’s behaviors
permit to have a vantage point of the whole systems and
conducting advanced research in order to ensure long term
goals [6]. To illustrate with, Tesco’s loyalty program, a
British multinational grocery and general merchandise
retailer, generates a tremendous amount of customer data that
the company mines to inform decisions from promotions to
strategic segmentation of customers. Amazon uses customer
data to power its recommendation engine “you may also like
…” based on a type of predictive modelling technique called
collaborative filtering[6]. In this method, “the system observe
what the user has been done together with what all users have
done (what items they have bought, what music they have
listened) and predict how the user’s might behave in the
future[11]”.
2.4 Limitations of the traditional DBMS
In relational database, we can cope with structured and
sometimes semi-structured data. The data is neatly formatted
and fits into the schema. The data should fit into the table and
if the data does not fit into the table, there is a need to design
a database that is more complex and more difficult to handle.
This approach might result in loss of some hidden data. In
addition, the schema of traditional relational database is not
suitable for certain dynamic information, like weather
patterns, that change concurrently. However, ”There are
some more flexible mechanisms, such as the ability to store
XML documents and binary data, but the capabilities for
handling these types of data are usually quite limited
[10]”.Furthermore, in the traditional database to process data,
the data is to put in the central node location. As the data
grows, the processing central node has to be extended and
consequently, there are some limitations depending on the
chosen hardware platform like memory size[12].
“It’s important that understand that conventional database
technologies are an important, and relevant, part of an overall
analytic solution. In fact, they become even more vital when
used in conjunction with your Big Data platform [14].”
In Big Data, there is no limitation in storing the data. We can
have all sort of data, structured, semi-structured and,
particularly, unstructured data and easily query a data. Big
data solutions store the data in its raw format and apply a
schema only when the data is read, which preserves all of the
information within the data [10].
Big Data Mining - Classification, Techniques and Issues
3. DATA MINING TECHNIQUES
Traditionally, data mining handles transactions which are
recorded in databases if the customers actually purchase
products or services. Analyzing transactional data leads to
discovery of frequently purchased products or services,
especially repeat customers. But transaction mining cannot
obtain information about customers who are likely to be
interested in products or services, but have not purchased any
products or services yet. In other words, it is impossible to
discover prospective customers who are likely to be new
customers in the future.
In the physical real world, however, customers look at or
touch interesting items displayed in the racks. They trial-
listen to interesting videos or audios if they can. They may
even smell or taste interesting items if possible and even if
interesting items are unavailable for any reasons, customers
talk about them or collect information about them.
These behaviors can be considered, as parts of interactions
between customers and systems. Such interactions indicate
the interests of latent customers, who either purchase
interesting items or do not in the end, for some reasons.
Analyzing interactions in the physical real world leads to
understanding which items customers are interested in. By
such analysis, however, which aspects of the items the
customers are interested in, why they bought the items, or
why they didn’t, remain unknown. Therefore, if interests of
the users are extracted from heterogeneous data sources and
the reasons for purchasing or not purchasing the items are
uncovered, it will be possible to obtain valuable information
about latent customers. Traditional mining of transactional
data and new mining of interactional data are distinctively
called transaction mining and interaction mining.
3.1 NoSQL as a Database
It has been reported that 65% of queries processed by
Amazon depend on primary keys [Vogels 2007]. Therefore,
data access based on keys, key value stores mechanism is
used by Internet giants such as Google and Amazon. The
concrete key value stores include DynamoDB [DynamoDB
2014] of Amazon, BigTable [Chang et al. 2006] of Google,
HBase [HBase 2014] of the Hadoop project and Cassandra
[Cassandra 2014], by Facebook.
Generally, given key data, key value stores are suitable for
searching non-key data (attribute values) associated with the
key data. Initially a hash function is applied to a node which
stores data. Then, the node is mapped to a point (i.e., logical
place) on a ring type network. In storing data, the same hash
function is applied to a key value of each data and then the
data is similarly mapped to a point on the ring. Each data is
stored in the nearest node by the clockwise rotation of the
ring. Thus, for data access, search for the nearest node
located by applying the hash function to the key value. This
access structure is called consistent hashing, which is also
adopted by P2P systems used for various purposes such as
file sharing.
3.2 MapReduce
MapReduce is considered as a design pattern which can
process tasks efficiently by carrying out scale-out in a
straightforward manner. For example, human users browsing
web sites and robots aiming at crawling for search engines
leave the access log data in Web servers when they access the
sites. Therefore it is necessary to extract only the session
(i.e., a coherent series of page accesses) by each user from
the recorded access log data and store them in databases for
further analysis. Generally such a task is called extraction,
transformation, and loading (ETL).
MapReduce is suitable for applications which perform such
ETL tasks. It divides a task into subtasks and processes them
in a parallel distributed manner. MapReduce is suitable for
cases where only data or parameters of each subtask are
separate although the method of processing is completely the
same. First, the Map phase is carried out and the outputs are
rearranged so that they are suitable for the input of the
Reduce phase. For applications where similarity (i.e., identity
of processing in this case) and diversity (i.e., difference of
data and parameters for processing) are inherent, MapReduce
exploits these characters to improve the efficiency of
processing. Parallelization and distribution of large scale
computations are the two contributing factors for generating
this kind of model.
3.3 Hadoop
Hadoop [Hadoop 2014] is an open source software for
distributed processing on a computer cluster, which consists
of two or more servers. Hadoop consists of a distributed file
system called HDFS (Hadoop Distributed File System),
MapReduce as it is, and Hadoop Common as common
libraries. A computer system is a collection of clusters which
consist of two or more servers. Data is divided into blocks.
While one block for original data is stored in a server which
is determined by Hadoop, copies of the original data are
stored in two other servers (default) inside racks other than
the rack holding the server for the original data
simultaneously. Although such data arrangement has the
objective to improve availability, it also has another objective
to improve parallelism.
The special server called NameNode manages data
arrangement in HDFS. The NameNode server carries out
book keeping of all the metadata of data files. The metadata
are resident on core memories for high speed access.
Therefore, the server for NameNode should be more reliable
than the other servers.
It is expected that if copies of the same data exist in two or
more servers, candidate solutions increase in number for such
problems that process tasks in parallel by dividing them into
multiple subtasks. If Hadoop is fed a task, Hadoop searches
the location of relevant data by consulting NameNode and
sends a program for execution to the server which stores the
data. This is because communication cost for sending
programs is generally lower than that for sending data.
Big Data Mining - Classification, Techniques and Issues
4. ISSUES AND CHALLENGES
Variety and heterogeneity: In the past, the datasets that we
had had was quite simple and homogenous. We have to
interact with structured, semi-structured and unstructured
data. Structured data is compatible with conventional
DBMS. Semi-structured and unstructured dataset require to
envelope in the adequate and state-of-the-art platforms.
Volume/Scalability: Data now is in tremendous scale, which
will give us an opportunity to discover hidden knowledge
and serve/ understand people better. There are two
approaches if exploited properly, may lead to remarkable
scalability required for future data and mining systems to
manage and mine the big data; Advanced User
Interaction[5.6] Data mining in a straight forward manner
will implies extremely time consuming task on a large space,
however with user interaction we can decrease the search
space into more promising subspaces, Cloud Computing
which is an another approach that showed admirable
elasticity, which, combined with massively parallel
computing architectures can make our systems scalable.
Velocity/Speed: We must finish processing/mining in a
desired time or else the information is useless. Speed
depends a) Data access time and b) Efficiency of mining
algorithms, Exploitation of advanced indexing schemes is the
key to speed issue multidimensional indexing structures such
as R tree is useful for big data and data access time. An
additional approach to boost the speed of big data access and
mining is through maximally identifying and exploiting the
potential parallelism in the access and mining algorithms.
Accuracy, trust and provenance: In the past, we were dealing
with the dataset techniques which were reliable. On the era of
big data, evolution of big data urge us to deal with all the
rigors of a considerable amount of unstructured and
unreliable data. Moreover, how can we trust the unreliable
data? The use of learning algorithms is an appropriate way to
determine the creditability of the source of data, these
algorithms should be able to update the creditability of the
source of data in a timely manner.
Privacy crisis: Every piece of info can be mined out from the
internet about someone because data is interconnected, once
this info is put together the privacy will disappears. We are
working on developing a mining system that can mine a huge
portion of the web, so these same tools can be used to
retrieve personal and confidential information about you.
Interactiveness: Is the capability of data mining system that
allows user interaction such as feedback and guidance.
Interactiveness can help narrow down the search space,
accelerating the speed and increase system scalability, also
heterogeneity can be overcome by allowing users to interpret
intermediate and final results by interaction. Interactiveness
boosts the data mining results, even if data mining systems
are professionally designed but without interactiveness the
value of the results can be discounted or simply rejected.
Garbage mining: In WWW the volume of data is generated
very fast and outdated very fast so we require cyberspace
cleaning but it's not easy foreseeable reasons: garbage is
hidden, and there is an ownership issue, are you allowed to
dispose or collect garbage that does not belong to you? We
propose applying data mining approaches to mine garbage
and recycle it. We believe garbage mining is a serious
research topic mining for garbage is mining for knowledge.
REFERENCES
1. S. Hendrickson, Getting Started with Hadoop with Amazon’s
Elastic MapReduce, EMR, 2010.
2. M. Hilbert and P. L´opez, “The world’s technological capacity
to store, communicate, and compute information,” Science,
vol. 332, no. 6025, pp. 60–65, 2011.
3. J. M. Wing, “Computational thinking and thinking about
computing,” Philosophical Transactions of the Royal Society of
London A:Mathematical, Physical and Engineering Sciences,
vol.366, no. 1881, pp. 3717–3725, 2008.
4. J.Mervis, “Agencies rally to tackle big data,” Science, vol. 336,
no. 6077, p. 22, 2012.
5. http://www.marklogic.com/blog/birth-of-big-data/
6. Che.Dunren, Safran.Mejdl, Peng.Zhiyong, From big data to big
data mining: Challenges,Issues and Opportunities, In: DAFAA
Workshop 2013,LNCS 7827,pp. 1-15, 2013
7. https://jeremyronk.wordpress.com/2014/09/01/structured-semi-
structured-and-unstructured-data/
8. http://whatis.techtarget.com/definition/semi-structured-data
9. Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C
., Byers,AH., Big Data: The next frontier for innovation,
competition and productivity, McKinsey Global Institute, p33,
June 20111
10. https://msdn.microsoft.com/en-us/library/dn749785.aspx
11. https://en.wikipedia.org/wiki/Collaborative_filtering
12. Salehinia.A, Comparisons of Relational Databases with Big
Data: a Teaching Approach, South Dakota State University,
Brookings, SD 57007
13. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th,
Lapis,G,”Understanding Big Data”, p5,2012
14. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th,
Lapis,G,”Understanding Big Data”, p16,2012
15. Ishikawa.H, Social Big Data Mining, 2015

More Related Content

What's hot

Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremAnthonyOtuonye
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lectureMahmoud Alfarra
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification303Computing
 
Terrorism in the Age of Big Data
Terrorism in the Age of Big DataTerrorism in the Age of Big Data
Terrorism in the Age of Big DataMichael Maman
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesDr. Amarjeet Singh
 

What's hot (18)

Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE Theorem
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Datamining
DataminingDatamining
Datamining
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing Platforms
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lecture
 
Ngdm09 han gao
Ngdm09 han gaoNgdm09 han gao
Ngdm09 han gao
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
Data Mining
Data MiningData Mining
Data Mining
 
Terrorism in the Age of Big Data
Terrorism in the Age of Big DataTerrorism in the Age of Big Data
Terrorism in the Age of Big Data
 
3 classification
3  classification3  classification
3 classification
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: Challenges
 

Viewers also liked

Fundamentals of data security policy in i.t. management it-toolkits
Fundamentals of data security policy in i.t. management   it-toolkitsFundamentals of data security policy in i.t. management   it-toolkits
Fundamentals of data security policy in i.t. management it-toolkitsIT-Toolkits.org
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Miningopenminted_eu
 
Personal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off AnalysisPersonal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off AnalysisShannon Szabo-Pickering
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Chris Shillum
 
Merit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data ProtectionMerit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data Protectionmeritnorthwest
 
A business driven approach to security policy management a technical perspec...
A business driven approach to security policy management  a technical perspec...A business driven approach to security policy management  a technical perspec...
A business driven approach to security policy management a technical perspec...AlgoSec
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issuesKrish_ver2
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt Sabreen Irfana
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 

Viewers also liked (11)

Fundamentals of data security policy in i.t. management it-toolkits
Fundamentals of data security policy in i.t. management   it-toolkitsFundamentals of data security policy in i.t. management   it-toolkits
Fundamentals of data security policy in i.t. management it-toolkits
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Mining
 
Personal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off AnalysisPersonal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off Analysis
 
Lecture1
Lecture1Lecture1
Lecture1
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
 
Merit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data ProtectionMerit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data Protection
 
A business driven approach to security policy management a technical perspec...
A business driven approach to security policy management  a technical perspec...A business driven approach to security policy management  a technical perspec...
A business driven approach to security policy management a technical perspec...
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issues
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 

Similar to Big Data Mining - Classification, Techniques and Issues

RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal1
 
Research in Big Data - An Overview
Research in Big Data - An OverviewResearch in Big Data - An Overview
Research in Big Data - An Overviewieijjournal
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
 
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MININGISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MININGcscpconf
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdfAkuhuruf
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksDr. Richard Otieno
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxcalf_ville86
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...ijdpsjournal
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
 
E content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMEN
E content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMENE content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMEN
E content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMENsenekapseneka
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Data Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyData Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyEditor IJCATR
 

Similar to Big Data Mining - Classification, Techniques and Issues (20)

RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
Research in Big Data - An Overview
Research in Big Data - An OverviewResearch in Big Data - An Overview
Research in Big Data - An Overview
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big data
 
Big data survey
Big data surveyBig data survey
Big data survey
 
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MININGISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdf
 
big-data.pdf
big-data.pdfbig-data.pdf
big-data.pdf
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
 
BIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdfBIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdf
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptx
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
 
E content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMEN
E content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMENE content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMEN
E content.1 - P.SENEKA II-MSC COMPUTER SCIENCE,BON SECOURS COLLEGE FOR WOMEN
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Data Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyData Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A Survey
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Big Data Mining - Classification, Techniques and Issues

  • 1. Abstract—At this moment, data deluge is continuously producing a large amount of data in various sectors of modern society. Such data are called big data. Big data contain datasets originating both in our physical real world and in social media and are difficult to manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. The Big Data is providing the robust solutions for overcoming the present issues caused due to the volume, variability and velocity. We present in this issue, a broad overview of the topic, its current status and techniques such as NoSQL, MapReduce and Hadoop. Keywords - Big Data Mining, Mining Techniques, NoSQL, Hadoop, MapReduce. 1. INTRODUCTION In the present age, large amounts of data are produced every moment in various fields, such as science, Internet, and physical systems. Such phenomena collectively called data deluge [Mcfedries 2011]. According to researches carried out by IDC [IDC 2008, IDC 2012], the size of data which are generated and reproduced all over the world every year is estimated to be 161 exa bytes. It is predicted that data increase rapidly at a rate of 10x every five years [1]. In the meanwhile the computing size of general purpose computers encounter a 58% rise annually [2]. Consider the Internet data. The web pages indexed by Google were around one million in 1998 but quickly reached one billion in 2000 and have already exceeded 1 trillion in 2008. This rapid expansion is accelerated by the dramatic increase in acceptance of social networking applications, such as Facebook, Twitter, Weibo, etc., that allow users create content freely and amplify the already huge Web volume. Thus, the term “Big Data” is a critical issue that needs solemn attention [3,4]. The etymology of the Big Data coined by two person: First, John Mashey, who was the chief scientist at Silicon Graphics in the 1990s, who gave a talk “Big Data and the Next Wave of InfraStress” in 1998. Second, Francis X. Diebold, an economist at the University of Pennsylvania, for his paper on “Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting,” (2000) [5]. We introduce Big Data Mining and its application in Section 2. We discuss some Data Mining Techniques in Section 3. Then we discuss the Issues and Challenges in the Section 4. 2. BIG DATA MINING The origin of the term ‘Big Data’, is due to the fat we are creating a huge amount of data every day. Usama Fayyad [11] in his invited talk at the KDD BigMine’ 12 Workshop presented amazing data numbers about internet usage, among them the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 million tweets per day, Facebook has more than 800 million updates per day and YouTube has more than 800 million updates per day. The data produced nowadays is estimated in the order of zettabytes and is growing around 40% every year. There are mainly three concepts associated with big data: structured, semi structured and unstructured data. In today’s world structured data represent only 5 to 10% of all informatics data. Structured data is the data that can be stored in database SQL in table with specific rows and tables [7]. Following this, semi structured data, likewise represents a few parts of data (approximately 5 to 10%). This type of data does not have the precise organization infrastructure of structured data, which fits into tables. In other words, semi- structured data associated with metadata. Metadata is the term that we use in order to describe the content and context of data files, e.g. Means of creation, purpose, time and date of creation, and author [9]. In particular, XML documents are the semi-structured documents. Moreover, NoSQL databases are considered as semi structured [7]. The eminent challenge is to find ways in order to cope with unstructured data, which is everywhere and is most the strong one among others, streaming such as text, images, audios and videos. It represent 80% of data [7]. 2.1 Big Data Definition - 3 V’s: In today's world, organization have been bombed with bulk of information, but there is a decline in the percent of data that can be analyzed. The reason behind that is 80% of the data is in the semi-structured and unstructured format. And thus, we need new algorithms and new toolset deal with all this data. The features of big data can be summarized as follows: • Volume: The quantity of data is extraordinary, but not the percent of data that our tools can process. • Variety: The kinds of data have expanded into unstructured texts, audio, video, graph or XML. • Velocity: Data is arriving continuously as streams, the speed at which data are generated is very high. 
 Big Data Mining - Classification, Techniques and Issues Karan Deep Singh Yeghia Koronian Gelareh Tavako Saberi ka_ingh@encs.concordia.ca y_koroni@encs.concordia.ca g_tavako@encs.concordia.ca Masters in Computer Science Masters in Computer Science Masters in Computer Science Concordia University Concordia University Concordia University
  • 2. Big Data Mining - Classification, Techniques and Issues Therefore, big data are often characterized as V3 by taking the initial letters of these three terms Volume, Variety, and Velocity. Apart from these, there is another factor Variability that corresponds to the changes in the structure of the data and how users want to interpret that data. Gartner[15] summarizes this in their definition of Big Data in 2012 as high volume, velocity and variety information assets that demand cost-effective, innovation forms of information processing for enhanced insight and decision making. 2.2 Data Mining Data mining is, in a nutshell, to discover frequent patterns and meaningful structures appearing in a large amount of data used by applications. Association Analysis: It is to discover frequent co- occurrences between structured data used in business applications, which are usually managed by DBMS. An algorithm called Apriori is used in many cases for that purpose. For example, it discovers combinations of items co- occurring frequently in a group of items (i.e., contents of the shopping carts) purchased at the same time in retail stores. Based on association rules, a lot of application systems recommend a set of items by revising arrangements of them. Association rule mining is extended and applied to the history of product purchases and the history of click streams on the Web pages in order to discover the frequent patterns of series data. Classification: On the other hand, a classifier is learned based on data whose classes (i.e., categories) are known in advance. Then, if there is new data, classes to which they should belong are determined by using the learned classifier. This task called classification is one of the basic data mining techniques. Naïve Bayes and decision trees are used as typical classifiers. Classification is used by such a variety of applications as determination of promising customers, detection of spam e-mails and determination of categories of new specimens in science or medicine. Determination of continuous values such as temperatures and stock prices is also called prediction of future values. Clustering: It may be possible to define the degrees of similarity between data even if the categories of the data are not known in advance. The opposite concept of similarity is dissimilarity or distance. Based on the defined similarity, grouping data into the same group which are similar to each other in a collection of data is called cluster analysis or clustering, which is also one of the basic technologies of data mining. Unlike classification, clustering doesn’t demand that the names and characteristics of clusters are known in advance. Techniques such as a hierarchical agglomerative method and a nonhierarchical k-means method are often used for clustering. Promising applications of clustering include discovery of groups of similar customers for marketing. Outlier Detection: This data mining task can detect exceptional values or values different from standard values. There are methods for outlier detection based on statistical models, data distances, and data densities. There are alternative ways to find outliers using clustering and classification. Outlier detection has been used for applications, such as detection of credit card frauds or network intrusions. 2.3 Big data vs traditional DBMS Big data convey us through the compelling opportunities for the data manipulation. It allows us to encounter with huge volume of semi-structured and unstructured data that the traditional database is not able to store these data. Moreover, it gives us a chance to uncover hidden insights in large sets of data [10]. Enterprise and companies tend to track their customers, monitor their transactions in order to achieve desired statistics. Thus, evaluating the customer’s behaviors permit to have a vantage point of the whole systems and conducting advanced research in order to ensure long term goals [6]. To illustrate with, Tesco’s loyalty program, a British multinational grocery and general merchandise retailer, generates a tremendous amount of customer data that the company mines to inform decisions from promotions to strategic segmentation of customers. Amazon uses customer data to power its recommendation engine “you may also like …” based on a type of predictive modelling technique called collaborative filtering[6]. In this method, “the system observe what the user has been done together with what all users have done (what items they have bought, what music they have listened) and predict how the user’s might behave in the future[11]”. 2.4 Limitations of the traditional DBMS In relational database, we can cope with structured and sometimes semi-structured data. The data is neatly formatted and fits into the schema. The data should fit into the table and if the data does not fit into the table, there is a need to design a database that is more complex and more difficult to handle. This approach might result in loss of some hidden data. In addition, the schema of traditional relational database is not suitable for certain dynamic information, like weather patterns, that change concurrently. However, ”There are some more flexible mechanisms, such as the ability to store XML documents and binary data, but the capabilities for handling these types of data are usually quite limited [10]”.Furthermore, in the traditional database to process data, the data is to put in the central node location. As the data grows, the processing central node has to be extended and consequently, there are some limitations depending on the chosen hardware platform like memory size[12]. “It’s important that understand that conventional database technologies are an important, and relevant, part of an overall analytic solution. In fact, they become even more vital when used in conjunction with your Big Data platform [14].” In Big Data, there is no limitation in storing the data. We can have all sort of data, structured, semi-structured and, particularly, unstructured data and easily query a data. Big data solutions store the data in its raw format and apply a schema only when the data is read, which preserves all of the information within the data [10].
  • 3. Big Data Mining - Classification, Techniques and Issues 3. DATA MINING TECHNIQUES Traditionally, data mining handles transactions which are recorded in databases if the customers actually purchase products or services. Analyzing transactional data leads to discovery of frequently purchased products or services, especially repeat customers. But transaction mining cannot obtain information about customers who are likely to be interested in products or services, but have not purchased any products or services yet. In other words, it is impossible to discover prospective customers who are likely to be new customers in the future. In the physical real world, however, customers look at or touch interesting items displayed in the racks. They trial- listen to interesting videos or audios if they can. They may even smell or taste interesting items if possible and even if interesting items are unavailable for any reasons, customers talk about them or collect information about them. These behaviors can be considered, as parts of interactions between customers and systems. Such interactions indicate the interests of latent customers, who either purchase interesting items or do not in the end, for some reasons. Analyzing interactions in the physical real world leads to understanding which items customers are interested in. By such analysis, however, which aspects of the items the customers are interested in, why they bought the items, or why they didn’t, remain unknown. Therefore, if interests of the users are extracted from heterogeneous data sources and the reasons for purchasing or not purchasing the items are uncovered, it will be possible to obtain valuable information about latent customers. Traditional mining of transactional data and new mining of interactional data are distinctively called transaction mining and interaction mining. 3.1 NoSQL as a Database It has been reported that 65% of queries processed by Amazon depend on primary keys [Vogels 2007]. Therefore, data access based on keys, key value stores mechanism is used by Internet giants such as Google and Amazon. The concrete key value stores include DynamoDB [DynamoDB 2014] of Amazon, BigTable [Chang et al. 2006] of Google, HBase [HBase 2014] of the Hadoop project and Cassandra [Cassandra 2014], by Facebook. Generally, given key data, key value stores are suitable for searching non-key data (attribute values) associated with the key data. Initially a hash function is applied to a node which stores data. Then, the node is mapped to a point (i.e., logical place) on a ring type network. In storing data, the same hash function is applied to a key value of each data and then the data is similarly mapped to a point on the ring. Each data is stored in the nearest node by the clockwise rotation of the ring. Thus, for data access, search for the nearest node located by applying the hash function to the key value. This access structure is called consistent hashing, which is also adopted by P2P systems used for various purposes such as file sharing. 3.2 MapReduce MapReduce is considered as a design pattern which can process tasks efficiently by carrying out scale-out in a straightforward manner. For example, human users browsing web sites and robots aiming at crawling for search engines leave the access log data in Web servers when they access the sites. Therefore it is necessary to extract only the session (i.e., a coherent series of page accesses) by each user from the recorded access log data and store them in databases for further analysis. Generally such a task is called extraction, transformation, and loading (ETL). MapReduce is suitable for applications which perform such ETL tasks. It divides a task into subtasks and processes them in a parallel distributed manner. MapReduce is suitable for cases where only data or parameters of each subtask are separate although the method of processing is completely the same. First, the Map phase is carried out and the outputs are rearranged so that they are suitable for the input of the Reduce phase. For applications where similarity (i.e., identity of processing in this case) and diversity (i.e., difference of data and parameters for processing) are inherent, MapReduce exploits these characters to improve the efficiency of processing. Parallelization and distribution of large scale computations are the two contributing factors for generating this kind of model. 3.3 Hadoop Hadoop [Hadoop 2014] is an open source software for distributed processing on a computer cluster, which consists of two or more servers. Hadoop consists of a distributed file system called HDFS (Hadoop Distributed File System), MapReduce as it is, and Hadoop Common as common libraries. A computer system is a collection of clusters which consist of two or more servers. Data is divided into blocks. While one block for original data is stored in a server which is determined by Hadoop, copies of the original data are stored in two other servers (default) inside racks other than the rack holding the server for the original data simultaneously. Although such data arrangement has the objective to improve availability, it also has another objective to improve parallelism. The special server called NameNode manages data arrangement in HDFS. The NameNode server carries out book keeping of all the metadata of data files. The metadata are resident on core memories for high speed access. Therefore, the server for NameNode should be more reliable than the other servers. It is expected that if copies of the same data exist in two or more servers, candidate solutions increase in number for such problems that process tasks in parallel by dividing them into multiple subtasks. If Hadoop is fed a task, Hadoop searches the location of relevant data by consulting NameNode and sends a program for execution to the server which stores the data. This is because communication cost for sending programs is generally lower than that for sending data.
  • 4. Big Data Mining - Classification, Techniques and Issues 4. ISSUES AND CHALLENGES Variety and heterogeneity: In the past, the datasets that we had had was quite simple and homogenous. We have to interact with structured, semi-structured and unstructured data. Structured data is compatible with conventional DBMS. Semi-structured and unstructured dataset require to envelope in the adequate and state-of-the-art platforms. Volume/Scalability: Data now is in tremendous scale, which will give us an opportunity to discover hidden knowledge and serve/ understand people better. There are two approaches if exploited properly, may lead to remarkable scalability required for future data and mining systems to manage and mine the big data; Advanced User Interaction[5.6] Data mining in a straight forward manner will implies extremely time consuming task on a large space, however with user interaction we can decrease the search space into more promising subspaces, Cloud Computing which is an another approach that showed admirable elasticity, which, combined with massively parallel computing architectures can make our systems scalable. Velocity/Speed: We must finish processing/mining in a desired time or else the information is useless. Speed depends a) Data access time and b) Efficiency of mining algorithms, Exploitation of advanced indexing schemes is the key to speed issue multidimensional indexing structures such as R tree is useful for big data and data access time. An additional approach to boost the speed of big data access and mining is through maximally identifying and exploiting the potential parallelism in the access and mining algorithms. Accuracy, trust and provenance: In the past, we were dealing with the dataset techniques which were reliable. On the era of big data, evolution of big data urge us to deal with all the rigors of a considerable amount of unstructured and unreliable data. Moreover, how can we trust the unreliable data? The use of learning algorithms is an appropriate way to determine the creditability of the source of data, these algorithms should be able to update the creditability of the source of data in a timely manner. Privacy crisis: Every piece of info can be mined out from the internet about someone because data is interconnected, once this info is put together the privacy will disappears. We are working on developing a mining system that can mine a huge portion of the web, so these same tools can be used to retrieve personal and confidential information about you. Interactiveness: Is the capability of data mining system that allows user interaction such as feedback and guidance. Interactiveness can help narrow down the search space, accelerating the speed and increase system scalability, also heterogeneity can be overcome by allowing users to interpret intermediate and final results by interaction. Interactiveness boosts the data mining results, even if data mining systems are professionally designed but without interactiveness the value of the results can be discounted or simply rejected. Garbage mining: In WWW the volume of data is generated very fast and outdated very fast so we require cyberspace cleaning but it's not easy foreseeable reasons: garbage is hidden, and there is an ownership issue, are you allowed to dispose or collect garbage that does not belong to you? We propose applying data mining approaches to mine garbage and recycle it. We believe garbage mining is a serious research topic mining for garbage is mining for knowledge. REFERENCES 1. S. Hendrickson, Getting Started with Hadoop with Amazon’s Elastic MapReduce, EMR, 2010. 2. M. Hilbert and P. L´opez, “The world’s technological capacity to store, communicate, and compute information,” Science, vol. 332, no. 6025, pp. 60–65, 2011. 3. J. M. Wing, “Computational thinking and thinking about computing,” Philosophical Transactions of the Royal Society of London A:Mathematical, Physical and Engineering Sciences, vol.366, no. 1881, pp. 3717–3725, 2008. 4. J.Mervis, “Agencies rally to tackle big data,” Science, vol. 336, no. 6077, p. 22, 2012. 5. http://www.marklogic.com/blog/birth-of-big-data/ 6. Che.Dunren, Safran.Mejdl, Peng.Zhiyong, From big data to big data mining: Challenges,Issues and Opportunities, In: DAFAA Workshop 2013,LNCS 7827,pp. 1-15, 2013 7. https://jeremyronk.wordpress.com/2014/09/01/structured-semi- structured-and-unstructured-data/ 8. http://whatis.techtarget.com/definition/semi-structured-data 9. Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C ., Byers,AH., Big Data: The next frontier for innovation, competition and productivity, McKinsey Global Institute, p33, June 20111 10. https://msdn.microsoft.com/en-us/library/dn749785.aspx 11. https://en.wikipedia.org/wiki/Collaborative_filtering 12. Salehinia.A, Comparisons of Relational Databases with Big Data: a Teaching Approach, South Dakota State University, Brookings, SD 57007 13. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th, Lapis,G,”Understanding Big Data”, p5,2012 14. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th, Lapis,G,”Understanding Big Data”, p16,2012 15. Ishikawa.H, Social Big Data Mining, 2015