On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
The volume of business data worldwide, across all companies, doubles every
1.2 years (was 1.5 years)
Daily 2500 quadrillion of data are produced and more than 90 percentage of
data are produced within past two years.
A regular person is processing daily more data than a 16th century individual
in his entire life
In the last years cost of storage and processing power dropped significantly
Bad data or poor data quality costs US businesses $600 billion annually
By 2015, 4.4 million IT jobs globally will be created to support big data
Facebook processes 10 TB of data every day / Twitter 7 TB
Google has over 3 million servers processing over 2 trillion searches per year
in 2012 (only 22 million in 2000)
The term Big data is used to describe a massive
volume of both structured and unstructured data
that is so large that it's difficult to process using
traditional database and software techniques.
Characteristics of Big Data
Volume - The quantity of data
Variety - categorizing the data
Velocity - speed of generation of data or the speed of processing
Variability - Inconsistency
Complexity - Managing the data
DATA MINING CHALLENGES WITH BIG DATA
Main challenge for an intelligent database is handling Big data. The
important thing is scaling the large amount of data and provide
solution for these problem by HACE theorem
Location of Big Data sources- Commonly Big Data are
stored in different locations
Volume of the Big Data- size of the Big Data grows
Hardware resources- RAM capacity
Privacy- Medical reports, bank transactions
Having domain knowledge
Getting meaningful information
Parallel computing programming
An efficient platform for computing will not have
centralized data storage instead of that platform
will be distributed in big scale storage.
Restricting access to the data
BIG Data Mining Tools
It is developed by Apache Software Foundation project and open source
software platform for scalable, distributed computing.
Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
Hadoop provides fast and reliable analysis of both Structured and un
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Hadoop uses MapReduce programming model to mine data.
This MapReduce program is used to separate datasets which are sent as input
into independent subsets. Those are process parallel map task.
Map() procedure that performs filtering and sorting
Reduce() procedure that performs a summary operation
Big Data Mining Algorithm
Big data applications have so many sources to gather information.
If we want to mine data, we need to gather all distributed data to the
centralized site. But it is prohibited because of high data transmission cost
and privacy concerns.
Most of the mining levels order to achieve the pattern of correlations, or
patterns can be discovered from combined variety of sources.
The global data mining is done through two steps process.
Each and every local sites use local data to calculate the data statistics and it
share this information in order to achieve global data distribution in their
In model level it will produce local pattern. This pattern will be produced
after mined local data.
By sharing these local patterns with other local sites, we can produce a single
At the knowledge level, model correlation analysis investigates the relevance
between models generated from various data sources to determine how
related the data sources are correlated to each other, and how to form
accurate decisions based on models built from autonomous sources
Applications of Big Data
Healthcare organizations can achieve better insight into disease trends and
Public sector agencies can catch fraud and other threats in real-time.
Applications of Multimedia data
To find travelling pattern of travelers
CC TV camera footage
Photos and Videos from social network
Integration and mining of Bio data from various sources in Biological network
by NSF (National Science Foundation).
Classifying the Big data stream in run time, by Australian Research council.
 IEEE, Data Mining with Big Data, January 2014
 McKinsy Global Institute, Big Data: The next frontier for innovation, competition and
productivity- May 2011
 Xindong Wu, Xinguan Zhu, Gong-Qing Wu, Wei Ding, 2013, Data Mining with Big Data
 Ahmed and Karypis 2012, Rezwan Ahmed, George Karpis, Algorithms for mining the evolution
of conserved relational states in dynamic network
 Wu X. 2000, Building Intelligent Learning Database Systems, AI Magazine
 Oracle, June 2013,Unstructured Data Management with Oracle Database 12c
 Valery A.Petrushin, Jia-Yu Pan, Cees G.M.Snoek, 2010,Tenth International Workshop on
Multimedia Data Mining
 Big data[Online].Available:www.en.wikipedia.org/wiki/Big_data
 Big data [Online]. Available: www.webopedia.com/TERM /B/ big_data.html
IBM big data and information management [Online]. Available: www-
 Big data [Online]. Available: www.adainbigdata.com
 Big Data Explained [Online]. Available: www.mongodb.com/big-data-explained
 Big data [Online]. Available: www.sas.com/en_us/insights/big-data/what-is-big-data.html
 Big Data Mining Tools[Online]. Available: www.albertbifet.com/big-data-mining-tools