3. Big Data is a term for data sets that are so large
or complex that traditional data processing application
softwareis inadequate to deal with them.
4. Data mining is the computing process of
discovering patterns in large data sets involving
methods at the intersection of machine
learning, statistics, and database systems.
5. Important Info
Daily 2500 quadrillion of data are produced and more than 90 percentage of data
are produced within past two years.
A regular person is processing daily more data than a 16th century individual in his
entire life
The volume of business data worldwide, across all companies, doubles every 1.2
years (was 1.5 years)
Bad data or poor data quality costs US businesses $600 billion annually
By 2015, 4.4 million IT jobs globally will be created to support big data (Gartner)
Facebook processes 10 TB of data every day / Twitter 7 TB
Google has over 3 million servers processing over 2 trillion searches per year in
2012 (only 22 million in 2000)
6. 4 variants of Big Data
Volume
• Data
Quantity
Velocity
• Data Speed
Variety
• Data Types
Variability
• Inconsistency
7. Big Data Mining Algorithm
Big data applications have so many sources to gather information.
If we want to mine data, we need to gather all distributed data to the
centralized site. But it is prohibited because of high data transmission cost and
privacy concerns.
Most of the mining levels order to achieve the pattern of correlations, or patterns
can be discovered from combined variety of sources.
The global data mining is done through two steps process.
Model level
Knowledge level.
Each and every local sites use local data to calculate the data statistics and it
share this information in order to achieve global data distribution in their data
level.
8. In model level it will produce local pattern. This pattern will be produced
after mined local data.
By sharing these local patterns with other local sites, we can produce a single
global pattern.
At the knowledge level, model correlation analysis investigates the relevance
between models generated from various data sources to determine how
related the data sources are correlated to each other, and how to form
accurate decisions based on models built from autonomous sources
9. DATA MINING CHALLENGES WITH BIG DATA
Main challenge for an intelligent database is handling Big data.
The important thing is scaling the large amount of data and
provide solution for these problem by HACE theorem
10. Challenges
Hardware resources- RAM capacity
Location of Big Data sources- Commonly Big Data are
stored in different locations
Volume of the Big Data- size of the Big Data grows
continuously.
Privacy- Medical reports, bank transactions
Having domain knowledge
Getting meaningful information
11. Solutions
Parallel computing programming
An efficient platform for computing will
not have centralized data storage instead of
that platform will be distributed in big scale
storage.
Restricting access to the data
12. BIG Data Mining Tools
Hadoop
Apache S4
Strom
Apache Mahout
MOA
13. Hadoop
It is developed by Apache Software Foundation project and open source
software platform for scalable, distributed computing.
Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
programming models.
Hadoop provides fast and reliable analysis of both Structured and un
structured data.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Hadoop uses MapReduce programming model to mine data.
This MapReduce program is used to separate datasets which are sent as input
into independent subsets. Those are process parallel map task.
Map() procedure that performs filtering and sorting
Reduce() procedure that performs a summary operation
14. Applications of Big Data
Healthcare organizations can achieve better insight into disease trends and
patient treatments.
Public sector agencies can catch fraud and other threats in real-time.
Applications of Multimedia data
To find travelling pattern of travelers
CC TV camera footage
Photos and Videos from social network
Recommender system
Integration and mining of Bio data from various sources in Biological network
by NSF (National Science Foundation).
Classifying the Big data stream in run time, by Australian Research council.
15. Advantages
Fast response
Extract useful information
Prediction of required data from large amount of data.
Savour of better results in the form of visualization.
16.
17. We Are: The Genius
Gopesh Singha ………………….1519
Md. Mizanur Rahman ………..…1524
Kawsar Ahmed ……………….…1531
Hasan Pervez…………………….1520
Editor's Notes
In 2012, debate which is held during president election between Obama & Mitt triggered about 10 million tweets within 2 hours. And the well-known web site Flickr which is used to post our images faced a problem. It receives 1.8 million photographs every day which has the size of 2MB. Approximately they need 3.6TB storage capacity per day. These situations shows the reason for rise of Big Data application
Sourcessssssssss
Social network
Satellite data
Geographical data
Live streaming data