1. Big Data Basics
AUTHOR : MITHUN BANERJEE
DATE: 05-OCTOBER-2016
C O P Y R I G H T P R O T E C T E D B Y E C L I P S E T E C H N O C O N S U L T I N G G L O B A L ( P ) L T D .
2. What is Big data?
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
--Wikipedia
Is the above definition fully comprehensive?
Lets try to go deep in next slides
3. Data units to measure exponential growth of data
over the years
VOLUME of DATA
4. Type of data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
Social Network, SemanticWeb (RDF), …
• Streaming Data
You can only scan the data once
• A single application can be generating/collecting many types of data
• Big Public Data (online, weather, finance, etc)
Variety (complexities) of
data
5. Velocity of data
Late decisions missing opportunities
Example: Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
Velocity of data
Social media and networks
(all of us are generating data) Scientific instruments
(collecting all sorts of data)
Sensor technology and networks
(measuring all kinds of data)
REALTIME / FAST DATA
8. Generation and
Consumption of Data
In past
In present
OLTP: O N L I N E T R A N S A C T I O N P R O C E S S I NG ( D B M S )
OLAP: O N L I N E A N A LY T I C A L P R O C E S S I N G ( D ATA
WA R E H O U S I N G )
RTAP: REAL-TIME ANALYTICS PROCESSING (BIG
DATA ARCHITECTURE & TECHNOLOGY)
9. Driver of Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
-Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
10. The Evolution of Business Intelligence
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other
SQL ReportingTools
Interactive
Business
Intelligence &
In-memory RDBMS
QliqView,Tableau, HANA
Big Data:
RealTime &
SingleView
Graph Databases
Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark;
HBase/Cassandra
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
11. Topic 1: Data Analytics &
Data Mining
• EXPLORATORY DATA ANALYSIS
•
• LINEAR CLASSIFICATION (PERCEPTRON &
LOGISTIC REGRESSION)
•
• LINEAR REGRESSION
• C4.5 DECISION TREE
• APRIORI
• K-MEANS CLUSTERING
•
• EM ALGORITHM
• PAGERANK & HITS
• COLLABORATIVE FILTERING
12. Topic 2: Hadoop/MapReduce
Programming & Data Processing
ARCHITECTURE OF HADOOP, HDFS, AND YARN
PROGRAMMING ON HADOOP
BASIC DATA PROCESSING: SORT AND JOIN
INFORMATION RETRIEVAL USING HADOOP
DATA MINING USING HADOOP
(KMEANS+HISTOGRAMS)
MACHINE LEARNING ON HADOOP (EM)
HIVE/PIG
HBASE AND CASSANDRA
14. Reference to read for in
depth home work
•Hadoop:The Definitive Guide,Tom White, O’Reilly
•Data Mining: Concepts andTechniques,Third Edition, by
Jiawei Han et al.
•https://www.mongodb.com/collateral/big-data-examples-
and-guidelines-enterprise-decision-maker
•
•http://www.aptude.com/blog/entry/hadoop-vs-mongodb-
which-platform-is-better-for-handling-big-data
•
•http://www.slideshare.net/wlaforest/an-introduction-to-
big-data-nosql-and-mongodb
•http://www.infoworld.com/article/2608460/application-
development/the-10-worst-big-data-practices.html