Big Data Basics
AUTHOR : MITHUN BANERJEE
DATE: 05-OCTOBER-2016
C O P Y R I G H T P R O T E C T E D B Y E C L I P S E T E C H N O C O N S U L T I N G G L O B A L ( P ) L T D .
What is Big data?
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
--Wikipedia
Is the above definition fully comprehensive? 
Lets try to go deep in next slides
Data units to measure exponential growth of data
over the years
VOLUME of DATA
Type of data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
Social Network, SemanticWeb (RDF), …
• Streaming Data
You can only scan the data once
• A single application can be generating/collecting many types of data
• Big Public Data (online, weather, finance, etc)
Variety (complexities) of
data
Velocity of data
Late decisions  missing opportunities
Example: Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction
Velocity of data
Social media and networks
(all of us are generating data) Scientific instruments
(collecting all sorts of data)
Sensor technology and networks
(measuring all kinds of data)
REALTIME / FAST DATA
3Vs
4Vs
Generation and
Consumption of Data
In past
In present
OLTP: O N L I N E T R A N S A C T I O N P R O C E S S I NG ( D B M S )
OLAP: O N L I N E A N A LY T I C A L P R O C E S S I N G ( D ATA
WA R E H O U S I N G )
RTAP: REAL-TIME ANALYTICS PROCESSING (BIG
DATA ARCHITECTURE & TECHNOLOGY)
Driver of Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
-Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
The Evolution of Business Intelligence
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other
SQL ReportingTools
Interactive
Business
Intelligence &
In-memory RDBMS
QliqView,Tableau, HANA
Big Data:
RealTime &
SingleView
Graph Databases
Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark;
HBase/Cassandra
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
Topic 1: Data Analytics &
Data Mining
• EXPLORATORY DATA ANALYSIS
•
• LINEAR CLASSIFICATION (PERCEPTRON &
LOGISTIC REGRESSION)
•
• LINEAR REGRESSION
• C4.5 DECISION TREE
• APRIORI
• K-MEANS CLUSTERING
•
• EM ALGORITHM
• PAGERANK & HITS
• COLLABORATIVE FILTERING
Topic 2: Hadoop/MapReduce
Programming & Data Processing
ARCHITECTURE OF HADOOP, HDFS, AND YARN
PROGRAMMING ON HADOOP
BASIC DATA PROCESSING: SORT AND JOIN
INFORMATION RETRIEVAL USING HADOOP
DATA MINING USING HADOOP
(KMEANS+HISTOGRAMS)
MACHINE LEARNING ON HADOOP (EM)
HIVE/PIG
HBASE AND CASSANDRA
Topic 3: Graph Database and
Graph Analytics
GRAPH DATABASE
(HTTP://EN.WIKIPEDIA.ORG/WIKI/GRAPH_DATAB
ASE)
Native Graph Database (Neo4j)
Pregel/Giraph (Distributed Graph Processing Engine)
NEO4J/TITAN/GRAPHLAB/GRAPHSQL
Reference to read for in
depth home work
•Hadoop:The Definitive Guide,Tom White, O’Reilly
•Data Mining: Concepts andTechniques,Third Edition, by
Jiawei Han et al.
•https://www.mongodb.com/collateral/big-data-examples-
and-guidelines-enterprise-decision-maker
•
•http://www.aptude.com/blog/entry/hadoop-vs-mongodb-
which-platform-is-better-for-handling-big-data
•
•http://www.slideshare.net/wlaforest/an-introduction-to-
big-data-nosql-and-mongodb
•http://www.infoworld.com/article/2608460/application-
development/the-10-worst-big-data-practices.html
Ets train ppt_big_data_basics_v2.0

Ets train ppt_big_data_basics_v2.0

  • 1.
    Big Data Basics AUTHOR: MITHUN BANERJEE DATE: 05-OCTOBER-2016 C O P Y R I G H T P R O T E C T E D B Y E C L I P S E T E C H N O C O N S U L T I N G G L O B A L ( P ) L T D .
  • 2.
    What is Bigdata? Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. --Wikipedia Is the above definition fully comprehensive?  Lets try to go deep in next slides
  • 3.
    Data units tomeasure exponential growth of data over the years VOLUME of DATA
  • 4.
    Type of data •Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data Social Network, SemanticWeb (RDF), … • Streaming Data You can only scan the data once • A single application can be generating/collecting many types of data • Big Public Data (online, weather, finance, etc) Variety (complexities) of data
  • 5.
    Velocity of data Latedecisions  missing opportunities Example: Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction Velocity of data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) REALTIME / FAST DATA
  • 6.
  • 7.
  • 8.
    Generation and Consumption ofData In past In present OLTP: O N L I N E T R A N S A C T I O N P R O C E S S I NG ( D B M S ) OLAP: O N L I N E A N A LY T I C A L P R O C E S S I N G ( D ATA WA R E H O U S I N G ) RTAP: REAL-TIME ANALYTICS PROCESSING (BIG DATA ARCHITECTURE & TECHNOLOGY)
  • 9.
    Driver of Data -Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources -Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets
  • 10.
    The Evolution ofBusiness Intelligence BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL ReportingTools Interactive Business Intelligence & In-memory RDBMS QliqView,Tableau, HANA Big Data: RealTime & SingleView Graph Databases Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra 1990’s 2000’s 2010’s Speed Scale Scale Speed
  • 11.
    Topic 1: DataAnalytics & Data Mining • EXPLORATORY DATA ANALYSIS • • LINEAR CLASSIFICATION (PERCEPTRON & LOGISTIC REGRESSION) • • LINEAR REGRESSION • C4.5 DECISION TREE • APRIORI • K-MEANS CLUSTERING • • EM ALGORITHM • PAGERANK & HITS • COLLABORATIVE FILTERING
  • 12.
    Topic 2: Hadoop/MapReduce Programming& Data Processing ARCHITECTURE OF HADOOP, HDFS, AND YARN PROGRAMMING ON HADOOP BASIC DATA PROCESSING: SORT AND JOIN INFORMATION RETRIEVAL USING HADOOP DATA MINING USING HADOOP (KMEANS+HISTOGRAMS) MACHINE LEARNING ON HADOOP (EM) HIVE/PIG HBASE AND CASSANDRA
  • 13.
    Topic 3: GraphDatabase and Graph Analytics GRAPH DATABASE (HTTP://EN.WIKIPEDIA.ORG/WIKI/GRAPH_DATAB ASE) Native Graph Database (Neo4j) Pregel/Giraph (Distributed Graph Processing Engine) NEO4J/TITAN/GRAPHLAB/GRAPHSQL
  • 14.
    Reference to readfor in depth home work •Hadoop:The Definitive Guide,Tom White, O’Reilly •Data Mining: Concepts andTechniques,Third Edition, by Jiawei Han et al. •https://www.mongodb.com/collateral/big-data-examples- and-guidelines-enterprise-decision-maker • •http://www.aptude.com/blog/entry/hadoop-vs-mongodb- which-platform-is-better-for-handling-big-data • •http://www.slideshare.net/wlaforest/an-introduction-to- big-data-nosql-and-mongodb •http://www.infoworld.com/article/2608460/application- development/the-10-worst-big-data-practices.html