4. Big Data – Srinath & Arjun 4
The Context
• Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
• Google collects 270PB data in a month (2007), 20000PB a day (2008)
• 2010 census data is expected to be a huge gold mine of information
• Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
5. Big Data – Srinath & Arjun
• We are in a knowledge economy.
– Data is an important asset to any organization
– Discovery of knowledge; Enabling discovery; annotation of data
• We are looking at newer
– programming models, and
– Supporting algorithms and data structures.
The Context
6. Big Data – Srinath & Arjun
• Big Data is New
• Big Data is only about Massive Data Volume
• Big data means Hadoop
• Big data need a Data Warehouse
• Big data means Unstructured Data
• Big data is for Social Media and Data mining Analyses
6
The Myth about Big Data
7. Big Data – Srinath & Arjun
It is all about better analytic on a broader spectrum of data, and
therefore represents an opportunity to create even more differentiation
among industries.
7
Big Data is…
8. Big Data – Srinath & Arjun
Where Data is coming….?
12+ TBs
of tweet data
every day
25+ TBs
of
log data
every day
?TBsof
dataevery
day
2+
billion
people
on the
Web by
end 2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
9. Big Data – Srinath & Arjun
Facebook
• 4.5 billion Facebook likes every day
• 350 million photos uploaded on a daily basis
• 250 billion photos stored by Facebook
• 10 billion messages sent everyday
• 1 trillion posts in Facebook’s graph search database
• 500 TB of data processed daily
• 100 PB of data stored in Facebok’s Hadoop disk cluster (1PB=1000TB=1000000
GB)
Example of Big Data Generation
10. Big Data – Srinath & Arjun
Flights
• 1 Boeing plane engine generates 20TB of data for every hour of flying
• How much data do all the flights in this world generate every year if
there are 100000 two engine flights daily?
Example of Big Data Generation
11. Big Data – Srinath & Arjun
• Black Box Data
• Social Media Data
• Stock Exchange Data
• Power Grid Data
• Transport Data
• Search Engine Data
What comes under Big data?
12. Big Data – Srinath & Arjun
• Capturing Data
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
Big Data Challenges
13. Big Data – Srinath & Arjun
Characteristics of Big Data
Volume
of Tweets
create daily.
12+terabytes
Variety
of different
types of data.
100’s
Veracity
decision makers trust
their information.
Only 1 in 3
trade events
per second.
5+million
Velocity
14. Big Data – Srinath & Arjun
• Structured data : Relational Data
• Semi Structured data : XML data
• Unstructured Data : Word, PDF, Text, Media Logs
Types of Data
15. Big Data – Srinath & Arjun
The Data Explosion
• 2.5 quintillion bytes of data created each year
• 90 % of data in the world was created in the last two years
17. Big Data – Srinath & Arjun
Hadoop
• Open Source Software Framework
• Inspired by Google’s Map – Reduce Programming Model (GFS)
• Originally written for the Nutch search engine project
• Written in java
• Efficiently processes large volumes of Data
• Breaks up Big data into multiple parts
• Two key parts
• HDFS
• MapReduce
21. Big Data – Srinath & Arjun
HDFS – Hadoop Distributed File System
• It’s a file system designed for storing very large files running on cluster of
commodity hardware
• High fault tolerance, Distributed, Reliable, Scalable file system for Data
Storage
• Stores multiple copies of data on different nodes. (default 64MB)
• Typically has a single namenode and no.of datanodes to form the HDFS
cluster
22. Big Data – Srinath & Arjun
HDFS Architecture
• Two types of Nodes
Master or Namenode
Slave or Datanode
26. Big Data – Srinath & Arjun
Hadoop Cluster Modes
• Standalone Mode
• Pseudo-Distributed Mode
• Fully-Distributed Mode
27. Big Data – Srinath & Arjun
MapReduce
Programming Model designed for processing large volumes of data in
parallel by dividing the work into a set of independent tasks
33. Big Data – Srinath & Arjun
Input Formats
• Text Input Format
• Sequential input format
• Key value text input format
34. Big Data – Srinath & Arjun
NoSQL
• NoSQL mean “not only SQL”
• This includes key value stores, document-oriented databases, graph
databases, big datable structures, and caching data stores
Eg. MongoDB, Cassandra
36. Big Data – Srinath & Arjun
What is HIVE?
• Data Warehousing Infrastructure
• Data Summarization, ad-hoc querying and analysis of large
volumes of data
37. Big Data – Srinath & Arjun
HiveQL
• HiveQL is the Hive query language.
• Hive doesn’t support transactions.
38. Big Data – Srinath & Arjun
Hive Application
• Log Processing
• Text Mining
• Document indexing
• Customer – facing Business intelligence (eg. Google Analytics)
• Predictive modelling, hypothesis testing