Big Data – Srinath & Arjun
Big Data – Srinath & Arjun
• The BIG-DATA
• Hadoop
• Hadoop Components
• Hadoop Eco Systems
2
Agenda
Big Data – Srinath & Arjun
The BIG-DATA
Big Data – Srinath & Arjun 4
The Context
• Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
• Google collects 270PB data in a month (2007), 20000PB a day (2008)
• 2010 census data is expected to be a huge gold mine of information
• Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
Big Data – Srinath & Arjun
• We are in a knowledge economy.
– Data is an important asset to any organization
– Discovery of knowledge; Enabling discovery; annotation of data
• We are looking at newer
– programming models, and
– Supporting algorithms and data structures.
The Context
Big Data – Srinath & Arjun
• Big Data is New
• Big Data is only about Massive Data Volume
• Big data means Hadoop
• Big data need a Data Warehouse
• Big data means Unstructured Data
• Big data is for Social Media and Data mining Analyses
6
The Myth about Big Data
Big Data – Srinath & Arjun
It is all about better analytic on a broader spectrum of data, and
therefore represents an opportunity to create even more differentiation
among industries.
7
Big Data is…
Big Data – Srinath & Arjun
Where Data is coming….?
12+ TBs
of tweet data
every day
25+ TBs
of
log data
every day
?TBsof
dataevery
day
2+
billion
people
on the
Web by
end 2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Big Data – Srinath & Arjun
Facebook
• 4.5 billion Facebook likes every day
• 350 million photos uploaded on a daily basis
• 250 billion photos stored by Facebook
• 10 billion messages sent everyday
• 1 trillion posts in Facebook’s graph search database
• 500 TB of data processed daily
• 100 PB of data stored in Facebok’s Hadoop disk cluster (1PB=1000TB=1000000
GB)
Example of Big Data Generation
Big Data – Srinath & Arjun
Flights
• 1 Boeing plane engine generates 20TB of data for every hour of flying
• How much data do all the flights in this world generate every year if
there are 100000 two engine flights daily?
Example of Big Data Generation
Big Data – Srinath & Arjun
• Black Box Data
• Social Media Data
• Stock Exchange Data
• Power Grid Data
• Transport Data
• Search Engine Data
What comes under Big data?
Big Data – Srinath & Arjun
• Capturing Data
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
Big Data Challenges
Big Data – Srinath & Arjun
Characteristics of Big Data
Volume
of Tweets
create daily.
12+terabytes
Variety
of different
types of data.
100’s
Veracity
decision makers trust
their information.
Only 1 in 3
trade events
per second.
5+million
Velocity
Big Data – Srinath & Arjun
• Structured data : Relational Data
• Semi Structured data : XML data
• Unstructured Data : Word, PDF, Text, Media Logs
Types of Data
Big Data – Srinath & Arjun
The Data Explosion
• 2.5 quintillion bytes of data created each year
• 90 % of data in the world was created in the last two years
Big Data – Srinath & Arjun
Hadoop
Big Data – Srinath & Arjun
Hadoop
• Open Source Software Framework
• Inspired by Google’s Map – Reduce Programming Model (GFS)
• Originally written for the Nutch search engine project
• Written in java
• Efficiently processes large volumes of Data
• Breaks up Big data into multiple parts
• Two key parts
• HDFS
• MapReduce
Big Data – Srinath & Arjun
History of Hadoop
Big Data – Srinath & Arjun
Hadoop Architecture
Big Data – Srinath & Arjun
Hadoop Components
Big Data – Srinath & Arjun
HDFS – Hadoop Distributed File System
• It’s a file system designed for storing very large files running on cluster of
commodity hardware
• High fault tolerance, Distributed, Reliable, Scalable file system for Data
Storage
• Stores multiple copies of data on different nodes. (default 64MB)
• Typically has a single namenode and no.of datanodes to form the HDFS
cluster
Big Data – Srinath & Arjun
HDFS Architecture
• Two types of Nodes
 Master or Namenode
 Slave or Datanode
Big Data – Srinath & Arjun
HDFS Architecture
Big Data – Srinath & Arjun
Read a File
Big Data – Srinath & Arjun
Write a File
Big Data – Srinath & Arjun
Hadoop Cluster Modes
• Standalone Mode
• Pseudo-Distributed Mode
• Fully-Distributed Mode
Big Data – Srinath & Arjun
MapReduce
Programming Model designed for processing large volumes of data in
parallel by dividing the work into a set of independent tasks
Big Data – Srinath & Arjun
Terminology
• Job
• Task
• Task Attempt
• NameNode
• MasterNode
• SlaveNode
• Clusters
• Commodity Hardware
Big Data – Srinath & Arjun
Components
• Master Nodes
• Slave Nodes
Big Data – Srinath & Arjun
Workflow
Big Data – Srinath & Arjun
Example
Big Data – Srinath & Arjun
Closer Look
Big Data – Srinath & Arjun
Input Formats
• Text Input Format
• Sequential input format
• Key value text input format
Big Data – Srinath & Arjun
NoSQL
• NoSQL mean “not only SQL”
• This includes key value stores, document-oriented databases, graph
databases, big datable structures, and caching data stores
Eg. MongoDB, Cassandra
Big Data – Srinath & Arjun
Hadoop ECO Systems
Big Data – Srinath & Arjun
What is HIVE?
• Data Warehousing Infrastructure
• Data Summarization, ad-hoc querying and analysis of large
volumes of data
Big Data – Srinath & Arjun
HiveQL
• HiveQL is the Hive query language.
• Hive doesn’t support transactions.
Big Data – Srinath & Arjun
Hive Application
• Log Processing
• Text Mining
• Document indexing
• Customer – facing Business intelligence (eg. Google Analytics)
• Predictive modelling, hypothesis testing
Big Data – Srinath & Arjun
Thank You….

Big data, Hadoop and Hive

  • 1.
    Big Data –Srinath & Arjun
  • 2.
    Big Data –Srinath & Arjun • The BIG-DATA • Hadoop • Hadoop Components • Hadoop Eco Systems 2 Agenda
  • 3.
    Big Data –Srinath & Arjun The BIG-DATA
  • 4.
    Big Data –Srinath & Arjun 4 The Context • Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) • Google collects 270PB data in a month (2007), 20000PB a day (2008) • 2010 census data is expected to be a huge gold mine of information • Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.
  • 5.
    Big Data –Srinath & Arjun • We are in a knowledge economy. – Data is an important asset to any organization – Discovery of knowledge; Enabling discovery; annotation of data • We are looking at newer – programming models, and – Supporting algorithms and data structures. The Context
  • 6.
    Big Data –Srinath & Arjun • Big Data is New • Big Data is only about Massive Data Volume • Big data means Hadoop • Big data need a Data Warehouse • Big data means Unstructured Data • Big data is for Social Media and Data mining Analyses 6 The Myth about Big Data
  • 7.
    Big Data –Srinath & Arjun It is all about better analytic on a broader spectrum of data, and therefore represents an opportunity to create even more differentiation among industries. 7 Big Data is…
  • 8.
    Big Data –Srinath & Arjun Where Data is coming….? 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataevery day 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014
  • 9.
    Big Data –Srinath & Arjun Facebook • 4.5 billion Facebook likes every day • 350 million photos uploaded on a daily basis • 250 billion photos stored by Facebook • 10 billion messages sent everyday • 1 trillion posts in Facebook’s graph search database • 500 TB of data processed daily • 100 PB of data stored in Facebok’s Hadoop disk cluster (1PB=1000TB=1000000 GB) Example of Big Data Generation
  • 10.
    Big Data –Srinath & Arjun Flights • 1 Boeing plane engine generates 20TB of data for every hour of flying • How much data do all the flights in this world generate every year if there are 100000 two engine flights daily? Example of Big Data Generation
  • 11.
    Big Data –Srinath & Arjun • Black Box Data • Social Media Data • Stock Exchange Data • Power Grid Data • Transport Data • Search Engine Data What comes under Big data?
  • 12.
    Big Data –Srinath & Arjun • Capturing Data • Storage • Searching • Sharing • Transfer • Analysis • Presentation Big Data Challenges
  • 13.
    Big Data –Srinath & Arjun Characteristics of Big Data Volume of Tweets create daily. 12+terabytes Variety of different types of data. 100’s Veracity decision makers trust their information. Only 1 in 3 trade events per second. 5+million Velocity
  • 14.
    Big Data –Srinath & Arjun • Structured data : Relational Data • Semi Structured data : XML data • Unstructured Data : Word, PDF, Text, Media Logs Types of Data
  • 15.
    Big Data –Srinath & Arjun The Data Explosion • 2.5 quintillion bytes of data created each year • 90 % of data in the world was created in the last two years
  • 16.
    Big Data –Srinath & Arjun Hadoop
  • 17.
    Big Data –Srinath & Arjun Hadoop • Open Source Software Framework • Inspired by Google’s Map – Reduce Programming Model (GFS) • Originally written for the Nutch search engine project • Written in java • Efficiently processes large volumes of Data • Breaks up Big data into multiple parts • Two key parts • HDFS • MapReduce
  • 18.
    Big Data –Srinath & Arjun History of Hadoop
  • 19.
    Big Data –Srinath & Arjun Hadoop Architecture
  • 20.
    Big Data –Srinath & Arjun Hadoop Components
  • 21.
    Big Data –Srinath & Arjun HDFS – Hadoop Distributed File System • It’s a file system designed for storing very large files running on cluster of commodity hardware • High fault tolerance, Distributed, Reliable, Scalable file system for Data Storage • Stores multiple copies of data on different nodes. (default 64MB) • Typically has a single namenode and no.of datanodes to form the HDFS cluster
  • 22.
    Big Data –Srinath & Arjun HDFS Architecture • Two types of Nodes  Master or Namenode  Slave or Datanode
  • 23.
    Big Data –Srinath & Arjun HDFS Architecture
  • 24.
    Big Data –Srinath & Arjun Read a File
  • 25.
    Big Data –Srinath & Arjun Write a File
  • 26.
    Big Data –Srinath & Arjun Hadoop Cluster Modes • Standalone Mode • Pseudo-Distributed Mode • Fully-Distributed Mode
  • 27.
    Big Data –Srinath & Arjun MapReduce Programming Model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks
  • 28.
    Big Data –Srinath & Arjun Terminology • Job • Task • Task Attempt • NameNode • MasterNode • SlaveNode • Clusters • Commodity Hardware
  • 29.
    Big Data –Srinath & Arjun Components • Master Nodes • Slave Nodes
  • 30.
    Big Data –Srinath & Arjun Workflow
  • 31.
    Big Data –Srinath & Arjun Example
  • 32.
    Big Data –Srinath & Arjun Closer Look
  • 33.
    Big Data –Srinath & Arjun Input Formats • Text Input Format • Sequential input format • Key value text input format
  • 34.
    Big Data –Srinath & Arjun NoSQL • NoSQL mean “not only SQL” • This includes key value stores, document-oriented databases, graph databases, big datable structures, and caching data stores Eg. MongoDB, Cassandra
  • 35.
    Big Data –Srinath & Arjun Hadoop ECO Systems
  • 36.
    Big Data –Srinath & Arjun What is HIVE? • Data Warehousing Infrastructure • Data Summarization, ad-hoc querying and analysis of large volumes of data
  • 37.
    Big Data –Srinath & Arjun HiveQL • HiveQL is the Hive query language. • Hive doesn’t support transactions.
  • 38.
    Big Data –Srinath & Arjun Hive Application • Log Processing • Text Mining • Document indexing • Customer – facing Business intelligence (eg. Google Analytics) • Predictive modelling, hypothesis testing
  • 39.
    Big Data –Srinath & Arjun Thank You….