Harinderjit Kaur
M.Tech(CSE)
PIT KAPURHALA
 What is the Need of Big data Technology
when we have robust, high-performing,
relational database management system
?
RDBMS
 Data Stored in structured format like PK, Rows,
Columns, Tuples and FK .
 It was for just Transactional data analysis.
 Later using Data warehouse for offline data.
 Massive use of Internet and Social Networking(FB,
Linkdin) Data become less structured.
 ‘Big Data’ is similar to ‘small data’, but bigger
 Datasets that grow so large that they become
awkward to work with using on-hand database
management tools. Difficulties include capture,
storage, search, sharing, and analytics.
What is Big Data?
3 V‘s of Big Data
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types
Hadoop History
 2003 Doug Cutting was creating Nutch
 Open Source “Google”
 Web Crawler
 Indexer
 Crawler and Indexing processing was difficult
 Massive storage and processing problem
 In 2003 Google publishes GFS paper and in 2004
MapReduce paper
 Based on Google’s paper, Doug redesign Nutch and
delivered it in 2006 as Hadoop.
What is Hadoop?
 Framework of tools
 Open source maintained by and under Apache
License
 Support running apps for BigData
 Addressing the BigData challenges:
Variety
VelocityVolume
What is Hadoop?
 Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google
MapReduce
 Hadoop is based on a simple programming model called
MapReduce
 an open source software framework written in Java.
Hadoop makes it easier to
store, process and analyze
lot of data on commodity
hardware!
Apache Hadoop
 Developer(s) Apache Software Foundation
 Initial release December 10, 2011;
 Stable release 2.6.0 / Nov 18, 2014
 Development status Active
 Written in Java
 Operating system Cross-platform
 Type Distributed file system
 License Apache License 2.0
 Website hadoop.apache.org
Characteristics of Hadoop
 Scalable
A cluster can be expanded by adding new servers or
resources without having to move, reformat, or change
the dependent analytic workflows or applications.
 Cost effective
Hadoop brings massively parallel computing to
commodity servers. The result is a sizeable decrease in
the cost per terabyte of storage, which in turn makes it
affordable to model all your data.
 Flexible
Hadoop is schema-less and can absorb any type of
data, structured or not, from any number of sources.
Data from multiple sources can be joined and
aggregated in arbitrary ways enabling deeper analysis
than any one system can provide.
 Fault tolerant
When you lose a node, the system redirects work to
another location of the data and continues processing
without missing a beat.
Hadoop Master/Slave Architecture
 Hadoop is designed as a master-slave shared-nothing
architecture
16
Master node
(single node)
Many slave nodes
Hadoop Components
HDFS
Storage
Self-healing
high-bandwidth
clustered storage
MapReduce
Processing
Fault-tolerant
distributed
processing
HDFS Basics
 HDFS (Hadoop Distributed File System) is a file system
written in Java
 Sits on top of a native file system
 Provides redundant storage for massive amounts of data
Main Properties of HDFS
 Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
 Namenode is consistently checking Datanodes
19
Hadoop Distributed File System (HDFS)
20
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
HDFS Data
 Data is split into blocks and stored on multiple
nodes in the cluster.
 Each block is usually 64 MB or 128 MB
 Each block is replicated multiple times.
 Replicas stored on different data nodes
 Large files, 100 MB+
2 Kinds of Nodes
Master Nodes Slave Nodes
Master Nodes
 NameNode
 only 1 per cluster
 metadata server and database
• JobTracker
• only 1 per cluster
• job scheduler
Slave Nodes
 DataNodes
 1-4000 per cluster
 block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution
NameNode
 A single NameNode stores all metadata
 Filenames, locations on DataNodes of each block, owner,
group, etc.
 All information maintained in RAM for fast lookup
 File system metadata size is limited to the amount of
available RAM on the NameNode
Data Node
 DataNodes store file contents
 Different blocks of the same file will be stored on
different DataNodes
 Same block is stored on three (or more) DataNodes for
redundancy
MapReduce
MapReduce
 Programming model used by Google
 Input: a set of key/value pairs
 User supplies two functions:
 map(k,v)  list(k1,v1)
 reduce(k1, list(v1))  v2
 Map
 Process a key/value pair to generate intermediate
key/value pairs
 Reduce
 Merge all intermediate values associated with the same
key
MapReduce
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Properties of MapReduce Engine
Job Tracker is the master node (runs with the
namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
30
• This file has 5 Blocks  run 5 map tasks
Node 1 Node 2 Node 3
Properties of MapReduce Engine (Cont’d)
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
31
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example, 1 map-reduce
job consists of 4 map tasks
and 3 reduce tasks
How Map and Reduce Work
Together
Map returns
information
Reduce accepts
information
Reduce applies a
user defined
function to
reduce the
amount of data
MapReduce Example -
WordCount
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Hadoop Workflow
Hadoop Cluster
You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job
3a. Go back to Step 2
4. Retrieve data from HDFS
Thank you

HADOOP

  • 1.
  • 2.
     What isthe Need of Big data Technology when we have robust, high-performing, relational database management system ?
  • 3.
    RDBMS  Data Storedin structured format like PK, Rows, Columns, Tuples and FK .  It was for just Transactional data analysis.  Later using Data warehouse for offline data.  Massive use of Internet and Social Networking(FB, Linkdin) Data become less structured.
  • 5.
     ‘Big Data’is similar to ‘small data’, but bigger  Datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, and analytics. What is Big Data?
  • 6.
    3 V‘s ofBig Data Volume • Data quantity Velocity • Data Speed Variety • Data Types
  • 8.
    Hadoop History  2003Doug Cutting was creating Nutch  Open Source “Google”  Web Crawler  Indexer  Crawler and Indexing processing was difficult  Massive storage and processing problem  In 2003 Google publishes GFS paper and in 2004 MapReduce paper  Based on Google’s paper, Doug redesign Nutch and delivered it in 2006 as Hadoop.
  • 9.
    What is Hadoop? Framework of tools  Open source maintained by and under Apache License  Support running apps for BigData  Addressing the BigData challenges: Variety VelocityVolume
  • 10.
    What is Hadoop? Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  an open source software framework written in Java.
  • 11.
    Hadoop makes iteasier to store, process and analyze lot of data on commodity hardware!
  • 12.
    Apache Hadoop  Developer(s)Apache Software Foundation  Initial release December 10, 2011;  Stable release 2.6.0 / Nov 18, 2014  Development status Active  Written in Java
  • 13.
     Operating systemCross-platform  Type Distributed file system  License Apache License 2.0  Website hadoop.apache.org
  • 14.
    Characteristics of Hadoop Scalable A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications.  Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • 15.
     Flexible Hadoop isschema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analysis than any one system can provide.  Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
  • 16.
    Hadoop Master/Slave Architecture Hadoop is designed as a master-slave shared-nothing architecture 16 Master node (single node) Many slave nodes
  • 17.
  • 18.
    HDFS Basics  HDFS(Hadoop Distributed File System) is a file system written in Java  Sits on top of a native file system  Provides redundant storage for massive amounts of data
  • 19.
    Main Properties ofHDFS  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS  Namenode is consistently checking Datanodes 19
  • 20.
    Hadoop Distributed FileSystem (HDFS) 20 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 21.
    HDFS Data  Datais split into blocks and stored on multiple nodes in the cluster.  Each block is usually 64 MB or 128 MB  Each block is replicated multiple times.  Replicas stored on different data nodes  Large files, 100 MB+
  • 22.
    2 Kinds ofNodes Master Nodes Slave Nodes
  • 23.
    Master Nodes  NameNode only 1 per cluster  metadata server and database • JobTracker • only 1 per cluster • job scheduler
  • 24.
    Slave Nodes  DataNodes 1-4000 per cluster  block data storage • TaskTrackers • 1-4000 per cluster • task execution
  • 25.
    NameNode  A singleNameNode stores all metadata  Filenames, locations on DataNodes of each block, owner, group, etc.  All information maintained in RAM for fast lookup  File system metadata size is limited to the amount of available RAM on the NameNode
  • 26.
    Data Node  DataNodesstore file contents  Different blocks of the same file will be stored on different DataNodes  Same block is stored on three (or more) DataNodes for redundancy
  • 27.
  • 28.
    MapReduce  Programming modelused by Google  Input: a set of key/value pairs  User supplies two functions:  map(k,v)  list(k1,v1)  reduce(k1, list(v1))  v2  Map  Process a key/value pair to generate intermediate key/value pairs  Reduce  Merge all intermediate values associated with the same key
  • 29.
    MapReduce JobTracker MapReduce job submitted by clientcomputer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance
  • 30.
    Properties of MapReduceEngine Job Tracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) 30 • This file has 5 Blocks  run 5 map tasks Node 1 Node 2 Node 3
  • 31.
    Properties of MapReduceEngine (Cont’d) Task Tracker is the slave node (runs on each datanode) Receives the task from Job Tracker Runs the task until completion (either map or reduce task) Always in communication with the Job Tracker reporting progress 31 Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks
  • 32.
    How Map andReduce Work Together Map returns information Reduce accepts information Reduce applies a user defined function to reduce the amount of data
  • 33.
  • 34.
    Lifecycle of aMapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 35.
    Hadoop Workflow Hadoop Cluster You 1.Load data into HDFS 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 4. Retrieve data from HDFS
  • 36.