Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.1
Prabhu Thukkaram
Director, Product Development
Oracle Complex Processing & SOA Suite
Feb 28, 2014

What is Big Data ?
HDFS
Map Reduce
HBase
Columnar DB
PIG Hive
ETL Tools BI Reporting
Self Healing
Clustered Storage
System
Distributed Data
Processing
Higher level
abstraction
Top-level interfaces
Structured Data,, Excel, etc
Unstructured & Semi-Structured Data, Web
Logs, Images, etc
SQOOP
Zoo
Keeper

HBase Quick Overview
 Relational Database – Product Table
Product Id Price SKU Inventory
Count
0001 1300 SKU0001 10
0002 2800 SKU0002 25
0003 5600 SKU0003 8
 Ideal for OLTP transactions
 Faster writes and record updates
 But slow for OLAP
 E.g. Select sum(InventoryCount) from Products;
 Reason:- Data for column “InventoryCount” is not contiguous

HBase Quick Overview
 HBase – Product Table
Product Id 0001 0002 0003
Price 1300 2800 5600
SKU SKU0001 SKU0002 SKU0003
Inventory
Count
10 25 8
 Extremely fast for OLAP
 E.g. Select sum(InventoryCount) from Products;
 Supports big data analytics – single table with million columns and
billion rows
43

Hadoop Cluster
Master
Job
Tracker
Name
Node
Slave 1
Task
Tracker
Data
Node
Slave 2
Task
Tracker
Data
Node
 Name Node
 Does not store data, maintains
directory tree of all files in the
cluster
 Tracks data blocks of a file across
the cluster
 Client apps like Hadoop Shell/CLI
talk to Name Node to locate,
create, move, rename, and delete
a file
 Returns a list of Data Node
servers where data lives
 Single point of failure, addressed
in Hadoop V2 or YARN
Hadoop
Shell, CLI,
etc

Hadoop/HDFS cluster
Master
Job
Tracker
Name
Node
Slave 1
Task
Tracker
Data
Node
Slave 2
Task
Tracker
Data
Node
 Data Node
 Stores & replicates data on the file
system
 Connects to Name Node on
startup & responds to file system
operations
 Hadoop Shell/CLI clients can talk
directly to Data Node if they know
the location of the data
 Data Nodes talk to each other
when replicating data
Hadoop
Shell, CLI,
etc

Writing a file to HDFS
Hadoop Client
NameNode
DataNode 1 DataNode 6DataNode 5 ……
File.txt
Blk A
Blk B
Blk C
Wants to write
Blocks A, B, C of
File.txt
Ok, write to data
nodes 1,5,6
Blk A Blk B Blk C
Blk A
Replication of Blk A
 Client consults Name Node
 Client writes block directly to one Data Node
 That Data Node replicates the block
 Client writes next block and the cycle repeats

Hadoop/HDFS cluster
Master
Job
Tracker
Name
Node
Slave 1
Task
Tracker
Data
Node
Slave 2
Task
Tracker
Data
Node
 Job Tracker – Accepts MR
jobs from Clients
 Contacts & submits MR tasks to
Task Trackers on located Data
Nodes
 Monitors Task Tracker nodes for
heartbeat/failures and resubmits
to a different Task Tracker as
needed
 Updates the status of a job when
complete
 Single point of failure, fixed in
MRV2 or YARN
 Note:- Jobs run as batch and clients
can retrieve the status by querying the
Job Tracker
Hadoop
Shell, CLI,
etc

Hadoop/HDFS cluster
Master
Job
Tracker
Name
Node
Slave 1
Task
Tracker
Data
Node
Slave 2
Task
Tracker
Data
Node
 Task Tracker – Accepts
map, & reduce tasks from
Job Tracker
 A predefined set of slots
determine the number of tasks it
can accept
 Spawns a separate JVM process
for the task
 Notifies the Job Tracker when the
process finishesHadoop
Shell, CLI,
etc

Map Reduce Example
• User submits a MR job/jar for input file URL hdfs://File.txt to Job Submitter on the Client JVM. Job Submitter Client
contacts Job Tracker to obtain a Job Id
• Job Tracker creates Map Tasks based on the number of input splits. Reduce jobs are defined by the job itself, configured
or in API call setNumReduceTasks()
• Client contacts Name Node to compute input splits. Copies the job jar, computed input splits, etc. to Job Tracker’s file
system with a directory named after the Job Id. Submits the Job. Job Tracker adds job to its Queue.

Map Reduce Example
• Map Task splits records and passes each record to user’s Map logic code
• In above example, user’s Map logic tokenizes each record to generate one or more key value pairs.
• Output of a Map task is partitioned as per the defined # of reducers, shuffled, and sorted. Each partition output is then
routed to its Reducer
• Reducer merges partition output from other Map tasks in the cluster and calls user’s reduce logic
• M/R guarantees that the input to every Reducer is sorted by key

Map Reduce – Under the Hood
Source: Hadoop, The Definitive Guide.
User’s
Map
Logic
Record Split
Sorted & partitioned
Single Map Task
Merged
User’s
Reduce
Logic
Single Reduce Task
From other Map Tasks
To other Reducers
Data Block

Map Reduce – Summary
Map – Process of organizing entity data as key-value pairs. Key could be
Customer Id, Purchase Order Id, etc.
M/R Framework – Ensures all data relevant to an entity/key is delivered
to a single Reducer
Reduce – Process of aggregating data related to an Entity and deducing
information.

Risk Analysis - Large no of CC txns, no
direct deposit into checking over the last
two months implies the customer is
unemployed and at high risk of
defaulting
Map Reduce - Risk Analysis
HDFS Global
View of
Customer
Credit Card Txns
Chat
Session
Checking
account
deposits and
withdrawals
Map
Phase
Reduce
Phase
Risk
Score
Gathers all data (CC
txns, chats, withdrawals,
etc.) pertaining to a single
customer.
Data in HDFS changes frequently and
hence the need to reevaluate risk using
batch jobs. Reevaluated results are
written to a DB to enable business
decisions. E.g. To approve a new credit
card or loan

Hadoop V1 - Limitations
 Cluster resource management is tightly coupled to Map Reduce Job
Tracker
 Job Tracker Functionality in V1
 Cluster resource management
 Application life-cycle Management (Job Scheduling/Re-scheduling/Monitoring)
 Can only run Map Reduce applications, poor utilization of cluster
 Need to run other kinds of applications – Real-
time, Graph, Messaging, etc.
 Scalability & single point of failure (Name Node & Job Tracker)
 Lack of wire compatibility protocols

Hadoop MRV2 or YARN

Hadoop MRV2 or YARN
 Job Tracker in V1 split into
 Global Resource Manager
 Application Master per Job Request
 Node Manager
 Application - classic MR job or a
DAG of jobs
Resource
Manager
Node
Manager
Container
App
Master
Node
Manager
Container
App
Master
Node
Manager
Container Container
Client Client
Job Status
Job Submission
Node Status
Resource Request

Hadoop MRV2 or YARN
Resource
Manager
 Resource Manager
 Ultimate authority for managing and scheduling resources in
cluster
 Works with the Node Manager to track and utilize available
containers
 Container is the unit of resource in YARN. E.g. 2 Cores & 2
GB memory, Disk, etc.
 Accepts Jobs from clients and delegates it to an Application
Master

Hadoop MRV2 or YARN
App Master
 Application Master
 Negotiates the required resources for job with RM
 Tracks job status and monitors progress
 By shifting job control to App Master in local slaves, YARN
provides better scale out and fault tolerance

Big Data Overview - Next

Hadoop and Big Data Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and Big Data Overview

Similar to Hadoop and Big Data Overview (20)

Recently uploaded

Recently uploaded (20)

Hadoop and Big Data Overview

Editor's Notes