Apache Hadoop 
Design Pathshala 
April 22, 2014 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
1
Course Details 
 The Motivation for Hadoop 
 Hadoop: Basic Concepts 
 Writing a MapReduce Program 
 Common MapReduce Algorithms 
 PIG Concepts 
 Hive Concepts 
 Working with Sqoop 
 Working with Flume 
 OOZIE Concepts 
 HUE Concepts 
 Reporting Tools 
 Project 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
2
Apache Hadoop 
The Motivation for Hadoop 
Design Pathshala 
April 22, 2014 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
3
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
4
Design Pathshala 
 Every one of our courses, written by experts in their respective fields. 
 We try our best to make you connect real life examples with real business 
practices. 
 Learn and apply to work or your own business. 
 We provide online classes on different subjects, including Oracle HRMS, 
Peoplesoft HRMS & JAVA. 
 We have both Weekday as well as Weekend classes. 
5 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
How data comes? 
6 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Machine generated and historical data 
7 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Three V’s of Bigdata 
Volume 
Velocity 
Variety 
8 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Volume .. Amount of data 
~3 ZB of 
data exist in 
the digital 
universe 
today. 
>300 TB of 
data in U.S. 
Library of 
Congress. 
Facebook 
has 30+ PB. 
~2.5 PB of 
data in 
DWH. 
+10PB DWH 
size. 
9 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Velocity .. How Rapidly data is growing 
48 hours of 
new video 
every minute 
571 new 
websites every 
minute 
500+ TB to 
Facebook. 
175 million 
tweets every 
day 
1+ million 
customer 
transactions 
every hour 
Data 
production will 
be 44 times 
greater in 2020 
than it was in 
2009. 
10 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Variety.. How Rapidly data is growing 
Structured 
• Traditional 
Databases 
• Numeric data 
Semi - 
structured 
• Json 
• XML 
Unstructured 
• Text documents 
• Email 
• Video 
• Audio 
• Machine 
Generated 
11 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
12
How Companies minting on Bigdata! 
Predict exactly what customers want before they ask for it 
Marketing Campaign 
Improve customer service 
Fraud Detection 
Get customers excited about their own data 
Identify customer pain points and solve them 
Reduce health care costs and improve treatment 
Social Graph Analysis & Sentiment Analysis 
Research and development 
13 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
How data is used by some big Companies for 
different business analysis. 
14 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Big Data Market Forecast 
15 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
16
Career options 
www.designpathshala.com | +91 120 260 5512 | +91 17 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Big data jobs, big pay jobs 
www.designpathshala.com | +91 120 260 5512 | +91 18 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Top Recruiters in India 
www.designpathshala.com | +91 120 260 5512 | +91 19 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Hadoop & Hive History 
 Dec 2004 – Google GFS paper published 
 July 2005 – Nutch uses MapReduce 
 Feb 2006 – Becomes Lucene subproject 
 Apr 2007 – Yahoo! on 1000-node cluster 
 Jan 2008 – An Apache Top Level Project 
 Jul 2008 – A 4000 node test cluster 
 Sept 2008 – Hive becomes a Hadoop subproject 
20 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
You Say, “tomato…” 
Google calls it: Hadoop equivalent: 
GFS HDFS 
Bigtable HBase 
Chubby Zookeeper 
21 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Problems with current systems 
1 Machine 
• Read 1 TB data 
• 4 I/O operations 
• 100 Mbps 
22 
~45 
mins 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
23
Apache Hadoop Wins Terabyte Sort Benchmark (July 2008) 
 Yahoo's sorted 1 TB data in 209 seconds 
 Beat the previous record of 297 seconds of Google. 
 The sort used 1800 mappers and 1800 reduces 
 Cluster configuration used for benchmark sort 
 910 nodes 
 2 quad core Xeons @ 2.0ghz per node 
 8G RAM per a node; 
www.designpathshala.com | +91 120 260 5512 | +91 24 
98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Why Hadoop? 
1 Machine 
• Read 1 TB data 
• 4 I/O operations 
• 100 Mbps 
10 Machines 
4 I/O operations 
100 Mbps 
25 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
~45 
mins 
~4.5 
mins
Distributed File System (DFS) 
designpathshalaproject 
 dp.global.inhomeproject 
 dp.global.inhomeimages 
 dp.global.inhomesoftware 
 dp.global.inhomewebsites 
26 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
designpathshalasoftware 
designpathshalaimages 
designpathshalawebsites 
Namespace 
dp.global.in
Who uses Hadoop? 
27 
42,000 nodes 
as on July 
2011 
4100 nodes 
1400 
nodes 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
What is Hadoop 
 Hadoop is a framework for distributed processing of large datasets across 
large clusters of commodity computers using simple programing model. 
 Large datasets  Terabytes or petabytes of data 
 Large clusters  hundreds or thousands of nodes 
 Hadoop is open-source implementation for Google MapReduce 
 Hadoop is based on a simple programming model called MapReduce 
 Hadoop is based on a simple data model, any data will fit 
28 
www.designpathshala.com | +91 120 260 5512 | +91 98 
188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
29
What makes it especially useful 
 Scalable: It can reliably store and process petabytes. 
 Economical: It distributes the data and processing across clusters of commonly available 
computers (in thousands). 
 Efficient: By distributing the data, it can process it in parallel on the nodes where the 
data is located. 
 Reliable: It automatically maintains multiple copies of data and automatically redeploys 
computing tasks based on failures. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
30
Hadoop: Assumptions 
 Hardware will fail. 
 Applications need a write-once-read-many access model. 
 Data transfer and I/o is bottleneck 
 Very Large Distributed File System 
– 10K nodes, 100 million files, 10 PB 
 Assumes Commodity Hardware 
– Files are replicated to handle hardware failure 
– Detect failures and recovers from them 
 Move logic rather than data 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
31
Secondary 
NameNode 
Client 
HDFS Architecture 
NameNode 
Data Nodes 
Metadata 
NameNode : Contains information about data 
DataNode : Contains physical data 
SecondaryNameNode: Keeps reading data from NN 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
32
Distributed File System 
 Single Namespace for entire cluster 
 Data Coherency 
– Write-once-read-many access model 
– Client can only append to existing files 
 Files are broken up into blocks 
– Typically 64 MB block size 
– Each block replicated on multiple DataNodes 
 Intelligent Client 
– Client can find location of blocks 
– Client accesses data directly from DataNode 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
33
Hadoop architecture 
34 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
35
 Major re-architecture of Distributed 
File System and Processing 
 YARN Architecture enables to run 
multiple things on Hadoop nodes 
 Interactive SQL Support 
 Integrated streaming support 
 In-memory processing 
 Search 
 Enterprise Security 
 Data Lifecycle Management 
 Readily available tools and libraries 
36 
Why Hadoop 2.0? 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Hadoop 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
37
38 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop and the Hadoop Ecosystem 
 MapReduce 
 A distributed data processing model and execution environment that runs on large 
clusters of commodity machines. 
 HDFS 
 A distributed filesystem that runs on large clusters of commodity machines. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
39
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
40
Apache Hadoop and the Hadoop Ecosystem 
 Pig 
 A data flow language and execution environment for exploring very large datasets. 
Pig runs on HDFS and MapReduce clusters. 
 Hive 
 A distributed data warehouse. Hive manages data stored in HDFS and provides a 
query language based on SQL (and which is translated by the runtime engine to 
MapReduce jobs) for querying the data. 
 Sqoop 
 A tool for efficiently moving data between relational databases and HDFS. 
 Oozie 
 Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie 
Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
41
Apache Hadoop and the Hadoop Ecosystem 
 HBase 
 A distributed, column-oriented database. HBase uses HDFS for its underlying 
storage, and supports both batch-style computations using MapReduce and point 
queries (random reads). 
 ZooKeeper 
 A distributed, highly available coordination service. ZooKeeper provides primitives 
such as distributed locks that can be used for building distributed applications. 
 Flume 
 Flume is a distributed, reliable, and available service for efficiently collecting, 
aggregating, and moving large amounts of log data. 
 Strom 
 Apache Storm is a free and open source distributed realtime computation system. 
Storm makes it easy to reliably process unbounded streams of data, doing for 
realtime processing what Hadoop did for batch processing. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
42
Apache Hadoop and the Hadoop Ecosystem 
 Spark & Spark 
 Apache Spark™ is a fast and general engine for large-scale data processing. 
 Drill 
 Apache Drill provides direct queries on self-describing and semi-structured data in 
files (such as JSON, Parquet) and HBase tables without needing to specify metadata 
definitions in a centralized store such as Hive metastore. 
 Avro 
 A serialization system for efficient, cross-language RPC, and persistent data 
storage. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
43
44 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
45
Oozie Workflows 
Pig Sqoop Jobs/Hive Scripts 
46 
Source Databases (Reporting) 
HDFS (domain/xbec/dwh) 
Source 
Table Data 
Temporary 
Table Data 
DW Table 
Data 
Hive Tables 
MySQL Data 
Warehouse 
Dashboard 
Reporting 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Big Data 
Platform 
Data Export 
Algorithms 
Integration Mahout Execution 
Base HDFS / MapReduce / Hive / Pig on Hortonworks HDP 2.0 
47 
Data As A 
Service 
Summary Data 
Services 
Repository API 
Command & 
Job API 
Hive Metadata 
API 
Workflow DSL 
API 
Workflow API 
API 
Storage API 
External MR 
Submissions 
Events Elastic Search & Indexing Application & Notifications + CEP 
Domain & User 
Mgmt 
Security Layer 
SSO 
User Auth Authorization 
Hadoop Platform 
Integration 
Gateway 
API 
Other 
Real Time Analytics 
Kerberos 
Security 
Hive 
Security 
Pig/MR 
Security 
HDFS Data 
Privacy 
Log Analytics 
Data Analytics 
Spatial Analytics 
Tracking Analytics 
RDBMS 
Integration 
API 
Queues 
Flume 
Integration 
Queuing & 
Ingestion 
Real-time 
Analytics 
R Integration 
Detached 
Storage 
Archiving 
Storage Management 
Core Build & Deploy VM Provisioning & Software Deployment Cloud Foundry / Open Shift 
Existing DW 
External 
Storage 
Analytics 
Functions 
Fast SQL Layer 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Data 
Warehouse 
(Domain 
Specific) 
(Traditional 
DW) 
Multi-tenant Big Data Platform w/ Hadoop & PaaS Platforms 
Ingestion 
Engine 
(Kafka) 
Devices, Visualization, Search, Reporting & Alerts 
Application/Rules/Webservices (REST) Layer 
(JBOSS & JBRMS) 
Intermediate 
Storage 
Engine 
(Cassandra & 
HBASE& Solr) 
Real-time 
Processing 
Engine 
(Storm) 
(MapReduce/Pig/ 
Hive/Stringer) 
48 
Ingestion 
Engine 
(Kafka) 
Real-time 
Stream 
Processing 
Engine 
(Storm) 
Intermediate 
Data Store & 
Search 
(MySQL/HBase) 
& 
Elastic Search 
Hadoop 
HDFS 
Hadoop 
HDFS 
(MapReduce/Pig/ 
Hive) 
Predictive 
Analytics & 
Machine 
Learning 
(Mahout/R) 
Data Inputs 
Gateway 
(Talend,Fuse) 
(JMS, 
RDBMS/Sqoop, Log 
files/Flume, 
REST/WebHDFS, 
etc) 
Hadoop 2.0 Platform (Hartonworks) 
Data Integration/ETL/Workflows (Oozie) 
Cloud Orchestration (Zookeeper, YARN, Ambari) 
Reporting/BI 
Tools 
(ex: Jaspersoft) 
Batch 
Metrics & 
ETL 
Real-time 
Metrics 
Predictive 
Metrics 
Analytics Libraries 
SQL/Hive 
Engines 
(Data Warehouse) 
(Stringer/HAWQ 
ETL 
PaaS Platform 
Virtualization (Public/Private/Hybrid) 
Data Access 
Existing 
Datawarehouse 
Platforms 
Data Export 
Ad-hoc/ 
Interac 
tive 
Analytics 
Analytics & Business Applications 
API 
Queues 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
49
Hadoop complex queries comparison with 
traditional DB’s 
50 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Which Hadoop Distribution? 
Type Distribution Pros Cons 
Pureplay 
(Apache/Ope 
nSource) 
Hortonworks 100% Open source version 
Integration/Services focused 
Extensive partnership 
network 
Slower interactive 
queries 
Cloudera Widely used distribution 
Faster interactive queries 
Extensive tooling 
Proprietary extensions 
like Impala 
Commercial version only 
MapR Enterprise and Production 
ready focused 
Works with NFS & Native Unix 
commands 
Less focused on using 
new Hadoop features 
such as Yarn, etc 
Proprietary PivotalHD Faster interactive query 
support with Greenplum 
Integrates with CloudFoundry 
PaaS platform 
Proprietary extensions 
Not easy to decouple 
IBM Offer open source without 
branch version 
Integrated with PaaS and IBM 
tools 
Limited releases 
Expensive 
May not be easy to 
decouple 51 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com
Disk 1 Disk 5 
2 Disk 6 
2 
Disk 7 
Disk 2 
Disk 3 
1 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
52 
Disk 9 
1 2 3 
Racks 
Disk 10 
Disk 11 
Disk 8 Disk 12 
Disk 4 
1 
1 
2 
3 
3 
3 
Data blocks 
Rack 1 Rack 2 Rack 3 
File F 1 2 3 4 5 
Blocks (64 MB)
Block Placement 
 Current Strategy 
-- One replica on local node 
-- Second replica on a remote rack 
-- Third replica on same remote rack 
-- Additional replicas are randomly placed 
 Clients read from nearest replica 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
53
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
54
Main Properties of HDFS 
 Large: A HDFS instance may consist of thousands of server machines, each 
storing part of the file system’s data 
 Replication: Each data block is replicated many times (default is 3) 
 Failure: Failure is the norm rather than exception 
 Fault Tolerance: Detection of faults and quick, automatic recovery from 
them is a core architectural goal of HDFS 
 Datanodes send heartbeats to Name node 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
55
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
56
NameNode Metadata 
 Meta-data in Memory 
Types of Metadata 
– List of files 
– List of Blocks for each file 
– List of DataNodes for each block 
– File attributes, e.g creation time, replication factor 
 A Transaction Log 
– Records file creations, file deletions. etc 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
57
DataNode 
 A Block Server 
– Stores data in the local file system 
– Stores meta-data of a block 
– Serves data to Clients 
 Block Report 
– Periodically sends a report of all existing blocks to the 
NameNode 
 Facilitates Pipelining of Data 
– Forwards data to other specified DataNodes 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
58
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
59
Hadoop Master/Slave Architecture 
 Hadoop is designed as a master-slave shared-nothing architecture 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
Master node (single node) 
Many slave nodes 
60
JobTracker 
 Master node runs JobTracker instance, which accepts Job requests from 
clients 
 There is only one JobTracker daemon running per hadoop cluster 
 Determine the execution plan by determining which files to process 
 Assigns Nodes to different task 
 Monitor all tasks as they are running 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
61
TaskTracker 
 Manages execution of individual tasks on each data node 
 One TaskTracker each data node 
 Each TaskTracker can spawn multiple JVM’s to handle many map or reduce 
task in parallel 
 TaskTracker constantly communicate with job tracker 
 JobTracker fails to receive heartbeat from TaskTracker in specified amount of 
time, it assumes the task tracker has crashed. In such a scenario, job tracker 
will resubmit the task to some other TaskTracker. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
62
Job Tracker 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
63 
User 
DFS 
Copy 
Input 
Files 
Client 
Submit 
job 
Create 
Splits 
Upload 
Job Info 
Job.XML 
Job.jar 
Job Tracker 
Submit 
Job 
Get Input 
file info
Job Tracker Cont.. 
Job.XML 
Job.jar 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
64 
Clint 
DFS 
Job Tracker 
Submit 
job 
Initialize 
job 
Create Map 
& Reduce 
Job Queue 
M 
R 
S 
S 
S 
S 
S 
S 
No of maps = 
Input splits 
Read Files
Job Tracker Cont.. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
65 
Job Tracker 
Task Tracker 
Picks Task 
Heart 
Beat 
Job Queue 
Assign 
Task 
Job Queue
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
66
Job Tracker Cont.. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
67 
Task Tracker 
Job Tracker 
Read 
from local 
Disk 
DFS 
Assign 
Task 
Job.xml 
Job.jar
Heartbeats 
 DataNodes send hearbeat to the NameNode 
 NameNode uses heartbeats to detect DataNode failure 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
68
Replication Engine 
 NameNode detects DataNode failures 
 Chooses new DataNodes for new replicas 
 Balances disk usage 
 Balances communication traffic to DataNodes 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
69
Data Pipeline & Write Anatomy 
HDFS Client Add Block Name Node 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
70 
Data Node 
Data Node 
Data Node 
Write 
Ack 
Complete
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
71
Data Pipelining 
 Client retrieves a list of DataNodes on which to place 
replicas of a block 
 Client writes block to the first DataNode 
 The first DataNode forwards the data to the next 
DataNode in the Pipeline 
 When all replicas are written, the Client moves on to 
write the next block in file 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
72
Read Anatomy 
HDFS Client Get Block Name Node 
Data Node Data Node Data Node 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
73 
Read 
Read
Data Correctness 
 Use Checksums to validate data 
– Use CRC32 
 File Creation 
– Client computes checksum per 512 byte 
– DataNode stores the checksum 
 File access 
– Client retrieves the data and checksum from DataNode 
– If Validation fails, Client tries other replicas 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
74
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
75

Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala

  • 1.
    Apache Hadoop DesignPathshala April 22, 2014 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 1
  • 2.
    Course Details The Motivation for Hadoop  Hadoop: Basic Concepts  Writing a MapReduce Program  Common MapReduce Algorithms  PIG Concepts  Hive Concepts  Working with Sqoop  Working with Flume  OOZIE Concepts  HUE Concepts  Reporting Tools  Project www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 2
  • 3.
    Apache Hadoop TheMotivation for Hadoop Design Pathshala April 22, 2014 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 3
  • 4.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 4
  • 5.
    Design Pathshala Every one of our courses, written by experts in their respective fields.  We try our best to make you connect real life examples with real business practices.  Learn and apply to work or your own business.  We provide online classes on different subjects, including Oracle HRMS, Peoplesoft HRMS & JAVA.  We have both Weekday as well as Weekend classes. 5 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 6.
    How data comes? 6 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 7.
    Machine generated andhistorical data 7 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 8.
    Three V’s ofBigdata Volume Velocity Variety 8 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 9.
    Volume .. Amountof data ~3 ZB of data exist in the digital universe today. >300 TB of data in U.S. Library of Congress. Facebook has 30+ PB. ~2.5 PB of data in DWH. +10PB DWH size. 9 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 10.
    Velocity .. HowRapidly data is growing 48 hours of new video every minute 571 new websites every minute 500+ TB to Facebook. 175 million tweets every day 1+ million customer transactions every hour Data production will be 44 times greater in 2020 than it was in 2009. 10 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 11.
    Variety.. How Rapidlydata is growing Structured • Traditional Databases • Numeric data Semi - structured • Json • XML Unstructured • Text documents • Email • Video • Audio • Machine Generated 11 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 12.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12
  • 13.
    How Companies mintingon Bigdata! Predict exactly what customers want before they ask for it Marketing Campaign Improve customer service Fraud Detection Get customers excited about their own data Identify customer pain points and solve them Reduce health care costs and improve treatment Social Graph Analysis & Sentiment Analysis Research and development 13 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 14.
    How data isused by some big Companies for different business analysis. 14 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 15.
    Big Data MarketForecast 15 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 16.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 16
  • 17.
    Career options www.designpathshala.com| +91 120 260 5512 | +91 17 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 18.
    Big data jobs,big pay jobs www.designpathshala.com | +91 120 260 5512 | +91 18 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 19.
    Top Recruiters inIndia www.designpathshala.com | +91 120 260 5512 | +91 19 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 20.
    Hadoop & HiveHistory  Dec 2004 – Google GFS paper published  July 2005 – Nutch uses MapReduce  Feb 2006 – Becomes Lucene subproject  Apr 2007 – Yahoo! on 1000-node cluster  Jan 2008 – An Apache Top Level Project  Jul 2008 – A 4000 node test cluster  Sept 2008 – Hive becomes a Hadoop subproject 20 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 21.
    You Say, “tomato…” Google calls it: Hadoop equivalent: GFS HDFS Bigtable HBase Chubby Zookeeper 21 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 22.
    Problems with currentsystems 1 Machine • Read 1 TB data • 4 I/O operations • 100 Mbps 22 ~45 mins www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 23.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 23
  • 24.
    Apache Hadoop WinsTerabyte Sort Benchmark (July 2008)  Yahoo's sorted 1 TB data in 209 seconds  Beat the previous record of 297 seconds of Google.  The sort used 1800 mappers and 1800 reduces  Cluster configuration used for benchmark sort  910 nodes  2 quad core Xeons @ 2.0ghz per node  8G RAM per a node; www.designpathshala.com | +91 120 260 5512 | +91 24 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 25.
    Why Hadoop? 1Machine • Read 1 TB data • 4 I/O operations • 100 Mbps 10 Machines 4 I/O operations 100 Mbps 25 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com ~45 mins ~4.5 mins
  • 26.
    Distributed File System(DFS) designpathshalaproject  dp.global.inhomeproject  dp.global.inhomeimages  dp.global.inhomesoftware  dp.global.inhomewebsites 26 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com designpathshalasoftware designpathshalaimages designpathshalawebsites Namespace dp.global.in
  • 27.
    Who uses Hadoop? 27 42,000 nodes as on July 2011 4100 nodes 1400 nodes www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 28.
    What is Hadoop  Hadoop is a framework for distributed processing of large datasets across large clusters of commodity computers using simple programing model.  Large datasets  Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  Hadoop is based on a simple data model, any data will fit 28 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 29.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 29
  • 30.
    What makes itespecially useful  Scalable: It can reliably store and process petabytes.  Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).  Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.  Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 30
  • 31.
    Hadoop: Assumptions Hardware will fail.  Applications need a write-once-read-many access model.  Data transfer and I/o is bottleneck  Very Large Distributed File System – 10K nodes, 100 million files, 10 PB  Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them  Move logic rather than data www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 31
  • 32.
    Secondary NameNode Client HDFS Architecture NameNode Data Nodes Metadata NameNode : Contains information about data DataNode : Contains physical data SecondaryNameNode: Keeps reading data from NN www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 32
  • 33.
    Distributed File System  Single Namespace for entire cluster  Data Coherency – Write-once-read-many access model – Client can only append to existing files  Files are broken up into blocks – Typically 64 MB block size – Each block replicated on multiple DataNodes  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 33
  • 34.
    Hadoop architecture 34 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 35.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 35
  • 36.
     Major re-architectureof Distributed File System and Processing  YARN Architecture enables to run multiple things on Hadoop nodes  Interactive SQL Support  Integrated streaming support  In-memory processing  Search  Enterprise Security  Data Lifecycle Management  Readily available tools and libraries 36 Why Hadoop 2.0? www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 37.
    Hadoop www.designpathshala.com |+91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 37
  • 38.
    38 www.designpathshala.com |+91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 39.
    Apache Hadoop andthe Hadoop Ecosystem  MapReduce  A distributed data processing model and execution environment that runs on large clusters of commodity machines.  HDFS  A distributed filesystem that runs on large clusters of commodity machines. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 39
  • 40.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 40
  • 41.
    Apache Hadoop andthe Hadoop Ecosystem  Pig  A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.  Hive  A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.  Sqoop  A tool for efficiently moving data between relational databases and HDFS.  Oozie  Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 41
  • 42.
    Apache Hadoop andthe Hadoop Ecosystem  HBase  A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).  ZooKeeper  A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.  Flume  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.  Strom  Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 42
  • 43.
    Apache Hadoop andthe Hadoop Ecosystem  Spark & Spark  Apache Spark™ is a fast and general engine for large-scale data processing.  Drill  Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to specify metadata definitions in a centralized store such as Hive metastore.  Avro  A serialization system for efficient, cross-language RPC, and persistent data storage. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 43
  • 44.
    44 www.designpathshala.com |+91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 45.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 45
  • 46.
    Oozie Workflows PigSqoop Jobs/Hive Scripts 46 Source Databases (Reporting) HDFS (domain/xbec/dwh) Source Table Data Temporary Table Data DW Table Data Hive Tables MySQL Data Warehouse Dashboard Reporting www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 47.
    Big Data Platform Data Export Algorithms Integration Mahout Execution Base HDFS / MapReduce / Hive / Pig on Hortonworks HDP 2.0 47 Data As A Service Summary Data Services Repository API Command & Job API Hive Metadata API Workflow DSL API Workflow API API Storage API External MR Submissions Events Elastic Search & Indexing Application & Notifications + CEP Domain & User Mgmt Security Layer SSO User Auth Authorization Hadoop Platform Integration Gateway API Other Real Time Analytics Kerberos Security Hive Security Pig/MR Security HDFS Data Privacy Log Analytics Data Analytics Spatial Analytics Tracking Analytics RDBMS Integration API Queues Flume Integration Queuing & Ingestion Real-time Analytics R Integration Detached Storage Archiving Storage Management Core Build & Deploy VM Provisioning & Software Deployment Cloud Foundry / Open Shift Existing DW External Storage Analytics Functions Fast SQL Layer www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 48.
    Data Warehouse (Domain Specific) (Traditional DW) Multi-tenant Big Data Platform w/ Hadoop & PaaS Platforms Ingestion Engine (Kafka) Devices, Visualization, Search, Reporting & Alerts Application/Rules/Webservices (REST) Layer (JBOSS & JBRMS) Intermediate Storage Engine (Cassandra & HBASE& Solr) Real-time Processing Engine (Storm) (MapReduce/Pig/ Hive/Stringer) 48 Ingestion Engine (Kafka) Real-time Stream Processing Engine (Storm) Intermediate Data Store & Search (MySQL/HBase) & Elastic Search Hadoop HDFS Hadoop HDFS (MapReduce/Pig/ Hive) Predictive Analytics & Machine Learning (Mahout/R) Data Inputs Gateway (Talend,Fuse) (JMS, RDBMS/Sqoop, Log files/Flume, REST/WebHDFS, etc) Hadoop 2.0 Platform (Hartonworks) Data Integration/ETL/Workflows (Oozie) Cloud Orchestration (Zookeeper, YARN, Ambari) Reporting/BI Tools (ex: Jaspersoft) Batch Metrics & ETL Real-time Metrics Predictive Metrics Analytics Libraries SQL/Hive Engines (Data Warehouse) (Stringer/HAWQ ETL PaaS Platform Virtualization (Public/Private/Hybrid) Data Access Existing Datawarehouse Platforms Data Export Ad-hoc/ Interac tive Analytics Analytics & Business Applications API Queues www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 49.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 49
  • 50.
    Hadoop complex queriescomparison with traditional DB’s 50 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 51.
    Which Hadoop Distribution? Type Distribution Pros Cons Pureplay (Apache/Ope nSource) Hortonworks 100% Open source version Integration/Services focused Extensive partnership network Slower interactive queries Cloudera Widely used distribution Faster interactive queries Extensive tooling Proprietary extensions like Impala Commercial version only MapR Enterprise and Production ready focused Works with NFS & Native Unix commands Less focused on using new Hadoop features such as Yarn, etc Proprietary PivotalHD Faster interactive query support with Greenplum Integrates with CloudFoundry PaaS platform Proprietary extensions Not easy to decouple IBM Offer open source without branch version Integrated with PaaS and IBM tools Limited releases Expensive May not be easy to decouple 51 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com
  • 52.
    Disk 1 Disk5 2 Disk 6 2 Disk 7 Disk 2 Disk 3 1 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 52 Disk 9 1 2 3 Racks Disk 10 Disk 11 Disk 8 Disk 12 Disk 4 1 1 2 3 3 3 Data blocks Rack 1 Rack 2 Rack 3 File F 1 2 3 4 5 Blocks (64 MB)
  • 53.
    Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed  Clients read from nearest replica www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 53
  • 54.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 54
  • 55.
    Main Properties ofHDFS  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS  Datanodes send heartbeats to Name node www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 55
  • 56.
    www.designpathshala.com | +91120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 56
  • 57.
    NameNode Metadata Meta-data in Memory Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  A Transaction Log – Records file creations, file deletions. etc www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 57
  • 58.
    DataNode  ABlock Server – Stores data in the local file system – Stores meta-data of a block – Serves data to Clients  Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 58
  • 59.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 59
  • 60.
    Hadoop Master/Slave Architecture  Hadoop is designed as a master-slave shared-nothing architecture www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com Master node (single node) Many slave nodes 60
  • 61.
    JobTracker  Masternode runs JobTracker instance, which accepts Job requests from clients  There is only one JobTracker daemon running per hadoop cluster  Determine the execution plan by determining which files to process  Assigns Nodes to different task  Monitor all tasks as they are running www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 61
  • 62.
    TaskTracker  Managesexecution of individual tasks on each data node  One TaskTracker each data node  Each TaskTracker can spawn multiple JVM’s to handle many map or reduce task in parallel  TaskTracker constantly communicate with job tracker  JobTracker fails to receive heartbeat from TaskTracker in specified amount of time, it assumes the task tracker has crashed. In such a scenario, job tracker will resubmit the task to some other TaskTracker. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 62
  • 63.
    Job Tracker www.designpathshala.com| +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 63 User DFS Copy Input Files Client Submit job Create Splits Upload Job Info Job.XML Job.jar Job Tracker Submit Job Get Input file info
  • 64.
    Job Tracker Cont.. Job.XML Job.jar www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 64 Clint DFS Job Tracker Submit job Initialize job Create Map & Reduce Job Queue M R S S S S S S No of maps = Input splits Read Files
  • 65.
    Job Tracker Cont.. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 65 Job Tracker Task Tracker Picks Task Heart Beat Job Queue Assign Task Job Queue
  • 66.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 66
  • 67.
    Job Tracker Cont.. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 67 Task Tracker Job Tracker Read from local Disk DFS Assign Task Job.xml Job.jar
  • 68.
    Heartbeats  DataNodessend hearbeat to the NameNode  NameNode uses heartbeats to detect DataNode failure www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 68
  • 69.
    Replication Engine NameNode detects DataNode failures  Chooses new DataNodes for new replicas  Balances disk usage  Balances communication traffic to DataNodes www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 69
  • 70.
    Data Pipeline &Write Anatomy HDFS Client Add Block Name Node www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 70 Data Node Data Node Data Node Write Ack Complete
  • 71.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 71
  • 72.
    Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 72
  • 73.
    Read Anatomy HDFSClient Get Block Name Node Data Node Data Node Data Node www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 73 Read Read
  • 74.
    Data Correctness Use Checksums to validate data – Use CRC32  File Creation – Client computes checksum per 512 byte – DataNode stores the checksum  File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 74
  • 75.
    Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 75