Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala

Apache Hadoop
Design Pathshala
April 22, 2014
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
1

Course Details
 The Motivation for Hadoop
 Hadoop: Basic Concepts
 Writing a MapReduce Program
 Common MapReduce Algorithms
 PIG Concepts
 Hive Concepts
 Working with Sqoop
 Working with Flume
 OOZIE Concepts
 HUE Concepts
 Reporting Tools
 Project
2

Apache Hadoop
The Motivation for Hadoop
Design Pathshala
April 22, 2014
3

Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
4

Design Pathshala
 Every one of our courses, written by experts in their respective fields.
 We try our best to make you connect real life examples with real business
practices.
 Learn and apply to work or your own business.
 We provide online classes on different subjects, including Oracle HRMS,
Peoplesoft HRMS & JAVA.
 We have both Weekday as well as Weekend classes.
5
www.designpathshala.com | +91 120 260 5512 | +91 98
188 23045 | admin@designpathshala.com |

How data comes?
6

Machine generated and historical data
7

Three V’s of Bigdata
Volume
Velocity
Variety
8

Volume .. Amount of data
~3 ZB of
data exist in
the digital
universe
today.
>300 TB of
data in U.S.
Library of
Congress.
Facebook
has 30+ PB.
~2.5 PB of
data in
DWH.
+10PB DWH
size.
9

Velocity .. How Rapidly data is growing
48 hours of
new video
every minute
571 new
websites every
minute
500+ TB to
Facebook.
175 million
tweets every
day
1+ million
customer
transactions
every hour
Data
production will
be 44 times
greater in 2020
than it was in
2009.
10

Variety.. How Rapidly data is growing
Structured
• Traditional
Databases
• Numeric data
Semi -
structured
• Json
• XML
Unstructured
• Text documents
• Email
• Video
• Audio
• Machine
Generated
11

Or Call us at: +91 120 260 5512 or +91 98 188 23045
12

How Companies minting on Bigdata!
Predict exactly what customers want before they ask for it
Marketing Campaign
Improve customer service
Fraud Detection
Get customers excited about their own data
Identify customer pain points and solve them
Reduce health care costs and improve treatment
Social Graph Analysis & Sentiment Analysis
Research and development
13

How data is used by some big Companies for
different business analysis.
14

Big Data Market Forecast
15

Or Call us at: +91 120 260 5512 or +91 98 188 23045
16

Career options
98

Big data jobs, big pay jobs
98

Top Recruiters in India
98

Hadoop & Hive History
 Dec 2004 – Google GFS paper published
 July 2005 – Nutch uses MapReduce
 Feb 2006 – Becomes Lucene subproject
 Apr 2007 – Yahoo! on 1000-node cluster
 Jan 2008 – An Apache Top Level Project
 Jul 2008 – A 4000 node test cluster
 Sept 2008 – Hive becomes a Hadoop subproject
20

You Say, “tomato…”
Google calls it: Hadoop equivalent:
GFS HDFS
Bigtable HBase
Chubby Zookeeper
21

Problems with current systems
1 Machine
• Read 1 TB data
• 4 I/O operations
• 100 Mbps
22
~45
mins

Or Call us at: +91 120 260 5512 or +91 98 188 23045
23

Apache Hadoop Wins Terabyte Sort Benchmark (July 2008)
 Yahoo's sorted 1 TB data in 209 seconds
 Beat the previous record of 297 seconds of Google.
 The sort used 1800 mappers and 1800 reduces
 Cluster configuration used for benchmark sort
 910 nodes
 2 quad core Xeons @ 2.0ghz per node
 8G RAM per a node;
98

Why Hadoop?
1 Machine
• Read 1 TB data
• 4 I/O operations
• 100 Mbps
10 Machines
4 I/O operations
100 Mbps
25
~45
mins
~4.5
mins

Distributed File System (DFS)
designpathshalaproject
 dp.global.inhomeproject
 dp.global.inhomeimages
 dp.global.inhomesoftware
 dp.global.inhomewebsites
26
designpathshalasoftware
designpathshalaimages
designpathshalawebsites
Namespace
dp.global.in

Who uses Hadoop?
27
42,000 nodes
as on July
2011
4100 nodes
1400
nodes

What is Hadoop
 Hadoop is a framework for distributed processing of large datasets across
large clusters of commodity computers using simple programing model.
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google MapReduce
 Hadoop is based on a simple programming model called MapReduce
 Hadoop is based on a simple data model, any data will fit
28

Or Call us at: +91 120 260 5512 or +91 98 188 23045
29

What makes it especially useful
 Scalable: It can reliably store and process petabytes.
 Economical: It distributes the data and processing across clusters of commonly available
computers (in thousands).
 Efficient: By distributing the data, it can process it in parallel on the nodes where the
data is located.
 Reliable: It automatically maintains multiple copies of data and automatically redeploys
computing tasks based on failures.
30

Hadoop: Assumptions
 Hardware will fail.
 Applications need a write-once-read-many access model.
 Data transfer and I/o is bottleneck
 Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
 Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
 Move logic rather than data
31

Secondary
NameNode
Client
HDFS Architecture
NameNode
Data Nodes
Metadata
NameNode : Contains information about data
DataNode : Contains physical data
SecondaryNameNode: Keeps reading data from NN
32

Distributed File System
 Single Namespace for entire cluster
 Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
 Files are broken up into blocks
– Typically 64 MB block size
– Each block replicated on multiple DataNodes
 Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
33

Hadoop architecture
34

Or Call us at: +91 120 260 5512 or +91 98 188 23045
35

 Major re-architecture of Distributed
File System and Processing
 YARN Architecture enables to run
multiple things on Hadoop nodes
 Interactive SQL Support
 Integrated streaming support
 In-memory processing
 Search
 Enterprise Security
 Data Lifecycle Management
 Readily available tools and libraries
36
Why Hadoop 2.0?

Hadoop
37

38

Apache Hadoop and the Hadoop Ecosystem
 MapReduce
 A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
 HDFS
 A distributed filesystem that runs on large clusters of commodity machines.
39

Or Call us at: +91 120 260 5512 or +91 98 188 23045
40

 Pig
 A data flow language and execution environment for exploring very large datasets.
Pig runs on HDFS and MapReduce clusters.
 Hive
 A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
 Sqoop
 A tool for efficiently moving data between relational databases and HDFS.
 Oozie
 Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie
Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
41

 HBase
 A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
 ZooKeeper
 A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
 Flume
 Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
 Strom
 Apache Storm is a free and open source distributed realtime computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing.
42

 Spark & Spark
 Apache Spark™ is a fast and general engine for large-scale data processing.
 Drill
 Apache Drill provides direct queries on self-describing and semi-structured data in
files (such as JSON, Parquet) and HBase tables without needing to specify metadata
definitions in a centralized store such as Hive metastore.
 Avro
 A serialization system for efficient, cross-language RPC, and persistent data
storage.
43

44

Or Call us at: +91 120 260 5512 or +91 98 188 23045
45

Oozie Workflows
Pig Sqoop Jobs/Hive Scripts
46
Source Databases (Reporting)
HDFS (domain/xbec/dwh)
Source
Table Data
Temporary
Table Data
DW Table
Data
Hive Tables
MySQL Data
Warehouse
Dashboard
Reporting

Big Data
Platform
Data Export
Algorithms
Integration Mahout Execution
Base HDFS / MapReduce / Hive / Pig on Hortonworks HDP 2.0
47
Data As A
Service
Summary Data
Services
Repository API
Command &
Job API
Hive Metadata
API
Workflow DSL
API
Workflow API
API
Storage API
External MR
Submissions
Events Elastic Search & Indexing Application & Notifications + CEP
Domain & User
Mgmt
Security Layer
SSO
User Auth Authorization
Hadoop Platform
Integration
Gateway
API
Other
Real Time Analytics
Kerberos
Security
Hive
Security
Pig/MR
Security
HDFS Data
Privacy
Log Analytics
Data Analytics
Spatial Analytics
Tracking Analytics
RDBMS
Integration
API
Queues
Flume
Integration
Queuing &
Ingestion
Real-time
Analytics
R Integration
Detached
Storage
Archiving
Storage Management
Core Build & Deploy VM Provisioning & Software Deployment Cloud Foundry / Open Shift
Existing DW
External
Storage
Analytics
Functions
Fast SQL Layer

Data
Warehouse
(Domain
Specific)
(Traditional
DW)
Multi-tenant Big Data Platform w/ Hadoop & PaaS Platforms
Ingestion
Engine
(Kafka)
Devices, Visualization, Search, Reporting & Alerts
Application/Rules/Webservices (REST) Layer
(JBOSS & JBRMS)
Intermediate
Storage
Engine
(Cassandra &
HBASE& Solr)
Real-time
Processing
Engine
(Storm)
(MapReduce/Pig/
Hive/Stringer)
48
Ingestion
Engine
(Kafka)
Real-time
Stream
Processing
Engine
(Storm)
Intermediate
Data Store &
Search
(MySQL/HBase)
&
Elastic Search
Hadoop
HDFS
Hadoop
HDFS
(MapReduce/Pig/
Hive)
Predictive
Analytics &
Machine
Learning
(Mahout/R)
Data Inputs
Gateway
(Talend,Fuse)
(JMS,
RDBMS/Sqoop, Log
files/Flume,
REST/WebHDFS,
etc)
Hadoop 2.0 Platform (Hartonworks)
Data Integration/ETL/Workflows (Oozie)
Cloud Orchestration (Zookeeper, YARN, Ambari)
Reporting/BI
Tools
(ex: Jaspersoft)
Batch
Metrics &
ETL
Real-time
Metrics
Predictive
Metrics
Analytics Libraries
SQL/Hive
Engines
(Data Warehouse)
(Stringer/HAWQ
ETL
PaaS Platform
Virtualization (Public/Private/Hybrid)
Data Access
Existing
Datawarehouse
Platforms
Data Export
Ad-hoc/
Interac
tive
Analytics
Analytics & Business Applications
API
Queues

Or Call us at: +91 120 260 5512 or +91 98 188 23045
49

Hadoop complex queries comparison with
traditional DB’s
50

Which Hadoop Distribution?
Type Distribution Pros Cons
Pureplay
(Apache/Ope
nSource)
Hortonworks 100% Open source version
Integration/Services focused
Extensive partnership
network
Slower interactive
queries
Cloudera Widely used distribution
Faster interactive queries
Extensive tooling
Proprietary extensions
like Impala
Commercial version only
MapR Enterprise and Production
ready focused
Works with NFS & Native Unix
commands
Less focused on using
new Hadoop features
such as Yarn, etc
Proprietary PivotalHD Faster interactive query
support with Greenplum
Integrates with CloudFoundry
PaaS platform
Proprietary extensions
Not easy to decouple
IBM Offer open source without
branch version
Integrated with PaaS and IBM
tools
Limited releases
Expensive
May not be easy to
decouple 51

Disk 1 Disk 5
2 Disk 6
2
Disk 7
Disk 2
Disk 3
1
52
Disk 9
1 2 3
Racks
Disk 10
Disk 11
Disk 8 Disk 12
Disk 4
1
1
2
3
3
3
Data blocks
Rack 1 Rack 2 Rack 3
File F 1 2 3 4 5
Blocks (64 MB)

Block Placement
 Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
 Clients read from nearest replica
53

Or Call us at: +91 120 260 5512 or +91 98 188 23045
54

Main Properties of HDFS
 Large: A HDFS instance may consist of thousands of server machines, each
storing part of the file system’s data
 Replication: Each data block is replicated many times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS
 Datanodes send heartbeats to Name node
55

56

NameNode Metadata
 Meta-data in Memory
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
 A Transaction Log
– Records file creations, file deletions. etc
57

DataNode
 A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data to Clients
 Block Report
– Periodically sends a report of all existing blocks to the
NameNode
 Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
58

Or Call us at: +91 120 260 5512 or +91 98 188 23045
59

Hadoop Master/Slave Architecture
 Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
60

JobTracker
 Master node runs JobTracker instance, which accepts Job requests from
clients
 There is only one JobTracker daemon running per hadoop cluster
 Determine the execution plan by determining which files to process
 Assigns Nodes to different task
 Monitor all tasks as they are running
61

TaskTracker
 Manages execution of individual tasks on each data node
 One TaskTracker each data node
 Each TaskTracker can spawn multiple JVM’s to handle many map or reduce
task in parallel
 TaskTracker constantly communicate with job tracker
 JobTracker fails to receive heartbeat from TaskTracker in specified amount of
time, it assumes the task tracker has crashed. In such a scenario, job tracker
will resubmit the task to some other TaskTracker.
62

Job Tracker
63
User
DFS
Copy
Input
Files
Client
Submit
job
Create
Splits
Upload
Job Info
Job.XML
Job.jar
Job Tracker
Submit
Job
Get Input
file info

Job Tracker Cont..
Job.XML
Job.jar
64
Clint
DFS
Job Tracker
Submit
job
Initialize
job
Create Map
& Reduce
Job Queue
M
R
S
S
S
S
S
S
No of maps =
Input splits
Read Files

Job Tracker Cont..
65
Job Tracker
Task Tracker
Picks Task
Heart
Beat
Job Queue
Assign
Task
Job Queue

Or Call us at: +91 120 260 5512 or +91 98 188 23045
66

Job Tracker Cont..
67
Task Tracker
Job Tracker
Read
from local
Disk
DFS
Assign
Task
Job.xml
Job.jar

Heartbeats
 DataNodes send hearbeat to the NameNode
 NameNode uses heartbeats to detect DataNode failure
68

Replication Engine
 NameNode detects DataNode failures
 Chooses new DataNodes for new replicas
 Balances disk usage
 Balances communication traffic to DataNodes
69

Data Pipeline & Write Anatomy
HDFS Client Add Block Name Node
70
Data Node
Data Node
Data Node
Write
Ack
Complete

Or Call us at: +91 120 260 5512 or +91 98 188 23045
71

Data Pipelining
 Client retrieves a list of DataNodes on which to place
replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next
DataNode in the Pipeline
 When all replicas are written, the Client moves on to
write the next block in file
72

Read Anatomy
HDFS Client Get Block Name Node
Data Node Data Node Data Node
73
Read
Read

Data Correctness
 Use Checksums to validate data
– Use CRC32
 File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
 File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
74

Or Call us at: +91 120 260 5512 or +91 98 188 23045
75

Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala

Similar to Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala (20)

Recently uploaded

Recently uploaded (20)

Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala