A Basic Introduction to the Hadoop eco system - no animation

Basic introduction to the
Hadoop Eco-System
Sameer Tiwari
Hadoop Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech

Break it down
• Raw Storage - HDFS
• Columnar Store - HBase
• Query engines - Hive, Pig
• Schedulers - Map-Reduce, YARN
• Streaming - Flume
• Machine Learning - Mahout
• Workflow - Oozie
• Distributed Locking - Zookeeper

Break it down
HDFS
Map Reduce / YARN
Pig Hive Oozie Mahout
HBase
Zoo
keeper
Flume
Sqoop
HDFS
API
Unix OS and File System

Hadoop Distributed File System(HDFS)
• History
o Based on Google File System Paper (2003)
o Built at Yahoo by a small team
• Goals
o Tolerance to Hardware failure
o Sequential access as opposed to Random
o High aggregated throughput for Large Data Sets
o “Write Once Read Many” paradigm

HDFS - Key Components
Client1
-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
AB1 AB2 BB1
BB1
AB1
BB1
AB1
Client2
-FileB
Rack 1 Rack 2
AB2 AB2
File.create()
MetaData
NN OPs
Data Blocks
DN OPs
File.write()
FileA: Metadata e.g. Size, Owner...
AB1:D1, AB1:D3, AB1:D4
AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...
BB1:D1, BB1:D2, BB1:D4
Replication PipeLining

Map Reduce
Input
Mappers
Reducers
Output
Shuffle/S
ort
map(key1,value) -> list<key2,value2>, reduce(key2, list<value2>) -> list<value3>

Map Reduce
Job
Tracker
Task
TrackerClient
1
Client
2
Task
Tracker
Task
Task
1,2,4
HDFS
3
5
6
5
6
1. Client submit job
using to JT
2. JT responds with
jobid
3. JobClient Copies job
resources to HDFS
4. Submit job to JT
5. TT Heartbeat to JT
gets the task
6. TT gets the task from
HDFS
7. Execute Task Map or
Reduce

YARN
Resource
Manager
Node
Manager
Client
App
Master
Container
Node
Manager
App
Master
Container
1
2
3,4,8
5
5
6
6
7

Notes on previous YARN slide
1. A client program submits the application, including the necessary specifications to launch the application-specific
ApplicationMaster itself.
2. The ResourceManager assumes the responsibility to negotiate a specified container in which to start the
ApplicationMaster and then launches the ApplicationMaster.
3. The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client
program to query the ResourceManager for details, which allow it to directly communicate with its own
ApplicationMaster.
4. During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-
request protocol.
5. On successful container allocations, the ApplicationMaster launches the container by providing the container
launch specification to the NodeManager. The launch specification, typically, includes the necessary information
to allow the container to communicate with the ApplicationMaster itself.
6. The application code executing within the container then provides necessary information (progress, status etc.)
to its ApplicationMaster via an application-specific protocol.
7. During the application execution, the client that submitted the program communicates directly with the
ApplicationMaster to get status, progress updates etc. via an application-specific protocol.
8. Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters
with the ResourceManager and shuts down, allowing its own container to be repurposed.

Flume
http://flume.apache.org/

HBase
• History
o Based on Google’s Big Table (2006)
o Built at Powerset (later acquired by Microsoft)
o Facebook and Yahoo use it extensively (~1000 machines)
• Goals
o Random R/W access
o Tables with Billions of Rows X Millions of Columns
o Often referred to as a “NoSQL” Data store
o High speed ingest rate. FB == ~Billion msgs+chat per day.

HBase - Key Components
NameNodeJobTrackerHMaster
DataNodeTaskTracker
HRegion
Server
ZK
ClusterZK
ClusterZK
Cluster
Client
Master(s):
Active and Backup
Slaves:
Many
• Google BigTable on GFS == HBase on HDFS
• Generally co-located with HDFS
• Depends on HDFS for storing its data
• Follows a Master Slave model
• Depends on a ZK quorum for Master election

Mahout
• Parallel Machine Learning and Data mining library
• Core groups of algorithms
o Recommendation - Netflix, Pandora
o Classification - “look-alike”, pattern recognition
o Clustering - Marketing and Sales
• Uses Map Reduce under the covers

Hive and Pig
• Higher level languages for using MapReduce
• Hive
o Convenience of storing data in Tables with schemas
o Has a SQL “like” language called HiveQL
o Builds a simple optimized execution plan
• Pig
o Scripting language interface
o Used for ETL

Additional Components
• HCatalog for Pig and Map-Reduce
• Workflow - Oozie
• Distributed Locking - Zookeeper
• Spark and Shark from UC Berkeley

Hadoop Eco-System
Sameer Tiwari
Hadoop Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech

A Basic Introduction to the Hadoop eco system - no animation

More Related Content

What's hot

Viewers also liked

Similar to A Basic Introduction to the Hadoop eco system - no animation

Recently uploaded

A Basic Introduction to the Hadoop eco system - no animation