Introduction to Hadoop

AN INTRODUCTION
TO
HADOOP
SudarshanPant
MOKPONATIONALUNIVERSITY
2017.06.01

HADOOP
An open-sourceimplementationof frameworks for reliable,scalable,
distributedcomputingand data storage.
A flexibleand highly-availablearchitecturefor large scalecomputation
and dataprocessing on a networkof commodity hardware.
Redundantand reliable.
Batch processing centric.

2005: Doug Cutting and Michael J. Cafarella developed Hadoop with funding from
Yahoo.
2006: Yahoo gave the project to Apache Software Foundation.
Google File system
Google MapReduce
BigTable
HADOOP

Hadoopis a system for large scale data processing.
It has two components
HDFS – Hadoop distributed file system(for storage)
 Distributed across nodes
 Natively redundant
Map Reduce(forprocessing)
 Splits a taskacross processors
 “near” the data and assembles results
 Clustered storage
Machine
HDFS
MapReduce
HADOOP

 The Mapreduce server on a typical machineis called TaskTracker
 The HDFSserver on a typical machineis called a DataNode
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
HADOOP

Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
JobTracker
JOBTRACKER
 JobTracker keeps track of jobs being run.
 Keeps track of the nodes bydetectingthe nodefailureand reassigning
the taskto anothernode

 NameNodekeeps informationon data location
 Keeps noticesthenode failureand automaticallyre-replicate thedata
NAME NODE
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
NameNode

Pig
Hive
MapReduce
HDFS
HBase
Hadoop System
ZOOKEEPER
HADOOP ECOSYSTEM

Pig : High level languagethattranslatesdown into MapReduce Job.
Allows to write job description in highlevel languageinsteadof writing all
MapReduce job code in Java.
Hive: SQL likeinterface for computations. TakesSQL-like language and converts
it intoMapReduce Jobs
Hbase: provides simple interface to distribute datathat allowsthe realtime data
availabilityto the data.
ZooKeeper: provides scalable, highly availablecoordination service for servers
by keeping the metadata about different servers.
HADOOP ECOSYSTEM

 Mahout– for machinelearning
 Sqoop – helps tomove datato and from RelationationalSQL
databases.
 Oozie - workflow coordinationtool (definingscheduledtasks)
 Flume–for importingstreamingdatato Hadoop
 Protobuff,Avro, Thrift- for data serialization
OTHER USEFUL TOOLS

Client NameNode
 Allocationofblockstofile
 MonitoringdataNodeforfailure
 NewNodeAddition
 ReplicationManagement
 ManagingUser Requests(read/write)
DataNode1 DataNode2 DataNode3 DataNoden
Metadata operations
Read/Write
Block
NameNode
Commands
 Write/Readblocks to/from Local File
 Perform operation as directedbynameNode
 Register/Heartbeat itself with namenodeand provideblock report to name
node
FUNCTIONOFHDFSCONPONENTS

File A
70MB
File B
8MB
BlockA
Allocated64MB
Consumed64MB
SpaceLeft0MB
BlockB
Allocated64MB
Consumed8MB
SpaceLeft56MB
BlockC
Allocated64MB
Consumed8MB
SpaceLeft56MB
 Blocks arecreatedbasedonpredefined block size (default : 64MB)
 Blocks arethen replicated as perthe defined Replication Factor (default : 3)
 1 file –1 block rule :Oneblock does not contain the contents from two different files.
FILEHANDLING

 Parallel programming modelfor largeclusters
 Co-Designed, Co-developed andCo-deployed with HDFS
 Processing takes place where data is located.
 Conducted in two phases
 Map
 Reduce
Map: Processakey/valuepairtogenerateintermediatekey/valuepairs.
Reduce:Mergeall intermediatevaluesassociatedwiththesamekey.
Input
Data
Map
(Thread 1)
Map
(Thread P)
Sort
Reduce
(Thread 1)
Reduce
(Thread Q)
Output
DataSplit Split
Merge
MAPREDUCE

TheDriverconfigures how the program will run in Hadoop
TheMapperis fed input data from Hadoopand produces intermediate output
TheReduceris fed the intermediate output from the mappers
and produces the final output.
When data moves between the mappers andreducers,
it might cross the network so all the values mustbeserializable.
STRUCTUREOF
MAPREDUCEPROGRAM

Introduction to Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Hadoop

Similar to Introduction to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop