2. HADOOP
An open-sourceimplementationof frameworks for reliable,scalable,
distributedcomputingand data storage.
A flexibleand highly-availablearchitecturefor large scalecomputation
and dataprocessing on a networkof commodity hardware.
Redundantand reliable.
Batch processing centric.
4. 2005: Doug Cutting and Michael J. Cafarella developed Hadoop with funding from
Yahoo.
2006: Yahoo gave the project to Apache Software Foundation.
Google File system
Google MapReduce
BigTable
HADOOP
5. Hadoopis a system for large scale data processing.
It has two components
HDFS – Hadoop distributed file system(for storage)
Distributed across nodes
Natively redundant
Map Reduce(forprocessing)
Splits a taskacross processors
“near” the data and assembles results
Clustered storage
Machine
HDFS
MapReduce
HADOOP
6. The Mapreduce server on a typical machineis called TaskTracker
The HDFSserver on a typical machineis called a DataNode
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
Machine
DataNode
TaskTracker
HADOOP
10. Pig : High level languagethattranslatesdown into MapReduce Job.
Allows to write job description in highlevel languageinsteadof writing all
MapReduce job code in Java.
Hive: SQL likeinterface for computations. TakesSQL-like language and converts
it intoMapReduce Jobs
Hbase: provides simple interface to distribute datathat allowsthe realtime data
availabilityto the data.
ZooKeeper: provides scalable, highly availablecoordination service for servers
by keeping the metadata about different servers.
HADOOP ECOSYSTEM
11. Mahout– for machinelearning
Sqoop – helps tomove datato and from RelationationalSQL
databases.
Oozie - workflow coordinationtool (definingscheduledtasks)
Flume–for importingstreamingdatato Hadoop
Protobuff,Avro, Thrift- for data serialization
OTHER USEFUL TOOLS
12. Client NameNode
Allocationofblockstofile
MonitoringdataNodeforfailure
NewNodeAddition
ReplicationManagement
ManagingUser Requests(read/write)
DataNode1 DataNode2 DataNode3 DataNoden
Metadata operations
Read/Write
Block
NameNode
Commands
Write/Readblocks to/from Local File
Perform operation as directedbynameNode
Register/Heartbeat itself with namenodeand provideblock report to name
node
FUNCTIONOFHDFSCONPONENTS
14. Parallel programming modelfor largeclusters
Co-Designed, Co-developed andCo-deployed with HDFS
Processing takes place where data is located.
Conducted in two phases
Map
Reduce
Map: Processakey/valuepairtogenerateintermediatekey/valuepairs.
Reduce:Mergeall intermediatevaluesassociatedwiththesamekey.
Input
Data
Map
(Thread 1)
Map
(Thread P)
Sort
Reduce
(Thread 1)
Reduce
(Thread Q)
Output
DataSplit Split
Merge
MAPREDUCE
15. TheDriverconfigures how the program will run in Hadoop
TheMapperis fed input data from Hadoopand produces intermediate output
TheReduceris fed the intermediate output from the mappers
and produces the final output.
When data moves between the mappers andreducers,
it might cross the network so all the values mustbeserializable.
STRUCTUREOF
MAPREDUCEPROGRAM