Hadoop Architecture
Presented by :
Yojana Nanaware
ME(CSE-I)
Agenda
• What is Hadoop?
• Why, When, Where?
• Hadoop : How?
• Hadoop Architecture
• Hadoop Common
• HDFS
• Hadoop Map/Reduce
• Process
• Hadoop Community
• Conclusion
• References
What is Hadoop?
• A SMART WAY TO STORE & ANALYAZE
DATA
• Douglas Reed Cutting, who is the creator
of Open-Source Technology & also
Hadoop. He originated Lucene and Nutch
• Open-source project administered by
Apache Software Foundation. Hadoop
consists of two key services:
What is Hadoop?
– Hadoop Distributed File System (HDFS).
– Map/Reduce .
• Hadoop is large-scale, high-performance
processing jobs — in spite of system
changes or failures
Why Hadoop?
• Need to process 100TB datasets
• On 1 node :
– Scanning @ 50MB/s=23 days
– MTBF = 3 years
• On 1000 node cluster :
– Scanning @ 50MB/s=33mins
– MTBF = 1 days
• Need efficient, Reliable & Usable
framework
Where & When?
• Where
– Batch Data
Processing, not
real-time/ user
facing
– Highly parallel data
intensive distributed
application
– Very large
production of
deployment
• When
– Process lots of
unstructured data
– When your processing
can easily be made
parallel
– Running batch jobs is
acceptable
– When you have to
access lots of cheap
hardware
Hadoop : How?
• Commodity hardware cluster
• Distributed File System
– Modeled on GFS
• Distributed Processing Framework
– Using Map/Reduce metaphor
• Open Source Java
– Apache Lucene Framework
Hadoop Architecture
Hadoop consists :
•Hadoop Common
– Support other Hadoop subprojects
•HDFS
– Provide high throughput access to application
data
•MapReduce
– Compute cluster of large data sets
Hadoop Common
• It is a set of utilities
• Includes File system, RPC, & Serialization
libraries
HDFS
• Primary storage system
• Creates multiple replicas of data blocks &
distributes them on compute nodes
throughout a cluster to enable reliable,
extremely rapid computations.
• Replication & locality
HDFS Architecture
Hadoop MapReduce
• The Map/Reduce programming language
– Framework
– Pluggable user code
• Common design pattern in design processing
cat * I grep I sort I unique –c I cat>file
input I map I shuffle I reduce I output
• Natural for
– log processing
– web search indexing
– Ad-hoc queries
Map/Reduce Implementation
1. input files split
2. Assign Masters &
Workers
3. Map tasks
4. Writing intermediate
data to disk
5. Intermediate data
read & sort
6. Reduce tasks
7. Return
Example of Map/Reduce word count
• Read text files & count how word often
occur.
– The input is text files
– The output is text file
• Each line : word, tab, count
• Map – Produce pair of (word, count)
• Reduce – For each word, sum up the
counts
Process
• Installation
– Requirements : Linux,
java1.6, sshd, rsync
– Configure SSH for
password free
authentication
– Unpack Hadoop
distribution
– Edit a few configuration
files
– Format the DFS on the
name node
– Start all the demon
process
• Execution
– Compile your job into a
jar files
– Copy input data into the
HDFS
– Execute bin/hadoop jar
with relevant arguments
– Monitor task via Web
interface (optional)
– Examine output when
job is complete
Hadoop Community
• Hadoop Users
– Adobe
– Alibaba
– Amazon
– AOL
– Facebook
– Google
– IBM
• Major Contributor
– Apache
– Cloudera
– Yahoo
Conclusion
• Designed to run on cheap commodity
power
• Handles data replication & node failure
• Cost saving & efficient & reliable data
processing
References
• http://www.newyorksys.com/hadoop-
online-training
• Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop )
• http://hadoop.apache.org/core/docs/curren
t/api/

Hadoop

  • 1.
    Hadoop Architecture Presented by: Yojana Nanaware ME(CSE-I)
  • 2.
    Agenda • What isHadoop? • Why, When, Where? • Hadoop : How? • Hadoop Architecture • Hadoop Common • HDFS • Hadoop Map/Reduce • Process • Hadoop Community • Conclusion • References
  • 3.
    What is Hadoop? •A SMART WAY TO STORE & ANALYAZE DATA • Douglas Reed Cutting, who is the creator of Open-Source Technology & also Hadoop. He originated Lucene and Nutch • Open-source project administered by Apache Software Foundation. Hadoop consists of two key services:
  • 4.
    What is Hadoop? –Hadoop Distributed File System (HDFS). – Map/Reduce . • Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures
  • 5.
    Why Hadoop? • Needto process 100TB datasets • On 1 node : – Scanning @ 50MB/s=23 days – MTBF = 3 years • On 1000 node cluster : – Scanning @ 50MB/s=33mins – MTBF = 1 days • Need efficient, Reliable & Usable framework
  • 6.
    Where & When? •Where – Batch Data Processing, not real-time/ user facing – Highly parallel data intensive distributed application – Very large production of deployment • When – Process lots of unstructured data – When your processing can easily be made parallel – Running batch jobs is acceptable – When you have to access lots of cheap hardware
  • 7.
    Hadoop : How? •Commodity hardware cluster • Distributed File System – Modeled on GFS • Distributed Processing Framework – Using Map/Reduce metaphor • Open Source Java – Apache Lucene Framework
  • 8.
    Hadoop Architecture Hadoop consists: •Hadoop Common – Support other Hadoop subprojects •HDFS – Provide high throughput access to application data •MapReduce – Compute cluster of large data sets
  • 9.
    Hadoop Common • Itis a set of utilities • Includes File system, RPC, & Serialization libraries
  • 10.
    HDFS • Primary storagesystem • Creates multiple replicas of data blocks & distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. • Replication & locality
  • 11.
  • 12.
    Hadoop MapReduce • TheMap/Reduce programming language – Framework – Pluggable user code • Common design pattern in design processing cat * I grep I sort I unique –c I cat>file input I map I shuffle I reduce I output • Natural for – log processing – web search indexing – Ad-hoc queries
  • 13.
    Map/Reduce Implementation 1. inputfiles split 2. Assign Masters & Workers 3. Map tasks 4. Writing intermediate data to disk 5. Intermediate data read & sort 6. Reduce tasks 7. Return
  • 14.
    Example of Map/Reduceword count • Read text files & count how word often occur. – The input is text files – The output is text file • Each line : word, tab, count • Map – Produce pair of (word, count) • Reduce – For each word, sum up the counts
  • 15.
    Process • Installation – Requirements: Linux, java1.6, sshd, rsync – Configure SSH for password free authentication – Unpack Hadoop distribution – Edit a few configuration files – Format the DFS on the name node – Start all the demon process • Execution – Compile your job into a jar files – Copy input data into the HDFS – Execute bin/hadoop jar with relevant arguments – Monitor task via Web interface (optional) – Examine output when job is complete
  • 16.
    Hadoop Community • HadoopUsers – Adobe – Alibaba – Amazon – AOL – Facebook – Google – IBM • Major Contributor – Apache – Cloudera – Yahoo
  • 17.
    Conclusion • Designed torun on cheap commodity power • Handles data replication & node failure • Cost saving & efficient & reliable data processing
  • 18.
    References • http://www.newyorksys.com/hadoop- online-training • Hadoopon Wikipedia (http://en.wikipedia.org/wiki/Hadoop ) • http://hadoop.apache.org/core/docs/curren t/api/