SlideShare a Scribd company logo
© 2011 IBM Corporation
Hadoop architecture
IBM Information Management – Cloud Computing Center of Competence
IBM Canada Labs
© 2011 IBM Corporation
Agenda
• Terminology review
• HDFS
• MapReduce
• Type of nodes
• Topology awareness
2
© 2011 IBM Corporation
Agenda
• Terminology review
• HDFS
• MapReduce
• Type of nodes
• Topology awareness
3
© 2011 IBM Corporation
4
Node 1
Terminology review
© 2011 IBM Corporation
5
Node 2
Node 1
Terminology review
© 2011 IBM Corporation
6
Node 2
Node n
…
Node 1
Terminology review
© 2011 IBM Corporation
7
Rack 1
Node 2
Node n
…
Node 1
Terminology review
© 2011 IBM Corporation
8
Rack 1
Node 2
Node n
…
Node 1
Node 2
Node n
…
Rack 2
Node 1
Terminology review
© 2011 IBM Corporation
9
Rack 1
Node 2
Node n
…
Node 1
Node 2
Node n
…
Rack 2
Node 1
Node 2
Node n
…
Rack n
Node 1
…
Terminology review
© 2011 IBM Corporation
10
Hadoop cluster
Rack 1
Node 2
Node n
…
Node 1
Node 2
Node n
…
Rack 2
Node 1
Node 2
Node n
…
Rack n
Node 1
…
Terminology review
© 2011 IBM Corporation
Hadoop architecture
• Two main components:
– Hadoop Distributed File System (HDFS)
11
– MapReduce Engine
© 2011 IBM Corporation
Agenda
• Terminology review
• HDFS
• MapReduce
• Type of nodes
• Topology awareness
12
© 2011 IBM Corporation
Hadoop distributed file system (HDFS)
13
• Hadoop file system that runs on top of
existing file system
• Designed to handle very large files with
streaming data access patterns
• Uses blocks to store a file or parts of a file
© 2011 IBM Corporation
HDFS - Blocks
14
• File Blocks
– 64MB (default), 128MB (recommended) – compare to
4KB in UNIX
– Behind the scenes, 1 HDFS block is supported by
multiple operating system (OS) blocks
128 MB
OS Blocks
HDFS Block
. . .
© 2011 IBM Corporation
HDFS - Blocks
15
• Fits well with replication to provide fault tolerance and
availability
• Advantages of blocks:
– Fixed size – easy to calculate how many fit on a disk
– A file can be larger than any single disk in the network
– If a file or a chunk of the file is smaller than the block size, only
needed space is used. Eg: 420MB file is split as:
128 MB 128 MB128 MB 36 MB
© 2011 IBM Corporation
HDFS - Replication
• Blocks with data are replicated to multiple nodes
• Allows for node failure without data loss
16
Node 1
Node 2
Node 3
© 2011 IBM Corporation
Writing a file to HDFS
17
© 2011 IBM Corporation
Writing a file to HDFS
18
© 2011 IBM Corporation
Writing a file to HDFS
19
© 2011 IBM Corporation
Writing a file to HDFS
20
© 2011 IBM Corporation
Writing a file to HDFS
21
© 2011 IBM Corporation
Writing a file to HDFS
22
© 2011 IBM Corporation
Writing a file to HDFS
23
© 2011 IBM Corporation
Writing a file to HDFS
24
© 2011 IBM Corporation
Writing a file to HDFS
25
© 2011 IBM Corporation
Writing a file to HDFS
26
© 2011 IBM Corporation
Writing a file to HDFS
27
© 2011 IBM Corporation
HDFS Command line interface
28
• File System Shell (fs)
• Invoked as follows:
hadoop fs <args>
• Example:
• Listing the current directory in hdfs
hadoop fs –ls .
© 2011 IBM Corporation
HDFS Command line interface
29
• FS shell commands take paths URIs as argument
• URI format: scheme://authority/path
• Scheme:
• For the local filesystem, the scheme is file
• For HDFS, the scheme is hdfs
hadoop fs –copyFromLocal
file://myfile.txt
hdfs://localhost/user/keith/myfile.txt
• Scheme and authority are optional
• Defaults are taken from configuration file core-site.xml
© 2011 IBM Corporation
HDFS Command line interface
30
• Many POSIX-like commands
• cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail
• Some HDFS-specific commands
• copyFromLocal, copyToLocal, get, getmerge, put, setrep
© 2011 IBM Corporation
HDFS – Specific commands
31
• copyFromLocal / put
• Copy files from the local file system into fs
hadoop fs -copyFromLocal <localsrc> .. <dst>
hadoop fs -put <localsrc> .. <dst>
Or
© 2011 IBM Corporation
HDFS – Specific commands
32
• copyToLocal / get
• Copy files from fs into the local file system
hadoop fs -copyToLocal [-ignorecrc] [-crc]
<src> <localdst>
hadoop fs -get [-ignorecrc] [-crc]
<src> <localdst>
Or
© 2011 IBM Corporation
HDFS – Specific commands
33
• getMerge
• Get all the files in the directories that match the source file pattern
• Merge and sort them to only one file on local fs
• <src> is kept
hadoop fs -getmerge <src> <localdst>
© 2011 IBM Corporation
HDFS – Specific commands
34
• setRep
• Set the replication level of a file.
• The -R flag requests a recursive change of replication level for an
entire tree.
• If -w is specified, waits until new replication level is achieved.
hadoop fs -setrep [-R] [-w] <rep> <path/file>
© 2011 IBM Corporation
Agenda
• Terminology review
• HDFS
• MapReduce
• Type of nodes
• Topology awareness
35
© 2011 IBM Corporation
MapReduce engine
36
• Technology from Google
• A MapReduce program consists of map and reduce
functions
• A MapReduce job is broken into tasks that run in
parallel
© 2011 IBM Corporation
Agenda
• Terminology review
• HDFS
• MapReduce
• Type of nodes
• Topology awareness
37
© 2011 IBM Corporation
Types of nodes - Overview
38
• HDFS nodes
– NameNode
– DataNode
• MapReduce nodes
– JobTracker
– TaskTracker
• There are other nodes not discussed in this course
© 2011 IBM Corporation
Types of nodes - Overview
39
© 2011 IBM Corporation
Types of nodes - Overview
40
© 2011 IBM Corporation
Types of nodes - Overview
41
© 2011 IBM Corporation
Types of nodes - Overview
42
© 2011 IBM Corporation
Types of nodes - NameNode
43
• NameNode
– Only one per Hadoop cluster
– Manages the filesystem namespace and metadata
– Single point of failure, but mitigated by writing state to
multiple filesystems
– Single point of failure: Don’t use inexpensive
commodity hardware for this node, large memory
requirements
© 2011 IBM Corporation
Types of nodes - DataNode
44
• DataNode
– Many per Hadoop cluster
– Manages blocks with data and
serves them to clients
– Periodically reports to name
node the list of blocks it stores
– Use inexpensive commodity
hardware for this node
© 2011 IBM Corporation
Types of nodes - JobTracker
45
• JobTracker node
– One per Hadoop cluster
– Receives job requests submitted by client
– Schedules and monitors MapReduce jobs on task trackers
© 2011 IBM Corporation
Types of nodes - TaskTracker
46
• TaskTracker node
– Many per Hadoop cluster
– Executes MapReduce operations
– Reads blocks from DataNodes
© 2011 IBM Corporation
Agenda
• Terminology review
• HDFS
• MapReduce
• Type of nodes
• Topology awareness
47
© 2011 IBM Corporation
Topology awareness (or Rack awareness)
48
Bandwidth becomes progressively smaller in the following scenarios:
© 2011 IBM Corporation
Topology awareness
49
Bandwidth becomes progressively smaller in the following scenarios:
1.Process on the same node.
© 2011 IBM Corporation
Bandwidth becomes progressively smaller in the following scenarios:
1.Process on the same node
2.Different nodes on the same rack
Topology awareness
50
© 2011 IBM Corporation
Bandwidth becomes progressively smaller in the following scenarios:
1.Process on the same node
2.Different nodes on the same rack
3.Nodes on different racks in the same data center
Topology awareness
51
© 2011 IBM Corporation
Bandwidth becomes progressively smaller in the following scenarios:
1.Process on the same node
2.Different nodes on the same rack
3.Nodes on different racks in the same data center
4.Nodes in different data centers
Topology awareness
52
© 2011 IBM Corporation
Thank you!

More Related Content

What's hot

In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 

What's hot (20)

Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
 
HDF Tools Tutorial
HDF Tools TutorialHDF Tools Tutorial
HDF Tools Tutorial
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
 
HDF4 and HDF5 Performance Preliminary Results
HDF4 and HDF5 Performance Preliminary ResultsHDF4 and HDF5 Performance Preliminary Results
HDF4 and HDF5 Performance Preliminary Results
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 

Similar to hadoop architecture -Big data hadoop

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
NIKHILGR3
 

Similar to hadoop architecture -Big data hadoop (20)

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
big data ppt.ppt
big data ppt.pptbig data ppt.ppt
big data ppt.ppt
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 

More from jasikadogra

Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
jasikadogra
 

More from jasikadogra (20)

Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing
Email marketingEmail marketing
Email marketing
 
Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing- Excelrange
Email marketing- ExcelrangeEmail marketing- Excelrange
Email marketing- Excelrange
 
FB marketing notes- Excelrange
FB marketing notes- ExcelrangeFB marketing notes- Excelrange
FB marketing notes- Excelrange
 
Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing
Email marketingEmail marketing
Email marketing
 
Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing
Email marketingEmail marketing
Email marketing
 
FB NOTES
FB NOTESFB NOTES
FB NOTES
 
Email marketing Notes - excelrange
Email marketing Notes - excelrangeEmail marketing Notes - excelrange
Email marketing Notes - excelrange
 
Email marketing Notes - excelrange
Email marketing Notes - excelrangeEmail marketing Notes - excelrange
Email marketing Notes - excelrange
 
Email marketing notes- excelrange
Email marketing notes- excelrangeEmail marketing notes- excelrange
Email marketing notes- excelrange
 
Email marketing Notes- Excelrange
Email marketing Notes- ExcelrangeEmail marketing Notes- Excelrange
Email marketing Notes- Excelrange
 
Email marketing
Email marketingEmail marketing
Email marketing
 
Facebook Ads Notes
Facebook Ads NotesFacebook Ads Notes
Facebook Ads Notes
 
Wordpress tutorial - Excelrange
Wordpress tutorial - ExcelrangeWordpress tutorial - Excelrange
Wordpress tutorial - Excelrange
 
digital marketing curriculum
digital marketing curriculumdigital marketing curriculum
digital marketing curriculum
 

Recently uploaded

Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
Avinash Rai
 

Recently uploaded (20)

Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 

hadoop architecture -Big data hadoop

  • 1. © 2011 IBM Corporation Hadoop architecture IBM Information Management – Cloud Computing Center of Competence IBM Canada Labs
  • 2. © 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 2
  • 3. © 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 3
  • 4. © 2011 IBM Corporation 4 Node 1 Terminology review
  • 5. © 2011 IBM Corporation 5 Node 2 Node 1 Terminology review
  • 6. © 2011 IBM Corporation 6 Node 2 Node n … Node 1 Terminology review
  • 7. © 2011 IBM Corporation 7 Rack 1 Node 2 Node n … Node 1 Terminology review
  • 8. © 2011 IBM Corporation 8 Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Terminology review
  • 9. © 2011 IBM Corporation 9 Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Node 2 Node n … Rack n Node 1 … Terminology review
  • 10. © 2011 IBM Corporation 10 Hadoop cluster Rack 1 Node 2 Node n … Node 1 Node 2 Node n … Rack 2 Node 1 Node 2 Node n … Rack n Node 1 … Terminology review
  • 11. © 2011 IBM Corporation Hadoop architecture • Two main components: – Hadoop Distributed File System (HDFS) 11 – MapReduce Engine
  • 12. © 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 12
  • 13. © 2011 IBM Corporation Hadoop distributed file system (HDFS) 13 • Hadoop file system that runs on top of existing file system • Designed to handle very large files with streaming data access patterns • Uses blocks to store a file or parts of a file
  • 14. © 2011 IBM Corporation HDFS - Blocks 14 • File Blocks – 64MB (default), 128MB (recommended) – compare to 4KB in UNIX – Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks 128 MB OS Blocks HDFS Block . . .
  • 15. © 2011 IBM Corporation HDFS - Blocks 15 • Fits well with replication to provide fault tolerance and availability • Advantages of blocks: – Fixed size – easy to calculate how many fit on a disk – A file can be larger than any single disk in the network – If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as: 128 MB 128 MB128 MB 36 MB
  • 16. © 2011 IBM Corporation HDFS - Replication • Blocks with data are replicated to multiple nodes • Allows for node failure without data loss 16 Node 1 Node 2 Node 3
  • 17. © 2011 IBM Corporation Writing a file to HDFS 17
  • 18. © 2011 IBM Corporation Writing a file to HDFS 18
  • 19. © 2011 IBM Corporation Writing a file to HDFS 19
  • 20. © 2011 IBM Corporation Writing a file to HDFS 20
  • 21. © 2011 IBM Corporation Writing a file to HDFS 21
  • 22. © 2011 IBM Corporation Writing a file to HDFS 22
  • 23. © 2011 IBM Corporation Writing a file to HDFS 23
  • 24. © 2011 IBM Corporation Writing a file to HDFS 24
  • 25. © 2011 IBM Corporation Writing a file to HDFS 25
  • 26. © 2011 IBM Corporation Writing a file to HDFS 26
  • 27. © 2011 IBM Corporation Writing a file to HDFS 27
  • 28. © 2011 IBM Corporation HDFS Command line interface 28 • File System Shell (fs) • Invoked as follows: hadoop fs <args> • Example: • Listing the current directory in hdfs hadoop fs –ls .
  • 29. © 2011 IBM Corporation HDFS Command line interface 29 • FS shell commands take paths URIs as argument • URI format: scheme://authority/path • Scheme: • For the local filesystem, the scheme is file • For HDFS, the scheme is hdfs hadoop fs –copyFromLocal file://myfile.txt hdfs://localhost/user/keith/myfile.txt • Scheme and authority are optional • Defaults are taken from configuration file core-site.xml
  • 30. © 2011 IBM Corporation HDFS Command line interface 30 • Many POSIX-like commands • cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail • Some HDFS-specific commands • copyFromLocal, copyToLocal, get, getmerge, put, setrep
  • 31. © 2011 IBM Corporation HDFS – Specific commands 31 • copyFromLocal / put • Copy files from the local file system into fs hadoop fs -copyFromLocal <localsrc> .. <dst> hadoop fs -put <localsrc> .. <dst> Or
  • 32. © 2011 IBM Corporation HDFS – Specific commands 32 • copyToLocal / get • Copy files from fs into the local file system hadoop fs -copyToLocal [-ignorecrc] [-crc] <src> <localdst> hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> Or
  • 33. © 2011 IBM Corporation HDFS – Specific commands 33 • getMerge • Get all the files in the directories that match the source file pattern • Merge and sort them to only one file on local fs • <src> is kept hadoop fs -getmerge <src> <localdst>
  • 34. © 2011 IBM Corporation HDFS – Specific commands 34 • setRep • Set the replication level of a file. • The -R flag requests a recursive change of replication level for an entire tree. • If -w is specified, waits until new replication level is achieved. hadoop fs -setrep [-R] [-w] <rep> <path/file>
  • 35. © 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 35
  • 36. © 2011 IBM Corporation MapReduce engine 36 • Technology from Google • A MapReduce program consists of map and reduce functions • A MapReduce job is broken into tasks that run in parallel
  • 37. © 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 37
  • 38. © 2011 IBM Corporation Types of nodes - Overview 38 • HDFS nodes – NameNode – DataNode • MapReduce nodes – JobTracker – TaskTracker • There are other nodes not discussed in this course
  • 39. © 2011 IBM Corporation Types of nodes - Overview 39
  • 40. © 2011 IBM Corporation Types of nodes - Overview 40
  • 41. © 2011 IBM Corporation Types of nodes - Overview 41
  • 42. © 2011 IBM Corporation Types of nodes - Overview 42
  • 43. © 2011 IBM Corporation Types of nodes - NameNode 43 • NameNode – Only one per Hadoop cluster – Manages the filesystem namespace and metadata – Single point of failure, but mitigated by writing state to multiple filesystems – Single point of failure: Don’t use inexpensive commodity hardware for this node, large memory requirements
  • 44. © 2011 IBM Corporation Types of nodes - DataNode 44 • DataNode – Many per Hadoop cluster – Manages blocks with data and serves them to clients – Periodically reports to name node the list of blocks it stores – Use inexpensive commodity hardware for this node
  • 45. © 2011 IBM Corporation Types of nodes - JobTracker 45 • JobTracker node – One per Hadoop cluster – Receives job requests submitted by client – Schedules and monitors MapReduce jobs on task trackers
  • 46. © 2011 IBM Corporation Types of nodes - TaskTracker 46 • TaskTracker node – Many per Hadoop cluster – Executes MapReduce operations – Reads blocks from DataNodes
  • 47. © 2011 IBM Corporation Agenda • Terminology review • HDFS • MapReduce • Type of nodes • Topology awareness 47
  • 48. © 2011 IBM Corporation Topology awareness (or Rack awareness) 48 Bandwidth becomes progressively smaller in the following scenarios:
  • 49. © 2011 IBM Corporation Topology awareness 49 Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node.
  • 50. © 2011 IBM Corporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack Topology awareness 50
  • 51. © 2011 IBM Corporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack 3.Nodes on different racks in the same data center Topology awareness 51
  • 52. © 2011 IBM Corporation Bandwidth becomes progressively smaller in the following scenarios: 1.Process on the same node 2.Different nodes on the same rack 3.Nodes on different racks in the same data center 4.Nodes in different data centers Topology awareness 52
  • 53. © 2011 IBM Corporation Thank you!