SlideShare a Scribd company logo
1 of 21
Download to read offline
Intro to Apache™ Hadoop®
A Brown Bag Session at EAI Technologies
by Sufi Nawaz
What is this Hadoop you speak of?
"Apache Hadoop is an open-
source software framework that
supports data-intensive
distributed applications, licensed
under the Apache v2 license. It
supports the running of
applications on large clusters of
commodity hardware."
- Wikipedia
Doug Cutting
(Creator)
More about Hadoop
● It is a highly scalable, fault tolerant and
distributed compute and storage platform.
● Based on Google GFS and MapReduce.
● Brings computation to data and not the other
way around.
● Created by Doug Cutting and Mike Cafarella
in 2005.
● Originally developed to support distribution
for the Nutch search engine project.
Why use Hadoop?
● Process lots of data - in petabytes even
● Distributed processing
● Uses simple programming models
● Scalable - add new nodes simply
● Cost effective - uses commodity hardware
● Flexible - Hadoop is schema-less and can
absorb any kind of data
● Fault tolerant - redistribution of failed jobs
and data recovery by data replication
When to use Hadoop and not?
Good for:
● Indexing Data
● Log Analysis
● Image Manipulation
● Sorting Large Scale Data
● Data Mining
Bad for:
● For real time processing
● For processing intensive tasks with little data
Hadoop Modules
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- Hadoop YARN
- Hadoop MapReduce
Hadoop Distributed File System
(HDFS)
Hadoop Distributed File System
The Apache HDFS is the primary distributed
storage component used by applications under
Apache Hadoop project.
Apache HDFS can serve as a stand-alone
distributed file system as well.
Hadoop Distributed File System
A single Namenode maintains the directory
tree and manages the namespace and access
to files by clients. It holds Metadata for list of
files, blocks, datanodes all in memory.
Datanodes store and manage the data blocks
as local files on servers throughout the rest of
the cluster. Reports to Namenode with
heartbeat.
Hadoop Distributed File System
Hadoop Distributed File System
What is HDFS bad for?
● Low latency data access. It trades low
latency to increase the throughput of the
data.
● Lots of small files, since default block size is
64MB. Will increase memory requirements
of namenode.
● Multiple writers and arbitrary modification.
Hadoop Distributed File System
Anatomy of write
● DFSOutputStream splits data into packets.
● Writes into an internal queue.
● DataStreamer asks namenode to get list of
datanodes and uses the internal data queue.
● Namenode gives a list of datanodes for the
pipeline.
● Maintains internal queue of packets waiting
to be acknowledged.
Hadoop Distributed File System
Anatomy of read:
● Namenode returns locations of blocks.
● Datanode list is sorted according to their proximity to the
client.
● FSDataInputStream wraps DFSInputStream, which
manages datanode and namenode I/O.
● Read is called repeatedly on the datanode till end of the
block is reached.
● Finds the next DataNode for next data block.
● All happens transparently to the client.
● Calls close after finishing reading the data.
Hadoop Distributed File System
Accessibility
● DFS Shell
● DFS Admin
● Browser Interface
● Mountable HDFS
MapReduce
MapReduce
MapReduce
Main Components
● JobClient
● JobTracker
● TaskTracker
MapReduce
JobTracker (Master)
● Single Job Tracker per cluster
● Schedule Map and Reduce Tasks for TaskTrackers
● Monitors Tasks and keeps track of TaskTrackers status
● Re-execute tasks on failure
TaskTracker (Slave)
● Single TaskTrackers per node (multiple in a cluster)
● Run Map and Reduce Tasks
Who uses Hadoop?
● Yahoo!
○ Support research for Ad Systems and Web Search
● Facebook
○ 2 major clusters (1100 + 300 machines w/ 8 cores)
○ Heavy users of both streaming and Java APIs.
○ Have developed a FUSE implementation on HDFS.
● EBay
○ 532 nodes cluster (8 * 532 cores, 5.3PB
● Hulu
○ 13 machine cluster (8 cores/machine, 4TB/machine)
○ Log storage and analysis
● Many more
○ http://wiki.apache.org/hadoop/PoweredBy
Where can I find resources?
● Hadoop Docs
○ http://hadoop.apache.org/docs/current/
● Mailing List:
○ http://hadoop.apache.org/mailing_lists.html
● White papers from Cloudera, Intel, Dell, etc.
● Hadoop in 20 Pages (http://blog.imaginea.
com/hadoop-a-short-guide/)
● Yahoo! CDN Hadoop Tutorial
● Google Search Engine (!)
Some Additional Info
● Hadoop Streaming
○ Run MapReduce with any language supporting
standard I/O e.g. ruby, python.
● Hadoop Distributed Cache
○ Puts contents of specified input path to memory in all
datanodes across cluster.
● Hadoop Security
○ Secure Hadoop with Kerberos
● Hadoop Federation
○ Solution for NameNode High Availability (HA) and no
Single Point of Failure of NameNode

More Related Content

What's hot

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache HadoopMike Frampton
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_SystemBrett Keim
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 

What's hot (18)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Hadoop
HadoopHadoop
Hadoop
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Anju
AnjuAnju
Anju
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Similar to Intro to Apache Hadoop

Similar to Intro to Apache Hadoop (20)

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop
HadoopHadoop
Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Intro to Apache Hadoop

  • 1. Intro to Apache™ Hadoop® A Brown Bag Session at EAI Technologies by Sufi Nawaz
  • 2. What is this Hadoop you speak of? "Apache Hadoop is an open- source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware." - Wikipedia Doug Cutting (Creator)
  • 3. More about Hadoop ● It is a highly scalable, fault tolerant and distributed compute and storage platform. ● Based on Google GFS and MapReduce. ● Brings computation to data and not the other way around. ● Created by Doug Cutting and Mike Cafarella in 2005. ● Originally developed to support distribution for the Nutch search engine project.
  • 4. Why use Hadoop? ● Process lots of data - in petabytes even ● Distributed processing ● Uses simple programming models ● Scalable - add new nodes simply ● Cost effective - uses commodity hardware ● Flexible - Hadoop is schema-less and can absorb any kind of data ● Fault tolerant - redistribution of failed jobs and data recovery by data replication
  • 5. When to use Hadoop and not? Good for: ● Indexing Data ● Log Analysis ● Image Manipulation ● Sorting Large Scale Data ● Data Mining Bad for: ● For real time processing ● For processing intensive tasks with little data
  • 6. Hadoop Modules - Hadoop Common - Hadoop Distributed File System (HDFS) - Hadoop YARN - Hadoop MapReduce
  • 7. Hadoop Distributed File System (HDFS)
  • 8. Hadoop Distributed File System The Apache HDFS is the primary distributed storage component used by applications under Apache Hadoop project. Apache HDFS can serve as a stand-alone distributed file system as well.
  • 9. Hadoop Distributed File System A single Namenode maintains the directory tree and manages the namespace and access to files by clients. It holds Metadata for list of files, blocks, datanodes all in memory. Datanodes store and manage the data blocks as local files on servers throughout the rest of the cluster. Reports to Namenode with heartbeat.
  • 11. Hadoop Distributed File System What is HDFS bad for? ● Low latency data access. It trades low latency to increase the throughput of the data. ● Lots of small files, since default block size is 64MB. Will increase memory requirements of namenode. ● Multiple writers and arbitrary modification.
  • 12. Hadoop Distributed File System Anatomy of write ● DFSOutputStream splits data into packets. ● Writes into an internal queue. ● DataStreamer asks namenode to get list of datanodes and uses the internal data queue. ● Namenode gives a list of datanodes for the pipeline. ● Maintains internal queue of packets waiting to be acknowledged.
  • 13. Hadoop Distributed File System Anatomy of read: ● Namenode returns locations of blocks. ● Datanode list is sorted according to their proximity to the client. ● FSDataInputStream wraps DFSInputStream, which manages datanode and namenode I/O. ● Read is called repeatedly on the datanode till end of the block is reached. ● Finds the next DataNode for next data block. ● All happens transparently to the client. ● Calls close after finishing reading the data.
  • 14. Hadoop Distributed File System Accessibility ● DFS Shell ● DFS Admin ● Browser Interface ● Mountable HDFS
  • 17. MapReduce Main Components ● JobClient ● JobTracker ● TaskTracker
  • 18. MapReduce JobTracker (Master) ● Single Job Tracker per cluster ● Schedule Map and Reduce Tasks for TaskTrackers ● Monitors Tasks and keeps track of TaskTrackers status ● Re-execute tasks on failure TaskTracker (Slave) ● Single TaskTrackers per node (multiple in a cluster) ● Run Map and Reduce Tasks
  • 19. Who uses Hadoop? ● Yahoo! ○ Support research for Ad Systems and Web Search ● Facebook ○ 2 major clusters (1100 + 300 machines w/ 8 cores) ○ Heavy users of both streaming and Java APIs. ○ Have developed a FUSE implementation on HDFS. ● EBay ○ 532 nodes cluster (8 * 532 cores, 5.3PB ● Hulu ○ 13 machine cluster (8 cores/machine, 4TB/machine) ○ Log storage and analysis ● Many more ○ http://wiki.apache.org/hadoop/PoweredBy
  • 20. Where can I find resources? ● Hadoop Docs ○ http://hadoop.apache.org/docs/current/ ● Mailing List: ○ http://hadoop.apache.org/mailing_lists.html ● White papers from Cloudera, Intel, Dell, etc. ● Hadoop in 20 Pages (http://blog.imaginea. com/hadoop-a-short-guide/) ● Yahoo! CDN Hadoop Tutorial ● Google Search Engine (!)
  • 21. Some Additional Info ● Hadoop Streaming ○ Run MapReduce with any language supporting standard I/O e.g. ruby, python. ● Hadoop Distributed Cache ○ Puts contents of specified input path to memory in all datanodes across cluster. ● Hadoop Security ○ Secure Hadoop with Kerberos ● Hadoop Federation ○ Solution for NameNode High Availability (HA) and no Single Point of Failure of NameNode