SlideShare a Scribd company logo
1 of 20
 Open source software framework designed
for storage and processing of large scale data
on clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella
in 2005.
 Cutting named the program after his son’s
toy elephant.
http://www.flaxit.com/ Ph: +1-720-463-3800
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis
http://www.flaxit.com/ Ph: +1-720-463-3800
•Contains Libraries and other modulesHadoop Common
•Hadoop Distributed File SystemHDFS
•Yet Another Resource NegotiatorHadoop YARN
•A programming model for large scale
data processing
Hadoop
MapReduce
http://www.flaxit.com/ Ph: +1-720-463-3800
 What were the limitations of earlier large-
scale computing?
 What requirements should an alternative
approach have?
 How does Hadoop address those
requirements?
http://www.flaxit.com/ Ph: +1-720-463-3800
 Historically computation was processor-
bound
◦ Data volume has been relatively small
◦ Complicated computations are performed on that
data
 Advances in computer technology has
historically centered around improving the
power of a single machine
http://www.flaxit.com/ Ph: +1-720-463-3800
 Typically divided into Data Nodes and
Compute Nodes
 At compute time, data is copied to the
Compute Nodes
 Fine for relatively small amounts of data
 Modern systems deal with far more data than
was gathering in the past
http://www.flaxit.com/ Ph: +1-720-463-3800
 Must support partial
failure
 Must be scalable
http://www.flaxit.com/ Ph: +1-720-463-3800
 Failure of a single component must not cause
the failure of the entire system only a
degradation of the application performance
 Failure should not
result in the loss
of any data
http://www.flaxit.com/ Ph: +1-720-463-3800
 If a component fails, it should be able to
recover without restarting the entire system
 Component failure or recovery during a job
must not affect the final output
http://www.flaxit.com/ Ph: +1-720-463-3800
 Increasing resources should increase load
capacity
 Increasing the load on the system should
result in a graceful decline in performance for
all jobs
◦ Not system failure
http://www.flaxit.com/ Ph: +1-720-463-3800
 Based on work done by Google in the early
2000s
◦ “The Google File System” in 2003
◦ “MapReduce: Simplified Data Processing on Large
Clusters” in 2004
 The core idea was to distribute the data as it
is initially stored
◦ Each node can then perform computation on the
data it stores without moving the data for the initial
processing
http://www.flaxit.com/ Ph: +1-720-463-3800
 Applications are written in a high-level
programming language
◦ No network programming or temporal dependency
 Nodes should communicate as little as
possible
◦ A “shared nothing” architecture
 Data is spread among the machines in
advance
◦ Perform computation where the data is already
stored as often as possible
http://www.flaxit.com/ Ph: +1-720-463-3800
 When data is loaded onto the system it is
divided into blocks
◦ Typically 64MB or 128MB
 Tasks are divided into two phases
◦ Map tasks which are done on small portions of data
where the data is stored
◦ Reduce tasks which combine data to produce the
final output
 A master program allocates work to
individual nodes
http://www.flaxit.com/ Ph: +1-720-463-3800
 Failures are detected by the master program
which reassigns the work to a different node
 Restarting a task does not affect the nodes
working on other portions of the data
 If a failed node restarts, it is added back to
the system and assigned new tasks
 The master can redundantly execute the
same task to avoid slow running nodes
http://www.flaxit.com/ Ph: +1-720-463-3800
 Responsible for storing data on the cluster
 Data files are split into blocks and distributed
across the nodes in the cluster
 Each block is replicated multiple times
http://www.flaxit.com/ Ph: +1-720-463-3800
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive
amounts of data
http://www.flaxit.com/ Ph: +1-720-463-3800
 HDFS works best with a smaller number of
large files
◦ Millions as opposed to billions of files
◦ Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files
and not random reads
http://www.flaxit.com/ Ph: +1-720-463-3800
 Files are split into blocks
 Blocks are split across many machines at load
time
◦ Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple
machines
 The NameNode keeps track of which blocks
make up a file and where they are stored
http://www.flaxit.com/ Ph: +1-720-463-3800
 When a client wants to retrieve data
◦ Communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
◦ Then communicated directly with the data nodes to
read the data
http://www.flaxit.com/ Ph: +1-720-463-3800

More Related Content

What's hot

Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
Kamal A
 

What's hot (20)

Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 
Hadoop
Hadoop Hadoop
Hadoop
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
 
Anju
AnjuAnju
Anju
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

Viewers also liked

Viewers also liked (17)

Final group presentation
Final group presentationFinal group presentation
Final group presentation
 
Project Management 2.0 - Da Wiki a Trello
Project Management 2.0 - Da Wiki a TrelloProject Management 2.0 - Da Wiki a Trello
Project Management 2.0 - Da Wiki a Trello
 
703 alqurashi i_book chapter
703 alqurashi  i_book chapter703 alqurashi  i_book chapter
703 alqurashi i_book chapter
 
Online learning and curriculum
Online learning and curriculumOnline learning and curriculum
Online learning and curriculum
 
Politropia e Web
Politropia e WebPolitropia e Web
Politropia e Web
 
Wikispaces to collaborate
Wikispaces to collaborateWikispaces to collaborate
Wikispaces to collaborate
 
Asian-Pacific Islands Women's Health Forum Talk
Asian-Pacific Islands Women's Health Forum TalkAsian-Pacific Islands Women's Health Forum Talk
Asian-Pacific Islands Women's Health Forum Talk
 
Gdit711 mooc test design
Gdit711 mooc test designGdit711 mooc test design
Gdit711 mooc test design
 
Musica on line e diritto d’autore
Musica on line e diritto d’autoreMusica on line e diritto d’autore
Musica on line e diritto d’autore
 
Un modello retorico per l’analisi delle piattaforme partecipative on-line
Un modello retorico per l’analisi delle piattaforme partecipative on-lineUn modello retorico per l’analisi delle piattaforme partecipative on-line
Un modello retorico per l’analisi delle piattaforme partecipative on-line
 
Gpsy 662 learning disabilities
Gpsy 662 learning disabilitiesGpsy 662 learning disabilities
Gpsy 662 learning disabilities
 
Mindful entrepreneur
Mindful entrepreneurMindful entrepreneur
Mindful entrepreneur
 
The Internet Explained To Your Mum In 5 Slides
The Internet Explained To Your Mum In 5 SlidesThe Internet Explained To Your Mum In 5 Slides
The Internet Explained To Your Mum In 5 Slides
 
Grev 721 enthography presentation
Grev 721 enthography presentationGrev 721 enthography presentation
Grev 721 enthography presentation
 
An Introduction to Agile Project Management
An Introduction to Agile Project ManagementAn Introduction to Agile Project Management
An Introduction to Agile Project Management
 
Gdit 726 education in a global society, south korea
Gdit 726 education in a global society, south koreaGdit 726 education in a global society, south korea
Gdit 726 education in a global society, south korea
 
Renewable Sources of Energy- Dynamo in Bicycle
Renewable Sources of Energy- Dynamo in BicycleRenewable Sources of Energy- Dynamo in Bicycle
Renewable Sources of Energy- Dynamo in Bicycle
 

Similar to Hadoop Online Training

Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
Dirk Petersen
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 

Similar to Hadoop Online Training (20)

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 

Hadoop Online Training

  • 1.
  • 2.  Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and Mike Carafella in 2005.  Cutting named the program after his son’s toy elephant. http://www.flaxit.com/ Ph: +1-720-463-3800
  • 3.  Data-intensive text processing  Assembly of large genomes  Graph mining  Machine learning and data mining  Large scale social network analysis http://www.flaxit.com/ Ph: +1-720-463-3800
  • 4. •Contains Libraries and other modulesHadoop Common •Hadoop Distributed File SystemHDFS •Yet Another Resource NegotiatorHadoop YARN •A programming model for large scale data processing Hadoop MapReduce http://www.flaxit.com/ Ph: +1-720-463-3800
  • 5.  What were the limitations of earlier large- scale computing?  What requirements should an alternative approach have?  How does Hadoop address those requirements? http://www.flaxit.com/ Ph: +1-720-463-3800
  • 6.  Historically computation was processor- bound ◦ Data volume has been relatively small ◦ Complicated computations are performed on that data  Advances in computer technology has historically centered around improving the power of a single machine http://www.flaxit.com/ Ph: +1-720-463-3800
  • 7.  Typically divided into Data Nodes and Compute Nodes  At compute time, data is copied to the Compute Nodes  Fine for relatively small amounts of data  Modern systems deal with far more data than was gathering in the past http://www.flaxit.com/ Ph: +1-720-463-3800
  • 8.  Must support partial failure  Must be scalable http://www.flaxit.com/ Ph: +1-720-463-3800
  • 9.  Failure of a single component must not cause the failure of the entire system only a degradation of the application performance  Failure should not result in the loss of any data http://www.flaxit.com/ Ph: +1-720-463-3800
  • 10.  If a component fails, it should be able to recover without restarting the entire system  Component failure or recovery during a job must not affect the final output http://www.flaxit.com/ Ph: +1-720-463-3800
  • 11.  Increasing resources should increase load capacity  Increasing the load on the system should result in a graceful decline in performance for all jobs ◦ Not system failure http://www.flaxit.com/ Ph: +1-720-463-3800
  • 12.  Based on work done by Google in the early 2000s ◦ “The Google File System” in 2003 ◦ “MapReduce: Simplified Data Processing on Large Clusters” in 2004  The core idea was to distribute the data as it is initially stored ◦ Each node can then perform computation on the data it stores without moving the data for the initial processing http://www.flaxit.com/ Ph: +1-720-463-3800
  • 13.  Applications are written in a high-level programming language ◦ No network programming or temporal dependency  Nodes should communicate as little as possible ◦ A “shared nothing” architecture  Data is spread among the machines in advance ◦ Perform computation where the data is already stored as often as possible http://www.flaxit.com/ Ph: +1-720-463-3800
  • 14.  When data is loaded onto the system it is divided into blocks ◦ Typically 64MB or 128MB  Tasks are divided into two phases ◦ Map tasks which are done on small portions of data where the data is stored ◦ Reduce tasks which combine data to produce the final output  A master program allocates work to individual nodes http://www.flaxit.com/ Ph: +1-720-463-3800
  • 15.  Failures are detected by the master program which reassigns the work to a different node  Restarting a task does not affect the nodes working on other portions of the data  If a failed node restarts, it is added back to the system and assigned new tasks  The master can redundantly execute the same task to avoid slow running nodes http://www.flaxit.com/ Ph: +1-720-463-3800
  • 16.  Responsible for storing data on the cluster  Data files are split into blocks and distributed across the nodes in the cluster  Each block is replicated multiple times http://www.flaxit.com/ Ph: +1-720-463-3800
  • 17.  HDFS is a file system written in Java based on the Google’s GFS  Provides redundant storage for massive amounts of data http://www.flaxit.com/ Ph: +1-720-463-3800
  • 18.  HDFS works best with a smaller number of large files ◦ Millions as opposed to billions of files ◦ Typically 100MB or more per file  Files in HDFS are write once  Optimized for streaming reads of large files and not random reads http://www.flaxit.com/ Ph: +1-720-463-3800
  • 19.  Files are split into blocks  Blocks are split across many machines at load time ◦ Different blocks from the same file will be stored on different machines  Blocks are replicated across multiple machines  The NameNode keeps track of which blocks make up a file and where they are stored http://www.flaxit.com/ Ph: +1-720-463-3800
  • 20.  When a client wants to retrieve data ◦ Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored ◦ Then communicated directly with the data nodes to read the data http://www.flaxit.com/ Ph: +1-720-463-3800

Editor's Notes

  1. Example of map and reduce
  2. Default replication is 3-fold