SlideShare a Scribd company logo
1 of 30
unit-4
 What is hadoop
 Motivation of hadoop
 Hadoop distribution file system
 Map reduce
 Hadoop ecosystem
 Open source software framework designed
for storage and processing of large scale data
on clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella
in 2005.
 Cutting named the program after his son’s
toy elephant.
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis
•Contains Libraries and other modulesHadoop Common
•Hadoop Distributed File SystemHDFS
•Yet Another Resource NegotiatorHadoop YARN
•A programming model for large scale
data processing
Hadoop
MapReduce
 What were the limitations of earlier large-
scale computing?
 What requirements should an alternative
approach have?
 How does Hadoop address those
requirements?
 Historically computation was processor-
bound
◦ Data volume has been relatively small
◦ Complicated computations are performed on that
data
 Advances in computer technology has
historically centered around improving the
power of a single machine
 Power consumption limits the speed increase
we get from transistor density
 Allows developers to
use multiple
machines for a single
task
 Programming on a distributed system is
much more complex
◦ Synchronizing data exchanges
◦ Managing a finite bandwidth
◦ Controlling computation timing is complicated
“You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work
done.” –Leslie Lamport
 Distributed systems must be designed with
the expectation of failure
 Typically divided into Data Nodes and
Compute Nodes
 At compute time, data is copied to the
Compute Nodes
 Fine for relatively small amounts of data
 Modern systems deal with far more data than
was gathering in the past
 Facebook
◦ 500 TB per day
 Yahoo
◦ Over 170 PB
 eBay
◦ Over 6 PB
 Getting the data to the processors becomes
the bottleneck
 If a component fails, it should be able to
recover without restarting the entire system
 Component failure or recovery during a job
must not affect the final output
 Increasing resources should increase load
capacity
 Increasing the load on the system should
result in a graceful decline in performance for
all jobs
◦ Not system failure
 Based on work done by Google in the early
2000s
◦ “The Google File System” in 2003
◦ “MapReduce: Simplified Data Processing on Large
Clusters” in 2004
 The core idea was to distribute the data as it
is initially stored
◦ Each node can then perform computation on the
data it stores without moving the data for the initial
processing
 Applications are written in a high-level
programming language
◦ No network programming or temporal dependency
 Nodes should communicate as little as
possible
◦ A “shared nothing” architecture
 Data is spread among the machines in
advance
◦ Perform computation where the data is already
stored as often as possible
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive
amounts of data
 HDFS works best with a smaller number of
large files
◦ Millions as opposed to billions of files
◦ Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files
and not random reads
 Files are split into blocks
 Blocks are split across many machines at load
time
◦ Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple
machines
 The NameNode keeps track of which blocks
make up a file and where they are stored
 Default replication is 3-fold
 When a client wants to retrieve data
◦ Communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
◦ Then communicated directly with the data nodes to
read the data
 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
◦ Map
◦ Reduce
 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers
to use
 Node based flat
 Suitable for structured, unstructured data.
Supports variety of data formats in real time
such as XML, JSON, text based flat file
formats, etc.
 Analytical, big data processing
 Big data processing, which does not require
any consistent relationships between data.
 In a hadoop cluster, a node requires only a
processor, a network card, and few hard
drives.
 Key aspects of hadoop
 Hadoop components
 Hadoop conceptual layer
 High-level architecture of hadoop
 Open source software
 Frame work
 Distributed
 Massive storage
 Faster processing
 Hadoop core components
1.HDFS:
a)storage components.
b)distributes data across several nodes.
c)natively redundant.
2.Mapreduce:
a) computational framework.
b)splits a task across multiple nodes.
c)processes data in parallel.
 Hadoop ecosystem: hadoop ecosystem are
supports projects to enhance the
functionality of hadoop components. the eco
projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE

More Related Content

What's hot

Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 

What's hot (20)

Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Big data
Big dataBig data
Big data
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data
Big dataBig data
Big data
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Mongo db
Mongo dbMongo db
Mongo db
 

Similar to Big Data Unit 4 - Hadoop

Hadoop tutorial for beginners-tibacademy.in
Hadoop tutorial for beginners-tibacademy.inHadoop tutorial for beginners-tibacademy.in
Hadoop tutorial for beginners-tibacademy.inTIB Academy
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 

Similar to Big Data Unit 4 - Hadoop (20)

Hadoop tutorial for beginners-tibacademy.in
Hadoop tutorial for beginners-tibacademy.inHadoop tutorial for beginners-tibacademy.in
Hadoop tutorial for beginners-tibacademy.in
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Online Training
Hadoop Online TrainingHadoop Online Training
Hadoop Online Training
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 

Recently uploaded

How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 

Recently uploaded (20)

How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

Big Data Unit 4 - Hadoop

  • 2.  What is hadoop  Motivation of hadoop  Hadoop distribution file system  Map reduce  Hadoop ecosystem
  • 3.  Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and Mike Carafella in 2005.  Cutting named the program after his son’s toy elephant.
  • 4.  Data-intensive text processing  Assembly of large genomes  Graph mining  Machine learning and data mining  Large scale social network analysis
  • 5.
  • 6. •Contains Libraries and other modulesHadoop Common •Hadoop Distributed File SystemHDFS •Yet Another Resource NegotiatorHadoop YARN •A programming model for large scale data processing Hadoop MapReduce
  • 7.  What were the limitations of earlier large- scale computing?  What requirements should an alternative approach have?  How does Hadoop address those requirements?
  • 8.  Historically computation was processor- bound ◦ Data volume has been relatively small ◦ Complicated computations are performed on that data  Advances in computer technology has historically centered around improving the power of a single machine
  • 9.  Power consumption limits the speed increase we get from transistor density
  • 10.  Allows developers to use multiple machines for a single task
  • 11.  Programming on a distributed system is much more complex ◦ Synchronizing data exchanges ◦ Managing a finite bandwidth ◦ Controlling computation timing is complicated
  • 12. “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” –Leslie Lamport  Distributed systems must be designed with the expectation of failure
  • 13.  Typically divided into Data Nodes and Compute Nodes  At compute time, data is copied to the Compute Nodes  Fine for relatively small amounts of data  Modern systems deal with far more data than was gathering in the past
  • 14.  Facebook ◦ 500 TB per day  Yahoo ◦ Over 170 PB  eBay ◦ Over 6 PB  Getting the data to the processors becomes the bottleneck
  • 15.  If a component fails, it should be able to recover without restarting the entire system  Component failure or recovery during a job must not affect the final output
  • 16.  Increasing resources should increase load capacity  Increasing the load on the system should result in a graceful decline in performance for all jobs ◦ Not system failure
  • 17.  Based on work done by Google in the early 2000s ◦ “The Google File System” in 2003 ◦ “MapReduce: Simplified Data Processing on Large Clusters” in 2004  The core idea was to distribute the data as it is initially stored ◦ Each node can then perform computation on the data it stores without moving the data for the initial processing
  • 18.  Applications are written in a high-level programming language ◦ No network programming or temporal dependency  Nodes should communicate as little as possible ◦ A “shared nothing” architecture  Data is spread among the machines in advance ◦ Perform computation where the data is already stored as often as possible
  • 19.  HDFS is a file system written in Java based on the Google’s GFS  Provides redundant storage for massive amounts of data
  • 20.  HDFS works best with a smaller number of large files ◦ Millions as opposed to billions of files ◦ Typically 100MB or more per file  Files in HDFS are write once  Optimized for streaming reads of large files and not random reads
  • 21.  Files are split into blocks  Blocks are split across many machines at load time ◦ Different blocks from the same file will be stored on different machines  Blocks are replicated across multiple machines  The NameNode keeps track of which blocks make up a file and where they are stored
  • 23.  When a client wants to retrieve data ◦ Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored ◦ Then communicated directly with the data nodes to read the data
  • 24.  A method for distributing computation across multiple nodes  Each node processes the data that is stored at that node  Consists of two main phases ◦ Map ◦ Reduce
  • 25.  Automatic parallelization and distribution  Fault-Tolerance  Provides a clean abstraction for programmers to use
  • 26.  Node based flat  Suitable for structured, unstructured data. Supports variety of data formats in real time such as XML, JSON, text based flat file formats, etc.  Analytical, big data processing  Big data processing, which does not require any consistent relationships between data.  In a hadoop cluster, a node requires only a processor, a network card, and few hard drives.
  • 27.  Key aspects of hadoop  Hadoop components  Hadoop conceptual layer  High-level architecture of hadoop
  • 28.  Open source software  Frame work  Distributed  Massive storage  Faster processing
  • 29.  Hadoop core components 1.HDFS: a)storage components. b)distributes data across several nodes. c)natively redundant. 2.Mapreduce: a) computational framework. b)splits a task across multiple nodes. c)processes data in parallel.
  • 30.  Hadoop ecosystem: hadoop ecosystem are supports projects to enhance the functionality of hadoop components. the eco projects are as follows: 1. HIVE 2. PIG 3. SQOOP 4. HBASE 5. FLUME 6. OOZIE

Editor's Notes

  1. Example of failure issues. Linux lab is distributed file system, if the file server fails, what happens.
  2. Default replication is 3-fold