SlideShare a Scribd company logo
1 of 18
Knowledge ShareHadoop Yu Zhao Platform
Hadoop What is Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop Includes MapReduce HDFS Hadoop Common
Hadoop Position APPLICATION HADOOP OS OS OS OS HOST HOST HOST HOST
MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness
MapReduce Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results
MapReduce Outline stays the same, map and reduce change to fit the problem More specifically… Programmer specifies two primary methods: map: (K1, V1) -> list(K2, V2)  reduce: (K2, list(V2)) -> list(K3, V3)
MapReduce Example- word count Counting the number of occurrences of each word in a large  collection of documents.  Key:“document1” Value:“to be or not to be” MAP Key	Value “to”	“1” “be” 	“1” “or”	“1” “not”	”1” “to”	”1” “be”	”1” Key	Value  “be”  	“1”“1” “not”	“1” “or”	“1” “to”	“1””1” Key	Value “be”	“2” “not”	“1” “or”	“1” “to”	“2” REDUCE SHUFFLE & SORT
MapReduce Example- word count Pseudo-code Map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); Reduce (String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
MapReduce Shuffle and Sort
HDFS Design  Very large files  Streaming data access  Commodity hardware Ignores  Low-latency data access  Lots of small files  Multiple writers, arbitrary file modifications
HDFS Concepts Blocks Namenodes and Datanodes
HDFS Network Topology and Hadoop Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Distances compute Processes on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers
HDFS Network Topology and Hadoop
HDFS Data Flow- read
HDFS Data Flow- write
Setup Required Software JavaTM 1.6.x, preferably from Sun, must be installed. ssh Cygwin - Required on windows for shell support in addition to the required software above.
Setup Configure Setup passphraselessssh Configuration Files 	core-site.xml 	hdfs-site.xml  	mapred-site.xml 	masters 	slaves
References Hadoop: The Definitive Guide, Tom White MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean

More Related Content

What's hot

Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 

What's hot (20)

Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop
Hadoop Hadoop
Hadoop
 

Viewers also liked (9)

El agua si
El agua siEl agua si
El agua si
 
Git使用
Git使用Git使用
Git使用
 
云计算
云计算云计算
云计算
 
沙龙升级
沙龙升级沙龙升级
沙龙升级
 
Avometer
AvometerAvometer
Avometer
 
Fcc narrow banding mandate for two way radios - by bearcom
Fcc narrow banding mandate for two way radios - by bearcomFcc narrow banding mandate for two way radios - by bearcom
Fcc narrow banding mandate for two way radios - by bearcom
 
Shell奇技淫巧
Shell奇技淫巧Shell奇技淫巧
Shell奇技淫巧
 
Ccih 2014-images-for-health-ellen-vorder-bruegge
Ccih 2014-images-for-health-ellen-vorder-brueggeCcih 2014-images-for-health-ellen-vorder-bruegge
Ccih 2014-images-for-health-ellen-vorder-bruegge
 
Stay Up To Date on the Latest Happenings in the Boardroom: Recommended Summer...
Stay Up To Date on the Latest Happenings in the Boardroom: Recommended Summer...Stay Up To Date on the Latest Happenings in the Boardroom: Recommended Summer...
Stay Up To Date on the Latest Happenings in the Boardroom: Recommended Summer...
 

Similar to Hadoop

Similar to Hadoop (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Hadoop

  • 1. Knowledge ShareHadoop Yu Zhao Platform
  • 2. Hadoop What is Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop Includes MapReduce HDFS Hadoop Common
  • 3. Hadoop Position APPLICATION HADOOP OS OS OS OS HOST HOST HOST HOST
  • 4. MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness
  • 5. MapReduce Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results
  • 6. MapReduce Outline stays the same, map and reduce change to fit the problem More specifically… Programmer specifies two primary methods: map: (K1, V1) -> list(K2, V2) reduce: (K2, list(V2)) -> list(K3, V3)
  • 7. MapReduce Example- word count Counting the number of occurrences of each word in a large collection of documents. Key:“document1” Value:“to be or not to be” MAP Key Value “to” “1” “be” “1” “or” “1” “not” ”1” “to” ”1” “be” ”1” Key Value “be” “1”“1” “not” “1” “or” “1” “to” “1””1” Key Value “be” “2” “not” “1” “or” “1” “to” “2” REDUCE SHUFFLE & SORT
  • 8. MapReduce Example- word count Pseudo-code Map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); Reduce (String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 10. HDFS Design Very large files Streaming data access Commodity hardware Ignores Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
  • 11. HDFS Concepts Blocks Namenodes and Datanodes
  • 12. HDFS Network Topology and Hadoop Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Distances compute Processes on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers
  • 13. HDFS Network Topology and Hadoop
  • 16. Setup Required Software JavaTM 1.6.x, preferably from Sun, must be installed. ssh Cygwin - Required on windows for shell support in addition to the required software above.
  • 17. Setup Configure Setup passphraselessssh Configuration Files core-site.xml hdfs-site.xml  mapred-site.xml masters slaves
  • 18. References Hadoop: The Definitive Guide, Tom White MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean