SlideShare a Scribd company logo
MapReduce and HDFS

This presentation includes course content © University of Washington
 Redistributed under the Creative Commons Attribution 3.0 license.
                         All other contents:
                       © 2009 Cloudera, Inc.
Overview
• Why MapReduce?
• What is MapReduce?
• The Hadoop Distributed File System




                © 2009 Cloudera, Inc.
How MapReduce is Structured
• Functional programming meets distributed
  computing
• A batch data processing system
• Factors out many reliability concerns from
  application logic




                 © 2009 Cloudera, Inc.
MapReduce Provides:
•   Automatic parallelization & distribution
•   Fault-tolerance
•   Status and monitoring tools
•   A clean abstraction for programmers




                    © 2009 Cloudera, Inc.
Programming Model
• Borrows from functional programming
• Users implement interface of two
  functions:
  – map (in_key, in_value) ->
     (out_key, intermediate_value) list

  – reduce (out_key, intermediate_value list) ->
     out_value list




                   © 2009 Cloudera, Inc.
map
• Records from the data source (lines out of
  files, rows of a database, etc) are fed into
  the map function as key*value pairs: e.g.,
  (filename, line).
• map() produces one or more intermediate
  values along with an output key from the
  input.


                  © 2009 Cloudera, Inc.
map
map (in_key, in_value) ->
  (out_key, intermediate_value) list




              © 2009 Cloudera, Inc.
Example: Upper-case Mapper
let map(k, v) =
  emit(k.toUpper(), v.toUpper())

(“foo”, “bar”)      (“FOO”, “BAR”)
(“Foo”, “other”)      (“FOO”, “OTHER”)
(“key2”, “data”)      (“KEY2”, “DATA”)




                   © 2009 Cloudera, Inc.
Example: Explode Mapper
let map(k, v) =
  foreach char c in v:
    emit(k, c)

(“A”, “cats”)    (“A”, “c”), (“A”, “a”),
                 (“A”, “t”), (“A”, “s”)

(“B”, “hi”)     (“B”, “h”), (“B”, “i”)




                  © 2009 Cloudera, Inc.
Example: Filter Mapper
let map(k, v) =
  if (isPrime(v)) then emit(k, v)

(“foo”, 7)     (“foo”, 7)
(“test”, 10)     (nothing)




                  © 2009 Cloudera, Inc.
Example: Changing Keyspaces
let map(k, v) = emit(v.length(), v)

(“hi”, “test”)    (4, “test”)
(“x”, “quux”)    (4, “quux”)
(“y”, “abracadabra”)    (10, “abracadabra”)




                 © 2009 Cloudera, Inc.
reduce
• After the map phase is over, all the
  intermediate values for a given output key
  are combined together into a list
• reduce() combines those intermediate
  values into one or more final values for
  that same output key
• (in practice, usually only one final value
  per key)

                 © 2009 Cloudera, Inc.
reduce
reduce (out_key, intermediate_value list) ->
         out_value list




                  © 2009 Cloudera, Inc.
Example: Sum Reducer
let reduce(k, vals) =
  sum = 0
  foreach int v in vals:
    sum += v
  emit(k, sum)

(“A”, [42, 100, 312])    (“A”, 454)
(“B”, [12, 6, -2])    (“B”, 16)




                 © 2009 Cloudera, Inc.
Example: Identity Reducer
let reduce(k, vals) =
  foreach v in vals:
    emit(k, v)

(“A”, [42, 100, 312])    (“A”, 42),
                     (“A”, 100), (“A”, 312)

(“B”, [12, 6, -2])        (“B”, 12), (“B”, 6),
                         (“B”, -2)


                 © 2009 Cloudera, Inc.
© 2009 Cloudera, Inc.
Parallelism
• map() functions run in parallel, creating
  different intermediate values from different
  input data sets
• reduce() functions also run in parallel,
  each working on a different output key
• All values are processed independently
• Bottleneck: reduce phase can’t start until
  map phase is completely finished.


                  © 2009 Cloudera, Inc.
Example: Count word occurrences
map(String input_key, String input_value):
 // input_key: document name
 // input_value: document contents
 for each word w in input_value:
   emit(w, 1);

reduce(String output_key, Iterator<int>
  intermediate_values):
 // output_key: a word
 // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
   result += v;
 emit(output_key, result);
                   © 2009 Cloudera, Inc.
Combining Phase
• Run on mapper nodes after map phase
• “Mini-reduce,” only on local map output
• Used to save bandwidth before sending
  data to full reducer
• Reducer can be combiner if commutative
  & associative
  – e.g., SumReducer


                 © 2009 Cloudera, Inc.
Combiner, graphically




       © 2009 Cloudera, Inc.
WordCount Redux
map(String input_key, String input_value):
 // input_key: document name
 // input_value: document contents
 for each word w in input_value:
   emit(w, 1);

reduce(String output_key, Iterator<int>
  intermediate_values):
 // output_key: a word
 // output_values: a list of counts
 int result = 0;
 for each v in intermediate_values:
   result += v;
 emit(output_key, result);
                   © 2009 Cloudera, Inc.
MapReduce Conclusions
• MapReduce has proven to be a useful
  abstraction in many areas
• Greatly simplifies large-scale computations
• Functional programming paradigm can be
  applied to large-scale applications
• You focus on the “real” problem, library deals
  with messy details




                    © 2009 Cloudera, Inc.
HDFS



Some slides designed by Alex Moschuk, University of Washington
Redistributed under the Creative Commons Attribution 3.0 license
HDFS: Motivation

• Based on Google’s GFS
• Redundant storage of massive amounts of
  data on cheap and unreliable computers
• Why not use an existing file system?
  – Different workload and design priorities
  – Handles much bigger dataset sizes than other
    filesystems


                  © 2009 Cloudera, Inc.
Assumptions
• High component failure rates
   – Inexpensive commodity components fail all the
     time
• “Modest” number of HUGE files
   – Just a few million
   – Each is 100MB or larger; multi-GB files typical
• Files are write-once, mostly appended to
   – Perhaps concurrently
• Large streaming reads
• High sustained throughput favored over low latency
                    © 2009 Cloudera, Inc.
HDFS Design Decisions

• Files stored as blocks
   – Much larger size than most filesystems (default is 64MB)
• Reliability through replication
   – Each block replicated across 3+ DataNodes
• Single master (NameNode) coordinates access, metadata
   – Simple centralized management
• No data caching
   – Little benefit due to large data sets, streaming reads
• Familiar interface, but customize the API
   – Simplify the problem; focus on distributed apps

                        © 2009 Cloudera, Inc.
HDFS Client Block Diagram
          $   "


                                          "




    ###                 !    !




                  © 2009 Cloudera, Inc.
Based on GFS Architecture




            Figure from “The Google File System,”
            Ghemawat et. al., SOSP 2003

         © 2009 Cloudera, Inc.
Metadata

• Single NameNode stores all metadata
  – Filenames, locations on DataNodes of each file
• Maintained entirely in RAM for fast lookup

• DataNodes store opaque file contents in
  “block” objects on underlying local filesystem




                    © 2009 Cloudera, Inc.
HDFS Conclusions

• HDFS supports large-scale processing
  workloads on commodity hardware
  – designed to tolerate frequent component failures
  – optimized for huge files that are mostly appended and
    read
  – filesystem interface is customized for the job, but still
    retains familiarity for developers
  – simple solutions can work (e.g., single master)


• Reliably stores several TB in individual clusters
                       © 2009 Cloudera, Inc.
2 Map Reduce And Hdfs

More Related Content

What's hot

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
 
Hive
HiveHive
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 

What's hot (20)

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
01 hbase
01 hbase01 hbase
01 hbase
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive
HiveHive
Hive
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 

Viewers also liked

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 

Viewers also liked (6)

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to 2 Map Reduce And Hdfs

Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
Santiago Coffey
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
Madhur Nawandar
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
 

Similar to 2 Map Reduce And Hdfs (20)

Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Spark etl
Spark etlSpark etl
Spark etl
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 

2 Map Reduce And Hdfs

  • 1. MapReduce and HDFS This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: © 2009 Cloudera, Inc.
  • 2. Overview • Why MapReduce? • What is MapReduce? • The Hadoop Distributed File System © 2009 Cloudera, Inc.
  • 3. How MapReduce is Structured • Functional programming meets distributed computing • A batch data processing system • Factors out many reliability concerns from application logic © 2009 Cloudera, Inc.
  • 4. MapReduce Provides: • Automatic parallelization & distribution • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers © 2009 Cloudera, Inc.
  • 5. Programming Model • Borrows from functional programming • Users implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list © 2009 Cloudera, Inc.
  • 6. map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. © 2009 Cloudera, Inc.
  • 7. map map (in_key, in_value) -> (out_key, intermediate_value) list © 2009 Cloudera, Inc.
  • 8. Example: Upper-case Mapper let map(k, v) = emit(k.toUpper(), v.toUpper()) (“foo”, “bar”) (“FOO”, “BAR”) (“Foo”, “other”) (“FOO”, “OTHER”) (“key2”, “data”) (“KEY2”, “DATA”) © 2009 Cloudera, Inc.
  • 9. Example: Explode Mapper let map(k, v) = foreach char c in v: emit(k, c) (“A”, “cats”) (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”) (“B”, “hi”) (“B”, “h”), (“B”, “i”) © 2009 Cloudera, Inc.
  • 10. Example: Filter Mapper let map(k, v) = if (isPrime(v)) then emit(k, v) (“foo”, 7) (“foo”, 7) (“test”, 10) (nothing) © 2009 Cloudera, Inc.
  • 11. Example: Changing Keyspaces let map(k, v) = emit(v.length(), v) (“hi”, “test”) (4, “test”) (“x”, “quux”) (4, “quux”) (“y”, “abracadabra”) (10, “abracadabra”) © 2009 Cloudera, Inc.
  • 12. reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key • (in practice, usually only one final value per key) © 2009 Cloudera, Inc.
  • 13. reduce reduce (out_key, intermediate_value list) -> out_value list © 2009 Cloudera, Inc.
  • 14. Example: Sum Reducer let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) (“A”, [42, 100, 312]) (“A”, 454) (“B”, [12, 6, -2]) (“B”, 16) © 2009 Cloudera, Inc.
  • 15. Example: Identity Reducer let reduce(k, vals) = foreach v in vals: emit(k, v) (“A”, [42, 100, 312]) (“A”, 42), (“A”, 100), (“A”, 312) (“B”, [12, 6, -2]) (“B”, 12), (“B”, 6), (“B”, -2) © 2009 Cloudera, Inc.
  • 17. Parallelism • map() functions run in parallel, creating different intermediate values from different input data sets • reduce() functions also run in parallel, each working on a different output key • All values are processed independently • Bottleneck: reduce phase can’t start until map phase is completely finished. © 2009 Cloudera, Inc.
  • 18. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(String output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result); © 2009 Cloudera, Inc.
  • 19. Combining Phase • Run on mapper nodes after map phase • “Mini-reduce,” only on local map output • Used to save bandwidth before sending data to full reducer • Reducer can be combiner if commutative & associative – e.g., SumReducer © 2009 Cloudera, Inc.
  • 20. Combiner, graphically © 2009 Cloudera, Inc.
  • 21. WordCount Redux map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(String output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result); © 2009 Cloudera, Inc.
  • 22. MapReduce Conclusions • MapReduce has proven to be a useful abstraction in many areas • Greatly simplifies large-scale computations • Functional programming paradigm can be applied to large-scale applications • You focus on the “real” problem, library deals with messy details © 2009 Cloudera, Inc.
  • 23. HDFS Some slides designed by Alex Moschuk, University of Washington Redistributed under the Creative Commons Attribution 3.0 license
  • 24. HDFS: Motivation • Based on Google’s GFS • Redundant storage of massive amounts of data on cheap and unreliable computers • Why not use an existing file system? – Different workload and design priorities – Handles much bigger dataset sizes than other filesystems © 2009 Cloudera, Inc.
  • 25. Assumptions • High component failure rates – Inexpensive commodity components fail all the time • “Modest” number of HUGE files – Just a few million – Each is 100MB or larger; multi-GB files typical • Files are write-once, mostly appended to – Perhaps concurrently • Large streaming reads • High sustained throughput favored over low latency © 2009 Cloudera, Inc.
  • 26. HDFS Design Decisions • Files stored as blocks – Much larger size than most filesystems (default is 64MB) • Reliability through replication – Each block replicated across 3+ DataNodes • Single master (NameNode) coordinates access, metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads • Familiar interface, but customize the API – Simplify the problem; focus on distributed apps © 2009 Cloudera, Inc.
  • 27. HDFS Client Block Diagram $ " " ### ! ! © 2009 Cloudera, Inc.
  • 28. Based on GFS Architecture Figure from “The Google File System,” Ghemawat et. al., SOSP 2003 © 2009 Cloudera, Inc.
  • 29. Metadata • Single NameNode stores all metadata – Filenames, locations on DataNodes of each file • Maintained entirely in RAM for fast lookup • DataNodes store opaque file contents in “block” objects on underlying local filesystem © 2009 Cloudera, Inc.
  • 30. HDFS Conclusions • HDFS supports large-scale processing workloads on commodity hardware – designed to tolerate frequent component failures – optimized for huge files that are mostly appended and read – filesystem interface is customized for the job, but still retains familiarity for developers – simple solutions can work (e.g., single master) • Reliably stores several TB in individual clusters © 2009 Cloudera, Inc.