SlideShare a Scribd company logo
Chapter 12
Big Data with
Map / Reduce
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• MapReduce
• Apache Hadoop
• Apache Spark
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Repetition: Definition of Big Data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What are the innovative
forms of information
processing?
Parallelism is Mandatory
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
In pioneer days they used oxen for heavy pulling, and when
one ox couldn't budge a log, they didn't try to grow a larger ox.
We shouldn't be trying for bigger computers, but for more
systems of computers
– Grace Hopper
Parallel Programming Models
• Message Passing
• Independent tasks on local data
• Tasks interact by exchanging messages
• Shared memory
• Tasks share common address space
• Tasks interact by reading/writing in this space
• Data parallelization
• Tasks execute independent operations on partitions of data
• Well suited for problems that are “embarrassingly parallel”
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Traditional Infrastructures
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data storage
…
Compute cluster
Result / Insight
Database or Storage
Area Network (SAN)
Compute Nodes
Message Passing / Shared Memory
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data storage
…
Compute cluster
Result / Insight
Each node may load complete
data!
 Does not scale
Data Parallelization
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data storage
…
Compute cluster
Result / Insight
Each node only loads partition
 Scales better
 Still requires transfering all
data over the network
Data Locality as Solution
• Not supported by traditional clusters / infrastructures
• Clusters for sharing CPUs, not storage
• Storage made for IO, not computations
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
If moving data is the problem,
stop moving the data!
Core concept of Big Data Technologies
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Parallelization Data Locality
Compute Cluster with Distributed Storage
…
Our examples:
Outline
• Overview
• MapReduce
• Apache Hadoop
• Apache Spark
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
MapReduce
• Programming model for data parallelization
• Published in 2004 by Google
• map() and reduce() functions for data processing
• Based on transformations of key-value pairs
• shuffle() function for arranging intermediate results
• Distribution via master/worker paradigm
• Supports high availability / recoverability
• Discussed together with hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing
on large clusters." Communications of the ACM, 51.1 (2008): 107-113.
Overview of MapReduce
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Value
map() shuffle()
Key
reduce()
Initial pairs
<key1,value1>
Intermediate pairs
<key2,value2>
Pairs grouped
by key2
Results by
key2
The map() Function
• Concept from functional programming
• Applies a function to every item in the input separately
• map(fun, <key1,value1>)  list(<key2, value2>)
• Functions are usually user-defined
• Input keys and output keys can be different
• Also different types
• Output is a list, i.e., one mapping can have multiple outputs
• All list elements must have same types
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data parallelization
trivial
In the initial MapReduce implementation, all
keys and values were strings, users where
expected to convert the types if required as
part of the map/reduce functions
The shuffle() Function
• Organize data by key
• shuffle(list(<key2,value2>))  list(<key2, list(value2)>)
• Often includes sorting by key for efficiency
• Shuffle does not wait for map() to finish
• Once a <key2,value2> is available it can be shuffled
• Reduces waiting times
• Provided by the MapReduce framework
• Can be overridden by users to optimize for use case
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The reduce() Function
• Related to fold from functional programming
• Aggregates <key2,value2> pairs with the same key
• Single value per key
• Results in a list of values, one for each key
• reduce(fun, list(<key2,list(value2)>))  list(value3)
• Functions are usually user defined
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Parallelization with MapReduce
• Input can be read in chunks
• Parallelism for creation of initial key-value pairs
• map() can be computed for each key-value pair independently
• Parallelism potential only limited by amount of data
• shuffle() can start working as soon as first key-value pair is
processed
• Limits waiting times
• reduce() can run in parallel for different keys
• Need not wait for map to complete
• Can already start when all results for a key are available
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Word Counts with MapReduce
• Our Data:
• Initial <key1,value1> pairs:
• <sentence1, “what is your name”>
• <sentence2, “the name is bond james bond”>
• map() function: emit <word,1> for each word in a sentence
• <sentence1, “what is your name”>
 <“what”,1>, <“is”,1>,<“your”,1>,<“name”,1>
• <sentence2, “the name is bond james bond”>
 <“the”, 1>,<“name”,1>,<“is”,1>,<“bond”,1>,<“james”,1>,<“bond”,1>
• reduce() function: key concatenated with sum of values
• <“what”, list(1)>  “what: 1”
• <“bond”, list(1,1)> ”bond: 2”
• …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
This is the “Hello
World” for
MapReduce
What is your name?
The name is Bond, James Bond.
Outline
• Overview
• MapReduce
• Apache Hadoop
• Apache Spark
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Overview of Hadoop
• Open-source implementation of MapReduce
• Supported by all major cloud providers
• Used by many large companies, e.g., Twitter, Facebook, Amazon, …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
HDFS
YARN
Hadoop
MapReduce
Others
Data Processing
Cluster resource management
Distributed file system
Hadoop Distributed File System (HDFS)
• Core component of Hadoop
• Goals of HDFS
• High throughput instead of low latency
• Support for large files and data sets
• Moving computation instead of moving data ( Data locality)
• Resiliency against hardware failures
• Uses a master/slave architecture
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Overview of HDFS
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
NameNode
DataNodes
Users • Access point for clients
• Exposes file system operations
• Organizes block creation / deletion /
replication of DataNodes
• Can have secondary NameNode to
avoid single point of failure
• Stores blocks of data
• Serves read/write requests
• Perform computations on
blocks
Example: Write File
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
NameNode
DataNodes
Users
1. Create file
2. Data stream for writing
7. Close and complete
4. Create replica
5. Acknowledge
File
Blocks created
by data stream Replication level 2
 Each block twice
Computing with Hadoop
• HDFS „only“ distributed block storage
• Each DataNode should also serve as compute node
• Map / Reduce / Shuffle tasks
• Ideally also general compute tasks
• Each resource should only be used by one task
• CPU cores
• Memory
 No overutilization
• Resources should be used as efficiently as possible
 No underutilization
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
YARN for Resource Management
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
Users
NodeManager NodeManager
• Resource scheduler
• Manages resources for different
applications
• Runs on DataNodes
• Gets tasks from Resource Manager
• Executes tasks on local resources
Running Applications with YARN
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
Users
NodeManager NodeManager
Application
1. Submit
2. Launch
Application
Master
5. Provide
resources
Application
Master
3. Request resources
Application
Application
6. Launch application in container
• Called Container
• Allocated by Resource Manager
• Started by Application Master
4.1. Allocate container 4.2. Allocate container
Resource Requests
• Send from Application Master to Resource Manager
• Name of the resource
• Can be used to select specific hosts or racks
• Priority
• Only within the application
• Allows application to have an internal scheduler
• Resource requirements
• Memory
• CPU cores
• Number of containers
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
4. Provide
Resources
Application
Master
3. Request
resources
Launching Containers
• Executes parts of the application
• For example, map, reduce, or shuffle tasks
• Requires commands to start application
• Environment configuration
• E.g., environment variables of application call
• Can access local resources
• Binaries, HDFS files/blocks
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
NodeManager NodeManager
Application
Master
Application
5. Launch application in container
MapReduce with Hadoop
• Implementation of MapReduce on top of YARN
• Users define applications as sequences of map/reduce tasks
• Hadoop specifies as MRAppMaster YARN container
• MRAppMaster manages execution of tasks
• Two execution modes
• Java applications
• Streaming mode
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Java Applications
• MapReduce applications defined by Jobs programmatically
• Job class used to
• Specify input
• Register mapper
• Register reducer
• Specify output
• Mapper and reducer tasks defined by extending classes
• Compiled Jar is submitted to resource manager for execution
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Example for a Mapper
• Hadoop Mapper for Word Count example
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Type of input key
Type of input value Type of output key
Type of output value
Example for a Reducer
• Hadoop Reducer for the Word Count example
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Type of input key
Type of input value Type of output key
Type of output value
Example for a Job definition
• Hadoop Job for the Word Count example
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Reads file line by line
Execution of Word Count with Hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
DataNode
Users
NodeManager
DataNode
NodeManager
DataNode
WordCount.jar
1. Submit
2. Launch
Application
Master
MRAppMaster
Execution of Word Count with Hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
DataNode
Users
NodeManager
DataNode
NodeManager
DataNode
5. Provide
resources
MRAppMaster
3. Request resources
for Map tasks
WordCount
Map Task
WordCount
Map Task
6. Launch Map tasks in container
4.1. Allocate container for map 4.2. Allocate container for map
7. Compute intermediary results
7. Compute intermediary results
Execution of Word Count with Hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
DataNode
Users
NodeManager
DataNode
NodeManager
DataNode
MRAppMaster
8. Report tasks
as finished
9.1. Free container for map 9.2. Free container for map
Execution of Word Count with Hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
DataNode
Users
NodeManager
DataNode
NodeManager
DataNode
12. Provide
resources
MRAppMaster
10. Request resources
for Reduce tasks
WordCount
Reduce Task
13. Launch Reduce tasks in container
11. Allocate container for reduce
14.2. Shuffle intermediary data
14.1. Shuffle intermediary data
15. Write output
Shuffle is an automated service
running in the NodeManager
Execution of Word Count with Hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
Node Manager
DataNode
Users
Node Manager
DataNode
Node Manager
DataNode
MRAppMaster
16. Report job
as finished
Execution of Word Count with Hadoop
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Resource
Manager
NodeManager
DataNode
Users
NodeManager
DataNode
NodeManager
DataNode
17.1. Free container for reduce
17.1. Free
App.
Master
Streaming Mode of Hadoop
• Provided Java Application that implements Hadoop MapReduce
• Input and output by line-wise file processing
• Same as the handler we showed in the word count example
• Formats can be defined using command line parameters
• Mapper/Reducer can be any executable application
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Java Application that implements
Hadoop MapReduce
Input and output locations
Executables used as
mapper/reducer
Copies local files to compute
nodes
Word Count with Python
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Additional Important Parts of Hadoop
• Combiner
• Reducer function that runs locally before shuffling the data to another
node
• Often same as reducer, but not always
• Requires functions to be chainable / idempotent
• Can reduce network traffic
• Example:
• Word count reducer can also be used as combiner
• The shuffled pairs would not all have a count of 1 anymore, but the word counts
of the data of that node
• MapReduce Job History Server
• Collects information about the history of MapReduce jobs
• Log files, start/end times, job state
• Can be used by users to see status ob their jobs
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Limitations of Hadoop
• Multiple Map and/or Reduce steps require multiple jobs
• Can still be defined in a single Java file
• Output of one job can be input of another job
• Jobs can have dependencies, e.g., one waiting for the completion of
other jobs
• Dependencies must be modeled by the programmer
• All communication between jobs via the file system
• Bad for multiple computations on the same data, e.g., chained map
functions
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• MapReduce
• Apache Hadoop
• Apache Spark
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Apache Spark
• Engine for large-scale data processing
• Designed to resolve limitations of Hadoop for data analysis
• Supports in-memory analysis
• Good support for iterative algorithms
• Supports arbitrary combinations of Map and Reduce tasks
• Users do not need to care about specific jobs and their dependencies
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Spark Stack (I)
• SparkSQL for SQL-like queries and data frame generation
• Spark Streaming for live processing of streaming data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Spark Stack (II)
• MLlib for machine learning algorithms
• > 20 algorithms for clustering, regression, and classification
• GraphX for graph algorithms
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data Structures used by Spark
• Not a file system like Hadoop, instead two important in-memory
data structures
• Resilient Distributed Dataset (RDD)
• Abstraction layer for data operations
• Immutable partitions of elements
• All elements in a RDD can be processed in parallel
• Support map, reduce, filtering, user-defined functions, and persistence
• Data frame
• Higher abstraction built on RDDs
• Similar to R / pandas data frames
• Usually generated using the SparkSQL API
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Infrastructures to Execute Apache Spark
• Spark allows setup of clusters for computing
• Not the standard way to use Spark
• Compatible with many existing infrastructures instead
• Computing
• Hadoop/YARN and Apache Mesos
• Storage
• Hadoop/HDFS, Cassandra, HBase, MongoDB, …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Programming with Apache Spark
• Natively implemented in Scala
• JVM language with type inference and functional programming
concepts
• Also provides APIs for
• Java
• Python (PySpark)
• R (SparkR)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Word Count with PySpark
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
flatMap can map input to multiple outputs
Python lambda functions: anonymous
function with parameters a,b that returns
the result of the computation after the
colon, i.e., a+b
reduceByKey reduces the data set by
merging (i.e. reducing) key-value pairs
with the same key
Two major differences to Hadoop
• In-memory
• RDDs and data frames are handled in-memory if possible
• Vast speed-up
• Example: Logistic Regression
• Does not provide own distributed storage back-end
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Spark Ecosystem
• Different execution modes
• Large open source community
• Many technologies on top of Spark
• https://spark-packages.org/
• Still rapidly developing
• Core concepts stable
• 2.4.0 release this fall
• 3.0.0 release planned for next year
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• MapReduce
• Apache Hadoop
• Apache Spark
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Big data processing requires dedicated infrastructures
• Moving data not feasible
• Move computation to data
• MapReduce as programming model for big data processing
• Apache Hadoop as big data framework
• HDFS distributed and reliable file system
• YARN for resource management
• Provides a MapReduce framework
• Apache Spark for in-memory computations with big data
• Compatible with different infrastructures, including Hadoop
• Provides API for machine learning
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 12-BigDataMapReduce.pptx

An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Josh Devins
 
Spark!
Spark!Spark!
Hadoop
HadoopHadoop
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
Vakul Vankadaru
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
Talentica Software
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Hadoop
HadoopHadoop

Similar to 12-BigDataMapReduce.pptx (20)

An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Spark!
Spark!Spark!
Spark!
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop
HadoopHadoop
Hadoop
 

More from Shree Shree

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
Shree Shree
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
Shree Shree
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
Shree Shree
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
Shree Shree
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
Shree Shree
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
Shree Shree
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
Shree Shree
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
Shree Shree
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
Shree Shree
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
Shree Shree
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
Shree Shree
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
Shree Shree
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
Shree Shree
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
Shree Shree
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
Shree Shree
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
Shree Shree
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
Shree Shree
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
Shree Shree
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
Shree Shree
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
Shree Shree
 

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
 

Recently uploaded

Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 

Recently uploaded (20)

Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 

12-BigDataMapReduce.pptx

  • 1. Chapter 12 Big Data with Map / Reduce Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline • Overview • MapReduce • Apache Hadoop • Apache Spark • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. Repetition: Definition of Big Data Introduction to Data Science https://sherbold.github.io/intro-to-data-science What are the innovative forms of information processing?
  • 4. Parallelism is Mandatory Introduction to Data Science https://sherbold.github.io/intro-to-data-science In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers – Grace Hopper
  • 5. Parallel Programming Models • Message Passing • Independent tasks on local data • Tasks interact by exchanging messages • Shared memory • Tasks share common address space • Tasks interact by reading/writing in this space • Data parallelization • Tasks execute independent operations on partitions of data • Well suited for problems that are “embarrassingly parallel” Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 6. Traditional Infrastructures Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data storage … Compute cluster Result / Insight Database or Storage Area Network (SAN) Compute Nodes
  • 7. Message Passing / Shared Memory Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data storage … Compute cluster Result / Insight Each node may load complete data!  Does not scale
  • 8. Data Parallelization Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data storage … Compute cluster Result / Insight Each node only loads partition  Scales better  Still requires transfering all data over the network
  • 9. Data Locality as Solution • Not supported by traditional clusters / infrastructures • Clusters for sharing CPUs, not storage • Storage made for IO, not computations Introduction to Data Science https://sherbold.github.io/intro-to-data-science If moving data is the problem, stop moving the data!
  • 10. Core concept of Big Data Technologies Introduction to Data Science https://sherbold.github.io/intro-to-data-science Parallelization Data Locality Compute Cluster with Distributed Storage … Our examples:
  • 11. Outline • Overview • MapReduce • Apache Hadoop • Apache Spark • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 12. MapReduce • Programming model for data parallelization • Published in 2004 by Google • map() and reduce() functions for data processing • Based on transformations of key-value pairs • shuffle() function for arranging intermediate results • Distribution via master/worker paradigm • Supports high availability / recoverability • Discussed together with hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM, 51.1 (2008): 107-113.
  • 13. Overview of MapReduce Introduction to Data Science https://sherbold.github.io/intro-to-data-science Value map() shuffle() Key reduce() Initial pairs <key1,value1> Intermediate pairs <key2,value2> Pairs grouped by key2 Results by key2
  • 14. The map() Function • Concept from functional programming • Applies a function to every item in the input separately • map(fun, <key1,value1>)  list(<key2, value2>) • Functions are usually user-defined • Input keys and output keys can be different • Also different types • Output is a list, i.e., one mapping can have multiple outputs • All list elements must have same types Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data parallelization trivial In the initial MapReduce implementation, all keys and values were strings, users where expected to convert the types if required as part of the map/reduce functions
  • 15. The shuffle() Function • Organize data by key • shuffle(list(<key2,value2>))  list(<key2, list(value2)>) • Often includes sorting by key for efficiency • Shuffle does not wait for map() to finish • Once a <key2,value2> is available it can be shuffled • Reduces waiting times • Provided by the MapReduce framework • Can be overridden by users to optimize for use case Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 16. The reduce() Function • Related to fold from functional programming • Aggregates <key2,value2> pairs with the same key • Single value per key • Results in a list of values, one for each key • reduce(fun, list(<key2,list(value2)>))  list(value3) • Functions are usually user defined Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 17. Parallelization with MapReduce • Input can be read in chunks • Parallelism for creation of initial key-value pairs • map() can be computed for each key-value pair independently • Parallelism potential only limited by amount of data • shuffle() can start working as soon as first key-value pair is processed • Limits waiting times • reduce() can run in parallel for different keys • Need not wait for map to complete • Can already start when all results for a key are available Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 18. Word Counts with MapReduce • Our Data: • Initial <key1,value1> pairs: • <sentence1, “what is your name”> • <sentence2, “the name is bond james bond”> • map() function: emit <word,1> for each word in a sentence • <sentence1, “what is your name”>  <“what”,1>, <“is”,1>,<“your”,1>,<“name”,1> • <sentence2, “the name is bond james bond”>  <“the”, 1>,<“name”,1>,<“is”,1>,<“bond”,1>,<“james”,1>,<“bond”,1> • reduce() function: key concatenated with sum of values • <“what”, list(1)>  “what: 1” • <“bond”, list(1,1)> ”bond: 2” • … Introduction to Data Science https://sherbold.github.io/intro-to-data-science This is the “Hello World” for MapReduce What is your name? The name is Bond, James Bond.
  • 19. Outline • Overview • MapReduce • Apache Hadoop • Apache Spark • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 20. Overview of Hadoop • Open-source implementation of MapReduce • Supported by all major cloud providers • Used by many large companies, e.g., Twitter, Facebook, Amazon, … Introduction to Data Science https://sherbold.github.io/intro-to-data-science HDFS YARN Hadoop MapReduce Others Data Processing Cluster resource management Distributed file system
  • 21. Hadoop Distributed File System (HDFS) • Core component of Hadoop • Goals of HDFS • High throughput instead of low latency • Support for large files and data sets • Moving computation instead of moving data ( Data locality) • Resiliency against hardware failures • Uses a master/slave architecture Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 22. Overview of HDFS Introduction to Data Science https://sherbold.github.io/intro-to-data-science NameNode DataNodes Users • Access point for clients • Exposes file system operations • Organizes block creation / deletion / replication of DataNodes • Can have secondary NameNode to avoid single point of failure • Stores blocks of data • Serves read/write requests • Perform computations on blocks
  • 23. Example: Write File Introduction to Data Science https://sherbold.github.io/intro-to-data-science NameNode DataNodes Users 1. Create file 2. Data stream for writing 7. Close and complete 4. Create replica 5. Acknowledge File Blocks created by data stream Replication level 2  Each block twice
  • 24. Computing with Hadoop • HDFS „only“ distributed block storage • Each DataNode should also serve as compute node • Map / Reduce / Shuffle tasks • Ideally also general compute tasks • Each resource should only be used by one task • CPU cores • Memory  No overutilization • Resources should be used as efficiently as possible  No underutilization Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 25. YARN for Resource Management Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager Users NodeManager NodeManager • Resource scheduler • Manages resources for different applications • Runs on DataNodes • Gets tasks from Resource Manager • Executes tasks on local resources
  • 26. Running Applications with YARN Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager Users NodeManager NodeManager Application 1. Submit 2. Launch Application Master 5. Provide resources Application Master 3. Request resources Application Application 6. Launch application in container • Called Container • Allocated by Resource Manager • Started by Application Master 4.1. Allocate container 4.2. Allocate container
  • 27. Resource Requests • Send from Application Master to Resource Manager • Name of the resource • Can be used to select specific hosts or racks • Priority • Only within the application • Allows application to have an internal scheduler • Resource requirements • Memory • CPU cores • Number of containers Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager 4. Provide Resources Application Master 3. Request resources
  • 28. Launching Containers • Executes parts of the application • For example, map, reduce, or shuffle tasks • Requires commands to start application • Environment configuration • E.g., environment variables of application call • Can access local resources • Binaries, HDFS files/blocks Introduction to Data Science https://sherbold.github.io/intro-to-data-science NodeManager NodeManager Application Master Application 5. Launch application in container
  • 29. MapReduce with Hadoop • Implementation of MapReduce on top of YARN • Users define applications as sequences of map/reduce tasks • Hadoop specifies as MRAppMaster YARN container • MRAppMaster manages execution of tasks • Two execution modes • Java applications • Streaming mode Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 30. Java Applications • MapReduce applications defined by Jobs programmatically • Job class used to • Specify input • Register mapper • Register reducer • Specify output • Mapper and reducer tasks defined by extending classes • Compiled Jar is submitted to resource manager for execution Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 31. Example for a Mapper • Hadoop Mapper for Word Count example Introduction to Data Science https://sherbold.github.io/intro-to-data-science Type of input key Type of input value Type of output key Type of output value
  • 32. Example for a Reducer • Hadoop Reducer for the Word Count example Introduction to Data Science https://sherbold.github.io/intro-to-data-science Type of input key Type of input value Type of output key Type of output value
  • 33. Example for a Job definition • Hadoop Job for the Word Count example Introduction to Data Science https://sherbold.github.io/intro-to-data-science Reads file line by line
  • 34. Execution of Word Count with Hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager DataNode Users NodeManager DataNode NodeManager DataNode WordCount.jar 1. Submit 2. Launch Application Master MRAppMaster
  • 35. Execution of Word Count with Hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager DataNode Users NodeManager DataNode NodeManager DataNode 5. Provide resources MRAppMaster 3. Request resources for Map tasks WordCount Map Task WordCount Map Task 6. Launch Map tasks in container 4.1. Allocate container for map 4.2. Allocate container for map 7. Compute intermediary results 7. Compute intermediary results
  • 36. Execution of Word Count with Hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager DataNode Users NodeManager DataNode NodeManager DataNode MRAppMaster 8. Report tasks as finished 9.1. Free container for map 9.2. Free container for map
  • 37. Execution of Word Count with Hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager DataNode Users NodeManager DataNode NodeManager DataNode 12. Provide resources MRAppMaster 10. Request resources for Reduce tasks WordCount Reduce Task 13. Launch Reduce tasks in container 11. Allocate container for reduce 14.2. Shuffle intermediary data 14.1. Shuffle intermediary data 15. Write output Shuffle is an automated service running in the NodeManager
  • 38. Execution of Word Count with Hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager Node Manager DataNode Users Node Manager DataNode Node Manager DataNode MRAppMaster 16. Report job as finished
  • 39. Execution of Word Count with Hadoop Introduction to Data Science https://sherbold.github.io/intro-to-data-science Resource Manager NodeManager DataNode Users NodeManager DataNode NodeManager DataNode 17.1. Free container for reduce 17.1. Free App. Master
  • 40. Streaming Mode of Hadoop • Provided Java Application that implements Hadoop MapReduce • Input and output by line-wise file processing • Same as the handler we showed in the word count example • Formats can be defined using command line parameters • Mapper/Reducer can be any executable application Introduction to Data Science https://sherbold.github.io/intro-to-data-science Java Application that implements Hadoop MapReduce Input and output locations Executables used as mapper/reducer Copies local files to compute nodes
  • 41. Word Count with Python Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 42. Additional Important Parts of Hadoop • Combiner • Reducer function that runs locally before shuffling the data to another node • Often same as reducer, but not always • Requires functions to be chainable / idempotent • Can reduce network traffic • Example: • Word count reducer can also be used as combiner • The shuffled pairs would not all have a count of 1 anymore, but the word counts of the data of that node • MapReduce Job History Server • Collects information about the history of MapReduce jobs • Log files, start/end times, job state • Can be used by users to see status ob their jobs Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 43. Limitations of Hadoop • Multiple Map and/or Reduce steps require multiple jobs • Can still be defined in a single Java file • Output of one job can be input of another job • Jobs can have dependencies, e.g., one waiting for the completion of other jobs • Dependencies must be modeled by the programmer • All communication between jobs via the file system • Bad for multiple computations on the same data, e.g., chained map functions Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 44. Outline • Overview • MapReduce • Apache Hadoop • Apache Spark • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 45. Apache Spark • Engine for large-scale data processing • Designed to resolve limitations of Hadoop for data analysis • Supports in-memory analysis • Good support for iterative algorithms • Supports arbitrary combinations of Map and Reduce tasks • Users do not need to care about specific jobs and their dependencies Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 46. Spark Stack (I) • SparkSQL for SQL-like queries and data frame generation • Spark Streaming for live processing of streaming data Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 47. Spark Stack (II) • MLlib for machine learning algorithms • > 20 algorithms for clustering, regression, and classification • GraphX for graph algorithms Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 48. Data Structures used by Spark • Not a file system like Hadoop, instead two important in-memory data structures • Resilient Distributed Dataset (RDD) • Abstraction layer for data operations • Immutable partitions of elements • All elements in a RDD can be processed in parallel • Support map, reduce, filtering, user-defined functions, and persistence • Data frame • Higher abstraction built on RDDs • Similar to R / pandas data frames • Usually generated using the SparkSQL API Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 49. Infrastructures to Execute Apache Spark • Spark allows setup of clusters for computing • Not the standard way to use Spark • Compatible with many existing infrastructures instead • Computing • Hadoop/YARN and Apache Mesos • Storage • Hadoop/HDFS, Cassandra, HBase, MongoDB, … Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 50. Programming with Apache Spark • Natively implemented in Scala • JVM language with type inference and functional programming concepts • Also provides APIs for • Java • Python (PySpark) • R (SparkR) Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 51. Word Count with PySpark Introduction to Data Science https://sherbold.github.io/intro-to-data-science flatMap can map input to multiple outputs Python lambda functions: anonymous function with parameters a,b that returns the result of the computation after the colon, i.e., a+b reduceByKey reduces the data set by merging (i.e. reducing) key-value pairs with the same key
  • 52. Two major differences to Hadoop • In-memory • RDDs and data frames are handled in-memory if possible • Vast speed-up • Example: Logistic Regression • Does not provide own distributed storage back-end Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 53. Spark Ecosystem • Different execution modes • Large open source community • Many technologies on top of Spark • https://spark-packages.org/ • Still rapidly developing • Core concepts stable • 2.4.0 release this fall • 3.0.0 release planned for next year Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 54. Outline • Overview • MapReduce • Apache Hadoop • Apache Spark • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 55. Summary • Big data processing requires dedicated infrastructures • Moving data not feasible • Move computation to data • MapReduce as programming model for big data processing • Apache Hadoop as big data framework • HDFS distributed and reliable file system • YARN for resource management • Provides a MapReduce framework • Apache Spark for in-memory computations with big data • Compatible with different infrastructures, including Hadoop • Provides API for machine learning Introduction to Data Science https://sherbold.github.io/intro-to-data-science