SlideShare a Scribd company logo
Themis
An I/O-Efficient MapReduce
Alex Rasmussen, Michael Conley,
Rishi Kapoor, Vinh The Lam,
George Porter, AminVahdat*
University of California San Diego
*& Google, Inc.
1
MapReduce is Everywhere
  First published in OSDI 2004
  De facto standard for large-scale bulk
data processing
  Key benefit: simple programming model,
can handle large volume of data
2
I/O Path Efficiency
  I/O bound, so disks are bottleneck
  Existing implementations do a lot of disk I/O
– Materialize map output, re-read during shuffle
– Multiple sort passes if intermediate data large
– Swapping in response to memory pressure
  Hadoop Sort 2009: <3MBps per node
(3% of single disk’s throughput!)
3
How Low CanYou Go?
  Agarwal andVitter: minimum of two reads and
two writes per record for out-of-core sort
  Systems that meet this
lower bound have
the “2-IO property”
  MapReduce has sort in the middle,
so same principle can apply
4
Themis
  Goal: Build a MapReduce implementation with
the 2-IO property
  TritonSort (NSDI ’11): 2-IO sort
– World record holder in large-scale sorting
  Runs on a cluster of machines with many disks
per machine and fast NICs
  Performs wide range of I/O-bound
MapReduce jobs at nearly TritonSort speeds
5
Outline
  Architecture Overview
  Memory Management
  Fault Tolerance
  Evaluation
6
7
Network
Phase One
Map and Shuffle
Map
Network
8
Map
3 3925 15
Network
9
Map
3 3925 15
1 10 11 20••• ••• 21 30 31 40••• •••
Network
10
1 10 11 20••• ••• 21 30 31 40••• •••Unsorted
Network
11
PhaseTwo
Sort and Reduce
1 10 11 20••• ••• 21 30 31 40••• •••
1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• •••
Unsorted
Network
Sort
12
1 10 11 20••• ••• 21 30 31 40••• •••
1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• •••
1 10 11 20 21 30 31 40••• ••• ••• •••
Unsorted
Reduce
Network
Sort
13
1 10 11 20••• ••• 21 30 31 40••• •••
1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• •••
1 10 11 20 21 30 31 40••• ••• ••• •••
Unsorted
Reduce
Network
1 10 11 20 21 30 31 40••• ••• ••• •••
Sort
14
Reader
Byte Stream
Converter
Mapper
Sender Receiver
Byte Stream
Converter
Tuple
Demux
Chainer Coalescer Writer
Network
Implementation Details
15
Phase One
Reader
Byte Stream
Converter
Mapper
Sender Receiver
Byte Stream
Converter
Tuple
Demux
Chainer Coalescer Writer
Network
Implementation Details
16
Phase One
Stage
Implementation Details
17
Worker 1 Worker 2
Worker 3 Worker 4
Challenges
  Can’t spill to disk or swap under pressure
– Themis memory manager
  Partitions must fit in RAM
– Sampling (see paper)
  How does Themis handle failure?
– Job-level fault tolerance
18
Outline
  Architecture Overview
  Memory Management
  Fault Tolerance
  Evaluation
19
Example



20
A B C
Memory Management - Goals
  If we exceed physical memory, have to swap
– OS approach: virtual memory, swapping
– Hadoop approach: spill files
  Since this incurs more I/O, unacceptable
21
Example – Runaway Stage
22
A B C
Too Much Memory!
Example – Large Record



23
A B C
Too Much Memory!
Requirements
  Main requirement: can’t allocate more than
the amount physical memory
– Swapping or spilling breaks 2-IO
  Provide flow control via back-pressure
– Prevent stage from monopolizing memory
– Memory allocation can block indefinitely
  Support large records
  Should have high memory utilization
24
Approach
  Application-level memory management
  Three memory management schemes
– Pools
– Quotas
– Constraints
25
Pool-Based Memory Management
26
A B C
PoolAB PoolBC
Pool-Based Memory Management
27
A B C
PoolAB PoolBC
Pool-Based Memory Management
  Prevents using more than what’s in a pool
  Empty pools cause back-pressure
  Record size limited to size of buffer
  Unless tuned, memory utilization might be low
– Tuning is hard; must set buffer, pool sizes
  Used when receiving data from network
– Must be fast to keep up with 10Gbps links
28
Quota-Based Memory Management
29
A B C
QuotaAC = 1000
300
700
Quota-Based Memory
Management
  Provides back-pressure by limiting memory
between source and sink stage
  Supports large records (up to size of quota)
  High memory utilization if quotas set well
  Size of the data between source and sink
cannot change – otherwise leaks
  Used in the rest of phase one (map + shuffle)
30
Constraint-Based Memory
Management
  Single global memory quota
  Requests that would exceed quota are
enqueued and scheduled
  Dequeue based on policy
– Current: stage distance to network/disk write
– Rationale: process record completely before
admitting new records
31
Constraint-Based Memory
Management
  Globally limits memory usage
  Applies back-pressure dynamically
  Supports record sizes up to size of memory
  Extremely high utilization
  Higher overhead (2-3x over quotas)
  Can deadlock for complicated graphs, patterns
  Used in phase two (sort + reduce)
32
Outline
  Architecture Overview
  Memory Management
  Fault Tolerance
  Evaluation
33
FaultTolerance
  Cluster MTTF determined by size, node MTTF
– Node MTTF ~ 4 months (Google, OSDI ’10)
34
Google
10,000+ Nodes
2 minute MTTF
Failures common
Job must survive fault
Average Hadoop Cluster
30-200 Nodes
80 hour MTTF
Failures uncommon
OK to just re-run
MTTF Small MTTF Large
Why are Smaller Clusters OK?
  Hardware trends let you do more with less
– Increased hard drive density (32TB in 2U)
– Faster bus speeds
– 10Gbps Ethernet at end host
– Larger core counts
  Can store, process petabytes
  But is it really OK to just restart on failure?
35
Analytical Modeling
  Modeling goal: when is performance gain
nullified by restart cost?
  Example: 2x improvement
– 30 minute jobs: ~4300 nodes
– 2.5 hr jobs: ~800 nodes
  Can sort 100TB on 52 machines in ~2.5 hours
– Petabyte scale easily possible in these regions
  See paper for additional discussion
36
Outline
  Architecture Overview
  Memory Management
  Fault Tolerance
  Evaluation
37
Workload
  Sort: uniform and highly-skewed
  CloudBurst: short-read gene alignment
(ported from Hadoop implementation)
  PageRank: synthetic graphs,Wikipedia
  Word count
  n-Gram count (5-grams)
  Session extraction from synthetic logs
38
Performance
39
Performance
40
Performance
41
Performance vs. Hadoop
42
Application Hadoop
Runtime
(Sec)
Themis
Runtime
(Sec)
Improvement
Sort-500G 28881 1789 16.14x
CloudBurst 2878 944 3.05x
Summary
  Themis – MapReduce with 2-IO property
  Avoids swapping and spilling by carefully
managing memory
  Potential place for job-level fault tolerance
  Executes wide variety of workloads at
extremely high speed
43
Themis - Questions?
44
Zoom!
http://themis.sysnet.ucsd.edu/!

More Related Content

What's hot

31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loading
myrajendra
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
Hiroshi Ono
 
Memory allocation for real time operating system
Memory allocation for real time operating systemMemory allocation for real time operating system
Memory allocation for real time operating system
Asma'a Lafi
 

What's hot (20)

Communication model of parallel platforms
Communication model of parallel platformsCommunication model of parallel platforms
Communication model of parallel platforms
 
Understanding memory management
Understanding memory managementUnderstanding memory management
Understanding memory management
 
OSCh9
OSCh9OSCh9
OSCh9
 
Memory management
Memory managementMemory management
Memory management
 
Overview of Distributed Systems
Overview of Distributed SystemsOverview of Distributed Systems
Overview of Distributed Systems
 
RAMinate Invited Talk at NII
RAMinate Invited Talk at NIIRAMinate Invited Talk at NII
RAMinate Invited Talk at NII
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
Memory management
Memory managementMemory management
Memory management
 
Cassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyCassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie Satterly
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loading
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Memory allocation for real time operating system
Memory allocation for real time operating systemMemory allocation for real time operating system
Memory allocation for real time operating system
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data Platforms
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Ch4 memory management
Ch4 memory managementCh4 memory management
Ch4 memory management
 
Memory managment
Memory managmentMemory managment
Memory managment
 

Viewers also liked (6)

GSA2015_SD_KW_Page1 (1) (1) (2)
GSA2015_SD_KW_Page1 (1) (1) (2)GSA2015_SD_KW_Page1 (1) (1) (2)
GSA2015_SD_KW_Page1 (1) (1) (2)
 
14
1414
14
 
Geomorphic process
Geomorphic processGeomorphic process
Geomorphic process
 
geomorphic process
geomorphic processgeomorphic process
geomorphic process
 
Geomorphic Processes
Geomorphic ProcessesGeomorphic Processes
Geomorphic Processes
 
Endogenic processes - The Process Within
Endogenic processes - The Process WithinEndogenic processes - The Process Within
Endogenic processes - The Process Within
 

Similar to Themis: An I/O-Efficient MapReduce (SoCC 2012)

MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
abord
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
balmanme
 

Similar to Themis: An I/O-Efficient MapReduce (SoCC 2012) (20)

TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Kinetic basho public
Kinetic basho publicKinetic basho public
Kinetic basho public
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 

More from Alex Rasmussen

More from Alex Rasmussen (6)

Schema Evolution Patterns - Texas Scalability Summit 2019
Schema Evolution Patterns - Texas Scalability Summit 2019Schema Evolution Patterns - Texas Scalability Summit 2019
Schema Evolution Patterns - Texas Scalability Summit 2019
 
Schema Evolution Patterns - Velocity SJ 2019
Schema Evolution Patterns - Velocity SJ 2019Schema Evolution Patterns - Velocity SJ 2019
Schema Evolution Patterns - Velocity SJ 2019
 
EarthBound’s almost-Turing-complete text system!
EarthBound’s almost-Turing-complete text system!EarthBound’s almost-Turing-complete text system!
EarthBound’s almost-Turing-complete text system!
 
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
How Do We Solve The World's Spreadsheet Problem? - Velocity NY 2018
 
Unorthodox Paths to High Performance - QCon NY 2016
Unorthodox Paths to High Performance - QCon NY 2016Unorthodox Paths to High Performance - QCon NY 2016
Unorthodox Paths to High Performance - QCon NY 2016
 
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter StoragePapers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter Storage
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 

Themis: An I/O-Efficient MapReduce (SoCC 2012)

  • 1. Themis An I/O-Efficient MapReduce Alex Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, AminVahdat* University of California San Diego *& Google, Inc. 1
  • 2. MapReduce is Everywhere   First published in OSDI 2004   De facto standard for large-scale bulk data processing   Key benefit: simple programming model, can handle large volume of data 2
  • 3. I/O Path Efficiency   I/O bound, so disks are bottleneck   Existing implementations do a lot of disk I/O – Materialize map output, re-read during shuffle – Multiple sort passes if intermediate data large – Swapping in response to memory pressure   Hadoop Sort 2009: <3MBps per node (3% of single disk’s throughput!) 3
  • 4. How Low CanYou Go?   Agarwal andVitter: minimum of two reads and two writes per record for out-of-core sort   Systems that meet this lower bound have the “2-IO property”   MapReduce has sort in the middle, so same principle can apply 4
  • 5. Themis   Goal: Build a MapReduce implementation with the 2-IO property   TritonSort (NSDI ’11): 2-IO sort – World record holder in large-scale sorting   Runs on a cluster of machines with many disks per machine and fast NICs   Performs wide range of I/O-bound MapReduce jobs at nearly TritonSort speeds 5
  • 6. Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation 6
  • 10. Map 3 3925 15 1 10 11 20••• ••• 21 30 31 40••• ••• Network 10
  • 11. 1 10 11 20••• ••• 21 30 31 40••• •••Unsorted Network 11 PhaseTwo Sort and Reduce
  • 12. 1 10 11 20••• ••• 21 30 31 40••• ••• 1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• ••• Unsorted Network Sort 12
  • 13. 1 10 11 20••• ••• 21 30 31 40••• ••• 1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• ••• 1 10 11 20 21 30 31 40••• ••• ••• ••• Unsorted Reduce Network Sort 13
  • 14. 1 10 11 20••• ••• 21 30 31 40••• ••• 1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• ••• 1 10 11 20 21 30 31 40••• ••• ••• ••• Unsorted Reduce Network 1 10 11 20 21 30 31 40••• ••• ••• ••• Sort 14
  • 15. Reader Byte Stream Converter Mapper Sender Receiver Byte Stream Converter Tuple Demux Chainer Coalescer Writer Network Implementation Details 15 Phase One
  • 16. Reader Byte Stream Converter Mapper Sender Receiver Byte Stream Converter Tuple Demux Chainer Coalescer Writer Network Implementation Details 16 Phase One Stage
  • 17. Implementation Details 17 Worker 1 Worker 2 Worker 3 Worker 4
  • 18. Challenges   Can’t spill to disk or swap under pressure – Themis memory manager   Partitions must fit in RAM – Sampling (see paper)   How does Themis handle failure? – Job-level fault tolerance 18
  • 19. Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation 19
  • 21. Memory Management - Goals   If we exceed physical memory, have to swap – OS approach: virtual memory, swapping – Hadoop approach: spill files   Since this incurs more I/O, unacceptable 21
  • 22. Example – Runaway Stage 22 A B C Too Much Memory!
  • 23. Example – Large Record    23 A B C Too Much Memory!
  • 24. Requirements   Main requirement: can’t allocate more than the amount physical memory – Swapping or spilling breaks 2-IO   Provide flow control via back-pressure – Prevent stage from monopolizing memory – Memory allocation can block indefinitely   Support large records   Should have high memory utilization 24
  • 25. Approach   Application-level memory management   Three memory management schemes – Pools – Quotas – Constraints 25
  • 28. Pool-Based Memory Management   Prevents using more than what’s in a pool   Empty pools cause back-pressure   Record size limited to size of buffer   Unless tuned, memory utilization might be low – Tuning is hard; must set buffer, pool sizes   Used when receiving data from network – Must be fast to keep up with 10Gbps links 28
  • 29. Quota-Based Memory Management 29 A B C QuotaAC = 1000 300 700
  • 30. Quota-Based Memory Management   Provides back-pressure by limiting memory between source and sink stage   Supports large records (up to size of quota)   High memory utilization if quotas set well   Size of the data between source and sink cannot change – otherwise leaks   Used in the rest of phase one (map + shuffle) 30
  • 31. Constraint-Based Memory Management   Single global memory quota   Requests that would exceed quota are enqueued and scheduled   Dequeue based on policy – Current: stage distance to network/disk write – Rationale: process record completely before admitting new records 31
  • 32. Constraint-Based Memory Management   Globally limits memory usage   Applies back-pressure dynamically   Supports record sizes up to size of memory   Extremely high utilization   Higher overhead (2-3x over quotas)   Can deadlock for complicated graphs, patterns   Used in phase two (sort + reduce) 32
  • 33. Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation 33
  • 34. FaultTolerance   Cluster MTTF determined by size, node MTTF – Node MTTF ~ 4 months (Google, OSDI ’10) 34 Google 10,000+ Nodes 2 minute MTTF Failures common Job must survive fault Average Hadoop Cluster 30-200 Nodes 80 hour MTTF Failures uncommon OK to just re-run MTTF Small MTTF Large
  • 35. Why are Smaller Clusters OK?   Hardware trends let you do more with less – Increased hard drive density (32TB in 2U) – Faster bus speeds – 10Gbps Ethernet at end host – Larger core counts   Can store, process petabytes   But is it really OK to just restart on failure? 35
  • 36. Analytical Modeling   Modeling goal: when is performance gain nullified by restart cost?   Example: 2x improvement – 30 minute jobs: ~4300 nodes – 2.5 hr jobs: ~800 nodes   Can sort 100TB on 52 machines in ~2.5 hours – Petabyte scale easily possible in these regions   See paper for additional discussion 36
  • 37. Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation 37
  • 38. Workload   Sort: uniform and highly-skewed   CloudBurst: short-read gene alignment (ported from Hadoop implementation)   PageRank: synthetic graphs,Wikipedia   Word count   n-Gram count (5-grams)   Session extraction from synthetic logs 38
  • 42. Performance vs. Hadoop 42 Application Hadoop Runtime (Sec) Themis Runtime (Sec) Improvement Sort-500G 28881 1789 16.14x CloudBurst 2878 944 3.05x
  • 43. Summary   Themis – MapReduce with 2-IO property   Avoids swapping and spilling by carefully managing memory   Potential place for job-level fault tolerance   Executes wide variety of workloads at extremely high speed 43