SlideShare a Scribd company logo
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Mosharaf Chowdhury
Srikanth Kandula
Ion Stoica
Presented by Ran Ziv UC Berkeley
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 1
What’s Ahead?
• Intro - Data Intensive Cluster
• Proposed solution
• Evaluation
• Conclusion
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 2
What is Data Intensive Cluster?
• Scalable data storage and processing
• “Core” consists of two main parts
• Distributed File System (DFS)
• Processing (MapReduce)
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 3
Motivation
Store and analyze PBs of information
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 4
How was it Originated?
• Heavily influenced by Google’s architecture
• Other Web companies quickly saw the benefits
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 5
DFS: How does it work?
• Moore’s law… and not
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 6
Disk Capacity and Price
• We’re generating more data than ever before
• Fortunately, the size and cost of storage has kept
pace
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 7
Disk Capacity and Performance
• Disk performance has also increased in the last 15
years
• Unfortunately, transfer rates haven’t kept pace with
capacity
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 8
Architecture of a Typical HPC System
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 9
Architecture of a Typical HPC System
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 10
Architecture of a Typical HPC System
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 11
Architecture of a Typical HPC System
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 12
You Don’t Just Need Speed…
• The problem is that we have way more data than
code
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 13
You Need Speed At Scale
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 14
DISTRIBUTED FILESYSTEM
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 15
Benefits of DFS
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Linear scalability
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 16
Collocated Storage and Processing
• Solution: store and process data on the same nodes
• Data Locality: “Bring the computation to the data”
• Reduces I/O and boosts performance
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 17
DFS High-Level Architecture
• DFS follows a master-slave architecture
• Master: NameNode
• Responsible for namespace and metadata
• Namespace: file hierarchy
• Metadata: ownership, permissions, block locations, etc.
• Slave: DataNode
• Responsible for storing actual datablocks
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 19
DFS Blocks
• When a file is added to DFS, it’s split into blocks
• DFS uses a much larger block size (>= 64MB), for
performance
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 20
DFS Replication
• Those blocks are then replicated across machines
• The first block might be replicated to A, C and D
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 21
DFS Replication
• The next block might be replicated to B, D and E
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 22
DFS Replication
• The last block might be replicated to A, C and E
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 23
DFS Reliability
• Replication helps to achieve reliability
• Even when a node fails, two copies of the block remain
• These will be re-replicated to other nodes automatically
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 24
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 25
MapReduce High-Level Architecture
Like DFS, MapReduce has a master-slave Architecture
• Master: JobTracker
• Responsible for dividing, scheduling and monitoring work
• Slave: TaskTracker
• Responsible for actual processing
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 26
Gentle Introduction to MapReduce
• MapReduce is conceptually like a UNIX pipeline
• One function (Map) processes data
• That output is ultimately input to another function
(Reduce)
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 27
The Map Function
• Operates on each record individually
• Typical uses include filtering, parsing, or transforming
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 28
Intermediate Processing
• The Map function’s output is grouped and sorted
• This is the automatic “sort and shuffle” process
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 29
The Reduce Function
• Operates on all records in a group
• Often used for sum, average or other aggregate functions
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 30
MapReduce Flow
Job Tracker
Machine
Intermediate Files
Output Files
Maper
(Task)
Maper
(Task)
Maper
(Task)
Maper
(Task)
Input Files
Reducers
(Task)
Reducers
(Task)
Reducers
(Task)
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 31
Communication is Crucial
Performance
Facebook analytics jobs spend 33% of their runtime in
communication
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 32
Cross-Rack Traffic
DFS
Reads
14%
Inter.
46%
DFS
Writes
40%
DFS
Reads
31%
Inter.
15%
DFS
Writes
54%
Facebook Bing
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 33
DFS
Core
Rack 1 Rack 2 Rack 3
F
F F
Files are divided into
blocks
• 64MB to 1GB in size
Each block is replicated
• To 3 machines for fault
tolerance
• In 2 fault domains for partition
tolerance.
Synchronous operations
F I L E
III
E LL E
L E
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 34
DFS
• Files are divided into
blocks
• 64MB to 1GB in size
• Each block is replicated
• To 3 machines for fault
tolerance
• In 2 fault domains for partition
tolerance.
Synchronous operations
Fixed Sources
Destinations
Flexible Paths
Rates
Core
Rack 1 Rack 2 Rack 3
F
FII
E LL E
How to handle
DFS flows?
Hedera, VLB,
Orchestra, Coflow,
MicroTE, DevoFlow, …
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 35
DFS
• Files are divided into
blocks
• 64MB to 1GB in size
• Each block is replicated
• To 3 machines for fault
tolerance
• In 2 fault domains for partition
tolerance.
Synchronous operations
Fixed Sources
Destinations
Flexible Paths
Rates
Core
Rack 1 Rack 2 Rack 3
F
FII
E LL E
Replica location don’t matter
As long as constraints are met
Flexible Sources
Destinations
How to handle
DFS flows?
Hedera, VLB,
Orchestra, Coflow,
MicroTE, DevoFlow, …
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 36
Sinbad
Steers flexible replication traffic away from hotspots
• Improve write rates
• More balanced network
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 37
The Distributed Writing Problem
Core
Rack 1 Rack 2 Rack 3
Given
• Blocks of different size
• Links of different capacities
Place blocks to minimize
• The average block write time
• The average file write time
F EI L
Given
• Jobs of different length, and
• Machines of different speed,
Schedule jobs to minimize
• The average job completion time
Machine 1
Machine 2
Machine 3
Job Shop Scheduling
J O B is NP-Hard
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 38
How to Make it Easy?
Assumptions:
• All blocks have the same size
• Link utilizations are stable
Theorem:
Greedy placement minimizes
average block/file write times
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 39
How to Make it Easy? – In Practice
• Link utilizations are stable
In Reality: Average link utilizations are temporarily stable1,2
• All blocks have the same size
In Reality: Fixed-size large blocks write 93% of all bytes
1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value
2. Typically, x ranges from 5 to 10 seconds.
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 40
Greedy Algorithm
two-step greedy replica placement:
1. Pick the least-loaded link
2. Send a block from the file with the least-
remaining blocks through the selected link
1C B
TT+1
Time
A2
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 42
Sinbad Overview
follows a master-slave architecture
• Master:
• collocated with the CFS master
• Decides where to place each block
• Slave:
• periodically report information
Sinbad
Master
DFS
Master
DFS
Slave
Sinbad
Slave
DFS
Slave
Sinbad
Slave
DFS
Slave
Sinbad
Slave
Machine
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 43
Evaluation
A 3000-node trace-driven simulation matched against a
100-node EC2 deployment
1. Does it improve performance?
2. Does it balance the network?
3. Does the storage remain balanced? YES
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 47
Faster
Job Improv. DFS Improv.
Simulation
Experiment
1.39X
1.26X
1.79X
1.60X
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 49
More Balanced
EC2 Deployment
0
0.25
0.5
0.75
1
0 1 2 3 4
FractionofTime
Coeff. of Var. of Load
Across Rack-to-Host Links
Default
Network-Aware
Facebook Trace
Simulation
0
0.25
0.5
0.75
1
0 1 2 3 4
FractionofTime
Coeff. of Var. of Load
Across Core-to-Rack Links
Default
Network-Aware
Imbalance
(Coeff. of Var.1 of Link
Utilization)
Imbalance
(Coeff. of Var.1 of Link
Utilization)Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 50
What About Storage Balance?
Imbalanced in the short term
But, in the long term,
hotspots are uniformly distributed
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 51
Conclusions
Three
Approaches
Toward
Contention
Mitigation
#3
Balance
Usage
Manage elephant
flows
Optimize
intermediate comm.
Valiant load balancing (VLB),
Hedera, Orchestra, Coflow,
MicroTE, DevoFlow, …
#1
Increase
Capacity
Fatter
links/interfaces
Increase Bisection
B/W
Fat tree, VL2, DCell, BCube,
F10, …
#2
Decrease
Load
Data locality
Static optimization
Fair scheduling, Delay
scheduling, Mantri, Quincy,
PeriSCOPE, RoPE, Rhea, …
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 54
• Improves job performance by making the network more
balanced
• Improves DFS write performance while keeping the
storage balanced
• Sinbad will become increasingly more important as
storage becomes faster
Sinbad
Greedily steers
replication traffic
away from hotspots
Planning to deploy Sinbad at
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 55
Questions?
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 56
Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Ran Ziv© 2013 57

More Related Content

What's hot

Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...
Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...
Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...
xKinAnx
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
xKinAnx
 
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
xKinAnx
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
Sandeep Patil
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
WMLab,NCU
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
Cloudera, Inc.
 
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Doug O'Flaherty
 
Load Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseLoad Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed Database
Md. Shamsur Rahim
 
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Armando Leon
 
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
xKinAnx
 
EDBT2015: Transactional Replication in Hybrid Data Store Architectures
EDBT2015: Transactional Replication in Hybrid Data Store ArchitecturesEDBT2015: Transactional Replication in Hybrid Data Store Architectures
EDBT2015: Transactional Replication in Hybrid Data Store Architectures
tatemura
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
xKinAnx
 
Scaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsScaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsYury Kaliaha
 
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data StreamsLow-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Diego Marrón Vida
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
Severalnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VISeveralnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VI
Severalnines
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
Andrii Vozniuk
 
5 Ways to Avoid Server and Application Downtime
5 Ways to Avoid Server and Application Downtime5 Ways to Avoid Server and Application Downtime
5 Ways to Avoid Server and Application Downtime
Neverfail Group
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
Hortonworks
 

What's hot (20)

Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...
Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...
Ibm spectrum scale fundamentals workshop for americas part 2 IBM Spectrum Sca...
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
 
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5
 
Load Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseLoad Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed Database
 
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
 
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
Ibm spectrum scale fundamentals workshop for americas part 3 Information Life...
 
EDBT2015: Transactional Replication in Hybrid Data Store Architectures
EDBT2015: Transactional Replication in Hybrid Data Store ArchitecturesEDBT2015: Transactional Replication in Hybrid Data Store Architectures
EDBT2015: Transactional Replication in Hybrid Data Store Architectures
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
 
Scaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsScaling Out Tier Based Applications
Scaling Out Tier Based Applications
 
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data StreamsLow-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Severalnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VISeveralnines Self-Training: MySQL® Cluster - Part VI
Severalnines Self-Training: MySQL® Cluster - Part VI
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
5 Ways to Avoid Server and Application Downtime
5 Ways to Avoid Server and Application Downtime5 Ways to Avoid Server and Application Downtime
5 Ways to Avoid Server and Application Downtime
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
 

Similar to Leveraging Endpoint Flexibility in Data-Intensive Clusters

Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
Christian Johannsen
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Continuent
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
DataStax Academy
 
Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...
Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...
Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...
Veritas Technologies LLC
 
Lustre at indiana university
Lustre at indiana universityLustre at indiana university
Lustre at indiana university
inside-BigData.com
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracled0nn9n
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
Olga Lavrentieva
 
Data management in cloud computing trainee
Data management in cloud computing  traineeData management in cloud computing  trainee
Data management in cloud computing trainee
Damilola Mosaku
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdf
ScyllaDB
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
VitsRangannavar
 
Distributed RDBMS: Challenges, Solutions & Trade-offs
Distributed RDBMS: Challenges, Solutions & Trade-offsDistributed RDBMS: Challenges, Solutions & Trade-offs
Distributed RDBMS: Challenges, Solutions & Trade-offs
Ahmed Magdy Ezzeldin, MSc.
 

Similar to Leveraging Endpoint Flexibility in Data-Intensive Clusters (20)

Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
Webinar Slides: Geo-Distributed MySQL Clustering Done Right!
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...
Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...
Technical Best Practices for Veritas and Microsoft Azure Using a Detailed Ref...
 
Lustre at indiana university
Lustre at indiana universityLustre at indiana university
Lustre at indiana university
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Data management in cloud computing trainee
Data management in cloud computing  traineeData management in cloud computing  trainee
Data management in cloud computing trainee
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdf
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Distributed RDBMS: Challenges, Solutions & Trade-offs
Distributed RDBMS: Challenges, Solutions & Trade-offsDistributed RDBMS: Challenges, Solutions & Trade-offs
Distributed RDBMS: Challenges, Solutions & Trade-offs
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Leveraging Endpoint Flexibility in Data-Intensive Clusters

  • 1. Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury Srikanth Kandula Ion Stoica Presented by Ran Ziv UC Berkeley Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 1
  • 2. What’s Ahead? • Intro - Data Intensive Cluster • Proposed solution • Evaluation • Conclusion Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 2
  • 3. What is Data Intensive Cluster? • Scalable data storage and processing • “Core” consists of two main parts • Distributed File System (DFS) • Processing (MapReduce) Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 3
  • 4. Motivation Store and analyze PBs of information Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 4
  • 5. How was it Originated? • Heavily influenced by Google’s architecture • Other Web companies quickly saw the benefits Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 5
  • 6. DFS: How does it work? • Moore’s law… and not Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 6
  • 7. Disk Capacity and Price • We’re generating more data than ever before • Fortunately, the size and cost of storage has kept pace Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 7
  • 8. Disk Capacity and Performance • Disk performance has also increased in the last 15 years • Unfortunately, transfer rates haven’t kept pace with capacity Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 8
  • 9. Architecture of a Typical HPC System Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 9
  • 10. Architecture of a Typical HPC System Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 10
  • 11. Architecture of a Typical HPC System Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 11
  • 12. Architecture of a Typical HPC System Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 12
  • 13. You Don’t Just Need Speed… • The problem is that we have way more data than code Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 13
  • 14. You Need Speed At Scale Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 14
  • 15. DISTRIBUTED FILESYSTEM Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 15
  • 16. Benefits of DFS • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Linear scalability Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 16
  • 17. Collocated Storage and Processing • Solution: store and process data on the same nodes • Data Locality: “Bring the computation to the data” • Reduces I/O and boosts performance Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 17
  • 18. DFS High-Level Architecture • DFS follows a master-slave architecture • Master: NameNode • Responsible for namespace and metadata • Namespace: file hierarchy • Metadata: ownership, permissions, block locations, etc. • Slave: DataNode • Responsible for storing actual datablocks Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 19
  • 19. DFS Blocks • When a file is added to DFS, it’s split into blocks • DFS uses a much larger block size (>= 64MB), for performance Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 20
  • 20. DFS Replication • Those blocks are then replicated across machines • The first block might be replicated to A, C and D Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 21
  • 21. DFS Replication • The next block might be replicated to B, D and E Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 22
  • 22. DFS Replication • The last block might be replicated to A, C and E Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 23
  • 23. DFS Reliability • Replication helps to achieve reliability • Even when a node fails, two copies of the block remain • These will be re-replicated to other nodes automatically Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 24
  • 24. Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 25
  • 25. MapReduce High-Level Architecture Like DFS, MapReduce has a master-slave Architecture • Master: JobTracker • Responsible for dividing, scheduling and monitoring work • Slave: TaskTracker • Responsible for actual processing Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 26
  • 26. Gentle Introduction to MapReduce • MapReduce is conceptually like a UNIX pipeline • One function (Map) processes data • That output is ultimately input to another function (Reduce) Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 27
  • 27. The Map Function • Operates on each record individually • Typical uses include filtering, parsing, or transforming Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 28
  • 28. Intermediate Processing • The Map function’s output is grouped and sorted • This is the automatic “sort and shuffle” process Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 29
  • 29. The Reduce Function • Operates on all records in a group • Often used for sum, average or other aggregate functions Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 30
  • 30. MapReduce Flow Job Tracker Machine Intermediate Files Output Files Maper (Task) Maper (Task) Maper (Task) Maper (Task) Input Files Reducers (Task) Reducers (Task) Reducers (Task) Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 31
  • 31. Communication is Crucial Performance Facebook analytics jobs spend 33% of their runtime in communication Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 32
  • 33. DFS Core Rack 1 Rack 2 Rack 3 F F F Files are divided into blocks • 64MB to 1GB in size Each block is replicated • To 3 machines for fault tolerance • In 2 fault domains for partition tolerance. Synchronous operations F I L E III E LL E L E Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 34
  • 34. DFS • Files are divided into blocks • 64MB to 1GB in size • Each block is replicated • To 3 machines for fault tolerance • In 2 fault domains for partition tolerance. Synchronous operations Fixed Sources Destinations Flexible Paths Rates Core Rack 1 Rack 2 Rack 3 F FII E LL E How to handle DFS flows? Hedera, VLB, Orchestra, Coflow, MicroTE, DevoFlow, … Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 35
  • 35. DFS • Files are divided into blocks • 64MB to 1GB in size • Each block is replicated • To 3 machines for fault tolerance • In 2 fault domains for partition tolerance. Synchronous operations Fixed Sources Destinations Flexible Paths Rates Core Rack 1 Rack 2 Rack 3 F FII E LL E Replica location don’t matter As long as constraints are met Flexible Sources Destinations How to handle DFS flows? Hedera, VLB, Orchestra, Coflow, MicroTE, DevoFlow, … Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 36
  • 36. Sinbad Steers flexible replication traffic away from hotspots • Improve write rates • More balanced network Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 37
  • 37. The Distributed Writing Problem Core Rack 1 Rack 2 Rack 3 Given • Blocks of different size • Links of different capacities Place blocks to minimize • The average block write time • The average file write time F EI L Given • Jobs of different length, and • Machines of different speed, Schedule jobs to minimize • The average job completion time Machine 1 Machine 2 Machine 3 Job Shop Scheduling J O B is NP-Hard Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 38
  • 38. How to Make it Easy? Assumptions: • All blocks have the same size • Link utilizations are stable Theorem: Greedy placement minimizes average block/file write times Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 39
  • 39. How to Make it Easy? – In Practice • Link utilizations are stable In Reality: Average link utilizations are temporarily stable1,2 • All blocks have the same size In Reality: Fixed-size large blocks write 93% of all bytes 1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value 2. Typically, x ranges from 5 to 10 seconds. Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 40
  • 40. Greedy Algorithm two-step greedy replica placement: 1. Pick the least-loaded link 2. Send a block from the file with the least- remaining blocks through the selected link 1C B TT+1 Time A2 Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 42
  • 41. Sinbad Overview follows a master-slave architecture • Master: • collocated with the CFS master • Decides where to place each block • Slave: • periodically report information Sinbad Master DFS Master DFS Slave Sinbad Slave DFS Slave Sinbad Slave DFS Slave Sinbad Slave Machine Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 43
  • 42. Evaluation A 3000-node trace-driven simulation matched against a 100-node EC2 deployment 1. Does it improve performance? 2. Does it balance the network? 3. Does the storage remain balanced? YES Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 47
  • 43. Faster Job Improv. DFS Improv. Simulation Experiment 1.39X 1.26X 1.79X 1.60X Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 49
  • 44. More Balanced EC2 Deployment 0 0.25 0.5 0.75 1 0 1 2 3 4 FractionofTime Coeff. of Var. of Load Across Rack-to-Host Links Default Network-Aware Facebook Trace Simulation 0 0.25 0.5 0.75 1 0 1 2 3 4 FractionofTime Coeff. of Var. of Load Across Core-to-Rack Links Default Network-Aware Imbalance (Coeff. of Var.1 of Link Utilization) Imbalance (Coeff. of Var.1 of Link Utilization)Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 50
  • 45. What About Storage Balance? Imbalanced in the short term But, in the long term, hotspots are uniformly distributed Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 51
  • 46. Conclusions Three Approaches Toward Contention Mitigation #3 Balance Usage Manage elephant flows Optimize intermediate comm. Valiant load balancing (VLB), Hedera, Orchestra, Coflow, MicroTE, DevoFlow, … #1 Increase Capacity Fatter links/interfaces Increase Bisection B/W Fat tree, VL2, DCell, BCube, F10, … #2 Decrease Load Data locality Static optimization Fair scheduling, Delay scheduling, Mantri, Quincy, PeriSCOPE, RoPE, Rhea, … Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 54
  • 47. • Improves job performance by making the network more balanced • Improves DFS write performance while keeping the storage balanced • Sinbad will become increasingly more important as storage becomes faster Sinbad Greedily steers replication traffic away from hotspots Planning to deploy Sinbad at Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 55
  • 48. Questions? Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 56
  • 49. Leveraging Endpoint Flexibility in Data-Intensive Clusters Ran Ziv© 2013 57

Editor's Notes

  1. Good afternoon. I’m … Today, I’m going to talk about network transfers that do not have fixed destinations. This is a joint work with … Written @ Berkeley (SIGCOMM) and presented on last August in Hong Kong
  2. How it started: internet companies …. Main motivation in addition to regular stuff: Lower cost Less time Greater flexibility Linear scalability But How? And what happens that allows it?
  3. Open source Started in Google…
  4. Gordon Moore 1965
  5. Capacity has increased while price has decreased
  6. They analyze data in FB and Bing. Found out 33%.... ----------------- Many data-intensive jobs depend on communication for faster end-to-end completion time. For example, in one of our earlier works, we found that typical jobs at Facebook spend up to a third of their running time in shuffle or intermediate data transfers. As in-memory systems proliferate and disks are removed from the I/O pipeline, the network is likely to be the primary bottleneck. But what do the network usage of data-intensive clusters look like and where do they come from? To better understand the problem, we have analyzed traces from two data-intensive production clusters at Facebook and Microsoft. 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
  7. We have found something very interesting. While there has been a LOT of attention into deceasing reads over the network or managing intermediate communication, DFS replication creates almost half of all cross-rack traffic. Note that this doesn’t mean everyone was wrong; communication of intermediate data or shuffle is still a major source of job-level communication. But, the sources of these writes are ingestion of new data into the cluster and preprocessing of existing data for later use. Both of which do not show up when someone looks only at the jobs. Very small amount is actually created by typical jobs. We’ve also found that during ingestion many writers spend up to 90% of their time in writing. Well, that is their job. What is this DFS?
  8. Distributed file systems are ubiquitous in data-intensive clusters and form the narrow waist. Diverse computing frameworks read from and write to the same DFS. Examples include GFS, HDFS, Cosmos etc. Typically, distributed file systems store data as files. Each file is divided into large blocks. Typical size of a block would be 256MB. Each block of a file is then replicated to three different machines for fault tolerance. These three machines are located in two different fault domains, typically racks, for partition tolerance. Finally, replicas are placed uniformly randomly throughout the cluster to avoid storage imbalance. Writes to a DFS are typically synchronous.
  9. We address the traffic of distributed file systems in modern clusters like any other elephant flows in the network. We assume that the endpoints are fixed. All the existing work balance the network after the locations of the sources and destinations have already been decided. Because sources and destinations are fixed, they try to find different paths between them or try to change rates in different paths. But we can do more. Let us revisit the requirements of replica placement.
  10. Notice that, as long as these constraints are met, the DFS does not care where actually the replicas have been placed. This means, we can effectively change the destinations of all replication traffic if we satisfy the constraints. We refer to such transfers as constrained anycast in that replicas can go anywhere, but they are constrained.
  11. In this work, we present Sinbad. By steering replication traffic away from congested hotspots in the network Sinbad can improve the performance of the network. However, this can only be useful only if we have significant hotspot activities in the cluster.
  12. We refer to this as the distributed writing problem. Given blocks of different size and links of different capacities, Sinbad must place the replicas in a way to minimize the average block write time as well as the average file write time. Note that, block can have different size because blocks are not padded in a DFS. Now for each block, we consider a job of that length and for each link we consider a machine of that capacity, we see that the distributed writing problem is similar to the job shop scheduling problem. And it is NP-hard.
  13. Let’s take an example. We have the same network as before. We are going to assume that the three core-to-rack links are the possible bottlenecks. Replica placement requests from two different files come online. The black file has two blocks and the orange one has three. Now, let us assume that time is divided into discrete intervals. We must decide on the three requests during time interval T. We are also going to assume that intervals are independent; i.e., placement decisions during T will not affect the ones during T+1. Finally, we are going to assume that link utilizations are stable for the duration of replication or during T, and all blocks have the same size. It is clear that we should pick the least-loaded link because that will finish the fastest. Because all blocks are of the same size, it doesn’t matter which block we choose for minimizing the average block write time. If we also care about minimizing the file write times, we should always pick the smallest file (the one with the least remaining blocks) to go through the fattest link. Under these assumptions, greedy placement is optimal.
  14. We propose a simple two-step greedy placement policy. At any point, we pick the least-loaded link and then
  15. That brings us to Sinbad. Sinbad performs network-aware replica placement for DFS. <EXPLAIN> // Mention master-slave architecture etc.
  16. That brings us to Sinbad. Given a replica placement request, the master greedily places it and returns back the locations. It also adds some hysteresis to avoid placing too many replicas in the same rack. Further details on the process can be found in the paper. One thing to note is that the interface is incredibly simple, which makes it all the more deployable. All in all, we needed to change only a couple hundred lines of code to implement the whole thing.
  17. We have implemented Sinbad in the HDFS which is the de factor open source DFS used by traditional frameworks like Hadoop as well as upcoming systems like Spark. We have also performed flow-level simulation of the 3000-node facebook cluster. The three high-level question one might ask are — Does it improve performance? Does it improve network balance? Will the storage remain balanced. The short answer to all three is YES.
  18. We have implemented Sinbad in the HDFS which is the de factor open source DFS used by traditional frameworks like Hadoop as well as upcoming systems like Spark. We have also performed flow-level simulation of the 3000-node facebook cluster. The three high-level question one might ask are — Does it improve performance? Does it improve network balance? Will the storage remain balanced. The short answer to all three is YES.
  19. We consider performance from the perspective of the user (i.e., job performance) and that of the system (DFS performance) <EXPLAIN results> We have found that if we applied similar technique to in-memory storage systems like Tachyon, the improvements can be even higher because disks are never the bottlenecks. So, network-balance improved and performance improved as well. Upper bound: 1.89X
  20. ציר – מקדם השונות של ה UTILIZATION We’ve found that network is highly imbalanced in both clusters. We are looking at a CDF of imbalance in core-to-rack downlinks in the facebook cluster. In the x-axis we have imbalance measured by the coefficient of variation of link utilizations. Coefficient of variation is the ratio of standard deviation to the mean of some samples, which is zero when all samples are the same, i.e., there is NO imbalance. In general, smaller CoV means smaller imbalance. We’ve measured link utilization as the average of 10s bins. We see that it is almost never zero and more than 50% of the time it is more than 1 (which is a typical threshold for high imbalance) Same is true for the Bing cluster as well. Given that a large fraction of traffic allow flexibility in endpoint placement and the network indeed has hotspots, we can now formally define the problem Sinbad is trying to address. ------------------------------------------------------------------------------------------------------------------------------ The network became more balanced as well. Notice that in both EC2 experiments and trace-based simulations, the orange moved toward the left, which indicate decreased network imbalance.
  21. Sinbad optimize to Network – 10s check, decide where to put replicas by network and not only by Storage. Short term = 1h
  22. There have been a LOT of work on better optimizing the network. And the solutions largely fall into three categories. The first approach is to increase the capacity of the network. This includes moving from 1GigE to 10GigE links and increasing bisection bandwidth of datacenter networks. In fact, there have been a lot of proposals on designing full bisection bandwidth networks. However, full bisection bandwidth does not mean infinite bandwidth, and the size of workload is always increasing. In practice, many clusters still have some amount of oversubscription in their core-to-rack links. The next approach is decreasing the amount of network traffic. All the work on data locality, and there have been many, try to decrease network communication by moving computation closer to its input. Recently, many researchers have looked into static analysis of data-intensive applications to decrease communication. These are all best effort mechanisms, and there is always some data that must traverse the network. This brings us to the third approach, that is load balancing the network. Typically it focuses on managing large flows and optimizing communication of intermediate data. Our recent work on Orchestra and Coflow also fall in this category. This work is about going one step further in this direction.