From the dive bars of silicon valley to the World Tour
Carlo Curino
PhD + PostDoc in databases
“we did all of this 30 years ago!”
Yahoo! Research
“webscale, webscale, webscale!”
Microsoft – ...
Cluster as an Embedded systems (map-reduce)
single-purpose clusters
General purpose cluster OS (YARN, Mesos, Omega, Corona...
Cluster as an Embedded Systems
“the era of map-reduce only clusters”
Purpose-built technology
Within large web companies
Well targeted mission (process webcrawl)
à scale and fault tolerance
...
In-house growth
What was the key to success for Hadoop?
In-house growth (following Hadoop story)
Access, access, access…
All the data sit in the DFS
Trivial to use massive comput...
Popularization
Everybody wants Big Data
Insight from raw data is cool
Outside MS and Google, Big-Data == Hadoop
Hadoop as ...
Not just massive in-house clusters
New challenges?
New deployment environments
Small clusters (10s of machines)
Public Clo...
New deployment challenges
Small clusters
Efficiency matter more than scalability
Admin/tuning done by mere mortals
Cloud
U...
Classic MapReduce
Classic MapReduce
Classic MapReduce
Classic Hadoop (1.0) Architecture
Client
Job1
JobTracker
Scheduler
TaskTracker TaskTracker TaskTracker
Map Reduce Map Redu...
What are the key shortcomings
of (old) Hadoop?
Hadoop 1.0 Shortcomings
Programming model rigidity
JobTracker manages resources
JobTracker manages application workflow (data dependencies)
Perfor...
General purpose cluster OS
“Cluster OS (YARN)”
YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)*
Request-based central scheduler
Mesos (2011, UCB, open-sourced, te...
YARN
Ad-hoc
app
Ad-hoc
app
Ad-hoc
app
Ad-hoc
Apps
YARN
MR
v2
Tez Giraph Storm Dryad
REEF
...
Hive / Pig
Hadoop 1.x
(MapRed...
A new architecture for Hadoop
Decouples resource management from programming model
(MapReduce is an “application” running ...
YARN (Hadoop 2) Architecture
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Task
Task
Task Tas...
Flexibility, Performance and Availability
Multiple Programming Models
Central components do less à scale better
Easier Hi...
Anything else you can think?
Maintenance, Upgrade, and Experimentation
Run with multiple framework versions (at one time)
Trying out a new ideas is as ...
Real-time OS for the cluster (?)
“predictable resource allocation”
YARN (Cosmos, Mesos, and Corona)
support instantaneous scheduling invariants (fairness/capacity)
maximize cluster throughp...
Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie)
Service: daily ebb/flows, reserve capacity accordingly (e.g., S...
In a consolidated cluster:
Time-based SLAs for production jobs (completion deadline)
Good latency for best-effort jobs
Hig...
Decompose time-based SLAs in
resource definition: via RDL
predictable resource allocation: planning + scheduling
Divide an...
Expose to planner application needs
time: start (s), finish (f)
resources: capacity (w), total parallelism (h),
minimum pa...
Skylines / pipelines:
dependencies: among atomic allocations
(ALL, ANY, ORDER)
Resource Definition Language (RDL) 2/2
Important classes
Framework semantics: Perforator modeling of Scope/Hive
Machine Learning: gang + bounded iterations (PRED...
Root
100%
Staging
15%
Production
60%
J1
10%
J2
40%
J3
10%
Post
5%
Best Effort
20%
Planning vs Scheduling
Plan
Follower
J3
...
Some example run:
lots of queues for gridmix
Microsoft pipelines
Dynamic queues
Improves
production job SLAs
best-effort jobs latency
cluster utilization and throughput
Comparing against Hadoop Capacity...
Under promise, over deliver
Plan for late execution, and run as early as you can
Greedy Agent
GB
Coping with imperfections (system)
compensate RDL based on black-box models of overheads
Coping with Failures (system)
re-...
Sharing Policy: CapacityOverTimePolicy
constrains: instantaneous max, and running avg
e.g., no user can exceed an instanta...
Introduce Admission Control and Time-based SLAs (YARN-1051)
New ReservationService API (to reserve resources)
Agents + Pla...
Cluster stdlib: REEF
“factoring out recurring components”
Dryad DAG computations
Tez DAG computations (focus on interactive and Hive support)
Storm stream processing
Spark interact...
Lots of repeated work
Communication
Configuration
Data and Control Flow
Error handling / fault-tolerance
Common “better th...
The Challenge
YARN / HDFS
SQL / Hive … …
Machine
Learning
u  Fault Tolerance
u  Row/Column Storage
u  High Bandwidth Ne...
The Challenge
YARN / HDFS
SQL / Hive … …
Machine
Learning
u  Fault Awareness
u  Local data caching
u  Low Latency Netwo...
SQL / Hive
The Challenge
YARN / HDFS
… …
Machine
Learning
SQL / Hive
REEF in the Stack
YARN / HDFS
… …
Machine
Learning
REEF
REEF in the Stack (Future)
YARN / HDFS
SQL / Hive … …
Machine
Learning
REEF
Operator API and Library
Logical Abstraction
REEF
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Task
Task
Task Task
TaskApp
Master
Task
Ta...
REEF
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Evaluator
Task
services
Evaluator
Task
ser...
REEF: Computation and Data Management
Extensible Control Flow Data Management Services
Storage
Network
State Management
Jo...
Control Flow is centralized in the Driver
Evaluator, Tasks configuration and launch
Error Handling is centralized in the D...
Big-Data Systems
Ongoing focus
Future work
Leverage high-level app semantics
Coordinate tiered-storage and scheduling
Conc...
Adding Preemption to YARN,
and open-sourcing it toApache
Limited mechanisms to “revise current schedule”
Patience
Container killing
To enforce global properties
Leave resources fa...
Changes throughout YARN
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
T...
Changes throughout YARN
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
MR
AM Task
Task
Task
Task
Task
Task
T...
57
CapacityScheduler + Unreservation + Preemption: memory utilization
CapacityScheduler (allow overcapacity)
CapacitySched...
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
Task
Task
MR-5192
MR-5194...
With @Preemptable
tag imperative code with semantic property
Generalize this trick
expose semantic properties to platform ...
(Basic) Building block for:
Enables efficient preemption
Dynamic Optimizations (task splitting, efficiency improvements)
F...
61
Upcoming SlideShare
Loading in...5
×

The Evolution of Big Data Frameworks

1,028

Published on

The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and show the benefits and challenges of a new architectural paradigm that decouples the resource management component (YARN) from the specifics of the application frameworks (e.g., MapReduce, Tez, REEF, Giraph, Naiad, Dryad, Spark,...). We argue that beside the primary goals of increasing scalability and programming model flexibility, this transformation dramatically facilitates innovation.

In this context, I will present some of our contributions to the evolution of Hadoop (namely: work-preserving preemption, and predictable resource allocation), and comment on the fascinating experience of working on open- source technologies from within Microsoft. The current Hadoop APIs (HDFS and YARN) provide the cluster equivalent of an OS API. With this as a backdrop, I will present our attempt to create the equivalent of stdlib for the cluster: the REEF project.

Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,028
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
54
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

The Evolution of Big Data Frameworks

  1. 1. From the dive bars of silicon valley to the World Tour Carlo Curino
  2. 2. PhD + PostDoc in databases “we did all of this 30 years ago!” Yahoo! Research “webscale, webscale, webscale!” Microsoft – CISL “enterprise + cloud + search engine + big-data” My perspective
  3. 3. Cluster as an Embedded systems (map-reduce) single-purpose clusters General purpose cluster OS (YARN, Mesos, Omega, Corona) standardizing access to computational resources Real-time OS for the cluster !? (Rayon) predictable resource allocation Cluster stdlib (REEF) factoring out common functionalities Agenda
  4. 4. Cluster as an Embedded Systems “the era of map-reduce only clusters”
  5. 5. Purpose-built technology Within large web companies Well targeted mission (process webcrawl) à scale and fault tolerance The origin Google leading the pack Google File System + MapReduce (2003/2004) Open-source and parallel efforts Yahoo! Hadoop ecosystem HDFS + MR (2006/2007) Microsoft Scope/Cosmos (2008) (more than MR)
  6. 6. In-house growth What was the key to success for Hadoop?
  7. 7. In-house growth (following Hadoop story) Access, access, access… All the data sit in the DFS Trivial to use massive compute power à lots of new applications But… everything has to be MR Cast any computation as map-only job MPI, graph processing, streaming, launching web-servers!?!
  8. 8. Popularization Everybody wants Big Data Insight from raw data is cool Outside MS and Google, Big-Data == Hadoop Hadoop as catch-all big-data solution (and cluster manager)
  9. 9. Not just massive in-house clusters New challenges? New deployment environments Small clusters (10s of machines) Public Cloud
  10. 10. New deployment challenges Small clusters Efficiency matter more than scalability Admin/tuning done by mere mortals Cloud Untrusted users (security) Users are paying (availability, predictability) Users are unrelated to each other (performance isolation)
  11. 11. Classic MapReduce
  12. 12. Classic MapReduce
  13. 13. Classic MapReduce
  14. 14. Classic Hadoop (1.0) Architecture Client Job1 JobTracker Scheduler TaskTracker TaskTracker TaskTracker Map Reduce Map Reduce Map Map Map Map Map Map Map Red. Red. Red. Red. Red. Red. Reduce Handles resource management Global invariants fairness/capacity Determines who runs / resources / where Manages MapReduce application flow maps before reducers, re-run upon failure, etc..
  15. 15. What are the key shortcomings of (old) Hadoop? Hadoop 1.0 Shortcomings
  16. 16. Programming model rigidity JobTracker manages resources JobTracker manages application workflow (data dependencies) Performance and Availability Map vs Reduce slots lead to low cluster utilization (~70%) JobTracker had too much to do: scalability concern JobTracker is a single point of failure Hadoop 1.0 Shortcomings (similar to original MR)
  17. 17. General purpose cluster OS “Cluster OS (YARN)”
  18. 18. YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)* Request-based central scheduler Mesos (2011, UCB, open-sourced, tested at Twitter)* Offer-based two level scheduler Omega (2013, Google, simulation?)* Shared-state-based scheduling Corona (2013, Facebook, production) YARN-like but offered-based Four proposals * all three were best-papers or best-student-paper
  19. 19. YARN Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps YARN MR v2 Tez Giraph Storm Dryad REEF ... Hive / Pig Hadoop 1.x (MapReduce) MR v1 Hive / Pig Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System HDFS 1 HDFS 2 Hardware Ad-hoc app Ad-hoc app
  20. 20. A new architecture for Hadoop Decouples resource management from programming model (MapReduce is an “application” running on YARN) YARN (or Hadoop 2.x)
  21. 21. YARN (Hadoop 2) Architecture Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Task Task Task Task TaskApp Master Task Task Negotiate access to more resources (ResourceRequest)
  22. 22. Flexibility, Performance and Availability Multiple Programming Models Central components do less à scale better Easier High-Availability (e.g., RM vs AM) Why does this matter? System Jobs/Day Tasks/Day Cores pegged Hadoop 1.0 77k 4M 3.2 YARN 125k (150k) 12M (15M) 6 (10)
  23. 23. Anything else you can think?
  24. 24. Maintenance, Upgrade, and Experimentation Run with multiple framework versions (at one time) Trying out a new ideas is as is as launching a job Anything else you can think?
  25. 25. Real-time OS for the cluster (?) “predictable resource allocation”
  26. 26. YARN (Cosmos, Mesos, and Corona) support instantaneous scheduling invariants (fairness/capacity) maximize cluster throughput (eye to locality) Current trends New applications (require “gang” and dependencies) Consolidation of production/test clusters + Cloud (SLA jobs mixed with best-effort jobs) Motivation
  27. 27. Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie) Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza) Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph) Example Use Cases
  28. 28. In a consolidated cluster: Time-based SLAs for production jobs (completion deadline) Good latency for best-effort jobs High cluster utilization/throughput (Support rich applications: gang and skylines) High-Level Goals
  29. 29. Decompose time-based SLAs in resource definition: via RDL predictable resource allocation: planning + scheduling Divide and Conquer time-based SLAs
  30. 30. Expose to planner application needs time: start (s), finish (f) resources: capacity (w), total parallelism (h), minimum parallelism (l), min lease duration (t) Resource Definition Language (RDL) 1/2
  31. 31. Skylines / pipelines: dependencies: among atomic allocations (ALL, ANY, ORDER) Resource Definition Language (RDL) 2/2
  32. 32. Important classes Framework semantics: Perforator modeling of Scope/Hive Machine Learning: gang + bounded iterations (PREDict) Periodic jobs: history-based resource definition Coming up with RDL specs prediction
  33. 33. Root 100% Staging 15% Production 60% J1 10% J2 40% J3 10% Post 5% Best Effort 20% Planning vs Scheduling Plan Follower J3 J1 J3 J1 J3 J1 J3 J1 J3 ?? Scheduling (fine-grained but time-oblivious) Resource Definition Planning (coarse but time-aware) Preemption Plan Sharing Policy ReservationService Resource Manager Sys Model / Feedback
  34. 34. Some example run: lots of queues for gridmix Microsoft pipelines Dynamic queues
  35. 35. Improves production job SLAs best-effort jobs latency cluster utilization and throughput Comparing against Hadoop CapacityScheduler
  36. 36. Under promise, over deliver Plan for late execution, and run as early as you can Greedy Agent GB
  37. 37. Coping with imperfections (system) compensate RDL based on black-box models of overheads Coping with Failures (system) re-plan (move/kill allocations) in response of system- observable resource issues Coping with Failures/Misprediction (user) continue in best-effort mode when reservation expires re-negotiate existing reservations Dealing with “Reality”
  38. 38. Sharing Policy: CapacityOverTimePolicy constrains: instantaneous max, and running avg e.g., no user can exceed an instantaneous 30% allocation, and an average of 10% in any 24h period of time single partial scan of plan: O(|alloc| + |window|) User Quotas (trade-off flexibility to fairness)
  39. 39. Introduce Admission Control and Time-based SLAs (YARN-1051) New ReservationService API (to reserve resources) Agents + Plan + SharingPolicy to organize future allocations Leverage underlying scheduler Future Directions Work with MSR-India on RDL estimates for Hive and MR Advanced agents for placement ($$-based and optimal algos) Enforcing decisions (Linux Containers, Drawbridge, Pacer) Conclusion
  40. 40. Cluster stdlib: REEF “factoring out recurring components”
  41. 41. Dryad DAG computations Tez DAG computations (focus on interactive and Hive support) Storm stream processing Spark interactive / in-memory / iterative Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel) Impala scalable, interactive, SQL-like query HoYA Hbase on yarn Stratoshpere parallel iterative computations REEF, Weave, Spring-Hadoop meta-frameworks to help build apps Focusing on YARN: many applications
  42. 42. Lots of repeated work Communication Configuration Data and Control Flow Error handling / fault-tolerance Common “better than hadoop” tricks: Avoid Scheduling overheads Control Excessive disk IO Are YARN/Mesos/Omega enough?
  43. 43. The Challenge YARN / HDFS SQL / Hive … … Machine Learning u  Fault Tolerance u  Row/Column Storage u  High Bandwidth Networking
  44. 44. The Challenge YARN / HDFS SQL / Hive … … Machine Learning u  Fault Awareness u  Local data caching u  Low Latency Networking
  45. 45. SQL / Hive The Challenge YARN / HDFS … … Machine Learning
  46. 46. SQL / Hive REEF in the Stack YARN / HDFS … … Machine Learning REEF
  47. 47. REEF in the Stack (Future) YARN / HDFS SQL / Hive … … Machine Learning REEF Operator API and Library Logical Abstraction
  48. 48. REEF Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Task Task Task Task TaskApp Master Task Task Negotiate access to more resources (ResourceRequest)
  49. 49. REEF Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Evaluator Task services Evaluator Task services REEF RT Driver Name- based User control flow logic Retains State! User data crunching logic Fault- detection Injection-based checkable configuration Event-based Control flow
  50. 50. REEF: Computation and Data Management Extensible Control Flow Data Management Services Storage Network State Management Job Driver Control plane implementation. User code executed on YARN’s Application Master Activity User code executed within an Evaluator. Evaluator Execution Environment for Activities. One Evaluator is bound to one YARN Container.
  51. 51. Control Flow is centralized in the Driver Evaluator, Tasks configuration and launch Error Handling is centralized in the Driver All exceptions are forwarded to the Driver All APIs are asynchronous Support for: Caching / Checkpointing / Group communication Example Apps Running on REEF MR, Asynch Page Rank, ML regressions, PCA, distributed shell,…. REEF Summary (Open-sourced with Apache License)
  52. 52. Big-Data Systems Ongoing focus Future work Leverage high-level app semantics Coordinate tiered-storage and scheduling Conclusions
  53. 53. Adding Preemption to YARN, and open-sourcing it toApache
  54. 54. Limited mechanisms to “revise current schedule” Patience Container killing To enforce global properties Leave resources fallow (e.g., CapacityScheduler) à low utilization Kill containers (e.g., FairScheduler) à wasted work (Old) new trick Support work-preserving preemption (via) checkpointing à more than preemption State of the Art
  55. 55. Changes throughout YARN Client Job1 RM Scheduler NodeManager NodeManager NodeManager App Master Task Task Task Task Task Task Task PreemptionMessage { Strict { Set<ContainerID> } Flexible { Set<ResourceRequest>, Set<ContainerID> } } Collaborative application Policy-based binding for Flexible preemption requests Use of Preemption Context: Outdated information Delayed effects of actions Multi-actor orchestration Interesting type of preemption: RM declarative request AM bounds it to containers
  56. 56. Changes throughout YARN Client Job1 RM Scheduler NodeManager NodeManager NodeManager MR AM Task Task Task Task Task Task Task When can I preempt? tag safe UDFs or user-saved state @Preemptable public class MyReducer{ … } Common Checkpoint Service WriteChannel cwc = cs.create(); cwc.write(…state…); CheckpointID cid = cs.commit(cwc); ReadChannel crc = cs.open(cid);
  57. 57. 57 CapacityScheduler + Unreservation + Preemption: memory utilization CapacityScheduler (allow overcapacity) CapacityScheduler (no overcapacity)
  58. 58. Client Job1 RM Scheduler NodeManager NodeManager NodeManager App Master Task Task Task Task Task Task Task MR-5192 MR-5194 MR-5197 MR-5189 MR-5189 MR-5176 YARN-569 MR-5196 (Metapoint) Experience contributing to Apache Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished
  59. 59. With @Preemptable tag imperative code with semantic property Generalize this trick expose semantic properties to platform (@PreserveSortOrder) allow platforms to optimize execution (map-reduce pipelining) REEF seems the logical place where to do this. Tagging UDFs
  60. 60. (Basic) Building block for: Enables efficient preemption Dynamic Optimizations (task splitting, efficiency improvements) Fault Tolerance Other uses for Checkpointing
  61. 61. 61
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×