The Evolution of Big Data Frameworks

  • 772 views
Uploaded on

The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and …

The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and show the benefits and challenges of a new architectural paradigm that decouples the resource management component (YARN) from the specifics of the application frameworks (e.g., MapReduce, Tez, REEF, Giraph, Naiad, Dryad, Spark,...). We argue that beside the primary goals of increasing scalability and programming model flexibility, this transformation dramatically facilitates innovation.

In this context, I will present some of our contributions to the evolution of Hadoop (namely: work-preserving preemption, and predictable resource allocation), and comment on the fascinating experience of working on open- source technologies from within Microsoft. The current Hadoop APIs (HDFS and YARN) provide the cluster equivalent of an OS API. With this as a backdrop, I will present our attempt to create the equivalent of stdlib for the cluster: the REEF project.

Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
772
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
27
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. From the dive bars of silicon valley to the World Tour Carlo Curino
  • 2. PhD + PostDoc in databases “we did all of this 30 years ago!” Yahoo! Research “webscale, webscale, webscale!” Microsoft – CISL “enterprise + cloud + search engine + big-data” My perspective
  • 3. Cluster as an Embedded systems (map-reduce) single-purpose clusters General purpose cluster OS (YARN, Mesos, Omega, Corona) standardizing access to computational resources Real-time OS for the cluster !? (Rayon) predictable resource allocation Cluster stdlib (REEF) factoring out common functionalities Agenda
  • 4. Cluster as an Embedded Systems “the era of map-reduce only clusters”
  • 5. Purpose-built technology Within large web companies Well targeted mission (process webcrawl) à scale and fault tolerance The origin Google leading the pack Google File System + MapReduce (2003/2004) Open-source and parallel efforts Yahoo! Hadoop ecosystem HDFS + MR (2006/2007) Microsoft Scope/Cosmos (2008) (more than MR)
  • 6. In-house growth What was the key to success for Hadoop?
  • 7. In-house growth (following Hadoop story) Access, access, access… All the data sit in the DFS Trivial to use massive compute power à lots of new applications But… everything has to be MR Cast any computation as map-only job MPI, graph processing, streaming, launching web-servers!?!
  • 8. Popularization Everybody wants Big Data Insight from raw data is cool Outside MS and Google, Big-Data == Hadoop Hadoop as catch-all big-data solution (and cluster manager)
  • 9. Not just massive in-house clusters New challenges? New deployment environments Small clusters (10s of machines) Public Cloud
  • 10. New deployment challenges Small clusters Efficiency matter more than scalability Admin/tuning done by mere mortals Cloud Untrusted users (security) Users are paying (availability, predictability) Users are unrelated to each other (performance isolation)
  • 11. Classic MapReduce
  • 12. Classic MapReduce
  • 13. Classic MapReduce
  • 14. Classic Hadoop (1.0) Architecture Client Job1 JobTracker Scheduler TaskTracker TaskTracker TaskTracker Map Reduce Map Reduce Map Map Map Map Map Map Map Red. Red. Red. Red. Red. Red. Reduce Handles resource management Global invariants fairness/capacity Determines who runs / resources / where Manages MapReduce application flow maps before reducers, re-run upon failure, etc..
  • 15. What are the key shortcomings of (old) Hadoop? Hadoop 1.0 Shortcomings
  • 16. Programming model rigidity JobTracker manages resources JobTracker manages application workflow (data dependencies) Performance and Availability Map vs Reduce slots lead to low cluster utilization (~70%) JobTracker had too much to do: scalability concern JobTracker is a single point of failure Hadoop 1.0 Shortcomings (similar to original MR)
  • 17. General purpose cluster OS “Cluster OS (YARN)”
  • 18. YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)* Request-based central scheduler Mesos (2011, UCB, open-sourced, tested at Twitter)* Offer-based two level scheduler Omega (2013, Google, simulation?)* Shared-state-based scheduling Corona (2013, Facebook, production) YARN-like but offered-based Four proposals * all three were best-papers or best-student-paper
  • 19. YARN Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps YARN MR v2 Tez Giraph Storm Dryad REEF ... Hive / Pig Hadoop 1.x (MapReduce) MR v1 Hive / Pig Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System HDFS 1 HDFS 2 Hardware Ad-hoc app Ad-hoc app
  • 20. A new architecture for Hadoop Decouples resource management from programming model (MapReduce is an “application” running on YARN) YARN (or Hadoop 2.x)
  • 21. YARN (Hadoop 2) Architecture Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Task Task Task Task TaskApp Master Task Task Negotiate access to more resources (ResourceRequest)
  • 22. Flexibility, Performance and Availability Multiple Programming Models Central components do less à scale better Easier High-Availability (e.g., RM vs AM) Why does this matter? System Jobs/Day Tasks/Day Cores pegged Hadoop 1.0 77k 4M 3.2 YARN 125k (150k) 12M (15M) 6 (10)
  • 23. Anything else you can think?
  • 24. Maintenance, Upgrade, and Experimentation Run with multiple framework versions (at one time) Trying out a new ideas is as is as launching a job Anything else you can think?
  • 25. Real-time OS for the cluster (?) “predictable resource allocation”
  • 26. YARN (Cosmos, Mesos, and Corona) support instantaneous scheduling invariants (fairness/capacity) maximize cluster throughput (eye to locality) Current trends New applications (require “gang” and dependencies) Consolidation of production/test clusters + Cloud (SLA jobs mixed with best-effort jobs) Motivation
  • 27. Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie) Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza) Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph) Example Use Cases
  • 28. In a consolidated cluster: Time-based SLAs for production jobs (completion deadline) Good latency for best-effort jobs High cluster utilization/throughput (Support rich applications: gang and skylines) High-Level Goals
  • 29. Decompose time-based SLAs in resource definition: via RDL predictable resource allocation: planning + scheduling Divide and Conquer time-based SLAs
  • 30. Expose to planner application needs time: start (s), finish (f) resources: capacity (w), total parallelism (h), minimum parallelism (l), min lease duration (t) Resource Definition Language (RDL) 1/2
  • 31. Skylines / pipelines: dependencies: among atomic allocations (ALL, ANY, ORDER) Resource Definition Language (RDL) 2/2
  • 32. Important classes Framework semantics: Perforator modeling of Scope/Hive Machine Learning: gang + bounded iterations (PREDict) Periodic jobs: history-based resource definition Coming up with RDL specs prediction
  • 33. Root 100% Staging 15% Production 60% J1 10% J2 40% J3 10% Post 5% Best Effort 20% Planning vs Scheduling Plan Follower J3 J1 J3 J1 J3 J1 J3 J1 J3 ?? Scheduling (fine-grained but time-oblivious) Resource Definition Planning (coarse but time-aware) Preemption Plan Sharing Policy ReservationService Resource Manager Sys Model / Feedback
  • 34. Some example run: lots of queues for gridmix Microsoft pipelines Dynamic queues
  • 35. Improves production job SLAs best-effort jobs latency cluster utilization and throughput Comparing against Hadoop CapacityScheduler
  • 36. Under promise, over deliver Plan for late execution, and run as early as you can Greedy Agent GB
  • 37. Coping with imperfections (system) compensate RDL based on black-box models of overheads Coping with Failures (system) re-plan (move/kill allocations) in response of system- observable resource issues Coping with Failures/Misprediction (user) continue in best-effort mode when reservation expires re-negotiate existing reservations Dealing with “Reality”
  • 38. Sharing Policy: CapacityOverTimePolicy constrains: instantaneous max, and running avg e.g., no user can exceed an instantaneous 30% allocation, and an average of 10% in any 24h period of time single partial scan of plan: O(|alloc| + |window|) User Quotas (trade-off flexibility to fairness)
  • 39. Introduce Admission Control and Time-based SLAs (YARN-1051) New ReservationService API (to reserve resources) Agents + Plan + SharingPolicy to organize future allocations Leverage underlying scheduler Future Directions Work with MSR-India on RDL estimates for Hive and MR Advanced agents for placement ($$-based and optimal algos) Enforcing decisions (Linux Containers, Drawbridge, Pacer) Conclusion
  • 40. Cluster stdlib: REEF “factoring out recurring components”
  • 41. Dryad DAG computations Tez DAG computations (focus on interactive and Hive support) Storm stream processing Spark interactive / in-memory / iterative Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel) Impala scalable, interactive, SQL-like query HoYA Hbase on yarn Stratoshpere parallel iterative computations REEF, Weave, Spring-Hadoop meta-frameworks to help build apps Focusing on YARN: many applications
  • 42. Lots of repeated work Communication Configuration Data and Control Flow Error handling / fault-tolerance Common “better than hadoop” tricks: Avoid Scheduling overheads Control Excessive disk IO Are YARN/Mesos/Omega enough?
  • 43. The Challenge YARN / HDFS SQL / Hive … … Machine Learning u  Fault Tolerance u  Row/Column Storage u  High Bandwidth Networking
  • 44. The Challenge YARN / HDFS SQL / Hive … … Machine Learning u  Fault Awareness u  Local data caching u  Low Latency Networking
  • 45. SQL / Hive The Challenge YARN / HDFS … … Machine Learning
  • 46. SQL / Hive REEF in the Stack YARN / HDFS … … Machine Learning REEF
  • 47. REEF in the Stack (Future) YARN / HDFS SQL / Hive … … Machine Learning REEF Operator API and Library Logical Abstraction
  • 48. REEF Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Task Task Task Task TaskApp Master Task Task Negotiate access to more resources (ResourceRequest)
  • 49. REEF Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Evaluator Task services Evaluator Task services REEF RT Driver Name- based User control flow logic Retains State! User data crunching logic Fault- detection Injection-based checkable configuration Event-based Control flow
  • 50. REEF: Computation and Data Management Extensible Control Flow Data Management Services Storage Network State Management Job Driver Control plane implementation. User code executed on YARN’s Application Master Activity User code executed within an Evaluator. Evaluator Execution Environment for Activities. One Evaluator is bound to one YARN Container.
  • 51. Control Flow is centralized in the Driver Evaluator, Tasks configuration and launch Error Handling is centralized in the Driver All exceptions are forwarded to the Driver All APIs are asynchronous Support for: Caching / Checkpointing / Group communication Example Apps Running on REEF MR, Asynch Page Rank, ML regressions, PCA, distributed shell,…. REEF Summary (Open-sourced with Apache License)
  • 52. Big-Data Systems Ongoing focus Future work Leverage high-level app semantics Coordinate tiered-storage and scheduling Conclusions
  • 53. Adding Preemption to YARN, and open-sourcing it toApache
  • 54. Limited mechanisms to “revise current schedule” Patience Container killing To enforce global properties Leave resources fallow (e.g., CapacityScheduler) à low utilization Kill containers (e.g., FairScheduler) à wasted work (Old) new trick Support work-preserving preemption (via) checkpointing à more than preemption State of the Art
  • 55. Changes throughout YARN Client Job1 RM Scheduler NodeManager NodeManager NodeManager App Master Task Task Task Task Task Task Task PreemptionMessage { Strict { Set<ContainerID> } Flexible { Set<ResourceRequest>, Set<ContainerID> } } Collaborative application Policy-based binding for Flexible preemption requests Use of Preemption Context: Outdated information Delayed effects of actions Multi-actor orchestration Interesting type of preemption: RM declarative request AM bounds it to containers
  • 56. Changes throughout YARN Client Job1 RM Scheduler NodeManager NodeManager NodeManager MR AM Task Task Task Task Task Task Task When can I preempt? tag safe UDFs or user-saved state @Preemptable public class MyReducer{ … } Common Checkpoint Service WriteChannel cwc = cs.create(); cwc.write(…state…); CheckpointID cid = cs.commit(cwc); ReadChannel crc = cs.open(cid);
  • 57. 57 CapacityScheduler + Unreservation + Preemption: memory utilization CapacityScheduler (allow overcapacity) CapacityScheduler (no overcapacity)
  • 58. Client Job1 RM Scheduler NodeManager NodeManager NodeManager App Master Task Task Task Task Task Task Task MR-5192 MR-5194 MR-5197 MR-5189 MR-5189 MR-5176 YARN-569 MR-5196 (Metapoint) Experience contributing to Apache Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished
  • 59. With @Preemptable tag imperative code with semantic property Generalize this trick expose semantic properties to platform (@PreserveSortOrder) allow platforms to optimize execution (map-reduce pipelining) REEF seems the logical place where to do this. Tagging UDFs
  • 60. (Basic) Building block for: Enables efficient preemption Dynamic Optimizations (task splitting, efficiency improvements) Fault Tolerance Other uses for Checkpointing
  • 61. 61