1. The document discusses Microsoft's SCOPE analytics platform running on Apache Tez and YARN. It describes how Graphene was designed to integrate SCOPE with Tez to enable SCOPE jobs to run as Tez DAGs on YARN clusters.
2. Key components of Graphene include a DAG converter, Application Master, and tooling integration. The Application Master manages task execution and communicates with SCOPE engines running in containers.
3. Initial experience running SCOPE on Tez has been positive though challenges remain around scaling to very large workloads with over 15,000 parallel tasks and optimizing for opportunistic containers and Application Master recovery.
1. Graphene – Microsoft
SCOPE on Tez
Hitesh Sharma (Principal Software Eng. Manager)
Anupam (Senior Software Engineer)
2. Agenda
• Overview of SCOPE and Cosmos
• SCOPE Job Manager responsibilities
• Design of Graphene
• Features required in Tez
3. Cosmos
Environment
• A Microsoft-internal platform
for building big-data
applications
• Available externally as Azure
Data Lake Analytics
• Enable customers to transform
data of any scale into new
business assets easily at low
cost in the cloud
4. Cosmos: World’s Biggest YARN Cluster!
Single DC >
40K machines
Multiple DCs
> 500,000 jobs
/ day
~ 3 billion
containers/day
High avg. CPU
utilization
Three Nines
Exabytes in
storage
100s of PB
processed/day
Exabytes of
data moved
5. SCOPE
• Scripting language for Cosmos
• Influenced by SQL and relational
concepts
• Works great with C# and .NET
• Very extensible
• Auto scale
• Naturally parallelizable computation
• Lower the barrier to write efficient
programs
RawData =
EXTRACT
Clicks:int,
Domain:string
FROM @“RAWWEBDATA.TSV”
USING DefaultTextExtractor();
WebData =
SELECT *,
Domain.Trim().ToUpper()
AS NormalizedDomain
FROM RawData;
OUTPUT WebData
TO “WEBDATA.TSV”
USING DefaultTextOutputter();
7. Job scale
• Single job can consume > 1PB of
data
• > 15000 concurrent tasks (degree of
parallelism)
• Thousands of vertices
• DAGs can be very wide, very deep,
or both
• Millions of tasks in a job
• Billions of edges
8. Job Manager
• DAG execution
• Builds execution graph
• Topologically executes the DAG
• Keep track of state of the job/vertices
• Dynamic DAG updates
• Rack level aggregation
• Broadcast tree
• Fault tolerance
• Handle failures and do revocations
• Detect and mitigate outliers
9. Job Manager
• Scheduling
• Keep track of cluster resources
• Distributed scheduler
• Requests bonus or opportunistic containers
to increase utilization
• Can upgrade opportunistic containers
• Container reuse
• Opportunistic containers present some
interesting choices for reuse
• Tricky to implement
Time
Parallel
containers
Max containers
allowed
10. Job Manager
• Finalization
• Concatenate final outputs
• Metadata operations
• Tooling
• Near real-time feedback
• Finding the critical path
• Structured error reporting
11. Current
Challenges
• Higher cost of ownership
• No AM recovery
• Tied to Cosmos infrastructure
• Memory inefficient
• Native support for interactive
workload
15. Consume output of
compilation to
generate DAG
Algebr
a
Launch and
communicate with
ScopeEngine
Engine
Produce status,
debugging, and
error details for
existing tooling
Tooling
Interact with
storage layer
Store
Graphene – Integration Points
16. Graphene – Application Master
GRAPHENE AM
GrapheneDAGAppMaster
DAG
Converter
Algebra
Legend
Tez Component
Uses Tez API
External Component
DAG
Store Client
Input InitializerDAGAppMaster
DAGImpl
Custom Edge
and Vertex Mgr
Tez
Magic
Task
17. Graphene – Task Execution
Task Container
SCOPE Engine
SCOPE Processor
SCOPE Input SCOPE Output
SCOPE TaskTez
Magic!
GRAPHENE AM
AM Container
Launch Container
InputFailedEvent/DataMovementEvent
InputDataInformationEvent or
DataMovementEvent Task CommandStatus & Error
Legend
Tez Component
Uses Tez API
External Component
18. Graphene – Tooling Integration
Task Container
SCOPE Engine
SCOPE Task
Periodic Stats and Diag
Legend
Tez Component
Uses Tez API
External Component
Statistics & DiagTez
Magic
GRAPHENE AM
AM Container
JobProfiler:
EventListener
Real Time
Stats
Historic
Stats Task Level Stats
Vertex Level Stats
19. Experience So Far
Reliability
As expected from
a production ready
software
No major bugs or
reliability issues
Onboarding
Modular and
tested code
Documentation :
Opportunity to
contribute
Community
Very responsive
Special thanks to
Bikas Saha, Kuhu
Shukla, Jonathan
Eagles
20. Scaling Tez
• Existing Cosmos workloads can have >
15k parallel tasks
• Acquiring and managing these
containers
• Managing communications with
these tasks
• Providing real time progress for
all the tasks
21. Scaling Tez
• Optimize AM memory
• Metadata management for large
inputs
• Memory pressure under large
event throughput
• Large DAGs with > 2000 vertices
and > 1 million tasks
• Optimizations for deep DAGs
22. Integrating
with YARN
Opportunistic
containers
• Mechanism to drive up utilization of
cluster
• AM has deep understanding of the
capability
• Effectively using opportunistic
containers in scheduler
• Harder scheduling choices with
container reuse
23. AM Recovery
• High priority customer ask
• Need to plugin Graphene to this AM
resiliency
• Deterministic and reliable recovery
with dynamic behavior
25. References
• Apache Tez: A Unifying Framework for Modeling and Building Data
Processing Applications [SIGMOD, 2015]
• SCOPE: easy and efficient parallel processing of massive data sets
[VLDB, 2008]
• Apollo: Scalable and Coordinated Scheduling for Cloud-Scale
Computing [OSDI, 2014]
• Dryad: distributed data-parallel programs from sequential building
blocks [EuroSys, 2007]
• Lessons learned from scaling YARN to 40k machines in a multi tenancy
environment. [DataWorksSummit, 2017]
Editor's Notes
We are here to talk about how we are looking to power SCOPE with Tez.
We will do a quick overview of Cosmos and SCOPE
Then we will talk about the role job manager plays in the system
How we are looking to fix some of the problems that we have by leveraging Tez
Anupam will dig a little deeper into design of Graphene
He will be talking about the challenges in front of us and why we need your help to take Tez to the next level
A Microsoft-internal platform for building big-data applications
Used across Microsoft by Bing, Azure, Windows, Office for data mining and analysis. Available externally as Azure Data Lake Analytics
Lets user focus on transforming data to gain insights while we focus on operating the platform at lower COGS.
SCOPE is the main scripting language for Cosmos. Targeted for large-scale data analysis. You could run a script over 1GB, TB, or a PB and we handle scaling that.
SQL like language that allows C# and .NET devs to get started easily.
On the right is a sample SCOPE script. In this case we are reading a TSV file and running a select statement on that, add a new column, and output that as a new file.
Users can easily define their own functions and even implement their own versions of operators like extractors, processors, and outputters.
Users just write the scripts thinking it is going to run on a single machine and we scale it out on the cluster.
This means that the nitty-gritties of dealing from failures and retries is not something a user should worry about
Users submit a SCOPE script from VS using Scope studio plugin. The script goes through Cosmos Job Service and FE where it is compiled by Scope compiler. Compiler produces an AST representation of the script along with the codegen DLLs for user code and other artifacts. Optimizer makes decisions about execution plan, parallelism, and generates an algebra.
Job manager, which is us, parses the algebra and starts executing the DAG on the cluster. As part of the execution JM launches Scope engine on the tasks which provides implementations of many standard physical operators. JM gives the Scope engine input paths to read and outputs to produce. Typically outputs of one vertex become input to some other vertex and DAG execution continues.
---
SCOPE compiler and optimizer are responsible for generating an efficient execution plan and the runtime.
So what are the responsibilities of the Job Manager?
DAG execution
JM is the central and coordinating process for all processing vertices within an application. The primary function of the JM is to construct the runtime DAG from the compile time representation of a DAG and execute over it. The JM schedules a DAG vertex onto the cluster nodes when all the inputs are ready.
JM can also do dynamic updates to the graph like a pod level aggregation or build a broadcast tree.
Fault tolerance
The Job Manager monitors progress of all executing vertices. Failing vertices are re-executed a limited number of times and if there are too many failures, the job is terminated.
JM also detects slower tasks in a vertex and reexecutes them elsewhere on the cluster.
Scheduling
When a task is ready then JM looks for a machine in the cluster to run the task upon.
The global cluster load information used by each JM is provided through the cooperation of two additional entities in the system: a Resource Monitor (RM) for each cluster and a Process Node (PN) on each server. The RM aggregates load information from PNs across the cluster continuously, providing a global view of the cluster status for each JM to make informed scheduling decisions.
It also enforces token limits..
Users typically give a job some tokens to run. Each token amounts to 2 cores and 6GB. JM ensures that the resources used by the job never exceed the allocated number of tokens.
When the job finishes then the JM finalizes the outputs so they become visible to the user. It also supports some custom metadata operations like catalog updates.
Go through details.
Explains design and implementation decisions for Graphene and how we use Tez
1m Once we decided to implement SCOPE AM using Tez.
We decided upon certain ground rules or guiding principles we would use to accomplish this goal.
Hitesh already gave us an idea about the scale of Cosmos and SCOPE workloads, and how critical they are for Microsoft’s business.
The compiler, optimizer, execution engine and tooling will be minimally changed, in order to allow for a staged transition.
Tez has very powerful set of APIs to allow any system to plugin. We will be using these extensibility points as much as possible.
Finally, for features that we feel the need to add to Tez, we will be working with the community and making them work generally for all Tez users as much as possible.
With these ground rules set we started working on porting Scope to run on top of Tez.
3m The need to seamlessly upgrade from current job manager to graphene implies that graphene should be a drop-in replacement for current job manager.
As Hitesh showed, doing this at Cosmos scale while being the backbone of Microsoft’s analytics need implies least perturbation.
This meant that the SCOPE AM on Tez had to mimic existing job manager kind of behavior.
Graphene has 4 unique integration point in Cosmos SCOPE stack not native to Tez.
This introduction of our guiding principles and integration points will be helpful to understand our implementation and the rationale behind our design choices.