Compute-based sizing and system dashboard

1 © Hortonworks Inc. 2011–2018. All rights reserved.
Compute Based Sizing & Operation
Dashboard
Janet Li HP Inc.
Pranay Vyas Hortonworks

Agenda
• Sizing puzzle
• Solution
• Operation dashboard
• Q&A

Sizing puzzle

• HP operates a very complex HDP environment with key stakeholders and critical data
across a variety of business areas: finance, supply chain, sales, and customer support
• We load over 8,000 files per day, execute 1.5M lines of SQL via 6000 jobs running
against 637B rows of data comprised of over 5000 tables in 77 domains
• Defining our cluster size and monitoring job performance is essential for our success and
the satisfaction of our stakeholders across the different business and IT organizations
• With estimated to run 15000 jobs everyday, it is evident that traditional storage based
sizing will not yield the computation power needed
Background
Why do we need sizing strategy?

• Multiple applications with different target state architectures
• Data Lake with multiple layers – raw dataset, historical dataset, transformed dataset &
reporting dataset
• Hybrid environment, on premise as well as on cloud
• Application owners only knows the incoming data volumes, lack of data point for
computing requirement with the use case
• Need a method to project demand and growth with new program onboarding
• Yarn capacity management and scheduling constraints as most jobs are region based
and are triggered at same time
Challenge
What are the Challenges for sizing the cluster?

• Review all the applications and collect data volumes and its estimated growth
• Capture every applications current footprint and projected capacity requirements based
on functional / technical requirements
• Understand where each application stand in terms of meeting their required SLA and
what is blocking concern
• Looked at peak CPU utilization, allocated and peak RAM usage, Storage Capacity,
Network IOPS Bandwidth & Throughput and Average Response Time and Transaction
Throughput
• Developed a framework that could be leveraged for on going capacity management and
deployment – with appropriate modifications to address different workload patterns
Approach
How to approach compute sizing?

Solution

• The ELT process is standard
• Loading layer
• History layer
• Transformation layer
• Reporting layer
• To extrapolate the sizing, we need to capture job metrics for initial release run from
loading layer to reporting layer.
• Estimate the computation requirements for queries and tune Hive and Tez properties
Collect Data
Method – Extrapolate based on initial run

• Categorize the sample runs based on
• Major job > 30 min
• Large job < 30 min
• Small job < 15 min
• Tiny job < 1 min
• Categorize the initial run based on # of containers
• Major footprint > 400 containers
• Large footprint < 400 containers
• Small footprint < 100 containers
• Tiny footprint < 20 containers
Collect Data

• Total map tasks and reduce tasks needed for initial run
• Average, max & min time taken for map task and reduce task
• Extrapolate the # of hql queries needed at every release based on existing system and
initial release runs
Collect Data

• Yarn memory: 6TB
• Tez container size: 4GB
• Map/ Reducer ratio – 70%:30%
• Average map time: 0.34min
• Average reducer task time: 0.5min
Collect Data
Current cluster estimates

Map Tasks: 4553788 tasks with each running for 0.34min
Reduce Tasks: 1327104 tasks with each running for 0.5min
Estimate # of map task and reduce task that can run

• Based on the initial run get map task at each job classification
Estimate # of map task and reduce task needed
Category
Expected
Count
Initial Run
Count
Major Job 47 2
Large Job 43 5
Small Job 1061 20
Tiny Job 10345 20

• With 6TB of Yarn memory
Map Tasks: 4553788 tasks with each running for 0.34min
Reduce Tasks: 1327104 tasks with each running for 0.5min
• What is needed to run the transformations 3 times
Map Tasks: 6733140 *3 = 20199420 tasks
Reduce Tasks: 92426 * 3 = 277278 tasks
• With 6TB of yarn memory it is evident that we may not be able to run the required
amount of map and reduce tasks.
Estimate difference

How we got the data?

Operation Dashboard

• Provides critical input to platform sizing
• Provides job execution details and highlights differences in run
• Easy to track incremental records written to each table and unusual low/high writes
• Help understanding peak loads or dip on capacity for new job scheduling
• Know historical patterns on how the table is being loaded and identify problems with
the load
Overview

• Tez jobs writes to timeline server its application log and job metrics
• Job execution detail are stored HDFS directory “/ats/done” in complex JSON format
• A custom parser is written that extracts metrics details from the above JSON files
• The parser extracts application information, dag details, counters and detailed vertices
information.
How is data captured

What is captured

Categorize jobs
• Job category helps in understanding where most time and resources are being
spent by each tenant

Scheduling
Resource Usage Patterns
• Find peak load time and dips where cluster remains unutilized to schedule
more jobs
• Can I run more parallel jobs? Max parallelization vs Resource utilized and Max
resource utilization vs parallelization Can
schedule
more jobs

Write Patterns
Historical Comparison
• Historical pattern for records written, # of tasks, Duration, # of vertex, Average
task duration, CPU and IO operations for a table

Summary

Deliverable Description Artifact
Template for Sizing Methodology
• Foundational approach to application infrastructure assessment.
• Although every application, is unique, we find that most fall into one or a
combination of different patterns (e.g. Data Warehouse, Analytics, Transaction
Processing).
• Subsequent documents extend this approach their specific pattern type and
approach to infrastructure assessment.
Preliminary Application Assessment
Calculator
• Calculator used to provide a heat-map of potential problem areas.
HPI Infrastructure Assessment
• Provides a detailed overview of the Infrastructure Assessment covering a summary
of the issues, the current state, issues and challenges and high-level
recommendations
Application Assessment
Methodology
• Leveraging the template, this document captures the 4 phases of the infrastructure
assessment: Discovery, Analysis, Recommendations, and Socialize.
• This document becomes the basis of the written Infrastructure Assessment.
Application Sizing Calculator
• This calculator utilizes the data gathered in the discovery phase as input to help
assess if an application is sized correctly
Charge back dashboard
• It is a build in charge back tool using Smart Sense Activity Explorer
• Charge back based on real time usage metrics
Capacity Management Framework / Capability

• The source can be found at
https://github.com/pranayVyas/ats_extract
Tested HDP versions
• HDP 2.6
• Updates will be made to support HDP 3.0
• You can reach out to us if you would like to parse additional metrics info
Free the source

Questions?

Thank you

Compute-based sizing and system dashboard

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Compute-based sizing and system dashboard

Similar to Compute-based sizing and system dashboard (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Compute-based sizing and system dashboard

Editor's Notes