Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
How to Develop and Operate Cloud Native Data Platforms and Applications
Speaker:
Du Li, Electronic Arts (EA)
For more Alluxio events: https://www.alluxio.io/events/
How to Develop and Operate Cloud Native Data Platforms and Applications
1. How to Develop and Operate Cloud
Native Data Platforms and Applications
Du Li
Serena Wang, Yen Feng, Tony Ma, Preethi Ganeshan, Tushar Agarwal,
Kaiyu Liu, Nitish Victor, Yu Jin, Sundeep Narravula
Electronic Arts (EA)
Data Orchestration Summit, 11/7/2019
4. General problems in cloud-based tech stacks
● Reasons: Varying degrees of automation exist at different layers
○ Infrastructure (as Code): fully automated
○ Platforms: manually managed
○ Applications: manually managed
● Consequences: High operational costs in platforms and applications
○ Lack of built-in automation, hard to scale up/down
○ Mismatch of automation, hard to keep different layers in sync
○ Engineers busy with ops, limited time for dev and innovation
● Solutions: Redesign or upgrade in cloud-native architectures
5. One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
6. The life of ops team with 1,000+ alerts per day ...
7. Alerts
Generator
Metrics
Aggregator
Metrics Sensors
confs
What a monitoring system is like ...
confs
confs confs confs
which machine runs what
and how to do health check
from which machines to
collect metrics
what metrics to collect and
where to send metrics
8. Mismatch of automation causing ops nightmare ...
● Monitoring components: manually configured, little or no automation
● Infrastructure: fully automated, computers added and removed rapidly
● The mismatch makes it hard to keep configurations up to date
● Consequence: many false alerts, while many resources not monitored
10. Which machine
runs what
How monitoring
components are
wired
Library of
methods of
how to check
What to check
for each cluster
Monitoring As Code: Configuration Orchestration
AWS
CI/CD Pipelines
Alerts
Generator
Auto Configuration Generators
Metrics
Aggregators
Alerts
Generator
Metrics
Sensor
Metrics
Aggregator
Metrics
Sensors
11. Monitoring As Code
updates
review/merge
deploy by
CI/CD pipelines
● All monitoring configurations are managed
as one gitlab project
● Impose standard structures on all the
configurations such that they can be
automatically generated
● Configurations as treated the same way as
software code
○ Updates are reviewed, traceable
○ Auto deployed by CI/CD
● Essentially an orchestration layer on top of
monitoring components without modifying
any of those components
12. 1/1/2019 - 4/25/2019 daily averages improved:
~1,000 alerts => ~100 alerts
against increasing #hosts and #services
Since 6/2019, ~10-20 alerts per day on average
13. One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
14.
15. ● Ocean: Internal ETL jobs
○ Only cache recent data in HDFS
○ ETL Jobs read/write on HDFS
○ Backup data to S3 and update the
analytics metastore of new data
○ Legacy tech stacks
● Pond: External customer workloads
○ Run analytic queries directly on S3,
maybe the entire history of data
○ Some queries turned into light ETL jobs
○ Some analytic ETL jobs have
dependencies with regular ETL jobs
○ Mixed legacy and modern tech stacks
Two clusters
16. Main pain points
● Two clusters with duplicates
● Hadoop hard to scale
● HDFS used as cache
○ Complex custom code for data backup, purge, retention, loading
● Fragmented address spaces
○ HDFS (for ETL) recent data, S3 (for analytics) all data
○ Two hive metadata stores: one for ETL and one for analytic workloads
○ Complex sync between jobs and between metadata stores
17. Data Platform 2.0
● Consolidated ETL/Analytics
● CI/CD for auto deployment
● Auto scaling of YARN and Presto
● Data orchestration using Alluxio
18. Benefits of data orchestration
● Analytics Workloads
○ Presto + Alluxio: caching for better performance
● ETL Workloads
○ YARN + Alluxio: replace HDFS with Alluxio
■ Specialized cache service to auto backup, purge & load data
■ One unified address space to simplify syncs
■ Easier to scale and auto manage compute resources
○ Significant architectural benefits. But, will we lose any performance?
19. Preliminary benchmarking: Alluxio vs HDFS
● Same configurations
○ 10 H1.8xlarge instances, collocated YARN + HDFS/Alluxio
○ Alluxio with 3 replicas, no memory (using HDD for caching)
● Test dataset
○ Single file of varying sizes, single table with 1000 files of varying sizes
● Reads: S3 => Alluxio 3.3-4.7X faster than S3 => HDFS
● Writes: Alluxio => S3 2.7-4.2X faster than HDFS => S3
● Hive query: Alluxio => Alluxio 1.3-2.1X faster than HDFS => HDFS
20. One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
21. Key Takeaways
● In the new cloud age, old ways of devops are breaking
○ Wrong technologies => 20% dev, 80% ops
○ Strong implications on CapEx, OpEx, and HREx
● New emphasis in the cloud: Automation
○ Auto deployment, scaling, recovery => 90% dev, 10% ops
● TIPS: Adopt cloud-native technologies, e.g.,
○ Everything as Code: infrastructure, platforms, applications
○ Configuration Orchestration: auto generate all configurations
○ Data Orchestration: caching and unified address space