Alluxio Online Meetup
Feb 11, 2020
Speakers:
Du Li, Electronic Arts
Bin Fan, Alluxio
In cloud-based software stacks, there are varying degrees of automation across different layers: infrastructure, platform, and application. The mismatch in automation often breaks balance in devops, causing ops nightmares in platforms and applications. This talk will overview two projects at Electronic Arts (EA) that address the mismatch by data orchestration: One project automatically generates configurations for all components in a large monitoring system, which reduces the daily average number of alerts from ~1000 to ~20. The other project introduces Alluxio for caching and unifying address space across ETL and analytics workloads, which substantially simplifies architecture, improves performance, and reduces ops overheads.
5. DevOps problems in cloud-based tech stacks
● Reasons: Varying degrees of automation exist at different layers
○ Infrastructure (as Code): fully automated
○ Platforms
○ Applications
● Consequences: High operational costs in platforms and applications
○ Lack of built-in automation, hard to scale up/down
○ Mismatch of automation, hard to keep different layers in sync
○ Engineers busy with ops, limited time for dev and innovation
● Solutions: Redesign or upgrade in cloud-native architectures
6. One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
7. The life of ops team with 1,000+ alerts per day ...
9. Mismatch of automation causing ops nightmare ...
● Monitoring components: manually configured, little or no automation
● Infrastructure: fully automated, computers added and removed rapidly
● The mismatch makes it hard to keep configurations up to date
● Consequence: many false alerts, while many resources not monitored
11. Which machine
runs what
How monitoring
components are
wired
Library of
methods of
how to check
What to check
for each cluster
Monitoring As Code: Configuration Orchestration
AWS
CI/CD Pipelines
Alerts
Generator
Auto Configuration Generators
Metrics
Aggregators
Alerts
Generator
Metrics
Sensor
Metrics
Aggregator
Metrics
Sensors
12. Monitoring As Code
updates
review/merge
deploy by
CI/CD pipelines
● All monitoring configurations are managed
as one gitlab project
● Impose standard structures on all the
configurations such that they can be
automatically generated
● Configurations as treated the same way as
software code
○ Updates are reviewed, traceable
○ Auto deployed by CI/CD
● Essentially an orchestration layer on top of
monitoring components without modifying
any of those components
13. 1/1/2019 - 4/25/2019 daily averages improved:
~1,000 alerts => ~100 alerts
against increasing #hosts and #services
Since 6/2019, ~10-20 alerts per day on average
14. One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
15.
16. ● Ocean: Internal ETL jobs
○ Only cache recent data in HDFS
○ ETL Jobs read/write on HDFS
○ Backup data to S3 and update the
analytics metastore of new data
○ Legacy tech stacks
● Pond: External customer workloads
○ Run analytic queries directly on S3,
maybe the entire history of data
○ Some queries turned into light ETL jobs
○ Some analytic ETL jobs have
dependencies with regular ETL jobs
○ Mixed legacy and modern tech stacks
Two clusters
17. Main pain points
● Two clusters w/ duplicates
● Hadoop hard to scale
● HDFS used as cache
○ Complex custom code for data backup, purge, retention, loading
● Fragmented address spaces
○ HDFS (for ETL) recent data, S3 (for analytics) all data
○ Two hive metadata stores: one for ETL and one for analytic workloads
○ Complex sync between jobs and between metadata stores
18.
19. Data Platform 2.0
● Consolidated ETL/Analytics
● CI/CD for auto deployment
● Auto scaling of YARN and Presto
● Data orchestration using Alluxio
20. Benefits of data orchestration
● Analytics Workloads
○ Presto + Alluxio: caching for better performance
● ETL Workloads
○ YARN + Alluxio: replace HDFS with Alluxio
■ Specialized cache service to auto backup, purge & load data
■ One unified address space to simplify syncs
■ Easier to scale and auto manage compute resources
22. Preliminary benchmarking: Alluxio vs HDFS
● Same configurations
○ 10 H1.8xlarge instances, collocated YARN + HDFS/Alluxio
○ Alluxio with 3 replicas, no memory (using HDD for caching)
● Test dataset
○ Single file of varying sizes, single table with 1000 files of varying sizes
○ File sizes varying from 1KB to 500MB
● Reads: S3 => Alluxio 3.3-4.7X faster than S3 => HDFS
● Writes: Alluxio => S3 2.7-4.2X faster than HDFS => S3
● Hive query: Alluxio => Alluxio 1.3-2.1X faster than HDFS => HDFS
23. One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
24. Key Takeaways
● In the new cloud age, old ways of devops are breaking
○ Wrong technologies => 20% dev, 80% ops or worse
○ Strong implications on CapEx, OpEx, and HREx
● New emphasis in the cloud: Automation
○ Auto deployment, scaling, recovery => 80% dev, 20% ops or better
● Further: Automate on the right things, e.g.,
○ Everything as Code: infrastructure, platforms, applications
○ Configuration Orchestration: auto generate all configurations
○ Data Orchestration: caching and unified address space