How to Develop and Operate Cloud
Native Data Platforms and Applications
Du Li
Architect of Data Infrastructure
Electronic Arts (EA)
2/11/2020
Applications
Platforms
Infrastructure
Addressing Big Data Challenges at EA
● Seasonal and daily
fluctuations in data traffic
● Many Black Fridays
● Need to scale up and down
compute resources
frequently and quickly
DevOps problems in cloud-based tech stacks
● Reasons: Varying degrees of automation exist at different layers
○ Infrastructure (as Code): fully automated
○ Platforms
○ Applications
● Consequences: High operational costs in platforms and applications
○ Lack of built-in automation, hard to scale up/down
○ Mismatch of automation, hard to keep different layers in sync
○ Engineers busy with ops, limited time for dev and innovation
● Solutions: Redesign or upgrade in cloud-native architectures
One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
The life of ops team with 1,000+ alerts per day ...
Host
1
Host
2
Host
n
Alerts
Generator
Metrics
Aggregator
Metrics Sensors
confs
What a monitoring system is like ...
confs
confs confs confs
which machine runs what
and how to do health check
from which machines to
collect metrics
what metrics to collect and
where to send metrics
Mismatch of automation causing ops nightmare ...
● Monitoring components: manually configured, little or no automation
● Infrastructure: fully automated, computers added and removed rapidly
● The mismatch makes it hard to keep configurations up to date
● Consequence: many false alerts, while many resources not monitored
Host
1
Host
2
Host
n
confs
confs
Alerts
Generator
confs confs confs
Metrics
Aggregators
Configuration Orchestration
● Keep monitoring confs in sync with
hardware status
● Not modifying any underlying
monitoring components
The Solution: Monitoring As Code
Which machine
runs what
How monitoring
components are
wired
Library of
methods of
how to check
What to check
for each cluster
Monitoring As Code: Configuration Orchestration
AWS
CI/CD Pipelines
Alerts
Generator
Auto Configuration Generators
Metrics
Aggregators
Alerts
Generator
Metrics
Sensor
Metrics
Aggregator
Metrics
Sensors
Monitoring As Code
updates
review/merge
deploy by
CI/CD pipelines
● All monitoring configurations are managed
as one gitlab project
● Impose standard structures on all the
configurations such that they can be
automatically generated
● Configurations as treated the same way as
software code
○ Updates are reviewed, traceable
○ Auto deployed by CI/CD
● Essentially an orchestration layer on top of
monitoring components without modifying
any of those components
1/1/2019 - 4/25/2019 daily averages improved:
~1,000 alerts => ~100 alerts
against increasing #hosts and #services
Since 6/2019, ~10-20 alerts per day on average
One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
● Ocean: Internal ETL jobs
○ Only cache recent data in HDFS
○ ETL Jobs read/write on HDFS
○ Backup data to S3 and update the
analytics metastore of new data
○ Legacy tech stacks
● Pond: External customer workloads
○ Run analytic queries directly on S3,
maybe the entire history of data
○ Some queries turned into light ETL jobs
○ Some analytic ETL jobs have
dependencies with regular ETL jobs
○ Mixed legacy and modern tech stacks
Two clusters
Main pain points
● Two clusters w/ duplicates
● Hadoop hard to scale
● HDFS used as cache
○ Complex custom code for data backup, purge, retention, loading
● Fragmented address spaces
○ HDFS (for ETL) recent data, S3 (for analytics) all data
○ Two hive metadata stores: one for ETL and one for analytic workloads
○ Complex sync between jobs and between metadata stores
Data Platform 2.0
● Consolidated ETL/Analytics
● CI/CD for auto deployment
● Auto scaling of YARN and Presto
● Data orchestration using Alluxio
Benefits of data orchestration
● Analytics Workloads
○ Presto + Alluxio: caching for better performance
● ETL Workloads
○ YARN + Alluxio: replace HDFS with Alluxio
■ Specialized cache service to auto backup, purge & load data
■ One unified address space to simplify syncs
■ Easier to scale and auto manage compute resources
Will we lose performance?
Preliminary benchmarking: Alluxio vs HDFS
● Same configurations
○ 10 H1.8xlarge instances, collocated YARN + HDFS/Alluxio
○ Alluxio with 3 replicas, no memory (using HDD for caching)
● Test dataset
○ Single file of varying sizes, single table with 1000 files of varying sizes
○ File sizes varying from 1KB to 500MB
● Reads: S3 => Alluxio 3.3-4.7X faster than S3 => HDFS
● Writes: Alluxio => S3 2.7-4.2X faster than HDFS => S3
● Hive query: Alluxio => Alluxio 1.3-2.1X faster than HDFS => HDFS
One tale of two stories ...
● Monitoring Systems
● Data Platforms
● Key Takeaways
How orchestration helped solve the problems ...
Key Takeaways
● In the new cloud age, old ways of devops are breaking
○ Wrong technologies => 20% dev, 80% ops or worse
○ Strong implications on CapEx, OpEx, and HREx
● New emphasis in the cloud: Automation
○ Auto deployment, scaling, recovery => 80% dev, 20% ops or better
● Further: Automate on the right things, e.g.,
○ Everything as Code: infrastructure, platforms, applications
○ Configuration Orchestration: auto generate all configurations
○ Data Orchestration: caching and unified address space

How to Develop and Operate Cloud First Data Platforms

  • 1.
    How to Developand Operate Cloud Native Data Platforms and Applications Du Li Architect of Data Infrastructure Electronic Arts (EA) 2/11/2020
  • 3.
    Applications Platforms Infrastructure Addressing Big DataChallenges at EA ● Seasonal and daily fluctuations in data traffic ● Many Black Fridays ● Need to scale up and down compute resources frequently and quickly
  • 5.
    DevOps problems incloud-based tech stacks ● Reasons: Varying degrees of automation exist at different layers ○ Infrastructure (as Code): fully automated ○ Platforms ○ Applications ● Consequences: High operational costs in platforms and applications ○ Lack of built-in automation, hard to scale up/down ○ Mismatch of automation, hard to keep different layers in sync ○ Engineers busy with ops, limited time for dev and innovation ● Solutions: Redesign or upgrade in cloud-native architectures
  • 6.
    One tale oftwo stories ... ● Monitoring Systems ● Data Platforms ● Key Takeaways How orchestration helped solve the problems ...
  • 7.
    The life ofops team with 1,000+ alerts per day ...
  • 8.
    Host 1 Host 2 Host n Alerts Generator Metrics Aggregator Metrics Sensors confs What amonitoring system is like ... confs confs confs confs which machine runs what and how to do health check from which machines to collect metrics what metrics to collect and where to send metrics
  • 9.
    Mismatch of automationcausing ops nightmare ... ● Monitoring components: manually configured, little or no automation ● Infrastructure: fully automated, computers added and removed rapidly ● The mismatch makes it hard to keep configurations up to date ● Consequence: many false alerts, while many resources not monitored
  • 10.
    Host 1 Host 2 Host n confs confs Alerts Generator confs confs confs Metrics Aggregators ConfigurationOrchestration ● Keep monitoring confs in sync with hardware status ● Not modifying any underlying monitoring components The Solution: Monitoring As Code
  • 11.
    Which machine runs what Howmonitoring components are wired Library of methods of how to check What to check for each cluster Monitoring As Code: Configuration Orchestration AWS CI/CD Pipelines Alerts Generator Auto Configuration Generators Metrics Aggregators Alerts Generator Metrics Sensor Metrics Aggregator Metrics Sensors
  • 12.
    Monitoring As Code updates review/merge deployby CI/CD pipelines ● All monitoring configurations are managed as one gitlab project ● Impose standard structures on all the configurations such that they can be automatically generated ● Configurations as treated the same way as software code ○ Updates are reviewed, traceable ○ Auto deployed by CI/CD ● Essentially an orchestration layer on top of monitoring components without modifying any of those components
  • 13.
    1/1/2019 - 4/25/2019daily averages improved: ~1,000 alerts => ~100 alerts against increasing #hosts and #services Since 6/2019, ~10-20 alerts per day on average
  • 14.
    One tale oftwo stories ... ● Monitoring Systems ● Data Platforms ● Key Takeaways How orchestration helped solve the problems ...
  • 16.
    ● Ocean: InternalETL jobs ○ Only cache recent data in HDFS ○ ETL Jobs read/write on HDFS ○ Backup data to S3 and update the analytics metastore of new data ○ Legacy tech stacks ● Pond: External customer workloads ○ Run analytic queries directly on S3, maybe the entire history of data ○ Some queries turned into light ETL jobs ○ Some analytic ETL jobs have dependencies with regular ETL jobs ○ Mixed legacy and modern tech stacks Two clusters
  • 17.
    Main pain points ●Two clusters w/ duplicates ● Hadoop hard to scale ● HDFS used as cache ○ Complex custom code for data backup, purge, retention, loading ● Fragmented address spaces ○ HDFS (for ETL) recent data, S3 (for analytics) all data ○ Two hive metadata stores: one for ETL and one for analytic workloads ○ Complex sync between jobs and between metadata stores
  • 19.
    Data Platform 2.0 ●Consolidated ETL/Analytics ● CI/CD for auto deployment ● Auto scaling of YARN and Presto ● Data orchestration using Alluxio
  • 20.
    Benefits of dataorchestration ● Analytics Workloads ○ Presto + Alluxio: caching for better performance ● ETL Workloads ○ YARN + Alluxio: replace HDFS with Alluxio ■ Specialized cache service to auto backup, purge & load data ■ One unified address space to simplify syncs ■ Easier to scale and auto manage compute resources
  • 21.
    Will we loseperformance?
  • 22.
    Preliminary benchmarking: Alluxiovs HDFS ● Same configurations ○ 10 H1.8xlarge instances, collocated YARN + HDFS/Alluxio ○ Alluxio with 3 replicas, no memory (using HDD for caching) ● Test dataset ○ Single file of varying sizes, single table with 1000 files of varying sizes ○ File sizes varying from 1KB to 500MB ● Reads: S3 => Alluxio 3.3-4.7X faster than S3 => HDFS ● Writes: Alluxio => S3 2.7-4.2X faster than HDFS => S3 ● Hive query: Alluxio => Alluxio 1.3-2.1X faster than HDFS => HDFS
  • 23.
    One tale oftwo stories ... ● Monitoring Systems ● Data Platforms ● Key Takeaways How orchestration helped solve the problems ...
  • 24.
    Key Takeaways ● Inthe new cloud age, old ways of devops are breaking ○ Wrong technologies => 20% dev, 80% ops or worse ○ Strong implications on CapEx, OpEx, and HREx ● New emphasis in the cloud: Automation ○ Auto deployment, scaling, recovery => 80% dev, 20% ops or better ● Further: Automate on the right things, e.g., ○ Everything as Code: infrastructure, platforms, applications ○ Configuration Orchestration: auto generate all configurations ○ Data Orchestration: caching and unified address space