Cloud Native Data Platform
at Fitbit
Challenges and lessons in a multi-tenant multi-cloud
environment
Intro of Fitbit
● Consumer Fitness trackers & smart watches
● Corporate Wellness based on Fitbit trackers for
enterprises
● Social apps and fitness coach (on Smart Phone and
Smartwatches)
● Other projects
Example users of the Data Platform
● Data Science and machine learning
● Research Science (new health insights, health studies, etc.)
● Software Engineers
● Hardware and manufacturing
● Customer Support, Marketing, Legal, Security, etc.
Challenges
● Diverse user experience and expectations: batch and micro batch ETL,
stream insights, analytics dashboards, adhoc queries, AML/DL model training
& serving, etc. From simple SQL to complex deep learning, from very small
laptop-size datasets to 100s TB sized datasets.
● Multiple compliances: PII, PCI, HIPAA, GDPR, etc.
● Very lean data platform team
● Most valuable data are locked in transaction store (MySQL and Cassandra)
Some Stats
● ~ 100 TB of time based data generated each day
● ~ 30 million active users globally
● Many TB derived data generated each day
● ~ 100s of different primary datasets and even more derived datasets
● Realtime and batch based reports and insights generated regularly
Data Platform Architecture
MySQL Kafka Cassandra
Micro services (Apache Mesos) API
Mobile
Trackers
Web
Partners
Ad hoc analysis
Batch ETL
S3
Extraction Pipelines
S3
Kafka
Mirror
S3
S3 Compute
(Presto/Spark)
on EMR
Tooling: Data Dictionary/Discovery,
Monitoring, Provisioning, Airflow, security &
compliance AWS ECS Machine Learning
Spark Structured
Streaming
Stream insights
BI Warehouse
Data Lake
Multi-Tenancy on AWS
Data Dictionary
Schema Registry
Provisioning
Portal+Tools
Artifacts
Service Registry
Kafka Mirrors
S3 S3
S3
EMR Clusters
Airflow Web/
Scheduler
Gateway +
Proxy
Logs and
Monitoring
S3 S3
S3
EMR Clusters
Airflow Web/
Scheduler
Gateway +
Proxy
Containers
Lambdas Streams
Team A (AWS Account A)
Data Platform (AWS Master Account)
Team B (AWS Account B)
Notebook
Choices Made - Multi-tenancy
● Multi-tenancy (many AWS sub accounts) for security, compliance (via IAM
role access controls and S3 bucket policies), and cost attribution
● Very fine granular buckets (~ 1-2 buckets per a set of features/data streams).
Tooling for abstraction on top of buckets
● Self-service model for both data producer and data consumer teams
● Abstractions on top of AWS: Data Discovery/Data Dictionary, Airflow Pipeline
Blocks, and cluster labels
● Ephemeral clusters (based on AWS EMR) for all batch jobs
● Provide tools to automate ephemeral clusters provisioning, monitoring, log
aggregation, cost attribution, etc.
Lessons Learned
● Large S3 partitions scanning is not efficient in either Spark or Presto. Using a
layered metadata outside S3 and HMS helps
● Layered custom input/output format with consolidated metadata helps on
query, but incur code maintenance overhead
● Multi-tenancy + data discovery early on helps adoption of the new platform
● Using s3distcp between EMR and S3 can be more reliable than distcp.
EMRFS helps
● Multi-stage ETL jobs need enough capacity on ephemeral HDFS
● EMR has its own issues and blackbox surprises though AWS is pretty
responsive in general. Managing EMR job failures via external scheduler
(Airlow) helps

Cloud native data platform

  • 1.
    Cloud Native DataPlatform at Fitbit Challenges and lessons in a multi-tenant multi-cloud environment
  • 2.
    Intro of Fitbit ●Consumer Fitness trackers & smart watches ● Corporate Wellness based on Fitbit trackers for enterprises ● Social apps and fitness coach (on Smart Phone and Smartwatches) ● Other projects
  • 3.
    Example users ofthe Data Platform ● Data Science and machine learning ● Research Science (new health insights, health studies, etc.) ● Software Engineers ● Hardware and manufacturing ● Customer Support, Marketing, Legal, Security, etc.
  • 4.
    Challenges ● Diverse userexperience and expectations: batch and micro batch ETL, stream insights, analytics dashboards, adhoc queries, AML/DL model training & serving, etc. From simple SQL to complex deep learning, from very small laptop-size datasets to 100s TB sized datasets. ● Multiple compliances: PII, PCI, HIPAA, GDPR, etc. ● Very lean data platform team ● Most valuable data are locked in transaction store (MySQL and Cassandra)
  • 5.
    Some Stats ● ~100 TB of time based data generated each day ● ~ 30 million active users globally ● Many TB derived data generated each day ● ~ 100s of different primary datasets and even more derived datasets ● Realtime and batch based reports and insights generated regularly
  • 6.
    Data Platform Architecture MySQLKafka Cassandra Micro services (Apache Mesos) API Mobile Trackers Web Partners Ad hoc analysis Batch ETL S3 Extraction Pipelines S3 Kafka Mirror S3 S3 Compute (Presto/Spark) on EMR Tooling: Data Dictionary/Discovery, Monitoring, Provisioning, Airflow, security & compliance AWS ECS Machine Learning Spark Structured Streaming Stream insights BI Warehouse Data Lake
  • 7.
    Multi-Tenancy on AWS DataDictionary Schema Registry Provisioning Portal+Tools Artifacts Service Registry Kafka Mirrors S3 S3 S3 EMR Clusters Airflow Web/ Scheduler Gateway + Proxy Logs and Monitoring S3 S3 S3 EMR Clusters Airflow Web/ Scheduler Gateway + Proxy Containers Lambdas Streams Team A (AWS Account A) Data Platform (AWS Master Account) Team B (AWS Account B) Notebook
  • 8.
    Choices Made -Multi-tenancy ● Multi-tenancy (many AWS sub accounts) for security, compliance (via IAM role access controls and S3 bucket policies), and cost attribution ● Very fine granular buckets (~ 1-2 buckets per a set of features/data streams). Tooling for abstraction on top of buckets ● Self-service model for both data producer and data consumer teams ● Abstractions on top of AWS: Data Discovery/Data Dictionary, Airflow Pipeline Blocks, and cluster labels ● Ephemeral clusters (based on AWS EMR) for all batch jobs ● Provide tools to automate ephemeral clusters provisioning, monitoring, log aggregation, cost attribution, etc.
  • 9.
    Lessons Learned ● LargeS3 partitions scanning is not efficient in either Spark or Presto. Using a layered metadata outside S3 and HMS helps ● Layered custom input/output format with consolidated metadata helps on query, but incur code maintenance overhead ● Multi-tenancy + data discovery early on helps adoption of the new platform ● Using s3distcp between EMR and S3 can be more reliable than distcp. EMRFS helps ● Multi-stage ETL jobs need enough capacity on ephemeral HDFS ● EMR has its own issues and blackbox surprises though AWS is pretty responsive in general. Managing EMR job failures via external scheduler (Airlow) helps