SlideShare a Scribd company logo
Cloud Native Data Platform
at Fitbit
Challenges and lessons in a multi-tenant multi-cloud
environment
Intro of Fitbit
● Consumer Fitness trackers & smart watches
● Corporate Wellness based on Fitbit trackers for
enterprises
● Social apps and fitness coach (on Smart Phone and
Smartwatches)
● Other projects
Example users of the Data Platform
● Data Science and machine learning
● Research Science (new health insights, health studies, etc.)
● Software Engineers
● Hardware and manufacturing
● Customer Support, Marketing, Legal, Security, etc.
Challenges
● Diverse user experience and expectations: batch and micro batch ETL,
stream insights, analytics dashboards, adhoc queries, AML/DL model training
& serving, etc. From simple SQL to complex deep learning, from very small
laptop-size datasets to 100s TB sized datasets.
● Multiple compliances: PII, PCI, HIPAA, GDPR, etc.
● Very lean data platform team
● Most valuable data are locked in transaction store (MySQL and Cassandra)
Some Stats
● ~ 100 TB of time based data generated each day
● ~ 30 million active users globally
● Many TB derived data generated each day
● ~ 100s of different primary datasets and even more derived datasets
● Realtime and batch based reports and insights generated regularly
Data Platform Architecture
MySQL Kafka Cassandra
Micro services (Apache Mesos) API
Mobile
Trackers
Web
Partners
Ad hoc analysis
Batch ETL
S3
Extraction Pipelines
S3
Kafka
Mirror
S3
S3 Compute
(Presto/Spark)
on EMR
Tooling: Data Dictionary/Discovery,
Monitoring, Provisioning, Airflow, security &
compliance AWS ECS Machine Learning
Spark Structured
Streaming
Stream insights
BI Warehouse
Data Lake
Multi-Tenancy on AWS
Data Dictionary
Schema Registry
Provisioning
Portal+Tools
Artifacts
Service Registry
Kafka Mirrors
S3 S3
S3
EMR Clusters
Airflow Web/
Scheduler
Gateway +
Proxy
Logs and
Monitoring
S3 S3
S3
EMR Clusters
Airflow Web/
Scheduler
Gateway +
Proxy
Containers
Lambdas Streams
Team A (AWS Account A)
Data Platform (AWS Master Account)
Team B (AWS Account B)
Notebook
Choices Made - Multi-tenancy
● Multi-tenancy (many AWS sub accounts) for security, compliance (via IAM
role access controls and S3 bucket policies), and cost attribution
● Very fine granular buckets (~ 1-2 buckets per a set of features/data streams).
Tooling for abstraction on top of buckets
● Self-service model for both data producer and data consumer teams
● Abstractions on top of AWS: Data Discovery/Data Dictionary, Airflow Pipeline
Blocks, and cluster labels
● Ephemeral clusters (based on AWS EMR) for all batch jobs
● Provide tools to automate ephemeral clusters provisioning, monitoring, log
aggregation, cost attribution, etc.
Lessons Learned
● Large S3 partitions scanning is not efficient in either Spark or Presto. Using a
layered metadata outside S3 and HMS helps
● Layered custom input/output format with consolidated metadata helps on
query, but incur code maintenance overhead
● Multi-tenancy + data discovery early on helps adoption of the new platform
● Using s3distcp between EMR and S3 can be more reliable than distcp.
EMRFS helps
● Multi-stage ETL jobs need enough capacity on ephemeral HDFS
● EMR has its own issues and blackbox surprises though AWS is pretty
responsive in general. Managing EMR job failures via external scheduler
(Airlow) helps

More Related Content

What's hot

Riak CS Build Your Own Cloud Storage
Riak CS Build Your Own Cloud StorageRiak CS Build Your Own Cloud Storage
Riak CS Build Your Own Cloud Storage
buildacloud
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
Eva Tse
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
Knoldus Inc.
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Databricks
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
Spark Summit
 
Riak at shareaholic
Riak at shareaholicRiak at shareaholic
Riak at shareaholic
freerobby
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
Alexander Dean
 
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
confluent
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
HostedbyConfluent
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
Rainforest QA
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
Oliver Buckley-Salmon
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
HostedbyConfluent
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 

What's hot (20)

Riak CS Build Your Own Cloud Storage
Riak CS Build Your Own Cloud StorageRiak CS Build Your Own Cloud Storage
Riak CS Build Your Own Cloud Storage
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Riak at shareaholic
Riak at shareaholicRiak at shareaholic
Riak at shareaholic
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 

Similar to Cloud native data platform

AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Piyush Kumar
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Amazon Web Services
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
Amazon Web Services
 
Azure Overview Csco
Azure Overview CscoAzure Overview Csco
Azure Overview Cscorajramab
 
The State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWSThe State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWS
Trevor Parsons
 
Building a reliable and cost effect logging system at Box
Building a reliable and cost effect logging system at Box Building a reliable and cost effect logging system at Box
Building a reliable and cost effect logging system at Box
Elasticsearch
 
AWS Webcast - Sumo Logic
AWS Webcast - Sumo LogicAWS Webcast - Sumo Logic
AWS Webcast - Sumo Logic
Amazon Web Services
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
Amazon Web Services
 
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Insight Technology, Inc.
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 

Similar to Cloud native data platform (20)

AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
Azure Overview Csco
Azure Overview CscoAzure Overview Csco
Azure Overview Csco
 
The State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWSThe State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWS
 
Building a reliable and cost effect logging system at Box
Building a reliable and cost effect logging system at Box Building a reliable and cost effect logging system at Box
Building a reliable and cost effect logging system at Box
 
AWS Webcast - Sumo Logic
AWS Webcast - Sumo LogicAWS Webcast - Sumo Logic
AWS Webcast - Sumo Logic
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 

Recently uploaded

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 

Recently uploaded (20)

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 

Cloud native data platform

  • 1. Cloud Native Data Platform at Fitbit Challenges and lessons in a multi-tenant multi-cloud environment
  • 2. Intro of Fitbit ● Consumer Fitness trackers & smart watches ● Corporate Wellness based on Fitbit trackers for enterprises ● Social apps and fitness coach (on Smart Phone and Smartwatches) ● Other projects
  • 3. Example users of the Data Platform ● Data Science and machine learning ● Research Science (new health insights, health studies, etc.) ● Software Engineers ● Hardware and manufacturing ● Customer Support, Marketing, Legal, Security, etc.
  • 4. Challenges ● Diverse user experience and expectations: batch and micro batch ETL, stream insights, analytics dashboards, adhoc queries, AML/DL model training & serving, etc. From simple SQL to complex deep learning, from very small laptop-size datasets to 100s TB sized datasets. ● Multiple compliances: PII, PCI, HIPAA, GDPR, etc. ● Very lean data platform team ● Most valuable data are locked in transaction store (MySQL and Cassandra)
  • 5. Some Stats ● ~ 100 TB of time based data generated each day ● ~ 30 million active users globally ● Many TB derived data generated each day ● ~ 100s of different primary datasets and even more derived datasets ● Realtime and batch based reports and insights generated regularly
  • 6. Data Platform Architecture MySQL Kafka Cassandra Micro services (Apache Mesos) API Mobile Trackers Web Partners Ad hoc analysis Batch ETL S3 Extraction Pipelines S3 Kafka Mirror S3 S3 Compute (Presto/Spark) on EMR Tooling: Data Dictionary/Discovery, Monitoring, Provisioning, Airflow, security & compliance AWS ECS Machine Learning Spark Structured Streaming Stream insights BI Warehouse Data Lake
  • 7. Multi-Tenancy on AWS Data Dictionary Schema Registry Provisioning Portal+Tools Artifacts Service Registry Kafka Mirrors S3 S3 S3 EMR Clusters Airflow Web/ Scheduler Gateway + Proxy Logs and Monitoring S3 S3 S3 EMR Clusters Airflow Web/ Scheduler Gateway + Proxy Containers Lambdas Streams Team A (AWS Account A) Data Platform (AWS Master Account) Team B (AWS Account B) Notebook
  • 8. Choices Made - Multi-tenancy ● Multi-tenancy (many AWS sub accounts) for security, compliance (via IAM role access controls and S3 bucket policies), and cost attribution ● Very fine granular buckets (~ 1-2 buckets per a set of features/data streams). Tooling for abstraction on top of buckets ● Self-service model for both data producer and data consumer teams ● Abstractions on top of AWS: Data Discovery/Data Dictionary, Airflow Pipeline Blocks, and cluster labels ● Ephemeral clusters (based on AWS EMR) for all batch jobs ● Provide tools to automate ephemeral clusters provisioning, monitoring, log aggregation, cost attribution, etc.
  • 9. Lessons Learned ● Large S3 partitions scanning is not efficient in either Spark or Presto. Using a layered metadata outside S3 and HMS helps ● Layered custom input/output format with consolidated metadata helps on query, but incur code maintenance overhead ● Multi-tenancy + data discovery early on helps adoption of the new platform ● Using s3distcp between EMR and S3 can be more reliable than distcp. EMRFS helps ● Multi-stage ETL jobs need enough capacity on ephemeral HDFS ● EMR has its own issues and blackbox surprises though AWS is pretty responsive in general. Managing EMR job failures via external scheduler (Airlow) helps