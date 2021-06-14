Successfully reported this slideshow.
Delivering Insights from 20M+ Smart Homes with 500M+ devices Sameer Vaidya and Raghav Karnam Data Engineering and Data Sci...
Universal Peace Mantra May all beings everywhere be happy and free and may the thoughts, words and actions of my own life ...
Present Day: Plume business and products imperatives and expectations from data teams
Insights from world wide Smart home Locations, Device types and Behaviors over Time public examples at: https://plume.com,...
Agenda: Our journey to developer & operations productivity and scale: ▪ Job Clusters ▪ Template Notebooks ▪ Avro/Parquet -...
Challenges with our ﬁrst generation Spark processing clusters and Data Warehouse
Poor Dev/Ops productivity, visibility, fragility ▪ DevOps owned AWS IaaS became bottleneck ▪ Lack of automation created po...
#1 Developer and operational productivity: Deploying worldwide E2 workspaces and empowering developers with Notebooks and ...
Operate across N regions X [dev + prod] workspaces Standardization and Automation: users:groups:clusters:buckets:subnets:j...
Developer productivity 30-50% up with Notebooks • Use Github Repos • Interactive dev/debug by uploading jars • Interactive...
• Databricks Job Clusters DevOpsless self-service Developers with Clusters Databricks clusters reduce operational tickets ...
Segment Usage/Billing by Teams, Projects, Owners Use cost-center:region:team:env:project:owner AWS Tags in cluster creatio...
#2 Query performance, scale and automated metadata management: Migrating from legacy Avro/Parquet to Delta lake
Migrate Glue metadata to DataBricks Metastore Move to Delta ASAP - Poor performance for poorly partitioned Avro/Parquet - ...
Parquet -> Delta in place conversion optimal on resources but requires complex coordination 1. Catalog all paths, database...
#3 Scalable SQL Analytics over large data sets: Migrating Data Scientists, Analysts and BI dashboards to consume Databrick...
SQLA Endpoints optimized for BI/Analytics workloads • Start with single “general-purpose” • 1 hour idle termination • Rich...
#4. Summary: Scaling development and operations for BI and Analytics for worldwide deployments requires: - Workspace manag...
SPEAKER CHANGE - TRANSITION TO RAGHAV’s PRESO (DELETE THIS SLIDE)
Present Day: Plume ML Focus areas and expectations from Machine learning teams
Challenges with our ﬁrst generation ML Life cycle and MLOPS. Our evolution to increase productivity of our Data Scientist’...
Curate Data DE Model Performance metrics /Thresholds Build Model Data Scientist Model Performance metrics /Thresholds Depl...
#5. ML Lifecycle in Databricks
Plume’s ML Architecture
Models Across Databricks Workspaces
Demo
Jun. 14, 2021

Delivering Insights from 20M+ Smart Homes with 500M+ Devices

We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.

However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.

This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.

We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…

Delivering Insights from 20M+ Smart Homes with 500M+ Devices

  1. 1. Delivering Insights from 20M+ Smart Homes with 500M+ devices Sameer Vaidya and Raghav Karnam Data Engineering and Data Science
  2. 2. Universal Peace Mantra May all beings everywhere be happy and free and may the thoughts, words and actions of my own life contribute in some way to that happiness and to that freedom for all
  3. 3. Present Day: Plume business and products imperatives and expectations from data teams
  4. 4. Insights from world wide Smart home Locations, Device types and Behaviors over Time public examples at: https://plume.com, https://discover.plume.com/wfh-dashboard
  5. 5. Agenda: Our journey to developer & operations productivity and scale: ▪ Job Clusters ▪ Template Notebooks ▪ Avro/Parquet -> Delta ▪ SQL Analytics ▪ ML Lifecycle https://www.plume.com/careers @ Sameer Data Engineering, Analytics & BI @ Raghav Data Science & ML Engineering
  6. 6. Challenges with our ﬁrst generation Spark processing clusters and Data Warehouse
  7. 7. Poor Dev/Ops productivity, visibility, fragility ▪ DevOps owned AWS IaaS became bottleneck ▪ Lack of automation created poor utilization in prod and dev ▪ Poor developer productivity: Notebooks integration was complicated and largely unused • AWS Athena *serverless • AWS EMR Spark Clusters • metadata management is critical to see all data • scheduling is tricky • easy to make a mess • creates lots of cruft tables when misconﬁgured or extraneous ﬁles in path • AWS Glue Crawlers due to lack of automation and developer IDE, control over resources and complexity • Data scientists couldn’t answer complex questions requiring long running queries timeout • Enabling support Web app limited queue slots cannot handle unpredictable Web app loads
  8. 8. #1 Developer and operational productivity: Deploying worldwide E2 workspaces and empowering developers with Notebooks and self service clusters
  9. 9. Operate across N regions X [dev + prod] workspaces Standardization and Automation: users:groups:clusters:buckets:subnets:jobs:databases:tables • Standardize Namespaces • Map SAML IDP SSO • Plan RBAC model • https://status.databricks.com/
  10. 10. Developer productivity 30-50% up with Notebooks • Use Github Repos • Interactive dev/debug by uploading jars • Interactive SQL/python • Easily convert to scheduled Jobs • Combine with IDEs • Databricks Connect • Simba JDBC • Schedule via Airﬂow
  11. 11. • Databricks Job Clusters DevOpsless self-service Developers with Clusters Databricks clusters reduce operational tickets and enhanced productivity • Standard / High Concurrency • $$$ needs high utilization • Lesson: optimized for multiple queries but runs individual slower • Use EC2 Reserved Instances for Driver nodes - and Spot instances for all Workers - for long or short running jobs • Use Service Principles for team ownership of logs / jobs • Plan dedicated subnet space for expansion • Use 1 hr idle termination • Best Practices • Developers decide cluster size for their jobs- cluster policies put sanity bounds • Achieve High Availability • Retry Airﬂow Retries for 30 mins • AWS Instance availability • Databricks API availability • Retry Airﬂow Job submission for 30 mins • Plan AWS per AZ Instance type Availability • Plan for Databricks API outages during upgrades • Use Idempotency tokens to avoid multiple runs during API outages
  12. 12. Segment Usage/Billing by Teams, Projects, Owners Use cost-center:region:team:env:project:owner AWS Tags in cluster creation APIs Jan ... Dec Cost Center $ $$ $$$ Region $ $$ $$$ Environment $ $$ $$$ Team Owner With great authority comes great responsibility: - Usage plan makes owners accountable - Usage data is available to you - Customize using Notebooks
  13. 13. #2 Query performance, scale and automated metadata management: Migrating from legacy Avro/Parquet to Delta lake
  14. 14. Migrate Glue metadata to DataBricks Metastore Move to Delta ASAP - Poor performance for poorly partitioned Avro/Parquet - No Glue Crawlers Interim support for Legacy Avro/Parquet data: - Generate DDL from templates - Jobs to MSCK REPAIR TABLE + scripts to scan S3 and ADD PARTITION Convert to Delta: - Migrate Jobs to read/write - AutoLoader
  15. 15. Parquet -> Delta in place conversion optimal on resources but requires complex coordination 1. Catalog all paths, databases, tables 2. Prepare DDL USING PARQUET & DELTA 3. Convert pipelines to read/write Delta instead of Parquet 4. Coordinate with external consumers 5. Pause and upgrade all pipelines 6. Migrate parquet to delta 7. Resume pipelines 8. Schedule Glue MSCK 9. Recovery Plan
  16. 16. #3 Scalable SQL Analytics over large data sets: Migrating Data Scientists, Analysts and BI dashboards to consume Databricks SQL Analytics Endpoints
  17. 17. SQLA Endpoints optimized for BI/Analytics workloads • Start with single “general-purpose” • 1 hour idle termination • Rich SQL IDEs supported - DBeaver • Can serve Web APIs! Create dedicated SQL endpoint / clusters for each use case; size clusters per use case / workload
  18. 18. #4. Summary: Scaling development and operations for BI and Analytics for worldwide deployments requires: - Workspace management - Clusters + Notebooks - Metadata management - Migrate BI/adhoc to SQL Endpoints
  19. 19. SPEAKER CHANGE - TRANSITION TO RAGHAV’s PRESO (DELETE THIS SLIDE)
  20. 20. Present Day: Plume ML Focus areas and expectations from Machine learning teams
  21. 21. Challenges with our ﬁrst generation ML Life cycle and MLOPS. Our evolution to increase productivity of our Data Scientist’s.
  22. 22. Curate Data DE Model Performance metrics /Thresholds Build Model Data Scientist Model Performance metrics /Thresholds Deployment /ML Model ML Engineer A/B testing Integrate Model SWE Pass/Fail Operate Model Monitor for Data Drift Model
  23. 23. #5. ML Lifecycle in Databricks
  24. 24. Plume’s ML Architecture
  25. 25. Models Across Databricks Workspaces
  26. 26. Demo
  27. 27. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

