SlideShare a Scribd company logo
1 of 27
Download to read offline
Delivering Insights from
20M+ Smart Homes with
500M+ devices
Sameer Vaidya and Raghav Karnam
Data Engineering and Data Science
Universal Peace Mantra
May all beings everywhere be
happy and free and may the
thoughts, words and actions of
my own life contribute in some
way to that happiness and to that
freedom for all
Present Day:
Plume business and products imperatives and
expectations from data teams
Insights from world wide Smart home Locations,
Device types and Behaviors over Time
public examples at: https://plume.com, https://discover.plume.com/wfh-dashboard
Agenda:
Our journey to
developer & operations
productivity and scale:
▪ Job Clusters
▪ Template Notebooks
▪ Avro/Parquet -> Delta
▪ SQL Analytics
▪ ML Lifecycle https://www.plume.com/careers
@ Sameer
Data Engineering,
Analytics & BI
@ Raghav
Data Science &
ML Engineering
Challenges with our first generation Spark
processing clusters and Data Warehouse
Poor Dev/Ops productivity, visibility, fragility
▪ DevOps owned AWS IaaS
became bottleneck
▪ Lack of automation created
poor utilization in prod and
dev
▪ Poor developer productivity:
Notebooks integration was
complicated and largely
unused
• AWS Athena *serverless
• AWS EMR Spark Clusters
• metadata management is
critical to see all data
• scheduling is tricky
• easy to make a mess
• creates lots of cruft tables
when misconfigured or
extraneous files in path
• AWS Glue Crawlers
due to lack of automation and developer IDE, control over resources and complexity
• Data scientists couldn’t
answer complex questions
requiring long running
queries timeout
• Enabling support Web app
limited queue slots cannot
handle unpredictable Web
app loads
#1 Developer and operational productivity:
Deploying worldwide E2 workspaces and
empowering developers with Notebooks and self
service clusters
Operate across N regions X [dev + prod] workspaces
Standardization and Automation: users:groups:clusters:buckets:subnets:jobs:databases:tables
• Standardize
Namespaces
• Map SAML IDP SSO
• Plan RBAC model
• https://status.databricks.com/
Developer productivity 30-50% up with Notebooks
• Use Github Repos
• Interactive dev/debug by
uploading jars
• Interactive SQL/python
• Easily convert to scheduled
Jobs
• Combine with IDEs
• Databricks Connect
• Simba JDBC
• Schedule via Airflow
• Databricks Job Clusters
DevOpsless self-service Developers with Clusters
Databricks clusters reduce operational tickets and enhanced
productivity
• Standard / High Concurrency
• $$$ needs high utilization
• Lesson: optimized for
multiple queries but runs
individual slower
• Use EC2 Reserved Instances
for Driver nodes - and Spot
instances for all Workers - for
long or short running jobs
• Use Service Principles for team
ownership of logs / jobs
• Plan dedicated subnet space
for expansion
• Use 1 hr idle termination
• Best Practices
• Developers decide
cluster size for their
jobs- cluster policies
put sanity bounds
• Achieve High Availability
• Retry Airflow
Retries for 30
mins
• AWS Instance
availability
• Databricks API
availability
• Retry Airflow Job
submission for 30 mins
• Plan AWS per AZ
Instance type
Availability
• Plan for Databricks API
outages during
upgrades
• Use Idempotency
tokens to avoid multiple
runs during API outages
Segment Usage/Billing by Teams, Projects, Owners
Use cost-center:region:team:env:project:owner AWS Tags in cluster creation
APIs
Jan ... Dec
Cost Center $ $$ $$$
Region $ $$ $$$
Environment $ $$ $$$
Team
Owner
With great authority
comes great
responsibility:
- Usage plan makes
owners
accountable
- Usage data is
available to you
- Customize using
Notebooks
#2 Query performance, scale and automated
metadata management:
Migrating from legacy Avro/Parquet to Delta lake
Migrate Glue metadata to DataBricks Metastore
Move to Delta ASAP
- Poor performance for poorly
partitioned Avro/Parquet
- No Glue Crawlers
Interim support for Legacy
Avro/Parquet data:
- Generate DDL from
templates
- Jobs to MSCK REPAIR
TABLE + scripts to scan S3
and ADD PARTITION
Convert to Delta:
- Migrate Jobs to read/write
- AutoLoader
Parquet -> Delta in place conversion optimal on resources
but requires complex coordination
1. Catalog all paths,
databases, tables
2. Prepare DDL USING
PARQUET & DELTA
3. Convert pipelines to
read/write Delta
instead of Parquet
4. Coordinate with
external consumers
5. Pause and upgrade all
pipelines
6. Migrate parquet to
delta
7. Resume pipelines
8. Schedule Glue MSCK
9. Recovery Plan
#3 Scalable SQL Analytics over large data sets:
Migrating Data Scientists, Analysts and BI
dashboards to consume Databricks SQL Analytics
Endpoints
SQLA Endpoints optimized for BI/Analytics workloads
• Start with single
“general-purpose”
• 1 hour idle
termination
• Rich SQL IDEs
supported - DBeaver
• Can serve Web APIs!
Create dedicated SQL endpoint / clusters for each use case; size clusters per use case / workload
#4. Summary:
Scaling development and operations for BI and
Analytics for worldwide deployments requires:
- Workspace management
- Clusters + Notebooks
- Metadata management
- Migrate BI/adhoc to SQL Endpoints
SPEAKER CHANGE
- TRANSITION TO RAGHAV’s PRESO
(DELETE THIS SLIDE)
Present Day:
Plume ML Focus areas and expectations from
Machine learning teams
Challenges with our first generation ML Life cycle
and MLOPS.
Our evolution to increase productivity of our Data
Scientist’s.
Curate Data DE
Model
Performance
metrics
/Thresholds
Build Model Data Scientist
Model
Performance
metrics
/Thresholds
Deployment /ML
Model
ML Engineer A/B testing
Integrate Model SWE Pass/Fail
Operate Model
Monitor for Data
Drift
Model
#5. ML Lifecycle in Databricks
Plume’s ML Architecture
Models Across Databricks Workspaces
Demo
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
Automating Data Quality Processes at Reckitt
Automating Data Quality Processes at ReckittAutomating Data Quality Processes at Reckitt
Automating Data Quality Processes at Reckitt
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

What's hot (20)

Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Automating Data Quality Processes at Reckitt
Automating Data Quality Processes at ReckittAutomating Data Quality Processes at Reckitt
Automating Data Quality Processes at Reckitt
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 

Similar to Delivering Insights from 20M+ Smart Homes with 500M+ Devices

Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
Timothy Spann
 

Similar to Delivering Insights from 20M+ Smart Homes with 500M+ Devices (20)

Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
DPGD Microsoft Hyderabad 22nd Sept 2018
DPGD Microsoft Hyderabad 22nd Sept 2018DPGD Microsoft Hyderabad 22nd Sept 2018
DPGD Microsoft Hyderabad 22nd Sept 2018
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Delivering Insights from 20M+ Smart Homes with 500M+ Devices

  • 1. Delivering Insights from 20M+ Smart Homes with 500M+ devices Sameer Vaidya and Raghav Karnam Data Engineering and Data Science
  • 2. Universal Peace Mantra May all beings everywhere be happy and free and may the thoughts, words and actions of my own life contribute in some way to that happiness and to that freedom for all
  • 3. Present Day: Plume business and products imperatives and expectations from data teams
  • 4. Insights from world wide Smart home Locations, Device types and Behaviors over Time public examples at: https://plume.com, https://discover.plume.com/wfh-dashboard
  • 5. Agenda: Our journey to developer & operations productivity and scale: ▪ Job Clusters ▪ Template Notebooks ▪ Avro/Parquet -> Delta ▪ SQL Analytics ▪ ML Lifecycle https://www.plume.com/careers @ Sameer Data Engineering, Analytics & BI @ Raghav Data Science & ML Engineering
  • 6. Challenges with our first generation Spark processing clusters and Data Warehouse
  • 7. Poor Dev/Ops productivity, visibility, fragility ▪ DevOps owned AWS IaaS became bottleneck ▪ Lack of automation created poor utilization in prod and dev ▪ Poor developer productivity: Notebooks integration was complicated and largely unused • AWS Athena *serverless • AWS EMR Spark Clusters • metadata management is critical to see all data • scheduling is tricky • easy to make a mess • creates lots of cruft tables when misconfigured or extraneous files in path • AWS Glue Crawlers due to lack of automation and developer IDE, control over resources and complexity • Data scientists couldn’t answer complex questions requiring long running queries timeout • Enabling support Web app limited queue slots cannot handle unpredictable Web app loads
  • 8. #1 Developer and operational productivity: Deploying worldwide E2 workspaces and empowering developers with Notebooks and self service clusters
  • 9. Operate across N regions X [dev + prod] workspaces Standardization and Automation: users:groups:clusters:buckets:subnets:jobs:databases:tables • Standardize Namespaces • Map SAML IDP SSO • Plan RBAC model • https://status.databricks.com/
  • 10. Developer productivity 30-50% up with Notebooks • Use Github Repos • Interactive dev/debug by uploading jars • Interactive SQL/python • Easily convert to scheduled Jobs • Combine with IDEs • Databricks Connect • Simba JDBC • Schedule via Airflow
  • 11. • Databricks Job Clusters DevOpsless self-service Developers with Clusters Databricks clusters reduce operational tickets and enhanced productivity • Standard / High Concurrency • $$$ needs high utilization • Lesson: optimized for multiple queries but runs individual slower • Use EC2 Reserved Instances for Driver nodes - and Spot instances for all Workers - for long or short running jobs • Use Service Principles for team ownership of logs / jobs • Plan dedicated subnet space for expansion • Use 1 hr idle termination • Best Practices • Developers decide cluster size for their jobs- cluster policies put sanity bounds • Achieve High Availability • Retry Airflow Retries for 30 mins • AWS Instance availability • Databricks API availability • Retry Airflow Job submission for 30 mins • Plan AWS per AZ Instance type Availability • Plan for Databricks API outages during upgrades • Use Idempotency tokens to avoid multiple runs during API outages
  • 12. Segment Usage/Billing by Teams, Projects, Owners Use cost-center:region:team:env:project:owner AWS Tags in cluster creation APIs Jan ... Dec Cost Center $ $$ $$$ Region $ $$ $$$ Environment $ $$ $$$ Team Owner With great authority comes great responsibility: - Usage plan makes owners accountable - Usage data is available to you - Customize using Notebooks
  • 13. #2 Query performance, scale and automated metadata management: Migrating from legacy Avro/Parquet to Delta lake
  • 14. Migrate Glue metadata to DataBricks Metastore Move to Delta ASAP - Poor performance for poorly partitioned Avro/Parquet - No Glue Crawlers Interim support for Legacy Avro/Parquet data: - Generate DDL from templates - Jobs to MSCK REPAIR TABLE + scripts to scan S3 and ADD PARTITION Convert to Delta: - Migrate Jobs to read/write - AutoLoader
  • 15. Parquet -> Delta in place conversion optimal on resources but requires complex coordination 1. Catalog all paths, databases, tables 2. Prepare DDL USING PARQUET & DELTA 3. Convert pipelines to read/write Delta instead of Parquet 4. Coordinate with external consumers 5. Pause and upgrade all pipelines 6. Migrate parquet to delta 7. Resume pipelines 8. Schedule Glue MSCK 9. Recovery Plan
  • 16. #3 Scalable SQL Analytics over large data sets: Migrating Data Scientists, Analysts and BI dashboards to consume Databricks SQL Analytics Endpoints
  • 17. SQLA Endpoints optimized for BI/Analytics workloads • Start with single “general-purpose” • 1 hour idle termination • Rich SQL IDEs supported - DBeaver • Can serve Web APIs! Create dedicated SQL endpoint / clusters for each use case; size clusters per use case / workload
  • 18. #4. Summary: Scaling development and operations for BI and Analytics for worldwide deployments requires: - Workspace management - Clusters + Notebooks - Metadata management - Migrate BI/adhoc to SQL Endpoints
  • 19. SPEAKER CHANGE - TRANSITION TO RAGHAV’s PRESO (DELETE THIS SLIDE)
  • 20. Present Day: Plume ML Focus areas and expectations from Machine learning teams
  • 21. Challenges with our first generation ML Life cycle and MLOPS. Our evolution to increase productivity of our Data Scientist’s.
  • 22. Curate Data DE Model Performance metrics /Thresholds Build Model Data Scientist Model Performance metrics /Thresholds Deployment /ML Model ML Engineer A/B testing Integrate Model SWE Pass/Fail Operate Model Monitor for Data Drift Model
  • 23. #5. ML Lifecycle in Databricks
  • 26. Demo
  • 27. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.