We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.
However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.
This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.
We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
1. Delivering Insights from
20M+ Smart Homes with
500M+ devices
Sameer Vaidya and Raghav Karnam
Data Engineering and Data Science
2. Universal Peace Mantra
May all beings everywhere be
happy and free and may the
thoughts, words and actions of
my own life contribute in some
way to that happiness and to that
freedom for all
4. Insights from world wide Smart home Locations,
Device types and Behaviors over Time
public examples at: https://plume.com, https://discover.plume.com/wfh-dashboard
5. Agenda:
Our journey to
developer & operations
productivity and scale:
▪ Job Clusters
▪ Template Notebooks
▪ Avro/Parquet -> Delta
▪ SQL Analytics
▪ ML Lifecycle https://www.plume.com/careers
@ Sameer
Data Engineering,
Analytics & BI
@ Raghav
Data Science &
ML Engineering
6. Challenges with our first generation Spark
processing clusters and Data Warehouse
7. Poor Dev/Ops productivity, visibility, fragility
▪ DevOps owned AWS IaaS
became bottleneck
▪ Lack of automation created
poor utilization in prod and
dev
▪ Poor developer productivity:
Notebooks integration was
complicated and largely
unused
• AWS Athena *serverless
• AWS EMR Spark Clusters
• metadata management is
critical to see all data
• scheduling is tricky
• easy to make a mess
• creates lots of cruft tables
when misconfigured or
extraneous files in path
• AWS Glue Crawlers
due to lack of automation and developer IDE, control over resources and complexity
• Data scientists couldn’t
answer complex questions
requiring long running
queries timeout
• Enabling support Web app
limited queue slots cannot
handle unpredictable Web
app loads
8. #1 Developer and operational productivity:
Deploying worldwide E2 workspaces and
empowering developers with Notebooks and self
service clusters
9. Operate across N regions X [dev + prod] workspaces
Standardization and Automation: users:groups:clusters:buckets:subnets:jobs:databases:tables
• Standardize
Namespaces
• Map SAML IDP SSO
• Plan RBAC model
• https://status.databricks.com/
10. Developer productivity 30-50% up with Notebooks
• Use Github Repos
• Interactive dev/debug by
uploading jars
• Interactive SQL/python
• Easily convert to scheduled
Jobs
• Combine with IDEs
• Databricks Connect
• Simba JDBC
• Schedule via Airflow
11. • Databricks Job Clusters
DevOpsless self-service Developers with Clusters
Databricks clusters reduce operational tickets and enhanced
productivity
• Standard / High Concurrency
• $$$ needs high utilization
• Lesson: optimized for
multiple queries but runs
individual slower
• Use EC2 Reserved Instances
for Driver nodes - and Spot
instances for all Workers - for
long or short running jobs
• Use Service Principles for team
ownership of logs / jobs
• Plan dedicated subnet space
for expansion
• Use 1 hr idle termination
• Best Practices
• Developers decide
cluster size for their
jobs- cluster policies
put sanity bounds
• Achieve High Availability
• Retry Airflow
Retries for 30
mins
• AWS Instance
availability
• Databricks API
availability
• Retry Airflow Job
submission for 30 mins
• Plan AWS per AZ
Instance type
Availability
• Plan for Databricks API
outages during
upgrades
• Use Idempotency
tokens to avoid multiple
runs during API outages
12. Segment Usage/Billing by Teams, Projects, Owners
Use cost-center:region:team:env:project:owner AWS Tags in cluster creation
APIs
Jan ... Dec
Cost Center $ $$ $$$
Region $ $$ $$$
Environment $ $$ $$$
Team
Owner
With great authority
comes great
responsibility:
- Usage plan makes
owners
accountable
- Usage data is
available to you
- Customize using
Notebooks
13. #2 Query performance, scale and automated
metadata management:
Migrating from legacy Avro/Parquet to Delta lake
14. Migrate Glue metadata to DataBricks Metastore
Move to Delta ASAP
- Poor performance for poorly
partitioned Avro/Parquet
- No Glue Crawlers
Interim support for Legacy
Avro/Parquet data:
- Generate DDL from
templates
- Jobs to MSCK REPAIR
TABLE + scripts to scan S3
and ADD PARTITION
Convert to Delta:
- Migrate Jobs to read/write
- AutoLoader
15. Parquet -> Delta in place conversion optimal on resources
but requires complex coordination
1. Catalog all paths,
databases, tables
2. Prepare DDL USING
PARQUET & DELTA
3. Convert pipelines to
read/write Delta
instead of Parquet
4. Coordinate with
external consumers
5. Pause and upgrade all
pipelines
6. Migrate parquet to
delta
7. Resume pipelines
8. Schedule Glue MSCK
9. Recovery Plan
16. #3 Scalable SQL Analytics over large data sets:
Migrating Data Scientists, Analysts and BI
dashboards to consume Databricks SQL Analytics
Endpoints
17. SQLA Endpoints optimized for BI/Analytics workloads
• Start with single
“general-purpose”
• 1 hour idle
termination
• Rich SQL IDEs
supported - DBeaver
• Can serve Web APIs!
Create dedicated SQL endpoint / clusters for each use case; size clusters per use case / workload
18. #4. Summary:
Scaling development and operations for BI and
Analytics for worldwide deployments requires:
- Workspace management
- Clusters + Notebooks
- Metadata management
- Migrate BI/adhoc to SQL Endpoints
21. Challenges with our first generation ML Life cycle
and MLOPS.
Our evolution to increase productivity of our Data
Scientist’s.
22. Curate Data DE
Model
Performance
metrics
/Thresholds
Build Model Data Scientist
Model
Performance
metrics
/Thresholds
Deployment /ML
Model
ML Engineer A/B testing
Integrate Model SWE Pass/Fail
Operate Model
Monitor for Data
Drift
Model