Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

180 views

Published on

The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

  1. 1. Lessons from building large-scale, multi-cloud, SaaS software at Databricks Jeff Pang Principal Software Engineer @
  2. 2. Who am I? ▪ Jeff Pang Principal Software Engineer, Databricks ▪ Databricks Platform Engineering To help data teams solve the world’s toughest problems, the Databricks Platform team provides the world-class, multi-cloud platform that enables us to expand fast and iterate quickly http://databricks.com/careers
  3. 3. About ▪ Founded in 2013 by the original creators of Apache Spark ▪ Data and AI platform as a service for 5000+ customers ▪ 1000+ employees, 200+ engineers, >$200M annual recurring revenue
  4. 4. Our product Data scientists Data engineers Business users
  5. 5. Agenda The architecture Inside the Unified Analytics Platform Challenges & lessons Growing a SaaS data platform Operating on multiple clouds Accelerating a data platform with data & AI
  6. 6. The architecture Inside the Unified Analytics Platform
  7. 7. Simple data engineering architecture cluster Reporting Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Data Lake S3, HDFS, Blob Store, etc.
  8. 8. Modern data engineering architecture Data Lake Reporting, Notebooks, AI Streaming Analytics Bronze Silver Gold CSV, JSON, TXT… Kinesis Workflow scheduling clusters Cluster management
  9. 9. Customer Network Multiply by thousands of customers... Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis ... control plane Collaborative Notebooks, AI Streaming Analytics Workflow scheduling Cluster management Admin & Security Reporting, Business Insights
  10. 10. ...across many regions...
  11. 11. ...on multiple clouds...
  12. 12. → millions of VMs managed per day
  13. 13. That’s the Databricks control plane What did we learn from building a large-scale, multi-cloud data platform? 100,000s of users 100,000s of Spark clusters per day Millions of VMs launched per day Exabytes of data processed per day
  14. 14. Growing a SaaS data platform
  15. 15. Evolution of the Databricks control plane We didn’t start with a global-scale, multi-cloud data platform Challenge: Scaling a data platform from one customer to 5000+ Lesson: The factory that builds and evolves the data platform is more important than the data platform itself
  16. 16. Fast time to market Databricks control plane “in-a-box” ▪ Need to deliver value quickly ▪ Need to iterate quickly ▪ Can’t break things while iterating! Keys to success: ▪ Modern CI ▪ Fast developer tools ▪ Testing, testing, testing V1 V2 25-500x Scala build speedups 10s of millions of tests per day 100s of Databrick s “in-a-box” test envs per day
  17. 17. Expand the total addressable market Replicating control planes quickly ▪ Need different configurations for different environments ▪ Need to update many environments ▪ Can’t slow down platform development! Keys to success: ▪ Declarative infrastructure ▪ Modern CD infrastructure jsonnet 10 million lines 250k lines
  18. 18. Service Framework Land and expand workloads Scaling the control plane ▪ Need to support more users & workloads ▪ Need to build more features that scale ▪ Don’t want devs to reinvent the wheel! Keys to success: ▪ A service framework to do the hard stuff ▪ Decompose monoliths to microservices Container & replica management, APIs & RPCs, rate limits, metrics, logging, secrets & security, ... Cloud VM API Cluster Manager Customer Clusters version 1 Cloud VM API CM Master Customer Clusters Worker Worker API Server CM MasterCM Shard API ServerAPI ServerAPI Server version 3 usage
  19. 19. Data Platform The Databricks data platform factory ... Customer Network Customer Network Customer Network Customer Network Customer Network Kubernetes HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ... Envoy, GraphQL Cloud VMs, network, storage, databases CM Master Worker Worker API Server CM MasterCM Shard API ServerAPI ServerAPI Server
  20. 20. Operating on multiple clouds
  21. 21. Why multi-cloud? The data platform needs to be where the data is ▪ Performance, latency, egress data costs ▪ Cloud-specific integrations ▪ Data governance policies Challenge: Supporting multiple clouds without sacrificing dev velocity Lesson: A cloud-agnostic layer is key to dev velocity, but it also needs to integrate with the standards of each cloud and deal with their quirks
  22. 22. Challenge: dev velocity on multiple clouds Many cloud services have no direct equivalents ▪ DynamoDB vs ? ▪ CosmosDB vs ? ▪ Aurora vs ? ▪ SQL DW vs ? Cloud APIs don’t look like each other ▪ SDK: no common interfaces ▪ Auth: IAM vs AAD ▪ ACLs: IAM vs Azure RBAC APIs?Services? Operational tools for each cloud are very different ▪ Templates: CloudFormation vs ARM templates ▪ Logs: CloudWatch vs Azure Monitor Ops?
  23. 23. Approach: cloud agnostic dev framework Use lowest common denominator cloud services EKS ←Kubernetes →AKS HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ... Envoy EC2 VPC RDS MySQL/Postgres CM Master Worker Worker API Server CM MasterCM Shard API ServerAPI ServerAPI Server Azure Compute VNet Azure Database for MySQL/Postgres ≈ ≈ ≈ ELB Azure Load Balancer Service framework API ≈
  24. 24. Challenge: not everything can be cloud agnostic Customers want to integrate with the standards of each cloud “Equivalent” cloud services have implementation quirks
  25. 25. Approach: abstraction layer for key integrations Fargate ←Kubernetes →AKS Bring your own key encryption AuthN / AuthZ / Identity EC2 VPC RDS MySQL/Postgres CM Master Worker Worker API Server CM MasterCM Shard API ServerAPI ServerAPI Server Azure Compute VNet Azure Database for MySQL/Postgres ≈ ≈ ≈ Okta, OneLogin, etc. Azure Active Directory IAM roles KMS Azure Key Vault Unified usage service AWS Marketplace, Custom Billing Azure Commerce Billing ELB Azure Load Balancer≈ Databricks file systemS3 Azure Storage S3 commit service
  26. 26. Approach: harmonize “equivalent” cloud service quirks Promise of elastic compute is unevenly distributed ▪ Provisioning speed differs ▪ Deletion speed differs (speed to refill quota) → Need to adapt to cloud resource and API limits TCP connections are hard ▪ “Invisible” NATs have connection & timeout limits → Need tuned keep alive, connection limit configs ▪ Kernel TCP SACK bug caused API hangs in one cloud only → Need to deep robustness testing against both clouds (ex: poor NIC reliability) NetworkVirtual machines When MySQL != MySQL ▪ Host OS matters Ex: case sensitivity defaults ▪ Default DB params matter Ex: tablespace config → 100x difference in recovery time → Need expertise in DB tuning to ensure equivalence Databases
  27. 27. Accelerating a data platform with data & AI
  28. 28. Inception: Improving a data platform with data & AI We are one of our biggest customers Challenge: Building a data platform is hard without a data platform ▪ Need data to track usage, maintain security ▪ Need data to observe and improve how users use the data platform ▪ Need data to keep the data platform up and running Lesson: Data & AI can accelerate data platform features, product analytics, and devops
  29. 29. How we use Databricks to accelerate itself Key platform features ▪ Usage and billing reports ▪ Audit logs Essential product analytics ▪ Feature usage, trends, prediction ▪ Growth and churn forecast, models Mission critical devops ▪ Service KPIs and SLAs ▪ API and application structured logs ▪ Spark debug logs
  30. 30. Data foundation & analytics Our distributed data pipelines 100s of TB logs per day Millions of time series per secondTime-series, raw logs, request tracing, dashboards Kinesis Event Hubs Declarative data pipeline deployments Real-time streaming
  31. 31. Takeaways The architecture Managing millions of VMs around the world in multiple clouds Challenges & lessons The factory that builds and evolves the data platform is more important than the data platform itself A cloud-agnostic platform that integrates with cloud standards and quirks is the key to multi-cloud Data & AI accelerates data platform features, product analytics, and devops Join us! http://databricks.com/careers
  32. 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  33. 33. 34 Our Product Built around open source: Interactive data science Scheduled jobs SQL frontend Data scientists Data engineers Business users Cloud Storage Compute Clusters Databricks Runtime Customer’s Cloud AccountDatabricks Service
  34. 34. Logos
  35. 35. Colors
  36. 36. Color Palette Primary Colors
  37. 37. Content Slides
  38. 38. Basic Slide ▪ Bullet 1 ▪ Sub-bullet ▪ Sub-bullet ▪ Bullet 2 ▪ Sub-bullet ▪ Sub-bullet ▪ Bullet 3 ▪ Sub-bullet ▪ Sub-bullet
  39. 39. Reduce Long Titles ▪ Bullet 1 ▪ Sub-bullet ▪ Sub-bullet ▪ Bullet 2 ▪ Sub-bullet ▪ Sub-bullet By splitting them into a short title, and a more detailed subtitle using this slide format that includes a subtitle area
  40. 40. Two Columns ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format Headline FormatHeadline Format
  41. 41. Two Box ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list CategoryCategory
  42. 42. Three Box ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list CategoryCategory ▪ Bulleted list ▪ Bulleted list Category
  43. 43. Four Box ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list CategoryCategory ▪ Bulleted list ▪ Bulleted list Category ▪ Bulleted list ▪ Bulleted list Category
  44. 44. Shapes
  45. 45. Shapes Pill-shaped rectangle Double corner rectangle Double corner rectangle
  46. 46. Tables and Charts
  47. 47. Table Column Column Column Row Value Value Value Row Value Value Value Row Value Value Value Row Value Value Value Row Value Value Value Row Value Value Value Row Value Value Value
  48. 48. Bar chart
  49. 49. Line chart
  50. 50. Pie Chart
  51. 51. Quotes and Text Callouts
  52. 52. Attribution Format Second line of attribution This is a template for a quote slide. This is where the quote goes. Attribute the source below…
  53. 53. Databricks simplifies data and AI so data teams can innovate faster
  54. 54. Databricks simplifies data and AI so data teams can innovate faster
  55. 55. Logos
  56. 56. Spark + AI Summit Logos
  57. 57. Databricks Logos
  58. 58. Open Source Logos
  59. 59. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×