Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Scaling and Modernizing Data Platform with Databricks

Download to read offline

Today a Data Platform is expected to process and analyze a multitude of sources spanning batch files, streaming sources, backend databases, REST APIs, and more. There is clearly a need for standardizing the platform that scales and be flexible letting data engineers and data scientists focus on the business problems rather than managing the infrastructure and backend services. Another key aspect of the platform is multi-tenancy to isolate the workloads and able to track cost usage per tenant.

In this talk, Richa Singhal and Esha Shah will cover how to build a scalable Data Platform using Databricks and deploy your data pipelines effectively while managing the costs. The following topics will be covered:

Key tenets of a Data Platform
Setup multistage environment on Databricks
Build data pipelines locally and test on Databricks cluster
CI/CD for data pipelines with Databricks
Orchestrating pipelines using Apache Airflow – Change Data Capture using Databricks Delta
Leveraging Databricks Notebooks for Analytics and Data Science teams

  • Be the first to like this

Scaling and Modernizing Data Platform with Databricks

  1. 1. Managing & Scaling Data Pipelines with Databricks Esha Shah Senior Data Engineer ATLASSIAN Go-To-Market Data Engineering Richa Singhal Senior Data Engineer
  2. 2. Agenda Atlassian Overview Summary Adopting Databricks Data Platform Challenges
  3. 3. Growth over the last 5 years Data is now 20x times (Multi petabytes) 5x growth in numbers of internal users 5x number of events/day (Billions)
  4. 4. Atlassian Data Architecture (Before Databricks)
  5. 5. Key Challenges with Legacy Architecture Development Cross-team dependencies Cluster management Collaboration
  6. 6. Prepping for Scale Self-service Standardization Automation Agility Cost Optimization
  7. 7. Current Atlassian Data Architecture
  8. 8. Our Success Story Reduced development time Rapid Development Increased team and project efficiency with simplified sharing and co-authoring Collaboration Were able to support growth while reducing Infrastructure cost Scaling Removed Data engineering dependency for Analytics and Data Science teams Self Service
  9. 9. Adopting Databricks at Atlassian Building Data Pipelines Orchestration Leveraging Databricks Delta Databricks for Analytics and Data Science
  10. 10. Building Data Pipelines
  11. 11. Data Pipelines with Databricks Data Pipelines using Notebooks Data Pipelines using DB-Connect
  12. 12. Development using Databricks Notebook AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Databricks Workspace Import/ Export Jira Ticket Command Line Databricks Notebook Databricks Cluster
  13. 13. Multi-stage Envs using Databricks Workspaces Databricks Notebook Databricks Workspace Dev Folder Local/ Development Stage/ Production Bitbucket CICD Pipeline Stg Folder Prod Folder Stg Cluster Prod Cluster
  14. 14. Bitbucket CICD Pipeline branches: main: - step: name: Check configuration file deployment: test script: - pip install -r requirements.txt - 'yamllint -d "{extends: default, rules: {}" config.yaml' - python databricks_cicd/check_duplicates.py - step: name: Move code to Databricks deployment: production caches: - pip script: - pip install -r requirements.txt - bash databricks_cicd/move_code_to_databricks.sh prod - step: name: Update the job in Databricks script: - pip install -r requirements.txt - python databricks_cicd/configure_job_in_databricks.py
  15. 15. Development using DB-Connect Library AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Local IDE Pull Request /Merge db-connect Jira Ticket Databricks Cluster
  16. 16. Multi-stage Envs using AWS S3 Local IDE Databricks Cluster Dev Bucket Local/ Development Stage/ Production Bitbucket CICD Pipeline Docker Stg Bucket Prod Bucket Stg Cluster Prod Cluster
  17. 17. Orchestration
  18. 18. Orchestration using Airflow Airflow on Kubernetes SparkSubmit Task YODA In-house Data Quality Platform SignalFx Opsgenie On-Call Notebook Task Slack Notification Code on S3 Notebook Databricks Workspace
  19. 19. Tracking Resource Usage and Cost Job Metadata 'custom_tags': { 'business_unit': 'Data Engineering', 'environment': cluster_env, 'pipeline': 'Team_name', 'user': 'airflow', 'resource_owner': '<resource_owner>', 'service_name': '<service-name>' } Data Lake Ad Hoc Reporting Databricks Job
  20. 20. Leveraging Databricks Delta
  21. 21. Delta Time Travel Merge Auto-optimize
  22. 22. Databricks for Analytics and Data Science
  23. 23. Analytics Use Cases Exploratory and root cause analysis Analysis for Strategic Decisions POC for new metrics and business logic Creating and refreshing ad-hoc datasets Team Onboarding Templates
  24. 24. Big Wins: Analytics Self-service Collaboration
  25. 25. Data Science Use Cases Exploration, Sizing Feature generation Model training Scoring Experiments Analyzing results Model serving
  26. 26. Big Wins: Data Science Faster local stack to cloud cycle No infrastructure overhead Increased ML adoption across teams Governance & Tracking
  27. 27. Summary
  28. 28. Key Takeaways Delivery time reduced by 30% Decreased infrastructure costs by 60% Databricks used by 50% of all Atlassians Reduced Data team dependencies by more than 70%
  29. 29. Thank you!
  30. 30. Feedback Your feedback is important to us Don’t forget to rate and review the sessions

Today a Data Platform is expected to process and analyze a multitude of sources spanning batch files, streaming sources, backend databases, REST APIs, and more. There is clearly a need for standardizing the platform that scales and be flexible letting data engineers and data scientists focus on the business problems rather than managing the infrastructure and backend services. Another key aspect of the platform is multi-tenancy to isolate the workloads and able to track cost usage per tenant. In this talk, Richa Singhal and Esha Shah will cover how to build a scalable Data Platform using Databricks and deploy your data pipelines effectively while managing the costs. The following topics will be covered: Key tenets of a Data Platform Setup multistage environment on Databricks Build data pipelines locally and test on Databricks cluster CI/CD for data pipelines with Databricks Orchestrating pipelines using Apache Airflow – Change Data Capture using Databricks Delta Leveraging Databricks Notebooks for Analytics and Data Science teams

Views

Total views

132

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

5

Shares

0

Comments

0

Likes

0

×