SlideShare a Scribd company logo
Continuous Integration &
Continuous Delivery
Prakash Chockalingam
December, 2017
Housekeeping
• Your connection will be muted
• Submit questions via the Q&Apanel
• Questions will be answered at the end of the webinar
• Any outstanding questions will be answered in the Databricks Forum
(https://forums.databricks.com)
• Webinar will be recorded and attachments will be made available via
www.databricks.com
2
About Prakash
Prakash Chockalingam
● Product Manager at Databricks
● Works closelywith customers
● Deep experience building large scale
distributed systems and machine learning
infrastructure at Netflix and Yahoo
3
4
Agenda
• About Databricks
• Different stages and challenges involved in building a data
pipeline.
• Best practices for building a data pipeline in Databricks
–Development
–Unit testing and build
–Staging
–Production
5
Accelerate innovation by unifying data science,
engineering and business
• Founded by the creators of Apache Spark
• Contributes 75%of the open source code, 10x more
than any other company
• Trained 40k+ Spark users on the Databricks platform
VISION
WHO WE ARE
Unified Analytics Platform powered by Apache SparkPRODUCT
6
CI/CD - Terminologies
• Continuous Integration (CI): Allows multiple developersto merge code changes
to a central repository.Each merge typically triggersan automated build that compiles the
code and runsunit tests.
• Continuous Delivery (CD): Expands on CI by pushingcode changes to multiple
environmentslike QA and staging after build has been completed so that new changes can be
tested for stability, performanceandsecurity.
• Continuous Deployment: CD typically requiresmanualapproval beforethenew
changes are pushedto production.Continuousdeployment automates the productionpush
as well.
7
CI/CD - Stages for general development
PRODUCTION
Deploy code
to production
STAGING
Deploy code to staging
& verify behavior &
stability
BUILD
Build code &
run tests
SOURCE
Developers
commit changes
8
CI/CD - Stages for a data pipeline
PRODUCTION
Deploy code
to production
STAGING
Deploy code to staging
& verify behavior &
stability
BUILD
Build code &
run tests
SOURCE
Developers
commit changes
DATA
EXPLORATION
Identify characteristics
of a data set
9
Databricks Concepts
10
Databricks Clusters
• Resilient to transient cloud failures
• Optimized for high throughput
• High availability
• Scalable to handle thousands of nodes
• Bulletproof security
• Optimized for lowering your cloud costs
11
Databricks Workspace
• Hosted notebook service
• Organize notebooks & libraries
with folders
• Collaboration
• Access controls
• One-click visualizations &
dashboarding
12
Databricks FileSystem (DBFS)
● Layer over cloud storage (S3, Azure Blob Storage)
● Files in DBFS persist to cloud storage so that you won’t lose
data even after clusters terminate
● You can access DBFS using dbutils from a cluster. Ex:
dbutils.fs.ls(“/mnt/myfolder”)
13
Databricks Jobs
• Built-in scheduler
• Schedule notebooks,jars, python
files & eggs
• Support for spark-submit
• Alerting to email, slack, pagerduty
• Access controls
• Metrics & monitoring
14
Databricks Command Line Interface
● Open source developer tool that wraps Databricks REST API
● Hosted at https://github.com/databricks/databricks-cli
● Currently supports the following APIs:
○ DBFS
○ Workspace
○ Cluster
○ Jobs
15
Development in
Databricks
16
Development Environment
17
Development Setup
18
Recommended code organization
Libraries
- Core logic that needs to be unit tested
Notebook
- Parameters
- Contents of main class that calls different classes in libraries
19
Development Stages
1.Checkout code from revision control to your computer
2.Compile code locally into libraries. Copy them to DBFS
3.Import Notebooks from your computer to Databricks
4.Create/Start your cluster
5.Attach library in DBFS to the cluster
6.Run notebooks and explore data
7.Once done, download/export notebooks to local computer
8.Checkin code from local computer to revision control
20
Best practices for notebook development
Unit testing: Refactor core logic in notebooksto classes with Dependency
Injection so that you can unit test easily. Package the core logic as libraries in
your favorite IDE.
21
Parameterization: Keepthe lightweight business logic and parameters
in notebooks so that you can iterate quickly and move only the core logic into
libraries.
Best practices for notebook development
22
Best practices for notebook development
Simple Chaining:Chain simple linear workflow with fail fast mechanism.
The workflow could be chaining different different code blocks from the
same library or different libraries too.
23
Best practices for notebook development
Performance& Visual Troubleshooting: Use different cells to call
different code blocks so that you can easily look at the performance & logs.
24
Best practices for notebook development
Return value: You can return results from notebooksby calling
dbutils.notebook.exit(value). You can get the return value through
API so that you can easily take further actions based on return values.
25
CI/CD with Databricks
26
CI/CD with Databricks
27
Continuous Integration and build
Once a code is checked in, the build server (ex: Jenkins) can do
the following:
● Compile all the core logic into libraries
● Run unit tests of the core logic in libraries
● Push the artifacts (libraries and notebooks) into a repository
like Maven or S3
28
Push to Staging
29
Staging Job
30
Push to Staging
● Libraries: The build server can programmatically push the libraries to a staging
folder in DBFS in Databricks using the DBFS API.
● Notebooks: The build server can also programmatically push the notebooks to a
staging folder in the Databricks workspace through the Workspace API.
● Jobs and cluster configuration: The build server can then leverage the Jobs
API to create a staging job with a certain set of configuration, provide the libraries
in DBFS and point to the main notebook to be triggered by the job.
● Results: The build server can also get the output of the run and then take further
actions based on that.
31
Push to Production
Blue/Green deployment to production
● Push the new production ready libraries to a new DBFS location.
● Push the new production ready notebooks to a new folder under a restricted
production folder in Databricks’ workspace.
● Modify the job configuration to point to the new notebook and library location so
that the next run of the job can pick them up and run the pipeline with the new
code.
32
Push to Production
Try Apache Spark in Databricks
Blog: https://databricks.com/blog/2017/10/30/continuous-
integration-continuous-delivery-databricks.html
Sign up for a free 14-day trial of Databricks
https://databricks.com/try-databricks
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
3
3

More Related Content

What's hot

Github
GithubGithub
Github
MeetPatel710
 
DevOps 101 - an Introduction to DevOps
DevOps 101  - an Introduction to DevOpsDevOps 101  - an Introduction to DevOps
DevOps 101 - an Introduction to DevOps
Red Gate Software
 
DevOps explained
DevOps explainedDevOps explained
DevOps explained
Jérôme Kehrli
 
Introduction to GitHub Actions
Introduction to GitHub ActionsIntroduction to GitHub Actions
Introduction to GitHub Actions
Bo-Yi Wu
 
GIT presentation
GIT presentationGIT presentation
GIT presentation
Naim Latifi
 
Getting Started with Azure DevOps
Getting Started with Azure DevOpsGetting Started with Azure DevOps
Getting Started with Azure DevOps
Jessica Deen
 
Introduction to github slideshare
Introduction to github slideshareIntroduction to github slideshare
Introduction to github slideshare
Rakesh Sukumar
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
Sunil Gurav
 
Git commands
Git commandsGit commands
Git commands
Viyaan Jhiingade
 
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
Edureka!
 
An Introduction to Maven
An Introduction to MavenAn Introduction to Maven
An Introduction to Maven
Vadym Lotar
 
Git and GitHub
Git and GitHubGit and GitHub
Git and GitHub
Md. Ahsan Habib Nayan
 
Introduction to Git
Introduction to GitIntroduction to Git
Introduction to Git
Yan Vugenfirer
 
Introduction to Docker - 2017
Introduction to Docker - 2017Introduction to Docker - 2017
Introduction to Docker - 2017
Docker, Inc.
 
DevOps Best Practices
DevOps Best PracticesDevOps Best Practices
DevOps Best Practices
Giragadurai Vallirajan
 
Clean Architecture
Clean ArchitectureClean Architecture
Clean Architecture
Badoo
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1
Omar Fathy
 
Infrastructure Deployment with Docker & Ansible
Infrastructure Deployment with Docker & AnsibleInfrastructure Deployment with Docker & Ansible
Infrastructure Deployment with Docker & Ansible
Robert Reiz
 
Microsoft Azure Cloud Services
Microsoft Azure Cloud ServicesMicrosoft Azure Cloud Services
Microsoft Azure Cloud Services
David J Rosenthal
 

What's hot (20)

Github
GithubGithub
Github
 
DevOps 101 - an Introduction to DevOps
DevOps 101  - an Introduction to DevOpsDevOps 101  - an Introduction to DevOps
DevOps 101 - an Introduction to DevOps
 
DevOps explained
DevOps explainedDevOps explained
DevOps explained
 
Introduction to GitHub Actions
Introduction to GitHub ActionsIntroduction to GitHub Actions
Introduction to GitHub Actions
 
GIT presentation
GIT presentationGIT presentation
GIT presentation
 
Getting Started with Azure DevOps
Getting Started with Azure DevOpsGetting Started with Azure DevOps
Getting Started with Azure DevOps
 
Introduction to github slideshare
Introduction to github slideshareIntroduction to github slideshare
Introduction to github slideshare
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
 
Git commands
Git commandsGit commands
Git commands
 
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
 
An Introduction to Maven
An Introduction to MavenAn Introduction to Maven
An Introduction to Maven
 
Git and GitHub
Git and GitHubGit and GitHub
Git and GitHub
 
Introduction to Git
Introduction to GitIntroduction to Git
Introduction to Git
 
Introduction to Docker - 2017
Introduction to Docker - 2017Introduction to Docker - 2017
Introduction to Docker - 2017
 
DevOps Best Practices
DevOps Best PracticesDevOps Best Practices
DevOps Best Practices
 
Clean Architecture
Clean ArchitectureClean Architecture
Clean Architecture
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1
 
Infrastructure Deployment with Docker & Ansible
Infrastructure Deployment with Docker & AnsibleInfrastructure Deployment with Docker & Ansible
Infrastructure Deployment with Docker & Ansible
 
Microsoft Azure Cloud Services
Microsoft Azure Cloud ServicesMicrosoft Azure Cloud Services
Microsoft Azure Cloud Services
 

Similar to Continuous Integration & Continuous Delivery

Continuous Integration for Oracle Database Development
Continuous Integration for Oracle Database DevelopmentContinuous Integration for Oracle Database Development
Continuous Integration for Oracle Database Development
Vladimir Bakhov
 
Docs as Code: Publishing Processes for API Experiences
Docs as Code: Publishing Processes for API ExperiencesDocs as Code: Publishing Processes for API Experiences
Docs as Code: Publishing Processes for API Experiences
Anne Gentle
 
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
Databricks
 
Power of Azure Devops
Power of Azure DevopsPower of Azure Devops
Power of Azure Devops
Azure Riyadh User Group
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
Steve Wong
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERContinuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Indrajit Poddar
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
DataWorks Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
GraalVM and Oracle's Documentation Trends.pdf
GraalVM and Oracle's Documentation Trends.pdfGraalVM and Oracle's Documentation Trends.pdf
GraalVM and Oracle's Documentation Trends.pdf
ohupalo
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
confluent
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL
 
DevOps tools for winning agility
DevOps tools for winning agilityDevOps tools for winning agility
DevOps tools for winning agility
Kellyn Pot'Vin-Gorman
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
Databricks
 
Microsoft Tech Series 2019 - Azure DevOps
Microsoft Tech Series 2019 - Azure DevOpsMicrosoft Tech Series 2019 - Azure DevOps
Microsoft Tech Series 2019 - Azure DevOps
Tomasz Wisniewski
 
Hitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as Code
Hitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as CodeHitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as Code
Hitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as Code
Robert van Mölken
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 

Similar to Continuous Integration & Continuous Delivery (20)

Continuous Integration for Oracle Database Development
Continuous Integration for Oracle Database DevelopmentContinuous Integration for Oracle Database Development
Continuous Integration for Oracle Database Development
 
Docs as Code: Publishing Processes for API Experiences
Docs as Code: Publishing Processes for API ExperiencesDocs as Code: Publishing Processes for API Experiences
Docs as Code: Publishing Processes for API Experiences
 
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
 
Power of Azure Devops
Power of Azure DevopsPower of Azure Devops
Power of Azure Devops
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERContinuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
GraalVM and Oracle's Documentation Trends.pdf
GraalVM and Oracle's Documentation Trends.pdfGraalVM and Oracle's Documentation Trends.pdf
GraalVM and Oracle's Documentation Trends.pdf
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
DevOps tools for winning agility
DevOps tools for winning agilityDevOps tools for winning agility
DevOps tools for winning agility
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Microsoft Tech Series 2019 - Azure DevOps
Microsoft Tech Series 2019 - Azure DevOpsMicrosoft Tech Series 2019 - Azure DevOps
Microsoft Tech Series 2019 - Azure DevOps
 
Hitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as Code
Hitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as CodeHitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as Code
Hitchhiker's guide to Cloud-Native Build Pipelines and Infrastructure as Code
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 

Recently uploaded (20)

Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 

Continuous Integration & Continuous Delivery

  • 1. Continuous Integration & Continuous Delivery Prakash Chockalingam December, 2017
  • 2. Housekeeping • Your connection will be muted • Submit questions via the Q&Apanel • Questions will be answered at the end of the webinar • Any outstanding questions will be answered in the Databricks Forum (https://forums.databricks.com) • Webinar will be recorded and attachments will be made available via www.databricks.com 2
  • 3. About Prakash Prakash Chockalingam ● Product Manager at Databricks ● Works closelywith customers ● Deep experience building large scale distributed systems and machine learning infrastructure at Netflix and Yahoo 3
  • 4. 4 Agenda • About Databricks • Different stages and challenges involved in building a data pipeline. • Best practices for building a data pipeline in Databricks –Development –Unit testing and build –Staging –Production
  • 5. 5 Accelerate innovation by unifying data science, engineering and business • Founded by the creators of Apache Spark • Contributes 75%of the open source code, 10x more than any other company • Trained 40k+ Spark users on the Databricks platform VISION WHO WE ARE Unified Analytics Platform powered by Apache SparkPRODUCT
  • 6. 6 CI/CD - Terminologies • Continuous Integration (CI): Allows multiple developersto merge code changes to a central repository.Each merge typically triggersan automated build that compiles the code and runsunit tests. • Continuous Delivery (CD): Expands on CI by pushingcode changes to multiple environmentslike QA and staging after build has been completed so that new changes can be tested for stability, performanceandsecurity. • Continuous Deployment: CD typically requiresmanualapproval beforethenew changes are pushedto production.Continuousdeployment automates the productionpush as well.
  • 7. 7 CI/CD - Stages for general development PRODUCTION Deploy code to production STAGING Deploy code to staging & verify behavior & stability BUILD Build code & run tests SOURCE Developers commit changes
  • 8. 8 CI/CD - Stages for a data pipeline PRODUCTION Deploy code to production STAGING Deploy code to staging & verify behavior & stability BUILD Build code & run tests SOURCE Developers commit changes DATA EXPLORATION Identify characteristics of a data set
  • 10. 10 Databricks Clusters • Resilient to transient cloud failures • Optimized for high throughput • High availability • Scalable to handle thousands of nodes • Bulletproof security • Optimized for lowering your cloud costs
  • 11. 11 Databricks Workspace • Hosted notebook service • Organize notebooks & libraries with folders • Collaboration • Access controls • One-click visualizations & dashboarding
  • 12. 12 Databricks FileSystem (DBFS) ● Layer over cloud storage (S3, Azure Blob Storage) ● Files in DBFS persist to cloud storage so that you won’t lose data even after clusters terminate ● You can access DBFS using dbutils from a cluster. Ex: dbutils.fs.ls(“/mnt/myfolder”)
  • 13. 13 Databricks Jobs • Built-in scheduler • Schedule notebooks,jars, python files & eggs • Support for spark-submit • Alerting to email, slack, pagerduty • Access controls • Metrics & monitoring
  • 14. 14 Databricks Command Line Interface ● Open source developer tool that wraps Databricks REST API ● Hosted at https://github.com/databricks/databricks-cli ● Currently supports the following APIs: ○ DBFS ○ Workspace ○ Cluster ○ Jobs
  • 18. 18 Recommended code organization Libraries - Core logic that needs to be unit tested Notebook - Parameters - Contents of main class that calls different classes in libraries
  • 19. 19 Development Stages 1.Checkout code from revision control to your computer 2.Compile code locally into libraries. Copy them to DBFS 3.Import Notebooks from your computer to Databricks 4.Create/Start your cluster 5.Attach library in DBFS to the cluster 6.Run notebooks and explore data 7.Once done, download/export notebooks to local computer 8.Checkin code from local computer to revision control
  • 20. 20 Best practices for notebook development Unit testing: Refactor core logic in notebooksto classes with Dependency Injection so that you can unit test easily. Package the core logic as libraries in your favorite IDE.
  • 21. 21 Parameterization: Keepthe lightweight business logic and parameters in notebooks so that you can iterate quickly and move only the core logic into libraries. Best practices for notebook development
  • 22. 22 Best practices for notebook development Simple Chaining:Chain simple linear workflow with fail fast mechanism. The workflow could be chaining different different code blocks from the same library or different libraries too.
  • 23. 23 Best practices for notebook development Performance& Visual Troubleshooting: Use different cells to call different code blocks so that you can easily look at the performance & logs.
  • 24. 24 Best practices for notebook development Return value: You can return results from notebooksby calling dbutils.notebook.exit(value). You can get the return value through API so that you can easily take further actions based on return values.
  • 27. 27 Continuous Integration and build Once a code is checked in, the build server (ex: Jenkins) can do the following: ● Compile all the core logic into libraries ● Run unit tests of the core logic in libraries ● Push the artifacts (libraries and notebooks) into a repository like Maven or S3
  • 30. 30 Push to Staging ● Libraries: The build server can programmatically push the libraries to a staging folder in DBFS in Databricks using the DBFS API. ● Notebooks: The build server can also programmatically push the notebooks to a staging folder in the Databricks workspace through the Workspace API. ● Jobs and cluster configuration: The build server can then leverage the Jobs API to create a staging job with a certain set of configuration, provide the libraries in DBFS and point to the main notebook to be triggered by the job. ● Results: The build server can also get the output of the run and then take further actions based on that.
  • 31. 31 Push to Production Blue/Green deployment to production ● Push the new production ready libraries to a new DBFS location. ● Push the new production ready notebooks to a new folder under a restricted production folder in Databricks’ workspace. ● Modify the job configuration to point to the new notebook and library location so that the next run of the job can pick them up and run the pipeline with the new code.
  • 33. Try Apache Spark in Databricks Blog: https://databricks.com/blog/2017/10/30/continuous- integration-continuous-delivery-databricks.html Sign up for a free 14-day trial of Databricks https://databricks.com/try-databricks Additional Questions? Contact us at http://go.databricks.com/contact-databricks 3 3