Continuous Integration & Continuous Delivery

Continuous Integration &
Continuous Delivery
Prakash Chockalingam
December, 2017

Housekeeping
• Your connection will be muted
• Submit questions via the Q&Apanel
• Questions will be answered at the end of the webinar
• Any outstanding questions will be answered in the Databricks Forum
(https://forums.databricks.com)
• Webinar will be recorded and attachments will be made available via
www.databricks.com
2

About Prakash
Prakash Chockalingam
● Product Manager at Databricks
● Works closelywith customers
● Deep experience building large scale
distributed systems and machine learning
infrastructure at Netflix and Yahoo
3

4
Agenda
• About Databricks
• Different stages and challenges involved in building a data
pipeline.
• Best practices for building a data pipeline in Databricks
–Development
–Unit testing and build
–Staging
–Production

5
Accelerate innovation by unifying data science,
engineering and business
• Founded by the creators of Apache Spark
• Contributes 75%of the open source code, 10x more
than any other company
• Trained 40k+ Spark users on the Databricks platform
VISION
WHO WE ARE
Unified Analytics Platform powered by Apache SparkPRODUCT

6
CI/CD - Terminologies
• Continuous Integration (CI): Allows multiple developersto merge code changes
to a central repository.Each merge typically triggersan automated build that compiles the
code and runsunit tests.
• Continuous Delivery (CD): Expands on CI by pushingcode changes to multiple
environmentslike QA and staging after build has been completed so that new changes can be
tested for stability, performanceandsecurity.
• Continuous Deployment: CD typically requiresmanualapproval beforethenew
changes are pushedto production.Continuousdeployment automates the productionpush
as well.

7
CI/CD - Stages for general development
PRODUCTION
Deploy code
to production
STAGING
Deploy code to staging
& verify behavior &
stability
BUILD
Build code &
run tests
SOURCE
Developers
commit changes

8
CI/CD - Stages for a data pipeline
PRODUCTION
Deploy code
to production
STAGING
Deploy code to staging
& verify behavior &
stability
BUILD
Build code &
run tests
SOURCE
Developers
commit changes
DATA
EXPLORATION
Identify characteristics
of a data set

10
Databricks Clusters
• Resilient to transient cloud failures
• Optimized for high throughput
• High availability
• Scalable to handle thousands of nodes
• Bulletproof security
• Optimized for lowering your cloud costs

11
Databricks Workspace
• Hosted notebook service
• Organize notebooks & libraries
with folders
• Collaboration
• Access controls
• One-click visualizations &
dashboarding

12
Databricks FileSystem (DBFS)
● Layer over cloud storage (S3, Azure Blob Storage)
● Files in DBFS persist to cloud storage so that you won’t lose
data even after clusters terminate
● You can access DBFS using dbutils from a cluster. Ex:
dbutils.fs.ls(“/mnt/myfolder”)

13
Databricks Jobs
• Built-in scheduler
• Schedule notebooks,jars, python
files & eggs
• Support for spark-submit
• Alerting to email, slack, pagerduty
• Access controls
• Metrics & monitoring

14
Databricks Command Line Interface
● Open source developer tool that wraps Databricks REST API
● Hosted at https://github.com/databricks/databricks-cli
● Currently supports the following APIs:
○ DBFS
○ Workspace
○ Cluster
○ Jobs

18
Recommended code organization
Libraries
- Core logic that needs to be unit tested
Notebook
- Parameters
- Contents of main class that calls different classes in libraries

19
Development Stages
1.Checkout code from revision control to your computer
2.Compile code locally into libraries. Copy them to DBFS
3.Import Notebooks from your computer to Databricks
4.Create/Start your cluster
5.Attach library in DBFS to the cluster
6.Run notebooks and explore data
7.Once done, download/export notebooks to local computer
8.Checkin code from local computer to revision control

20
Best practices for notebook development
Unit testing: Refactor core logic in notebooksto classes with Dependency
Injection so that you can unit test easily. Package the core logic as libraries in
your favorite IDE.

21
Parameterization: Keepthe lightweight business logic and parameters
in notebooks so that you can iterate quickly and move only the core logic into
libraries.

22
Simple Chaining:Chain simple linear workflow with fail fast mechanism.
The workflow could be chaining different different code blocks from the
same library or different libraries too.

23
Performance& Visual Troubleshooting: Use different cells to call
different code blocks so that you can easily look at the performance & logs.

24
Return value: You can return results from notebooksby calling
dbutils.notebook.exit(value). You can get the return value through
API so that you can easily take further actions based on return values.

27
Continuous Integration and build
Once a code is checked in, the build server (ex: Jenkins) can do
the following:
● Compile all the core logic into libraries
● Run unit tests of the core logic in libraries
● Push the artifacts (libraries and notebooks) into a repository
like Maven or S3

30
Push to Staging
● Libraries: The build server can programmatically push the libraries to a staging
folder in DBFS in Databricks using the DBFS API.
● Notebooks: The build server can also programmatically push the notebooks to a
staging folder in the Databricks workspace through the Workspace API.
● Jobs and cluster configuration: The build server can then leverage the Jobs
API to create a staging job with a certain set of configuration, provide the libraries
in DBFS and point to the main notebook to be triggered by the job.
● Results: The build server can also get the output of the run and then take further
actions based on that.

31
Push to Production
Blue/Green deployment to production
● Push the new production ready libraries to a new DBFS location.
● Push the new production ready notebooks to a new folder under a restricted
production folder in Databricks’ workspace.
● Modify the job configuration to point to the new notebook and library location so
that the next run of the job can pick them up and run the pipeline with the new
code.

Try Apache Spark in Databricks
Blog: https://databricks.com/blog/2017/10/30/continuous-
integration-continuous-delivery-databricks.html
Sign up for a free 14-day trial of Databricks
https://databricks.com/try-databricks
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
3
3

Continuous Integration & Continuous Delivery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Continuous Integration & Continuous Delivery

Similar to Continuous Integration & Continuous Delivery (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Continuous Integration & Continuous Delivery