Bring Your Own Container: Using Docker Images In Production

Bring your own container
Using Docker images in production
Harin Sanghirun & Max Cantor
Machine Learning Engineering at Condé Nast

Condé Nast
• Global leader in media featuring many iconic brands:
• The New Yorker, WIRED, Vanity Fair, Epicurious, and many more.
• In addition to content production, Condé Nast also invests in companion
software products to enhance our audience experiences.
• Examples: Content Recommendations, Audience Segmentation

Spire
Audience Segmentation
• Spire is a platform for user segmentation and targeted advertising
• Analyzes over one-hundred million users
• See our Spark+AI Summit 2020 talk for more background on Spire

Presentation Overview
• 1000-Foot View of Spire’s Architecture
• What Spire looked like before Docker
• How Docker helped streamline Spire’s deployment
• Containerization of Spire on Databricks
• Walkthrough of Spire’s containerization strategy in production
• Learning from Experience: Pros & Cons
• What we’ve learned from containerizing Spire on Databricks

How Docker streamlined Spire’s deployment

The 1000-Foot View
Ad Targeting
Content
Recommendations
Conde Nast’s Platform
Dataset
source(s)
(1st, 2nd,
and 3rd
party)

Components of Spire
▪ Model interface
standardization
▪ Model serialization
▪ Model versioning & tracking
▪ Hyperparameter tuning
• Database Management
• Scheduling and
Orchestration
• Spire Core Library
• Kalos
• Utility Methods
• Data Science Common Library

Data Science
Common
spire-2.0.0-py3-none-any.whl
kalos-1.0.0-py3-none-any.whl
datasci-1.0.0-py3-none-any.w
hl
CI/CD
CI/CD
CI/CD
Without Container
DBFS

Data Science
Common
ghcr.io/condenast/spire:3.0.0
ghcr.io/condenast/spire:3.1.0
ghcr.io/condenast/spire:3-stable
Dockerfile
CI/CD
With Container

Recap
• Pre-packaged all dependencies
• Each image represents a tested combination of packages
• Linked to a specific Databricks Runtime and Spire
• Fewer pipelines to manage
• Explicit and upfront control over dependency versions

Custom Containers on Databricks, 101

The Basics
• Step 1 - Choosing a base image
• Step 2 - Adding your dependency
• Step 3 - Push to a Docker Registry
• Step 4 - Launching a cluster

Step 1 - Choosing a Base Image
See their content at:
https://github.com/databricks/containers
• Available Images:
• Standard
• Minimal
• Python
• R
• DBFS FUSE
• SSH
• GPU

Step 2 - Adding Dependency and build
• Standard Docker workﬂow:
• “Do what you want”
• Things to watch out for:
• Make sure to match package
version listed in your target
Runtime Version
# select base image
FROM databricksruntime/standard:latest
# install pip libraries
RUN pip install pandas urllib3
# installing binaries
RUN apt-get install git

Step 3 - Pushing to a Docker Registry
● The recommended way:
○ AWS ECR
○ Azure Container Registry
● Also support any registry with basic authentication
● What we do:
○ GitHub Container Registry + Basic Authentication

Step 4 - Launching Your Cluster

Containerization of Spire on Databricks

Containerization of Spire on Databricks: Docker Image
Goal: Produce a Docker image w/ DBR + Spire
• The image is built from databricks-minimal
• Ubuntu 18.04
• Custom DBR 7.x functionality
• Spire package and sub-package install
• Dependency management

Containerization of Spire on Databricks: Hosting
Goal: Host image on GitHub Packages (ghcr.io)
• Manage production and development packages
• Can host multiple images per package
• Image Tagging:
• GitHub release tag
• Spire package version (version.py)

Containerization of Spire on Databricks: CI/CD
Goal: Integrate images creation with CI/CD pipeline
• GitHub Actions CI/CD Integration
• ubuntu-latest and macOS-10.15
• Automated pytest testing on push
• Automated build and deployment of Docker image on release tag

What We Learned: Pro 1
• Library customization
• Automatic and simplified control of the Spire module
• Integration of module dependencies

• Fluid integration with existing deployment pipeline
• GitHub Actions CI/CD
• Test database integration
• Pytest integration
• Multiple OS support (Ubuntu and macOS)
• Release tagging
• Databricks Jobs

• Ease of debugging
• ssh into container via docker run -it <package> bash
• pdb set trace support
• Test database Docker volume for pytest integrations testing

What We Learned: Con 1
• DBR Version Incompatibility
• DBR 8.x not currently supported, only DBR 6.x and 7.x
• Need to create custom base image

• Pip Package Management
• Must match target runtime specifications

• Large image size
• Base images are large compared to Docker images used for deployment
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
databricksruntime/standard latest fd383efa2fcc 19 months ago 1.84GB

• Requires prior Docker experience
• Local memory space requirement for local
builds grow quickly
• Need to continually prune these resources
• Deployment pipeline is slower
• More data is uploaded
Delete all docker images
docker rmi $(docker images -q)
Delete all docker containers
docker rm $(docker ps --ﬁlter status=exited -q)
Pruning
docker image prune -f
docker container prune -f
docker builder prune -f

● Credentials for Basic
Auth are stored of
plaintext
basic auth stored as plain text

● Each usage requires a pull of the container
○ Image pulls grow quickly

Summary
● What Databricks’ Containers offer:
● Streamlining of dependency management
● Streamlining of CI/CD
● More predictable runtime behavior
● Things to consider:
● Need careful synchronization of dependencies
○ Between your image and your target runtime
● Security concern for Basic Auth
● Additional overhead of Docker
● Each usage requires a pull of the container

Agenda
▪ First Presenter
▪ Topic Lorem ipsum
dolor sit amet
consectetur.
▪ Second Presenter
▪ Topic Lorem ipsum
dolor sit amet.
▪ Third Presenter

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Basic Slide
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet

Reduce Long Titles
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a
subtitle area

Two Columns
▪ Bulleted list format
• Headline Format
Headline Format

Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category

Three Box
▪ Bulleted list
▪ Bulleted list
• Bulleted list
• Bulleted list
• Category
• Category
• Bulleted list
• Bulleted list
• Category

Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category
▪ Bulleted list
▪ Bulleted list
• Category
▪ Bulleted list
▪ Bulleted list
• Category

Shapes
Rounded corner rectangle Double corner
rectangle
Double corner
rectangle

Table
Column Column Column
Row Value Value Value

Attribution Format
Second line of attribution
This is a template for a quote
slide. This is where the quote
goes. Attribute the source
below.

Bring Your Own Container: Using Docker Images In Production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bring Your Own Container: Using Docker Images In Production

Similar to Bring Your Own Container: Using Docker Images In Production (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Bring Your Own Container: Using Docker Images In Production