Condé Nast is a global leader in the media production space housing iconic brands such as The New Yorker, Wired, Vanity Fair, and Epicurious, among many others. Along with our content production, Condé Nast invests heavily in companion products to improve and enhance our audience’s experience. One such product solution is Spire, Condé Nast’s service for user segmentation and targeted advertising for over a hundred million users.
While Spire started as a set of databricks notebooks, we later utilized DBFS for deploying Spire distributions in the form of Python Whls, and more recently, we have packaged the entire production environment into docker images deployed onto our Databricks clusters. In this talk, we will walk through the process of evolving our python distributions and production environment into docker images, and discuss where this has streamlined our deployment workflow, where there were growing pains, and how to deal with them.
3. Condé Nast
• Global leader in media featuring many iconic brands:
• The New Yorker, WIRED, Vanity Fair, Epicurious, and many more.
• In addition to content production, Condé Nast also invests in companion
software products to enhance our audience experiences.
• Examples: Content Recommendations, Audience Segmentation
4. Spire
Audience Segmentation
• Spire is a platform for user segmentation and targeted advertising
• Analyzes over one-hundred million users
• See our Spark+AI Summit 2020 talk for more background on Spire
5. Presentation Overview
• 1000-Foot View of Spire’s Architecture
• What Spire looked like before Docker
• How Docker helped streamline Spire’s deployment
• Containerization of Spire on Databricks
• Walkthrough of Spire’s containerization strategy in production
• Learning from Experience: Pros & Cons
• What we’ve learned from containerizing Spire on Databricks
11. Recap
• Pre-packaged all dependencies
• Each image represents a tested combination of packages
• Linked to a specific Databricks Runtime and Spire
• Fewer pipelines to manage
• Explicit and upfront control over dependency versions
13. The Basics
• Step 1 - Choosing a base image
• Step 2 - Adding your dependency
• Step 3 - Push to a Docker Registry
• Step 4 - Launching a cluster
14. Step 1 - Choosing a Base Image
See their content at:
https://github.com/databricks/containers
• Available Images:
• Standard
• Minimal
• Python
• R
• DBFS FUSE
• SSH
• GPU
15. Step 2 - Adding Dependency and build
• Standard Docker workflow:
• “Do what you want”
• Things to watch out for:
• Make sure to match package
version listed in your target
Runtime Version
# select base image
FROM databricksruntime/standard:latest
# install pip libraries
RUN pip install pandas urllib3
# installing binaries
RUN apt-get install git
16. Step 3 - Pushing to a Docker Registry
● The recommended way:
○ AWS ECR
○ Azure Container Registry
● Also support any registry with basic authentication
● What we do:
○ GitHub Container Registry + Basic Authentication
19. Containerization of Spire on Databricks: Docker Image
Goal: Produce a Docker image w/ DBR + Spire
• The image is built from databricks-minimal
• Ubuntu 18.04
• Custom DBR 7.x functionality
• Spire package and sub-package install
• Dependency management
20. Containerization of Spire on Databricks: Hosting
Goal: Host image on GitHub Packages (ghcr.io)
• Manage production and development packages
• Can host multiple images per package
• Image Tagging:
• GitHub release tag
• Spire package version (version.py)
21. Containerization of Spire on Databricks: CI/CD
Goal: Integrate images creation with CI/CD pipeline
• GitHub Actions CI/CD Integration
• ubuntu-latest and macOS-10.15
• Automated pytest testing on push
• Automated build and deployment of Docker image on release tag
23. What We Learned: Pro 1
• Library customization
• Automatic and simplified control of the Spire module
• Integration of module dependencies
24. What We Learned: Pro 2
• Fluid integration with existing deployment pipeline
• GitHub Actions CI/CD
• Test database integration
• Pytest integration
• Multiple OS support (Ubuntu and macOS)
• Release tagging
• Databricks Jobs
25. What We Learned: Pro 3
• Ease of debugging
• ssh into container via docker run -it <package> bash
• pdb set trace support
• Test database Docker volume for pytest integrations testing
26. What We Learned: Con 1
• DBR Version Incompatibility
• DBR 8.x not currently supported, only DBR 6.x and 7.x
• Need to create custom base image
27. What We Learned: Con 2
• Pip Package Management
• Must match target runtime specifications
28. What We Learned: Con 3
• Large image size
• Base images are large compared to Docker images used for deployment
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
databricksruntime/standard latest fd383efa2fcc 19 months ago 1.84GB
29. What We Learned: Con 4
• Requires prior Docker experience
• Local memory space requirement for local
builds grow quickly
• Need to continually prune these resources
• Deployment pipeline is slower
• More data is uploaded
Delete all docker images
docker rmi $(docker images -q)
Delete all docker containers
docker rm $(docker ps --filter status=exited -q)
Pruning
docker image prune -f
docker container prune -f
docker builder prune -f
30. ● Credentials for Basic
Auth are stored of
plaintext
basic auth stored as plain text
What We Learned: Con 5
31. ● Each usage requires a pull of the container
○ Image pulls grow quickly
What We Learned: Con 6
32. Summary
● What Databricks’ Containers offer:
● Streamlining of dependency management
● Streamlining of CI/CD
● More predictable runtime behavior
● Things to consider:
● Need careful synchronization of dependencies
○ Between your image and your target runtime
● Security concern for Basic Auth
● Additional overhead of Docker
● Each usage requires a pull of the container
46. Reduce Long Titles
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a
subtitle area
47. Two Columns
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
• Headline Format
Headline Format
48. Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category
49. Three Box
▪ Bulleted list
▪ Bulleted list
• Bulleted list
• Bulleted list
• Category
• Category
• Bulleted list
• Bulleted list
• Category
50. Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category
▪ Bulleted list
▪ Bulleted list
• Category
▪ Bulleted list
▪ Bulleted list
• Category
54. Table
Column Column Column
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value