Critical to a data science team’s ability to drive impact is its effectiveness in incorporating its solutions into new or existing products. When collaborating with other engineering teams, and especially when solutions must operate at scale, technological choices can be critical factors in determining what type of outcome you'll have. We walk through strategies and specific technologies - Airflow, Docker, Kubernetes - that can help promote successful collaboration between data science and engineering.
Data Science in Production: Technologies That Drive Adoption of Data Science Solutions at JW Player
1. Data Science in Production
How Docker, K8s, and Airflow Drive
Adoption of Data Science Solutions at
JW Player
Nir Yungster
2. For this talk, we’ll cover technical approaches that
can help drive adoption of data science solutions...
But technology is not a remedy for everything
(people, process,…)
Disclaimer!
8. Data Science Requires Three Pieces to Succeed
1. Access to data
2. Effectiveness in research, development of solutions
3. Ability to deliver solutions when and where they’re needed
9. Part I: The Challenge of Data Science in Production
Part II: The Data Science Platform at JW Player
Part III: Data Science in Production at JW Player
1
2
3
Agenda
11. ● Model Performance
○ E.g. accuracy, precision, etc
● Production-Level Code
○ Portability
○ Maintainability
○ Scalability
○ Reliability
What Does Production Data Science Mean?
— Ease of deploying across environments
— Testing, monitoring, documentation
— Ability to handle high traffic volume
— Service up-time
14. I want accuracy,
interpretability,
& validation!!
I want model
performance! I want model
performance!
Scientist Engineer
I want efficiency,
reliability, &
SLEEP!!
15. Collaboration: The Good, the Bad, and The Ugly
● The Good
○ Positive collaboration
○ Both sides primary goals achieved
● The Bad
○ Models in Limbo
○ Mutant models
● The Ugly
○ Misunderstanding, distrust
○ Barriers between teams
16. There are tools that can help!
● To make production data science more feasible
● To make Data Science teams more self sufficient
● To enable better collaboration across teams
18. About JW Player
● Video player + platform
● Headquarters in NYC
● SaaS business
● 15k subscribers, 2M free
● 5% of video plays across the web
19. ● Video Recommendation Engine
Video Publisher Data
Products
● Automated Thumbnail Selection
● Shot/Scene Detection
20. ● Provide R&D for data products
● Centralized team (6 members)
○ Including 2 software developers
● Work with a variety of product and
engineering teams across the
company
Data Science Within JW Player
21. Key Elements of JW Data Science Infrastructure
Container Service Workflow Orchestration Application Orchestration
Scalability, Reliability
Portability Maintainabiilty
22. Docker is a Container Service
What’s a container?
● A standard wrapper for
tasks & applications so
that they run consistently
across environments
23. ● Applications / tasks can run in any
environment
● Removes friction arising from
development and deployment in
different environments
○ Across teams, within teams
Container Portability Reduces Integration Pain
dockerize all the things!
24. Airflow Orchestrates Workflows
● Workflow consist of a series of tasks
○ E.g. data processing, model training
○ Workflows run on a schedule
● Airflow helps with Maintainability
○ Monitoring & alerting
○ Web interface for investigating logs,
rerunning tasks / entire workflows
25.
26. ● Deploy & manage dockerized
applications that run continuously (e.g.
an API service)
● Built-in Scaling, Reliability, Monitoring
● JW Player maintains an internal
deployment service powered by
Kubernetes
Kubernetes Orchestrates Applications
28. Key Elements of JW Data Science Infrastructure
Container Service Workflow Orchestration Application Orchestration
Scalability, Reliability
Portability Maintainabiilty
30. Three flavors of production data science
● Backend Microservices
○ Server-side API Running in Kubernetes
● Plugins (aka Frontend microservices)
○ Client-side plugin running alongside the Player
● “Integrations” with engineering
○ Data Science conducts R&D, develops a model
○ Works with Engineering to productionize
31. Backend Microservice
● What is involved?
○ Deploy model as application on Kubernetes
○ Backend service with API
● When is this approach common?
○ Easiest for a new model
● Benefits
○ Data Science in full control of model, updates
○ Decoupled architecture
○ Clear ownership, boundaries
Backend
Frontend
Microservice
32. Client-side Plugin
● What is involved?
○ Effectively a client-side microservice
○ Written in JavaScript
● When is this approach common?
○ Easiest for a new model
○ If the model is lightweight
● Benefits
○ Decoupled architecture
○ Reduced network traffic, low latency
Backend
Frontend
Plugin
33. ● What is involved?
○ Translating / integrating model
○ Requires very close coordination
○ Often involves rewriting model code
● When is this approach common?
○ Often the case when you’re
improving upon an existing product
Integration with Engineering
Backend
Frontend
Model
??
● Possible Pitfalls
○ Tangled web
○ Unclear path to update/iterate
35. Some Takeaways
● Owning models means more maintenance responsibility
○ Can take away from core DS mission
● Microservices don’t remove need to collaborate with
other teams on models
○ To ensure feature fidelity
○ Ensure proper usage
○ SLAs
36. ● Think about production from the beginning of R&D
● Build intelligent fallbacks to ease reliability concerns
○ When one element of a service fails, allowing for slightly
degraded state (e.g. serving a stale model)
● Build a microservice that you jointly maintain with engineers
● Consider if your next hire should be a software engineer
Some Tips