Uber’s Data Science Workbench
Randy Wei Peng Du
Mission
Unleash the productivity of the Data
Science community at Uber by
providing scalable infrastructure,
tools, customization and support.
Tools of the Trade: Jupyter Notebooks
Alternative to traditional CLIs
Interactive tool which combines
Prose (HTML Markdown),
Code (Py, R, Scala)
Visualization (charts, maps, tables)
Shareable artifact of knowledge
Hosted webapp
Notebook, Notes, Cells
Each cell is an executable line of code
Used for
Data exploration, Cleansing, Modeling
Dashboarding/reporting
HTML
Code
Output
Tools of the Trade: RStudio Server
Browser interface to a remote R server
Centrally manage compute infrastructure
IDE for R
Syntax highlight, code completion
Debugging
Charts
File Browser
RStudio also has Notebook functionality
R has a huge library repository
Used mostly for rapid prototyping of models
on small datasets (UbeR)
Data
Code
Output
Tools of the Trade: Apache Spark
Distributed statistical computing framework
Run R code without translating it to Java
Choice of Intelligent Decision, Insurance, etc
teams
Distributed machine learning framework
Easy to integrate with scientific Python
libraries
Choice of Fraud Detection, Sensing and
Perception, etc teams
SparkR PySpark
● Productivity
● Py, R, Scala interpreters in Jupyter
● Hosted RStudio support
● Version Control
● Custom libraries/environment
● Single-pane lifecycle mgmnt.
● PySpark, SparkR
Scale
● Scalable Jupyter Server infra.
● Large dist. computation backend
● Multitenancy
● File Persistence
● Security
Requirements
Ecosystem Integration
● Scheduling: Piper
● Dashboards: Shiny
● Data Exploration: Query engine API
● Deploy: Machine learning platform
● Chargeback: Monitoring platform
● Knowledge
● Search
● Access Controls
● Sharing Controls
● Publish
● Comments & Discussion
Scale Productivity
Social Ecosystem
State of the Union
Problem
● Data Scientists (DSs) start
at Uber with diverse
skillsets and backgrounds
● Precious time wasted in
infra. setup, version control,
search, sharing...
● Teams are building their
own solutions
Vision
● Web-based hub for all Data
Scientists at Uber
● Ability to centrally:
○ provision tools
○ leverage dist.
Backend
○ search, comment,
share
○ monitor
● Integrated with Uber’s data
ecosystem
● Dedicated SRE
Opportunity
● Find and reuse knowledge
● Opportunity for a dedicated
team to advocate for and
build tools needs to make
DSs hyper-productive
● Cloud experience
● Chargeback
Similar offerings...
Management Service
Create, Delete, Search, Share, Publish, Schedule
RStudio
(Docker)
Uber Mesos Infra Shared File System
MLlib
Worker
MLlib
Worker
MLlib
Worker
MLlib
Worker
MLlib
Worker
PySpark
Worker
MLlib
Worker
MLlib
Worker
SparkR
Worker
Uber spark
debugging
toolkit
Uber spark
development
toolkit
RStudio
(Docker)
RStudio
(Docker) RStudio
(Docker)
RStudio
(Docker)
Jupyter
(Docker)
Manage
Mesos
Spark
Architecture
Architecture
NB1
Application
Management
Service
session / file
management,
proxy
Mesos Cluster
Docker Container Hadoop
Cluster
(Hive, Presto,
Spark)
Distributed
ProcessingDocker Container
Docker Container
RStudio
Server
RStudio
Jupyter
Docker Container
NB1Jupyter
Server NB2
Web GUI
Data Science
Workbench
Uber ML platform Palette
Hive Cassandra
Spark
Spark SDK, Spark Debug
tool, Spark templates
Uber Ecosystem
Models
HDFS
Query
Runner
Production
PySpark
for ML
Data Visualization
Workflow Demo
Q&A

Uber's data science workbench

  • 1.
    Uber’s Data ScienceWorkbench Randy Wei Peng Du
  • 3.
    Mission Unleash the productivityof the Data Science community at Uber by providing scalable infrastructure, tools, customization and support.
  • 4.
    Tools of theTrade: Jupyter Notebooks Alternative to traditional CLIs Interactive tool which combines Prose (HTML Markdown), Code (Py, R, Scala) Visualization (charts, maps, tables) Shareable artifact of knowledge Hosted webapp Notebook, Notes, Cells Each cell is an executable line of code Used for Data exploration, Cleansing, Modeling Dashboarding/reporting HTML Code Output
  • 5.
    Tools of theTrade: RStudio Server Browser interface to a remote R server Centrally manage compute infrastructure IDE for R Syntax highlight, code completion Debugging Charts File Browser RStudio also has Notebook functionality R has a huge library repository Used mostly for rapid prototyping of models on small datasets (UbeR) Data Code Output
  • 6.
    Tools of theTrade: Apache Spark Distributed statistical computing framework Run R code without translating it to Java Choice of Intelligent Decision, Insurance, etc teams Distributed machine learning framework Easy to integrate with scientific Python libraries Choice of Fraud Detection, Sensing and Perception, etc teams SparkR PySpark
  • 7.
    ● Productivity ● Py,R, Scala interpreters in Jupyter ● Hosted RStudio support ● Version Control ● Custom libraries/environment ● Single-pane lifecycle mgmnt. ● PySpark, SparkR Scale ● Scalable Jupyter Server infra. ● Large dist. computation backend ● Multitenancy ● File Persistence ● Security Requirements Ecosystem Integration ● Scheduling: Piper ● Dashboards: Shiny ● Data Exploration: Query engine API ● Deploy: Machine learning platform ● Chargeback: Monitoring platform ● Knowledge ● Search ● Access Controls ● Sharing Controls ● Publish ● Comments & Discussion Scale Productivity Social Ecosystem
  • 8.
    State of theUnion Problem ● Data Scientists (DSs) start at Uber with diverse skillsets and backgrounds ● Precious time wasted in infra. setup, version control, search, sharing... ● Teams are building their own solutions Vision ● Web-based hub for all Data Scientists at Uber ● Ability to centrally: ○ provision tools ○ leverage dist. Backend ○ search, comment, share ○ monitor ● Integrated with Uber’s data ecosystem ● Dedicated SRE Opportunity ● Find and reuse knowledge ● Opportunity for a dedicated team to advocate for and build tools needs to make DSs hyper-productive ● Cloud experience ● Chargeback
  • 9.
  • 10.
    Management Service Create, Delete,Search, Share, Publish, Schedule RStudio (Docker) Uber Mesos Infra Shared File System MLlib Worker MLlib Worker MLlib Worker MLlib Worker MLlib Worker PySpark Worker MLlib Worker MLlib Worker SparkR Worker Uber spark debugging toolkit Uber spark development toolkit RStudio (Docker) RStudio (Docker) RStudio (Docker) RStudio (Docker) Jupyter (Docker) Manage Mesos Spark Architecture
  • 11.
    Architecture NB1 Application Management Service session / file management, proxy MesosCluster Docker Container Hadoop Cluster (Hive, Presto, Spark) Distributed ProcessingDocker Container Docker Container RStudio Server RStudio Jupyter Docker Container NB1Jupyter Server NB2 Web GUI
  • 12.
    Data Science Workbench Uber MLplatform Palette Hive Cassandra Spark Spark SDK, Spark Debug tool, Spark templates Uber Ecosystem Models HDFS Query Runner Production PySpark for ML Data Visualization
  • 13.
  • 19.