Uber's data science workbench

•

9 likes•3,232 views

Ran Wei

https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56711

Engineering

Uber’s Data Science Workbench
Randy Wei Peng Du

Mission
Unleash the productivity of the Data
Science community at Uber by
providing scalable infrastructure,
tools, customization and support.

Tools of the Trade: Jupyter Notebooks
Alternative to traditional CLIs
Interactive tool which combines
Prose (HTML Markdown),
Code (Py, R, Scala)
Visualization (charts, maps, tables)
Shareable artifact of knowledge
Hosted webapp
Notebook, Notes, Cells
Each cell is an executable line of code
Used for
Data exploration, Cleansing, Modeling
Dashboarding/reporting
HTML
Code
Output

Tools of the Trade: RStudio Server
Browser interface to a remote R server
Centrally manage compute infrastructure
IDE for R
Syntax highlight, code completion
Debugging
Charts
File Browser
RStudio also has Notebook functionality
R has a huge library repository
Used mostly for rapid prototyping of models
on small datasets (UbeR)
Data
Code
Output

Tools of the Trade: Apache Spark
Distributed statistical computing framework
Run R code without translating it to Java
Choice of Intelligent Decision, Insurance, etc
teams
Distributed machine learning framework
Easy to integrate with scientific Python
libraries
Choice of Fraud Detection, Sensing and
Perception, etc teams
SparkR PySpark

● Productivity
● Py, R, Scala interpreters in Jupyter
● Hosted RStudio support
● Version Control
● Custom libraries/environment
● Single-pane lifecycle mgmnt.
● PySpark, SparkR
Scale
● Scalable Jupyter Server infra.
● Large dist. computation backend
● Multitenancy
● File Persistence
● Security
Requirements
Ecosystem Integration
● Scheduling: Piper
● Dashboards: Shiny
● Data Exploration: Query engine API
● Deploy: Machine learning platform
● Chargeback: Monitoring platform
● Knowledge
● Search
● Access Controls
● Sharing Controls
● Publish
● Comments & Discussion
Scale Productivity
Social Ecosystem

State of the Union
Problem
● Data Scientists (DSs) start
at Uber with diverse
skillsets and backgrounds
● Precious time wasted in
infra. setup, version control,
search, sharing...
● Teams are building their
own solutions
Vision
● Web-based hub for all Data
Scientists at Uber
● Ability to centrally:
○ provision tools
○ leverage dist.
Backend
○ search, comment,
share
○ monitor
● Integrated with Uber’s data
ecosystem
● Dedicated SRE
Opportunity
● Find and reuse knowledge
● Opportunity for a dedicated
team to advocate for and
build tools needs to make
DSs hyper-productive
● Cloud experience
● Chargeback

Management Service
Create, Delete, Search, Share, Publish, Schedule
RStudio
(Docker)
Uber Mesos Infra Shared File System
MLlib
Worker
MLlib
Worker
MLlib
Worker
MLlib
Worker
MLlib
Worker
PySpark
Worker
MLlib
Worker
MLlib
Worker
SparkR
Worker
Uber spark
debugging
toolkit
Uber spark
development
toolkit
RStudio
(Docker)
RStudio
(Docker) RStudio
(Docker)
RStudio
(Docker)
Jupyter
(Docker)
Manage
Mesos
Spark
Architecture

Architecture
NB1
Application
Management
Service
session / file
management,
proxy
Mesos Cluster
Docker Container Hadoop
Cluster
(Hive, Presto,
Spark)
Distributed
ProcessingDocker Container
Docker Container
RStudio
Server
RStudio
Jupyter
Docker Container
NB1Jupyter
Server NB2
Web GUI

Data Science
Workbench
Uber ML platform Palette
Hive Cassandra
Spark
Spark SDK, Spark Debug
tool, Spark templates
Uber Ecosystem
Models
HDFS
Query
Runner
Production
PySpark
for ML
Data Visualization

What's hot

Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks

Big Telco - Yousun JeongSpark Summit

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA

Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit

Building a Big Data PipelineJesus Rodriguez

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit

Hadoop application architectures - using Customer 360 as an examplehadooparchbook

Realtime streaming architecture in INFINARIOJozo Kovac

Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks

TechEvent Databricks on AzureTrivadis

Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit

Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks

High-Scale Entity Resolution in HadoopDataWorks Summit/Hadoop Summit

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks

Druid Overview by Rachel PedreschiBrian Olsen

Spark - Migration Story Roman Chukh

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA

What's hot (20)

Informational Referential Integrity Constraints Support in Apache Spark with ...

Big Telco - Yousun Jeong

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

Spark in the Enterprise - 2 Years Later by Alan Saldich

Building a Big Data Pipeline

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee

Hadoop application architectures - using Customer 360 as an example

Realtime streaming architecture in INFINARIO

Building Data Quality pipelines with Apache Spark and Delta Lake

TechEvent Databricks on Azure

Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...

Building Intelligent Applications, Experimental ML with Uber’s Data Science W...

High-Scale Entity Resolution in Hadoop

Data infrastructure architecture for medium size organization: tips for colle...

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica

Druid Overview by Rachel Pedreschi

Spark - Migration Story

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...

Viewers also liked

Architecting a Next Generation Data Platformhadooparchbook

Churn managementMohammed Akram Ayyubi

Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...eCommerce Institute

Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...eCommerce Institute

Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...eCommerce Institute

Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...eCommerce Institute

Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017eCommerce Institute

Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017eCommerce Institute

Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau

Viewers also liked (16)

Architecting a Next Generation Data Platform

Churn management

Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017

Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...

Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017

Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017

Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...

Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...

Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017

Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017

Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...

Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017

Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017

Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017

Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017

Debugging Apache Spark - Scala & Python super happy fun times 2017

Similar to Uber's data science workbench

Cloudera, Azure and Big Data at Cloudera Meetup '17Nathan Bijnens

Architecting an Open Source AI Platform 2018 editionDavid Talby

Bhadale group of companies our technology ecosystemVijayananda Mohire

Lviv Data Science Club (Sergiy Lunyakin)Lviv Startup Club

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media

Top 10 Data analytics tools to look for in 2021Mobcoder

Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceAlex Danvy

VanyaSehgal_ResumeVANYA SEHGAL

Developing and deploying AI solutions on the cloud using Team Data Science Pr...Debraj GuhaThakurta

Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...Lviv Startup Club

USQL Trivadis Azure Data Lake EventTrivadis

Analyzing data with docker v4Andreas Dewes

PPT5: Neuron Introductionakira-ai

sudipto_resumeSudipto Saha

December 2013 HUG: Hunk - Splunk over HadoopYahoo Developer Network

Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraAnant Corporation

Microsoft AI Platform OverviewDavid Chou

Bigdata.sunil_6+yearsExpbigdata sunil

Sudipta_Mukherjee_Resume_APR_2023.pdfsudipto801

Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit

Similar to Uber's data science workbench (20)

Cloudera, Azure and Big Data at Cloudera Meetup '17

Architecting an Open Source AI Platform 2018 edition

Bhadale group of companies our technology ecosystem

Lviv Data Science Club (Sergiy Lunyakin)

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...

Top 10 Data analytics tools to look for in 2021

Tour de France Azure PaaS 6/7 Ajouter de l'intelligence

VanyaSehgal_Resume

Developing and deploying AI solutions on the cloud using Team Data Science Pr...

Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...

USQL Trivadis Azure Data Lake Event

Analyzing data with docker v4

PPT5: Neuron Introduction

sudipto_resume

December 2013 HUG: Hunk - Splunk over Hadoop

Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra

Microsoft AI Platform Overview

Bigdata.sunil_6+yearsExp

Sudipta_Mukherjee_Resume_APR_2023.pdf

Integrating Apache Phoenix with Distributed Query Engines

Recently uploaded

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ

Past, Present and Future of Generative AIabhishek36461

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Effects of rheological properties on mixingviprabot1

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

pipeline in computer architecture designssuser87fa0c1

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

Artificial-Intelligence-in-Electronics (K).pptxbritheesh05

Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Work Experience-Dalton Park.pptxfvvvvvvvLewisJB

Oxy acetylene welding presentation note.eptoze12

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...

Past, Present and Future of Generative AI

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Effects of rheological properties on mixing

IVE Industry Focused Event - Defence Sector 2024

pipeline in computer architecture design

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

Introduction to Machine Learning Unit-3 for II MECH

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

Artificial-Intelligence-in-Electronics (K).pptx

Call Girls Delhi {Jodhpur} 9711199012 high profile service

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

Work Experience-Dalton Park.pptxfvvvvvvv

Oxy acetylene welding presentation note.

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf

Uber's data science workbench

1. Uber’s Data Science Workbench Randy Wei Peng Du

3. Mission Unleash the productivity of the Data Science community at Uber by providing scalable infrastructure, tools, customization and support.

4. Tools of the Trade: Jupyter Notebooks Alternative to traditional CLIs Interactive tool which combines Prose (HTML Markdown), Code (Py, R, Scala) Visualization (charts, maps, tables) Shareable artifact of knowledge Hosted webapp Notebook, Notes, Cells Each cell is an executable line of code Used for Data exploration, Cleansing, Modeling Dashboarding/reporting HTML Code Output

5. Tools of the Trade: RStudio Server Browser interface to a remote R server Centrally manage compute infrastructure IDE for R Syntax highlight, code completion Debugging Charts File Browser RStudio also has Notebook functionality R has a huge library repository Used mostly for rapid prototyping of models on small datasets (UbeR) Data Code Output

6. Tools of the Trade: Apache Spark Distributed statistical computing framework Run R code without translating it to Java Choice of Intelligent Decision, Insurance, etc teams Distributed machine learning framework Easy to integrate with scientific Python libraries Choice of Fraud Detection, Sensing and Perception, etc teams SparkR PySpark

7. ● Productivity ● Py, R, Scala interpreters in Jupyter ● Hosted RStudio support ● Version Control ● Custom libraries/environment ● Single-pane lifecycle mgmnt. ● PySpark, SparkR Scale ● Scalable Jupyter Server infra. ● Large dist. computation backend ● Multitenancy ● File Persistence ● Security Requirements Ecosystem Integration ● Scheduling: Piper ● Dashboards: Shiny ● Data Exploration: Query engine API ● Deploy: Machine learning platform ● Chargeback: Monitoring platform ● Knowledge ● Search ● Access Controls ● Sharing Controls ● Publish ● Comments & Discussion Scale Productivity Social Ecosystem

8. State of the Union Problem ● Data Scientists (DSs) start at Uber with diverse skillsets and backgrounds ● Precious time wasted in infra. setup, version control, search, sharing... ● Teams are building their own solutions Vision ● Web-based hub for all Data Scientists at Uber ● Ability to centrally: ○ provision tools ○ leverage dist. Backend ○ search, comment, share ○ monitor ● Integrated with Uber’s data ecosystem ● Dedicated SRE Opportunity ● Find and reuse knowledge ● Opportunity for a dedicated team to advocate for and build tools needs to make DSs hyper-productive ● Cloud experience ● Chargeback

9. Similar offerings...

10. Management Service Create, Delete, Search, Share, Publish, Schedule RStudio (Docker) Uber Mesos Infra Shared File System MLlib Worker MLlib Worker MLlib Worker MLlib Worker MLlib Worker PySpark Worker MLlib Worker MLlib Worker SparkR Worker Uber spark debugging toolkit Uber spark development toolkit RStudio (Docker) RStudio (Docker) RStudio (Docker) RStudio (Docker) Jupyter (Docker) Manage Mesos Spark Architecture

11. Architecture NB1 Application Management Service session / file management, proxy Mesos Cluster Docker Container Hadoop Cluster (Hive, Presto, Spark) Distributed ProcessingDocker Container Docker Container RStudio Server RStudio Jupyter Docker Container NB1Jupyter Server NB2 Web GUI

12. Data Science Workbench Uber ML platform Palette Hive Cassandra Spark Spark SDK, Spark Debug tool, Spark templates Uber Ecosystem Models HDFS Query Runner Production PySpark for ML Data Visualization

13. Workflow Demo

14.

15.

16.

17.

18.

19. Q&A

Uber's data science workbench

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Uber's data science workbench

Similar to Uber's data science workbench (20)

Recently uploaded

Recently uploaded (20)

Uber's data science workbench