SlideShare a Scribd company logo
1 of 19
Download to read offline
Uber’s Data Science Workbench
Randy Wei Peng Du
Mission
Unleash the productivity of the Data
Science community at Uber by
providing scalable infrastructure,
tools, customization and support.
Tools of the Trade: Jupyter Notebooks
Alternative to traditional CLIs
Interactive tool which combines
Prose (HTML Markdown),
Code (Py, R, Scala)
Visualization (charts, maps, tables)
Shareable artifact of knowledge
Hosted webapp
Notebook, Notes, Cells
Each cell is an executable line of code
Used for
Data exploration, Cleansing, Modeling
Dashboarding/reporting
HTML
Code
Output
Tools of the Trade: RStudio Server
Browser interface to a remote R server
Centrally manage compute infrastructure
IDE for R
Syntax highlight, code completion
Debugging
Charts
File Browser
RStudio also has Notebook functionality
R has a huge library repository
Used mostly for rapid prototyping of models
on small datasets (UbeR)
Data
Code
Output
Tools of the Trade: Apache Spark
Distributed statistical computing framework
Run R code without translating it to Java
Choice of Intelligent Decision, Insurance, etc
teams
Distributed machine learning framework
Easy to integrate with scientific Python
libraries
Choice of Fraud Detection, Sensing and
Perception, etc teams
SparkR PySpark
● Productivity
● Py, R, Scala interpreters in Jupyter
● Hosted RStudio support
● Version Control
● Custom libraries/environment
● Single-pane lifecycle mgmnt.
● PySpark, SparkR
Scale
● Scalable Jupyter Server infra.
● Large dist. computation backend
● Multitenancy
● File Persistence
● Security
Requirements
Ecosystem Integration
● Scheduling: Piper
● Dashboards: Shiny
● Data Exploration: Query engine API
● Deploy: Machine learning platform
● Chargeback: Monitoring platform
● Knowledge
● Search
● Access Controls
● Sharing Controls
● Publish
● Comments & Discussion
Scale Productivity
Social Ecosystem
State of the Union
Problem
● Data Scientists (DSs) start
at Uber with diverse
skillsets and backgrounds
● Precious time wasted in
infra. setup, version control,
search, sharing...
● Teams are building their
own solutions
Vision
● Web-based hub for all Data
Scientists at Uber
● Ability to centrally:
○ provision tools
○ leverage dist.
Backend
○ search, comment,
share
○ monitor
● Integrated with Uber’s data
ecosystem
● Dedicated SRE
Opportunity
● Find and reuse knowledge
● Opportunity for a dedicated
team to advocate for and
build tools needs to make
DSs hyper-productive
● Cloud experience
● Chargeback
Similar offerings...
Management Service
Create, Delete, Search, Share, Publish, Schedule
RStudio
(Docker)
Uber Mesos Infra Shared File System
MLlib
Worker
MLlib
Worker
MLlib
Worker
MLlib
Worker
MLlib
Worker
PySpark
Worker
MLlib
Worker
MLlib
Worker
SparkR
Worker
Uber spark
debugging
toolkit
Uber spark
development
toolkit
RStudio
(Docker)
RStudio
(Docker) RStudio
(Docker)
RStudio
(Docker)
Jupyter
(Docker)
Manage
Mesos
Spark
Architecture
Architecture
NB1
Application
Management
Service
session / file
management,
proxy
Mesos Cluster
Docker Container Hadoop
Cluster
(Hive, Presto,
Spark)
Distributed
ProcessingDocker Container
Docker Container
RStudio
Server
RStudio
Jupyter
Docker Container
NB1Jupyter
Server NB2
Web GUI
Data Science
Workbench
Uber ML platform Palette
Hive Cassandra
Spark
Spark SDK, Spark Debug
tool, Spark templates
Uber Ecosystem
Models
HDFS
Query
Runner
Production
PySpark
for ML
Data Visualization
Workflow Demo
Q&A

More Related Content

What's hot

Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 

What's hot (20)

Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 

Viewers also liked

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017
Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017
Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...
Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...
Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...eCommerce Institute
 
Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017
Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017
Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017
Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017
Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...
Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...
Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...eCommerce Institute
 
Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...
Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...
Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...eCommerce Institute
 
Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017
Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017
Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017
Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017
Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...
Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...
Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...eCommerce Institute
 
Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017
Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017
Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017
Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017
Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017
Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017
Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017
Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017
Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017eCommerce Institute
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 

Viewers also liked (16)

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Churn management
Churn managementChurn management
Churn management
 
Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017
Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017
Presentación Gregorio Trimarco | Mastercard - eCommerce Day Buenos Aires 2017
 
Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...
Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...
Presentación Deb Reyes | Google - eCommerce Day Buenos Aires 2017 12 30 4. de...
 
Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017
Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017
Presentación Juan Tomac | Unilever - eCommerce Day Buenos Aires 2017
 
Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017
Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017
Presentación Juan Pablo Lafosse | Almundo - eCommerce Day Buenos Aires 2017
 
Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...
Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...
Presentación Mariano Tordo, Farmacity & Andres Zaied, Musimundo - eCommerce D...
 
Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...
Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...
Presentación Jorgelina Striedinger | Digital Element - eCommerce Day Buenos A...
 
Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017
Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017
Presentación Alberto Banano Pardo | AdsMovil - eCommerce Day Buenos Aires 2017
 
Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017
Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017
Presentación Cristian Adamo | Avantrip - eCommerce Day Buenos Aires 2017
 
Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...
Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...
Presentaciones Gustavo Sambucetti | CACE & GoforEcommerce - eCommerce Day Bue...
 
Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017
Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017
Presentación Sergio Grinbaum | Think Thanks - eCommerce Day Buenos Aires 2017
 
Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017
Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017
Presentación Eliane Iwasaki | Return Path - eCommerce Day Buenos Aires 2017
 
Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017
Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017
Presentación Joan Miró | NetQuest - eCommerce Day Buenos Aires 2017
 
Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017
Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017
Presentación Francisco Berroeta | Samsonite - eCommerce Day Buenos Aires 2017
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 

Similar to Uber's data science workbench

Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Nathan Bijnens
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Bhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemBhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemVijayananda Mohire
 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Startup Club
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceAlex Danvy
 
VanyaSehgal_Resume
VanyaSehgal_ResumeVanyaSehgal_Resume
VanyaSehgal_ResumeVANYA SEHGAL
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Debraj GuhaThakurta
 
Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...
Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...
Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...Lviv Startup Club
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4Andreas Dewes
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introductionakira-ai
 
December 2013 HUG: Hunk - Splunk over Hadoop
December 2013 HUG: Hunk - Splunk over HadoopDecember 2013 HUG: Hunk - Splunk over Hadoop
December 2013 HUG: Hunk - Splunk over HadoopYahoo Developer Network
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraAnant Corporation
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform OverviewDavid Chou
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpbigdata sunil
 
Sudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfSudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfsudipto801
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 

Similar to Uber's data science workbench (20)

Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Bhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemBhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystem
 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
 
VanyaSehgal_Resume
VanyaSehgal_ResumeVanyaSehgal_Resume
VanyaSehgal_Resume
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...
Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...
Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
sudipto_resume
sudipto_resumesudipto_resume
sudipto_resume
 
December 2013 HUG: Hunk - Splunk over Hadoop
December 2013 HUG: Hunk - Splunk over HadoopDecember 2013 HUG: Hunk - Splunk over Hadoop
December 2013 HUG: Hunk - Splunk over Hadoop
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
Sudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfSudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdf
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 

Recently uploaded

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 

Uber's data science workbench

  • 1. Uber’s Data Science Workbench Randy Wei Peng Du
  • 2.
  • 3. Mission Unleash the productivity of the Data Science community at Uber by providing scalable infrastructure, tools, customization and support.
  • 4. Tools of the Trade: Jupyter Notebooks Alternative to traditional CLIs Interactive tool which combines Prose (HTML Markdown), Code (Py, R, Scala) Visualization (charts, maps, tables) Shareable artifact of knowledge Hosted webapp Notebook, Notes, Cells Each cell is an executable line of code Used for Data exploration, Cleansing, Modeling Dashboarding/reporting HTML Code Output
  • 5. Tools of the Trade: RStudio Server Browser interface to a remote R server Centrally manage compute infrastructure IDE for R Syntax highlight, code completion Debugging Charts File Browser RStudio also has Notebook functionality R has a huge library repository Used mostly for rapid prototyping of models on small datasets (UbeR) Data Code Output
  • 6. Tools of the Trade: Apache Spark Distributed statistical computing framework Run R code without translating it to Java Choice of Intelligent Decision, Insurance, etc teams Distributed machine learning framework Easy to integrate with scientific Python libraries Choice of Fraud Detection, Sensing and Perception, etc teams SparkR PySpark
  • 7. ● Productivity ● Py, R, Scala interpreters in Jupyter ● Hosted RStudio support ● Version Control ● Custom libraries/environment ● Single-pane lifecycle mgmnt. ● PySpark, SparkR Scale ● Scalable Jupyter Server infra. ● Large dist. computation backend ● Multitenancy ● File Persistence ● Security Requirements Ecosystem Integration ● Scheduling: Piper ● Dashboards: Shiny ● Data Exploration: Query engine API ● Deploy: Machine learning platform ● Chargeback: Monitoring platform ● Knowledge ● Search ● Access Controls ● Sharing Controls ● Publish ● Comments & Discussion Scale Productivity Social Ecosystem
  • 8. State of the Union Problem ● Data Scientists (DSs) start at Uber with diverse skillsets and backgrounds ● Precious time wasted in infra. setup, version control, search, sharing... ● Teams are building their own solutions Vision ● Web-based hub for all Data Scientists at Uber ● Ability to centrally: ○ provision tools ○ leverage dist. Backend ○ search, comment, share ○ monitor ● Integrated with Uber’s data ecosystem ● Dedicated SRE Opportunity ● Find and reuse knowledge ● Opportunity for a dedicated team to advocate for and build tools needs to make DSs hyper-productive ● Cloud experience ● Chargeback
  • 10. Management Service Create, Delete, Search, Share, Publish, Schedule RStudio (Docker) Uber Mesos Infra Shared File System MLlib Worker MLlib Worker MLlib Worker MLlib Worker MLlib Worker PySpark Worker MLlib Worker MLlib Worker SparkR Worker Uber spark debugging toolkit Uber spark development toolkit RStudio (Docker) RStudio (Docker) RStudio (Docker) RStudio (Docker) Jupyter (Docker) Manage Mesos Spark Architecture
  • 11. Architecture NB1 Application Management Service session / file management, proxy Mesos Cluster Docker Container Hadoop Cluster (Hive, Presto, Spark) Distributed ProcessingDocker Container Docker Container RStudio Server RStudio Jupyter Docker Container NB1Jupyter Server NB2 Web GUI
  • 12. Data Science Workbench Uber ML platform Palette Hive Cassandra Spark Spark SDK, Spark Debug tool, Spark templates Uber Ecosystem Models HDFS Query Runner Production PySpark for ML Data Visualization
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19. Q&A