SlideShare a Scribd company logo
1 of 31
Download to read offline
Zhenxiao Luo
Engineering Manager @ Uber
Machine Learning and Big
Data @ Uber:
A Tale of Two Systems
Mission
Uber Business Highlights
Machine Learning @ Uber
Pain Points
Big Data @ Uber
Machine Learning support in Big Data
Ongoing Work
Agenda
Transportation as reliable as running water, everywhere, for everyone
Uber Mission
Uber Stats
6
Continents
73
Countries
450
Cities
12,000
Employees
10+ Million
Avg. Trips/Day
40+ Million
MAU Riders
1.5+ Million
MAU Drivers
Kafka
Data Infrastructure @ Uber
Schemaless
MySQL,
Postgres
Vertica
Streamio
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
MemSQL
Modeled
Tables
Streaming
Warehouse
Real-time
Decrease
ETD error by
26%
Decreased
p50 ETA error
by 20% in
NYC and
Paris
Research Program (Applied)
Platform
Michelangelo
EATs Fraud ATGAI LabsResearch team
invents new
techniques
which are added
to the platform
Program
teams build
ML apps on
the platform
Research teams focus their
research based on business
priorities from program teams
Applied
AML
Applied ML teams helps with problems
when Program team doesn’t have right
expertise; platform-izes the solutions
Michelangelo: Uber Machine Learning Platform
DEVELOPMENT PHASE PRODUCTION PHASE OPERATIONAL
PHASE
ETL
Data dictionary /
Lineage
Feature
generation /
extraction
Model /
Param
selection
Train / Test
Visualize
IDE / Collaboration tools / Source
control
Workflow / Scheduler tools
Model management
ETL /
Feature
extraction /
Train
Dashboard
Scalable
RT
scoring
Scalable
Batch
scoring
Alerting
Operational
Metrics
Model
Performance
monitoring
Experiments
Workflow of a Machine Learning Project
Same basic ML workflow & system requirements for
Traditional ML & Deep Learning
Supervised, Unsupervised & Semi-supervised learning
Online learning
Batch, Realtime and Mobile deployments
Timeseries Forecasting
Workflow
GET DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS
BATCH
REALTIME
Spark /
SQL
Training
Algo
Trained
Models
Outcomes
(Training Set)
Deploy
Realtime Predict
Service
Trained
Model
<< Features Client
ServicePrediction >>
Features
Cassandra
Feature Store
Hive
Feature Store
To Hive
& Kafka
Stream
Engine
ETL
Batch Predict Job
(Spark)
Data Prep Job
Trained
Model
Predictions
Sampled
Predictions
Performance
Monitor Job
(Spark)
Metrics
To Monitor
System
GET DATA TRAIN MODELS DEPLOY, PREDICT & MONITOREVAL MODELS
Batch Training Job
(Spark or Mesos)
Pure function expressions for
Feature selection
Feature transformations (for derived and composite features)
Standard set of accessor functions for
Feature store
Basis features
Column stats (min, max, mean, std-dev, etc)
Standard transformation functions + UDFs
Examples
@palette:store:orders:prep_time_avg_1week:rs_uuid
nFill(@basis:distance, mean(@basis:distance))
DSL for Feature Engineering
Batch Training Job
(Spark)
Training
Algo
DSL
Batch Prediction Job
(Spark)
Trained
Model
DSL
Data Pipeline for Predictions
SPARK SQL FEATURE DSL ML MODEL
RAW DATA
BASIS
FEATURES
TRANSFORMED
FEATURES
PREDICTIONS
FEATURE STORE
FEATURE STORE
FEATURES
DATA LAKE
Pain Points
● Users have to learn DSL for feature engineering
○ Waste of data scientist time
● ETL and Feature extraction could last for hours
○ Waste of data scientist time
● Features dump onto disk several times
○ Waste of data scientist time
● Feature store and data lake are separate
○ Waste of hardware resource
● Machine Learning and Data Analytics are separate
○ Waste of hardware resource and software resource
Ideally, Feature Engineering should ...
● Users do not need to learn anything new
● User do not need to worry about implementation details
● Machine Learning jobs are running as fast
● Features stored in Data Lake
● Features are cached in memory for repeated processing
● Machine Learning pipelines and Data Analytics share the
same hardware, and maybe software?
Try Declarative Language
● SQL extensible language
● Users no need to worry about implementation details
● Feature store resides in data lake
● Data kept in memory for repeated processing
● Machine Learning and Data Analytics
○ share the same hardware
○ share the same software
○ Streaming, Pipelined, Parallel, Vectorized
execution
The Data Model
● All datasets are tables
○ Training datasets
○ Validation datasets
● Labels and Features are columns
○ Label is integer
○ Feature is map<integer, float>
● Models are SQL extensible types
○ <size>:<model>
○ Model: Variable length serialized type
● Classify, Regress are SQL extensible functions
The Execution Model
● Data are read from Hadoop once
● Data are kept in memory for repeated processing
● Streaming, Pipelined, Parallel, Vectorized execution
● Machine Learning Functions are implemented as:
○ Map
■ Applies to each input row
○ Combine
■ Merge partial aggregations
○ Reduce
■ Final aggregation
In Summary
● Users use extensible SQLs to train and predict models
● All data are modeled as tables
● All data stored in data lake
● Data read from disk once, kept in memory for repeated
processing
● A framework with execution primitives, to support:
○ SQL Analytics
○ Machine Learning
What does it look like?
The Presto Machine Learning Plugin
● All data are modeled as tables
● Data read from disk just once
○ Data are kept in memory for repeated processing
○ Streaming, Pipelined, Parallel, Vectorized execution
● Presto Connectors
○ SQL or Machine Learning on any data, e.g. Hadoop, MySQL,
Elasticsearch, Cassandra, MongoDB, etc.
● Presto provides execution primitives:
○ Input Function -- Map
○ Combine Function -- Combine
○ Output function -- Reduce
Significantly Reduce Complexity
Machine
Learning in
Presto
RAW DATA PREDICTIONS
DATA LAKE
● Data Scientists just use SQL for Machine Learning
● Model development time speed up > 5X
● Machine Learning pipeline speeds up > 10X
● Shared data lake
● Shared resource management
● Shared primitive executions
What is Presto: Interactive SQL Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL, Extensible for Machine Learning
Battle-tested by Facebook, Uber, & Netflix
Completely open source
Access to petabytes of data in the Hadoop data lake
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
○ Inline virtual function calls
○ Inline constants
○ Rewrite inner loops
○ Rewrite type-specific branches
Scale of Presto @ Uber
● 2 clusters
○ Application cluster
■ Hundreds of machines
■ 100K queries per day
■ P90: 30s
○ Ad hoc cluster
■ Hundreds of machines
■ 100K queries per day
■ P90: 60s
● Access to both raw and model tables
○ 5 petabytes of data
● Total 200K+ queries per day
Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
○ Short running queries higher priority
● Memory Management
○ Max memory per query per node
○ If query exceeds max memory limit, query fails
○ No OutOfMemory in Presto process
Presto Connectors: No Need to Copy Data
Presto Plugin
● Presto Connectors
○ ConnectorMetadata
○ ConnectorSplitManager
○ ConnectorSplit
○ ConnectorRecordCursor
● Extensible Types
○ Native Container Type
○ Type Signature
● Extensible Functions
○ Input Function, Combine Function, Output Function
Presto Machine Learning Plugin
● Types
○ Model Type : variable length type
○ Classifier Parametric Type
● Functions
○ Classify
○ Regress
○ Features
○ Learn_libsvm_regressor
○ Learn_libsvm_classifier
○ Evaluate_classifier_predictions
Presto Ongoing Work
● Machine Learning support in Presto
● Presto Elasticsearch Connector
● Cost based optimizer
● Spill to disk
● Multi-tenancy Support
● Authentication and Authorization
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the
use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise
exempt from disclosure under applicable law. All recipients of this document are notified that the information contained
herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any
way disclose this document or any of the enclosed information to any person other than employees of addressee to the
extent necessary for consultations with authorized personnel of Uber.
We are Hiring
https://www.uber.com/careers/list/27366/
Send resumes to:
luoz@uber.com
Interested in learning more about Uber Eng?
Eng.uber.com
Follow us on Twitter:
@UberEng

More Related Content

What's hot

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureZaloni
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 

What's hot (20)

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 

Similar to Machine learning and big data @ uber a tale of two systems

Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017Zhenxiao Luo
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Even Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ UberEven Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ UberDataWorks Summit
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleDmytro Semenov
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at UberDataWorks Summit
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 

Similar to Machine learning and big data @ uber a tale of two systems (20)

Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Even Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ UberEven Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ Uber
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 

More from Zhenxiao Luo

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitZhenxiao Luo
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15Zhenxiao Luo
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Zhenxiao Luo
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 

More from Zhenxiao Luo (7)

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto Summit
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptTanveerAhmed817946
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...LuisMiguelPaz5
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Recently uploaded (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Machine learning and big data @ uber a tale of two systems

  • 1. Zhenxiao Luo Engineering Manager @ Uber Machine Learning and Big Data @ Uber: A Tale of Two Systems
  • 2. Mission Uber Business Highlights Machine Learning @ Uber Pain Points Big Data @ Uber Machine Learning support in Big Data Ongoing Work Agenda
  • 3. Transportation as reliable as running water, everywhere, for everyone Uber Mission
  • 4. Uber Stats 6 Continents 73 Countries 450 Cities 12,000 Employees 10+ Million Avg. Trips/Day 40+ Million MAU Riders 1.5+ Million MAU Drivers
  • 5. Kafka Data Infrastructure @ Uber Schemaless MySQL, Postgres Vertica Streamio Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink MemSQL Modeled Tables Streaming Warehouse Real-time
  • 6. Decrease ETD error by 26% Decreased p50 ETA error by 20% in NYC and Paris
  • 7. Research Program (Applied) Platform Michelangelo EATs Fraud ATGAI LabsResearch team invents new techniques which are added to the platform Program teams build ML apps on the platform Research teams focus their research based on business priorities from program teams Applied AML Applied ML teams helps with problems when Program team doesn’t have right expertise; platform-izes the solutions Michelangelo: Uber Machine Learning Platform
  • 8. DEVELOPMENT PHASE PRODUCTION PHASE OPERATIONAL PHASE ETL Data dictionary / Lineage Feature generation / extraction Model / Param selection Train / Test Visualize IDE / Collaboration tools / Source control Workflow / Scheduler tools Model management ETL / Feature extraction / Train Dashboard Scalable RT scoring Scalable Batch scoring Alerting Operational Metrics Model Performance monitoring Experiments Workflow of a Machine Learning Project
  • 9. Same basic ML workflow & system requirements for Traditional ML & Deep Learning Supervised, Unsupervised & Semi-supervised learning Online learning Batch, Realtime and Mobile deployments Timeseries Forecasting Workflow GET DATA TRAIN MODELS EVALUATE MODELS DEPLOY MODELS MAKE PREDICTIONS MONITOR PREDICTIONS
  • 10. BATCH REALTIME Spark / SQL Training Algo Trained Models Outcomes (Training Set) Deploy Realtime Predict Service Trained Model << Features Client ServicePrediction >> Features Cassandra Feature Store Hive Feature Store To Hive & Kafka Stream Engine ETL Batch Predict Job (Spark) Data Prep Job Trained Model Predictions Sampled Predictions Performance Monitor Job (Spark) Metrics To Monitor System GET DATA TRAIN MODELS DEPLOY, PREDICT & MONITOREVAL MODELS Batch Training Job (Spark or Mesos)
  • 11. Pure function expressions for Feature selection Feature transformations (for derived and composite features) Standard set of accessor functions for Feature store Basis features Column stats (min, max, mean, std-dev, etc) Standard transformation functions + UDFs Examples @palette:store:orders:prep_time_avg_1week:rs_uuid nFill(@basis:distance, mean(@basis:distance)) DSL for Feature Engineering Batch Training Job (Spark) Training Algo DSL Batch Prediction Job (Spark) Trained Model DSL
  • 12. Data Pipeline for Predictions SPARK SQL FEATURE DSL ML MODEL RAW DATA BASIS FEATURES TRANSFORMED FEATURES PREDICTIONS FEATURE STORE FEATURE STORE FEATURES DATA LAKE
  • 13. Pain Points ● Users have to learn DSL for feature engineering ○ Waste of data scientist time ● ETL and Feature extraction could last for hours ○ Waste of data scientist time ● Features dump onto disk several times ○ Waste of data scientist time ● Feature store and data lake are separate ○ Waste of hardware resource ● Machine Learning and Data Analytics are separate ○ Waste of hardware resource and software resource
  • 14. Ideally, Feature Engineering should ... ● Users do not need to learn anything new ● User do not need to worry about implementation details ● Machine Learning jobs are running as fast ● Features stored in Data Lake ● Features are cached in memory for repeated processing ● Machine Learning pipelines and Data Analytics share the same hardware, and maybe software?
  • 15. Try Declarative Language ● SQL extensible language ● Users no need to worry about implementation details ● Feature store resides in data lake ● Data kept in memory for repeated processing ● Machine Learning and Data Analytics ○ share the same hardware ○ share the same software ○ Streaming, Pipelined, Parallel, Vectorized execution
  • 16. The Data Model ● All datasets are tables ○ Training datasets ○ Validation datasets ● Labels and Features are columns ○ Label is integer ○ Feature is map<integer, float> ● Models are SQL extensible types ○ <size>:<model> ○ Model: Variable length serialized type ● Classify, Regress are SQL extensible functions
  • 17. The Execution Model ● Data are read from Hadoop once ● Data are kept in memory for repeated processing ● Streaming, Pipelined, Parallel, Vectorized execution ● Machine Learning Functions are implemented as: ○ Map ■ Applies to each input row ○ Combine ■ Merge partial aggregations ○ Reduce ■ Final aggregation
  • 18. In Summary ● Users use extensible SQLs to train and predict models ● All data are modeled as tables ● All data stored in data lake ● Data read from disk once, kept in memory for repeated processing ● A framework with execution primitives, to support: ○ SQL Analytics ○ Machine Learning
  • 19. What does it look like?
  • 20. The Presto Machine Learning Plugin ● All data are modeled as tables ● Data read from disk just once ○ Data are kept in memory for repeated processing ○ Streaming, Pipelined, Parallel, Vectorized execution ● Presto Connectors ○ SQL or Machine Learning on any data, e.g. Hadoop, MySQL, Elasticsearch, Cassandra, MongoDB, etc. ● Presto provides execution primitives: ○ Input Function -- Map ○ Combine Function -- Combine ○ Output function -- Reduce
  • 21. Significantly Reduce Complexity Machine Learning in Presto RAW DATA PREDICTIONS DATA LAKE ● Data Scientists just use SQL for Machine Learning ● Model development time speed up > 5X ● Machine Learning pipeline speeds up > 10X ● Shared data lake ● Shared resource management ● Shared primitive executions
  • 22. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL, Extensible for Machine Learning Battle-tested by Facebook, Uber, & Netflix Completely open source Access to petabytes of data in the Hadoop data lake
  • 24. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation ○ Inline virtual function calls ○ Inline constants ○ Rewrite inner loops ○ Rewrite type-specific branches
  • 25. Scale of Presto @ Uber ● 2 clusters ○ Application cluster ■ Hundreds of machines ■ 100K queries per day ■ P90: 30s ○ Ad hoc cluster ■ Hundreds of machines ■ 100K queries per day ■ P90: 60s ● Access to both raw and model tables ○ 5 petabytes of data ● Total 200K+ queries per day
  • 26. Resource Management ● Presto has its own resource manager ○ Not on YARN ○ Not on Mesos ● CPU Management ○ Priority queues ○ Short running queries higher priority ● Memory Management ○ Max memory per query per node ○ If query exceeds max memory limit, query fails ○ No OutOfMemory in Presto process
  • 27. Presto Connectors: No Need to Copy Data
  • 28. Presto Plugin ● Presto Connectors ○ ConnectorMetadata ○ ConnectorSplitManager ○ ConnectorSplit ○ ConnectorRecordCursor ● Extensible Types ○ Native Container Type ○ Type Signature ● Extensible Functions ○ Input Function, Combine Function, Output Function
  • 29. Presto Machine Learning Plugin ● Types ○ Model Type : variable length type ○ Classifier Parametric Type ● Functions ○ Classify ○ Regress ○ Features ○ Learn_libsvm_regressor ○ Learn_libsvm_classifier ○ Evaluate_classifier_predictions
  • 30. Presto Ongoing Work ● Machine Learning support in Presto ● Presto Elasticsearch Connector ● Cost based optimizer ● Spill to disk ● Multi-tenancy Support ● Authentication and Authorization
  • 31. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. We are Hiring https://www.uber.com/careers/list/27366/ Send resumes to: luoz@uber.com Interested in learning more about Uber Eng? Eng.uber.com Follow us on Twitter: @UberEng