SlideShare a Scribd company logo
1 of 33
Download to read offline
Wei Yan
Zhenxiao Luo
Software Engineer @ Uber
Geospatial big data analysis
@ Uber
Mission
Uber Business Highlights
Analytics Infrastructure @ Uber
Presto
Interactive SQL engine for Big Data
GeoSpatial Analytics
GeoSpatial Optimizations for Presto
Ongoing Work
Agenda
Transportation as reliable as running water, everywhere, for everyone
Uber Mission
Uber Stats
6
Continents
73
Countries
633
Cities
23,000
Employees
10+ Million
Avg. Trips/Day
40+ Million
MAU Riders
1.5+ Million
MAU Drivers
Kafka
Analytics Infrastructure @ Uber
Schemaless
MySQL,
Postgres
Vertica
Streamific
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
MemSQL
Modeled
Tables
Streaming
Warehouse
Real-time
YARN/HDFS Cluster (per DC)
● 2K+ machines
● 150+ PB storage space
Presto Cluster (per DC)
● 2 clusters
● Hundreds of machines
Applications
● Hive
○ 40K+ queries per day
● Presto
○ 180K+ queries per day
● Spark
○ 100K+ jobs
Scale of Hadoop @ Uber
● Marketplace pricing
○ Real-time driver incentives
● Communication platform
○ Driver quality and action platform
○ Rider/driver cohorting
○ Ops, comms, & marketing
● Growth marketing
○ BI dashboard for growth marketing
● Data science
○ Exploratory analytics using notebooks
● Machine learning platform
● Ad-hoc user queries
Applications of Hadoop @ Uber
● Fast growing demand
● Fast growing number of servers & services
● Fast query engine
● Multi-tenant shared infrastructure
○ Resource allocation
○ Bad applications
Our Challenges
What is Presto: Interactive SQL Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, & Netflix
Completely open source
Access to petabytes of data in the Hadoop data lake
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
○ Inline virtual function calls
○ Inline constants
○ Rewrite inner loops
○ Rewrite type-specific branches
No Need to Copy Data: Presto Connectors
GeoSpatial @ Uber
Cities
Trips
Use Cases
GeoSpatial Data
Point
POINT (77.3548351 28.6973627)
● Two Dimensional Point
● Longitude, latitude
Polygon
POLYGON ((36.814155579
-1.3174386070000002, 36.814863682
-1.317545867, 36.814863682
-1.318221605, 36.813973188
-1.317910551, 36.814155579
-1.3174386070000002))
● A collection of Points
● No holes in Polygons
GeoSpatial Analytics
Get # of events happened at each airport:
SELECT airport_code, count(*)
FROM event_table
JOIN airport
ON st_contains(geofence, st_point(location.lng,location.lat))
WHERE datestr = ‘2017-02-01’
group by 1
Brute Force Solution
● Run as Hive/MapReduce jobs
● Have to compute st_contains for each Point and geofence
● Brute force st_contains computation complexity linear to # Point in geofence
● Geofence has huge number of Points
● A simple query running for weeks
Time complexity = 2B events x 200 airports = 400B st_contains = ~ 40 week
Efficiency: QuadTree
QuadTree for Cities
Hive GeoSpatial Optimizations
● Start Service for building QuadTree Indexes
● User rewrite query with ‘set configuration’ and QuadTree UDFs
● During Runtime:
○ Hive Hook detects QuadTree UDFs
○ Service builds QuadTree and register as temporary Hive UDF
○ Query runs with QuadTree optimization UDFs
Hive Query Rewrite
query before query after
SELECT airport_code, count(*)
FROM event_table
JOIN airport
ON st_contains(simplified_shape, st_point(location.lng,location.lat))
WHERE datestr = ‘2017-02-01’
GROUP BY 1
set hive.geospatial.index.list=[Airports:airport airport_code
simplified_shape];
SELECT AirportsContainsFirst(st_point(location.lng,location.lat)), count(*)
FROM event_table
WHERE datestr = '2017-02-01'
GROUP BY 1
GeoSpatial in Hive
● Efficiency: 15X runtime speedup
○ 5h V.S. 20min
○ Could we get even faster?
● Reliability: external service dependency
○ Service could get down
○ RPC call timeout
● Usability: user needs to rewrite query
○ Users need to learn how to rewrite it
GeoSpatial in Presto
GeoSpatial in Presto
● Efficiency: query runs faster
○ Presto is much faster than Hive
● Reliability: no external service dependency
○ GeoSpatial Plugin for Presto
○ Unifying indexing stage and query stage
● Usability: user no need to rewrite query
○ Presto Optimizer automatically rewrite user query
using QuadTree Index
GeoSpatial Plugin for Presto
● Geometry Type
○ serialize/deserialize via Presto standard Slice
● Complete GeoSpatial Functions support
○ ST_Contains, ST_Centroid, ST_Distance, etc.
● Build_geo_index
○ Build quadTree on the fly
● Geo_contains, Geo_intersects
○ Use QuadTree to filter geofences
○ Run ST_Contains, ST_Intersects for remaining geofences
Presto Optimizer rewrites user query
GeoSpatial in Presto
● Efficiency: 60X runtime speedup
○ 5h V.S. 5min
● Reliability: no external service dependency
● Usability: users no needs to rewrite query
Benchmarks
Presto Ongoing Work
● Presto Elasticsearch Connector
● Multi-tenancy Support
● All Active Presto Cross Data Centers
● Authentication and Authorization
● High Available Coordinator
● Caching HDFS for Presto
● Presto on Mesos
Hadoop Infrastructure & Analytics
● HDFS Erasure Encoding
● HDFS Tiered Storage
● All Active Hadoop Cross Data Centers
● Hive On Spark
● Spark
● Data Visualization
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the
use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise
exempt from disclosure under applicable law. All recipients of this document are notified that the information contained
herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any
way disclose this document or any of the enclosed information to any person other than employees of addressee to the
extent necessary for consultations with authorized personnel of Uber.
We are Hiring
https://www.uber.com/careers/list/27366/
Send resumes to:
weiy@uber.com or luoz@uber.com
Interested in learning more about Uber Eng?
Eng.uber.com
Follow us on Twitter:
@UberEng

More Related Content

What's hot

What's hot (20)

Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
Streaming Analytics @ Uber
Streaming Analytics @ UberStreaming Analytics @ Uber
Streaming Analytics @ Uber
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 

Similar to Presto GeoSpatial @ Strata New York 2017

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 

Similar to Presto GeoSpatial @ Strata New York 2017 (20)

Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
Even Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ UberEven Faster: When Presto meets Parquet @ Uber
Even Faster: When Presto meets Parquet @ Uber
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
Sprint 78
Sprint 78Sprint 78
Sprint 78
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 

Recently uploaded

₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 

Recently uploaded (20)

Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 

Presto GeoSpatial @ Strata New York 2017

  • 1. Wei Yan Zhenxiao Luo Software Engineer @ Uber Geospatial big data analysis @ Uber
  • 2. Mission Uber Business Highlights Analytics Infrastructure @ Uber Presto Interactive SQL engine for Big Data GeoSpatial Analytics GeoSpatial Optimizations for Presto Ongoing Work Agenda
  • 3. Transportation as reliable as running water, everywhere, for everyone Uber Mission
  • 4. Uber Stats 6 Continents 73 Countries 633 Cities 23,000 Employees 10+ Million Avg. Trips/Day 40+ Million MAU Riders 1.5+ Million MAU Drivers
  • 5. Kafka Analytics Infrastructure @ Uber Schemaless MySQL, Postgres Vertica Streamific Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink MemSQL Modeled Tables Streaming Warehouse Real-time
  • 6. YARN/HDFS Cluster (per DC) ● 2K+ machines ● 150+ PB storage space Presto Cluster (per DC) ● 2 clusters ● Hundreds of machines Applications ● Hive ○ 40K+ queries per day ● Presto ○ 180K+ queries per day ● Spark ○ 100K+ jobs Scale of Hadoop @ Uber
  • 7. ● Marketplace pricing ○ Real-time driver incentives ● Communication platform ○ Driver quality and action platform ○ Rider/driver cohorting ○ Ops, comms, & marketing ● Growth marketing ○ BI dashboard for growth marketing ● Data science ○ Exploratory analytics using notebooks ● Machine learning platform ● Ad-hoc user queries Applications of Hadoop @ Uber
  • 8. ● Fast growing demand ● Fast growing number of servers & services ● Fast query engine ● Multi-tenant shared infrastructure ○ Resource allocation ○ Bad applications Our Challenges
  • 9. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, & Netflix Completely open source Access to petabytes of data in the Hadoop data lake
  • 11. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation ○ Inline virtual function calls ○ Inline constants ○ Rewrite inner loops ○ Rewrite type-specific branches
  • 12. No Need to Copy Data: Presto Connectors
  • 15. Trips
  • 17. GeoSpatial Data Point POINT (77.3548351 28.6973627) ● Two Dimensional Point ● Longitude, latitude Polygon POLYGON ((36.814155579 -1.3174386070000002, 36.814863682 -1.317545867, 36.814863682 -1.318221605, 36.813973188 -1.317910551, 36.814155579 -1.3174386070000002)) ● A collection of Points ● No holes in Polygons
  • 18. GeoSpatial Analytics Get # of events happened at each airport: SELECT airport_code, count(*) FROM event_table JOIN airport ON st_contains(geofence, st_point(location.lng,location.lat)) WHERE datestr = ‘2017-02-01’ group by 1
  • 19. Brute Force Solution ● Run as Hive/MapReduce jobs ● Have to compute st_contains for each Point and geofence ● Brute force st_contains computation complexity linear to # Point in geofence ● Geofence has huge number of Points ● A simple query running for weeks Time complexity = 2B events x 200 airports = 400B st_contains = ~ 40 week
  • 22. Hive GeoSpatial Optimizations ● Start Service for building QuadTree Indexes ● User rewrite query with ‘set configuration’ and QuadTree UDFs ● During Runtime: ○ Hive Hook detects QuadTree UDFs ○ Service builds QuadTree and register as temporary Hive UDF ○ Query runs with QuadTree optimization UDFs
  • 23. Hive Query Rewrite query before query after SELECT airport_code, count(*) FROM event_table JOIN airport ON st_contains(simplified_shape, st_point(location.lng,location.lat)) WHERE datestr = ‘2017-02-01’ GROUP BY 1 set hive.geospatial.index.list=[Airports:airport airport_code simplified_shape]; SELECT AirportsContainsFirst(st_point(location.lng,location.lat)), count(*) FROM event_table WHERE datestr = '2017-02-01' GROUP BY 1
  • 24. GeoSpatial in Hive ● Efficiency: 15X runtime speedup ○ 5h V.S. 20min ○ Could we get even faster? ● Reliability: external service dependency ○ Service could get down ○ RPC call timeout ● Usability: user needs to rewrite query ○ Users need to learn how to rewrite it
  • 26. GeoSpatial in Presto ● Efficiency: query runs faster ○ Presto is much faster than Hive ● Reliability: no external service dependency ○ GeoSpatial Plugin for Presto ○ Unifying indexing stage and query stage ● Usability: user no need to rewrite query ○ Presto Optimizer automatically rewrite user query using QuadTree Index
  • 27. GeoSpatial Plugin for Presto ● Geometry Type ○ serialize/deserialize via Presto standard Slice ● Complete GeoSpatial Functions support ○ ST_Contains, ST_Centroid, ST_Distance, etc. ● Build_geo_index ○ Build quadTree on the fly ● Geo_contains, Geo_intersects ○ Use QuadTree to filter geofences ○ Run ST_Contains, ST_Intersects for remaining geofences
  • 29. GeoSpatial in Presto ● Efficiency: 60X runtime speedup ○ 5h V.S. 5min ● Reliability: no external service dependency ● Usability: users no needs to rewrite query
  • 31. Presto Ongoing Work ● Presto Elasticsearch Connector ● Multi-tenancy Support ● All Active Presto Cross Data Centers ● Authentication and Authorization ● High Available Coordinator ● Caching HDFS for Presto ● Presto on Mesos
  • 32. Hadoop Infrastructure & Analytics ● HDFS Erasure Encoding ● HDFS Tiered Storage ● All Active Hadoop Cross Data Centers ● Hive On Spark ● Spark ● Data Visualization
  • 33. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. We are Hiring https://www.uber.com/careers/list/27366/ Send resumes to: weiy@uber.com or luoz@uber.com Interested in learning more about Uber Eng? Eng.uber.com Follow us on Twitter: @UberEng