The document discusses geospatial analytics at Uber using Presto. Presto is an interactive SQL query engine that allows for fast querying of large datasets. The document outlines how Presto was optimized for geospatial queries using quadtree indexing to efficiently filter spatial data and speed up queries involving spatial functions like ST_Contains. This improved query performance for geospatial queries in Presto by 60x compared to the previous approach using Hive.
2. Mission
Uber Business Highlights
Analytics Infrastructure @ Uber
Presto
Interactive SQL engine for Big Data
GeoSpatial Analytics
GeoSpatial Optimizations for Presto
Ongoing Work
Agenda
5. Kafka
Analytics Infrastructure @ Uber
Schemaless
MySQL,
Postgres
Vertica
Streamific
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
MemSQL
Modeled
Tables
Streaming
Warehouse
Real-time
6. YARN/HDFS Cluster (per DC)
● 2K+ machines
● 150+ PB storage space
Presto Cluster (per DC)
● 2 clusters
● Hundreds of machines
Applications
● Hive
○ 40K+ queries per day
● Presto
○ 180K+ queries per day
● Spark
○ 100K+ jobs
Scale of Hadoop @ Uber
7. ● Marketplace pricing
○ Real-time driver incentives
● Communication platform
○ Driver quality and action platform
○ Rider/driver cohorting
○ Ops, comms, & marketing
● Growth marketing
○ BI dashboard for growth marketing
● Data science
○ Exploratory analytics using notebooks
● Machine learning platform
● Ad-hoc user queries
Applications of Hadoop @ Uber
8. ● Fast growing demand
● Fast growing number of servers & services
● Fast query engine
● Multi-tenant shared infrastructure
○ Resource allocation
○ Bad applications
Our Challenges
9. What is Presto: Interactive SQL Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, & Netflix
Completely open source
Access to petabytes of data in the Hadoop data lake
17. GeoSpatial Data
Point
POINT (77.3548351 28.6973627)
● Two Dimensional Point
● Longitude, latitude
Polygon
POLYGON ((36.814155579
-1.3174386070000002, 36.814863682
-1.317545867, 36.814863682
-1.318221605, 36.813973188
-1.317910551, 36.814155579
-1.3174386070000002))
● A collection of Points
● No holes in Polygons
18. GeoSpatial Analytics
Get # of events happened at each airport:
SELECT airport_code, count(*)
FROM event_table
JOIN airport
ON st_contains(geofence, st_point(location.lng,location.lat))
WHERE datestr = ‘2017-02-01’
group by 1
19. Brute Force Solution
● Run as Hive/MapReduce jobs
● Have to compute st_contains for each Point and geofence
● Brute force st_contains computation complexity linear to # Point in geofence
● Geofence has huge number of Points
● A simple query running for weeks
Time complexity = 2B events x 200 airports = 400B st_contains = ~ 40 week
22. Hive GeoSpatial Optimizations
● Start Service for building QuadTree Indexes
● User rewrite query with ‘set configuration’ and QuadTree UDFs
● During Runtime:
○ Hive Hook detects QuadTree UDFs
○ Service builds QuadTree and register as temporary Hive UDF
○ Query runs with QuadTree optimization UDFs
23. Hive Query Rewrite
query before query after
SELECT airport_code, count(*)
FROM event_table
JOIN airport
ON st_contains(simplified_shape, st_point(location.lng,location.lat))
WHERE datestr = ‘2017-02-01’
GROUP BY 1
set hive.geospatial.index.list=[Airports:airport airport_code
simplified_shape];
SELECT AirportsContainsFirst(st_point(location.lng,location.lat)), count(*)
FROM event_table
WHERE datestr = '2017-02-01'
GROUP BY 1
24. GeoSpatial in Hive
● Efficiency: 15X runtime speedup
○ 5h V.S. 20min
○ Could we get even faster?
● Reliability: external service dependency
○ Service could get down
○ RPC call timeout
● Usability: user needs to rewrite query
○ Users need to learn how to rewrite it
26. GeoSpatial in Presto
● Efficiency: query runs faster
○ Presto is much faster than Hive
● Reliability: no external service dependency
○ GeoSpatial Plugin for Presto
○ Unifying indexing stage and query stage
● Usability: user no need to rewrite query
○ Presto Optimizer automatically rewrite user query
using QuadTree Index
27. GeoSpatial Plugin for Presto
● Geometry Type
○ serialize/deserialize via Presto standard Slice
● Complete GeoSpatial Functions support
○ ST_Contains, ST_Centroid, ST_Distance, etc.
● Build_geo_index
○ Build quadTree on the fly
● Geo_contains, Geo_intersects
○ Use QuadTree to filter geofences
○ Run ST_Contains, ST_Intersects for remaining geofences
31. Presto Ongoing Work
● Presto Elasticsearch Connector
● Multi-tenancy Support
● All Active Presto Cross Data Centers
● Authentication and Authorization
● High Available Coordinator
● Caching HDFS for Presto
● Presto on Mesos
32. Hadoop Infrastructure & Analytics
● HDFS Erasure Encoding
● HDFS Tiered Storage
● All Active Hadoop Cross Data Centers
● Hive On Spark
● Spark
● Data Visualization