Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto GeoSpatial @ Strata New York 2017

907 views

Published on

GeoSpatial Big Data Analysis @ Uber

Published in: Internet
  • We called it "operation mind control" - as we discovered a simple mind game that makes a girl become obsessed with you. (Aand it works even if you're not her type or she's already dating someone else) Here's how we figured it out... ★★★ http://ishbv.com/unlockher/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • My special guest's 3-Step "No Product Funnel" can be duplicated to start earning a significant income online. ★★★ https://bit.ly/2kS5a5J
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Presto GeoSpatial @ Strata New York 2017

  1. 1. Wei Yan Zhenxiao Luo Software Engineer @ Uber Geospatial big data analysis @ Uber
  2. 2. Mission Uber Business Highlights Analytics Infrastructure @ Uber Presto Interactive SQL engine for Big Data GeoSpatial Analytics GeoSpatial Optimizations for Presto Ongoing Work Agenda
  3. 3. Transportation as reliable as running water, everywhere, for everyone Uber Mission
  4. 4. Uber Stats 6 Continents 73 Countries 633 Cities 23,000 Employees 10+ Million Avg. Trips/Day 40+ Million MAU Riders 1.5+ Million MAU Drivers
  5. 5. Kafka Analytics Infrastructure @ Uber Schemaless MySQL, Postgres Vertica Streamific Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink MemSQL Modeled Tables Streaming Warehouse Real-time
  6. 6. YARN/HDFS Cluster (per DC) ● 2K+ machines ● 150+ PB storage space Presto Cluster (per DC) ● 2 clusters ● Hundreds of machines Applications ● Hive ○ 40K+ queries per day ● Presto ○ 180K+ queries per day ● Spark ○ 100K+ jobs Scale of Hadoop @ Uber
  7. 7. ● Marketplace pricing ○ Real-time driver incentives ● Communication platform ○ Driver quality and action platform ○ Rider/driver cohorting ○ Ops, comms, & marketing ● Growth marketing ○ BI dashboard for growth marketing ● Data science ○ Exploratory analytics using notebooks ● Machine learning platform ● Ad-hoc user queries Applications of Hadoop @ Uber
  8. 8. ● Fast growing demand ● Fast growing number of servers & services ● Fast query engine ● Multi-tenant shared infrastructure ○ Resource allocation ○ Bad applications Our Challenges
  9. 9. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, & Netflix Completely open source Access to petabytes of data in the Hadoop data lake
  10. 10. How Presto Works
  11. 11. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation ○ Inline virtual function calls ○ Inline constants ○ Rewrite inner loops ○ Rewrite type-specific branches
  12. 12. No Need to Copy Data: Presto Connectors
  13. 13. GeoSpatial @ Uber
  14. 14. Cities
  15. 15. Trips
  16. 16. Use Cases
  17. 17. GeoSpatial Data Point POINT (77.3548351 28.6973627) ● Two Dimensional Point ● Longitude, latitude Polygon POLYGON ((36.814155579 -1.3174386070000002, 36.814863682 -1.317545867, 36.814863682 -1.318221605, 36.813973188 -1.317910551, 36.814155579 -1.3174386070000002)) ● A collection of Points ● No holes in Polygons
  18. 18. GeoSpatial Analytics Get # of events happened at each airport: SELECT airport_code, count(*) FROM event_table JOIN airport ON st_contains(geofence, st_point(location.lng,location.lat)) WHERE datestr = ‘2017-02-01’ group by 1
  19. 19. Brute Force Solution ● Run as Hive/MapReduce jobs ● Have to compute st_contains for each Point and geofence ● Brute force st_contains computation complexity linear to # Point in geofence ● Geofence has huge number of Points ● A simple query running for weeks Time complexity = 2B events x 200 airports = 400B st_contains = ~ 40 week
  20. 20. Efficiency: QuadTree
  21. 21. QuadTree for Cities
  22. 22. Hive GeoSpatial Optimizations ● Start Service for building QuadTree Indexes ● User rewrite query with ‘set configuration’ and QuadTree UDFs ● During Runtime: ○ Hive Hook detects QuadTree UDFs ○ Service builds QuadTree and register as temporary Hive UDF ○ Query runs with QuadTree optimization UDFs
  23. 23. Hive Query Rewrite query before query after SELECT airport_code, count(*) FROM event_table JOIN airport ON st_contains(simplified_shape, st_point(location.lng,location.lat)) WHERE datestr = ‘2017-02-01’ GROUP BY 1 set hive.geospatial.index.list=[Airports:airport airport_code simplified_shape]; SELECT AirportsContainsFirst(st_point(location.lng,location.lat)), count(*) FROM event_table WHERE datestr = '2017-02-01' GROUP BY 1
  24. 24. GeoSpatial in Hive ● Efficiency: 15X runtime speedup ○ 5h V.S. 20min ○ Could we get even faster? ● Reliability: external service dependency ○ Service could get down ○ RPC call timeout ● Usability: user needs to rewrite query ○ Users need to learn how to rewrite it
  25. 25. GeoSpatial in Presto
  26. 26. GeoSpatial in Presto ● Efficiency: query runs faster ○ Presto is much faster than Hive ● Reliability: no external service dependency ○ GeoSpatial Plugin for Presto ○ Unifying indexing stage and query stage ● Usability: user no need to rewrite query ○ Presto Optimizer automatically rewrite user query using QuadTree Index
  27. 27. GeoSpatial Plugin for Presto ● Geometry Type ○ serialize/deserialize via Presto standard Slice ● Complete GeoSpatial Functions support ○ ST_Contains, ST_Centroid, ST_Distance, etc. ● Build_geo_index ○ Build quadTree on the fly ● Geo_contains, Geo_intersects ○ Use QuadTree to filter geofences ○ Run ST_Contains, ST_Intersects for remaining geofences
  28. 28. Presto Optimizer rewrites user query
  29. 29. GeoSpatial in Presto ● Efficiency: 60X runtime speedup ○ 5h V.S. 5min ● Reliability: no external service dependency ● Usability: users no needs to rewrite query
  30. 30. Benchmarks
  31. 31. Presto Ongoing Work ● Presto Elasticsearch Connector ● Multi-tenancy Support ● All Active Presto Cross Data Centers ● Authentication and Authorization ● High Available Coordinator ● Caching HDFS for Presto ● Presto on Mesos
  32. 32. Hadoop Infrastructure & Analytics ● HDFS Erasure Encoding ● HDFS Tiered Storage ● All Active Hadoop Cross Data Centers ● Hive On Spark ● Spark ● Data Visualization
  33. 33. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. We are Hiring https://www.uber.com/careers/list/27366/ Send resumes to: weiy@uber.com or luoz@uber.com Interested in learning more about Uber Eng? Eng.uber.com Follow us on Twitter: @UberEng

×