Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Zhenxiao Luo
Software Engineer @ Uber
Even Faster:
When Presto Meets Parquet
@ Uber
Mission
Uber Business Highlights
Analytics Infrastructure @ Uber
Presto
Interactive SQL engine for Big Data
Parquet
Column...
Transportation as reliable as running water, everywhere, for everyone
Uber Mission
Uber Stats
6
Continents
73
Countries
450
Cities
12,000
Employees
10+ Million
Avg. Trips/Day
40+ Million
MAU Riders
1.5+ Mi...
Kafka
Analytics Infrastructure @ Uber
Schemaless
MySQL,
Postgres
Vertica
Streamio
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop...
Parquet @ Uber
Raw Tables
● No preprocessing
● Highly nested
● ~30 minutes ingestion latency
● Huge tables
Modeled Tables
...
Scale of Presto @ Uber
● 2 clusters
○ Application cluster
■ Hundreds of machines
■ 100K queries per day
■ P90: 30s
○ Ad ho...
● Marketplace pricing
○ Real-time driver incentives
● Communication platform
○ Driver quality and action platform
○ Rider/...
What is Presto: Interactive SQL Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested ...
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode ...
Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
...
Limitations
● No fault tolerance
● Joins do not fit in memory
○ Query fails
○ No OutOfMemory in Presto process
○ Try it on...
No Need to Copy Data: Presto Connectors
Parquet: Columnar Storage for Big Data
Parquet Optimizations for Presto
Example Query:
SELECT base.driver_uuid
FROM hdrone.mezzanine_trips
WHERE datestr = '2017-...
Old Parquet Reader
Nested Column Pruning
Columnar Reads
Predicate Pushdown
Dictionary Pushdown
Lazy Reads
Benchmarking Results
Presto Ongoing Work
● GeoSpatial query optimization
● Presto Elasticsearch Connector
● Multi-tenancy Support
● All Active ...
Hadoop Infrastructure & Analytics
● HDFS Erasure Encoding
● HDFS Tiered Storage
● All Active Hadoop Cross Data Centers
● H...
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be...
Upcoming SlideShare
Loading in …5
×

Presto @ Uber Hadoop summit2017

1,659 views

Published on

Presto @ Uber Hadoop summit2017

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi, is this parquet reader integrated into presto already? The blog points to an already integrated reader -> https://prestodb.io/docs/current/release/release-0.138.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • What is streamio here?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Presto @ Uber Hadoop summit2017

  1. 1. Zhenxiao Luo Software Engineer @ Uber Even Faster: When Presto Meets Parquet @ Uber
  2. 2. Mission Uber Business Highlights Analytics Infrastructure @ Uber Presto Interactive SQL engine for Big Data Parquet Columnar Storage for Big Data Parquet Optimizations for Presto Ongoing Work Agenda
  3. 3. Transportation as reliable as running water, everywhere, for everyone Uber Mission
  4. 4. Uber Stats 6 Continents 73 Countries 450 Cities 12,000 Employees 10+ Million Avg. Trips/Day 40+ Million MAU Riders 1.5+ Million MAU Drivers
  5. 5. Kafka Analytics Infrastructure @ Uber Schemaless MySQL, Postgres Vertica Streamio Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink MemSQL Modeled Tables Streaming Warehouse Real-time
  6. 6. Parquet @ Uber Raw Tables ● No preprocessing ● Highly nested ● ~30 minutes ingestion latency ● Huge tables Modeled Tables ● Preprocessing via Hive ETL ● Flattened ● ~12 hours ingestion latency
  7. 7. Scale of Presto @ Uber ● 2 clusters ○ Application cluster ■ Hundreds of machines ■ 100K queries per day ■ P90: 30s ○ Ad hoc cluster ■ Hundreds of machines ■ 20K queries per day ■ P90: 60s ● Access to both raw and model tables ○ 5 petabytes of data ● Total 120K+ queries per day
  8. 8. ● Marketplace pricing ○ Real-time driver incentives ● Communication platform ○ Driver quality and action platform ○ Rider/driver cohorting ○ Ops, comms, & marketing ● Growth marketing ○ BI dashboard for growth marketing ● Data science ○ Exploratory analytics using notebooks ● Data quality ○ Freshness and quality check ● Ad hoc queries Applications of Presto @ Uber
  9. 9. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, & Netflix Completely open source Access to petabytes of data in the Hadoop data lake
  10. 10. How Presto Works
  11. 11. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation ○ Inline virtual function calls ○ Inline constants ○ Rewrite inner loops ○ Rewrite type-specific branches
  12. 12. Resource Management ● Presto has its own resource manager ○ Not on YARN ○ Not on Mesos ● CPU Management ○ Priority queues ○ Short running queries higher priority ● Memory Management ○ Max memory per query per node ○ If query exceeds max memory limit, query fails ○ No OutOfMemory in Presto process
  13. 13. Limitations ● No fault tolerance ● Joins do not fit in memory ○ Query fails ○ No OutOfMemory in Presto process ○ Try it on Hive ● Coordinator is a single point of failure
  14. 14. No Need to Copy Data: Presto Connectors
  15. 15. Parquet: Columnar Storage for Big Data
  16. 16. Parquet Optimizations for Presto Example Query: SELECT base.driver_uuid FROM hdrone.mezzanine_trips WHERE datestr = '2017-03-02' AND base.city_id in (12) Data: ● Up to 15 levels of Nesting ● Up to 80 fields inside each Struct ● Fields are added/deleted/updated inside Struct
  17. 17. Old Parquet Reader
  18. 18. Nested Column Pruning
  19. 19. Columnar Reads
  20. 20. Predicate Pushdown
  21. 21. Dictionary Pushdown
  22. 22. Lazy Reads
  23. 23. Benchmarking Results
  24. 24. Presto Ongoing Work ● GeoSpatial query optimization ● Presto Elasticsearch Connector ● Multi-tenancy Support ● All Active Presto Cross Data Centers ● Authentication and Authorization ● High Available Coordinator
  25. 25. Hadoop Infrastructure & Analytics ● HDFS Erasure Encoding ● HDFS Tiered Storage ● All Active Hadoop Cross Data Centers ● Hive On Spark ● Spark ● Data Visualization
  26. 26. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. We are Hiring https://www.uber.com/careers/list/27366/ Send resumes to: abhik@uber.com or luoz@uber.com Interested in learning more about Uber Eng? Eng.uber.com Follow us on Twitter: @UberEng

×