Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Interactively Querying Large-scale Datasets on Amazon S3

Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Interactively Querying Large-scale Datasets on Amazon S3

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Keith Steward, Ph.D. Specialist (EMR) Solution Architect, AWS July 13, 2016 Interactively Querying Large-Scale Datasets on Amazon S3 (Presto on EMR)
  2. 2. Agenda • The challenges of using data warehouses, then a data warehouse approach • High-level steps (overview) for querying large-scale data on Amazon S3 • Amazon S3 • Amazon EMR • Apache Presto: history, goals & benefits, architecture • Presto on EMR • Demo – Querying 29 years of U.S. Air Flights data on S3 by using Presto on EMR
  3. 3. Challenges in using data warehouses Schema-on-Write Data Data Warehouse schema Significant “time to answer” Schema-on-Read Data Shorter “time to answer” $$$$ $$
  4. 4. How to query large-scale datasets on S3? 1. Store your large-scale data in S3. 2. Configure & launch an EMR cluster with Presto. 3. Log in to the EMR cluster. 4. Expose S3 data as a Hive table 5. Issue SQL queries against the Hive table using Presto. 6. Get query results.
  5. 5. Store anything (object storage) Scalable 99.999999999% durability Effectively infinite inbound bandwidth Extremely low cost: $0.03/GB-Mo; $30.72/TB-Mo Data layer for virtually all AWS services Amazon S3
  6. 6. Aggregate all data in S3 as your data lake surrounded by a collection of the right tools EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Storm Amazon S3 Import/Export Snowball
  7. 7. Exposing large-scale datasets in S3 as Hive tables hive> CREATE EXTERNAL TABLE airdelays ( yr INT, quarter INT, month INT, dayofmonth INT, dayofweek INT, flightdate STRING, uniquecarrier STRING, airlineid INT, . . . div5tailnum STRING ) PARTITIONED BY (year STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '' LINES TERMINATED BY 'n' LOCATION 's3://flightdelays-kls/csv’; S3 bucket with data: Ask Hive to expose S3 data as table: hive> describe airdelays; OK yr int quarter int month int dayofmonth int dayofweek int flightdate string . . . div5wheelsoff string div5tailnum string year string # Partition Information # col_name data_type comment year string Time taken: 0.169 seconds, Fetched: 115 row(s) Hive now knows about table:
  8. 8. Scalable Hadoop clusters as a service Hadoop, Hive, Spark, Presto, Hbase, etc. Easy to use; fully managed On demand, reserved, spot pricing HDFS, S3, and Amazon Elastic Block Store (Amazon EBS) file systems End-to-end security Amazon EMR
  9. 9. EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications – just read/write to “s3://” Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata HDFS is still available via local instance storage or Amazon EBS
  10. 10. Amazon S3 as your cluster’s persistent data store Amazon S3 Designed for 99.999999999% durability Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
  11. 11. Demo: Let’s spin up an EMR cluster (with Presto) …
  12. 12. (History) PB scale interactive query engine designed by Facebook in 2012 Originally designed for exploring existing Hive tables without triggering slow MapReduce jobs Open Source in late 2013
  13. 13. (Benefits) In-memory distributed query engine Support standard ANSI-SQL Support rich analytical functions Support wide range of data sources Combine data from multiple sources in single query Response time ranges from seconds to minutes
  14. 14. (Features) High Performance: 10x faster than Hive • E.g., Netflix: runs 3500+ Presto queries/day on 25+ PB dataset in S3 with 350 active platform users Extensibility • Pluggable back ends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, SystemSchema, TPCH • JDBC, ODBC for commercial BI tools or dashboards, like data visualization • Client protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js, Java (JDBC), C#, etc.) ANSI SQL • Complex queries, joins, aggregations, various functions (Window functions)
  15. 15. High-level architecture A distributed system that runs on a cluster of machines. Components: a coordinator and multiple workers. Queries are submitted from a client, such as the Presto CLI, to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers. https://prestodb.io/overview.html
  16. 16. Presto architecture
  17. 17. (Why is it so fast?) In-memory parallel queries Pipeline task execution Data local computation with multi-threading Cache hot queries and data Just-in-time compile-to-bye-code operator SQL optimization Other optimizations (e.g., Predicate Pushdown)
  18. 18. Presto: in-memory processing and pipelining
  19. 19. Presto: accessing large-scale datasets in S3 Any table known to the Hive Metastore can be accessed/queried by Presto. Including data in S3 exposed via CREATE EXTERNAL TABLE statements in Hive.
  20. 20. Embedded Mode • Uses Derby • Not recommended for production Hive Metastore deployment modes http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html
  21. 21. Embedded Mode • Uses Derby • Not recommended for production Local Mode • Metastore service in same process as main HiveServer process • Metastore DB runs in separate process http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html Hive Metastore deployment modes
  22. 22. Hive Metastore deployment modes http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html Embedded Mode • Uses Derby • Not recommended for production Local Mode • Metastore service in same process as main HiveServer process • Metastore DB runs in separate process Remote Mode • Metastore service runs in own JVM process • Processes communicate with it via Thrift network API • Metastore service communicates with Metastore DB over JDBC
  23. 23. Supported data sources Currently Presto provides connectors for the following: Hive Cassandra MySQL PostgreSQL Kafka Redis
  24. 24. Common use cases When to use Presto? • Need fast interactive query ability with high concurrency • Need ANSI SQL When might you not want to use Presto? • You focus on batch processing (ETL, enriching, aggregation, etc.) for large data sets. Hive or Spark recommended. • Need to compute (e.g., machine learning, graph algorithms) over the retrieved data. Spark recommended. • Star-schema organization of data. Amazon Redshift data warehouse recommended.
  25. 25. Airpal – a Presto GUI designed & open-sourced by Airbnb Optional access controls for users Search and find tables See metadata, partitions, schemas & sample rows Write queries in an easy-to-read editor Submit queries through a web interface Track query progress Get the results back through the browser as a CSV Create new Hive table based on the results of a query Save queries once written Searchable history of all queries run within the tool
  26. 26. Demo: Let’s now query 29 years worth of Air-traffic data in S3 using Presto on our EMR cluster…
  27. 27. Summary for Presto on EMR with data in S3 Data in S3 is queryable using Presto on Amazon EMR Presto is easy to deploy on Amazon EMR Presto provides fast ad-hoc queries Supports wide range of data sources In-memory data processing with pipelining Feature-rich Increasing adoption & active community Amazon S3 Amazon EMR
  28. 28. Remember to complete your evaluations! Reference http://www.slideshare.net/GuorongLIANG/facebook- presto-presentation https://prestodb.io https://github.com/airbnb/airpal#airpal https://github.com/treasure-data/prestogres If you want to run this demo later in your own AWS account, go to: http://bit.ly/1Xg0111
  29. 29. Thank you!

    Be the first to comment

    Login to see the comments

  • prashanthmcr1

    Aug. 9, 2016
  • geektravel

    Mar. 5, 2018
  • FerranMas2

    Jun. 8, 2018
  • jackcabrera

    Jan. 26, 2020

Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.

Views

Total views

3,345

On Slideshare

0

From embeds

0

Number of embeds

64

Actions

Downloads

75

Shares

0

Comments

0

Likes

4

×