Presto: Query Anything - Data Engineer’s perspective

Query Anything - Data Engineer’s perspective
Kamil Bajda-Pawlikowski
Co-founder / CTO
@prestosql @starburstdata
Data Orchestration Summit
Nov 2019 @ Mountain View
Martin Traverso
Creator of Presto

Why Presto?
Community-driven
open source project
High performance ANSI SQL engine
• Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute
and storage
• Scale storage and compute
independently
• No ETL or data integration
necessary to get to insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in

Built for Performance
● MPP-style pipelined in-memory execution
● Multi-threaded multi-core execution
● Columnar and vectorized data processing
● Runtime query bytecode compilation
● Memory efficient data structures
● Optimized readers for columnar formats (ORC and Parquet)
● Predicate and column projection pushdown
● Cost-Based Optimizer

Presto: SQL-on-Anything
Deploy Anywhere, Query Anything

Example - Join multiple sources
SELECT
country,
approx_percentile(date_diff('year', birthdate, now()), array[0.25, 0.5, 0.75])
FROM
elasticsearch.default."movies: overview:space~ +fiction" movies
JOIN hive.default.views USING (movie_id)
JOIN mysql.default.users USING (user_id)
GROUP BY ROLLUP(country)
Per country age distribution of people that watched space fiction movies

Example - Join historical with recent data
CREATE VIEW visits AS
TABLE hive.visits_historical
UNION ALL
TABLE mysql.visits_recent
SELECT city, count(*) total
FROM visits
GROUP BY city
ORDER BY total DESC

Community
See more at our Wiki

Presto Software Foundation
“An independent, non-profit organization with the mission of supporting a community
of passionate users and developers devoted to the advancement of the Presto
distributed SQL query engine for big data.”
“It is dedicated to preserving the vision of high quality, performant, and dependable
software.”
“Ensuring the project remains open, collaborative and independent for decades to
come.”

Presto Community
● Github: https://github.com/prestosql
● Website: https://prestosql.io
● Blog: https://prestosql.io/blog
● Twitter: @prestosql
● Slack: https://prestosql.io/slack.html
○ #troubleshooting channel
○ #dev channel

Recent Improvements (last ~10 months)
● FETCH FIRST … WITH TIES syntax
● OFFSET syntax
● COMMENT ON <table> IS …
● [LEFT/RIGHT/FULL] JOIN LATERAL (…) ON
● IGNORE NULLS for window functions
● .* for ROW expressions
● Pass-through security (client provided
credentials)
● Impersonation for Hive Metastore
● Kerberos security improvements
● Support for Hadoop KMS
● Role-based security
● Secure query results in client API
● Current user security mode for views
● Support for Azure Data Lake
● Hive Bucketing V2
● Docker image
● Spill-to-disk improvements
● CLI output formats
● Syntax highlighting in CLI
● UUID type and functions
● format(), combinations() functions
● ORC bloom filters (non-legacy)
● Connector-provided view definitions
● Elasticsearch Connector
● Google Sheets Connector
● Amazon Kinesis Connector
● Apache Phoenix Connector
● LZ4/ZSTD support for ORC/Parquet
● More type mappings for various connectors
● Performance improvements for GCS and S3
● Performance improvements for UNNEST
… and more! https://prestosql.io/docs/current/release.htm

Enterprise edition
© 2019 12
Founded by Presto committers:
● Many years of contributions to Presto
● Presto distro for on-prem and cloud env
● Supporting large customers in production
● Enterprise subscription add-ons (ODBC,
Ranger, Sentry, Oracle, Teradata, K8S)
Notable features contributed:
● ANSI SQL syntax enhancements
● Execution engine improvements
● Security integrations
● Spill to disk
● Cost-Based Optimizer
https://www.starburstdata.com/presto-enterprise/

Starburst: SQL on Anything, Anywhere
Data Orchestration with caching, even with remote data
A dozen more
orchestrated cloud data
sources

Available Soon: Starburst Presto + Alluxio on
▪ AWS AMI pre-configured to speed up
Presto queries using Alluxio caching
▪ Start in minutes: AWS CloudFormation
Template to create a Presto Alluxio
cluster
▪ Seamless Hive Metastore / AWS Glue
integration, no location / path changes
needed
▪ Tutorial:
https://www.alluxio.io/products/aws/s
tarburst-alluxio-cft-tutorial/
+

Administrative challenges
● Configuring and managing clusters
● Autotuning properties based on the hardware provisioned
● High Availability for Presto Coordinator
● Scaling cluster elastically based on query load
● Gracefully decommissioning Presto Workers to avoid killing queries
● Monitoring of hardware and software layers
https://www.starburstdata.com/technical-blog/presto-on-kubernetes/

https://docs.starburstdata.com/latest/kubernetes.html
Presto on Kubernetes (K8S)
Presto Worker
Pod
Presto Worker
Pod
16
Presto Coordinator
Pod
Presto Worker
Pod
Horizontal Pod
Autoscaler (HPA)
Presto Operator
K8s Operator
Presto
Service
Hive Metastore Service
Pod
Hadoop / Hive
RDBMS
● RedHat OpenShift
● Google (GKE)
● Azure (AKS)
● Amazon (EKS)

https://www.starburstdata.com/2019-nyc-presto-summit/

Thank You!
18
Twitter: @starburstdata @prestosql
Blog: www.starburstdata.com/technical-blog/
Newsletter: www.starburstdata.com/newsletter
© 2019

Presto: Query Anything - Data Engineer’s perspective

More Related Content

What's hot

Similar to Presto: Query Anything - Data Engineer’s perspective

More from Alluxio, Inc.

Recently uploaded

Presto: Query Anything - Data Engineer’s perspective