Query Anything - Data Engineer’s perspective
Kamil Bajda-Pawlikowski
Co-founder / CTO
@prestosql @starburstdata
Data Orchestration Summit
Nov 2019 @ Mountain View
Martin Traverso
Creator of Presto
Why Presto?
Community-driven
open source project
High performance ANSI SQL engine
• Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute
and storage
• Scale storage and compute
independently
• No ETL or data integration
necessary to get to insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
Built for Performance
● MPP-style pipelined in-memory execution
● Multi-threaded multi-core execution
● Columnar and vectorized data processing
● Runtime query bytecode compilation
● Memory efficient data structures
● Optimized readers for columnar formats (ORC and Parquet)
● Predicate and column projection pushdown
● Cost-Based Optimizer
Presto: SQL-on-Anything
Deploy Anywhere, Query Anything
Example - Join multiple sources
SELECT
country,
approx_percentile(date_diff('year', birthdate, now()), array[0.25, 0.5, 0.75])
FROM
elasticsearch.default."movies: overview:space~ +fiction" movies
JOIN hive.default.views USING (movie_id)
JOIN mysql.default.users USING (user_id)
GROUP BY ROLLUP(country)
Per country age distribution of people that watched space fiction movies
Example - Join historical with recent data
CREATE VIEW visits AS
TABLE hive.visits_historical
UNION ALL
TABLE mysql.visits_recent
SELECT city, count(*) total
FROM visits
GROUP BY city
ORDER BY total DESC
Community
See more at our Wiki
Presto Software Foundation
“An independent, non-profit organization with the mission of supporting a community
of passionate users and developers devoted to the advancement of the Presto
distributed SQL query engine for big data.”
“It is dedicated to preserving the vision of high quality, performant, and dependable
software.”
“Ensuring the project remains open, collaborative and independent for decades to
come.”
Presto Community
● Github: https://github.com/prestosql
● Website: https://prestosql.io
● Blog: https://prestosql.io/blog
● Twitter: @prestosql
● Slack: https://prestosql.io/slack.html
○ #troubleshooting channel
○ #dev channel
Recent Improvements (last ~10 months)
● FETCH FIRST … WITH TIES syntax
● OFFSET syntax
● COMMENT ON <table> IS …
● [LEFT/RIGHT/FULL] JOIN LATERAL (…) ON
● IGNORE NULLS for window functions
● .* for ROW expressions
● Pass-through security (client provided
credentials)
● Impersonation for Hive Metastore
● Kerberos security improvements
● Support for Hadoop KMS
● Role-based security
● Secure query results in client API
● Current user security mode for views
● Support for Azure Data Lake
● Hive Bucketing V2
● Docker image
● Spill-to-disk improvements
● CLI output formats
● Syntax highlighting in CLI
● UUID type and functions
● format(), combinations() functions
● ORC bloom filters (non-legacy)
● Connector-provided view definitions
● Elasticsearch Connector
● Google Sheets Connector
● Amazon Kinesis Connector
● Apache Phoenix Connector
● LZ4/ZSTD support for ORC/Parquet
● More type mappings for various connectors
● Performance improvements for GCS and S3
● Performance improvements for UNNEST
… and more! https://prestosql.io/docs/current/release.htm
Starburst
© 2019
Enterprise edition
© 2019 12
Founded by Presto committers:
● Many years of contributions to Presto
● Presto distro for on-prem and cloud env
● Supporting large customers in production
● Enterprise subscription add-ons (ODBC,
Ranger, Sentry, Oracle, Teradata, K8S)
Notable features contributed:
● ANSI SQL syntax enhancements
● Execution engine improvements
● Security integrations
● Spill to disk
● Cost-Based Optimizer
https://www.starburstdata.com/presto-enterprise/
Starburst: SQL on Anything, Anywhere
Data Orchestration with caching, even with remote data
A dozen more
orchestrated cloud data
sources
Available Soon: Starburst Presto + Alluxio on
▪ AWS AMI pre-configured to speed up
Presto queries using Alluxio caching
▪ Start in minutes: AWS CloudFormation
Template to create a Presto Alluxio
cluster
▪ Seamless Hive Metastore / AWS Glue
integration, no location / path changes
needed
▪ Tutorial:
https://www.alluxio.io/products/aws/s
tarburst-alluxio-cft-tutorial/
+
Administrative challenges
● Configuring and managing clusters
● Autotuning properties based on the hardware provisioned
● High Availability for Presto Coordinator
● Scaling cluster elastically based on query load
● Gracefully decommissioning Presto Workers to avoid killing queries
● Monitoring of hardware and software layers
https://www.starburstdata.com/technical-blog/presto-on-kubernetes/
https://docs.starburstdata.com/latest/kubernetes.html
Presto on Kubernetes (K8S)
Presto Worker
Pod
Presto Worker
Pod
16
Presto Coordinator
Pod
Presto Worker
Pod
Horizontal Pod
Autoscaler (HPA)
Presto Operator
K8s Operator
Presto
Service
Hive Metastore Service
Pod
Hadoop / Hive
RDBMS
● RedHat OpenShift
● Google (GKE)
● Azure (AKS)
● Amazon (EKS)
https://www.starburstdata.com/2019-nyc-presto-summit/
Thank You!
18
Twitter: @starburstdata @prestosql
Blog: www.starburstdata.com/technical-blog/
Newsletter: www.starburstdata.com/newsletter
© 2019

Presto: Query Anything - Data Engineer’s perspective

  • 1.
    Query Anything -Data Engineer’s perspective Kamil Bajda-Pawlikowski Co-founder / CTO @prestosql @starburstdata Data Orchestration Summit Nov 2019 @ Mountain View Martin Traverso Creator of Presto
  • 2.
    Why Presto? Community-driven open sourceproject High performance ANSI SQL engine • Cost-Based Query Optimizer • Proven scalability • High concurrency Separation of compute and storage • Scale storage and compute independently • No ETL or data integration necessary to get to insights • SQL-on-anything No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in
  • 3.
    Built for Performance ●MPP-style pipelined in-memory execution ● Multi-threaded multi-core execution ● Columnar and vectorized data processing ● Runtime query bytecode compilation ● Memory efficient data structures ● Optimized readers for columnar formats (ORC and Parquet) ● Predicate and column projection pushdown ● Cost-Based Optimizer
  • 4.
  • 5.
    Example - Joinmultiple sources SELECT country, approx_percentile(date_diff('year', birthdate, now()), array[0.25, 0.5, 0.75]) FROM elasticsearch.default."movies: overview:space~ +fiction" movies JOIN hive.default.views USING (movie_id) JOIN mysql.default.users USING (user_id) GROUP BY ROLLUP(country) Per country age distribution of people that watched space fiction movies
  • 6.
    Example - Joinhistorical with recent data CREATE VIEW visits AS TABLE hive.visits_historical UNION ALL TABLE mysql.visits_recent SELECT city, count(*) total FROM visits GROUP BY city ORDER BY total DESC
  • 7.
  • 8.
    Presto Software Foundation “Anindependent, non-profit organization with the mission of supporting a community of passionate users and developers devoted to the advancement of the Presto distributed SQL query engine for big data.” “It is dedicated to preserving the vision of high quality, performant, and dependable software.” “Ensuring the project remains open, collaborative and independent for decades to come.”
  • 9.
    Presto Community ● Github:https://github.com/prestosql ● Website: https://prestosql.io ● Blog: https://prestosql.io/blog ● Twitter: @prestosql ● Slack: https://prestosql.io/slack.html ○ #troubleshooting channel ○ #dev channel
  • 10.
    Recent Improvements (last~10 months) ● FETCH FIRST … WITH TIES syntax ● OFFSET syntax ● COMMENT ON <table> IS … ● [LEFT/RIGHT/FULL] JOIN LATERAL (…) ON ● IGNORE NULLS for window functions ● .* for ROW expressions ● Pass-through security (client provided credentials) ● Impersonation for Hive Metastore ● Kerberos security improvements ● Support for Hadoop KMS ● Role-based security ● Secure query results in client API ● Current user security mode for views ● Support for Azure Data Lake ● Hive Bucketing V2 ● Docker image ● Spill-to-disk improvements ● CLI output formats ● Syntax highlighting in CLI ● UUID type and functions ● format(), combinations() functions ● ORC bloom filters (non-legacy) ● Connector-provided view definitions ● Elasticsearch Connector ● Google Sheets Connector ● Amazon Kinesis Connector ● Apache Phoenix Connector ● LZ4/ZSTD support for ORC/Parquet ● More type mappings for various connectors ● Performance improvements for GCS and S3 ● Performance improvements for UNNEST … and more! https://prestosql.io/docs/current/release.htm
  • 11.
  • 12.
    Enterprise edition © 201912 Founded by Presto committers: ● Many years of contributions to Presto ● Presto distro for on-prem and cloud env ● Supporting large customers in production ● Enterprise subscription add-ons (ODBC, Ranger, Sentry, Oracle, Teradata, K8S) Notable features contributed: ● ANSI SQL syntax enhancements ● Execution engine improvements ● Security integrations ● Spill to disk ● Cost-Based Optimizer https://www.starburstdata.com/presto-enterprise/
  • 13.
    Starburst: SQL onAnything, Anywhere Data Orchestration with caching, even with remote data A dozen more orchestrated cloud data sources
  • 14.
    Available Soon: StarburstPresto + Alluxio on ▪ AWS AMI pre-configured to speed up Presto queries using Alluxio caching ▪ Start in minutes: AWS CloudFormation Template to create a Presto Alluxio cluster ▪ Seamless Hive Metastore / AWS Glue integration, no location / path changes needed ▪ Tutorial: https://www.alluxio.io/products/aws/s tarburst-alluxio-cft-tutorial/ +
  • 15.
    Administrative challenges ● Configuringand managing clusters ● Autotuning properties based on the hardware provisioned ● High Availability for Presto Coordinator ● Scaling cluster elastically based on query load ● Gracefully decommissioning Presto Workers to avoid killing queries ● Monitoring of hardware and software layers https://www.starburstdata.com/technical-blog/presto-on-kubernetes/
  • 16.
    https://docs.starburstdata.com/latest/kubernetes.html Presto on Kubernetes(K8S) Presto Worker Pod Presto Worker Pod 16 Presto Coordinator Pod Presto Worker Pod Horizontal Pod Autoscaler (HPA) Presto Operator K8s Operator Presto Service Hive Metastore Service Pod Hadoop / Hive RDBMS ● RedHat OpenShift ● Google (GKE) ● Azure (AKS) ● Amazon (EKS)
  • 17.
  • 18.
    Thank You! 18 Twitter: @starburstdata@prestosql Blog: www.starburstdata.com/technical-blog/ Newsletter: www.starburstdata.com/newsletter © 2019