Presto: Fast SQL-on-Anything
across Data Lakes, DBMS, and NoSQL data stores
Kamil Bajda-Pawlikowski
Co-founder and CTO Data Orchestration Summit 2020
What is Presto?
2
Community-driven open
source project
High performance MPP SQL engine
• Interactive ANSI SQL queries
• Proven scalability
• High concurrency
Deploy Anywhere
• Kubernetes
• Cloud (AWS, Azure, GCP)
• On premises
Separation of compute & storage
• Scale storage & compute independently
• SQL-on-anything
• Federated queries
About Starburst
3
Enterprise Grade
Security
On-Prem, or
Cloud
Rapid Time to
Insights
Low Cost of
Ownership
24x7 Expert
Support
ANSI SQL MPP
Query Engine
High
Concurrency
Our Platform
Named Open Source
Startup to Watch 2020
600% Growth YoY
100+
Enterprise Customers
NPS Score
80+
Massive
Scale
Starburst Customers
Tech
Retail Media & Telco
Finance & Insurance
Healthcare & Pharma Other
4
Starburst Platform
Why Delta Lake?
▪ ACID properties over data lake
▪ Open source table format
▪ Stored as Parquet files
▪ Object storage support
▪ Schema evolution
▪ Time travel feature
▪ Metadata & statistics
▪ Data skipping & z-ordering
Native Presto Delta Lake Reader
Supports data skipping & dynamic filtering
Optimizes query using file statistics
Supports reading the Delta transaction log
Native connector written from scratch
Query-time Data Federation
● Single point of access to numerous
data sources
● Query Delta Lake and federate with
legacy databases as well as many
NoSQL data stores
● Enforce table, column and row level
policies to ensure maximum data
security
● Mask column data for different groups
and users
Data Consumption & Analytics BI Reporting Tools
SQL Query Tools
• Connect using a variety of BI and SQL
tools including Looker, Tableau, Power
BI and DBeaver
• JDBC, ODBC and many libraries
including Python, R and Java
SELECT id, COUNT(*), SUM(active_seconds)
FROM delta.iot.events e
JOIN snowflake.sales.customer c ON (e.customer_id = c.id)
WHERE e.event_date >= current_date
AND c.region = 'US'
AND c.id IN
(SELECT l.customer_id
FROM elastic.web.logs l
WHERE l.visit_date >= date '2020-01-01')
GROUP BY id;
Thank You
10
Try Presto: www.starburstdata.com

Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores

  • 1.
    Presto: Fast SQL-on-Anything acrossData Lakes, DBMS, and NoSQL data stores Kamil Bajda-Pawlikowski Co-founder and CTO Data Orchestration Summit 2020
  • 2.
    What is Presto? 2 Community-drivenopen source project High performance MPP SQL engine • Interactive ANSI SQL queries • Proven scalability • High concurrency Deploy Anywhere • Kubernetes • Cloud (AWS, Azure, GCP) • On premises Separation of compute & storage • Scale storage & compute independently • SQL-on-anything • Federated queries
  • 3.
    About Starburst 3 Enterprise Grade Security On-Prem,or Cloud Rapid Time to Insights Low Cost of Ownership 24x7 Expert Support ANSI SQL MPP Query Engine High Concurrency Our Platform Named Open Source Startup to Watch 2020 600% Growth YoY 100+ Enterprise Customers NPS Score 80+ Massive Scale
  • 4.
    Starburst Customers Tech Retail Media& Telco Finance & Insurance Healthcare & Pharma Other 4
  • 5.
  • 6.
    Why Delta Lake? ▪ACID properties over data lake ▪ Open source table format ▪ Stored as Parquet files ▪ Object storage support ▪ Schema evolution ▪ Time travel feature ▪ Metadata & statistics ▪ Data skipping & z-ordering
  • 7.
    Native Presto DeltaLake Reader Supports data skipping & dynamic filtering Optimizes query using file statistics Supports reading the Delta transaction log Native connector written from scratch
  • 8.
    Query-time Data Federation ●Single point of access to numerous data sources ● Query Delta Lake and federate with legacy databases as well as many NoSQL data stores ● Enforce table, column and row level policies to ensure maximum data security ● Mask column data for different groups and users
  • 9.
    Data Consumption &Analytics BI Reporting Tools SQL Query Tools • Connect using a variety of BI and SQL tools including Looker, Tableau, Power BI and DBeaver • JDBC, ODBC and many libraries including Python, R and Java SELECT id, COUNT(*), SUM(active_seconds) FROM delta.iot.events e JOIN snowflake.sales.customer c ON (e.customer_id = c.id) WHERE e.event_date >= current_date AND c.region = 'US' AND c.id IN (SELECT l.customer_id FROM elastic.web.logs l WHERE l.visit_date >= date '2020-01-01') GROUP BY id;
  • 10.
    Thank You 10 Try Presto:www.starburstdata.com