Presto for apps deck varada prestoconf

•Download as PPTX, PDF•

4 likes•868 views

David Krakov, co-founder and CTO at Varada presents, how Varada used Presto to make interactive analytics on Datalake possible.

Technology

Presto for Applications
Presto Conference, Tel Aviv, April 2019
@DavidKrakov
Co-Founder & CTO

Who are we?
The platform for fast analytics on your cloud data lake
Founded
in 2017
Beta,
Pre-GA
$7.5M
Seed
Backed by

Shift is happening in how big data is used
Interactive
applications
Offline, Batch,
Exploratory
Data lake
Reporting
Ad-hoc queries
Batch processing
Business dashboards
Production applications
Real time decision

A task for Presto on S3?
Data Workload
Many sources
Complex schemas
Huge volumes
Diverse Queries
High Concurrency
Unpredictable load
User Expectations
Predictably Interactive

Previously: Presto Raptor
• Connector for low latency queries
• Store data copy on local SSDs
• Shared nothing (data is sharded)
• Columnar and sorted
• Use min/max to skip shards

Previously: Presto Raptor
Data modeling:
Sorting, Distribution
ETL & copy
management
Workload
optimization
Deployment
maintenance
➔ Long implementation cycles
➔ Expertise bottleneck
➔ Limited workloadsNeeds:

What is needed to shorten the cycle
No data modeling
No copy management
No workload bottlenecks
Simple Deployment

How Varada works
Data lake
Interactive response
SQL
Data view

Coordinator Worker
NVMe
SSD
Worker
Shared Everything: no bottlenecks
Worker Worker
Interconnected in Placement Group
NVMe
SSD
NVMe
SSD
NVMe
SSD

Coordinator Worker
NVMe
SSD
Worker
Shared Everything: no bottlenecks, no skew
Worker Worker
All
NVMe
SSDs
Worker
Worker
Worker
Worker
Shared Everything
Architecture
With NVMeF
Interconnected in Placement Group
NVMe
SSD
NVMe
SSD
NVMe
SSD
Coordinator

Materialized view: set the data with SQL
SSD
CREATE MATERIALIZED VIEW v
AS SELECT columns
FROM hive.orc_s3_table
WHERE condition
Define view
with SQL
Data Lake

Inline Indexing: no data modeling
SSD
Break all
dimensions to
nanoblocks
Inline index
across all
nanoblocks
All the dimensions are indexedDefine view
with SQL
Data Lake

Real Time Sync: no copy management
SSD
Auto synchronization from S3
Index & Data
synchronized to the
data lake on content
and on schema+
Track Data
Change
Data Lake

Ingredients of Index on Big Data
Random
I/O
Small
Blocks
Low
Latency
High
Bandwidth

Break data to small blocks
Up to few KBs of columnar
storage
Temp
31°
31°
32°
70°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Keyword
“four”
“score”
“and”
“seven”
Block on SSD
Block on SSD
Block on SSD
Block on SSD
Block on SSD
Block on SSD

Index across all dimensions
Every 100k rows of every column
Have an optimal index of their own
Temp
31°
31°
32°
70°
Index for the range
31° - 32°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Index for user id
High cardinality
Keyword
“four”
“score”
“and”
“seven”
Index for string
matching

Random walk the indexes in parallel
Temp
31°
31°
32°
70°
Index for the range
31° - 32°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Index for user id
High cardinality
Keyword
“four”
“score”
“and”
“seven”
Index for string
matching
SELECT WHERE temp > 40

Every query finds an index
Temp
31°
31°
32°
70°
Index for the range
31° - 32°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Index for user id
High cardinality
Keyword
“four”
“score”
“and”
“seven”
Index for string
matching
JOIN on a.UserId = b.UserID

Varada fast analytics platform for data lake
StorageCatalog
Metastore
AWS Cloud Data Lake
Data Applications
External Data SourcesVarada Platform
Relational
NoSQL
SQL engine
Presto
Auto Synchronization from S3 (Hive)
Connector
Materialized View and Index
Connectors
i3
Clusters

What's hot

Apache Kafka - OverviewCodeOps Technologies LLP

Design and Implementation of Incremental Cooperative Rebalancingconfluent

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB

Spark shuffle introductioncolorant

Dynamic filtering for presto join optimisationOri Reshef

Performance Optimizations in Apache ImpalaCloudera, Inc.

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Transparent Data Encryption in PostgreSQL and Integration with Key Management...Masahiko Sawada

Apache kafkaViswanath J

Optimizing RocksDB for Open-Channel SSDsJavier González

APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote

Introduction to Amazon RedshiftAmazon Web Services

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Introduction to Apache ZooKeeperSaurav Haloi

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Masterclass - RedshiftAmazon Web Services

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Performance Monitoring: Understanding Your Scylla ClusterScyllaDB

What's hot (20)

Apache Kafka - Overview

Design and Implementation of Incremental Cooperative Rebalancing

File Format Benchmarks - Avro, JSON, ORC, & Parquet

How to Build a Scylla Database Cluster that Fits Your Needs

Spark shuffle introduction

Dynamic filtering for presto join optimisation

Performance Optimizations in Apache Impala

Dongwon Kim – A Comparative Performance Evaluation of Flink

Transparent Data Encryption in PostgreSQL and Integration with Key Management...

Apache kafka

Optimizing RocksDB for Open-Channel SSDs

APACHE KAFKA / Kafka Connect / Kafka Streams

Introduction to Amazon Redshift

RocksDB Performance and Reliability Practices

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Introduction to Apache ZooKeeper

Performance Tuning RocksDB for Kafka Streams’ State Stores

Masterclass - Redshift

Tame the small files problem and optimize data layout for streaming ingestion...

Performance Monitoring: Understanding Your Scylla Cluster

Similar to Presto for apps deck varada prestoconf

Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services

Events and metrics the Lifeblood of WebopsDatadog

What's new in SQL Server 2016James Serra

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services

re:Invent re:Cap - Big Data & IoT at Any ScaleAdrian Hornsby

MinneBar 2013 - Scaling with CassandraJeff Smoley

Relational databases vs Non-relational databasesJames Serra

Expert summit SQL Server 2016Łukasz Grala

Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Trivadis

MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB

Modern data warehouseRakesh Jayaram

Amazon Elastic Map Reduce - Ian Meyershuguk

High-performance database technology for rock-solid IoT solutionsClusterpoint

Prague data management meetup 2018-03-27Martin Bém

Horses for Courses: Database RoundtableEric Kavanagh

Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services

Database and Analytics on the AWS CloudAmazon Web Services

How and when to use NoSQLAmazon Web Services

Handling Data in Mega Scale SystemsDirecti Group

NoSQL Options ComparedSergey Bushik

Similar to Presto for apps deck varada prestoconf (20)

Building Analytic Apps for SaaS: “Analytics as a Service”

Events and metrics the Lifeblood of Webops

What's new in SQL Server 2016

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...

re:Invent re:Cap - Big Data & IoT at Any Scale

MinneBar 2013 - Scaling with Cassandra

Relational databases vs Non-relational databases

Expert summit SQL Server 2016

Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)

MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB

Modern data warehouse

Amazon Elastic Map Reduce - Ian Meyers

High-performance database technology for rock-solid IoT solutions

Prague data management meetup 2018-03-27

Horses for Courses: Database Roundtable

Leveraging Amazon Redshift for Your Data Warehouse

Database and Analytics on the AWS Cloud

How and when to use NoSQL

Handling Data in Mega Scale Systems

NoSQL Options Compared

Recently uploaded

Install Stable Diffusion in windows machinePadma Pradeep

Artificial intelligence in the post-deep learning eraDeakin University

How to convert PDF to text with Nanonetsnaman860154

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Build your next Gen AI Breakthrough - April 2024Neo4j

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

AI as an Interface for Commercial BuildingsMemoori

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

CloudStudio User manual (basic edition):comworks

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

APIForce Zurich 5 April Automation LPDGMarianaLemus7

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Recently uploaded (20)

Install Stable Diffusion in windows machine

Artificial intelligence in the post-deep learning era

How to convert PDF to text with Nanonets

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

08448380779 Call Girls In Friends Colony Women Seeking Men

Benefits Of Flutter Compared To Other Frameworks

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Build your next Gen AI Breakthrough - April 2024

My Hashitalk Indonesia April 2024 Presentation

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Connect Wave/ connectwave Pitch Deck Presentation

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

AI as an Interface for Commercial Buildings

Are Multi-Cloud and Serverless Good or Bad?

CloudStudio User manual (basic edition):

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

The transition to renewables in India.pdf

APIForce Zurich 5 April Automation LPDG

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Understanding the Laravel MVC Architecture

Presto for apps deck varada prestoconf

1. Presto for Applications Presto Conference, Tel Aviv, April 2019 @DavidKrakov Co-Founder & CTO

2. Who are we? The platform for fast analytics on your cloud data lake Founded in 2017 Beta, Pre-GA $7.5M Seed Backed by

3. Shift is happening in how big data is used Interactive applications Offline, Batch, Exploratory Data lake Reporting Ad-hoc queries Batch processing Business dashboards Production applications Real time decision

4. A task for Presto on S3? Data Workload Many sources Complex schemas Huge volumes Diverse Queries High Concurrency Unpredictable load User Expectations Predictably Interactive

5. Previously: Presto Raptor • Connector for low latency queries • Store data copy on local SSDs • Shared nothing (data is sharded) • Columnar and sorted • Use min/max to skip shards

6. Previously: Presto Raptor Data modeling: Sorting, Distribution ETL & copy management Workload optimization Deployment maintenance ➔ Long implementation cycles ➔ Expertise bottleneck ➔ Limited workloadsNeeds:

7. What is needed to shorten the cycle No data modeling No copy management No workload bottlenecks Simple Deployment

8. How Varada works Data lake Interactive response SQL Data view

9. Coordinator Worker NVMe SSD Worker Shared Everything: no bottlenecks Worker Worker Interconnected in Placement Group NVMe SSD NVMe SSD NVMe SSD

10. Coordinator Worker NVMe SSD Worker Shared Everything: no bottlenecks, no skew Worker Worker All NVMe SSDs Worker Worker Worker Worker Shared Everything Architecture With NVMeF Interconnected in Placement Group NVMe SSD NVMe SSD NVMe SSD Coordinator

11. Materialized view: set the data with SQL SSD CREATE MATERIALIZED VIEW v AS SELECT columns FROM hive.orc_s3_table WHERE condition Define view with SQL Data Lake

12. Inline Indexing: no data modeling SSD Break all dimensions to nanoblocks Inline index across all nanoblocks All the dimensions are indexedDefine view with SQL Data Lake

13. Real Time Sync: no copy management SSD Auto synchronization from S3 Index & Data synchronized to the data lake on content and on schema+ Track Data Change Data Lake

14. Ingredients of Index on Big Data Random I/O Small Blocks Low Latency High Bandwidth

15. Break data to small blocks Up to few KBs of columnar storage Temp 31° 31° 32° 70° Userid ab32f27d 12eaaad4 ffff5ab1 NULL Keyword “four” “score” “and” “seven” Block on SSD Block on SSD Block on SSD Block on SSD Block on SSD Block on SSD

16. Index across all dimensions Every 100k rows of every column Have an optimal index of their own Temp 31° 31° 32° 70° Index for the range 31° - 32° Userid ab32f27d 12eaaad4 ffff5ab1 NULL Index for user id High cardinality Keyword “four” “score” “and” “seven” Index for string matching

17. Random walk the indexes in parallel Temp 31° 31° 32° 70° Index for the range 31° - 32° Userid ab32f27d 12eaaad4 ffff5ab1 NULL Index for user id High cardinality Keyword “four” “score” “and” “seven” Index for string matching SELECT WHERE temp > 40

18. Every query finds an index Temp 31° 31° 32° 70° Index for the range 31° - 32° Userid ab32f27d 12eaaad4 ffff5ab1 NULL Index for user id High cardinality Keyword “four” “score” “and” “seven” Index for string matching JOIN on a.UserId = b.UserID

19. Varada fast analytics platform for data lake StorageCatalog Metastore AWS Cloud Data Lake Data Applications External Data SourcesVarada Platform Relational NoSQL SQL engine Presto Auto Synchronization from S3 (Hive) Connector Materialized View and Index Connectors i3 Clusters

20. Thank you!

Editor's Notes

Taking on big dataNew approach to big dataIndex 1st, query 2nd / Indexing before/pre queryPre-ready indexing
Applications or BI tools connected to Varada perform complex queries with sub-second response times—on any dimension and schema—saving months of data preparation and dramatically improving an analyst’s time to insight.

Presto for apps deck varada prestoconf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presto for apps deck varada prestoconf

Similar to Presto for apps deck varada prestoconf (20)

Recently uploaded

Recently uploaded (20)

Presto for apps deck varada prestoconf

Editor's Notes