2. Who are we?
The platform for fast analytics on your cloud data lake
Founded
in 2017
Beta,
Pre-GA
$7.5M
Seed
Backed by
3. Shift is happening in how big data is used
Interactive
applications
Offline, Batch,
Exploratory
Data lake
Reporting
Ad-hoc queries
Batch processing
Business dashboards
Production applications
Real time decision
4. A task for Presto on S3?
Data Workload
Many sources
Complex schemas
Huge volumes
Diverse Queries
High Concurrency
Unpredictable load
User Expectations
Predictably Interactive
5. Previously: Presto Raptor
• Connector for low latency queries
• Store data copy on local SSDs
• Shared nothing (data is sharded)
• Columnar and sorted
• Use min/max to skip shards
6. Previously: Presto Raptor
Data modeling:
Sorting, Distribution
ETL & copy
management
Workload
optimization
Deployment
maintenance
➔ Long implementation cycles
➔ Expertise bottleneck
➔ Limited workloadsNeeds:
7. What is needed to shorten the cycle
No data modeling
No copy management
No workload bottlenecks
Simple Deployment
10. Coordinator Worker
NVMe
SSD
Worker
Shared Everything: no bottlenecks, no skew
Worker Worker
All
NVMe
SSDs
Worker
Worker
Worker
Worker
Shared Everything
Architecture
With NVMeF
Interconnected in Placement Group
NVMe
SSD
NVMe
SSD
NVMe
SSD
Coordinator
11. Materialized view: set the data with SQL
SSD
CREATE MATERIALIZED VIEW v
AS SELECT columns
FROM hive.orc_s3_table
WHERE condition
Define view
with SQL
Data Lake
12. Inline Indexing: no data modeling
SSD
Break all
dimensions to
nanoblocks
Inline index
across all
nanoblocks
All the dimensions are indexedDefine view
with SQL
Data Lake
13. Real Time Sync: no copy management
SSD
Auto synchronization from S3
Index & Data
synchronized to the
data lake on content
and on schema+
Track Data
Change
Data Lake
14. Ingredients of Index on Big Data
Random
I/O
Small
Blocks
Low
Latency
High
Bandwidth
15. Break data to small blocks
Up to few KBs of columnar
storage
Temp
31°
31°
32°
70°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Keyword
“four”
“score”
“and”
“seven”
Block on SSD
Block on SSD
Block on SSD
Block on SSD
Block on SSD
Block on SSD
16. Index across all dimensions
Every 100k rows of every column
Have an optimal index of their own
Temp
31°
31°
32°
70°
Index for the range
31° - 32°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Index for user id
High cardinality
Keyword
“four”
“score”
“and”
“seven”
Index for string
matching
17. Random walk the indexes in parallel
Temp
31°
31°
32°
70°
Index for the range
31° - 32°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Index for user id
High cardinality
Keyword
“four”
“score”
“and”
“seven”
Index for string
matching
SELECT WHERE temp > 40
18. Every query finds an index
Temp
31°
31°
32°
70°
Index for the range
31° - 32°
Userid
ab32f27d
12eaaad4
ffff5ab1
NULL
Index for user id
High cardinality
Keyword
“four”
“score”
“and”
“seven”
Index for string
matching
JOIN on a.UserId = b.UserID
19. Varada fast analytics platform for data lake
StorageCatalog
Metastore
AWS Cloud Data Lake
Data Applications
External Data SourcesVarada Platform
Relational
NoSQL
SQL engine
Presto
Auto Synchronization from S3 (Hive)
Connector
Materialized View and Index
Connectors
i3
Clusters
Taking on big dataNew approach to big dataIndex 1st, query 2nd / Indexing before/pre queryPre-ready indexing
Applications or BI tools connected to Varada perform complex queries with sub-second response times—on any dimension and schema—saving months of data preparation and dramatically improving an analyst’s time to insight.