10x Faster Trino Queries
on Your Data Platform
Jianjian Xie
Staff Software Engineer @ Alluxio
Staff Engineer @ Alluxio
Trino Contributor
PrestoDB Contributor
Jianjian Xie
RubiX is OUT, Alluxio is IN
3
Trino 332 introduced
Hive connector
storage caching
by RubiX/Qubole
2020
Trino 439
introduced Trino File
System Cache using
Alluxio
2024
2023 2024
June February June
Previewed cache
by Alluxio
developers at
Trino Fest 2023
Jonas@Dune shared the
production results at
Trino Fest 2024
⚡ 20~30% faster queries
💰 70% less S3 GET
requests
● RubiX is no longer maintained
● Does not support
Iceberg/Hudi/Delta formats
● Dependent on Hadoop and
Hive ecosystem
January
Source: A cache refresh for Trino, Trino Fest 2023, Trino Fest 2024
Glossary - Let’s Talk Cache
4
Trino File System
Cache
latest built-in fs cache in Trino 439 release based on Alluxio caching
library and replaced RubiX caching library in Trino.
Read Trino blog for details.
Alluxio or Alluxio
Distributed Cache
Alluxio Edge
Full-capability distributed system, deployed as a standalone cluster
(both Open Source and Enterprise Edition available).
Read edition comparison for details.
Similar to Trino File System Cache, a lite version of Alluxio that is
purpose built for Trino to be deployed as a sidecar to Trino.
Which Cache Fits Your Need?
5
Trino File System Cache Alluxio or Alluxio Distributed Cache
Maintainers
Actively maintained by Alluxio and Trino
community
Actively maintained by Alluxio community
Availability Since Trino 439 and onwards Available since 2015
Deployment A library in Trino worker processes
A standalone service running on independent
processes
Cache Capacity
Leverage local disk NVMe or memory, also
bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino worker
process for cached data
Cached data shareable across Trino clusters,
as well as Spark and other frameworks
APIs TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python
FSSpec (experimental)
Trino File System Cache
6
7
Four
Values of
Trino File
System
Cache
Boost
Performance
Save Costs
Prevent
Network
Congestion
Offload
Under
Storage
Key Features of Trino File System Cache
8
Caching Data
Local SSD
Memory
Connector Support
Iceberg
Hudi
Delta Lake
Hive
9
How to Enable Trino File System Cache?
From the view of a Trino user, nothing really changes
fs.cache.enabled=true
fs.cache.directories=/tmp/cache
fs.cache.max-sizes=10G
10
A Deeper Dive - How Trino File Cache Works
11
File System Caching at Uber Scale
3 Clusters, 1500 Nodes
Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/
50%
Input Read
Performance
10%
Data Read Traffic
to HDFS
Alluxio
Distributed Cache
12
13
Alluxio Distributed Cache Architecture
Compute Node
14
Unify Data Lake Across Multiple
Geographic Regions at Expedia
PROBLEMS ENCOUNTERED ALLUXIOʼS SOLUTION RESULTS ACHIEVED
US-WEST-2
MOUNTED
US-WEST-1
US-EAST-1
US-EAST-2
TEAM C
TEAM A
MAIN REGION: CENTRAL ANALYTICS
TEAM B
Unify data silos without the
need to copy or move data
Enhanced user experience with
consistent & high performance
analytics, reducing time to insights
Reduced cost per query
Data silos caused by different
brands/teams ingesting data dispersed
across multiple regions in AWS
Central analytics platform performing
queries across data silos suffered from
poor user experience and long time to
insight
Manual replication resulted in
inefficiencies, operational overheads and
expensive S3 egress cost
50%
Multi-Level
Cache
15
16
Multi-level Cache: Best of Both Worlds
Trino Worker
Trino
Trino
File System
Cache
Alluxio
Distributed
Cache
Ongoing
Work
17
18
Upcoming Trino Native Alluxio Distributed Cache
Avoid old & complex HDFS interface with native Trino interface implementation
Takeaways
19
Takeaways: Which Cache Fits Your Need?
20
Trino File System Cache Alluxio or Alluxio Distributed Cache
Maintainers
Actively maintained by Alluxio and Trino
community
Actively maintained by Alluxio community
Availability Since Trino 439 and onwards Available since 2015
Deployment A library in Trino worker processes
A standalone service running on independent
processes
Cache Capacity
Leverage local disk NVMe or memory, also
bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino worker
process for cached data
Cached data shareable across Trino clusters,
as well as Spark and other frameworks
APIs TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python
FSSpec (experimental)
Thank You
Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!

Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform

  • 1.
    10x Faster TrinoQueries on Your Data Platform Jianjian Xie Staff Software Engineer @ Alluxio
  • 2.
    Staff Engineer @Alluxio Trino Contributor PrestoDB Contributor Jianjian Xie
  • 3.
    RubiX is OUT,Alluxio is IN 3 Trino 332 introduced Hive connector storage caching by RubiX/Qubole 2020 Trino 439 introduced Trino File System Cache using Alluxio 2024 2023 2024 June February June Previewed cache by Alluxio developers at Trino Fest 2023 Jonas@Dune shared the production results at Trino Fest 2024 ⚡ 20~30% faster queries 💰 70% less S3 GET requests ● RubiX is no longer maintained ● Does not support Iceberg/Hudi/Delta formats ● Dependent on Hadoop and Hive ecosystem January Source: A cache refresh for Trino, Trino Fest 2023, Trino Fest 2024
  • 4.
    Glossary - Let’sTalk Cache 4 Trino File System Cache latest built-in fs cache in Trino 439 release based on Alluxio caching library and replaced RubiX caching library in Trino. Read Trino blog for details. Alluxio or Alluxio Distributed Cache Alluxio Edge Full-capability distributed system, deployed as a standalone cluster (both Open Source and Enterprise Edition available). Read edition comparison for details. Similar to Trino File System Cache, a lite version of Alluxio that is purpose built for Trino to be deployed as a sidecar to Trino.
  • 5.
    Which Cache FitsYour Need? 5 Trino File System Cache Alluxio or Alluxio Distributed Cache Maintainers Actively maintained by Alluxio and Trino community Actively maintained by Alluxio community Availability Since Trino 439 and onwards Available since 2015 Deployment A library in Trino worker processes A standalone service running on independent processes Cache Capacity Leverage local disk NVMe or memory, also bound to local disks capacity Cache capacity scales horizontally Cache Sharing Only accessible to the local Trino worker process for cached data Cached data shareable across Trino clusters, as well as Spark and other frameworks APIs TrinoFileSystem internal to Trino HadoopFileSystem, S3, POSIX (GA), Python FSSpec (experimental)
  • 6.
  • 7.
    7 Four Values of Trino File System Cache Boost Performance SaveCosts Prevent Network Congestion Offload Under Storage
  • 8.
    Key Features ofTrino File System Cache 8 Caching Data Local SSD Memory Connector Support Iceberg Hudi Delta Lake Hive
  • 9.
    9 How to EnableTrino File System Cache? From the view of a Trino user, nothing really changes fs.cache.enabled=true fs.cache.directories=/tmp/cache fs.cache.max-sizes=10G
  • 10.
    10 A Deeper Dive- How Trino File Cache Works
  • 11.
    11 File System Cachingat Uber Scale 3 Clusters, 1500 Nodes Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/ 50% Input Read Performance 10% Data Read Traffic to HDFS
  • 12.
  • 13.
    13 Alluxio Distributed CacheArchitecture Compute Node
  • 14.
    14 Unify Data LakeAcross Multiple Geographic Regions at Expedia PROBLEMS ENCOUNTERED ALLUXIOʼS SOLUTION RESULTS ACHIEVED US-WEST-2 MOUNTED US-WEST-1 US-EAST-1 US-EAST-2 TEAM C TEAM A MAIN REGION: CENTRAL ANALYTICS TEAM B Unify data silos without the need to copy or move data Enhanced user experience with consistent & high performance analytics, reducing time to insights Reduced cost per query Data silos caused by different brands/teams ingesting data dispersed across multiple regions in AWS Central analytics platform performing queries across data silos suffered from poor user experience and long time to insight Manual replication resulted in inefficiencies, operational overheads and expensive S3 egress cost 50%
  • 15.
  • 16.
    16 Multi-level Cache: Bestof Both Worlds Trino Worker Trino Trino File System Cache Alluxio Distributed Cache
  • 17.
  • 18.
    18 Upcoming Trino NativeAlluxio Distributed Cache Avoid old & complex HDFS interface with native Trino interface implementation
  • 19.
  • 20.
    Takeaways: Which CacheFits Your Need? 20 Trino File System Cache Alluxio or Alluxio Distributed Cache Maintainers Actively maintained by Alluxio and Trino community Actively maintained by Alluxio community Availability Since Trino 439 and onwards Available since 2015 Deployment A library in Trino worker processes A standalone service running on independent processes Cache Capacity Leverage local disk NVMe or memory, also bound to local disks capacity Cache capacity scales horizontally Cache Sharing Only accessible to the local Trino worker process for cached data Cached data shareable across Trino clusters, as well as Spark and other frameworks APIs TrinoFileSystem internal to Trino HadoopFileSystem, S3, POSIX (GA), Python FSSpec (experimental)
  • 21.
    Thank You Any Questions? Scanthe QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts!