How to ensure Presto
scalability
in multi use case
Kai Sasaki
Treasure Data Inc.
Kai Sasaki (@Lewuathe)
Software Engineer at Treasure Data Inc.
Hadoop/Presto/Spark
Presto In TD
• 150000+ queries / day
• 190+ TB processing / day
• 10+ MB processing / query * sec
• 100+ million processed records / query
Presto In TD
Prestobase
Proxy
PerfectQueue
query
Plazma
data
Presto
TD API
BI Tool
HTTP
How to make it scalable
• Prestobase Proxy
• Node scheduler
• Resource Group
Prestobase proxy
Prestobase proxy
Prestobase proxy aims to provide the
interface especially for BI tools through
JDBC/ODBC and also to replace Prestogres.
Presto In TD
Prestobase
Proxy
PerfectQueue
query
Plazma
data
Presto
TD API
BI Tool
HTTP
Prestobase proxy
• Written in Scala
• Finagle base RPC proxy
• Running as Docker container
• A user of Airframe
• VCR base light-weight test framework
Finagle
Finagle is an extensible RPC system for the JVM,
used to construct high-concurrency servers.
Finagle implements uniform client and server
APIs for several protocols, and is designed for
high performance and concurrency.
see: https://twitter.github.io/finagle/
Finagle
protected val service: Service[Request, Response] =
bind[SomeFilter] andThen
bind[AnotherHandler] andThen
LastFilter andThen
prestoClient
Build request pipeline by binding
filter, handlers with Airframe
Airframe
Airframe is a trait base dependency injection
framework using Scala macro
- https://github.com/wvlet/airframe
Airframe
- Dependency injection tailored Scala
- Tagged binding with wvlet
https://github.com/wvlet/wvlet
- Object lifecycle management
Airframe
val design : Design =
newDesign
.bind[X].toInstance(new X) // Bind type X to a concrete instance
.bind[Y].toSingleton // Bind type Y to a singleton object
.bind[Z].to[ZImpl] // Bind type Z to an instance of ZImpl
import wvlet.airframe._
trait App {
val x = bind[X]
val y = bind[Y]
val z = bind[Z]
// Do something with X, Y, and Z
}
val session = design.newSession
val app : App = session.build[App]
VCR testing framework
Record test suite HTTP interaction to make
test stable and deterministic
see more detail
https://testing.googleblog.com/2016/11/what-test-engineers-do-at-google.html
VCR testing framework
protected val service: Service[Request, Response] =
bind[SomeFilter] andThen
bind[AnotherHandler] andThen
QueryRewriter andThen
bind[RequestVCR] andThen
prestClient
protected val service: Service[Request, Response] =
bind[SomeFilter] andThen
bind[AnotherHandler] andThen
QueryRewriter andThen
bind[NoRecording] andThen
prestClient
On CI
On Production
Prestobase
VCR testing framework
RequestVCRClient
…
…
SQLite
Recording
Prestobase
VCR testing framework
RequestVCRClient
…
…
SQLite
Replaying
Prestobase proxy
Will be open sourced soon
Node Scheduler
Node Scheduler
Submitting query follows…
- Analyze query AST
- Make query logical/physical plan
- Schedule each stage
Node Scheduler
query
stage2 stage1 stage0
task2-0
task2-1
task2-0
task1-0
task1-1
task0-0
Table Scan output
Node Scheduler
NodeScheduler creates NodeSelector that
selects worker nodes on which tasks are
scheduled. NodeSelector picks up worker
nodes when there is available splits.
Node Scheduler in TD
Keeps worker node map that can be
candidate for launching next tasks.
- Ignore min candidates
- Limit by available memory pool
Node Scheduler in TD
Back to normal memory pool usage after task is completed.
Node Scheduler in TD
Challenges
- Smoothing CPU time metric
- Split type awareness
- Avoid problematic worker nodes
Resource Group
Resource Group
Resource Group was introduced since 0.147
→ https://prestodb.io/docs/current/admin/resource-groups.html
Resource Group aims to limit the resource
usage by account/group/query.
Resource Group
rootGroup
general adhoc
softMemoryLimit: 100%
maxQueued : 5000
maxRunning : 1000
softMemoryLimit: 100%
maxQueued : 100
maxRunning : 200
softMemoryLimit: 100%
maxRunning : 1000
Resource Group limits
- maxQueued
- maxRunning
- softMemoryLimit
Following queries will be queued
- softCpuLimit
Impose penalty against max running queries
- hardCpuLimit
Following queries will be queued
Resource Group scheduling
- schedulingPolicy
- fair : FIFO
- weighted : Selected stochastically
- query_priority : Selected according to priority
- schedulingWeight
Resource Group
Every query must be associated to a resource
group. The matching can be done by
configured selector.
{
"user": “bob", "group": "general"
},
{
"source": “.*adhoc.*", "group": "global.adhoc.adhoc_${USER}"
}
Resource Group
rootGroup
general adhoc
softMemoryLimit: 100%
maxQueued : 5000
maxRunning : 1000
softMemoryLimit: 100%
maxQueued : 100
maxRunning : 200
softMemoryLimit: 100%
maxRunning : 1000
Bob’s
query
Bob’s
query …
Resource Group DI
Easily change resource group config behavior
with Guice injection.
- ResourceGroupConfigurationManager
- configure(ResourceGroup, SelectionContext)
- ResourceGroupSelector
- match(Statement, SelectionContext)
SelectionContext
SelectionContext holds the information for associating
submitted query.
- Authenticated
- User
- Source
- Query Priority
Currently available as default
{
"runningQueryIds": ["query1", "query2"],
"accountId": 1,
"children": [{
"memoryUsage": 12345,
"runningQueryIds": [“query1"],
"children": [],
"runningQueries": 1,
"queuedQueries": 0,
"maxRunningQueries": 2,
"resourceId": "general"
}, {
"memoryUsage": 26296,
"runningQueryIds": ["query2"],
"children": [],
"runningQueries": 1,
"queuedQueries": 0,
"maxRunningQueries": 2,
"resourceId": "scheduled"
}],
"runningQueries": 2,
"maxRunningQueries": 30,
}
Queries in parent group
Running query in general
Running query in scheduled
Recap
Distributed system often requires each
component to be stable and scalable. We can
make Presto ecosystem reliable by doing…
- Code modification reliability with DI
- VCR testing
- Multi dimensional resource scheduling
- Resource isolation makes multi-tenant
distributed SQL engine reliable

How to ensure Presto scalability 
in multi use case

  • 1.
    How to ensurePresto scalability in multi use case Kai Sasaki Treasure Data Inc.
  • 2.
    Kai Sasaki (@Lewuathe) SoftwareEngineer at Treasure Data Inc. Hadoop/Presto/Spark
  • 3.
    Presto In TD •150000+ queries / day • 190+ TB processing / day • 10+ MB processing / query * sec • 100+ million processed records / query
  • 4.
  • 5.
    How to makeit scalable • Prestobase Proxy • Node scheduler • Resource Group
  • 6.
  • 7.
    Prestobase proxy Prestobase proxyaims to provide the interface especially for BI tools through JDBC/ODBC and also to replace Prestogres.
  • 8.
  • 9.
    Prestobase proxy • Writtenin Scala • Finagle base RPC proxy • Running as Docker container • A user of Airframe • VCR base light-weight test framework
  • 10.
    Finagle Finagle is anextensible RPC system for the JVM, used to construct high-concurrency servers. Finagle implements uniform client and server APIs for several protocols, and is designed for high performance and concurrency. see: https://twitter.github.io/finagle/
  • 11.
    Finagle protected val service:Service[Request, Response] = bind[SomeFilter] andThen bind[AnotherHandler] andThen LastFilter andThen prestoClient Build request pipeline by binding filter, handlers with Airframe
  • 12.
    Airframe Airframe is atrait base dependency injection framework using Scala macro - https://github.com/wvlet/airframe
  • 13.
    Airframe - Dependency injectiontailored Scala - Tagged binding with wvlet https://github.com/wvlet/wvlet - Object lifecycle management
  • 14.
    Airframe val design :Design = newDesign .bind[X].toInstance(new X) // Bind type X to a concrete instance .bind[Y].toSingleton // Bind type Y to a singleton object .bind[Z].to[ZImpl] // Bind type Z to an instance of ZImpl import wvlet.airframe._ trait App { val x = bind[X] val y = bind[Y] val z = bind[Z] // Do something with X, Y, and Z } val session = design.newSession val app : App = session.build[App]
  • 15.
    VCR testing framework Recordtest suite HTTP interaction to make test stable and deterministic see more detail https://testing.googleblog.com/2016/11/what-test-engineers-do-at-google.html
  • 16.
    VCR testing framework protectedval service: Service[Request, Response] = bind[SomeFilter] andThen bind[AnotherHandler] andThen QueryRewriter andThen bind[RequestVCR] andThen prestClient protected val service: Service[Request, Response] = bind[SomeFilter] andThen bind[AnotherHandler] andThen QueryRewriter andThen bind[NoRecording] andThen prestClient On CI On Production
  • 17.
  • 18.
  • 19.
    Prestobase proxy Will beopen sourced soon
  • 20.
  • 21.
    Node Scheduler Submitting queryfollows… - Analyze query AST - Make query logical/physical plan - Schedule each stage
  • 22.
    Node Scheduler query stage2 stage1stage0 task2-0 task2-1 task2-0 task1-0 task1-1 task0-0 Table Scan output
  • 23.
    Node Scheduler NodeScheduler createsNodeSelector that selects worker nodes on which tasks are scheduled. NodeSelector picks up worker nodes when there is available splits.
  • 24.
    Node Scheduler inTD Keeps worker node map that can be candidate for launching next tasks. - Ignore min candidates - Limit by available memory pool
  • 25.
    Node Scheduler inTD Back to normal memory pool usage after task is completed.
  • 26.
    Node Scheduler inTD Challenges - Smoothing CPU time metric - Split type awareness - Avoid problematic worker nodes
  • 27.
  • 28.
    Resource Group Resource Groupwas introduced since 0.147 → https://prestodb.io/docs/current/admin/resource-groups.html Resource Group aims to limit the resource usage by account/group/query.
  • 29.
    Resource Group rootGroup general adhoc softMemoryLimit:100% maxQueued : 5000 maxRunning : 1000 softMemoryLimit: 100% maxQueued : 100 maxRunning : 200 softMemoryLimit: 100% maxRunning : 1000
  • 30.
    Resource Group limits -maxQueued - maxRunning - softMemoryLimit Following queries will be queued - softCpuLimit Impose penalty against max running queries - hardCpuLimit Following queries will be queued
  • 31.
    Resource Group scheduling -schedulingPolicy - fair : FIFO - weighted : Selected stochastically - query_priority : Selected according to priority - schedulingWeight
  • 32.
    Resource Group Every querymust be associated to a resource group. The matching can be done by configured selector. { "user": “bob", "group": "general" }, { "source": “.*adhoc.*", "group": "global.adhoc.adhoc_${USER}" }
  • 33.
    Resource Group rootGroup general adhoc softMemoryLimit:100% maxQueued : 5000 maxRunning : 1000 softMemoryLimit: 100% maxQueued : 100 maxRunning : 200 softMemoryLimit: 100% maxRunning : 1000 Bob’s query Bob’s query …
  • 34.
    Resource Group DI Easilychange resource group config behavior with Guice injection. - ResourceGroupConfigurationManager - configure(ResourceGroup, SelectionContext) - ResourceGroupSelector - match(Statement, SelectionContext)
  • 35.
    SelectionContext SelectionContext holds theinformation for associating submitted query. - Authenticated - User - Source - Query Priority Currently available as default
  • 36.
    { "runningQueryIds": ["query1", "query2"], "accountId":1, "children": [{ "memoryUsage": 12345, "runningQueryIds": [“query1"], "children": [], "runningQueries": 1, "queuedQueries": 0, "maxRunningQueries": 2, "resourceId": "general" }, { "memoryUsage": 26296, "runningQueryIds": ["query2"], "children": [], "runningQueries": 1, "queuedQueries": 0, "maxRunningQueries": 2, "resourceId": "scheduled" }], "runningQueries": 2, "maxRunningQueries": 30, } Queries in parent group Running query in general Running query in scheduled
  • 37.
    Recap Distributed system oftenrequires each component to be stable and scalable. We can make Presto ecosystem reliable by doing… - Code modification reliability with DI - VCR testing - Multi dimensional resource scheduling - Resource isolation makes multi-tenant distributed SQL engine reliable