How to ensure Presto scalability  in multi use case

How to ensure Presto
scalability
in multi use case
Kai Sasaki
Treasure Data Inc.

Kai Sasaki (@Lewuathe)
Software Engineer at Treasure Data Inc.
Hadoop/Presto/Spark

Presto In TD
• 150000+ queries / day
• 190+ TB processing / day
• 10+ MB processing / query * sec
• 100+ million processed records / query

Presto In TD
Prestobase
Proxy
PerfectQueue
query
Plazma
data
Presto
TD API
BI Tool
HTTP

How to make it scalable
• Prestobase Proxy
• Node scheduler
• Resource Group

Prestobase proxy
Prestobase proxy aims to provide the
interface especially for BI tools through
JDBC/ODBC and also to replace Prestogres.

Prestobase proxy
• Written in Scala
• Finagle base RPC proxy
• Running as Docker container
• A user of Airframe
• VCR base light-weight test framework

Finagle
Finagle is an extensible RPC system for the JVM,
used to construct high-concurrency servers.
Finagle implements uniform client and server
APIs for several protocols, and is designed for
high performance and concurrency.
see: https://twitter.github.io/finagle/

Finagle
protected val service: Service[Request, Response] =
bind[SomeFilter] andThen
bind[AnotherHandler] andThen
LastFilter andThen
prestoClient
Build request pipeline by binding
filter, handlers with Airframe

Airframe
Airframe is a trait base dependency injection
framework using Scala macro
- https://github.com/wvlet/airframe

Airframe
- Dependency injection tailored Scala
- Tagged binding with wvlet
https://github.com/wvlet/wvlet
- Object lifecycle management

Airframe
val design : Design =
newDesign
.bind[X].toInstance(new X) // Bind type X to a concrete instance
.bind[Y].toSingleton // Bind type Y to a singleton object
.bind[Z].to[ZImpl] // Bind type Z to an instance of ZImpl
import wvlet.airframe._
trait App {
val x = bind[X]
val y = bind[Y]
val z = bind[Z]
// Do something with X, Y, and Z
}
val session = design.newSession
val app : App = session.build[App]

VCR testing framework
Record test suite HTTP interaction to make
test stable and deterministic
see more detail
https://testing.googleblog.com/2016/11/what-test-engineers-do-at-google.html

QueryRewriter andThen
bind[RequestVCR] andThen
prestClient
QueryRewriter andThen
bind[NoRecording] andThen
prestClient
On CI
On Production

Prestobase
RequestVCRClient
…
…
SQLite
Recording

Prestobase
RequestVCRClient
…
…
SQLite
Replaying

Prestobase proxy
Will be open sourced soon

Node Scheduler
Submitting query follows…
- Analyze query AST
- Make query logical/physical plan
- Schedule each stage

Node Scheduler
query
stage2 stage1 stage0
task2-0
task2-1
task2-0
task1-0
task1-1
task0-0
Table Scan output

Node Scheduler
NodeScheduler creates NodeSelector that
selects worker nodes on which tasks are
scheduled. NodeSelector picks up worker
nodes when there is available splits.

Node Scheduler in TD
Keeps worker node map that can be
candidate for launching next tasks.
- Ignore min candidates
- Limit by available memory pool

Back to normal memory pool usage after task is completed.

Challenges
- Smoothing CPU time metric
- Split type awareness
- Avoid problematic worker nodes

Resource Group
Resource Group was introduced since 0.147
→ https://prestodb.io/docs/current/admin/resource-groups.html
Resource Group aims to limit the resource
usage by account/group/query.

Resource Group
rootGroup
general adhoc
softMemoryLimit: 100%
maxQueued : 5000
maxRunning : 1000
maxQueued : 100
maxRunning : 200
maxRunning : 1000

Resource Group limits
- maxQueued
- maxRunning
- softMemoryLimit
Following queries will be queued
- softCpuLimit
Impose penalty against max running queries
- hardCpuLimit
Following queries will be queued

Resource Group scheduling
- schedulingPolicy
- fair : FIFO
- weighted : Selected stochastically
- query_priority : Selected according to priority
- schedulingWeight

Resource Group
Every query must be associated to a resource
group. The matching can be done by
configured selector.
{
"user": “bob", "group": "general"
},
{
"source": “.*adhoc.*", "group": "global.adhoc.adhoc_${USER}"
}

Resource Group
rootGroup
general adhoc
maxQueued : 5000
maxRunning : 1000
maxQueued : 100
maxRunning : 200
maxRunning : 1000
Bob’s
query
Bob’s
query …

Resource Group DI
Easily change resource group config behavior
with Guice injection.
- ResourceGroupConfigurationManager
- configure(ResourceGroup, SelectionContext)
- ResourceGroupSelector
- match(Statement, SelectionContext)

SelectionContext
SelectionContext holds the information for associating
submitted query.
- Authenticated
- User
- Source
- Query Priority
Currently available as default

{
"runningQueryIds": ["query1", "query2"],
"accountId": 1,
"children": [{
"memoryUsage": 12345,
"runningQueryIds": [“query1"],
"children": [],
"runningQueries": 1,
"queuedQueries": 0,
"maxRunningQueries": 2,
"resourceId": "general"
}, {
"memoryUsage": 26296,
"runningQueryIds": ["query2"],
"children": [],
"queuedQueries": 0,
"resourceId": "scheduled"
}],
}
Queries in parent group
Running query in general
Running query in scheduled

Recap
Distributed system often requires each
component to be stable and scalable. We can
make Presto ecosystem reliable by doing…
- Code modification reliability with DI
- VCR testing
- Multi dimensional resource scheduling
- Resource isolation makes multi-tenant
distributed SQL engine reliable

How to ensure Presto scalability in multi use case

More Related Content

What's hot

Viewers also liked

Similar to How to ensure Presto scalability in multi use case

More from Kai Sasaki

Recently uploaded