Presto Overview
Shixiong Zhu
Overview

Register
Ask active nodes

Discovery
Server
Coordinator
SQL
SQL
QueryInfo

SQLQueryManager

QueryResults
NextUri
CLI

SQLQueryExecution

StatementResource

QueryStarter

…

HttpRemoteTask
Fetch Data

QueryResults

Coordinator

Partial Data

OutputReceiver

Worker
SubPlan
ExchangeNode

AggregationNode(FINAL)

Plan
TableScanNode

OutputNode

FilterNode
SubPlan

AggregationNode

TableScanNode
FilterNode

OutputNode
AggregationNode(PARTIAL)
SinkNode
SubPlan

T: TableScanNode
A: AggregationNode
E: ExchangeNode

E

E

A(FINAL)

A(FINAL)

Plan
T

JoinNode
T
OutputNode

A

A
SubPlan

SubPlan

T

JoinNode

T

A(PARTIAL)

A(PARTIAL)

SinkNode

SinkNode

OutputNode
Stage

Task

Worker

Results

Stage
Stage

Worker

Coordinator

Worker

Worker

Worker
Worker
LocalExecutionPlan

SubPlan

Node1

Op1

Node2

Op2

Node3

Op3

…

…

Node3

Opn
LocalExecutionPlan

SubPlan
Node1

Node2

LocalExecutionPlan

Op1

Op2

SourceHash
JoinNode

HashJoinOperator

Node3

Op3

HashBuilderOperator
Page(max page size: 1MB, max rows:
16 * 1024 )

Row

Block

Slice
A byte array

Block

Block

Block

Block
Split

Split

Split

Split

Split

Split

Split
Is the data
ready?

Register a
callback

N
When the data of this
Split is ready, put the
Split back.

Y
Fetch one Page

Execute
Operator
Y

Has next
Operator?

N

N

TaskExecutor

Is the Split
done?

Thread number = core nubmer * 4

Y

Y

N
Time's up?
Execution Operators
Op1

page = op1.getOutput
op2.addInput(page)

Op2
page = op2.getOutput
op3.addInput(page)
Op3

…

Opn
Input
TableScanOperator

HiveSplit

DataStreamManager
RecordSetDataS
treamProvider
RecordProjectOperator

ConnectorData
StreamProvider
HiveRecordSet

HiveClient

ConnectorData
StreamProvider
HiveSplit

InputFormat

RecordReader
HiveRecordSet

Lines

TableScanOperator
RecordProjectOperator
Page
Next Operator
Load Balance
NodeMap

Split

Map: Rack -> Nodes

NodeSelector

NodeScheduler
Map: Host -> Nodes

Map: Host:Port -> Nodes

Node
NodeSelector.selectNode
• Select acceptable nodes (as least 10 nodes by
default)
– Nodes has the same address
– If not enough, add nodes in the same rack
– If not enough, randomly select nodes in other
racks

• Select the node with the smallest number of
assignments (pending tasks)
Output
• Only has SELETE statement
– Currently query results are streamed to the client
Communication
• Protocol: HTTP
• Data Format: JSON
• Every instance has one server and one client
Q&A

Presto overview

Editor's Notes

  • #9 A SubPlan will to convert by LocalExecutionPlanner to LocalExecutionPlan which has a operator sequence.
  • #10 HashJoinOperator andHashBuilderOperator is connected by SourceHash which contains the output of HashBuilderOperator.
  • #11 You can image Slice is a byte array. The Slice size is the array size. The Block size is the Slice size. The Page size is sum of all the Block sizes.
  • #12 Every Split is only allowed to execute 1s by default. When the time is up, the split will be put back to the queue.
  • #14 RecordSetDataStreamProvider is a subclass of ConnectorDataStreamProvider.
  • #16 When DiscoveryNodeManager receives any Node information query, it will check if the cache is expired (5 seconds).If so, it will ask the ServiceSelectorto fetch the active nodes and drop the failure nodes.ServiceSelector will fetch the new node list from the Discovery Server every 10s by default.There is a thread in HeartbeatFailureDetector which will send the heartbeat to every active node 500ms by default.