Helix talk at RelateIQ

Apache Helix: Simplifying
Distributed Systems
Kanak Biscuitwala and Jason Zhang
helix.incubator.apache.org
@apachehelix

Outline
•
•
•
•
•
•
•

Background
Resource Assignment Problem
Helix Concepts
Putting Concepts to Work
Getting Started
Plugins
Current Status

Load balancing

Responding to
node entry and
exit

Building distributed systems is hard.
Alerting based
on metrics

Managing data
replicas

Supporting
event listeners

Load balancing

Responding to
node entry and
exit

Helix abstracts away problems distributed
systems need to solve.
Alerting based
on metrics

Managing data
replicas

Supporting
event listeners

System Lifecycle
Cluster Expansion

Fault Tolerance
Multi-Node
Partitioning
Discovery
Co-Location

Single Node

Replication
Fault Detection
Recovery

Throttle movement
Redistribute data

RESOURCES

NODES

Sample Allocation
RESOURCES

NODES
25%

25%

25%

25%

Failure Handling
RESOURCES

NODES

33%

33%

34%

Making it Work: Take 1 (ZooKeeper)
Application
Application

Helix
ZooKeeper
File system
Lock
Ephemeral

ZooKeeper provides low-level primitives

Node
Partition
Replica
State
Transition

Consensus
System
We need high-level primitives

Making it Work: Take 2 (Decisions by Nodes)

S
conﬁg changes
node changes

node updates

S

Consensus
System

S
S

service running on a node

S

Making it Work: Take 2 (Decisions by Nodes)

S
config changes
node changes

node updates

S

Consensus
System

multiple brains
S

app-specific logic
unscalable traffic

S

Making it Work: Take 3 (Single Brain)
S
node updates
node updates
Controller

conﬁg changes
node changes

Consensus
System

S

Node logic is drastically simpliﬁed!

S

Helix View
RESOURCES

Controller
Controller
Controller

Manage

NODES (Participants)

Spectators

Helix View
RESOURCES

Controller
Controller
Controller

Manage

NODES (Participants)

Spectators

Question: How do we make this controller generic
enough to work for different resources?

Helix Concepts
Resources

Resource

Partition

Partition

Partition

All partitions can be replicated.

Helix Concepts
Declarative State Model

Ofﬂine

Slave

Master

Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants

Helix Concepts
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]

Transition Constraints
Scope: Cluster
OFFLINE-SLAVE: 3 concurrent
Scope: Resource R1
SLAVE-MASTER: 1 concurrent
Scope: Participant P4

Helix Concepts
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]

Transition Constraints
Scope: Cluster
Scope: Resource R1
SLAVE-MASTER: 1 concurrent
Scope: Participant P4

States and transitions are ordered by priority in computing replica states.
Transition constraints can be restricted to cluster, resource, and
participant scopes. The most restrictive constraint is used.

Helix Concepts
Resources and the Augmented State Model
master
Resource

slave
ofﬂine

Partition

Partition

Partition

All partitions can be replicated.
Each replica is in a state governed by the augmented state model.

Helix Concepts
Objectives

Partition Placement
Distribution policy for partitions and replicas
Making effective use of the cluster and the resource

Failure and Expansion Semantics
Create new replicas and assign states
Changing existing replica states

Rebalancing Strategies
Meeting Objectives within Constraints

Full-Auto

Replica
Placement

Replica
State

Helix

Helix

Semi-Auto

App

Helix

Customized

User-Deﬁned

App

App code
plugged into
the Helix
controller

App

App code
plugged into
the Helix
controller

Full-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: S

P1: S

By default, Helix optimizes for minimal movement and even
distribution of partitions and states

Full-Auto
Node 1

Node 2

P1: M

P2: M

P2: S

P3: S

P3: M

Node 3

P1: S

By default, Helix optimizes for minimal movement and even
distribution of partitions and states

Semi-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: S

P1: S

Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.

Semi-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: M
P3: S

P1: S

Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.

Customized

The app speciﬁes the location and state of
each replica. Helix still ensures that transitions
are ﬁred according to constraints.

Customized

The app speciﬁes the location and state of
each replica. Helix still ensures that transitions
are ﬁred according to constraints.
Need to respond to node changes? Use the
Helix custom code invoker to run on one
participant, or...

User-Deﬁned

Node joins or
leaves the cluster

Helix
controller
invokes code
plugged in by the
app

Rebalancer
implemented by
app computes
replica placement
and state

Helix ﬁres
transitions without
violating
constraints

The rebalancer receives a full snapshot of the current cluster
state, as well as access to the backing data store. Helix
rebalancers implement the same interface.

User-Deﬁned: Distributed Lock Manager
Node 1

Released

Node 2

Ofﬂine

Locked

Each lock is a partition!

Node 1

Node 3

Released

Node 2

Ofﬂine

Locked

Each lock is a partition!

public
ResourceAssignment
computeResourceMapping(

Resource
resource,
IdealState
currentIdealState,

CurrentStateOutput
currentStateOutput,

ClusterDataCache
clusterData)
{

...

int
i
=
0;

for
(Partition
partition
:
resource.getPartitions())
{

Map<String,
String>
replicaMap
=
new
HashMap<String,
String>();

int
participantIndex
=
i
%
liveParticipants.size();

String
participant
=
liveParticipants.get(participantIndex);

replicaMap.put(participant,
“LOCKED”);

assignment.addReplicaMap(partition,
replicaMap);

i++;

}

return
assignment;
}

Controller
Fault Tolerance

Ofﬂine

Standby

Leader

The augmented state model concept
applies to controllers too!

Controller
Scalability
Controller 1

Cluster 1
Cluster 2
Cluster 3

Controller 2
Cluster 4
Cluster 5
Controller 3

Cluster 6

ZooKeeper View
Ideal State
P1

P2

N1: M

N2: M

N2: S

N1: S

Replica
Placement
and State

{

"id"
:
"SampleResource",

"simpleFields"
:
{

"REBALANCE_MODE"
:
"USER_DEFINED",

"NUM_PARTITIONS"
:
"2",

"REPLICAS"
:
"2",

"STATE_MODEL_DEF_REF"
:
"MasterSlave",

"STATE_MODEL_FACTORY_NAME"
:
"DEFAULT"

},

"mapFields"
:
{

"SampleResource_0"
:
{

"node1_12918"
:
"MASTER",

"node2_12918"
:
"SLAVE"

}

...

},

"listFields"
:
{}
}

ZooKeeper View
Current State and External View
External View
Current State

P1

P2

N1

P1: MASTER
P2: MASTER

N1: M

N1: M

N2

P1: OFFLINE
P2: OFFLINE

N2: O

N2: O

Helix’s responsibility is to make the external view
match the ideal state as closely as possible

Logical Deployment
Spectator
Helix Agent

ZooKeeper

Helix Controller

Helix Agent

Helix Agent

Helix Agent

P1: M

P2: S

Participant

P2: M

P3: S

Participant

P3: M

P1: S

Participant

Example: Distributed Data Store
Master

P.1

P.2

P.3

P.5

P.6

P.7

P.9

P.10

P.11

Slave

P.4

P.5

P.6

P.8

P.1

P.2

P.12

P.3

P.4

P.9

P.10

P.11

P.12

P.7

P.8

Node 1
Partition
Management
• multiple replicas
• 1 master
• even distribution

Node 2
Fault Tolerance
• fault detection
• promote master to
slave
• no SPOF

Node 3
Elasticity
• minimize downtime
• minimize data
movement
• throttle movement

Helix-Based Solution

Define
state model
state transitions

Configure
create cluster
add nodes
add resource
config rebalancer

Run
start controller
start participants

State Model Deﬁnition: Master-Slave

States
all possible states
priority
Transitions
legal transitions
priority
Applicable to each partition
of a resource

Slave

Ofﬂine

Master

State Model Deﬁnition: Master-Slave
builder
=
new

StateModelDefinition.Builder(“MasterSlave”);

//
add
states
and
their
ranks
to
indicate
priority

builder.addState(MASTER,
1);

builder.addState(SLAVE,
2);

builder.addState(OFFLINE);

//
set
the
initial
state
when
participant
starts

builder.initialState(OFFLINE);
//
add
transitions

builder.addTransition(OFFLINE,
SLAVE);

builder.addTransition(SLAVE,
OFFLINE);

builder.addTransition(SLAVE,
MASTER);

builder.addTransition(MASTER,
SLAVE);

Deﬁning Constraints
StateCount=2
State

Transition

Y

Y

Resource -

Y

Node

Y

Y

Cluster

-

Y

Partition

Slave

Ofﬂine

Master

StateCount=1

Deﬁning Constraints: Code
//
static
constraints

builder.upperBound(MASTER,
1);

//
dynamic
constraints

builder.dynamicUpperBound(SLAVE,
“R”);

//
unconstrained

builder.upperBound(OFFLINE,
-‐1);

Participant Plug-In Code
@StateModelInfo(initialState=“OFFLINE”,
states={“OFFLINE”,

“SLAVE”,
“MASTER”})
class
DistributedDataStoreModel
extends
StateModel
{

@Transition(from=“OFFLINE”,
to=“SLAVE”)

public
void
fromOfflineToSlave(Message
m,
NotificationContext

ctx)
{

//
bootstrap
data,
setup
replication,
etc.

}

@Transition(from=“SLAVE”,
to=“MASTER”)

public
void
fromSlaveToMaster(Message
m,
NotificationContext

ctx)
{

//
catch
up
previous
master,
enable
writes,
etc.

}

...
}

Conﬁgure and Run
HelixAdmin
-‐zkSvr
<zk-‐address>
Create Cluster
-‐-‐
addCluster
MyCluster
Add Participants
-‐-‐
addNode
MyCluster
localhost_12000
...
Add Resource
-‐-‐
addResource
MyDB
16
MasterSlave
SEMI_AUTO
Conﬁgure Rebalancer
-‐-‐
rebalance
MyDB
3

Spectator Plug-In Code
class
RoutingLogic
{

public
void
write(Request
request)
{

partition
=
getPartition(request.key);

List<Node>
nodes
=

routingTableProvider.getInstance(partition,
“MASTER”);

nodes.get(0).write(request);

}

public
void
read(Request
request)
{

partition
=
getPartition(request.key);

List<Node>
nodes
=

routingTableProvider.getInstance(partition);

random(nodes).read(request);

}

Where is the Code?
Participant

Participant
Plug-In
Code

node updates
node updates
Controller

conﬁg changes
node changes

Consensus
System

Spectator

Participant
Plug-In
Code
Participant

Spectator
Plug-In
Code

Example: Distributed Search
Index
shard

P.1

P.2

P.3

P.4

P.3

P.4

P.5

P.6
Node 1

Partition
Management
• multiple replicas
• rack-aware
placement

P.5

P.6

P.1

P.2
Node 2

Fault Tolerance
• fault detection
• auto create replicas
• controlled creation
of replicas

Node 3
Elasticity
• redistribute
partitions
• minimize data
movement
• throttle movement

State Model Deﬁnition: Bootstrap

Idle

setup node

cleanup
recover

Ofﬂine

stop consume
data

StateCount=3
stop indexing
and serving

Online

consume data
to build index
can serve requests
Bootstrap

Error
StateCount=5

Conﬁgure and Run
Create Cluster
-‐-‐
addCluster
MyCluster
Add Participants
-‐-‐
addNode
MyCluster
localhost_12000
...
Add Resource
-‐-‐
addResource
MyIndex
16
Bootstrap
CUSTOMIZED
Conﬁgure Rebalancer
-‐-‐
rebalance
MyIndex
8

Example: Message Consumers
Assignment

Scaling

Partitioned Consumer
Queue

Queue

C1

C1

Fault Tolerance
Queue
C1

C3
C2

Partition Management
• one consumer per
queue

C3

C2

C2

Elasticity
• redistribute queues
among consumers
• minimize movement

Fault Tolerance
• redistribute
• minimize data
movement
• limit max queue per
consumer

State Model Definition: Online-Offline
Max 10 queues per consumer
StateCount = 1
Start consumption
Offline

Online
Stop consumption

Participant Plug-In Code
@StateModelInfo(initialState=“OFFLINE”,
states={“OFFLINE”,

“ONLINE”})
class
MessageConsumerModel
extends
StateModel
{

@Transition(from=“OFFLINE”,
to=“ONLINE”)

public
void
fromOfflineToOnline(Message
m,

NotificationContext
ctx)
{

//
register
listener

}

@Transition(from=“ONLINE”,
to=“OFFLINE”)

public
void
fromOnlineToOffline(Message
m,

NotificationContext
ctx)
{

//
unregister
listener

}
}

Plugins
Overview

Data-Driven
Testing and
Debugging

Chaos Monkey

Rolling
Upgrade

On-Demand
Task
Scheduling

Intra-Cluster
Messaging

Health
Monitoring

Plugins
Data-Driven Testing and Debugging

Instrument ZK,
controller, and
participant logs

Simulate execution
with Chaos Monkey

Analyze
invariants like state
and transition
constraints

The exact sequence of events can be
replayed: debugging made easy!

Plugins
Data-Driven Testing and Debugging: Sample Log File
timestamp

partition

participantName

sessionId

state

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_60

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_60

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

Plugins
Data-Driven Testing and Debugging: Count Aggregation
Time

State

Slave Count

Participant

42632

OFFLINE

0

10.117.58.247_12918

42796

SLAVE

1

10.117.58.247_12918

43124

OFFLINE

1

10.202.187.155_12918

43131

OFFLINE

1

10.220.225.153_12918

43275

SLAVE

2

10.220.225.153_12918

43323

SLAVE

3

10.202.187.155_12918

85795

MASTER

2

10.220.225.153_12918

Error! The state constraint for SLAVE
has an upper bound of 2.

Plugins
Data-Driven Testing and Debugging: Time Aggregation
Slave Count

Time

Percentage

0

1082319

0.5

1

35578388

16.46

2

179417802

82.99

3

118863

0.05

Master Count

Time

Percentage

0

1082319

0.5

1

35578388

16.46

83% of the time, there
were 2 slaves to a
partition
93% of the time, there
was 1 master to a
partition

We can see for exactly how long the cluster was out of whack.

Helix at LinkedIn
Graph
Index

Standardization

Updates

Search
Index

Primary
DB
Oracle

Data Change Events

Espresso

Databus

Read
Replicas

Coming Up Next

New APIs

Automatic scaling
with YARN

Non-JVM
participants

Summary
• Helix: A generic framework for building
distributed systems
• Abstraction and modularity allow for
modifying and enhancing system behavior
• Simple programming model: declarative
state machine

Questions?

?

website

helix.incubator.apache.org

dev mailing list

dev@helix.incubator.apache.org

user mailing list user@helix.incubator.apache.org
twitter

@apachehelix

Helix talk at RelateIQ

More Related Content

What's hot

Similar to Helix talk at RelateIQ

More from Kishore Gopalakrishna

Recently uploaded

Helix talk at RelateIQ