Apache Helix: Simplifying
Distributed Systems
Kanak Biscuitwala and Jason Zhang
helix.incubator.apache.org
@apachehelix
Outline
•
•
•
•
•
•
•

Background
Resource Assignment Problem
Helix Concepts
Putting Concepts to Work
Getting Started
Plugins
Current Status
Load balancing

Responding to
node entry and
exit

Building distributed systems is hard.
Alerting based
on metrics

Managing data
replicas

Supporting
event listeners
Load balancing

Responding to
node entry and
exit

Helix abstracts away problems distributed
systems need to solve.
Alerting based
on metrics

Managing data
replicas

Supporting
event listeners
System Lifecycle
Cluster Expansion

Fault Tolerance
Multi-Node
Partitioning
Discovery
Co-Location

Single Node

Replication
Fault Detection
Recovery

Throttle movement
Redistribute data
Resource Assignment Problem
Resource Assignment Problem
RESOURCES

NODES
Resource Assignment Problem
Sample Allocation
RESOURCES

NODES
25%

25%

25%

25%
Resource Assignment Problem
Failure Handling
RESOURCES

NODES

33%

33%

34%
Resource Assignment Problem
Making it Work: Take 1 (ZooKeeper)
Application
Application

Helix
ZooKeeper
File system
Lock
Ephemeral

ZooKeeper provides low-level primitives

Node
Partition
Replica
State
Transition

Consensus
System
We need high-level primitives
Resource Assignment Problem
Making it Work: Take 2 (Decisions by Nodes)

S
config changes
node changes

node updates

S

Consensus
System

S
S

service running on a node

S
Resource Assignment Problem
Making it Work: Take 2 (Decisions by Nodes)

S
config changes
node changes

node updates

S

Consensus
System

multiple brains
S

app-specific logic
unscalable traffic

S
Resource Assignment Problem
Making it Work: Take 2 (Decisions by Nodes)

S
config changes
node changes

node updates

S

Consensus
System

multiple brains
S

app-specific logic
unscalable traffic

S
Resource Assignment Problem
Making it Work: Take 3 (Single Brain)
S
node updates
node updates
Controller

config changes
node changes

Consensus
System

S

Node logic is drastically simplified!

S
Resource Assignment Problem
Helix View
RESOURCES

Controller
Controller
Controller

Manage

NODES (Participants)

Spectators
Resource Assignment Problem
Helix View
RESOURCES

Controller
Controller
Controller

Manage

NODES (Participants)

Spectators

Question: How do we make this controller generic
enough to work for different resources?
Helix Concepts
Helix Concepts
Resources

Resource

Partition

Partition

Partition

All partitions can be replicated.
Helix Concepts
Declarative State Model

Offline

Slave

Master
Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants
Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants

Transition Constraints
Scope: Cluster
OFFLINE-SLAVE: 3 concurrent
Scope: Resource R1
SLAVE-MASTER: 1 concurrent
Scope: Participant P4
OFFLINE-SLAVE: 2 concurrent
Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants

Transition Constraints
Scope: Cluster
OFFLINE-SLAVE: 3 concurrent
Scope: Resource R1
SLAVE-MASTER: 1 concurrent
Scope: Participant P4
OFFLINE-SLAVE: 2 concurrent

States and transitions are ordered by priority in computing replica states.
Transition constraints can be restricted to cluster, resource, and
participant scopes. The most restrictive constraint is used.
Helix Concepts
Resources and the Augmented State Model
master
Resource

slave
offline

Partition

Partition

Partition

All partitions can be replicated.
Each replica is in a state governed by the augmented state model.
Helix Concepts
Objectives

Partition Placement
Distribution policy for partitions and replicas
Making effective use of the cluster and the resource

Failure and Expansion Semantics
Create new replicas and assign states
Changing existing replica states
Putting Concepts to Work
Rebalancing Strategies
Meeting Objectives within Constraints

Full-Auto

Replica
Placement

Replica
State

Helix

Helix

Semi-Auto

App

Helix

Customized

User-Defined

App

App code
plugged into
the Helix
controller

App

App code
plugged into
the Helix
controller
Rebalancing Strategies
Full-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: S

P1: S

By default, Helix optimizes for minimal movement and even
distribution of partitions and states
Rebalancing Strategies
Full-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: S

P1: S

By default, Helix optimizes for minimal movement and even
distribution of partitions and states
Rebalancing Strategies
Full-Auto
Node 1

Node 2

P1: M

P2: M

P2: S

P3: S

P3: M

Node 3

P1: S

By default, Helix optimizes for minimal movement and even
distribution of partitions and states
Rebalancing Strategies
Semi-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: S

P1: S

Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
Rebalancing Strategies
Semi-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: S

P1: S

Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
Rebalancing Strategies
Semi-Auto
Node 1

Node 2

Node 3

P1: M

P2: M

P3: M

P2: S

P3: M
P3: S

P1: S

Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
Rebalancing Strategies
Customized

The app specifies the location and state of
each replica. Helix still ensures that transitions
are fired according to constraints.
Rebalancing Strategies
Customized

The app specifies the location and state of
each replica. Helix still ensures that transitions
are fired according to constraints.
Need to respond to node changes? Use the
Helix custom code invoker to run on one
participant, or...
Rebalancing Strategies
User-Defined

Node joins or
leaves the cluster

Helix
controller
invokes code
plugged in by the
app

Rebalancer
implemented by
app computes
replica placement
and state

Helix fires
transitions without
violating
constraints

The rebalancer receives a full snapshot of the current cluster
state, as well as access to the backing data store. Helix
rebalancers implement the same interface.
Rebalancing Strategies
User-Defined: Distributed Lock Manager
Node 1

Released

Node 2

Offline

Locked

Each lock is a partition!
Rebalancing Strategies
User-Defined: Distributed Lock Manager
Node 1

Node 3

Released

Node 2

Offline

Locked

Each lock is a partition!
Rebalancing Strategies
User-Defined: Distributed Lock Manager
public	
  ResourceAssignment	
  computeResourceMapping(
	
  	
  	
  	
  Resource	
  resource,	
  IdealState	
  currentIdealState,
	
  	
  	
  	
  CurrentStateOutput	
  currentStateOutput,
	
  	
  	
  	
  ClusterDataCache	
  clusterData)	
  {
	
  	
  ...
	
  	
  int	
  i	
  =	
  0;
	
  	
  for	
  (Partition	
  partition	
  :	
  resource.getPartitions())	
  {
	
  	
  	
  	
  Map<String,	
  String>	
  replicaMap	
  =	
  new	
  HashMap<String,	
  String>();
	
  	
  	
  	
  int	
  participantIndex	
  =	
  i	
  %	
  liveParticipants.size();
	
  	
  	
  	
  String	
  participant	
  =	
  liveParticipants.get(participantIndex);
	
  	
  	
  	
  replicaMap.put(participant,	
  “LOCKED”);
	
  	
  	
  	
  assignment.addReplicaMap(partition,	
  replicaMap);
	
  	
  	
  	
  i++;
	
  	
  }
	
  	
  return	
  assignment;
}
Rebalancing Strategies
User-Defined: Distributed Lock Manager
public	
  ResourceAssignment	
  computeResourceMapping(
	
  	
  	
  	
  Resource	
  resource,	
  IdealState	
  currentIdealState,
	
  	
  	
  	
  CurrentStateOutput	
  currentStateOutput,
	
  	
  	
  	
  ClusterDataCache	
  clusterData)	
  {
	
  	
  ...
	
  	
  int	
  i	
  =	
  0;
	
  	
  for	
  (Partition	
  partition	
  :	
  resource.getPartitions())	
  {
	
  	
  	
  	
  Map<String,	
  String>	
  replicaMap	
  =	
  new	
  HashMap<String,	
  String>();
	
  	
  	
  	
  int	
  participantIndex	
  =	
  i	
  %	
  liveParticipants.size();
	
  	
  	
  	
  String	
  participant	
  =	
  liveParticipants.get(participantIndex);
	
  	
  	
  	
  replicaMap.put(participant,	
  “LOCKED”);
	
  	
  	
  	
  assignment.addReplicaMap(partition,	
  replicaMap);
	
  	
  	
  	
  i++;
	
  	
  }
	
  	
  return	
  assignment;
}
Controller
Fault Tolerance

Offline

Standby

Leader

The augmented state model concept
applies to controllers too!
Controller
Scalability
Controller 1

Cluster 1
Cluster 2
Cluster 3

Controller 2
Cluster 4
Cluster 5
Controller 3

Cluster 6
Controller
Scalability
Controller 1

Cluster 1
Cluster 2
Cluster 3

Controller 2
Cluster 4
Cluster 5
Controller 3

Cluster 6
ZooKeeper View
Ideal State
P1

P2

N1: M

N2: M

N2: S

N1: S

Replica
Placement
and State

{
	
  	
  "id"	
  :	
  "SampleResource",
	
  	
  "simpleFields"	
  :	
  {
	
  	
  	
  	
  "REBALANCE_MODE"	
  :	
  "USER_DEFINED",
	
  	
  	
  	
  "NUM_PARTITIONS"	
  :	
  "2",
	
  	
  	
  	
  "REPLICAS"	
  :	
  "2",
	
  	
  	
  	
  "STATE_MODEL_DEF_REF"	
  :	
  "MasterSlave",
	
  	
  	
  	
  "STATE_MODEL_FACTORY_NAME"	
  :	
  "DEFAULT"
	
  	
  },
	
  	
  "mapFields"	
  :	
  {
	
  	
  	
  	
  "SampleResource_0"	
  :	
  {
	
  	
  	
  	
  	
  	
  "node1_12918"	
  :	
  "MASTER",
	
  	
  	
  	
  	
  	
  "node2_12918"	
  :	
  "SLAVE"
	
  	
  	
  	
  }
	
  	
  	
  	
  ...
	
  	
  },
	
  	
  "listFields"	
  :	
  {}
}
ZooKeeper View
Current State and External View
External View
Current State

P1

P2

N1

P1: MASTER
P2: MASTER

N1: M

N1: M

N2

P1: OFFLINE
P2: OFFLINE

N2: O

N2: O

Helix’s responsibility is to make the external view
match the ideal state as closely as possible
Logical Deployment
Spectator
Helix Agent

ZooKeeper

Helix Controller

Helix Agent

Helix Agent

Helix Agent

P1: M

P2: S

Participant

P2: M

P3: S

Participant

P3: M

P1: S

Participant
Getting Started
Example: Distributed Data Store
Master

P.1

P.2

P.3

P.5

P.6

P.7

P.9

P.10

P.11

Slave

P.4

P.5

P.6

P.8

P.1

P.2

P.12

P.3

P.4

P.9

P.10

P.11

P.12

P.7

P.8

Node 1
Partition
Management
• multiple replicas
• 1 master
• even distribution

Node 2
Fault Tolerance
• fault detection
• promote master to
slave
• even distribution
• no SPOF

Node 3
Elasticity
• minimize downtime
• minimize data
movement
• throttle movement
Example: Distributed Data Store
Helix-Based Solution

Define
state model
state transitions

Configure
create cluster
add nodes
add resource
config rebalancer

Run
start controller
start participants
Example: Distributed Data Store
State Model Definition: Master-Slave

States
all possible states
priority
Transitions
legal transitions
priority
Applicable to each partition
of a resource

Slave

Offline

Master
Example: Distributed Data Store
State Model Definition: Master-Slave
builder	
  =	
  new	
  
StateModelDefinition.Builder(“MasterSlave”);
	
  //	
  add	
  states	
  and	
  their	
  ranks	
  to	
  indicate	
  priority
	
  builder.addState(MASTER,	
  1);
	
  builder.addState(SLAVE,	
  2);
	
  builder.addState(OFFLINE);
	
  //	
  set	
  the	
  initial	
  state	
  when	
  participant	
  starts
	
  builder.initialState(OFFLINE);
//	
  add	
  transitions
	
  builder.addTransition(OFFLINE,	
  SLAVE);
	
  builder.addTransition(SLAVE,	
  OFFLINE);
	
  builder.addTransition(SLAVE,	
  MASTER);
	
  builder.addTransition(MASTER,	
  SLAVE);
Example: Distributed Data Store
Defining Constraints
StateCount=2
State

Transition

Y

Y

Resource -

Y

Node

Y

Y

Cluster

-

Y

Partition

Slave

Offline

Master

StateCount=1
Example: Distributed Data Store
Defining Constraints: Code
//	
  static	
  constraints
	
  builder.upperBound(MASTER,	
  1);
	
  //	
  dynamic	
  constraints
	
  builder.dynamicUpperBound(SLAVE,	
  “R”);
	
  //	
  unconstrained
	
  builder.upperBound(OFFLINE,	
  -­‐1);
Example: Distributed Data Store
Participant Plug-In Code
@StateModelInfo(initialState=“OFFLINE”,	
  states={“OFFLINE”,	
  
“SLAVE”,	
  “MASTER”})
class	
  DistributedDataStoreModel	
  extends	
  StateModel	
  {
	
  	
  @Transition(from=“OFFLINE”,	
  to=“SLAVE”)
	
  	
  public	
  void	
  fromOfflineToSlave(Message	
  m,	
  NotificationContext	
  
ctx)	
  {
	
  	
  	
  	
  //	
  bootstrap	
  data,	
  setup	
  replication,	
  etc.
	
  	
  }
	
  	
  
	
  	
  @Transition(from=“SLAVE”,	
  to=“MASTER”)
	
  	
  public	
  void	
  fromSlaveToMaster(Message	
  m,	
  NotificationContext	
  
ctx)	
  {
	
  	
  	
  	
  //	
  catch	
  up	
  previous	
  master,	
  enable	
  writes,	
  etc.
	
  	
  }
	
  	
  ...
}
Example: Distributed Data Store
Configure and Run
HelixAdmin	
  -­‐zkSvr	
  <zk-­‐address>
Create Cluster
-­‐-­‐	
  addCluster	
  MyCluster
Add Participants
-­‐-­‐	
  addNode	
  MyCluster	
  localhost_12000
...
Add Resource
-­‐-­‐	
  addResource	
  MyDB	
  16	
  MasterSlave	
  SEMI_AUTO
Configure Rebalancer
-­‐-­‐	
  rebalance	
  MyDB	
  3
Example: Distributed Data Store
Spectator Plug-In Code
class	
  RoutingLogic	
  {
	
  	
  	
  public	
  void	
  write(Request	
  request)	
  {
	
  	
  	
  	
  	
  	
  partition	
  =	
  getPartition(request.key);
	
  	
  	
  	
  	
  	
  List<Node>	
  nodes	
  =	
  
routingTableProvider.getInstance(partition,	
  “MASTER”);
	
  	
  	
  	
  	
  	
  nodes.get(0).write(request);
	
  	
  	
  }
	
  	
  	
  public	
  void	
  read(Request	
  request)	
  {
	
  	
  	
  	
  	
  	
  partition	
  =	
  getPartition(request.key);
	
  	
  	
  	
  	
  	
  List<Node>	
  nodes	
  =	
  
routingTableProvider.getInstance(partition);
	
  	
  	
  	
  	
  	
  random(nodes).read(request);
	
  	
  	
  }
Example: Distributed Data Store
Where is the Code?
Participant

Participant
Plug-In
Code

node updates
node updates
Controller

config changes
node changes

Consensus
System

Spectator

Participant
Plug-In
Code
Participant

Spectator
Plug-In
Code
Example: Distributed Search
Index
shard

P.1

P.2

P.3

P.4

P.3

P.4

P.5

P.6
Node 1

Partition
Management
• multiple replicas
• rack-aware
placement
• even distribution

P.5

P.6

P.1

P.2
Node 2

Fault Tolerance
• fault detection
• auto create replicas
• controlled creation
of replicas

Node 3
Elasticity
• redistribute
partitions
• minimize data
movement
• throttle movement
Example: Distributed Search
State Model Definition: Bootstrap

Idle

setup node

cleanup
recover

Offline

stop consume
data

StateCount=3
stop indexing
and serving

Online

consume data
to build index
can serve requests
Bootstrap

Error
StateCount=5
Example: Distributed Search
Configure and Run
Create Cluster
-­‐-­‐	
  addCluster	
  MyCluster
Add Participants
-­‐-­‐	
  addNode	
  MyCluster	
  localhost_12000
...
Add Resource
-­‐-­‐	
  addResource	
  MyIndex	
  16	
  Bootstrap	
  CUSTOMIZED
Configure Rebalancer
-­‐-­‐	
  rebalance	
  MyIndex	
  8
Example: Message Consumers
Assignment

Scaling

Partitioned Consumer
Queue

Partitioned Consumer
Queue

C1

C1

Fault Tolerance
Partitioned Consumer
Queue
C1

C3
C2

Partition Management
• one consumer per
queue
• even distribution

C3

C2

C2

Elasticity
• redistribute queues
among consumers
• minimize movement

Fault Tolerance
• redistribute
• minimize data
movement
• limit max queue per
consumer
Example: Message Consumers
State Model Definition: Online-Offline
Max 10 queues per consumer
StateCount = 1
Start consumption
Offline

Online
Stop consumption
Example: Message Consumers
Participant Plug-In Code
@StateModelInfo(initialState=“OFFLINE”,	
  states={“OFFLINE”,	
  
“ONLINE”})
class	
  MessageConsumerModel	
  extends	
  StateModel	
  {
	
  	
  @Transition(from=“OFFLINE”,	
  to=“ONLINE”)
	
  	
  public	
  void	
  fromOfflineToOnline(Message	
  m,	
  
NotificationContext	
  ctx)	
  {
	
  	
  	
  	
  //	
  register	
  listener
	
  	
  }
	
  	
  
	
  	
  @Transition(from=“ONLINE”,	
  to=“OFFLINE”)
	
  	
  public	
  void	
  fromOnlineToOffline(Message	
  m,	
  
NotificationContext	
  ctx)	
  {
	
  	
  	
  	
  //	
  unregister	
  listener
	
  	
  }
}
Plugins
Plugins
Overview

Data-Driven
Testing and
Debugging

Chaos Monkey

Rolling
Upgrade

On-Demand
Task
Scheduling

Intra-Cluster
Messaging

Health
Monitoring
Plugins
Data-Driven Testing and Debugging

Instrument ZK,
controller, and
participant logs

Simulate execution
with Chaos Monkey

Analyze
invariants like state
and transition
constraints

The exact sequence of events can be
replayed: debugging made easy!
Plugins
Data-Driven Testing and Debugging: Sample Log File
timestamp

partition

participantName

sessionId

state

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_60

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_91

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE

1.32331E+12

TestDB_60

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

OFFLINE

1.32331E+12

TestDB_123

express1-md_16918

ef172fe9-09ca-4d77b05e-15a414478ccc

SLAVE
Plugins
Data-Driven Testing and Debugging: Count Aggregation
Time

State

Slave Count

Participant

42632

OFFLINE

0

10.117.58.247_12918

42796

SLAVE

1

10.117.58.247_12918

43124

OFFLINE

1

10.202.187.155_12918

43131

OFFLINE

1

10.220.225.153_12918

43275

SLAVE

2

10.220.225.153_12918

43323

SLAVE

3

10.202.187.155_12918

85795

MASTER

2

10.220.225.153_12918

Error! The state constraint for SLAVE
has an upper bound of 2.
Plugins
Data-Driven Testing and Debugging: Time Aggregation
Slave Count

Time

Percentage

0

1082319

0.5

1

35578388

16.46

2

179417802

82.99

3

118863

0.05

Master Count

Time

Percentage

0

1082319

0.5

1

35578388

16.46

83% of the time, there
were 2 slaves to a
partition
93% of the time, there
was 1 master to a
partition

We can see for exactly how long the cluster was out of whack.
Current Status
Helix at LinkedIn
Graph
Index

Standardization

Updates

Search
Index

Primary
DB
Oracle

Data Change Events

Espresso

Databus

Read
Replicas
Coming Up Next

New APIs

Automatic scaling
with YARN

Non-JVM
participants
Summary
• Helix: A generic framework for building
distributed systems
• Abstraction and modularity allow for
modifying and enhancing system behavior
• Simple programming model: declarative
state machine
Questions?

?

website

helix.incubator.apache.org

dev mailing list

dev@helix.incubator.apache.org

user mailing list user@helix.incubator.apache.org
twitter

@apachehelix
Helix talk at RelateIQ

Helix talk at RelateIQ

  • 1.
    Apache Helix: Simplifying DistributedSystems Kanak Biscuitwala and Jason Zhang helix.incubator.apache.org @apachehelix
  • 2.
    Outline • • • • • • • Background Resource Assignment Problem HelixConcepts Putting Concepts to Work Getting Started Plugins Current Status
  • 3.
    Load balancing Responding to nodeentry and exit Building distributed systems is hard. Alerting based on metrics Managing data replicas Supporting event listeners
  • 4.
    Load balancing Responding to nodeentry and exit Helix abstracts away problems distributed systems need to solve. Alerting based on metrics Managing data replicas Supporting event listeners
  • 5.
    System Lifecycle Cluster Expansion FaultTolerance Multi-Node Partitioning Discovery Co-Location Single Node Replication Fault Detection Recovery Throttle movement Redistribute data
  • 6.
  • 7.
  • 8.
    Resource Assignment Problem SampleAllocation RESOURCES NODES 25% 25% 25% 25%
  • 9.
    Resource Assignment Problem FailureHandling RESOURCES NODES 33% 33% 34%
  • 10.
    Resource Assignment Problem Makingit Work: Take 1 (ZooKeeper) Application Application Helix ZooKeeper File system Lock Ephemeral ZooKeeper provides low-level primitives Node Partition Replica State Transition Consensus System We need high-level primitives
  • 11.
    Resource Assignment Problem Makingit Work: Take 2 (Decisions by Nodes) S config changes node changes node updates S Consensus System S S service running on a node S
  • 12.
    Resource Assignment Problem Makingit Work: Take 2 (Decisions by Nodes) S config changes node changes node updates S Consensus System multiple brains S app-specific logic unscalable traffic S
  • 13.
    Resource Assignment Problem Makingit Work: Take 2 (Decisions by Nodes) S config changes node changes node updates S Consensus System multiple brains S app-specific logic unscalable traffic S
  • 14.
    Resource Assignment Problem Makingit Work: Take 3 (Single Brain) S node updates node updates Controller config changes node changes Consensus System S Node logic is drastically simplified! S
  • 15.
    Resource Assignment Problem HelixView RESOURCES Controller Controller Controller Manage NODES (Participants) Spectators
  • 16.
    Resource Assignment Problem HelixView RESOURCES Controller Controller Controller Manage NODES (Participants) Spectators Question: How do we make this controller generic enough to work for different resources?
  • 17.
  • 18.
  • 19.
    Helix Concepts Declarative StateModel Offline Slave Master
  • 20.
    Helix Concepts Constraints: Augmentingthe State Model State Constraints MASTER: [1, 1] SLAVE: [0, R] Special Constraint Values R: Replica count per partition N: Number of participants
  • 21.
    Helix Concepts Constraints: Augmentingthe State Model State Constraints MASTER: [1, 1] SLAVE: [0, R] Special Constraint Values R: Replica count per partition N: Number of participants Transition Constraints Scope: Cluster OFFLINE-SLAVE: 3 concurrent Scope: Resource R1 SLAVE-MASTER: 1 concurrent Scope: Participant P4 OFFLINE-SLAVE: 2 concurrent
  • 22.
    Helix Concepts Constraints: Augmentingthe State Model State Constraints MASTER: [1, 1] SLAVE: [0, R] Special Constraint Values R: Replica count per partition N: Number of participants Transition Constraints Scope: Cluster OFFLINE-SLAVE: 3 concurrent Scope: Resource R1 SLAVE-MASTER: 1 concurrent Scope: Participant P4 OFFLINE-SLAVE: 2 concurrent States and transitions are ordered by priority in computing replica states. Transition constraints can be restricted to cluster, resource, and participant scopes. The most restrictive constraint is used.
  • 23.
    Helix Concepts Resources andthe Augmented State Model master Resource slave offline Partition Partition Partition All partitions can be replicated. Each replica is in a state governed by the augmented state model.
  • 24.
    Helix Concepts Objectives Partition Placement Distributionpolicy for partitions and replicas Making effective use of the cluster and the resource Failure and Expansion Semantics Create new replicas and assign states Changing existing replica states
  • 25.
  • 26.
    Rebalancing Strategies Meeting Objectiveswithin Constraints Full-Auto Replica Placement Replica State Helix Helix Semi-Auto App Helix Customized User-Defined App App code plugged into the Helix controller App App code plugged into the Helix controller
  • 27.
    Rebalancing Strategies Full-Auto Node 1 Node2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S By default, Helix optimizes for minimal movement and even distribution of partitions and states
  • 28.
    Rebalancing Strategies Full-Auto Node 1 Node2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S By default, Helix optimizes for minimal movement and even distribution of partitions and states
  • 29.
    Rebalancing Strategies Full-Auto Node 1 Node2 P1: M P2: M P2: S P3: S P3: M Node 3 P1: S By default, Helix optimizes for minimal movement and even distribution of partitions and states
  • 30.
    Rebalancing Strategies Semi-Auto Node 1 Node2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints. This is ideal for resources that are expensive to move.
  • 31.
    Rebalancing Strategies Semi-Auto Node 1 Node2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints. This is ideal for resources that are expensive to move.
  • 32.
    Rebalancing Strategies Semi-Auto Node 1 Node2 Node 3 P1: M P2: M P3: M P2: S P3: M P3: S P1: S Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints. This is ideal for resources that are expensive to move.
  • 33.
    Rebalancing Strategies Customized The appspecifies the location and state of each replica. Helix still ensures that transitions are fired according to constraints.
  • 34.
    Rebalancing Strategies Customized The appspecifies the location and state of each replica. Helix still ensures that transitions are fired according to constraints. Need to respond to node changes? Use the Helix custom code invoker to run on one participant, or...
  • 35.
    Rebalancing Strategies User-Defined Node joinsor leaves the cluster Helix controller invokes code plugged in by the app Rebalancer implemented by app computes replica placement and state Helix fires transitions without violating constraints The rebalancer receives a full snapshot of the current cluster state, as well as access to the backing data store. Helix rebalancers implement the same interface.
  • 36.
    Rebalancing Strategies User-Defined: DistributedLock Manager Node 1 Released Node 2 Offline Locked Each lock is a partition!
  • 37.
    Rebalancing Strategies User-Defined: DistributedLock Manager Node 1 Node 3 Released Node 2 Offline Locked Each lock is a partition!
  • 38.
    Rebalancing Strategies User-Defined: DistributedLock Manager public  ResourceAssignment  computeResourceMapping(        Resource  resource,  IdealState  currentIdealState,        CurrentStateOutput  currentStateOutput,        ClusterDataCache  clusterData)  {    ...    int  i  =  0;    for  (Partition  partition  :  resource.getPartitions())  {        Map<String,  String>  replicaMap  =  new  HashMap<String,  String>();        int  participantIndex  =  i  %  liveParticipants.size();        String  participant  =  liveParticipants.get(participantIndex);        replicaMap.put(participant,  “LOCKED”);        assignment.addReplicaMap(partition,  replicaMap);        i++;    }    return  assignment; }
  • 39.
    Rebalancing Strategies User-Defined: DistributedLock Manager public  ResourceAssignment  computeResourceMapping(        Resource  resource,  IdealState  currentIdealState,        CurrentStateOutput  currentStateOutput,        ClusterDataCache  clusterData)  {    ...    int  i  =  0;    for  (Partition  partition  :  resource.getPartitions())  {        Map<String,  String>  replicaMap  =  new  HashMap<String,  String>();        int  participantIndex  =  i  %  liveParticipants.size();        String  participant  =  liveParticipants.get(participantIndex);        replicaMap.put(participant,  “LOCKED”);        assignment.addReplicaMap(partition,  replicaMap);        i++;    }    return  assignment; }
  • 40.
    Controller Fault Tolerance Offline Standby Leader The augmentedstate model concept applies to controllers too!
  • 41.
    Controller Scalability Controller 1 Cluster 1 Cluster2 Cluster 3 Controller 2 Cluster 4 Cluster 5 Controller 3 Cluster 6
  • 42.
    Controller Scalability Controller 1 Cluster 1 Cluster2 Cluster 3 Controller 2 Cluster 4 Cluster 5 Controller 3 Cluster 6
  • 43.
    ZooKeeper View Ideal State P1 P2 N1:M N2: M N2: S N1: S Replica Placement and State {    "id"  :  "SampleResource",    "simpleFields"  :  {        "REBALANCE_MODE"  :  "USER_DEFINED",        "NUM_PARTITIONS"  :  "2",        "REPLICAS"  :  "2",        "STATE_MODEL_DEF_REF"  :  "MasterSlave",        "STATE_MODEL_FACTORY_NAME"  :  "DEFAULT"    },    "mapFields"  :  {        "SampleResource_0"  :  {            "node1_12918"  :  "MASTER",            "node2_12918"  :  "SLAVE"        }        ...    },    "listFields"  :  {} }
  • 44.
    ZooKeeper View Current Stateand External View External View Current State P1 P2 N1 P1: MASTER P2: MASTER N1: M N1: M N2 P1: OFFLINE P2: OFFLINE N2: O N2: O Helix’s responsibility is to make the external view match the ideal state as closely as possible
  • 45.
    Logical Deployment Spectator Helix Agent ZooKeeper HelixController Helix Agent Helix Agent Helix Agent P1: M P2: S Participant P2: M P3: S Participant P3: M P1: S Participant
  • 46.
  • 47.
    Example: Distributed DataStore Master P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.10 P.11 Slave P.4 P.5 P.6 P.8 P.1 P.2 P.12 P.3 P.4 P.9 P.10 P.11 P.12 P.7 P.8 Node 1 Partition Management • multiple replicas • 1 master • even distribution Node 2 Fault Tolerance • fault detection • promote master to slave • even distribution • no SPOF Node 3 Elasticity • minimize downtime • minimize data movement • throttle movement
  • 48.
    Example: Distributed DataStore Helix-Based Solution Define state model state transitions Configure create cluster add nodes add resource config rebalancer Run start controller start participants
  • 49.
    Example: Distributed DataStore State Model Definition: Master-Slave States all possible states priority Transitions legal transitions priority Applicable to each partition of a resource Slave Offline Master
  • 50.
    Example: Distributed DataStore State Model Definition: Master-Slave builder  =  new   StateModelDefinition.Builder(“MasterSlave”);  //  add  states  and  their  ranks  to  indicate  priority  builder.addState(MASTER,  1);  builder.addState(SLAVE,  2);  builder.addState(OFFLINE);  //  set  the  initial  state  when  participant  starts  builder.initialState(OFFLINE); //  add  transitions  builder.addTransition(OFFLINE,  SLAVE);  builder.addTransition(SLAVE,  OFFLINE);  builder.addTransition(SLAVE,  MASTER);  builder.addTransition(MASTER,  SLAVE);
  • 51.
    Example: Distributed DataStore Defining Constraints StateCount=2 State Transition Y Y Resource - Y Node Y Y Cluster - Y Partition Slave Offline Master StateCount=1
  • 52.
    Example: Distributed DataStore Defining Constraints: Code //  static  constraints  builder.upperBound(MASTER,  1);  //  dynamic  constraints  builder.dynamicUpperBound(SLAVE,  “R”);  //  unconstrained  builder.upperBound(OFFLINE,  -­‐1);
  • 53.
    Example: Distributed DataStore Participant Plug-In Code @StateModelInfo(initialState=“OFFLINE”,  states={“OFFLINE”,   “SLAVE”,  “MASTER”}) class  DistributedDataStoreModel  extends  StateModel  {    @Transition(from=“OFFLINE”,  to=“SLAVE”)    public  void  fromOfflineToSlave(Message  m,  NotificationContext   ctx)  {        //  bootstrap  data,  setup  replication,  etc.    }        @Transition(from=“SLAVE”,  to=“MASTER”)    public  void  fromSlaveToMaster(Message  m,  NotificationContext   ctx)  {        //  catch  up  previous  master,  enable  writes,  etc.    }    ... }
  • 54.
    Example: Distributed DataStore Configure and Run HelixAdmin  -­‐zkSvr  <zk-­‐address> Create Cluster -­‐-­‐  addCluster  MyCluster Add Participants -­‐-­‐  addNode  MyCluster  localhost_12000 ... Add Resource -­‐-­‐  addResource  MyDB  16  MasterSlave  SEMI_AUTO Configure Rebalancer -­‐-­‐  rebalance  MyDB  3
  • 55.
    Example: Distributed DataStore Spectator Plug-In Code class  RoutingLogic  {      public  void  write(Request  request)  {            partition  =  getPartition(request.key);            List<Node>  nodes  =   routingTableProvider.getInstance(partition,  “MASTER”);            nodes.get(0).write(request);      }      public  void  read(Request  request)  {            partition  =  getPartition(request.key);            List<Node>  nodes  =   routingTableProvider.getInstance(partition);            random(nodes).read(request);      }
  • 56.
    Example: Distributed DataStore Where is the Code? Participant Participant Plug-In Code node updates node updates Controller config changes node changes Consensus System Spectator Participant Plug-In Code Participant Spectator Plug-In Code
  • 57.
    Example: Distributed Search Index shard P.1 P.2 P.3 P.4 P.3 P.4 P.5 P.6 Node1 Partition Management • multiple replicas • rack-aware placement • even distribution P.5 P.6 P.1 P.2 Node 2 Fault Tolerance • fault detection • auto create replicas • controlled creation of replicas Node 3 Elasticity • redistribute partitions • minimize data movement • throttle movement
  • 58.
    Example: Distributed Search StateModel Definition: Bootstrap Idle setup node cleanup recover Offline stop consume data StateCount=3 stop indexing and serving Online consume data to build index can serve requests Bootstrap Error StateCount=5
  • 59.
    Example: Distributed Search Configureand Run Create Cluster -­‐-­‐  addCluster  MyCluster Add Participants -­‐-­‐  addNode  MyCluster  localhost_12000 ... Add Resource -­‐-­‐  addResource  MyIndex  16  Bootstrap  CUSTOMIZED Configure Rebalancer -­‐-­‐  rebalance  MyIndex  8
  • 60.
    Example: Message Consumers Assignment Scaling PartitionedConsumer Queue Partitioned Consumer Queue C1 C1 Fault Tolerance Partitioned Consumer Queue C1 C3 C2 Partition Management • one consumer per queue • even distribution C3 C2 C2 Elasticity • redistribute queues among consumers • minimize movement Fault Tolerance • redistribute • minimize data movement • limit max queue per consumer
  • 61.
    Example: Message Consumers StateModel Definition: Online-Offline Max 10 queues per consumer StateCount = 1 Start consumption Offline Online Stop consumption
  • 62.
    Example: Message Consumers ParticipantPlug-In Code @StateModelInfo(initialState=“OFFLINE”,  states={“OFFLINE”,   “ONLINE”}) class  MessageConsumerModel  extends  StateModel  {    @Transition(from=“OFFLINE”,  to=“ONLINE”)    public  void  fromOfflineToOnline(Message  m,   NotificationContext  ctx)  {        //  register  listener    }        @Transition(from=“ONLINE”,  to=“OFFLINE”)    public  void  fromOnlineToOffline(Message  m,   NotificationContext  ctx)  {        //  unregister  listener    } }
  • 63.
  • 64.
  • 65.
    Plugins Data-Driven Testing andDebugging Instrument ZK, controller, and participant logs Simulate execution with Chaos Monkey Analyze invariants like state and transition constraints The exact sequence of events can be replayed: debugging made easy!
  • 66.
    Plugins Data-Driven Testing andDebugging: Sample Log File timestamp partition participantName sessionId state 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  • 67.
    Plugins Data-Driven Testing andDebugging: Count Aggregation Time State Slave Count Participant 42632 OFFLINE 0 10.117.58.247_12918 42796 SLAVE 1 10.117.58.247_12918 43124 OFFLINE 1 10.202.187.155_12918 43131 OFFLINE 1 10.220.225.153_12918 43275 SLAVE 2 10.220.225.153_12918 43323 SLAVE 3 10.202.187.155_12918 85795 MASTER 2 10.220.225.153_12918 Error! The state constraint for SLAVE has an upper bound of 2.
  • 68.
    Plugins Data-Driven Testing andDebugging: Time Aggregation Slave Count Time Percentage 0 1082319 0.5 1 35578388 16.46 2 179417802 82.99 3 118863 0.05 Master Count Time Percentage 0 1082319 0.5 1 35578388 16.46 83% of the time, there were 2 slaves to a partition 93% of the time, there was 1 master to a partition We can see for exactly how long the cluster was out of whack.
  • 69.
  • 70.
  • 71.
    Coming Up Next NewAPIs Automatic scaling with YARN Non-JVM participants
  • 72.
    Summary • Helix: Ageneric framework for building distributed systems • Abstraction and modularity allow for modifying and enhancing system behavior • Simple programming model: declarative state machine
  • 73.