Helix talk at RelateIQ
Upcoming SlideShare
Loading in...5
×
 

Helix talk at RelateIQ

on

  • 1,573 views

 

Statistics

Views

Total Views
1,573
Views on SlideShare
1,389
Embed Views
184

Actions

Likes
5
Downloads
18
Comments
0

3 Embeds 184

http://www.scoop.it 106
https://twitter.com 66
https://www.linkedin.com 12

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Helix talk at RelateIQ Helix talk at RelateIQ Presentation Transcript

  • Apache Helix: Simplifying Distributed Systems Kanak Biscuitwala and Jason Zhang helix.incubator.apache.org @apachehelix
  • Outline • • • • • • • Background Resource Assignment Problem Helix Concepts Putting Concepts to Work Getting Started Plugins Current Status
  • Load balancing Responding to node entry and exit Building distributed systems is hard. Alerting based on metrics Managing data replicas Supporting event listeners View slide
  • Load balancing Responding to node entry and exit Helix abstracts away problems distributed systems need to solve. Alerting based on metrics Managing data replicas Supporting event listeners View slide
  • System Lifecycle Cluster Expansion Fault Tolerance Multi-Node Partitioning Discovery Co-Location Single Node Replication Fault Detection Recovery Throttle movement Redistribute data
  • Resource Assignment Problem
  • Resource Assignment Problem RESOURCES NODES
  • Resource Assignment Problem Sample Allocation RESOURCES NODES 25% 25% 25% 25%
  • Resource Assignment Problem Failure Handling RESOURCES NODES 33% 33% 34%
  • Resource Assignment Problem Making it Work: Take 1 (ZooKeeper) Application Application Helix ZooKeeper File system Lock Ephemeral ZooKeeper provides low-level primitives Node Partition Replica State Transition Consensus System We need high-level primitives
  • Resource Assignment Problem Making it Work: Take 2 (Decisions by Nodes) S config changes node changes node updates S Consensus System S S service running on a node S
  • Resource Assignment Problem Making it Work: Take 2 (Decisions by Nodes) S config changes node changes node updates S Consensus System multiple brains S app-specific logic unscalable traffic S
  • Resource Assignment Problem Making it Work: Take 2 (Decisions by Nodes) S config changes node changes node updates S Consensus System multiple brains S app-specific logic unscalable traffic S
  • Resource Assignment Problem Making it Work: Take 3 (Single Brain) S node updates node updates Controller config changes node changes Consensus System S Node logic is drastically simplified! S
  • Resource Assignment Problem Helix View RESOURCES Controller Controller Controller Manage NODES (Participants) Spectators
  • Resource Assignment Problem Helix View RESOURCES Controller Controller Controller Manage NODES (Participants) Spectators Question: How do we make this controller generic enough to work for different resources?
  • Helix Concepts
  • Helix Concepts Resources Resource Partition Partition Partition All partitions can be replicated.
  • Helix Concepts Declarative State Model Offline Slave Master
  • Helix Concepts Constraints: Augmenting the State Model State Constraints MASTER: [1, 1] SLAVE: [0, R] Special Constraint Values R: Replica count per partition N: Number of participants
  • Helix Concepts Constraints: Augmenting the State Model State Constraints MASTER: [1, 1] SLAVE: [0, R] Special Constraint Values R: Replica count per partition N: Number of participants Transition Constraints Scope: Cluster OFFLINE-SLAVE: 3 concurrent Scope: Resource R1 SLAVE-MASTER: 1 concurrent Scope: Participant P4 OFFLINE-SLAVE: 2 concurrent
  • Helix Concepts Constraints: Augmenting the State Model State Constraints MASTER: [1, 1] SLAVE: [0, R] Special Constraint Values R: Replica count per partition N: Number of participants Transition Constraints Scope: Cluster OFFLINE-SLAVE: 3 concurrent Scope: Resource R1 SLAVE-MASTER: 1 concurrent Scope: Participant P4 OFFLINE-SLAVE: 2 concurrent States and transitions are ordered by priority in computing replica states. Transition constraints can be restricted to cluster, resource, and participant scopes. The most restrictive constraint is used.
  • Helix Concepts Resources and the Augmented State Model master Resource slave offline Partition Partition Partition All partitions can be replicated. Each replica is in a state governed by the augmented state model.
  • Helix Concepts Objectives Partition Placement Distribution policy for partitions and replicas Making effective use of the cluster and the resource Failure and Expansion Semantics Create new replicas and assign states Changing existing replica states
  • Putting Concepts to Work
  • Rebalancing Strategies Meeting Objectives within Constraints Full-Auto Replica Placement Replica State Helix Helix Semi-Auto App Helix Customized User-Defined App App code plugged into the Helix controller App App code plugged into the Helix controller
  • Rebalancing Strategies Full-Auto Node 1 Node 2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S By default, Helix optimizes for minimal movement and even distribution of partitions and states
  • Rebalancing Strategies Full-Auto Node 1 Node 2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S By default, Helix optimizes for minimal movement and even distribution of partitions and states
  • Rebalancing Strategies Full-Auto Node 1 Node 2 P1: M P2: M P2: S P3: S P3: M Node 3 P1: S By default, Helix optimizes for minimal movement and even distribution of partitions and states
  • Rebalancing Strategies Semi-Auto Node 1 Node 2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints. This is ideal for resources that are expensive to move.
  • Rebalancing Strategies Semi-Auto Node 1 Node 2 Node 3 P1: M P2: M P3: M P2: S P3: S P1: S Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints. This is ideal for resources that are expensive to move.
  • Rebalancing Strategies Semi-Auto Node 1 Node 2 Node 3 P1: M P2: M P3: M P2: S P3: M P3: S P1: S Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints. This is ideal for resources that are expensive to move.
  • Rebalancing Strategies Customized The app specifies the location and state of each replica. Helix still ensures that transitions are fired according to constraints.
  • Rebalancing Strategies Customized The app specifies the location and state of each replica. Helix still ensures that transitions are fired according to constraints. Need to respond to node changes? Use the Helix custom code invoker to run on one participant, or...
  • Rebalancing Strategies User-Defined Node joins or leaves the cluster Helix controller invokes code plugged in by the app Rebalancer implemented by app computes replica placement and state Helix fires transitions without violating constraints The rebalancer receives a full snapshot of the current cluster state, as well as access to the backing data store. Helix rebalancers implement the same interface.
  • Rebalancing Strategies User-Defined: Distributed Lock Manager Node 1 Released Node 2 Offline Locked Each lock is a partition!
  • Rebalancing Strategies User-Defined: Distributed Lock Manager Node 1 Node 3 Released Node 2 Offline Locked Each lock is a partition!
  • Rebalancing Strategies User-Defined: Distributed Lock Manager public  ResourceAssignment  computeResourceMapping(        Resource  resource,  IdealState  currentIdealState,        CurrentStateOutput  currentStateOutput,        ClusterDataCache  clusterData)  {    ...    int  i  =  0;    for  (Partition  partition  :  resource.getPartitions())  {        Map<String,  String>  replicaMap  =  new  HashMap<String,  String>();        int  participantIndex  =  i  %  liveParticipants.size();        String  participant  =  liveParticipants.get(participantIndex);        replicaMap.put(participant,  “LOCKED”);        assignment.addReplicaMap(partition,  replicaMap);        i++;    }    return  assignment; }
  • Rebalancing Strategies User-Defined: Distributed Lock Manager public  ResourceAssignment  computeResourceMapping(        Resource  resource,  IdealState  currentIdealState,        CurrentStateOutput  currentStateOutput,        ClusterDataCache  clusterData)  {    ...    int  i  =  0;    for  (Partition  partition  :  resource.getPartitions())  {        Map<String,  String>  replicaMap  =  new  HashMap<String,  String>();        int  participantIndex  =  i  %  liveParticipants.size();        String  participant  =  liveParticipants.get(participantIndex);        replicaMap.put(participant,  “LOCKED”);        assignment.addReplicaMap(partition,  replicaMap);        i++;    }    return  assignment; }
  • Controller Fault Tolerance Offline Standby Leader The augmented state model concept applies to controllers too!
  • Controller Scalability Controller 1 Cluster 1 Cluster 2 Cluster 3 Controller 2 Cluster 4 Cluster 5 Controller 3 Cluster 6
  • Controller Scalability Controller 1 Cluster 1 Cluster 2 Cluster 3 Controller 2 Cluster 4 Cluster 5 Controller 3 Cluster 6
  • ZooKeeper View Ideal State P1 P2 N1: M N2: M N2: S N1: S Replica Placement and State {    "id"  :  "SampleResource",    "simpleFields"  :  {        "REBALANCE_MODE"  :  "USER_DEFINED",        "NUM_PARTITIONS"  :  "2",        "REPLICAS"  :  "2",        "STATE_MODEL_DEF_REF"  :  "MasterSlave",        "STATE_MODEL_FACTORY_NAME"  :  "DEFAULT"    },    "mapFields"  :  {        "SampleResource_0"  :  {            "node1_12918"  :  "MASTER",            "node2_12918"  :  "SLAVE"        }        ...    },    "listFields"  :  {} }
  • ZooKeeper View Current State and External View External View Current State P1 P2 N1 P1: MASTER P2: MASTER N1: M N1: M N2 P1: OFFLINE P2: OFFLINE N2: O N2: O Helix’s responsibility is to make the external view match the ideal state as closely as possible
  • Logical Deployment Spectator Helix Agent ZooKeeper Helix Controller Helix Agent Helix Agent Helix Agent P1: M P2: S Participant P2: M P3: S Participant P3: M P1: S Participant
  • Getting Started
  • Example: Distributed Data Store Master P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.10 P.11 Slave P.4 P.5 P.6 P.8 P.1 P.2 P.12 P.3 P.4 P.9 P.10 P.11 P.12 P.7 P.8 Node 1 Partition Management • multiple replicas • 1 master • even distribution Node 2 Fault Tolerance • fault detection • promote master to slave • even distribution • no SPOF Node 3 Elasticity • minimize downtime • minimize data movement • throttle movement
  • Example: Distributed Data Store Helix-Based Solution Define state model state transitions Configure create cluster add nodes add resource config rebalancer Run start controller start participants
  • Example: Distributed Data Store State Model Definition: Master-Slave States all possible states priority Transitions legal transitions priority Applicable to each partition of a resource Slave Offline Master
  • Example: Distributed Data Store State Model Definition: Master-Slave builder  =  new   StateModelDefinition.Builder(“MasterSlave”);  //  add  states  and  their  ranks  to  indicate  priority  builder.addState(MASTER,  1);  builder.addState(SLAVE,  2);  builder.addState(OFFLINE);  //  set  the  initial  state  when  participant  starts  builder.initialState(OFFLINE); //  add  transitions  builder.addTransition(OFFLINE,  SLAVE);  builder.addTransition(SLAVE,  OFFLINE);  builder.addTransition(SLAVE,  MASTER);  builder.addTransition(MASTER,  SLAVE);
  • Example: Distributed Data Store Defining Constraints StateCount=2 State Transition Y Y Resource - Y Node Y Y Cluster - Y Partition Slave Offline Master StateCount=1
  • Example: Distributed Data Store Defining Constraints: Code //  static  constraints  builder.upperBound(MASTER,  1);  //  dynamic  constraints  builder.dynamicUpperBound(SLAVE,  “R”);  //  unconstrained  builder.upperBound(OFFLINE,  -­‐1);
  • Example: Distributed Data Store Participant Plug-In Code @StateModelInfo(initialState=“OFFLINE”,  states={“OFFLINE”,   “SLAVE”,  “MASTER”}) class  DistributedDataStoreModel  extends  StateModel  {    @Transition(from=“OFFLINE”,  to=“SLAVE”)    public  void  fromOfflineToSlave(Message  m,  NotificationContext   ctx)  {        //  bootstrap  data,  setup  replication,  etc.    }        @Transition(from=“SLAVE”,  to=“MASTER”)    public  void  fromSlaveToMaster(Message  m,  NotificationContext   ctx)  {        //  catch  up  previous  master,  enable  writes,  etc.    }    ... }
  • Example: Distributed Data Store Configure and Run HelixAdmin  -­‐zkSvr  <zk-­‐address> Create Cluster -­‐-­‐  addCluster  MyCluster Add Participants -­‐-­‐  addNode  MyCluster  localhost_12000 ... Add Resource -­‐-­‐  addResource  MyDB  16  MasterSlave  SEMI_AUTO Configure Rebalancer -­‐-­‐  rebalance  MyDB  3
  • Example: Distributed Data Store Spectator Plug-In Code class  RoutingLogic  {      public  void  write(Request  request)  {            partition  =  getPartition(request.key);            List<Node>  nodes  =   routingTableProvider.getInstance(partition,  “MASTER”);            nodes.get(0).write(request);      }      public  void  read(Request  request)  {            partition  =  getPartition(request.key);            List<Node>  nodes  =   routingTableProvider.getInstance(partition);            random(nodes).read(request);      }
  • Example: Distributed Data Store Where is the Code? Participant Participant Plug-In Code node updates node updates Controller config changes node changes Consensus System Spectator Participant Plug-In Code Participant Spectator Plug-In Code
  • Example: Distributed Search Index shard P.1 P.2 P.3 P.4 P.3 P.4 P.5 P.6 Node 1 Partition Management • multiple replicas • rack-aware placement • even distribution P.5 P.6 P.1 P.2 Node 2 Fault Tolerance • fault detection • auto create replicas • controlled creation of replicas Node 3 Elasticity • redistribute partitions • minimize data movement • throttle movement
  • Example: Distributed Search State Model Definition: Bootstrap Idle setup node cleanup recover Offline stop consume data StateCount=3 stop indexing and serving Online consume data to build index can serve requests Bootstrap Error StateCount=5
  • Example: Distributed Search Configure and Run Create Cluster -­‐-­‐  addCluster  MyCluster Add Participants -­‐-­‐  addNode  MyCluster  localhost_12000 ... Add Resource -­‐-­‐  addResource  MyIndex  16  Bootstrap  CUSTOMIZED Configure Rebalancer -­‐-­‐  rebalance  MyIndex  8
  • Example: Message Consumers Assignment Scaling Partitioned Consumer Queue Partitioned Consumer Queue C1 C1 Fault Tolerance Partitioned Consumer Queue C1 C3 C2 Partition Management • one consumer per queue • even distribution C3 C2 C2 Elasticity • redistribute queues among consumers • minimize movement Fault Tolerance • redistribute • minimize data movement • limit max queue per consumer
  • Example: Message Consumers State Model Definition: Online-Offline Max 10 queues per consumer StateCount = 1 Start consumption Offline Online Stop consumption
  • Example: Message Consumers Participant Plug-In Code @StateModelInfo(initialState=“OFFLINE”,  states={“OFFLINE”,   “ONLINE”}) class  MessageConsumerModel  extends  StateModel  {    @Transition(from=“OFFLINE”,  to=“ONLINE”)    public  void  fromOfflineToOnline(Message  m,   NotificationContext  ctx)  {        //  register  listener    }        @Transition(from=“ONLINE”,  to=“OFFLINE”)    public  void  fromOnlineToOffline(Message  m,   NotificationContext  ctx)  {        //  unregister  listener    } }
  • Plugins
  • Plugins Overview Data-Driven Testing and Debugging Chaos Monkey Rolling Upgrade On-Demand Task Scheduling Intra-Cluster Messaging Health Monitoring
  • Plugins Data-Driven Testing and Debugging Instrument ZK, controller, and participant logs Simulate execution with Chaos Monkey Analyze invariants like state and transition constraints The exact sequence of events can be replayed: debugging made easy!
  • Plugins Data-Driven Testing and Debugging: Sample Log File timestamp partition participantName sessionId state 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  • Plugins Data-Driven Testing and Debugging: Count Aggregation Time State Slave Count Participant 42632 OFFLINE 0 10.117.58.247_12918 42796 SLAVE 1 10.117.58.247_12918 43124 OFFLINE 1 10.202.187.155_12918 43131 OFFLINE 1 10.220.225.153_12918 43275 SLAVE 2 10.220.225.153_12918 43323 SLAVE 3 10.202.187.155_12918 85795 MASTER 2 10.220.225.153_12918 Error! The state constraint for SLAVE has an upper bound of 2.
  • Plugins Data-Driven Testing and Debugging: Time Aggregation Slave Count Time Percentage 0 1082319 0.5 1 35578388 16.46 2 179417802 82.99 3 118863 0.05 Master Count Time Percentage 0 1082319 0.5 1 35578388 16.46 83% of the time, there were 2 slaves to a partition 93% of the time, there was 1 master to a partition We can see for exactly how long the cluster was out of whack.
  • Current Status
  • Helix at LinkedIn Graph Index Standardization Updates Search Index Primary DB Oracle Data Change Events Espresso Databus Read Replicas
  • Coming Up Next New APIs Automatic scaling with YARN Non-JVM participants
  • Summary • Helix: A generic framework for building distributed systems • Abstraction and modularity allow for modifying and enhancing system behavior • Simple programming model: declarative state machine
  • Questions? ? website helix.incubator.apache.org dev mailing list dev@helix.incubator.apache.org user mailing list user@helix.incubator.apache.org twitter @apachehelix