Untangling Cluster Management with Helix
Upcoming SlideShare
Loading in...5
×
 

Untangling Cluster Management with Helix

on

  • 659 views

This talk was given by Kishore Gopalakrishna (Staff Software Engineer @ LinkedIn) at the 3rd ACM Symposium on Cloud Computing (SOCC 2012).

This talk was given by Kishore Gopalakrishna (Staff Software Engineer @ LinkedIn) at the 3rd ACM Symposium on Cloud Computing (SOCC 2012).

Statistics

Views

Total Views
659
Views on SlideShare
652
Embed Views
7

Actions

Likes
3
Downloads
25
Comments
0

2 Embeds 7

http://www.slashdocs.com 6
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Untangling Cluster Management with Helix Untangling Cluster Management with Helix Presentation Transcript

  • Untangling Cluster Management with HelixHelix team @ LinkedInKishore Gopalakrishnahttp://www.linkedin.com/in/kgopalak@kishoreg1980 Recruiting Solutions 1
  • Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 2
  • What is Helix Cluster management framework for distributed systems using declarative state model 3
  • Distributed system examples 4
  • Motivation A system starts out simple… …but gets complex in the real world …as you address real requirements Application client library  Scale  Failover  Bootstrapping Call Routing System Replica 1 … Replica 2 … 5
  • Motivation Scale Failover Bootstrapping These are cluster management problems Helix solves them once… …so you can focus on your system 6
  • Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 7
  • Use-Case: Distributed Data Store Distributed P.1 Node 1 Node 2 Node 3 8
  • Use-Case: Distributed Data Store Distributed Partitioned P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.8 P.1 2 Node 1 Node 2 Node 3 9
  • Use-Case: Distributed Data Store Distributed Partitioned Replicated P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3 10
  • Partition Layout Highly Available Master accepts writes Balanced distribution Master Slave P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3 11
  • Failover Master Slave P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3
  • Add Capacity P.1 P.5 P.9 P.1 P.1 P.8 0 2 Master Node 4 Slave P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3
  • Use-case requirements • Partition constraints • 1 master per partition • Balance partitions across cluster • No single-point-of-failure: replicas on different nodes • Handle failures: transfer mastership • Elasticity • Distribute workload across added nodes  Minimize partition movement • Meet SLAs  Throttle concurrent data movement 14
  • Declarative Problem Statement State machine  Constraints – States – States  offline, slave, master – Transitions – Transitions  Objective  O-S, S-O, S-M, M-S – Partition placement COUNT=2 minimize(maxnj∈N S(nj) ) t1≤ 5 S t1 t2 t3 t4 O M COUNT=1 minimize(maxnj∈N M(nj) ) 15
  • Generalizing cluster management STATE MACHINE CONSTRAINTS OBJECTIVE 16
  • Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 17
  • Helix Based System Roles RESPONSE COMMAND P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.1 0 1 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 P.1 2 P.9 P.1 P.1 P.1 P.7 P.8 0 1 2 Node 1 Node 2 Node 3 18
  • Controller Execution Flow P1:OS P1:SM
  • Controller fault tolerance 20
  • Controller fault tolerance 21
  • Participant Plug-in code 22
  • Spectator Plug-in code 23
  • Benefits Cluster operations “just work” – Bootstrapping – Failover – Add nodes Global vs Local – Helix Controller  Global knowledge  Makes cluster decisions – Participant  Local knowledge  Follows orders 24
  • Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 25
  • consumer group 26
  • Consumer group: Scaling 27
  • Consumer group: Fault tolerance 28
  • Consumer group: state model 29
  • Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 30
  • Helix usage at LinkedIn (Pictures) Espresso – a timeline-consistent, distributed data store Databus – a change data capture service Search as a Service – a multi-tenant service for multiple search applications More planned 31
  • Summary Generic framework Easy to use: declarative model Easy to operate 32
  • Helix: Future Roadmap• Features • Span multiple data centers • Load balancing• Announcement • Open source: https://github.com/linkedin/helix • Apache incubation • New contributors
  • Questions? 34