Untangling Cluster Management with Helix

1,066 views

Published on

This talk was given by Kishore Gopalakrishna (Staff Software Engineer @ LinkedIn) at the 3rd ACM Symposium on Cloud Computing (SOCC 2012).

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,066
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
40
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Untangling Cluster Management with Helix

  1. 1. Untangling Cluster Management with HelixHelix team @ LinkedInKishore Gopalakrishnahttp://www.linkedin.com/in/kgopalak@kishoreg1980 Recruiting Solutions 1
  2. 2. Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 2
  3. 3. What is Helix Cluster management framework for distributed systems using declarative state model 3
  4. 4. Distributed system examples 4
  5. 5. Motivation A system starts out simple… …but gets complex in the real world …as you address real requirements Application client library  Scale  Failover  Bootstrapping Call Routing System Replica 1 … Replica 2 … 5
  6. 6. Motivation Scale Failover Bootstrapping These are cluster management problems Helix solves them once… …so you can focus on your system 6
  7. 7. Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 7
  8. 8. Use-Case: Distributed Data Store Distributed P.1 Node 1 Node 2 Node 3 8
  9. 9. Use-Case: Distributed Data Store Distributed Partitioned P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.8 P.1 2 Node 1 Node 2 Node 3 9
  10. 10. Use-Case: Distributed Data Store Distributed Partitioned Replicated P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3 10
  11. 11. Partition Layout Highly Available Master accepts writes Balanced distribution Master Slave P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3 11
  12. 12. Failover Master Slave P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3
  13. 13. Add Capacity P.1 P.5 P.9 P.1 P.1 P.8 0 2 Master Node 4 Slave P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.11 0 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 2 P.9 P.1 P.11 P.1 P.7 P.8 0 2 Node 1 Node 2 Node 3
  14. 14. Use-case requirements • Partition constraints • 1 master per partition • Balance partitions across cluster • No single-point-of-failure: replicas on different nodes • Handle failures: transfer mastership • Elasticity • Distribute workload across added nodes  Minimize partition movement • Meet SLAs  Throttle concurrent data movement 14
  15. 15. Declarative Problem Statement State machine  Constraints – States – States  offline, slave, master – Transitions – Transitions  Objective  O-S, S-O, S-M, M-S – Partition placement COUNT=2 minimize(maxnj∈N S(nj) ) t1≤ 5 S t1 t2 t3 t4 O M COUNT=1 minimize(maxnj∈N M(nj) ) 15
  16. 16. Generalizing cluster management STATE MACHINE CONSTRAINTS OBJECTIVE 16
  17. 17. Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 17
  18. 18. Helix Based System Roles RESPONSE COMMAND P.1 P.2 P.3 P.5 P.6 P.7 P.9 P.1 P.1 0 1 P.4 P.5 P.6 P.8 P.1 P.2 P.1 P.3 P.4 P.1 2 P.9 P.1 P.1 P.1 P.7 P.8 0 1 2 Node 1 Node 2 Node 3 18
  19. 19. Controller Execution Flow P1:OS P1:SM
  20. 20. Controller fault tolerance 20
  21. 21. Controller fault tolerance 21
  22. 22. Participant Plug-in code 22
  23. 23. Spectator Plug-in code 23
  24. 24. Benefits Cluster operations “just work” – Bootstrapping – Failover – Add nodes Global vs Local – Helix Controller  Global knowledge  Makes cluster decisions – Participant  Local knowledge  Follows orders 24
  25. 25. Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 25
  26. 26. consumer group 26
  27. 27. Consumer group: Scaling 27
  28. 28. Consumer group: Fault tolerance 28
  29. 29. Consumer group: state model 29
  30. 30. Outline What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A 30
  31. 31. Helix usage at LinkedIn (Pictures) Espresso – a timeline-consistent, distributed data store Databus – a change data capture service Search as a Service – a multi-tenant service for multiple search applications More planned 31
  32. 32. Summary Generic framework Easy to use: declarative model Easy to operate 32
  33. 33. Helix: Future Roadmap• Features • Span multiple data centers • Load balancing• Announcement • Open source: https://github.com/linkedin/helix • Apache incubation • New contributors
  34. 34. Questions? 34

×