Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Akka Cluster and Auto-scaling


Published on

Presentation slides on Scala Matsuri 2017

Published in: Technology
  • Be the first to comment

Akka Cluster and Auto-scaling

  1. 1. Akka Cluster and Auto-scaling Ikuo Matsumura CyberAgent, Inc. 2017/02/26
  2. 2. Akka Cluster ⾮中央集権的なノード群構築を⾏うAkka拡張 10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介 • Decentralized cluster membership service • no single point of failure, bottleneck • distribute actors over multiple JVMs • Applied to build a sub-system on AD serving • Tens of servers • Operations about 10 months
  3. 3. Requirements in our case • Host a lot of Entity with low cost • Fit existing Akka application • Down-time is acceptable to some extent* Akkaベースで多数のEntityを低コストで配備したい 多少のダウンタイムは許容できる *online machine learning
  4. 4. Our application of Akka Cluster 永続ActorをCluster Shardingで配備 データ保管にコモディティサービスを使⽤ frontend frontend frontend entities entities … … entities frontend • Existing app • Tens of nodes • Auto-scaling • New sub-system • Several nodes ElastiCache (Journal) S3 (Snapshot) data stores …
  5. 5. Challenges • Strategy on unreachables removal • Lifecycle of journals 運⽤する中でつまづいた2つの課題についてお話します
  6. 6. Membership Lifecycle in Cluster Specification クラスタメンバーのライフサイクルの概観 joining up down removed join leaving exiting unreachable leave
  7. 7. Joinning and Leader Action “leader action”を経て、他メンバと通信できるようになる joining up down join unreachable
  8. 8. Joinning and Leader Action “leader action”を経て、他メンバと通信できるようになる joining up down unreachable leader action
  9. 9. Joinning and Leader Action Scale-in発⽣時、unreachable のままになる joining up down unreachable failure detector leader action
  10. 10. Joinning and Leader Action leader actionが⾏えなくなる。 結果、新規メンバが他のメンバと通信できないままに。 joining up down unreachable leader action
  11. 11. Joinning and Leader Action Scale-inをトリガにしたdown指定が必要 joining up down unreachable leader action scale-in trigger mark as down
  12. 12. Joinning and Leader Action unreachableを除くことでleader actionが再開可能に joining up down unreachable leader action
  13. 13. Leader actions blocked by unreachables leader actionが⾏えない状態のログの例 Members that are “up” but have not seen the current state “Leader can currently not perform its duties”
  14. 14. Causes and actions on unreachables Type of failures Example Possible external action network partitions - wait for recovery or abandon a part machine crashes scale-in mark as down quarantined in akka remote layer restart an actor system unresponseive process long GC restart a JVM CPU starvation by credit shortage in EC2 re-create an instance failure detector はエラーの原因までは区別できない 原因に応じてクラスタ外部からの回復措置が要る
  15. 15. Split Brain Resolver* (commercial add-on) • Mark members as “down” when a part of the cluster become unreachable for some time • Strategies • Static Quorum • Keep Majority - default in Lagom • Keep Oldest • Keep Referee * 商⽤add-onである程度包括的に⾃動のdown指定が可能 ⼀定時間メンバの状態・到達可能性に変化がない時に発動
  16. 16. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  17. 17. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  18. 18. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  19. 19. Caution • Side-effect caused by app restart • ddata is experimental (at Akka 2.4) • Use Akka 2.4.8 or higher* 再起動による副作⽤やAkkaのバージョンに注意が必要 *has a fix on distributed pub-sub akka#20847
  20. 20. To keep cluster membership healthy 1. Trigger mark-as-down (or leave) on scale-in 2. Automate restart/recreation of
 AcotrSystem, JVM, server instance 3. Setup a fallback mechanism such as
 split brain resolver, or
 rejoining into a new cluster unreachable対策のまとめ
  21. 21. Challenges • Strategy of unreachables removal • Lifecycle of journals 次に、2つ⽬の課題についてお話しします。
  22. 22. Journal entities entities … entities ElastiCache (Journal) S3 (Snapshot) Event Sourcingにおけるイベントストアに対応するAPI Journalをキャッシュのように運⽤する想定をした
  23. 23. Cleanup old journals in Redis plug-in* JournalのDeleteMessageでは⼀部データが残るケースがある snapshotとのsequenceNrの⼀貫性に注意 key in Redis removed on deleteMessages journal:$persistenceId Yes journal:$persistenceId.highestSequenceNr No * Deleting highstSequenceNr could cause loading old version of snapshot. → Keep only the latest snapshot.
  24. 24. Event Sourcing and Ecosystem “it stores a complete history of the events associated with the aggregates in your domain” Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ] 本来のイベントストアはイベントの完全な履歴を持つ想定 そこから逸れるとエコシステム(plug-in)のサポートも弱くなる
  25. 25. Summary • Lessons learned from devops of an Akka Cluster app • Strategy on unreachables removal • scale-in trigger • automatic restart/recreation • fallback mechanism;
 split-brain resolver / rejoining • Lifecycle of journals • cost of deviation from Event Sourcing unreachableメンバを取り除く仕組みを各種⼊れておく Journalのキャッシュ的運⽤は意外と⼤変(なことがある)
  26. 26. Reference [CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian, 2012, [PSE] Persistence - Schema Evolution, Akka Documentation, http://