Akka Cluster and Auto-scaling

Akka Cluster and
Auto-scaling
Ikuo Matsumura
CyberAgent, Inc.
2017/02/26

Akka Cluster
⾮中央集権的なノード群構築を⾏うAkka拡張
10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介
• Decentralized cluster membership service
• no single point of failure, bottleneck
• distribute actors over multiple JVMs
• Applied to build a sub-system on AD serving
• Tens of servers
• Operations about 10 months

Requirements in our case
• Host a lot of Entity with low cost
• Fit existing Akka application
• Down-time is acceptable to some extent*
Akkaベースで多数のEntityを低コストで配備したい
多少のダウンタイムは許容できる
*online machine learning

Our application of Akka Cluster
永続ActorをCluster Shardingで配備
データ保管にコモディティサービスを使⽤
frontend frontend frontend
entities
entities
…
…
entities
frontend
• Existing app
• Tens of nodes
• Auto-scaling
• New sub-system
• Several nodes
ElastiCache
(Journal)
S3
(Snapshot)
data stores
…

Challenges
• Strategy on unreachables removal
• Lifecycle of journals
運⽤する中でつまづいた2つの課題についてお話します

Membership Lifecycle in Cluster Speciﬁcation
クラスタメンバーのライフサイクルの概観
http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle
joining
up
down
removed
join
leaving
exiting
unreachable
leave

Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
join
unreachable

“leader action”を経て、他メンバと通信できるようになる
joining
up
down
unreachable
leader action

Scale-in発⽣時、unreachable のままになる
joining
up
down
unreachable
failure detector
leader action

leader actionが⾏えなくなる。
結果、新規メンバが他のメンバと通信できないままに。
joining
up
down
unreachable
leader action

Scale-inをトリガにしたdown指定が必要
joining
up
down
unreachable
leader action
scale-in
trigger
mark as down

unreachableを除くことでleader actionが再開可能に
joining
up
down
unreachable
leader action

Leader actions blocked by unreachables
leader actionが⾏えない状態のログの例
Members that are “up” but have not seen the current state
“Leader can currently not perform its duties”

Causes and actions on unreachables
Type of failures Example
Possible
external action
network partitions -
wait for recovery or
abandon a part
machine crashes
scale-in mark as down
quarantined in
akka remote layer
restart
an actor system
unresponseive
process
long GC restart a JVM
CPU starvation by
credit shortage in EC2
re-create an instance
failure detector はエラーの原因までは区別できない
原因に応じてクラスタ外部からの回復措置が要る

Split Brain Resolver* (commercial add-on)
• Mark members as “down” when a part of the
cluster become unreachable for some time
• Strategies
• Static Quorum
• Keep Majority - default in Lagom
• Keep Oldest
• Keep Referee
* http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html
商⽤add-onである程度包括的に⾃動のdown指定が可能
⼀定時間メンバの状態・到達可能性に変化がない時に発動

Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)

Caution
• Side-effect caused by app restart
• ddata is experimental (at Akka 2.4)
• Use Akka 2.4.8 or higher*
再起動による副作⽤やAkkaのバージョンに注意が必要
*has a ﬁx on distributed pub-sub akka#20847

To keep cluster membership healthy
1. Trigger mark-as-down (or leave) on scale-in
2. Automate restart/recreation of 
AcotrSystem, JVM, server instance
3. Setup a fallback mechanism such as 
split brain resolver, or 
rejoining into a new cluster
unreachable対策のまとめ

Challenges
• Strategy of unreachables removal
次に、2つ⽬の課題についてお話しします。

Journal
entities
entities
…
entities
ElastiCache
(Journal)
S3
(Snapshot)
Event Sourcingにおけるイベントストアに対応するAPI
Journalをキャッシュのように運⽤する想定をした

Cleanup old journals in Redis plug-in*
JournalのDeleteMessageでは⼀部データが残るケースがある
snapshotとのsequenceNrの⼀貫性に注意
key in Redis removed on deleteMessages
journal:$persistenceId Yes
journal:$persistenceId.highestSequenceNr No
* https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala
Deleting highstSequenceNr could cause loading old version of snapshot.
→ Keep only the latest snapshot.

Event Sourcing and Ecosystem
“it stores a complete history of the events
associated with the aggregates in your domain”
Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ]
本来のイベントストアはイベントの完全な履歴を持つ想定
そこから逸れるとエコシステム(plug-in)のサポートも弱くなる

Summary
• Lessons learned from devops of an Akka Cluster app
• Strategy on unreachables removal
• scale-in trigger
• automatic restart/recreation
• fallback mechanism; 
split-brain resolver / rejoining
• cost of deviation from Event Sourcing
unreachableメンバを取り除く仕組みを各種⼊れておく
Journalのキャッシュ的運⽤は意外と⼤変（なことがある）

Reference
[CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian
Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian,
2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx
[PSE] Persistence - Schema Evolution, Akka Documentation, http://
doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html

Akka Cluster and Auto-scaling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Akka Cluster and Auto-scaling

Similar to Akka Cluster and Auto-scaling (20)

Recently uploaded

Recently uploaded (20)

Akka Cluster and Auto-scaling