Requirements in our case
• Host a lot of Entity with low cost
• Fit existing Akka application
• Down-time is acceptable to some extent*
Akkaベースで多数のEntityを低コストで配備したい
多少のダウンタイムは許容できる
*online machine learning
Our application of Akka Cluster
永続ActorをCluster Shardingで配備
データ保管にコモディティサービスを使⽤
frontend frontend frontend
entities
entities
…
…
entities
frontend
• Existing app
• Tens of nodes
• Auto-scaling
• New sub-system
• Several nodes
ElastiCache
(Journal)
S3
(Snapshot)
data stores
…
Membership Lifecycle in Cluster Specification
クラスタメンバーのライフサイクルの概観
http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle
joining
up
down
removed
join
leaving
exiting
unreachable
leave
Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
join
unreachable
Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
unreachable
leader action
Joinning and Leader Action
Scale-in発⽣時、unreachable のままになる
joining
up
down
unreachable
failure detector
leader action
Joinning and Leader Action
leader actionが⾏えなくなる。
結果、新規メンバが他のメンバと通信できないままに。
joining
up
down
unreachable
leader action
Joinning and Leader Action
Scale-inをトリガにしたdown指定が必要
joining
up
down
unreachable
leader action
scale-in
trigger
mark as down
Joinning and Leader Action
unreachableを除くことでleader actionが再開可能に
joining
up
down
unreachable
leader action
Leader actions blocked by unreachables
leader actionが⾏えない状態のログの例
Members that are “up” but have not seen the current state
“Leader can currently not perform its duties”
Causes and actions on unreachables
Type of failures Example
Possible
external action
network partitions -
wait for recovery or
abandon a part
machine crashes
scale-in mark as down
quarantined in
akka remote layer
restart
an actor system
unresponseive
process
long GC restart a JVM
CPU starvation by
credit shortage in EC2
re-create an instance
failure detector はエラーの原因までは区別できない
原因に応じてクラスタ外部からの回復措置が要る
Split Brain Resolver* (commercial add-on)
• Mark members as “down” when a part of the
cluster become unreachable for some time
• Strategies
• Static Quorum
• Keep Majority - default in Lagom
• Keep Oldest
• Keep Referee
* http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html
商⽤add-onである程度包括的に⾃動のdown指定が可能
⼀定時間メンバの状態・到達可能性に変化がない時に発動
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Caution
• Side-effect caused by app restart
• ddata is experimental (at Akka 2.4)
• Use Akka 2.4.8 or higher*
再起動による副作⽤やAkkaのバージョンに注意が必要
*has a fix on distributed pub-sub akka#20847
To keep cluster membership healthy
1. Trigger mark-as-down (or leave) on scale-in
2. Automate restart/recreation of
AcotrSystem, JVM, server instance
3. Setup a fallback mechanism such as
split brain resolver, or
rejoining into a new cluster
unreachable対策のまとめ
Cleanup old journals in Redis plug-in*
JournalのDeleteMessageでは⼀部データが残るケースがある
snapshotとのsequenceNrの⼀貫性に注意
key in Redis removed on deleteMessages
journal:$persistenceId Yes
journal:$persistenceId.highestSequenceNr No
* https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala
Deleting highstSequenceNr could cause loading old version of snapshot.
→ Keep only the latest snapshot.
Event Sourcing and Ecosystem
“it stores a complete history of the events
associated with the aggregates in your domain”
Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ]
本来のイベントストアはイベントの完全な履歴を持つ想定
そこから逸れるとエコシステム(plug-in)のサポートも弱くなる
Summary
• Lessons learned from devops of an Akka Cluster app
• Strategy on unreachables removal
• scale-in trigger
• automatic restart/recreation
• fallback mechanism;
split-brain resolver / rejoining
• Lifecycle of journals
• cost of deviation from Event Sourcing
unreachableメンバを取り除く仕組みを各種⼊れておく
Journalのキャッシュ的運⽤は意外と⼤変(なことがある)
Reference
[CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian
Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian,
2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx
[PSE] Persistence - Schema Evolution, Akka Documentation, http://
doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html