SlideShare a Scribd company logo
1 of 26
Download to read offline
Akka Cluster and
Auto-scaling
Ikuo Matsumura
CyberAgent, Inc.
2017/02/26
Akka Cluster
⾮中央集権的なノード群構築を⾏うAkka拡張
10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介
• Decentralized cluster membership service
• no single point of failure, bottleneck
• distribute actors over multiple JVMs
• Applied to build a sub-system on AD serving
• Tens of servers
• Operations about 10 months
Requirements in our case
• Host a lot of Entity with low cost
• Fit existing Akka application
• Down-time is acceptable to some extent*
Akkaベースで多数のEntityを低コストで配備したい
多少のダウンタイムは許容できる
*online machine learning
Our application of Akka Cluster
永続ActorをCluster Shardingで配備
データ保管にコモディティサービスを使⽤
frontend frontend frontend
entities
entities
…
…
entities
frontend
• Existing app
• Tens of nodes
• Auto-scaling
• New sub-system
• Several nodes
ElastiCache
(Journal)
S3
(Snapshot)
data stores
…
Challenges
• Strategy on unreachables removal
• Lifecycle of journals
運⽤する中でつまづいた2つの課題についてお話します
Membership Lifecycle in Cluster Specification
クラスタメンバーのライフサイクルの概観
http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle
joining
up
down
removed
join
leaving
exiting
unreachable
leave
Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
join
unreachable
Joinning and Leader Action
“leader action”を経て、他メンバと通信できるようになる
joining
up
down
unreachable
leader action
Joinning and Leader Action
Scale-in発⽣時、unreachable のままになる
joining
up
down
unreachable
failure detector
leader action
Joinning and Leader Action
leader actionが⾏えなくなる。
結果、新規メンバが他のメンバと通信できないままに。
joining
up
down
unreachable
leader action
Joinning and Leader Action
Scale-inをトリガにしたdown指定が必要
joining
up
down
unreachable
leader action
scale-in
trigger
mark as down
Joinning and Leader Action
unreachableを除くことでleader actionが再開可能に
joining
up
down
unreachable
leader action
Leader actions blocked by unreachables
leader actionが⾏えない状態のログの例
Members that are “up” but have not seen the current state
“Leader can currently not perform its duties”
Causes and actions on unreachables
Type of failures Example
Possible
external action
network partitions -
wait for recovery or
abandon a part
machine crashes
scale-in mark as down
quarantined in
akka remote layer
restart
an actor system
unresponseive
process
long GC restart a JVM
CPU starvation by
credit shortage in EC2
re-create an instance
failure detector はエラーの原因までは区別できない
原因に応じてクラスタ外部からの回復措置が要る
Split Brain Resolver* (commercial add-on)
• Mark members as “down” when a part of the
cluster become unreachable for some time
• Strategies
• Static Quorum
• Keep Majority - default in Lagom
• Keep Oldest
• Keep Referee
* http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html
商⽤add-onである程度包括的に⾃動のdown指定が可能
⼀定時間メンバの状態・到達可能性に変化がない時に発動
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Reset cluster membership (poor man’s)
存命ノードを新しいクラスタに参加させ直す
seed(s)
old cluster (ddata) new cluster (ddata)
Caution
• Side-effect caused by app restart
• ddata is experimental (at Akka 2.4)
• Use Akka 2.4.8 or higher*
再起動による副作⽤やAkkaのバージョンに注意が必要
*has a fix on distributed pub-sub akka#20847
To keep cluster membership healthy
1. Trigger mark-as-down (or leave) on scale-in
2. Automate restart/recreation of

AcotrSystem, JVM, server instance
3. Setup a fallback mechanism such as

split brain resolver, or

rejoining into a new cluster
unreachable対策のまとめ
Challenges
• Strategy of unreachables removal
• Lifecycle of journals
次に、2つ⽬の課題についてお話しします。
Journal
entities
entities
…
entities
ElastiCache
(Journal)
S3
(Snapshot)
Event Sourcingにおけるイベントストアに対応するAPI
Journalをキャッシュのように運⽤する想定をした
Cleanup old journals in Redis plug-in*
JournalのDeleteMessageでは⼀部データが残るケースがある
snapshotとのsequenceNrの⼀貫性に注意
key in Redis removed on deleteMessages
journal:$persistenceId Yes
journal:$persistenceId.highestSequenceNr No
* https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala
Deleting highstSequenceNr could cause loading old version of snapshot.
→ Keep only the latest snapshot.
Event Sourcing and Ecosystem
“it stores a complete history of the events
associated with the aggregates in your domain”
Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ]
本来のイベントストアはイベントの完全な履歴を持つ想定
そこから逸れるとエコシステム(plug-in)のサポートも弱くなる
Summary
• Lessons learned from devops of an Akka Cluster app
• Strategy on unreachables removal
• scale-in trigger
• automatic restart/recreation
• fallback mechanism;

split-brain resolver / rejoining
• Lifecycle of journals
• cost of deviation from Event Sourcing
unreachableメンバを取り除く仕組みを各種⼊れておく
Journalのキャッシュ的運⽤は意外と⼤変(なことがある)
Reference
[CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian
Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian,
2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx
[PSE] Persistence - Schema Evolution, Akka Documentation, http://
doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html

More Related Content

What's hot

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
 
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)petabridge
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterKonstantin Tsykulenko
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
 
The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)Robert Scholte
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsAnshum Gupta
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Productionbilyushonak
 
CloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCatCloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCatAndrew Bayer
 
CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!Andrew Bayer
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworksAnshum Gupta
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-frameworkAbdhesh Kumar
 
MySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB ClustersMySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB ClustersMatt Lord
 
Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013Andrew Bayer
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
MySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMatt Lord
 
MySQL Operator for Kubernetes
MySQL Operator for KubernetesMySQL Operator for Kubernetes
MySQL Operator for KubernetesKenny Gryp
 

What's hot (20)

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
獨體模式
獨體模式 獨體模式
獨體模式
 
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
Continuous Deployment with Akka.Cluster and Kubernetes (Akka.NET)
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
 
The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)The do's and don'ts with java 9 (Devoxx 2017)
The do's and don'ts with java 9 (Devoxx 2017)
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIs
 
Testing at Stream-Scale
Testing at Stream-ScaleTesting at Stream-Scale
Testing at Stream-Scale
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Production
 
CloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCatCloudStack, jclouds, Jenkins and CloudCat
CloudStack, jclouds, Jenkins and CloudCat
 
CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!CloudStack, jclouds and Whirr!
CloudStack, jclouds and Whirr!
 
Real World Java 9
Real World Java 9Real World Java 9
Real World Java 9
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworks
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-framework
 
MySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB ClustersMySQL High Availability -- InnoDB Clusters
MySQL High Availability -- InnoDB Clusters
 
Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013Apache jclouds SF Meetup, July 8, 2013
Apache jclouds SF Meetup, July 8, 2013
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
 
MySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMySQL Group Replication - an Overview
MySQL Group Replication - an Overview
 
MySQL Operator for Kubernetes
MySQL Operator for KubernetesMySQL Operator for Kubernetes
MySQL Operator for Kubernetes
 

Viewers also liked

Preparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriPreparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriTIS Inc.
 
Developing an Akka Edge1-3
Developing an Akka Edge1-3Developing an Akka Edge1-3
Developing an Akka Edge1-3saaaaaaki
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski
 
Deadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching PatternDeadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching Patternchibochibo
 
Make your programs Free
Make your programs FreeMake your programs Free
Make your programs FreePawel Szulc
 
Scala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.jsScala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.jstakezoe
 
Van laarhoven lens
Van laarhoven lensVan laarhoven lens
Van laarhoven lensNaoki Aoyama
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesPawel Szulc
 
Reducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer ExampleReducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer ExampleConnie Chen
 
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)Eugene Yokota
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~Masahito Zembutsu
 
ScalaにまつわるNewsな話
ScalaにまつわるNewsな話ScalaにまつわるNewsな話
ScalaにまつわるNewsな話Yosuke Mizutani
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics TokyoAdam Gibson
 
Scala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaScala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaKazuhiro Sera
 
Arquitectura barroca
Arquitectura barrocaArquitectura barroca
Arquitectura barrocaMaria Carmona
 
Sbtのマルチプロジェクトはいいぞ
SbtのマルチプロジェクトはいいぞSbtのマルチプロジェクトはいいぞ
SbtのマルチプロジェクトはいいぞYoshitaka Fujii
 

Viewers also liked (20)

Preparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuriPreparing for distributed system failures using akka #ScalaMatsuri
Preparing for distributed system failures using akka #ScalaMatsuri
 
Scala Matsuri 2017
Scala Matsuri 2017Scala Matsuri 2017
Scala Matsuri 2017
 
Developing an Akka Edge1-3
Developing an Akka Edge1-3Developing an Akka Edge1-3
Developing an Akka Edge1-3
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming World
 
Deadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching PatternDeadly Code! (seriously) Blocking & Hyper Context Switching Pattern
Deadly Code! (seriously) Blocking & Hyper Context Switching Pattern
 
Make your programs Free
Make your programs FreeMake your programs Free
Make your programs Free
 
Scala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.jsScala Warrior and type-safe front-end development with Scala.js
Scala Warrior and type-safe front-end development with Scala.js
 
Van laarhoven lens
Van laarhoven lensVan laarhoven lens
Van laarhoven lens
 
Pratical eff
Pratical effPratical eff
Pratical eff
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data types
 
Reducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer ExampleReducing Boilerplate and Combining Effects: A Monad Transformer Example
Reducing Boilerplate and Combining Effects: A Monad Transformer Example
 
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
The state of sbt 0.13, sbt server, and sbt 1.0 (ScalaMatsuri ver)
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
Dockerの期待と現実~Docker都市伝説はなぜ生まれるのか~
 
ScalaにまつわるNewsな話
ScalaにまつわるNewsな話ScalaにまつわるNewsな話
ScalaにまつわるNewsな話
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
 
Scala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscalaScala が支える医療系ウェブサービス #jissenscala
Scala が支える医療系ウェブサービス #jissenscala
 
Arquitectura barroca
Arquitectura barrocaArquitectura barroca
Arquitectura barroca
 
究極のPHP本完成
究極のPHP本完成究極のPHP本完成
究極のPHP本完成
 
Sbtのマルチプロジェクトはいいぞ
SbtのマルチプロジェクトはいいぞSbtのマルチプロジェクトはいいぞ
Sbtのマルチプロジェクトはいいぞ
 

Similar to Akka Cluster Auto-scaling Lessons Learned

Cassandra Bootstap from Backups
Cassandra Bootstap from BackupsCassandra Bootstap from Backups
Cassandra Bootstap from BackupsInstaclustr
 
Cassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsCassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsInstaclustr
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowAdrian Cockcroft
 
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatAccelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatNetApp
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018harvraja
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast EssentialsRahul Gupta
 
Openstack study-nova-02
Openstack study-nova-02Openstack study-nova-02
Openstack study-nova-02Jinho Shin
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC
 
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKitAtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKitWojciech Seliga
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10Kenny Gryp
 
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best PracticesMySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best PracticesKenny Gryp
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
 
Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds Arun Gupta
 
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...DataStax
 
Effective Testing in DSE
Effective Testing in DSEEffective Testing in DSE
Effective Testing in DSEpedjak
 

Similar to Akka Cluster Auto-scaling Lessons Learned (20)

Cassandra Bootstap from Backups
Cassandra Bootstap from BackupsCassandra Bootstap from Backups
Cassandra Bootstap from Backups
 
Cassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsCassandra Bootstrap from Backups
Cassandra Bootstrap from Backups
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatAccelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast Essentials
 
AppFabric Velocity
AppFabric VelocityAppFabric Velocity
AppFabric Velocity
 
Openstack study-nova-02
Openstack study-nova-02Openstack study-nova-02
Openstack study-nova-02
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
 
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKitAtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
AtlasCamp 2012 - Testing JIRA plugins smarter with TestKit
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10
 
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best PracticesMySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
MySQL InnoDB Cluster - New Features in 8.0 Releases - Best Practices
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
 
Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds Running your Java EE 6 applications in the clouds
Running your Java EE 6 applications in the clouds
 
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
DataStax | Effective Testing in DSE (Lessons Learned) (Predrag Knezevic) | Ca...
 
Effective Testing in DSE
Effective Testing in DSEEffective Testing in DSE
Effective Testing in DSE
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Akka Cluster Auto-scaling Lessons Learned

  • 1. Akka Cluster and Auto-scaling Ikuo Matsumura CyberAgent, Inc. 2017/02/26
  • 2. Akka Cluster ⾮中央集権的なノード群構築を⾏うAkka拡張 10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介 • Decentralized cluster membership service • no single point of failure, bottleneck • distribute actors over multiple JVMs • Applied to build a sub-system on AD serving • Tens of servers • Operations about 10 months
  • 3. Requirements in our case • Host a lot of Entity with low cost • Fit existing Akka application • Down-time is acceptable to some extent* Akkaベースで多数のEntityを低コストで配備したい 多少のダウンタイムは許容できる *online machine learning
  • 4. Our application of Akka Cluster 永続ActorをCluster Shardingで配備 データ保管にコモディティサービスを使⽤ frontend frontend frontend entities entities … … entities frontend • Existing app • Tens of nodes • Auto-scaling • New sub-system • Several nodes ElastiCache (Journal) S3 (Snapshot) data stores …
  • 5. Challenges • Strategy on unreachables removal • Lifecycle of journals 運⽤する中でつまづいた2つの課題についてお話します
  • 6. Membership Lifecycle in Cluster Specification クラスタメンバーのライフサイクルの概観 http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle joining up down removed join leaving exiting unreachable leave
  • 7. Joinning and Leader Action “leader action”を経て、他メンバと通信できるようになる joining up down join unreachable
  • 8. Joinning and Leader Action “leader action”を経て、他メンバと通信できるようになる joining up down unreachable leader action
  • 9. Joinning and Leader Action Scale-in発⽣時、unreachable のままになる joining up down unreachable failure detector leader action
  • 10. Joinning and Leader Action leader actionが⾏えなくなる。 結果、新規メンバが他のメンバと通信できないままに。 joining up down unreachable leader action
  • 11. Joinning and Leader Action Scale-inをトリガにしたdown指定が必要 joining up down unreachable leader action scale-in trigger mark as down
  • 12. Joinning and Leader Action unreachableを除くことでleader actionが再開可能に joining up down unreachable leader action
  • 13. Leader actions blocked by unreachables leader actionが⾏えない状態のログの例 Members that are “up” but have not seen the current state “Leader can currently not perform its duties”
  • 14. Causes and actions on unreachables Type of failures Example Possible external action network partitions - wait for recovery or abandon a part machine crashes scale-in mark as down quarantined in akka remote layer restart an actor system unresponseive process long GC restart a JVM CPU starvation by credit shortage in EC2 re-create an instance failure detector はエラーの原因までは区別できない 原因に応じてクラスタ外部からの回復措置が要る
  • 15. Split Brain Resolver* (commercial add-on) • Mark members as “down” when a part of the cluster become unreachable for some time • Strategies • Static Quorum • Keep Majority - default in Lagom • Keep Oldest • Keep Referee * http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html 商⽤add-onである程度包括的に⾃動のdown指定が可能 ⼀定時間メンバの状態・到達可能性に変化がない時に発動
  • 16. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  • 17. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  • 18. Reset cluster membership (poor man’s) 存命ノードを新しいクラスタに参加させ直す seed(s) old cluster (ddata) new cluster (ddata)
  • 19. Caution • Side-effect caused by app restart • ddata is experimental (at Akka 2.4) • Use Akka 2.4.8 or higher* 再起動による副作⽤やAkkaのバージョンに注意が必要 *has a fix on distributed pub-sub akka#20847
  • 20. To keep cluster membership healthy 1. Trigger mark-as-down (or leave) on scale-in 2. Automate restart/recreation of
 AcotrSystem, JVM, server instance 3. Setup a fallback mechanism such as
 split brain resolver, or
 rejoining into a new cluster unreachable対策のまとめ
  • 21. Challenges • Strategy of unreachables removal • Lifecycle of journals 次に、2つ⽬の課題についてお話しします。
  • 23. Cleanup old journals in Redis plug-in* JournalのDeleteMessageでは⼀部データが残るケースがある snapshotとのsequenceNrの⼀貫性に注意 key in Redis removed on deleteMessages journal:$persistenceId Yes journal:$persistenceId.highestSequenceNr No * https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala Deleting highstSequenceNr could cause loading old version of snapshot. → Keep only the latest snapshot.
  • 24. Event Sourcing and Ecosystem “it stores a complete history of the events associated with the aggregates in your domain” Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ] 本来のイベントストアはイベントの完全な履歴を持つ想定 そこから逸れるとエコシステム(plug-in)のサポートも弱くなる
  • 25. Summary • Lessons learned from devops of an Akka Cluster app • Strategy on unreachables removal • scale-in trigger • automatic restart/recreation • fallback mechanism;
 split-brain resolver / rejoining • Lifecycle of journals • cost of deviation from Event Sourcing unreachableメンバを取り除く仕組みを各種⼊れておく Journalのキャッシュ的運⽤は意外と⼤変(なことがある)
  • 26. Reference [CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian, 2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx [PSE] Persistence - Schema Evolution, Akka Documentation, http:// doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html