Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Preparing for distributed system failures using akka #ScalaMatsuri

4,675 views

Published on

Akkaで分散システムの障害に備える
Presentation of ScalaMatsuri 2017

Published in: Technology
  • Be the first to comment

Preparing for distributed system failures using akka #ScalaMatsuri

  1. 1. Copyright © 2017 TIS Inc. All rights reserved. Preparing for distributed system failures using Akka 2017.2.25 Scala Matsuri Yugo Maede @yugolf
  2. 2. Copyright © 2017 TIS Inc. All rights reserved. 2 Who am I? TIS Inc. provides “Reactive Systems Consulting Service” @yugolf https://twitter.com/okapies/status/781439220330164225 - support PoC projects - review designs - review codes        etc リアクティブシステムのコンサルティングサービ スをやっています
  3. 3. Copyright © 2017 TIS Inc. All rights reserved. 3 Todayʼs Topics What are Architectural Safety Measures in distributed system? How to realize them with Akka 分散システムに考慮が必要な安全対策 Akkaでどうやるか?
  4. 4. Copyright © 2017 TIS Inc. All rights reserved. 4 Microservices mean distributed systems from Monolith to Microservices マイクロサービス、 すなわち、分散システム
  5. 5. Copyright © 2017 TIS Inc. All rights reserved. 5 "Mooreʼs law is dead" means "distributed systems are the beginning" limitation of CPU performance ムーアの法則の終焉、 すなわち、分散システムの幕開け
  6. 6. Copyright © 2017 TIS Inc. All rights reserved. confront with distributed system 6 Building large-scale systems requires distributed systems 分散システムなしにはビジネスの成功はない
  7. 7. Copyright © 2017 TIS Inc. All rights reserved. 7 - increasing the number of server means increasing failure points - face new enemies "network" サーバが増えれば障害点も増える ネットワークという新たな敵の出現 building distributed system is not easy
  8. 8. Copyright © 2017 TIS Inc. All rights reserved. Architectural Safety Measures 8 define Cross-Functional Requirements - availability - response time and latency 機能横断要件を定義しましょう 可⽤性と応答時間/遅延
  9. 9. Copyright © 2017 TIS Inc. All rights reserved. systems based on failure 9 - needs Antifragile Organizations - needs systems based on failure アンチフラジャイルな組織と障害を前提とした システムが必要
  10. 10. Copyright © 2017 TIS Inc. All rights reserved. Architectural Safety Measures need 10 timeout bulkhead circuit breaker ... タイムアウト、隔壁、サーキットブレーカー、…
  11. 11. Copyright © 2017 TIS Inc. All rights reserved. Akka is here 11 Akka has tools to deal with distributed system failures Akkaには分散システムに関わる障害に対処する ためのツールが備わっている
  12. 12. Copyright © 2017 TIS Inc. All rights reserved. Akka Actor 12 participant Actor processes messages in order of arrival $30 host アクターはメッセージを到達順に処理 シンプルに⾮同期処理を実装可能 $10 $10 $10 status $10 $10 $10 mailbox
  13. 13. Copyright © 2017 TIS Inc. All rights reserved. Supervisor Hierarchy 13 let it crash スーパーバイザーが⼦アクターを監視し障害制 御などを⾏う supervisor child actorchild actor supervise signal failure - restart - resume - stop - escalate
  14. 14. Copyright © 2017 TIS Inc. All rights reserved. timeout 14
  15. 15. Copyright © 2017 TIS Inc. All rights reserved. request-response needs timeout 15 request response 応答が遅かったり、返ってこないこともある ☓
  16. 16. Copyright © 2017 TIS Inc. All rights reserved. message passing 16 ! tell(fire and forget)を使う askの場合はタイムアウトを適切に設定 ? 1s tell(fire and forget) ask
  17. 17. Copyright © 2017 TIS Inc. All rights reserved. timeout configuration 17 import akka.pattern.ask
 import akka.util.Timeout
 import system.dispatcher
 
 implicit val timeout = Timeout(5 seconds) 
 val response = kitchen ? KitchenActor.DripCoffee(count)
 
 response.mapTo[OrderCompleted] onComplete {
 case Success(result) =>
 log.info(s"success: ${result.message}")
 case Failure(e: AskTimeoutException) =>
 log.info(s"failure: ${e.getMessage}")
 case Failure(t) =>
 log.info(s"failure: ${t.getMessage}")
 } askのタイムアウト設定
  18. 18. Copyright © 2017 TIS Inc. All rights reserved. 18 送信先に問題があった場合は? ? 1s if a receiver has a problem
  19. 19. Copyright © 2017 TIS Inc. All rights reserved. 19 supervisor never return failure to sender 障害の事実を送信元に返さない if a receiver has a problem - restart - resume - stop - escalate
  20. 20. Copyright © 2017 TIS Inc. All rights reserved. 20 timeout! レスポンスが返ってこないためタイムアウトが 必要 ? 1s if a receiver has a problem ☓
  21. 21. Copyright © 2017 TIS Inc. All rights reserved. implements of ask pattern 1/2 21 def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any] = internalAsk(message, timeout, sender) private[pattern] def internalAsk(message: Any, timeout: Timeout, sender: ActorRef): Future[Any] = actorSel.anchor match {
 case ref: InternalActorRef ⇒
 if (timeout.duration.length <= 0)
 Future.failed[Any](
 new IllegalArgumentException(s"""Timeout length must not be negative, question not sent to [$actorSel]. Sender[$sender] sent the message of type "$ {message.getClass.getName}"."""))
 else {
 val a = PromiseActorRef(ref.provider, timeout, targetName = actorSel, message.getClass.getName, sender)
 actorSel.tell(message, a)
 a.result.future
 }
 case _ ⇒ Future.failed[Any](new IllegalArgumentException(s"""Unsupported recipient ActorRef type, question not sent to [$actorSel]. Sender[$sender] sent the message of type "${message.getClass.getName}"."""))
 } ? internalAsk
  22. 22. Copyright © 2017 TIS Inc. All rights reserved. 22 akka.pattern.PromiseActorRef def apply(provider: ActorRefProvider, timeout: Timeout, targetName: Any, messageClassName: String, sender: ActorRef = Actor.noSender): PromiseActorRef = {
 val result = Promise[Any]()
 val scheduler = provider.guardian.underlying.system.scheduler
 val a = new PromiseActorRef(provider, result, messageClassName)
 implicit val ec = a.internalCallingThreadExecutionContext
 val f = scheduler.scheduleOnce(timeout.duration) {
 result tryComplete Failure(
 new AskTimeoutException(s"""Ask timed out on [$targetName] after [${timeout.duration.toMillis} ms]. Sender[$sender] sent message of type "$ {a.messageClassName}"."""))
 }
 result.future onComplete { _ ⇒ try a.stop() finally f.cancel() }
 a
 } スケジューラを設定し時間がくれば AskTimeoutException送信 implements of ask pattern 2/2
  23. 23. Copyright © 2017 TIS Inc. All rights reserved. circuit breaker 23
  24. 24. Copyright © 2017 TIS Inc. All rights reserved. a receiver is down 24 問い合わせたいサービスがダウンしていること もある
  25. 25. Copyright © 2017 TIS Inc. All rights reserved. response latency will rise 25 100ms 1s normal abnormal(timeout=1s) レスポンス劣化 過負荷により性能劣化が拡⼤
  26. 26. Copyright © 2017 TIS Inc. All rights reserved. apply circuit breaker 26 サーキットブレーカ でダウンしているサービス には問い合わせをしないように circuit breaker
  27. 27. Copyright © 2017 TIS Inc. All rights reserved. what is circuit breaker 27 https://martinfowler.com/bliki/CircuitBreaker.html ⼀定回数の失敗を繰り返す と接続を抑⽌ Once the failures reach a certain threshold, the circuit breaker trips
  28. 28. Copyright © 2017 TIS Inc. All rights reserved. circuit breaker has three statuses 28 http://doc.akka.io/docs/akka/current/common/circuitbreaker.html Closed:メッセージ送信可能 Open :メッセージ送信不可
  29. 29. Copyright © 2017 TIS Inc. All rights reserved. decrease the latency 29 無駄な問い合わせをやめてレイテンシを発⽣さ せないようにする 100ms x ms normal abnormal(timeout=1s) 1s Open Close
  30. 30. Copyright © 2017 TIS Inc. All rights reserved. apply circuit breaker: implement 30 val breaker =
 new CircuitBreaker(
 context.system.scheduler,
 maxFailures = 5,
 callTimeout = 10.seconds,
 resetTimeout = 1.minute).onOpen(notifyMeOnOpen()) http://doc.akka.io/docs/akka/current/common/circuitbreaker.html def receive = {
 case "dangerousCall" =>
 breaker.withCircuitBreaker(Future(dangerousCall)) pipeTo sender()
 } 5回失敗するとOpenになり、1分間はメッセー ジを送信させない
  31. 31. Copyright © 2017 TIS Inc. All rights reserved. block threads 31 ブロッキング処理があるとスレッドが枯渇しレ イテンシが伝播 blockingblocking threads threads
  32. 32. Copyright © 2017 TIS Inc. All rights reserved. prevention of propagation 32 異常サービスを切り離すことで、問題が上流へ 伝播しない blockingblocking threads threads
  33. 33. Copyright © 2017 TIS Inc. All rights reserved. CAP trade-off 33 return old information vs don't return anything just do my work vs need synchronize with others cache push - read - write 古い情報を返してもよいか? 他者との同期なしで問題ないか?
  34. 34. Copyright © 2017 TIS Inc. All rights reserved. rate limiting 34 rate limiter 同じクライアントからの集中したリクエストか ら守る no more than 100 requests in any 3 sec interval
  35. 35. Copyright © 2017 TIS Inc. All rights reserved. bulkhead 35
  36. 36. Copyright © 2017 TIS Inc. All rights reserved. Even if there is damage next door, are you OK? 36 無関係なお隣さんがダウンしたとき、影響を被 る不運な出来事
  37. 37. Copyright © 2017 TIS Inc. All rights reserved. bulkhead blocks the damage 37 スレッドをブロックするアクターと影響を受け るアクターの間に隔壁 threadsthreads blocking
  38. 38. Copyright © 2017 TIS Inc. All rights reserved. isolating the blocking calls to actors 38 val blockingActor = context.actorOf(Props[BlockingActor].
 withDispatcher(“blocking-actor-dispatcher”),
 "blocking-actor")
 
 class BlockingActor extends Actor {
 def receive = {
 case GetCustomer(id) =>
 // calling database
 …
 }
 } ブロッキングコードはアクターごと分離してリ ソースを共有しない
  39. 39. Copyright © 2017 TIS Inc. All rights reserved. the blocking in Future 39 Future{ // blocking } ブロックするFutureによりディスパッチャが枯 渇 threads
  40. 40. Copyright © 2017 TIS Inc. All rights reserved. 40 http://www.slideshare.net/ktoso/zen-of-akka#44 デフォルトディスパッチャを利⽤した場合 using the default dispatcher
  41. 41. Copyright © 2017 TIS Inc. All rights reserved. 41 ブロッキング処理を分離 threadsthreads Future{ // blocking } isolating the blocking Future
  42. 42. Copyright © 2017 TIS Inc. All rights reserved. 42 http://www.slideshare.net/ktoso/zen-of-akka#44 using a dedicated dispatcher 専⽤ディスパッチャの利⽤
  43. 43. Copyright © 2017 TIS Inc. All rights reserved. CQRS:Command and Query Responsibility Segregation 43 コマンドとクエリを分離する write read command query
  44. 44. Copyright © 2017 TIS Inc. All rights reserved. cluster 44
  45. 45. Copyright © 2017 TIS Inc. All rights reserved. hardware will fail 45 If there are 365 machines failing once a year, one machine will fail a day Wouldn't a machine break even when it's hosted on the cloud? 1年に1回故障するマシンが365台あれば平均毎 ⽇1台故障する
  46. 46. Copyright © 2017 TIS Inc. All rights reserved. availability of AWS 46 例:AWSの可⽤性検証サイト https://cloudharmony.com/status-of-compute-and-storage-and-cdn-and-dns-for-aws
  47. 47. Copyright © 2017 TIS Inc. All rights reserved. preparing for failure of hardware 47 - minimize single point of failure - allow recovery of State 単⼀障害点を最⼩化 状態を永続化
  48. 48. Copyright © 2017 TIS Inc. All rights reserved. Cluster monitor each other by sending heartbeats 48 node1 node2 node3 node4 クラスタのメンバーがハートビートを送り合い 障害を検知
  49. 49. Copyright © 2017 TIS Inc. All rights reserved. recovery states 49 Cluster 永続化しておいたイベントをリプレイすること で状態の復元が可能 persist replay node1 node2 node3 node4 events state akka-persistence
  50. 50. Copyright © 2017 TIS Inc. All rights reserved. the database may be down or overloaded 50 永続化機能の障害未復旧時に闇雲にリトライし ない persist replay node3 node4 replay replay db has not started yet
  51. 51. Copyright © 2017 TIS Inc. All rights reserved. BackoffSupervisor 51 http://doc.akka.io/docs/akka/current/general/supervision.html#Delayed_restarts_with_the_BackoffSupervisor_pattern 3秒後、6秒後、12秒後、…の間隔でスタートを 試みる 
 val childProps = Props(classOf[EchoActor])
 
 val supervisor = BackoffSupervisor.props(
 Backoff.onStop(
 childProps,
 childName = "myEcho",
 minBackoff = 3.seconds,
 maxBackoff = 30.seconds,
 randomFactor = 0.2 // adds 20% "noise" to vary the intervals slightly
 ))
 
 system.actorOf(supervisor, name = "echoSupervisor") increasing intervals of 3, 6, 12, ...
  52. 52. Copyright © 2017 TIS Inc. All rights reserved. split brain resolver 52
  53. 53. Copyright © 2017 TIS Inc. All rights reserved. Cluster node1 node2 node3 node4 In the case of network partitions 53 ネットワークが切れることもある
  54. 54. Copyright © 2017 TIS Inc. All rights reserved. node1 node4 Cluster1 Cluster2 using split brain resolver 54 クラスタ間での⼀貫性維持のためSplit brain resolverを適⽤ node2 node3 node5 split brain resolver
  55. 55. Copyright © 2017 TIS Inc. All rights reserved. strategy 1/4: Static Quorum 55 quorum-size = 3 クラスタ内のノード数が⼀定数以上の場合⽣存 node2 node1 node4 node3 node5 Which can survive? - If the number of nodes is quorum-size or more
  56. 56. Copyright © 2017 TIS Inc. All rights reserved. strategy 2/4: Keep Majority 56 ノード数が50%より多い場合に⽣存 node2 node1 node4 node3 node5 Which can survive? - If the number of nodes is more than 50%
  57. 57. Copyright © 2017 TIS Inc. All rights reserved. strategy 3/4: Keep Oldest 57 最古のノードが⾃グループに含まれている場合 に⽣存 node2 node4 node3 node5 Which can survive? - If contain the oldest node node1 oldest
  58. 58. Copyright © 2017 TIS Inc. All rights reserved. strategy 4/4: Keep Referee 58 特定のノードが含まれている場合に⽣存 node2 node4 node3 node5 node1 Which can survive? - If contain the given referee node address = "akka.tcp://system@node1:port"
  59. 59. Copyright © 2017 TIS Inc. All rights reserved. 59 SBR is included in Lightbend Reactive Platform https://github.com/TanUkkii007/akka-cluster-custom-downing http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html Lightbend Reactive Platform akka-cluster-custom-downing SBRはLightbend Reactive Platformで提供され ています
  60. 60. Copyright © 2017 TIS Inc. All rights reserved. idempotence 60 冪等性
  61. 61. Copyright © 2017 TIS Inc. All rights reserved. Failed to receive ack message 61 Order(coffee,1) Order(coffee,1) ackを受信できずメッセージを再送すると2重注 ⽂してしまう coffee please! becomes a duplicate order by resending the message
  62. 62. Copyright © 2017 TIS Inc. All rights reserved. idempotence 62 メッセージを複数回受信しても問題ないように 冪等な設計で⼀貫性を維持 Order(id1, coffee, 1) Order(id1, coffee, 1) coffee, please! applying it multiple times is not harmful
  63. 63. Copyright © 2017 TIS Inc. All rights reserved. summary 63
  64. 64. Copyright © 2017 TIS Inc. All rights reserved. summary 64 - Microservices mean distributed systems - define Cross-Functional Requirements - design for failure 障害は発⽣するものなので、受け⼊れましょう
  65. 65. Copyright © 2017 TIS Inc. All rights reserved. summary 65 timeout circuit breaker bulkhead cluster backoff split brain resolver ... by using Akka Akkaは分散システムの障害に対処するための ツールキットを備えています
  66. 66. Copyright © 2017 TIS Inc. All rights reserved. reference materials 66 - Building Microservices - Reactive Design Patterns - Reactive Application Development - Effective Akka - http://akka.io/
  67. 67. Copyright © 2017 TIS Inc. All rights reserved. 67 https://gitter.im/akka-ja/akka-doc-ja https://github.com/akka-ja/akka-doc-ja/ akka.io翻訳協⼒者募集中!! Gitterにジョインしてください。 now translating
  68. 68. THANK YOU

×