HM-2 (Health Manager v2, Cloud Foundry)

HM-2
NTT 車谷駿介

1

#whoami
• 普段は研究開発に従事
– Cloud FoundryひいてはPaaS
– (勝手に)分散処理する技術
– スマートフォン向けのシンクライアント技術
• 元を辿るとa11y畑だった，ような．

2

What is HealthManager
instanceの状況からCloud Controllerへ命令

cloud
stager health
controller
manager

cc - db staging
jobs
package
uaa - db cache
dea
redis
uaa - AuthN blobstore
staging logs

図版出典: http://www.slideshare.net/marklucovsky/cloud-foundry-open-tour-london 3

HealthManager 2.0

4

Used on CF.com

https://github.com/cloudfoundry/cf-release/tree/master/src 5

maybe...

https://github.com/cloudfoundry/cloud_controller/tree/master/health_manager/lib 6

Simply, HM-1 is:
• インスタンスの状態を気合いで追従
– NATS.subscribe
– LocalにただのHashで保持
• 定期的に状態が妥当であるか確認
– 定期的にCCから期待している状態を取得
• 気づいたら即修正を試みる
– NATS.publish
• lib/health_manager.rbが全権を掌握

7

Not much difference
• リファクタリング．以上！
– コンポーネントの分離
• ソースがより追いかけやすく

8

Misc. Changes
• notifyによる状況変化のブロードキャスト
– HMの状況に応じたカスタマイズが容易に
• 優先度キューの活用強化
– 対処内容に応じた優先度付け
• 時間を要するタスク群を管理の上実行
– #analyze_all_appsを重複実行しないように
• Flapping Instancesの再起動頻度の調整
• varz(stats)の再設計
– ロック機構の追加，I/Fの提供
• HM-1互換の処理も提供……しようとしている

[28f1b8d9469d196c504de628bb36425888482f46]
本資料は2012-06-26時点のソースコードに基づきます 9

コンポーネントの分離

リファクタリング

10

Structure of HM-1
Cloud Foundry
(Cloud Controller, etc.)

HealthManager
• アプリケーションに関するオーケストレーション
• インスタンスからの状態変化監視
• インスタンスの異常検知
• 異常発生時の制御（命令の発行）

11

Structure of HM-2
Nudger Cloud Foundry
• HM外へメッセージを送信 (Cloud Controller, etc.)
• 優先度付きキューを持つ
• バッチ処理

Manager Reporter
• 他コンポーネントの登録 • アプリ状態の要求に応答
• 設定，初期化 • healthmanager.
[status, health]
Known State Provider
• インスタンスの状態変化追従 Expected State Provider
• イベントハンドラの呼出 • CCからExpectedStateを取得，提供
• HTTP(Bulk API)経由

Shadower
Harmonizer
• HM-1互換の動作を提供
• 定期的にKnown StateとExpected Stateを比較
• CFとの通信を一元管理
• 現状をあるべき状態へ修正
Scheduler: EventMachineのカプセル化
12

Desc. Structure of HM-2
Cloud Foundry
CCへ要求 (Cloud Controller, etc.)

初期化・統括 CFからの
要求に対応

実際の状態を把握 CCの期待する状態を把握

状態の監視・是正判断互換性向上

タスク管理
13

Desc. Structure of HM-2
Cloud Foundry
CCへ要求 (Cloud Controller, etc.)

I/F
初期化・統括 CFからの
・Cloud Foundryからの要求に対応
要求に対応
・Cloud Foundryへ命令
実際の状態を把握 CCの期待する状態を把握
Orchestration
・インスタンスの状態把握
状態の監視・是正判断互換性向上
・状態是正のために何を為すか決定
タスク管理
14

Defined at...
• nudger.rb (Cloud Controller, etc.)

Manager
• health_manager.rb Reporter
• reporter.rb
• app_state_provider.rb Expected State Provider
• nats_based_known_state_pro • bulk_based_expected_state_provic
vider.rb er.rb

Shadower
Harmonizer • shadower.rb
• harmonizer.rb
Scheduler: scheduler.rb
そのほか: constants.rb（定数定義）, varz.rb（varzの雛形）,
common.rb, varz_common.rb
15

HMの状況に応じたカスタマイズが容易に

NOTIFYによる
状況変化のブロードキャスト

16

Desc. Of AppState
• アプリケーションの状態を保持するクラス
– StateProviderが使用
• lib/app_state.rb

17

#notify
lib/health_manager/app_state.rb
def notify(event_type, *args)
self.class.notify_listener(event_type, self, *args)
end
def add_listener(event_type, &block)
check_event_type(event_type)
@listeners ||= {}
@listeners[event_type] ||= []
@listeners[event_type] << block
end

def notify_listener(event_type, app_state, *args)
check_event_type(event_type)
listeners = @listeners[event_type] || []
listeners.each do |block|
block.call(app_state, *args)
end
end

18

Desc. of State Provider
• インスタンスの状態を保持する抽象クラス
– @dropletsにAppStateインスタンスを格納
• Expected, KnownのそれぞれがAppStateで保持
– @dropletsはdroplet_idをキーとするハッシュ
• AppStateインスタンスの取得用I/Fは別に

19

AppStateインスタンスの取得
• droplet_idを用いて直接参照する
– #has_droplet?(id) -> boolean
– #get_droplet(id) -> AppState
– #get_state(id) = get_droplet(id).state
• 片方向リストとしてdropletを全て取得
– @cur_droplet_index: 最後の参照を記録
– #next_droplet: 次のAppStateを取得
– #rewind: @cur_droplet_indexを初期化

20

対処内容に応じた優先度付け

優先度キューの活用強化

21

How to harmonize
1. 定期的に: 実インスタンスの状態を確認
– Known State Providerが実施
2. 定期的に: CFの期待する状態を確認
– Expected State Providerが実施
3. イベント発生時に: harmonizeの判断
4. 定期的に: NATS.publish等のキューを消化
– Nudgerが実施
– Schedulerを活用

22

How to harmonize (1)
• HM外へメッセージを送信
• 優先度付きキューを持つ 1. 実インスタンスの状態を確認

• バッチ処理 2. CFの期待する状態を確認
3. harmonizeの判断
4. キューを消化


Harmonizer
23

KnownStateProvider
• 実際のインスタンスの状態を保持
– AppStateProviderを継承する抽象クラス
• NATS（3種）からインスタンス状態を追跡
– 種別毎にvarzを++
– イベントの発生をlistenersへnotify
• 実アプリケーション状態の取得・解析
– Hearmonizerから実行を要求される

24

NATSからインスタンス状態を追跡

• dea.heartbeat
• droplet.update
• droplet.exited

25

dea.heartbeat
def process_heartbeat(beat)
instance = get_instance(beat['version'], beat['index'])

if running_state?(beat) # instance['state']が動作中でない時はスルー
if instance['state'] == RUNNING && # "instance"の値が異なるときはエラー
instance['instance'] != beat['instance']
notify(:extra_instances, [[beat['instance'],'Instance mismatch']] )
else
instance['last_heartbeat'] = now # AppStateのタイムスタンプ類を更新
instance['timestamp'] = now
%w(instance state_timestamp state).each { |key|
instance[key] = beat[key] # AppStateの情報をheartbeatのものに更新
}
end
elsif beat['state'] == CRASHED # Crashedが検知できたときは……
@crashes[beat['instance']] = {
'timestamp' => now, 'crash_timestamp' => beat['state_timestamp']
}
end
end 26

droplet.update
def process_droplet_updated(message)
reset_missing_indices
notify(:droplet_updated, message)
end

def reset_missing_indices
@reset_timestamp = now
end

27

droplet.exited: STOPPED, DEA_*
def process_exit_dea(message)
notify(:exit_dea, message)
end

def process_exit_stopped(message)
reset_missing_indices
@state = STOPPED
notify(:exit_stopped, message)
end
def reset_missing_indices
@reset_timestamp = now
end

28

droplet.exited: CRASHED
def process_exit_crash(message)
instance = get_instance(message['version'], message['index']) # 状態の取得
instance['state'] = CRASHED # 状態の更新

instance['crashes'] = 0 if # 一定時間間隔(初期値:500)を超えたら値をreset
timestamp_older_than?(instance['crash_timestamp'],
AppState.flapping_timeout)
instance['crashes'] += 1
instance['crash_timestamp'] = message['crash_timestamp']

if instance['crashes'] > AppState.flapping_death # 初期値:1
instance['state'] = FLAPPING
end

@crashes[instance['instance']] = { # クラッシュしたインスタンスとして記録
'timestamp' => now, 'crash_timestamp' => message['crash_timestamp']
}
notify(:exit_crashed, message)
end
29

アプリケーション状態の解析-1
• Harmonizerが定期的に解析タスクを実行
– :droplet_lost秒後(初期値:30)から
:droplet_analysis秒(初期値:10)毎
lib/health_manager/harmonizer.rb
def analyze_all_apps
if scheduler.task_running? :droplet_analysis # 頻度が高すぎるときはwarn
logger.warn("Droplet analysis still in progress. Consider increasing droplet_analysis interval.")
return # 解析処理をとりやめ
end

logger.debug { "harmonizer: droplet_analysis" }

varz.reset_realtime_stats # varz内のKnownState情報を初期化
...
end

30

def analyze_all_apps
...
known_state_provider.rewind # dropletを順に取得するためにポインタを初期化

scheduler.start_task :droplet_analysis do # 初期値:10
known_droplet = known_state_provider.next_droplet
if known_droplet
known_droplet.analyze # KnownState更新
varz.update_realtime_stats_for_droplet(known_droplet) # varz更新
true
else
# TODO: remove once ready for production
varz.set(:droplets, known_state_provider.droplets) # varz更新
varz.publish_realtime_stats
# TODO: add elapsed time # 解析時間はログに出ない
logger.info ["harmonizer: Analyzed #{varz.get(:running_instances)} running ", "#{varz.get(:down_instances)} down instances"].join
false #signal :droplet_analysis task completion to the scheduler
end
end
end 31

#check for all anomalies and trigger appropriate events so that listeners can take action
def analyze
check_for_missing_indices # instanceが欠けているときはnotify
check_and_prune_extra_indices # 後述
prune_crashes # 後述
end
def check_for_missing_indices # task開始後にresetされていたら中止
unless reset_recently? or missing_indices.empty?
notify(:missing_instances, missing_indices)
reset_missing_indices # @reset_timestampを更新（幾つか処理を止める）
end
end
def reset_recently?
timestamp_fresher_than?(@reset_timestamp,
AppState.heartbeat_deadline || 0)
end

AppState.heartbeat_deadline = interval(:droplet_lost) # 初期値:30
32

def check_and_prune_extra_indices
extra_instances = [] # 消し去るべきインスタンスを理由と共に格納

# first, go through each version and prune indices
@versions.each do |version, version_entry |
... # extra_instances変数に追加（後述）
end

# now, prune versions
@versions.delete_if do |version, version_entry|
if version_entry['instances'].empty?
@state == STOPPED || version != @live_version
end # STOPPED状態とバージョン違いは次回以降の対象から外す
end

unless extra_instances.empty?
notify(:extra_instances, extra_instances)
end # extra_instancesが存在するときはnotify
end

33

def check_and_prune_extra_indices
...
@versions.each do |version, version_entry |
version_entry['instances'].delete_if do |index, instance| # deleting extra instances
if running_state?(instance) &&
timestamp_older_than?(instance['timestamp'],
AppState.heartbeat_deadline) # 一定時間以上HeartBeatがない時
instance['state'] = DOWN # DOWNと判定する
instance['state_timestamp'] = now
end

prune_reason = [[@state == STOPPED, 'Droplet state is STOPPED'], # STOPPED状態
[index >= @num_instances, 'Extra instance'], # インスタンス数が多い
[version != @live_version, 'Live version mismatch'] # バージョンが異なる
].find { |condition, _| condition } # 合致するなら文字列を代入

if prune_reason # 何か理由として記録されたなら
if running_state?(instance) # そのインスタンスが動作中である場合
reason = prune_reason.last
extra_instances << [instance[‘instance’], reason] # 理由とともにextra_instancesへ追加
end
end
prune_reason #prune when non-nil
end
end
...
end
34

def prune_crashes
@crashes.delete_if { |_, crash| # ↓初期値:500
timestamp_older_than?(crash['timestamp'], AppState.flapping_timeout)
}
end

35




Harmonizer
36

#update_expected_state
def update_expected_state # :expected_update_time間隔(初期値:10)
logger.debug { "harmonizer: expected_state_update pre-check" }

if expected_state_update_in_progress? # varzが更新中であるとき
postpone_expected_state_update # 処理を先送り(初期値:2秒)
return
end

expected_state_provider.update_user_counts # BulkAPI経由でユーザ数を更新
varz.reset_expected_stats # varzをなめて初期化
expected_state_provider.each_droplet do |app_id, expected|
known = known_state_provider.get_droplet(app_id)
expected_state_provider.set_expected_state(known, expected)
end # 情報をコピー: KnownState←Expected
end
def expected_state_update_in_progress?
varz.held?(:total)
end

37

BulkBasedExpectedStateProvider
• CFが思っているインスタンスの状態を保持
– AppStateProviderを継承する抽象クラス
• 処理は全てHearmonizerから要求される
– AppStateが持っている値を
Expected側からKnown側へコピー
– CCの持つ情報をvarzへ格納
• アプリケーション数やインスタンス数，メモリ利用量
や残量，ユーザ数など

38

AppStateの情報をコピー
lib/health_manager/bulk_based_expected_state_provider.rb
def set_expected_state(known, expected)
logger.debug { "bulk: #set_expected_state: known: #{known.inspect} expected: #{expected.inspect}" }

known.set_expected_state(
:num_instances => expected['instances'],
:state => expected['state'],
:live_version => "#{expected['staged_package_hash']}-#{expected['run_count']}",
:framework => expected['framework'],
:runtime => expected['runtime'],
:package_state => expected['package_state'],
:last_updated => parse_utc(expected['updated_at']))
end

39

CCの情報を記録(ユーザ)
• api.vcap.me/bulk/countsを:total_appsへ
lib/health_manager/bulk_based_expected_state_provider.rb
def update_user_counts
with_credentials do |user, password|
options = {
:head => { 'authorization' => [user, password] }, :query => { 'model' => 'user' }
}
http = EM::HttpRequest.new(counts_url).get(options)
http.callback do
if http.response_header.status != 200
logger.error("bulk: request problem. Response: #{http.response_header} #{http.response}")
next
end
response = parse_json(http.response) || {}
logger.debug { "bulk: user counts received: #{response}" }
counts = response['counts'] || {}
varz.set(:total_users, (counts['user'] || 0).to_i)
end
http.errback do
logger.error("bulk: error: talking to bulk API at #{counts_url}")
@user = @password = nil #ensure re-acquisition of credentials
end
end
end
40

CCの情報を記録(アプリケーション)
• #each_droplet:
– Private関数#process_next_batchの呼び出し
• #process_next_batch:
– api.vcap.me/bulk/appsを取得
– バッチ的に:batch_size毎に取得(初期値:50)
• varzの#update_expected_stats_for_droplet:
– droplet毎に呼び出し
– varzへ反映

41




Harmonizer
42

即座に対処:
droplet更新でExpectedState更新
def prepare
...
AppState.add_listener(:droplet_updated) do |app_state, message|
logger.info { "harmonizer: droplet_updated: #{message}" }
app_state.mark_stale # (dropletが変わったので)Stale状態にセット
update_expected_state # ExpectedStateも併せて更新
end
...
end

43

即座に対処:
過剰なインスタンスを終了
def prepare
...
AppState.add_listener(:extra_instances) do |app_state, extra_instances|
if app_state.stale? # Stale状態のインスタンスは対象外
logger.info { "harmonizer: stale: extra_instances ignored: #{extra_instances}" }
next
end

logger.debug { "harmonizer: extra_instances"}
nudger.stop_instances_immediately(app_state, extra_instances)
# インスタンスの停止を即座に要求
end
...
end

44

即座に対処:
Flapping状態のアプリケーション
def prepare
...
AppState.add_listener(:exit_crashed) do |app_state, message|
...
if flapping?(instance) # stateがFLAPPINGである場合
unless restart_pending?(instance)
instance['last_action'] = now
if giveup_restarting?(instance) # 一定回数を超えていたら再起動中止
# TODO: when refactoring this thing out, don't forget to
# mute for missing indices restarts
logger.info { "giving up on restarting: app_id=#{app_state.id} index=#{index}" }
else # 一定回数の試行が行われていないならある時間経過後に再起動
delay = calculate_delay(instance) # 起動までの時間は後述
schedule_delayed_restart(app_state, instance, index, delay)
end
end
else def giveup_restarting?(instance)
...
end
interval(:giveup_crash_number) > 0 # 負の値で試行回数制限無
end && instance['crashes'] > interval(:giveup_crash_number)
...
end
end
45

優先度高:
DEA終了時のインスタンス再起動
def prepare
...
AppState.add_listener(:exit_dea) do |app_state, message| # DEAの正常終了
index = message['index']

logger.info { "harmonizer: exit_dea: app_id=#{app_state.id} index=#{index}" }
nudger.start_instance(app_state, index, HIGH_PRIORITY)
# インスタンスの起動をqueueに入れる（優先度高）
end
...
end

46

優先度中:
インスタンス不足時に追加起動
def prepare
...
AppState.add_listener(:missing_instances) do |app_state, missing_indices|
if app_state.stale? # Stale状態のインスタンスは対象外
logger.info { "harmonizer: stale: missing_instances ignored app_id=#{app_state.id} indices=#{missing_indices}" }
next
end

logger.debug { "harmonizer: missing_instances"}
missing_indices.delete_if { |i|
restart_pending?(app_state.get_instance(i))
} # restart_pendingフラグ(Harmonizerがたてる)の立つインスタンスは対象外
nudger.start_instances(app_state, missing_indices, NORMAL_PRIORITY)
# インスタンスの起動を要求（優先度中）
#TODO: flapping logic, too
end
end

47

優先度低:
異常終了したインスタンスの再起動
def prepare
...
logger.debug { "harmonizer: exit_crashed" }

index = message['index']
instance = app_state.get_instance(message['version'], message['index'])

if flapping?(instance)
...
else # stateがFLAPPINGでない場合
nudger.start_instance(app_state, index, LOW_PRIORITY)
# インスタンスの起動を要求（優先度低）
end
end
...
end

48




Harmonizer
49

Desc. of Nudger
• 優先度付きキューを持つ
• まじめに優先度を指定するようになった
• 基本的にはバッチ処理
• インスタンスの起動・終了
– 非バッチ処理用関数: *_immediately
• 直後にcloudcontrollers.hm.requestsを発行
– Priority (0以上の数) : HIGH, NORMAL, LOW
• Queueに蓄積され，Harmonizerから適宜実行
• 実行毎に，@queue_batch_size個のQueueを処理
• Queueの処理=cloudcontrollers.hm.requestsの発行
• キューの追加，バッチ処理のトリガーはHarmonizer

50

PriorityQueue
• PriorityQueueFIFO: ヒープでよしなに
– 各Priorityに1つのBucketを与える
– Bucketの中にitemを複数保持(FIFO)
• PrioritySet < PriorityQueueFIFO
– 重複したアイテムをin/out時に監視
• key（未指定の時はitem）の重複
lib/health_manager/nudger.rb
def queue(message, priority = NORMAL_PRIORITY)
logger.debug { "nudger: queueing: #{message}, #{priority}" }
key = message.clone.delete(:last_updated) # app.last_update以外の全てがkey
@queue.insert(message, priority, key)
varz.set(:queue_length, @queue.size)
end
51

キューの実行
• :request_queue秒ごとに
:queue_batch_size個ずつ消化
– 初期値:40queues/1dequeing/1sec
– Harmonizerの初期化時にschedulerへ追加
def deque_batch_of_requests
@queue_batch_size.times do |i|
break if @queue.empty?
message = encode_json(@queue.remove)
publish_request_message(message)
end
end

52

#analyze_all_appsを重複実行しないように

時間を要するタスク群を
管理の上実行

53

Sheduler?
Cloud Foundry

54

Desc. of Scheduler
• EventMachineのカプセル化
– reactor loopが起動していようがいまいが
スケジューリングできるように
– 起動プロセスを改善
• 各scheduleにIdが降られるように
– Cancelや情報取得が可能に
• 処理の流れ
– スケジュールとして追加
– 定期的にスケジュールを消化
• Taskを用いることで，
実行中のタスクがキューに入ることを防ぐ

55

Methods (Scheduling)
• :immediate -> EM.next_tick
– immediately(&block)
• :periodic -> schedule()
– at_interval(interval_name, &block)
– every(interval, &block)
• :timer -> schedule()
– after_interval(interval_name, &block)
– after(interval, &block)

56

#schedule
lib/health_manager/scheduler.rb
def schedule(options, &block)
raise ArgumentError unless options.length == 1
raise ArgumentError, 'block required' unless block_given?
arg = options.first
sendee = {
:periodic => [:add_periodic_timer],
:timer => [:add_timer],
}[arg.first]

raise ArgumentError, "Unknown scheduling keyword, please use :immediate, :periodic or :timer" unless sendee
sendee << arg[1]
receipt = get_receipt # 通し番号
@schedule << [block, sendee, receipt]
receipt
end

57

#run
• @scheduleを処理しきったら2秒wait
def run
until @schedule.empty?
block, sendee, receipt = @schedule.shift
@receipt_to_timer[receipt] = EM.send(*sendee, &block)
end

EM.add_timer(@run_loop_interval) { run } # 初期値:2
end

58

Methods (Task)
• #analyze_all_apps
– Harmonizerが全dropletの状態確認する時
def start_task(task, &block)
return if task_running?(task) # 同じ名前のタスクを並列実行しない
mark_task_started(task)
quantize_task(task, &block)
end
def quantize_task(task, &block)
if yield
EM.next_tick { quantize_task( task, &block) }
else
mark_task_stopped(task)
end
end

59

Cloudfoundry.comでいったいなにがおこったのか．

FLAPPING INSTANCESの
再起動頻度の調整

60

即座に対処:
Flapping状態のアプリケーション
def prepare
...
...
if flapping?(instance) # stateがFLAPPINGである場合
unless restart_pending?(instance)
instance['last_action'] = now
if giveup_restarting?(instance) # 一定回数を超えていたら再起動中止
# TODO: when refactoring this thing out, don't forget to
# mute for missing indices restarts
logger.info { "giving up on restarting: app_id=#{app_state.id} index=#{index}" }
else # 一定回数の試行が行われていないならある時間経過後に再起動
delay = calculate_delay(instance) # 起動までの時間は後述
schedule_delayed_restart(app_state, instance, index, delay)
end
end
else def giveup_restarting?(instance)
...
end
interval(:giveup_crash_number) > 0 # 負の値で試行回数制限無
end && instance['crashes'] > interval(:giveup_crash_number)
...
end
end
61

#caluclate_delay
def calculate_delay(instance)
# once the number of crashes exceeds the value of
# :flapping_death interval, delay starts with min_restart_delay
# interval value, and doubles for every additional crash. the
# delay never exceeds :max_restart_delay though. But wait,
# there's more: random noise is added to the delay, to avoid a
# storm of simultaneous restarts. This is necessary because
# delayed restarts bypass nudger's queue -- once delay period
# passes, the start message is published immediately.

delay = [interval(:max_restart_delay), # 初期値:480
interval(:min_restart_delay) # ←初期値:60 ↓初期値:1
<< (instance['crashes'] - interval(:flapping_death) - 1)
].min.to_f # ↓初期値:10
noise_amount = 2.0 * (rand - 0.5) * interval(:delay_time_noise).to_f

result = delay + noise_amount

logger.info("delay: #{delay} noise: #{noise_amount} result: #{result}")
result
end 62

#caluclate_delay
def calculate_delay(instance)
# once the number of crashes exceeds the value of
# :flapping_death interval, delay starts with min_restart_delay

#1 #2 #3+
# interval value, and doubles for every additional crash. the
# delay never exceeds :max_restart_delay though. But wait,
# there's more: random noise is added to the delay, to avoid a
0 # Min
storm of simultaneous restarts. This is necessary because Min^2 Max
# delayed restarts bypass nudger's queue -- once delay period
# passes, the start message is published immediately.

delay = [interval(:max_restart_delay), # 初期値:480
interval(:min_restart_delay) # ←初期値:60 ↓初期値:1
<< (instance['crashes'] - interval(:flapping_death) - 1)
].min.to_f # ↓初期値:10
noise_amount = 2.0 * (rand - 0.5) * interval(:delay_time_noise).to_f

result = delay + noise_amount

logger.info("delay: #{delay} noise: #{noise_amount} result: #{result}")
result
end 63

ロック機構の追加，I/Fの提供

VARZ(STATS)の再設計

64

varz
• #declare_counter: def prepare
declare_counter :total_apps

– 数値をよしなに処理
declare_counter :total_instances
declare_counter :running_instances
declare_counter :down_instances
declare_counter :crashed_instances

• #declare_node: declare_counter :flapping_instances

declare_node :running

– ハッシュをよしなに処理
declare_node :running, :frameworks
declare_node :running, :runtimes

declare_counter :total_users

• #declare_collection: declare_collection :users
# FIXIT: ensure can be safely removed
declare_collection :apps

– 配列をよしなに処理 # FIXIT: ensure can be safely removed

declare_node :total
...
declare_node :total, :frameworks
declare_counter :heartbeat_msgs_received
declare_node :total, :runtimes
declare_counter :droplet_exited_msgs_received
declare_counter :droplet_updated_msgs_received
declare_counter :queue_length
declare_counter :healthmanager_status_msgs_received
...
declare_counter :healthmanager_health_request_msgs_received

declare_counter :varz_publishes
declare_counter :varz_holds
declare_node :droplets

end
# FIXIT: remove once ready for production
65

……しようとしている，みたいです．

HM-1互換の処理も提供

66

HM-2 w/ Shadower
• HM外へメッセージを送信 Shadower (Cloud Controller, etc.)
• 優先度付きキューを持つ • HM-1互換
• バッチ処理動作

Manager Reporter
• 他コンポーネントの登録 • アプリ状態の要求に応答
• 設定，初期化 • healthmanager.
[status, health]
def initialize(config = {}) Expected State Provider
lib/health_manager.rb
... # HM コンポーネントの起動
if should_shadow?
@publisher = @shadower = Shadower.new(@config)
else
Harmonizer
@publisher = NATS
end
... # 起動したHMコンポーネントをManagerへ登録
end
Scheduler:
67

Desc. of Shadower
• いらない子
• 今のところ
– ログを埋めるだけ
• 長期的には
– Harmonizationを補完させたいらしい？

68

Shadower subscribes:
• Infoレベルでログに書き出す……だけ
– healthmanager.start
– healthmanager.status
– healthmanager.health
• 何かやろうとしている
– cloudcontrollers.hm.requests
• リクエストを蓄積したい？（未実装）
– #check_shadowing
• 古びたリクエストでWarnを出したい？（未実装）

69

cloudcontrollers.hm.requests
def process_message(subj, message)
logger.info{ "shadower: received: #{subj}: #{message}" }
record_request(message, 1) if subj == 'cloudcontrollers.hm.requests'
end

def publish(subj, message) # ただし，呼び出し元なし
logger.info("shadower: publish: #{subj}: #{message}")
record_request(message, -1) if subj == 'cloudcontrollers.hm.requests'
end

def record_request(message, increment) # 実質的に機能していない
request = @requests[message] ||= {} # @requestsは常に初期値`{}`

request[:timestamp] = now
request[:count] ||= 0
request[:count] += increment

@requests.delete(message) if request[:count] == 0
end
70

#check_shadowing
lib/health_manager/harmonizer.rb (hm.start実行時に呼び出し)
scheduler.at_interval :check_shadowing do # 初期値:30 (constants.rb)
shadower.check_shadowing
end
def check_shadowing # Harmonizerが:check_shadowing(初期値:30)間隔で実行
max_delay = interval(:max_shadowing_delay) # 初期値:10

unmatched = @requests # 現状，@requestsは常に`{}`
.find_all { |_, entry| timestamp_older_than?(entry[:timestamp],
max_delay) }
# ↑(now – entry[:timestamp] > :max_delay)
if unmatched.empty?
logger.info("shadower: check: OK")

else
logger.warn("shadower: check: unmatched: found #{unmatched.size} unmatched messages, details follow")

unmatched.each do |message, entry|
logger.warn("shadower: check: unmatched: #} #{entry[:count]} ")
@requests.delete(message) # 古びたエントリーはWarn出して削除
end
end
end 71

まとめ

• リファクタリング
• notifyによる状況変化のブロードキャスト
• 優先度キューの活用強化
• 時間を要するタスク群を管理の上実行
• Flapping Instancesの再起動頻度の調整
• varz(stats)の再設計
• HM-1互換の処理も提供

72

HM-2 (Health Manager v2, Cloud Foundry)

Recommended

Recommended

More Related Content

Similar to HM-2 (Health Manager v2, Cloud Foundry)

Similar to HM-2 (Health Manager v2, Cloud Foundry) (20)

Recently uploaded

Recently uploaded (9)

HM-2 (Health Manager v2, Cloud Foundry)