3. What is HealthManager
instanceの状況からCloud Controllerへ命令
cloud
stager health
controller
manager
cc - db staging
jobs
package
uaa - db cache
dea
redis
uaa - AuthN blobstore
staging logs
図版出典: http://www.slideshare.net/marklucovsky/cloud-foundry-open-tour-london 3
30. アプリケーション状態の解析-1
• Harmonizerが定期的に解析タスクを実行
– :droplet_lost秒後(初期値:30)から
:droplet_analysis秒(初期値:10)毎
lib/health_manager/harmonizer.rb
def analyze_all_apps
if scheduler.task_running? :droplet_analysis # 頻度が高すぎるときはwarn
logger.warn("Droplet analysis still in progress. Consider increasing droplet_analysis interval.")
return # 解析処理をとりやめ
end
logger.debug { "harmonizer: droplet_analysis" }
varz.reset_realtime_stats # varz内のKnownState情報を初期化
...
end
30
31. アプリケーション状態の解析-2
lib/health_manager/harmonizer.rb
def analyze_all_apps
...
known_state_provider.rewind # dropletを順に取得するためにポインタを初期化
scheduler.start_task :droplet_analysis do # 初期値:10
known_droplet = known_state_provider.next_droplet
if known_droplet
known_droplet.analyze # KnownState更新
varz.update_realtime_stats_for_droplet(known_droplet) # varz更新
true
else
# TODO: remove once ready for production
varz.set(:droplets, known_state_provider.droplets) # varz更新
varz.publish_realtime_stats
# TODO: add elapsed time # 解析時間はログに出ない
logger.info ["harmonizer: Analyzed #{varz.get(:running_instances)} running ", "#{varz.get(:down_instances)} down instances"].join
false #signal :droplet_analysis task completion to the scheduler
end
end
end 31
32. アプリケーション状態の解析-3
lib/health_manager/app_state.rb
#check for all anomalies and trigger appropriate events so that listeners can take action
def analyze
check_for_missing_indices # instanceが欠けているときはnotify
check_and_prune_extra_indices # 後述
prune_crashes # 後述
end
def check_for_missing_indices # task開始後にresetされていたら中止
unless reset_recently? or missing_indices.empty?
notify(:missing_instances, missing_indices)
reset_missing_indices # @reset_timestampを更新(幾つか処理を止める)
end
end
def reset_recently?
timestamp_fresher_than?(@reset_timestamp,
AppState.heartbeat_deadline || 0)
end
lib/health_manager/harmonizer.rb
AppState.heartbeat_deadline = interval(:droplet_lost) # 初期値:30
32
33. アプリケーション状態の解析-3
lib/health_manager/app_state.rb
def check_and_prune_extra_indices
extra_instances = [] # 消し去るべきインスタンスを理由と共に格納
# first, go through each version and prune indices
@versions.each do |version, version_entry |
... # extra_instances変数に追加(後述)
end
# now, prune versions
@versions.delete_if do |version, version_entry|
if version_entry['instances'].empty?
@state == STOPPED || version != @live_version
end # STOPPED状態とバージョン違いは次回以降の対象から外す
end
unless extra_instances.empty?
notify(:extra_instances, extra_instances)
end # extra_instancesが存在するときはnotify
end
33
34. アプリケーション状態の解析-4
def check_and_prune_extra_indices
lib/health_manager/app_state.rb
...
@versions.each do |version, version_entry |
version_entry['instances'].delete_if do |index, instance| # deleting extra instances
if running_state?(instance) &&
timestamp_older_than?(instance['timestamp'],
AppState.heartbeat_deadline) # 一定時間以上HeartBeatがない時
instance['state'] = DOWN # DOWNと判定する
instance['state_timestamp'] = now
end
prune_reason = [[@state == STOPPED, 'Droplet state is STOPPED'], # STOPPED状態
[index >= @num_instances, 'Extra instance'], # インスタンス数が多い
[version != @live_version, 'Live version mismatch'] # バージョンが異なる
].find { |condition, _| condition } # 合致するなら文字列を代入
if prune_reason # 何か理由として記録されたなら
if running_state?(instance) # そのインスタンスが動作中である場合
reason = prune_reason.last
extra_instances << [instance[‘instance’], reason] # 理由とともにextra_instancesへ追加
end
end
prune_reason #prune when non-nil
end
end
...
end
34
42. How to harmonize (3)
Nudger Cloud Foundry
• HM外へメッセージを送信
• 優先度付きキューを持つ 1. 実インスタンスの状態を確認
(Cloud Controller, etc.)
• バッチ処理 2. CFの期待する状態を確認
3. harmonizeの判断
4. キューを消化
Known State Provider
• インスタンスの状態変化追従 Expected State Provider
• イベントハンドラの呼出 • CCからExpectedStateを取得,提供
• HTTP(Bulk API)経由
Harmonizer
• 定期的にKnown StateとExpected Stateを比較
• 現状をあるべき状態へ修正
Scheduler: EventMachineのカプセル化
42
43. 即座に対処:
droplet更新でExpectedState更新
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:droplet_updated) do |app_state, message|
logger.info { "harmonizer: droplet_updated: #{message}" }
app_state.mark_stale # (dropletが変わったので)Stale状態にセット
update_expected_state # ExpectedStateも併せて更新
end
...
end
43
44. 即座に対処:
過剰なインスタンスを終了
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:extra_instances) do |app_state, extra_instances|
if app_state.stale? # Stale状態のインスタンスは対象外
logger.info { "harmonizer: stale: extra_instances ignored: #{extra_instances}" }
next
end
logger.debug { "harmonizer: extra_instances"}
nudger.stop_instances_immediately(app_state, extra_instances)
# インスタンスの停止を即座に要求
end
...
end
44
45. 即座に対処:
Flapping状態のアプリケーション
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:exit_crashed) do |app_state, message|
...
if flapping?(instance) # stateがFLAPPINGである場合
unless restart_pending?(instance)
instance['last_action'] = now
if giveup_restarting?(instance) # 一定回数を超えていたら再起動中止
# TODO: when refactoring this thing out, don't forget to
# mute for missing indices restarts
logger.info { "giving up on restarting: app_id=#{app_state.id} index=#{index}" }
else # 一定回数の試行が行われていないならある時間経過後に再起動
delay = calculate_delay(instance) # 起動までの時間は後述
schedule_delayed_restart(app_state, instance, index, delay)
end
end
else def giveup_restarting?(instance)
...
end
interval(:giveup_crash_number) > 0 # 負の値で試行回数制限無
end && instance['crashes'] > interval(:giveup_crash_number)
...
end
end
45
46. 優先度高:
DEA終了時のインスタンス再起動
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:exit_dea) do |app_state, message| # DEAの正常終了
index = message['index']
logger.info { "harmonizer: exit_dea: app_id=#{app_state.id} index=#{index}" }
nudger.start_instance(app_state, index, HIGH_PRIORITY)
# インスタンスの起動をqueueに入れる(優先度高)
end
...
end
46
47. 優先度中:
インスタンス不足時に追加起動
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:missing_instances) do |app_state, missing_indices|
if app_state.stale? # Stale状態のインスタンスは対象外
logger.info { "harmonizer: stale: missing_instances ignored app_id=#{app_state.id} indices=#{missing_indices}" }
next
end
logger.debug { "harmonizer: missing_instances"}
missing_indices.delete_if { |i|
restart_pending?(app_state.get_instance(i))
} # restart_pendingフラグ(Harmonizerがたてる)の立つインスタンスは対象外
nudger.start_instances(app_state, missing_indices, NORMAL_PRIORITY)
# インスタンスの起動を要求(優先度中)
#TODO: flapping logic, too
end
end
47
48. 優先度低:
異常終了したインスタンスの再起動
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:exit_crashed) do |app_state, message|
logger.debug { "harmonizer: exit_crashed" }
index = message['index']
instance = app_state.get_instance(message['version'], message['index'])
if flapping?(instance)
...
else # stateがFLAPPINGでない場合
nudger.start_instance(app_state, index, LOW_PRIORITY)
# インスタンスの起動を要求(優先度低)
end
end
...
end
48
49. How to harmonize (4)
Nudger Cloud Foundry
• HM外へメッセージを送信
• 優先度付きキューを持つ 1. 実インスタンスの状態を確認
(Cloud Controller, etc.)
• バッチ処理 2. CFの期待する状態を確認
3. harmonizeの判断
4. キューを消化
Known State Provider
• インスタンスの状態変化追従 Expected State Provider
• イベントハンドラの呼出 • CCからExpectedStateを取得,提供
• HTTP(Bulk API)経由
Harmonizer
• 定期的にKnown StateとExpected Stateを比較
• 現状をあるべき状態へ修正
Scheduler: EventMachineのカプセル化
49
61. 即座に対処:
Flapping状態のアプリケーション
lib/health_manager/harmonizer.rb
def prepare
...
AppState.add_listener(:exit_crashed) do |app_state, message|
...
if flapping?(instance) # stateがFLAPPINGである場合
unless restart_pending?(instance)
instance['last_action'] = now
if giveup_restarting?(instance) # 一定回数を超えていたら再起動中止
# TODO: when refactoring this thing out, don't forget to
# mute for missing indices restarts
logger.info { "giving up on restarting: app_id=#{app_state.id} index=#{index}" }
else # 一定回数の試行が行われていないならある時間経過後に再起動
delay = calculate_delay(instance) # 起動までの時間は後述
schedule_delayed_restart(app_state, instance, index, delay)
end
end
else def giveup_restarting?(instance)
...
end
interval(:giveup_crash_number) > 0 # 負の値で試行回数制限無
end && instance['crashes'] > interval(:giveup_crash_number)
...
end
end
61
62. #caluclate_delay
lib/health_manager/harmonizer.rb
def calculate_delay(instance)
# once the number of crashes exceeds the value of
# :flapping_death interval, delay starts with min_restart_delay
# interval value, and doubles for every additional crash. the
# delay never exceeds :max_restart_delay though. But wait,
# there's more: random noise is added to the delay, to avoid a
# storm of simultaneous restarts. This is necessary because
# delayed restarts bypass nudger's queue -- once delay period
# passes, the start message is published immediately.
delay = [interval(:max_restart_delay), # 初期値:480
interval(:min_restart_delay) # ←初期値:60 ↓初期値:1
<< (instance['crashes'] - interval(:flapping_death) - 1)
].min.to_f # ↓初期値:10
noise_amount = 2.0 * (rand - 0.5) * interval(:delay_time_noise).to_f
result = delay + noise_amount
logger.info("delay: #{delay} noise: #{noise_amount} result: #{result}")
result
end 62
63. #caluclate_delay
lib/health_manager/harmonizer.rb
def calculate_delay(instance)
# once the number of crashes exceeds the value of
# :flapping_death interval, delay starts with min_restart_delay
#1 #2 #3+
# interval value, and doubles for every additional crash. the
# delay never exceeds :max_restart_delay though. But wait,
# there's more: random noise is added to the delay, to avoid a
0 # Min
storm of simultaneous restarts. This is necessary because Min^2 Max
# delayed restarts bypass nudger's queue -- once delay period
# passes, the start message is published immediately.
delay = [interval(:max_restart_delay), # 初期値:480
interval(:min_restart_delay) # ←初期値:60 ↓初期値:1
<< (instance['crashes'] - interval(:flapping_death) - 1)
].min.to_f # ↓初期値:10
noise_amount = 2.0 * (rand - 0.5) * interval(:delay_time_noise).to_f
result = delay + noise_amount
logger.info("delay: #{delay} noise: #{noise_amount} result: #{result}")
result
end 63