2. What to expect
• I will talk about two interesting use cases of Applied Math in Azure.
• Unfortunately, I can’t go into details of Azure or the numbers but I’m
hoping the gist will be clear.
3. What is Azure?
• Azure is a cloud service.
• Competitor to AWS
• Basic Architecture
4. Topic 1: Dirichlet Entropy for anomaly
detection
Contributors:
• Rohit Pandey
• Gil Lapid Shafriri
5. Background
• At Azure, we keep track of various causes and components associated
with downtimes of customer VMs (categorical histograms).
• We use this data to prioritize fixes for top downtime reasons and
components.
• But what about patterns that manage to stay out of sight?
6. • There is a tendency to confuse “small” with “ambient”. And
over a large timeframe, “small” becomes “large”.
• Ambient noise should be like a fair dice.
• Truly ambient noise won’t unduly favor any component (Ex:
Rack).
• We need one measure for how “skewed” our histogram is and
trend that over time.
Background (continued)
7. Approach
• Categorical histograms are like rolls of a dice and the canonical
distribution for the parameters of a dice is the Dirichlet.
• A great metric for determining skewness is Entropy (for a random
variable 𝑋: 𝑓𝑋(𝑥)).
H 𝑥 = 𝐸[log(1/𝑓𝑋(𝑥))]
0
5
10
15
20
25
30
35
1 2 3 4 5 6
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6
Low Entropy High Entropy
8. Implementation and Results
• Set up a portal that shows
list of categorical histograms
descending by Entropy.
• Caught multiple instances of
rack failures.
• Nodes stuck in reboot loop
due to incorrect
configuration.
• And more..
9. Topic 2: To reboot or not to reboot
Contributors:
• Rohit Pandey
• Durmus Karatay
• Gil Lapid Shafriri
• Randolph Yao
10. The Problem
• Machines in Azure can be in various “states”. For example, “Healthy”
and “Unwell”.
• When a machine becomes unwell, we wait a certain amount of time
(𝜏0) to give it a chance to organically recover.
• How do we optimize this 𝜏0 so as to minimize the downtime.
13. Formulation
𝐸 𝑇 = P 𝑋 ≤ 𝜏 × 𝐸 𝑋 𝑋 ≤ 𝜏
+ P X > 𝜏 × (𝜏 + 𝑌)
=
0
𝜏
𝑥𝑓𝑋 𝑥 𝑑𝑥 + (𝜏 + 𝑌)
𝜏
∞
𝑓𝑋 𝑥 𝑑𝑥
• In our estimate of Y, we consider both the happy and the sad paths.
• We can find the threshold ( 𝜏) that minimizes the expected downtime by
setting
𝜕𝐸[𝑇]
𝜕𝜏
= 0.
𝐻 𝑋 𝜏 =
𝑓 𝑋 𝜏
𝑃(𝑋 > 𝜏)
=
1
𝑌
Unwell
Healthy
Rebooting 𝒀
𝑿: 𝒇 𝑿(𝒙)
𝝉
14. Choice of X
• Considered 7-8 distributions and settled on Lomax because it can model
extreme values the best.
𝐻 𝑋 𝜏 =
𝑐1
1 + 𝑐2. 𝜏
=
1
𝑌
⇒ 𝜏 =
𝑐1 𝑌 − 1
𝑐2
• To estimate the parameters –
• All samples that we saw from Unwell to Ready
• The instances of Unwell to Rebooting which were all cases where it took more than
𝜏0 for sure.
𝐿𝐿 𝑐1, 𝑐2 𝑥1, 𝑥2, … , 𝑥 𝑛, 𝑚 =
1
𝑛
log(𝑓𝑋 𝑥𝑖; 𝑐1, 𝑐2 ) + 𝑚. log(𝑃(𝑋 > 𝜏0))
15. Choice of Y
• We think of “Healthy” as the absorbing state, others as transient.
• We denote by 𝑥𝑖 the time taken to get to the absorbing state from
transient state 𝑖.
𝑥𝑖 =
1
𝑛
𝑝𝑖𝑗(𝑡𝑖𝑗 + 𝑥𝑗)
⇒ 𝐼 − 𝑄 . 𝑥 = 𝑃 𝑜 𝐿 . 𝟏