.NET Core Summer event, 2019 in Brno, CZ - 2019/7/9
Talk: War stories from .NET team by Karel Zikmund
https://www.wug.cz/brno/akce/1152--NET-Core-Summer-Event
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel Zikmund
1. War stories
from .NET team
.NET Core Summer event 2019 – Brno, CZ
Karel Zikmund – @ziki_cz
2. Agenda
• Stories
• Investigations on .NET team
• Not just from me
• Lessons learned on the way
You won’t see any:
• Source code
• Debugger
Not needed: Deep .NET knowledge
Not on agenda
3. My First Serious Investigation
• Build lab for Windows component
• Build break 1x per week
• AccessViolation dialog hangs machine
• Toolset updated to 2.0 RTM
• Repro:
• Once in ~50 runs
• Overnight run: 247 crashes out of 77,006 runs (0.3%)
4. My First Serious Investigation - quotes
• "The actual crash is occurring on some boilerplate stack checking
code …“
• “Karel is relatively new to the code base so he indicated it might take
some time to understand what’s going on”
6. My First Serious Investigation
• Who corrupts stack?
• GC?
• NO!
• Changed value between caller and callee
• Single bit changed
• Who corrupts it?
• GC card table updates?
• Of course NOT!
• What about HW?
• Naw!
• Or maybe?
7. My First Serious Investigation
• Does it by a chance reproduce on only one machine?
• Answer: How did you know?
• But why always the same callstack?
• Good question, no good answer … magic
• Lesson learned: Debugging HW errors is costly and hard
• Always ask: Does it repro on more than 1 machine?
8. Another MetaData story
MetaData format background:
• Basically database – rows and columns
• Example – TypeDef table:
• Indexes into tables/heaps are either 2B or 4B
• What happens if last TypeDef has no methods?
• MethodList = Number of methods + 1 = max + 1
• What happens if there is 0xffff methods?
Flags TypeName TypeNamespace Extends MethodList
(Public) “Foo” “Awesome.Story” … Method #10
(Private) “Bar” “Awesome.Story” … Method #11
9. Another MetaData story
• II.24.2.6 “#~ stream”
• If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than
2^16 rows, otherwise it is stored using 4 bytes.
• II.22.37 TypeDef : 0x02
• 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid
means 1 <= row <= rowcount+1 [ERROR]
• How do you fix it?
• “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17
years”
• https://github.com/dotnet/corefx/issues/29554
• Lesson learned: Not all bugs have to be fixed
10. TypeSystem – Collapsing interfaces
• Table of implemented interfaces:
class A : I, J {}
• With generics:
class C<T> : L<T> {}
class D<T> : C<T>, L<string> {}
class E : D<string>, I {}
0 1
I J
0 1 2
I J K
0 1
L<T> L<string>
0 1
L<string> I
0 1 2
L<string> L<string> I
class B : A, K {}
Fix:
11. Breaking changes – Intro
• Everyone wants fix for their bug
• But nobody wants to be broken
• Observation: 10% of fixes have unintended side-effects
• Extreme case: Perf improvement can break app
• How many customers?
• Lesson learned: Everything has risk of breaking someone
12. Breaking changes – Last build
• Finance app crashing – “last” build of Windows 8 on arm (Surface RT)
• Latent bug (introduced months ago)
• Bug triggered by:
1. Method in NGen image has to be across 8KB pages
2. GC has to be triggered at least twice when it’s on stack
• Unrelated change caused “unlucky” method order for:
• System.Net.Configuration.DefaultProxySectionInternal..ctor
• Lesson learned: Anything, really ANYTHING, has risk of breaking
13. Breaking changes – Huge impact
• Patch to .NET Framework broke certain tax SW
• Printing tax forms
• Update pushed few days before tax deadline in US
• Note: Printing was tested on both sides (Microsoft & tax SW
company)
• But only into file, not to printer
• Lessons learned: Be extra cautious around sensitive dates
14. Breaking changes – Below you
• RavenDB – blue screen after KB4487017 on .NET Core!
• dotnet/coreclr#22597
• PrefetchVirtualMemory
• Kernel memory
management bug
15. Networking – Security issue
• January: Researcher running ML models on Cosmos
• Suspicion about buffers – more logging
• March: Repro gone
• May: Similar report
• +2 weeks: It blows up (more teams & impact)
• All hands on-deck
• Small repro (20 min, then 1 min) … yay!
• TTD trace (iDNA / TTT) … bonus & life saver
16. Networking – Security issue
• Root-cause: HTTP pipelining under stress
• 13 years old bug (.NET 2.0)
Response 1
Request 1
Server
Response 1
Request 1
Server
Request 2
Response 2
19. Networking – Security issue
• We have workaround (disable pipelining) – perf impact
• Worked fix …
• Verifying fix …
• Repro fails after 4h
• Same symptoms
• Repro sensitive to cloud network load (8-17)
• TTD (iDNA / TTT) does not work
• Suspicion about buffers again
20. Networking – Security issue
• Bad buffer lifetime management – on sending side!
• 5 years old bug (.NET 4.5.2)
• Trigger found:
• Thanks to Skype team – 24h deployment of experiments
• Change in .NET 4.7.1
• Fix around the problematic area
• Making the opportunity window SMALLER!
• … counter-intuitive
• Code review – similar bug on receiving side (5 years old)
• Same symptoms as HTTP pipelining
21. Networking – Security issue
• Why so many customers/services hit it at once?
• Maybe Spectre & Meltdown fixes roll out?
• or just … magic
• Lesson learned: Weird coincidences can happen …
22. Developer’s pride in multi-threading
• School project (2000-2003)
• Game simulation server – heavily multi-threaded
• https://github.com/karelz/WarPlusPlus (nostalgia)
• Classic deadlock – 2 threads locking A and B in different order
• Deadlock avoidance started make sense
• WinRT binder (2010)
• Binder is tricky – GC interaction (NO_GC range)
• Type routed to WinMD file, assembly meaningless
• Negotiated on namespace only in 1 assembly
• Multiple reviews, discussions with architects
• Bugs start to come in after shipping (NullReferenceException)
23. Optimizations
• Once upon a time, … there was a service in Microsoft
• List vs. array data structure perf
• Perspectives:
1. The data structure will have in practice 3-5 items
2. There 3 hops between servers for each request!!!
• Lesson learned: Avoid premature optimizations … at all cost
24. Lessons learned
• Always ask: Does it repro on more than 1 machine?
• Debugging HW bugs is costly
• Some bugs happen once in 17 years
• Spec bugs are hard to fix
• MetaData format bug
• Anything, really ANYTHING, has risk of breaking someone
• Innocent changes can trigger latent bugs elsewhere
• Impact may be huge – e.g. during tax season
• Always try to create small repro
• Make your and everyone’s life easier
• TTD (iDNA / TTT) is life saver
• Avoid premature optimizations … at all cost, save your time
• … sometimes there is just … magic
@ziki_cz
25. Thank you
• Feedback welcome
• Twitter DM, email, in-person, etc.
• Survey
• What you liked vs. not?
• Too rushed?
• Hard to understand?
• Boring?
• Didn’t meet your expectations?
@ziki_cz
Editor's Notes
Quickly about me:
.NET team for almost 14 years
Started as junior / out of college on Runtime – C++, pieces like Metadata, TypeSystem, Assembly Loader
Later on moved to manager role
Then moved to BCL (Base Class Libraries) – Networking area mainly (HttpClient) … working in open-source (.NET Core)
Community manager of dotnet/corefx repo
Lessons learned – maybe useful to you
Maybe just helps you understand what is happening on the other side / below you
I already had few people confirm they hit some/all situations
Were able to identify with problems and recommendations
2006 January – 3 months in MS
Large code base, dozens of machines, productivity impact on larger team
Crash – “hang dialog” with AV
msbuild -> C# compiler
Recently upgraded toolset to 2.0 RTM (.NET Framework, not Core )
Repro – great
Getting heap dumps
We get to see callstack … but before that, some quotes
… in the metadata writer code
Simplified callstack for readability
AV in MetaData emitting – defining a parameter
Basically stack corruption (dangerous)
Proper RW lock
Who corrupts memory? …
GC? … not Roslyn – this is native, no GC
Why something else? C# compiler is deterministic
Go into assembly (x86) – what is arguments, vs. locals
* Great exercise to learn/refresh all this in here
Costly and hard … and requires quite some expertise
Variants:
Different machine setup? … driver bugs
Extreme from Maoni: Real HW?
1 year old story – 2018 May
First background on MetaData
Compressed indexes = just schema which says 2B, 4B … variable between files, but static/stable and given per file
MethodList = Start of list of methods, INCLUSIVE
How do you fix that? … You don’t … spec bug / format bug
Changing rules means rewriting & recompiling all tools (CCI and command line tools like ildasm, or UI Reflector, ILSpy, Visual Studio, debuggers, profilers, …)
Compensate?
Rearranging fields/methods/params in a way the last one does not need the +1.
Nasty
Emitting fake type/method with field/method/param to push row count to 2^16.
Also nasty
Using 0 as valid value? Readers will be surprised, maybe other bugs?
2009/4 story
VirtualStubDispatch uses indexes of interfaces to call virtual methods – code in D<T>, casting to K<string> will use index 1 for interface and method index.
But if instance of E is passed, method on I (interface index 1)
SECURITY issue!
Missing feature – implementing in security patch … risk of breaking changes
Proper description complicates spec a lot
Kind of problem when spec had 1 reference implementation – described the implementation
Bug since introduction of generics, many smart people looked at it, yet missed it
Read slides
OEM getting builds 2 days
Paranoia
Sensitive dates like tax date, shopping season? (December) … online stores usually have stop on any changes
February case
Blue screen – not something CoreCLR should be able to cause
Used PrefetchVirtualMemory to optimize perf – rarely used
When you’re on cutting edge, pushing the limits, you should be prepared for anything …
Last July (2018)
Story starts 8 months earlier in December 2017
Is it server or client problem? … wireshark traces
Around Feb, we know it is client - .NET or Windows
March – repro is gone (they upgraded cluster)
(fast forward 2 months)
May another email thread – similar symptoms
Back and forth
Heated
Realize it is 2 different products on the thread
And then couple of more start coming in span of 2 weeks
Impact on one customer is huge
Potential:
Data loss
Information disclosure – mixing data in multi-tenant scenarios
3-4 weeks of all-hands on deck + 24/7
We had iDNA trace (TTD / TTT)
What happens when requests are cancelled?
If 1st – close connection
If last – remove it & and mark for closing
If in middle – remove it & and mark for closing
Bad things can happen – imagine you asked:
“Does the data exist?” … data loss
Multi-tenant scenarios: “Give me data about customer X” … data about Y
Added logging (ETW) – reused buffers
Old code – track down bad buffer management
Story begins at University (2000-2003)
- Large project (2.5 years, 1M+ lines of code, 5 people)
Something like SharePoint Online, Bing, O365 OneNote – where response to customer matters
Caring about perf early (with measurements!) != premature optimization