Successfully reported this slideshow.
Your SlideShare is downloading. ×

.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel Zikmund

.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel Zikmund

Download to read offline

.NET Core Summer event, 2019 in Linz, AT - 2019/7/23
Talk: War stories from .NET team by Karel Zikmund

https://www.meetup.com/NET-Stammtisch-Linz/events/261637908/

.NET Core Summer event, 2019 in Linz, AT - 2019/7/23
Talk: War stories from .NET team by Karel Zikmund

https://www.meetup.com/NET-Stammtisch-Linz/events/261637908/

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel Zikmund

  1. 1. War stories from .NET team .NET Core Summer event 2019 – Linz, AT Karel Zikmund – @ziki_cz
  2. 2. Agenda • Stories • Investigations on .NET team • Not just from me • Lessons learned on the way You won’t see any: • Source code • Debugger Not on agenda
  3. 3. My First Serious Investigation • Build lab for Windows component • Build break 1x per week • AccessViolation dialog hangs machine • Repro: • Once in ~50 runs • Overnight run: 247 crashes out of 77,006 runs (0.3%)
  4. 4. mscorwks!UTSemReadWrite::UnlockRead+0xe mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14 mscorwks!RegMeta::DefineParam+0x19 cscomp!EMITTER::EmitParamProp … … … cscomp!CController::Compile csc!main My First Serious Investigation
  5. 5. My First Serious Investigation • Does it by a chance reproduce on only one machine? • Answer: How did you know? • But why always the same callstack? • Good question, no good answer … magic • Lesson learned: Debugging HW errors is costly and hard • Always ask: Does it repro on more than 1 machine?
  6. 6. Another MetaData story MetaData format background: • Basically database – rows and columns • Example – TypeDef table: • Indexes into tables/heaps are either 2B or 4B • What happens if last TypeDef has no methods? • MethodList = Number of methods + 1 = max + 1 • What happens if there is 0xffff methods? Flags TypeName TypeNamespace Extends MethodList (Public) “Foo” “Awesome.Story” … Method #10 (Private) “Bar” “Awesome.Story” … Method #11
  7. 7. Another MetaData story • II.24.2.6 “#~ stream” • If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than 2^16 rows, otherwise it is stored using 4 bytes. • II.22.37 TypeDef : 0x02 • 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid means 1 <= row <= rowcount+1 [ERROR] • How do you fix it? • “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17 years” • https://github.com/dotnet/corefx/issues/29554 • Lesson learned: Not all bugs have to be fixed
  8. 8. TypeSystem – Collapsing interfaces • Table of implemented interfaces: class A : I, J {} • With generics: class C<T> : L<T> {} class D<T> : C<T>, L<string> {} class E : D<string>, I {} 0 1 I J 0 1 2 I J K 0 1 L<T> L<string> 0 1 L<string> I 0 1 2 L<string> L<string> I class B : A, K {} Fix:
  9. 9. Breaking changes – Intro • Everyone wants fix for their bug • But nobody wants to be broken • Observation: 10% of fixes have unintended side-effects • Extreme case: Perf improvement can break app • How many customers? • Lesson learned: Everything has risk of breaking someone
  10. 10. Breaking changes – Last build • Finance app crashing – “last” build of Windows 8 on arm (Surface RT) • Latent bug (introduced months ago) • Bug triggered by: 1. Method in NGen image has to be across 8KB pages 2. GC has to be triggered at least twice when it’s on stack • Unrelated change caused “unlucky” method order for: • System.Net.Configuration.DefaultProxySectionInternal..ctor • Lesson learned: Anything, really ANYTHING, has risk of breaking
  11. 11. Breaking changes – Huge impact • Patch to .NET Framework broke certain tax SW • Printing tax forms • Update pushed few days before tax deadline in US • Note: Printing was tested on both sides (Microsoft & tax SW company) • But only into file, not to printer • Lessons learned: Be extra cautious around sensitive dates
  12. 12. Breaking changes – Below you • RavenDB – blue screen after KB4487017 on .NET Core! • dotnet/coreclr#22597 • PrefetchVirtualMemory • Kernel memory management bug
  13. 13. Networking – Security issue • January: Researcher running ML models on Cosmos • Suspicion about buffers – more logging • March: Repro gone • May: Similar report • +2 weeks: It blows up (more teams & impact) • All hands on-deck • Small repro (20 min, then 1 min) … yay! • TTD trace (iDNA / TTT) … bonus & life saver
  14. 14. Networking – Security issue • Root-cause: HTTP pipelining under stress • 13 years old bug (.NET 2.0) Response 1 Request 1 Server Response 1 Request 1 Server Request 2 Response 2
  15. 15. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  16. 16. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  17. 17. Networking – Security issue • We have workaround (disable pipelining) – perf impact • Worked fix … • Verifying fix … • Repro fails after 4h  • Same symptoms • Repro sensitive to cloud network load (8-17) • TTD (iDNA / TTT) does not work  • Suspicion about buffers again
  18. 18. Networking – Security issue • Bad buffer lifetime management – on sending side! • 5 years old bug (.NET 4.5.2) • Trigger found: • Thanks to Skype team – 24h deployment of experiments • Change in .NET 4.7.1 • Fix around the problematic area • Making the opportunity window SMALLER! • … counter-intuitive • Code review – similar bug on receiving side (5 years old) • Same symptoms as HTTP pipelining
  19. 19. Networking – Security issue • Why so many customers/services hit it at once? • Maybe Spectre & Meltdown fixes roll out? • or just … magic • Lesson learned: Weird coincidences can happen …
  20. 20. Optimizations • Once upon a time, … there was a service in Microsoft • List vs. array data structure perf • Perspectives: 1. The data structure will have in practice 3-5 items 2. There 3 hops between servers for each request!!! • Lesson learned: Avoid premature optimizations … at all cost
  21. 21. Lessons learned • Always ask: Does it repro on more than 1 machine? • Debugging HW bugs is costly • Some bugs happen once in 17 years • Spec bugs are hard to fix • MetaData format bug • Anything, really ANYTHING, has risk of breaking someone • Innocent changes can trigger latent bugs elsewhere • Impact may be huge – e.g. during tax season • Always try to create small repro • Make your and everyone’s life easier • TTD (iDNA / TTT) is life saver • Avoid premature optimizations … at all cost, save your time • … sometimes there is just … magic @ziki_cz
  22. 22. Thank you • Feedback welcome • Twitter DM, email, in-person, etc. • Survey • What you liked vs. not? • Too rushed? • Hard to understand? • Boring? • Didn’t meet your expectations? @ziki_cz

Editor's Notes

  • Quickly about me:
    .NET team for almost 14 years
    Started as junior / out of college on Runtime – C++, pieces like Metadata, TypeSystem, Assembly Loader
    Later on moved to manager role
    Then moved to BCL (Base Class Libraries) – Networking area mainly (HttpClient) … working in open-source (.NET Core)
    Community manager of dotnet/corefx repo
  • Lessons learned – maybe useful to you
    Maybe just helps you understand what is happening on the other side / below you
    I already had few people confirm they hit some/all situations
    Were able to identify with problems and recommendations
  • 2006 January – 3 months in MS

    Large code base, dozens of machines, productivity impact on larger team
    Crash – “hang dialog” with AV

    Repro – great
    Getting heap dumps

    We get to see callstack … but before that, some quotes
  • Simplified callstack for readability
    AV in MetaData emitting – defining a parameter
    Basically stack corruption (dangerous)
    Proper RW lock
    Who corrupts memory? …
  • Costly and hard … and requires quite some expertise

    Variants:
    Different machine setup? … driver bugs
    Extreme from Maoni: Real HW?
  • 1 year old story – 2018 May

    First background on MetaData
    Compressed indexes = just schema which says 2B, 4B … variable between files, but static/stable and given per file

    MethodList = Start of list of methods, INCLUSIVE
  • How do you fix that? … You don’t … spec bug / format bug
    Changing rules means rewriting & recompiling all tools (CCI and command line tools like ildasm, or UI Reflector, ILSpy, Visual Studio, debuggers, profilers, …)

    Compensate?
    Rearranging fields/methods/params in a way the last one does not need the +1.
    Nasty
    Emitting fake type/method with field/method/param to push row count to 2^16.
    Also nasty
    Using 0 as valid value? Readers will be surprised, maybe other bugs?
  • 2009/4 story

    VirtualStubDispatch uses indexes of interfaces to call virtual methods – code in D<T>, casting to K<string> will use index 1 for interface and method index.
    But if instance of E is passed, method on I (interface index 1)
    SECURITY issue!
    Missing feature – implementing in security patch … risk of breaking changes

    Proper description complicates spec a lot

    Kind of problem when spec had 1 reference implementation – described the implementation

    Bug since introduction of generics, many smart people looked at it, yet missed it
  • Read slides
  • OEM getting builds 2 days

    Paranoia
  • Sensitive dates like tax date, shopping season? (December) … online stores usually have stop on any changes
  • Federico Lois

    February case
    Blue screen – not something CoreCLR should be able to cause
    Used PrefetchVirtualMemory to optimize perf – rarely used

    When you’re on cutting edge, pushing the limits, you should be prepared for anything
  • Last July (2018)
    Story starts 8 months earlier in December 2017
    Is it server or client problem? … wireshark traces
    Around Feb, we know it is client - .NET or Windows
    March – repro is gone (they upgraded cluster)

    (fast forward 2 months)
    May another email thread – similar symptoms
    Back and forth
    Heated
    Realize it is 2 different products on the thread
    And then couple of more start coming in span of 2 weeks
    Impact on one customer is huge
    Potential:
    Data loss
    Information disclosure – mixing data in multi-tenant scenarios

    3-4 weeks of all-hands on deck + 24/7

    We had iDNA trace (TTD / TTT)
  • What happens when requests are cancelled?
    If 1st – close connection
    If last – remove it & and mark for closing
    If in middle – remove it & and mark for closing
  • Bad things can happen – imagine you asked:
    “Does the data exist?” … data loss
    Multi-tenant scenarios: “Give me data about customer X” … data about Y
  • Added logging (ETW) – reused buffers
    Old code – track down bad buffer management
  • Something like SharePoint Online, Bing, O365 OneNote – where response to customer matters

    Caring about perf early (with measurements!) != premature optimization
  • Help me do better job next time

×