Advertisement
Advertisement

More Related Content

Similar to Cache Consistency – Requirements and its packet processing Performance implications(20)

Advertisement

More from Michelle Holley(20)

Advertisement

Cache Consistency – Requirements and its packet processing Performance implications

  1. Packet Processing & Cache Coherency -101A Primer By: M Jay
  2. 2 Notices and Disclaimers No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Intel, the Intel logo, {List of ALL the Intel trademarks in this document} are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © Intel Corporation.
  3. 3 Agenda •  Cache Coherency – Is it really needed? – Message Passing Vs Shared Mem •  Read access & cache - benefits we all know •  What about Write & Cache? •  Write Through – Write Back Cache •  DPDK PMD and Cache Coherency •  Snoop Protocol •  NUMA •  LIFO •  Dynamic Vs Static •  DDIO & Cache Size
  4. 4 Thread Local Storage – why worry about coherency? Well ! I need to Share Data !!
  5. 5 Thread Local Storage – why worry about coherency? Well ! I need to Share Data !!
  6. 6 Why share data? Why not developers use Message Passing Paradigm? Can we visualize no address space?
  7. 7 Why shared data? Why not developers use Message Passing Paradigm? Scratch Scratch Scratch Scratch If Developers Did so?
  8. 8 No Need For Coherency Protocol !! No need for Coherency protocol !
  9. 9 No Need of Cache Coherency? Message Passing – No need of Coherency Shared Memory Paradigm – H/ W to manage Coherency
  10. 10 So, really what is the root cause of Cache Coherency requirement? Where from Cache Coherency requirement is coming? Is it software developers’ problem “of not doing truly parallel programming”? Or is it hardware designer’s “overdo” problem?
  11. 11 Well ! But … Message Passing needs Moving Data Around… Moving Data ….. Won’t it be lot of overhead? Shared Memory Means Just Read / Write. No Moving Data Around ! Right? Yeah ! Right ! Bring it On Shared Memory
  12. 12 Why you need to share data with another thread?
  13. Network Platforms Group What Is The Task At Hand? Receive Process Transmit rx cost tx cost A Chain is only as strong as …..
  14. Network Platforms Group Benefits – Eliminating / Hiding Overheads Interrupt Context Switch Overhead Kernel User Overhead Core To Thread Scheduling Overhead Elimina=ng How? Polling User Mode Driver Pthread Affinity 4K Paging Overhead PCI Bridge I/ O Overhead Elimina'ng /Hiding How? Huge Page Lockless Inter-core Communica=on High Throughput Bulk Mode I/O calls To Tackle this challenge, what kind of devices /latency we have at our disposal?
  15. Network Platforms Group 15 PCIe* Connectivity and Core Usage Using run-to-completion or pipeline software models Processor 0 Physical Core 0 Linux* Control Plane NUMA Pool Caches Queue/Rings Buffers 10 GbE 10 GbE Physical Core 1 Intel® DPDK PMD Packet I/O Packet work Rx Tx Physical Core 2 Intel® DPDK PMD Packet I/O Flow work Rx Tx Physical Core 3 Intel® DPDK PMD Packet I/O Flow Classification App A, B, C Rx Tx Physical Core 5 Intel® DPDK PMD Packet I/O Flow Classification App A, B, C Rx Tx Run to Completion Model • I/O and Application workload can be handled on a single core • I/O can be scaled over multiple cores 10 GbE Pipeline Model • I/O application disperses packets to other cores • Application work performed on other cores Processor 1 Physical Core 4 Intel® DPDK 10 GbE Physical Core 5 Intel® DPDK Physical Core 0 Intel® DPDK PMD Packet I/O Hash Physical Core 1 Intel® DPDK App A App B App C Physical Core 2 Intel® DPDK App A App B App C Physical Core 3 Intel® DPDK Rx Tx 10 GbE Pkt Pkt Physical Core 4 Intel® DPDK PMD Packet I/O Flow Classification App A, B, C Rx Tx Pkt Pkt Pkt Pkt Pkt Pkt RSS Mode QPI PCIePCIePCIePCIe PCIePCIe NUMA Pool Caches Queue/Rings Buffers Can handle more I/O on fewer cores with vectorization
  16. 16 Why you need to share data with another thread?I So tell me .. Why you need to share data with another thread? It is the Pipeline Model that needs Sharing! – looks like!! Let us go with that for now !!
  17. 17 How can we map our s/w variables to h/w infrastructure?
  18. 18 How can we map our s/w variables to h/w infrastructure?
  19. 19 Individual Memory => For Thread Local Storage? Shared Memory => For Global Data? int shared Function ( ) { Int private }
  20. 20 Quiz Time
  21. 21 What do you wish for? Bigger Shared memory or bigger Individual memory? What about Locality ?
  22. 22 You look at the header once and forward the packet.. Right Away You Sprint to the next packet So What do you wish for? Bigger which one?
  23. 23 You look once the header & forward pkt Right Away You Sprint to next packet Not the same packet With fast line rate, you sprint from one packet to another packet very fast Temporal Locality in Packet Processing? How are we doing? How much Locality? Smaller Individual Caches with Less Locality – more Individual cache misses So you end up often going far Shared Cache / Memory So it is as if you don’t even have the individual cache and end up as if having slower memory all the time. So What do you wish for? Bigger which one?
  24. Last Level Cache L2 Cache Challenge: What if there is L1 Cache Miss and LLC Hit? L1 Cache Core 0 L1 Cache Core 0 LLC Cache 40 cycle With 40 cycles LLC Hit, How will you achieve Rx budget of 19 cycles ? L1 Cache Miss So what do you wish for? Bigger which one?
  25. 25 Your Answer ???
  26. L1 Cache With 4 Cycle Latency L1 Cache Core 0 Latenc y 4 cycle Caching Benefits on Read – Excellent !! Right? What? Now What? L1 Cache Hit Read Packet Descriptor With 4 cycles Latency, achieving Rx budget of 19 cycles is within reach. Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Miss, What about the first read that may cause miss
  27. 27 Cache is actually hashing ! 1st Line 1st Line 1st Line 1st Line 1st Line Cache Memory Cache Tag / Directory Indicates which one Is occupying the cache. What about Locality ? Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor
  28. 28 Cache and Tag! 1st Line 1st Line 1st Line 1st Line 1st Line Cache Memory Cache Tag / Directory Indicates which one Is occupying the cache. What about Locality ? Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor Read Packet Descriptor
  29. 29 Let us look at Write now
  30. 30 Where will Data be Coming From? Write-Through Vs Write-Back
  31. 31 Where will Data be Coming From? Write-Through Vs Write-Back
  32. 32 Where will Data be Coming From? Write-Through Vs Write-Back
  33. 33
  34. 34 Let us Look at Write – Through First For P2, Where will be Data Coming From? On Hit On Miss
  35. 35 Let us Look at Write – Through First For P2, Where will be Data Coming From? On Hit On Miss
  36. 36 So Writes happen at what speed? With Write Through Cache? What happens if you repeatedly write
  37. 37 Let us Look at Write – Back Next For P2, Where will be Data Coming From? If Hit, From Cache If miss From Where?
  38. 38 At What Speed Write Happens in Write Back ? How do we improve with more and more writes? – compared to Write Through !
  39. 39 Let us Look at Write – Back Next For P2, Where will be Data Coming From? If Hit, From Cache If miss From Where?
  40. 40 Where Else? Cache To Cache … So, it can come from 1)  its own cache or 2)  shared memory or 3)  Even from ANY OF the other Individual Cache (WB) Requesting CPU Which All CPUs can offer Data P 0 P1 to Pn P1 P0 & [P2 to Pn] P2 P0,P1 & [P3-Pn] And so on Pn [P0 to Pn-1] Total paths [N X N] ?? ??? Looks like we have complexity of Message Passing also Remember Me? You thought no movement of data in “shared memory”?
  41. 41 Additional housekeeping “dirty bit” with Write Back
  42. 42 That is for Data Side… What About Control for Coherency?
  43. 43 M- Modified E- Exclusive S – Shared I - Invalid
  44. 44 https://www.slideshare.net/sumitmittu/aca2-07-new
  45. 45 Write Through Memory Speed Write Back Cache Speed Can we go faster and faster…
  46. L1 Cache With 4 Cycle Latency L1 Cache Core 0 Post it ! POSTED WRITE !! Write Packet Descriptor But Why should I “wait for 4 cycles” in case of write?
  47. 47 How is the complexity? Data source is now Posted Buffer too Posted Buffer participating in Data sourcing As well as MESI cache coherency
  48. 48 Shared Memory – Data Sources From Local Write Buffer From Another Write Buffer From Local Cache From Another Cache From Shared cache From Shared memory
  49. 49 And you thought You will never see me again !
  50. 50 Coming to Packet Processing & Polled Mode Driver…
  51. 51 Shall we see couple of use cases?
  52. 52 Use Case 1 Prod ucer Consu mer Software Queue Question: What policy you will design? FIFO? LIFO? Why?
  53. 53 LRU … MRU …. Where Are You?
  54. 54 Few NICs .Many Cores …
  55. 55 Question – Statistics Collection Collective task? or Individual task?
  56. 56 Which Thread Gets Picked up by whom? CPU’s Task Priority Register
  57. 57 Which Thread Gets Picked up by whom? CPU’sTask Priority Register CPU’s Task Priority Register
  58. 58 So Going back to the question So, Collective task? or Individual task?
  59. 59 With Thread Pinning, we avoid Sharing ! Same lcore for same NIC ! No need to Share !!
  60. 60 With Thread Pinning, we avoid Sharing ! If Sharing is not needed, then why put it in memory? Why go through shared Memory? Why? Why not take it directly into Private Cache? Why not bypass shared memory?
  61. 61 Familiar About Bypass Road? Why go through congested inner cities? Why not bypass? Use Bypass Road !!
  62. 62 You say Bypass…. We Say DDIO .. Bypass memory Directly into cache
  63. 63 Do You? Really? With Polling and Thread Pinning, we avoid Sharing !
  64. 64 With RSS … back to the question -- responsibility Collective task? or Individual task?
  65. 65 Well, that is a special case use case - RSS But for RSS, we are good with only Thread Local Storage No need of shared data
  66. 66 Well, that is a special case use case - RSS Apart from that, we pin 1 core to 1 NIC – so no sharing !! Is that so? Really?
  67. 67 Classification – Cache Coherency Needed or Not?
  68. 68 Depends!! Depends on What? http://www.eetimes.com/document.asp?doc_id=1277622
  69. 69 Depends on Static Classification or Dynamic Classification?
  70. 70 What about Router Table? Is it a shared resource or private – per core resource?
  71. 71 What about Router Table? Is it a shared resource or private? – per core resource? Collective or Individual Router Table – is it one table per system? If so, Who are all writers? Who are all readers? Howmany writers? Howmany readers? What about 2 socket, 4 socket system? One table for each socket? Coherency between the 2 or 4 tables in a multi-socket system? Collective Responsibility or Individual Responsibility?
  72. 72 Multiple Writers – What will benefit? Write back Cache? or Write Through Cache?
  73. 73 What if you keep it “Dirty” and DMA control sneaks in?
  74. 74 Before we get too far…
  75. 75 In case of Siblings, does each has private cache of its own? With Siblings, how LTS gets mapped?
  76. 76 How do Siblings Share Caches – say, L1 and L2 ?
Advertisement