Cisco crs1


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 8
  • Single stage w/ VOQ approx an output buffered fabric Output buffered switch only buffers at output so has minimal blocking impact OB switch can better schedule service if Qs are at output
  • Cisco crs1

    1. 1. CRS-1 overview TAU – Mar 07 Rami Zemach
    2. 2. Agenda <ul><li>Cisco’s high end router CRS-1 </li></ul><ul><li>Future directions </li></ul>CRS-1’s NP Metro (SPP) CRS-1’s Fabric CRS-1’s Line Card
    3. 3. What drove the CRS? <ul><li>OC768 </li></ul><ul><li>Multi chassis </li></ul><ul><li>Improved BW/Watt & BW/Space </li></ul><ul><li>New OS (IOS-XR) </li></ul><ul><li>Scalable control plane </li></ul>A sample taxonomy
    4. 4. Multiple router flavours <ul><li>Core </li></ul><ul><ul><li>OC-12 (622Mbps) and up (to OC-768 ~= 40Gbps) </li></ul></ul><ul><ul><li>Big, fat, fast, expensive </li></ul></ul><ul><ul><li>E.g. Cisco HFR, Juniper T-640 </li></ul></ul><ul><ul><ul><li>HFR: 1.2Tbps each, interconnect up to 72 giving 92Tbps, start at $450k </li></ul></ul></ul><ul><li>Transit/Peering-facing </li></ul><ul><ul><li>OC-3 and up, good GigE density </li></ul></ul><ul><ul><li>ACLs, full-on BGP, uRPF, accounting </li></ul></ul><ul><li>Customer-facing </li></ul><ul><ul><li>FR/ATM/… </li></ul></ul><ul><ul><li>Feature set as above, plus fancy queues, etc </li></ul></ul><ul><li>Broadband aggregator </li></ul><ul><ul><li>High scalability: sessions, ports, reconnections </li></ul></ul><ul><ul><li>Feature set as above </li></ul></ul><ul><li>Customer-premises (CPE) </li></ul><ul><ul><li>100Mbps </li></ul></ul><ul><ul><li>NAT, DHCP, firewall, wireless, VoIP, … </li></ul></ul><ul><ul><li>Low cost, low-end, perhaps just software on a PC </li></ul></ul>A sample taxonomy
    5. 5. Routers are pushed to the edge <ul><li>Over time routers are pushed to the edge as: </li></ul><ul><ul><li>BW requirements grow </li></ul></ul><ul><ul><li># of interfaces scale </li></ul></ul><ul><li>Different routers have different offering </li></ul><ul><ul><li>Interfaces types (core is mostly Eathernet) </li></ul></ul><ul><ul><li>Features. Sometimes the same feature is implemented differently </li></ul></ul><ul><ul><li>User interface </li></ul></ul><ul><ul><li>Redundancy models </li></ul></ul><ul><ul><li>Operating system </li></ul></ul><ul><li>Costumers look for: </li></ul><ul><ul><li>investment protection </li></ul></ul><ul><ul><li>Stable network topology </li></ul></ul><ul><ul><li>Feature parity </li></ul></ul><ul><li>Transparent scale </li></ul>A sample taxonomy
    6. 6. What does Scaling means … <ul><li>Interfaces (BW, number, variance) </li></ul><ul><li>BW </li></ul><ul><li>Packet rate </li></ul><ul><li>Features (e.g. Support link BW in a flexible manner) </li></ul><ul><li>More Routes </li></ul><ul><li>Wider ECO system </li></ul><ul><li>Effective Management (e.g. capability to support more BGP peers and more events) </li></ul><ul><li>Fast Control (e.g. distribute routing information) </li></ul><ul><li>Availability </li></ul><ul><li>Serviceability </li></ul><ul><li>Scaling is both up and down (logical routers) </li></ul>A sample taxonomy
    7. 7. Low BW feature rich – centralized Shared Bus Line Interface Off-chip Buffer Route Table CPU Buffer Memory Line Interface MAC Line Interface MAC Line Interface MAC Typically <0.5Gb/s aggregate capacity CPU Memory
    8. 8. High BW – distributed Line Card MAC Local Buffer Memory CPU Card Line Card MAC Local Buffer Memory “ Crossbar”: Switched Backplane Line Interface CPU Memory Routing Table Fwding Table Typically <50Gb/s aggregate capacity Fwding Table
    9. 9. Distributed architecture challenges (examples) <ul><li>HW wise </li></ul><ul><ul><li>Switching fabric </li></ul></ul><ul><ul><li>High BW switching </li></ul></ul><ul><ul><li>QOS </li></ul></ul><ul><ul><li>Traffic loss </li></ul></ul><ul><ul><li>Speedup </li></ul></ul><ul><li>Data plane (SW) </li></ul><ul><ul><li>High BW / packet rate </li></ul></ul><ul><ul><li>Limited resources (cpu, memory) </li></ul></ul><ul><li>Control plane (SW) </li></ul><ul><ul><li>High event rate </li></ul></ul><ul><ul><li>Routing information distribution (e.g. forwarding tables) </li></ul></ul>
    10. 10. CRS-1 System View Fabric Shelves Contains Fabric cards, System Controllers Line Card Shelves Contains Route Processors, Line cards, System controllers NMS (Full system view) Out of band GE control bus to all shelf controllers 100m Shelf controller Shelf controller Sys controller Shelf controller Shelf controller Shelf controller Sys controller
    11. 11. CRS-1 System Architecture Fabric Chassis <ul><li>FORWARDING PLANE </li></ul><ul><li>Up to 1152x40G </li></ul><ul><li>40G throughput per LC </li></ul>MULTISTAGE SWITCH FABRIC 1296x1296 non-blocking buffered fabric Roots of Fabric architecture from Jon Turner’s early work DISTRIBUTED CONTROL PLANE Control SW distributed across multiple control processors Interface Module MID-PLANE Line Card Line Card 8 of 8 2 of 8 1 of 8 S1 S1 S2 S2 S3 S3 S1 S2 S3 Cisco SPP Cisco SPP Modular Service Card 8K Qs 8K Qs µ µ Route Processor Route Processor
    12. 12. Switch Fabric challenges <ul><li>Scale - many ports </li></ul><ul><li>Fast </li></ul><ul><li>Distributed arbitration </li></ul><ul><li>Minimum disruption with QOS model </li></ul><ul><li>Minimum blocking </li></ul><ul><li>Balancing </li></ul><ul><li>Redundancy </li></ul>
    13. 13. Previous solution: GSR – Cell based XBAR w centralized scheduling <ul><li>Each LC has variable width links to and from the XBAR, depending on its bandwidth requirement </li></ul><ul><li>Central scheduling ISLIP based </li></ul><ul><ul><li>Two request-grant-accept rounds </li></ul></ul><ul><ul><li>Each arbitration round lasts one cell time </li></ul></ul><ul><li>Per destination LC virtual output queues </li></ul><ul><li>Supports </li></ul><ul><ul><li>H/L priority </li></ul></ul><ul><ul><li>Unicast/multicast </li></ul></ul>
    14. 14. CRS Cell based Multi-Stage Benes <ul><li>Multiple paths to a destination </li></ul><ul><li>For a given input to output port, the no. of paths is equal to the no. of center stage elements </li></ul><ul><li>Distribution between S1 and S2 stages. Routing at S2 and S3 </li></ul><ul><li>Cell routing </li></ul>
    15. 15. Fabric speedup <ul><li>Q-fabric tries to approximate an output buffered switch </li></ul><ul><ul><li>to minimize sub-port blocking </li></ul></ul><ul><ul><li>Buffering at output allows better scheduling </li></ul></ul><ul><li>In single stage fabrics a 2X speedup very closely approximates an output buffered fabric * </li></ul><ul><li>For multi-stage the speedup factor to approx output buffered behavior is not known </li></ul><ul><ul><li>CRS-1 fabric’s ~5X speed up </li></ul></ul><ul><ul><li>constrained by available technology </li></ul></ul><ul><ul><ul><ul><li>* Balaji prabhakar and nick McKeown computer systems technical report CSL-TR-97-738. November 1997 . </li></ul></ul></ul></ul>
    16. 16. Fabric Flow Control Overview <ul><li>Discard - time constant in the 10’s of mS range </li></ul><ul><ul><li>Originates from ‘from fab’ and is directed at ‘to fab’. </li></ul></ul><ul><ul><li>Is a very fine level of granularity, discard to the level of individual destination raw queues. </li></ul></ul><ul><li>Back Pressure - time constant in the 10’s of  S range. </li></ul><ul><ul><li>Originates from the Fabric and is directed at ‘to fab’. </li></ul></ul><ul><ul><li>Operates per priority at increasingly coarse granularity: </li></ul></ul><ul><ul><ul><li>Fabric Destination (one of 4608) </li></ul></ul></ul><ul><ul><ul><li>Fabric Group (one of 48 in phase one and 96 in phase two) </li></ul></ul></ul><ul><ul><ul><li>Fabric (stop all traffic into the fabric per priority) </li></ul></ul></ul>
    17. 17. Reassembly Window <ul><li>Cells transitioning the Fabric take different paths between Sprayer and Sponge. </li></ul><ul><li>Cells for the same packet will arrive out of order. </li></ul><ul><li>The Reassembly Window for a given Source is defined as the the worst-case differential delay two cells from a packet encounter as they traverse the Fabric. </li></ul><ul><li>The Fabric limits the Reassembly Window </li></ul>
    18. 18. Linecard challenges <ul><li>Power </li></ul><ul><li>COGS </li></ul><ul><li>Multiple interfaces </li></ul><ul><li>Intermediate buffering </li></ul><ul><li>Speed up </li></ul><ul><li>CPU subsystem </li></ul>
    19. 19. Cisco CRS-1 Line Card MODULAR SERVICES CARD PLIM MIDPLANE CPU Squid GW OC192 Framer and Optics OC192 Framer and Optics OC192 Framer and Optics OC192 Framer and Optics Egress Packet Flow From Fabric Interface Module ASIC RX METRO Ingress Queuing TX METRO From Fabric ASIC Egress Queuing 4 1 8 7 6 5 2 3
    20. 20. Cisco CRS-1 Line Card MODULAR SERVICES CARD PLIM MIDPLANE CPU Squid GW OC192 Framer and Optics OC192 Framer and Optics OC192 Framer and Optics OC192 Framer and Optics Egress Packet Flow From Fabric Interface Module ASIC RX METRO Ingress Queuing TX METRO From Fabric ASIC Egress Queuing 4 1 8 7 6 5 2 3 Line Card CPU Egress Metro Ingress Metro Ingress Queuing Power Regulators Fabric Serdes From Fabric Egress Queuing
    21. 21. Cisco CRS-1 Line Card Egress Metro Ingress Metro Line Card CPU Ingress Queuing Power Regulators Fabric Serdes From Fabric Egress Queuing
    22. 22. Cisco CRS-1 Line Card Ingress Metro
    23. 23. Metro Subsystem
    24. 24. Metro Subsystem <ul><li>What is it ? </li></ul><ul><ul><li>Massively Parallel NP </li></ul></ul><ul><ul><li>Codename Metro </li></ul></ul><ul><ul><li>Marketing name SPP (Silicon Packet Processor) </li></ul></ul><ul><li>What were the Goals ? </li></ul><ul><ul><li>Programmability </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><li>Who designed & programmed it ? </li></ul><ul><ul><li>Cisco internal (Israel/San Jose) </li></ul></ul><ul><ul><li>IBM and Tensilica partners </li></ul></ul>
    25. 25. Metro Subsystem <ul><li>Metro </li></ul><ul><ul><li>2500 Balls </li></ul></ul><ul><ul><li>250Mhz 35W </li></ul></ul><ul><li>TCAM </li></ul><ul><ul><li>125MSPS </li></ul></ul><ul><ul><li>128kx144-bit entries </li></ul></ul><ul><ul><li>2 channels </li></ul></ul><ul><li>FCRAM </li></ul><ul><ul><li>166Mhz DDR </li></ul></ul><ul><ul><li>9 Channels </li></ul></ul><ul><ul><li>Lookups and Table Memory </li></ul></ul><ul><li>QDR2 SRAM </li></ul><ul><ul><li>250Mhz DDR </li></ul></ul><ul><ul><li>5 Channels </li></ul></ul><ul><ul><li>Policing state Classification results Queue length state </li></ul></ul>
    26. 26. Metro Top Level <ul><li>Packet Out </li></ul><ul><li>96 Gb/s BW </li></ul>Packet In 96 Gb/s BW <ul><li>18mmx18mm - IBM .13um </li></ul><ul><li>18M gates </li></ul><ul><li>8Mbit SRAM and RAs </li></ul>Control Processor Interface Proprietary 2Gb/s
    27. 27. Gee-whiz numbers <ul><ul><li>188 32-bit embedded Risc cores </li></ul></ul><ul><ul><li>~50 Bips </li></ul></ul><ul><ul><li>175 Gb/s Memory BW </li></ul></ul>78 MPPS peak performance
    28. 28. Why Programmability ? Simple forwarding – not so simple Example FEATURES: <ul><li>MPLS–3 Labels </li></ul><ul><li>Link Bundling (v4) </li></ul><ul><li>Load Balancing L3 (v4) </li></ul><ul><li>1 Policier Check </li></ul><ul><li>Marking </li></ul><ul><li>TE/FRR </li></ul><ul><li>Sampled Netflow </li></ul><ul><li>WRED </li></ul><ul><li>ACL </li></ul><ul><li>IPv4 Multicast </li></ul><ul><li>IPv6 Unicast </li></ul><ul><li>Per prefix accounting </li></ul><ul><li>GRE/L2TPv3 Tunneling </li></ul><ul><li>RPF check (loose/strict) v4 </li></ul><ul><li>Load Balancing V3 (v6) </li></ul><ul><li>Link Bundling (v6) </li></ul><ul><li>Congestion Control </li></ul><ul><li>IPv4 Unicast </li></ul>lookup algorithm L2 Adjacency Programmability also means Ability to juggle feature ordering Support for heterogeneous mixes of feature chains Rapid introduction of new features (Feature Velocity) Hundreds of Load balancing Entries per Millions of Routes 100k+ of adjacencies Pointer to Statistics Counters L3 load balance entry L2 info Increasing pressure to add 1-2 level of increased indirection for High Availability and increased update rates Lookup L3 info Load Balancing and Adjacencies : Sram/DRAM Sram/Dram leaf policy based routing TCAM table TCAM PBR associative Sram/DRAM 1:1 data
    29. 29. Metro Architecture Basics 96G 96G 96G 96 G PPE Resource Resource Packet tails stored on-chip Packet Distribution Run-to-completion (RTC) simple SW model efficient heterogeneous feature processing RTC and Non-Flow based Packet distribution means scalable architecture Costs High instruction BW supply Need RMW and flow ordering solutions ~100Bytes of packet context sent to PPEs 188 PPE On-Chip Packet Buffer Resource Fabric
    30. 30. Metro Architecture Basics 96G 96G 96G 96 G PPE Resource Resource Packet Gather Gather of Packets involves : Assembly of final packets (at 100Gb/s) Packet ordering after variable length processing Gathering without new packet distribution 188 PPE On-Chip Packet Buffer Resource Fabric
    31. 31. Metro Architecture Basics 96G 96G 96G 96 G PPE On-Chip Packet Buffer Resource Resource Packet Buffer accessible as Resource Resource Fabric is parallel wide multi-drop busses Resources consist of Memories Read-modify-write operations Performance heavy mechanisms 188 PPE Resource Fabric
    32. 32. Metro Resources Statistics 512k TCAM Interface Tables Policing 100k+ Lookup Engine 2M Prefixes Table DRAM (10’sMB) Queue Depth State CCR April 2004 (vol. 34 no. 2) pp 97-123. “Tree Bitmap : Hardware/Software IP Lookups with Incremental Updates”, Will Eatherton et. Al. Lookup Engine uses TreeBitmap Algorithm FCRAM and on-chip memory High Update rates Configurable performance Vs density
    33. 33. Packet Processing Element (PPE) 16 PPE Clusters Each Cluster of 12 PPE’s .5sqmm per PPE
    34. 34. Packet Processing Element (PPE) <ul><li>Tensilica Xtensa core with Cisco enhancements </li></ul><ul><ul><li>32-bit, 5-stage pipeline </li></ul></ul><ul><ul><li>Code Density : 16/24 bit instructions </li></ul></ul><ul><li>Small instruction cache and data memory </li></ul><ul><li>Cisco DMA engine – allows 3 outstanding Descriptor DMAs </li></ul><ul><li>10’s Kbytes Fast instruction memory </li></ul>32-bit RISC ICACHE DATA Mem Cisco DMA instruction bus Memory mapped Regs Distribution Hdr Pkt Hdr Scratch Pad Processor Core Cluster Instruction Memory Global Instruction Memory Cluster Data Mux Unit To12 PPE’s Pkt Distribution From Resources Pkt Gather To Resources To12 PPE’s PPE
    35. 35. Programming Model and Efficiency <ul><li>Metro Programming Model </li></ul><ul><ul><li>Run to completion programming model </li></ul></ul><ul><ul><li>Queued descriptor interface to resources </li></ul></ul><ul><ul><li>Industry leveraged tool flow </li></ul></ul><ul><li>Efficiency Data Points </li></ul><ul><ul><li>1 ucoder for 6 months: IPv4 with common features (ACL, PBR, QoS, etc..) </li></ul></ul><ul><ul><li>CRS-1 initial shipping datapath code was done by ~3 people </li></ul></ul>
    36. 36. Challenges <ul><li>Constant power battle </li></ul><ul><ul><li>Memory and IO </li></ul></ul><ul><li>Die Size Allocation </li></ul><ul><ul><li>PPEs Vs HW acceleration </li></ul></ul><ul><li>Scalability </li></ul><ul><ul><li>On-chip BW vs off-chip capacity </li></ul></ul><ul><ul><li>Procket NPU 100MPPS - limited scaling </li></ul></ul><ul><li>Performance </li></ul>
    37. 37. future directions <ul><li>POP convergence </li></ul><ul><li>Edge and core differences blur </li></ul><ul><li>Smartness in the network </li></ul><ul><li>More integrated services into the routing platforms </li></ul><ul><li>Feature sets needing acceleration expanding </li></ul><ul><li>Must leverage feature code across platforms/markets </li></ul><ul><li>Scalability (# of processors, amount of memory, BW) </li></ul>
    38. 38. Summary <ul><li>Router business is diverse </li></ul><ul><li>Network growth push routers to the </li></ul><ul><li>edge </li></ul><ul><li>Costumers expect scale from one hand </li></ul><ul><li>… and smart network </li></ul><ul><li>Routers become a massive parallel </li></ul><ul><li>processing machines </li></ul>
    39. 39. Questions ? Thank You
    40. 41. CRS-1 Positioning <ul><li>Core router (overall BW, interfaces types) </li></ul><ul><ul><li>1.2 Tbps, OC-768c Interface </li></ul></ul><ul><li>Distributed architecture </li></ul><ul><li>Scalability/Performance </li></ul><ul><ul><li>Scalable control plane </li></ul></ul><ul><li>High Availability </li></ul><ul><li>Logical Routers </li></ul><ul><li>Multi-Chassis Support </li></ul>
    41. 42. Networks planes <ul><li>Networks are considered to have three planes / operating timescales </li></ul><ul><ul><li>Data : packet forwarding [μs, ns] </li></ul></ul><ul><ul><li>Control : flows/connections [ ms, secs] </li></ul></ul><ul><ul><li>Management : aggregates, networks [ secs, hours ] </li></ul></ul><ul><li>Planes coupling is in descendent order (control-data more, management-control less) </li></ul>
    42. 43. Exact Matches in Ethernet Switches Trees and Tries Binary Search Tree < > < > < > Binary Search Trie 0 1 0 1 0 1 111 010 Lookup time bounded and independent of table size, storage is O(NW) Lookup time dependent on table size, but independent of address length, storage is O(N) log 2 N N entries
    43. 44. Exact Matches in Ethernet Switches Multiway tries 16-ary Search Trie 0000, ptr 1111, ptr 0000, 0 1111, ptr 000011110000 0000, 0 1111, ptr 111111111111 Ptr=0 means no children Q: Why can’t we just make it a 2 48 -ary trie?