GLORIAD's New Measurement and Monitoring System

456 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
456
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GLORIAD's New Measurement and Monitoring System

  1. 1. GLORIAD's New Measurement and Monitoring System for Addressing Individual Customer-based Performance across a Global Network Fabric APAN Meeting January 16, 2013 Greg Cole Principal Investigator GLORIAD gcole@gloriad.org
  2. 2. Presentation During the past year, GLORIAD has been working on a new system for measuring and monitoring global network infrastructure focused less on "links" and more on addressing needs of individual users. To accomplish its goal of actively improving global infrastructure for individual customers, the new system is designed to: (1) understand the network needs and requirements of a global customer base by actively studying utilization; (2) identify poor performance of individual applications by constantly (and in near-real-time) analyzing information on such per-flow metrics as load, packet loss, jitter and routing asymmetries; (3) mitigate poor performance of applications by identifying fabric weaknesses (4) build richly visual analysis applications such as GLORIAD-Earth and the new GloTOP to help make sense of the enormous volume of data. To realize this new model of measurement and monitoring (focused less on links and more on individual customers), GLORIAD has recently moved from its netflow-based system (used since 1998 and storing approximately 1 million records per day) to a new, much more detailed system – collecting, storing and analyzing 200-400 million network utilization records per day – based on deployment of open-source Argus software (www.qosient.com/argus). The talk will focus on the benefits and the technical challenges of this new and actively evolving work.
  3. 3. International infrastructure (circuits) .. “No GLIF no GLORIAD” Partners: SURFnet, NORDUnet, CSTnet (China), e-ARENA (Russia), KISTI (Korea), CANARIE (Canada), SingaREN, ENSTInet (Egypt), Tata Inst / Fund Rsrch/Bangalore Science Community, NLR/Internet2/NLR/NASA/FedNets, CERN/LHC Sponsors: US NSF ($18.5M 1998-2015), Tata ($6M), USAID ($3.5M 2011-2013) all Intl partners (~$240M 1998-2015) History: 1994 US-Russia Friends and Partners; 1996 US-Russia Civic Networking; 1997 US-Russia MIRnet; 2004 GLORIAD; 2009 GLORIAD/Taj; 2011 GLORIAD/Africa
  4. 4. Thank you GLORIAD-US Team (Anita, Harika, Karen, Kim, Naveen, Predrag, Susie)
  5. 5. GLORIAD Metrics Utilization Performance Operations Security (“you can’t manage [or improve] what you can’t measure” – quoting a wise NSF program official)) (“it’s all about ‘situational awareness’ and instrumenting towards that goal” )
  6. 6. Utilization Monitoring (in (near) real-time) “Top Talkers” Protocol utilization Application identification/utilization Traffic Analysis (DNS, etc.) Real-time (alerts, etc.) Historical timeline analysis
  7. 7. Utilization Monitoring 0" 20,000" 40,000" 60,000" 80,000" 100,000" 120,000" 140,000" 160,000" 180,000" 200,000" 2004)01" 2004)04" 2004)07" 2004)10" 2005)01" 2005)04" 2005)07" 2005)10" 2006)01" 2006)04" 2006)07" 2006)10" 2007)01" 2007)04" 2007)07" 2007)10" 2008)01" 2008)04" 2008)07" 2008)10" 2009)01" 2009)04" 2009)07" 2009)10" 2010)01" 2010)04" 2010)07" 2010)10" 2011)01" 2011)04" 2011)07" 2011)10" 2012)01" Gigabytes* GLORIAD*U2liza2on,*2004;2012* by*Country*Source*of*Traffic* Other" Poland" Sweden" Romania" Netherlands" Brazil" Italy" France" Taiwan" Norway" Great"Britain"(UK)" Hong"Kong" Singapore" Germany" Switzerland" Russian"FederaOon" Canada" Korea"(South)" China" United"States" !(500.0)! !'!!!! !500.0!! !1,000.0!! !1,500.0!! !2,000.0!! !2,500.0!! !3,000.0!! !3,500.0!! !4,000.0!! !4,500.0!! 1999'06!1999'08!1999'10!1999'12!2000'02!2000'06!2000'08!2000'10!2000'12!2001'02!2001'04!2001'06!2001'08!2001'10!2001'12!2002'02!2002'04!2002'06!2002'10!2002'12!2003'02!2003'04!2003'06!2003'08!2003'10!2003'12! %"of"Traffic"/"Monthly" GLORIAD/MirNET"Traffic"1999=2003" Russian!Federa:on! United!States! Germany! Switzerland! Taiwan! Israel! Great!Britain!(UK)! Poland! Netherlands! France! Japan! Sweden! Other!
  8. 8. Utilization Monitoring (in (near) real-time) Thank you CSTnet and KISTI (special thanks to Tong, Haina, Chunjing, Jiangning, Xiaodan, Gang, Lei, Hui, Dongkyun, Buseung)
  9. 9. Utilization Monitoring (in (near) real-time)
  10. 10. Utilization Monitoring (in (near) real-time)
  11. 11. Utilization Monitoring (in (near) real-time)
  12. 12. Utilization Monitoring (in (near) real-time)
  13. 13. Utilization Monitoring (in (near) real-time)
  14. 14. Utilization Monitoring (in (near) real-time)
  15. 15. Performance Monitoring (in (near) real-time) Key theme: we want to address real performance needs *before* users have to figure out who in the world to call about their “bad connection” - or before they decide that the “R&E Internet” is not adequate to their needs - i.e.,proactive performance mitigation (instead of reactive). Another theme: we want to develop tools, technologies and experience that can be used throughout the global network fabric (local, campus, regional, national, international)- the real “home” for these tools will ultimately be the local network operators who live closest to the customers.
  16. 16. New GloTop Application
  17. 17. New dvNOC Application
  18. 18. dvNOC System Joint effort by US, China, Korea, Nordic teams (and, now, new GLORIAD/Taj partners) Based on solid measurement infrastructure, information management and information sharing Fueled by the open-source Argus system of flow monitoring (5 second updates on all flows, 200-400 million flow-records/day; handles multi-G flow rates with room to spare) Focused on (1) understanding utilization, (2) improving performance systemically, (3) ensuring appropriate use, (4) distributing (decentralizing) operations and management of R&E networks
  19. 19. Former Metrics Data Source
  20. 20. “Taj” Measurement/Monitoring Update Picture of GLORIAD/Taj new “nprobe” network measurement device. Hardware: Dell PowerEdge R410 Server - 8 core intel processor, 10GE Intel Fiber Card (ixgbe driver). Network utilization and performance measurement box - at 10G line speed designed to improve and extend open source nprobe netflow emitter software, emit extended netflow records including detailed information of packet retransmissions. Software base: Luca Deri’s nprobe. The two screenshots above illustrate data generated from the Taj project’s new “nprobe” boxes deployed in Chicago and Seattle. The first illustrates top flows on the network; the second illustrates large flows suffering from poor performance (i.e., high packet retransmits). This data was formerly generated from GLORIAD’s packeteer system (limited to 1 Gbps circuit capacity). 2012 Transition to Argus http://www.qosient.com/argus/ We use Luca Deri’s enabling pf_ring underneath (and we’re also exploring freebsd’s netmap) R
  21. 21. Argus Flexible open-source software packet sensors to generate network flow records at line rate, for operations, performance and security. Comprehensive, not statistical, bi-directional, with many flow models allowing you to track any network traffic, not just 5-tuple IP traffic. Support for large scale collection, data processing, storage and archiving, sharing, vizualization, with analytics, aggregation, geospatial, netspatial analysis.
  22. 22. Argus (author: Carter Bullard)
  23. 23. Current GLORIAD-US Deployment of Argus KNOXVILLE RADIUM SERVER Apple Xserver- 1) Processors - 2 x 2.93GHz Quad- Core Intel Xeon 2) Memory - 24GB (6x4GB) 3) Hard drive - 1TB Serial ATA 4) OS - 10.8 Argus Analysis Tools (running on various (mostly apple) SEATTLE ARGUS NODE DELL R410 servers  -   1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores) 2) Memory  - 8 GB (4 x 2GB) UDDIMMs  3) Hard drive - 500GB SAS  4) Intel 82599EB 10G NIC  5) OS - Centos 6 6) modified for PF_RING 7) running argus daemon sending data to radium server in Knoxville Seattle Force-10 Router 10G SPAN port CHICAGO ARGUS NODE DELL R410 servers  -   1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores) 2) Memory  - 8 GB (4 x 2GB) UDDIMMs  3) Hard drive - 500GB SAS  4) Intel 82599EB 10G NIC  5) OS - Centos 6 6) modified for PF_RING 7) running argus daemon sending data to radium server in Knoxville Chicago Force-10 Router 10G SPAN port
  24. 24. Near-future GLORIAD-US Deployment of Argus SEATTLE ARGUS NODE DELL R410 servers  -   1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores) 2) Memory  - 8 GB (4 x 2GB) UDDIMMs  3) Hard drive - 500GB SAS  4) Intel 82599EB 10G NIC  5) OS - Centos 6 6) modified for PF_RING 7) running argus daemon sending data to radium server in Knoxville Seattle Force-10 Router 10G SPAN port CHICAGO ARGUS NODE DELL R410 servers  -   1) Processors - 2 x Intel xeon X55670, 2.93GHz (Quad cores) 2) Memory  - 8 GB (4 x 2GB) UDDIMMs  3) Hard drive - 500GB SAS  4) Intel 82599EB 10G NIC  5) OS - Centos 6 6) modified for PF_RING 7) running argus daemon sending data to radium server in Knoxville Chicago Force-10 Router 10G SPAN port X X(use taps instead) (use taps instead) • Local Storage • Local Analysis Hardware • Ability to handle much more capacity • Local Storage • Local Analysis Hardware • Ability to handle much more capacity KNOXVILLE RADIUM SERVER Apple Xserver- 1) Processors - 2 x 2.93GHz Quad- Core Intel Xeon 2) Memory - 24GB (6x4GB) 3) Hard drive - 1TB Serial ATA 4) OS - 10.8 Argus Analysis Tools (running on various (mostly apple) Big Farm of Cisco-provided Blade Servers Fast Analysis Parallel Database Architecture
  25. 25. Why all this power? • Preparing the data for this graph from 250G argus archive (which helped a large international R&E network systemically address a huge performance problem) took me 3 days with our current setup • We want any of our partners to be able do this in 3 minutes (or less) • We want “room” to better research the area of performance, operations and security analytics with our international partners
  26. 26. But we’re still designing for lesser needs as well (targeting single 1G and 10G networks) LinuxMacOSX FreeBSD
  27. 27. Current Process Chicago Argus Node Seattle Argus Node chained radium servers Additional “ad hoc” processing racluster process mysql archive of top users (10 seconds) Live Apps (glo- earth, glo- top, dvnoc) rastream process (5 minutes) Disk archive (~300 million recs / day) mySQL archive (1.8 million recs / day) Analysis Applications Knoxville Radium Server 3 Mbps stream 3 Mbps stream
  28. 28. New Process Chicago Argus Node Seattle Argus Node chained radium servers Additional “ad hoc” processing racluster process mysql archive of top users (10 seconds) Live Apps (glo- earth, glo- top, dvnoc) rastream process (5 minutes) Disk archive (~300 million recs / day) mySQL archive (1.8 million recs / day) Analysis Applications Knoxville Radium Server 3 Mbps stream 3 Mbps stream
  29. 29. New Process (Dec/2012-Jan/2013) 32 core Cisco Blade Server (freeBSD) with 128G RAM, 5T RAID storage “Farm” of Perl/POE/IKC Daemons Near-Realtime Analytics and Local Storage of Data “Top Users” DNS Analysis Bad Performers Link Analytics BGP Analysis ICMP Analysis Scan Analysis... Argus Nodes (for GLORIAD currently, Chicago and Seattle) Argus Data (from Argus Nodes to a Core Radium Collector) ... dvNOC ...GloTOP GLOEarth Ticketing System NOC Access User Tools for Analysis, Operational Support and Visualization
  30. 30. More detail .. “Farm” of Perl/POE/IKC Daemons Near-Realtime Analytics and Local Storage of Data “Top Users” DNS Analysis Bad Performers Link Analytics BGP Analysis ICMP Analysis Scan Analysis... dvNOC ...GloTOP GLOEarth Web Reports NOC Access User Tools for Analysis and Visualization • Built with Runrev LiveCode • Multi-platform (Mac, Windows, Linux, iOS, Android) • Event-driven, graphic/media rich applications • Perl POE event-loop, event-driven programming for “cooperative multi-tasking” • IKC for inter-kernel communications between “animals” • Daemonized (fast) • Use MySQL (or any other) for long-term storage; SQLlite for local (fast) in-memory database • Each “animal” on the “farm” is autonomous and very specialized • Most read from a single argus RABINS stream
  31. 31. All of the software, tools, data specifications, etc. are being “Github’d” (right thing to do (argus, perl, mysql, sqlite are all open) and we want people to help us ..)
  32. 32. GLORIAD github
  33. 33. “Operationalizing” this Data
  34. 34. New dvNOC Application
  35. 35. “REQUEST TRACKER” FED BY DATA FROM MONITORING SYSTEMS HOMEhttp://
  36. 36. Poor-Performance Analysis
  37. 37. Chicago Source x.x.3.226 Destination x.x.244.210 Active monitoring system - My TraceRoute(MTR) Harika Tandra, GLORIAD
  38. 38. Active monitoring system • For each under-performing flow identified, MTR runs are triggered to source and destination IPs • Triggered in near-real-time to the flow detected. Thus, test packets are triggered in network conditions similar to those seen by the real traffic • Combining the two gives approximate end-to-end performance 
 Harika Tandra, GLORIAD
  39. 39. Example network graphs for a few end hosts in U.S. Harika Tandra, GLORIAD
  40. 40. Example network graphs for a few end hosts in China Representation : • Graph node - router in paths discovered by MTR. • Rect. node - the end host. • Node label - • 1st line - value of cost function • 2nd line - IP (anonymized) • 3rd line- Avg. %packet loss at the node. • Color map ranges from Yellow through orange to red. • this graph is color mapped based on the ‘Avg. %packet loss’ value. • Edges labels : ‘A-B’ where • A => Total number of mtr runs through the parent to child node. • B => Number of runs in which there was non- zero packet loss. • Gray nodes are nodes which saw no packet loss. Harika Tandra, GLORIAD
  41. 41. Data Model Use mySQL (partly for benefit of using the myisam heap tables (fast)) but also now using SQLite for local autonomous analysis engines (very fast, especially with :memory: tables) Use BerkeleyDB for some things (tying perl hashes to disk data stores) Two large databases pflow - primary IP@s, ASNums, Domains, all large flows (1998 - current (~1.4 billion records)), support tables (ip mapping tables, ccodes, world regions, sci disciplines, protocols, services, etc.) summary - various tables to enable fast search/retrieval of flow information Experimenting with Argus rasql tools (powerful) Using Argus ralabel (with geoip) for live labeling of all flow updates with country codes, asnums, lat/long, etc. Looking at hadoop, others for parallel capabilities
  42. 42. Key Database Management Tables
  43. 43. Summary: Core Technologies Argus as passive monitor (formerly packeteer and then nprobe) running on top of pf_ring (or freebsd’s netmap) Mtr as active monitor Mysql as underlying database (exploring alternatives now) along with SQLite and BerkeleyDB RunRev’s LiveCode for front-end client development (we formerly used Flash) (someday this should be html5 apps (?)) Perl/POE/IKC for back-end “cooperative multitasking” server
  44. 44. Summary Work builds on efforts since 1999 Argus offers a *lot* of advantages over netflow or sflow Data management problem *is* solvable We hope to encourage an open global, community effort to deploy common standards and tools addressing metrics for R&E network performance, operations and security

×