Skydive
Real-time network topology and protocols analyzer
Sylvain Afchain
Sylvain Afchain
Principal software engineer
Redhat
Openstack Neutron contributor
Opencontrail contributor
WHY ?
SDN IS COMPLEX
Troubleshooting/monitoring is even more
complex
Implementations
Management
Control plane
● OpenFlow
● XMPP
● BGP
● AMQP
● Etc...
Data plane
● VLAN
● VXLAN
● GRE
● MPLS
● OVS, Linuxbridge, other
Real network issues… just an extract !
Offloading issue leading in packet drop
Offloading issue leading in bad performances, tcp retransmission
Offloading issue leading in bad checksum
Configuration issues like MTU
Remaining filtering rules or routing
Forwarding database corrupted with vxlan
Offloading checksum issue with gre tunnel, leading in bad performances for TCP connections.
Bonding, LACP issues multicast packets dropped in vxlan tunnels
SDN Control plane issues, config not reflected to the dataplane, ex: Remaining or missing openflow rules.
While trying to record a demo of Skydive…. and this not a joke !
http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [('Connection
aborted.', OSError(101, 'Network is unreachable'))]
Troubleshooting
Where...
are the packets dropped ?
are the packets fragmented ?
is the congestion point ?
What…
is the path of packets ?
kind of traffic for this virtual network ?
is the number of flows on this link ?
is the number of TCP Sessions ?
is the bandwidth for this tenant ?
Current toolbox
● Iproute2
● ovs-vsctl, ovs-ofctl, ovs-dpctl...
● ethtool
● brctl
● tcpdump
● SDN CLI/API
● etc.
The needs
1. SDN Agnostic solution
2. Non-intrusive
3. Lightweight
4. Flow centric
5. Easy to deploy
6. Open, API
7. Connectors to SDN
The needs
1. Topology probes
a. interfaces, bond, mtu, vlan
b. bridges
c. Network namespaces
d. etc..
2. Flow probes
a. on-demand traffic capture
b. on-demand counter capture
c. filtering
d. underlay/overlay informations
3. Topology/flow aggregation
a. mapping topology/flow
b. analysis
Skydive design
Agents
● On the nodes to monitor
● Topology probes
● Flow probes
● Southbound API, topology queries, Flow Probes
Analyzers
● collect agents data, time-series database
● Flow centric
● Northbound API, topology queries, flow capture, alarming
Skydive Use-cases
Operator :
● Detection of common configuration errors.
● Detection of live network issues at any point of the infrastructure meaning in
the underlay and in the overlay.
○ bad performances, helping to find the root cause.
○ DDOS and any unattended traffic.
● Possibility to capture traffic at any point for further analysis.
○ Historic of all the metrics captured, keeping all the flow events for further analysis.
User :
● Detection of misconfigured filtering mechanism like security groups.
● Detection of bad application performance, bad RTT.
Skydive today
● Topology capture
○ Netlink, NetNS, ETHTool, OVSDB
○ Connectors:
■ Neutron, Docker
○ Backend:
■ In-Memory, Gremlin based (Titangraph, Tinkerpop, neo4j)
● Live distributed capture
○ sFlow with OVS, PCAP
● Analysis
○ Flow table, flow event, session expiration, etc.
○ Backend:
■ ElasticSearch
● API/WebUI
○ On-demand capture, Topology (events, alerting)
Skydive Roadmap
● Topology capture
○ Adding more connectors
● Live distributed capture
○ PCAP, Filtering
● Analysis
○ Adding more protocols
○ Alerting
● Improvement of the security
○ RBAC
○ SSL
Key points
Non-intrusive, SDN-agnostic.
Helps to troubleshoot/monitor giving informations on the root cause and its impact.
Gives feedback providing informations needed for capacity planning, billing.
Open source
Apache License
Written in Go
Questions
What would the best place for the project ?
OpenStack - single point where multiple SDN controllers meet, but is it really
a network focussed project ?
Events to engage/propose content ?
Questions ?
https://github.com/redhat-cip/skydive
safchain@redhat.com

Skydive 31 janv. 2016

  • 1.
    Skydive Real-time network topologyand protocols analyzer Sylvain Afchain
  • 2.
    Sylvain Afchain Principal softwareengineer Redhat Openstack Neutron contributor Opencontrail contributor
  • 3.
  • 4.
  • 6.
    Implementations Management Control plane ● OpenFlow ●XMPP ● BGP ● AMQP ● Etc... Data plane ● VLAN ● VXLAN ● GRE ● MPLS ● OVS, Linuxbridge, other
  • 7.
    Real network issues…just an extract ! Offloading issue leading in packet drop Offloading issue leading in bad performances, tcp retransmission Offloading issue leading in bad checksum Configuration issues like MTU Remaining filtering rules or routing Forwarding database corrupted with vxlan Offloading checksum issue with gre tunnel, leading in bad performances for TCP connections. Bonding, LACP issues multicast packets dropped in vxlan tunnels SDN Control plane issues, config not reflected to the dataplane, ex: Remaining or missing openflow rules. While trying to record a demo of Skydive…. and this not a joke ! http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [('Connection aborted.', OSError(101, 'Network is unreachable'))]
  • 8.
    Troubleshooting Where... are the packetsdropped ? are the packets fragmented ? is the congestion point ? What… is the path of packets ? kind of traffic for this virtual network ? is the number of flows on this link ? is the number of TCP Sessions ? is the bandwidth for this tenant ?
  • 9.
    Current toolbox ● Iproute2 ●ovs-vsctl, ovs-ofctl, ovs-dpctl... ● ethtool ● brctl ● tcpdump ● SDN CLI/API ● etc.
  • 10.
    The needs 1. SDNAgnostic solution 2. Non-intrusive 3. Lightweight 4. Flow centric 5. Easy to deploy 6. Open, API 7. Connectors to SDN
  • 11.
    The needs 1. Topologyprobes a. interfaces, bond, mtu, vlan b. bridges c. Network namespaces d. etc.. 2. Flow probes a. on-demand traffic capture b. on-demand counter capture c. filtering d. underlay/overlay informations 3. Topology/flow aggregation a. mapping topology/flow b. analysis
  • 12.
    Skydive design Agents ● Onthe nodes to monitor ● Topology probes ● Flow probes ● Southbound API, topology queries, Flow Probes Analyzers ● collect agents data, time-series database ● Flow centric ● Northbound API, topology queries, flow capture, alarming
  • 14.
    Skydive Use-cases Operator : ●Detection of common configuration errors. ● Detection of live network issues at any point of the infrastructure meaning in the underlay and in the overlay. ○ bad performances, helping to find the root cause. ○ DDOS and any unattended traffic. ● Possibility to capture traffic at any point for further analysis. ○ Historic of all the metrics captured, keeping all the flow events for further analysis. User : ● Detection of misconfigured filtering mechanism like security groups. ● Detection of bad application performance, bad RTT.
  • 15.
    Skydive today ● Topologycapture ○ Netlink, NetNS, ETHTool, OVSDB ○ Connectors: ■ Neutron, Docker ○ Backend: ■ In-Memory, Gremlin based (Titangraph, Tinkerpop, neo4j) ● Live distributed capture ○ sFlow with OVS, PCAP ● Analysis ○ Flow table, flow event, session expiration, etc. ○ Backend: ■ ElasticSearch ● API/WebUI ○ On-demand capture, Topology (events, alerting)
  • 18.
    Skydive Roadmap ● Topologycapture ○ Adding more connectors ● Live distributed capture ○ PCAP, Filtering ● Analysis ○ Adding more protocols ○ Alerting ● Improvement of the security ○ RBAC ○ SSL
  • 19.
    Key points Non-intrusive, SDN-agnostic. Helpsto troubleshoot/monitor giving informations on the root cause and its impact. Gives feedback providing informations needed for capacity planning, billing. Open source Apache License Written in Go
  • 20.
    Questions What would thebest place for the project ? OpenStack - single point where multiple SDN controllers meet, but is it really a network focussed project ? Events to engage/propose content ?
  • 21.