VMware expert Motonori Shindo presented on L2 over L3 encapsulation protocols like VXLAN, NVGRE, STT, and Geneve. He explained how each protocol works including header formats and provided ecosystem updates. He believes Geneve has potential as it allows for extensibility through options fields while leveraging NIC offloading, but that VXLAN is already widely adopted. Critics argue its goals could be achieved through other means.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
- Show you how GoBGP can be used as a software router in conjunction with quagga
- (Tutorial) Walk through the setup of IXP connecting router using GoBGP
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
In the networking world there are a number of ways to increase performance over naive use of basic Berkeley sockets. These techniques have ranged from polling blocking sockets, non-blocking sockets controlled by Epoll, all the way through completely bypassing the Linux kernel for maximum network performance where you talk directly to the network interface card by using something like DPDK or Netmap. All these tools have their place, and generally occupy a space from convenience to performance. But in recent years, that landscape has changed massively.. The tools available to the average Linux systems developer have improved from the creation of io_uring, to the expansion of bpf from a simple filtering language to a full-on programming environment embedded directly in the kernel. Along with that came something called XDP (express datapath). This was Linux kernel's answer to kernel-bypass networking. AF_XDP is the new socket type created by this feature, and generally works very similarly to something like DPDK. History lessons out of the way, this talk will look into, and discuss the merits of this technology, it's place in the broader ecosystem and how it can be used to attain the highest level of performance possible. This talk will dive into crucial details, such as how AF_XDP works, how it can be integrated into a larger system and finally more advanced topics such as request sharding/load balancing. There will be detailed look at the design of AF_XDP, the eBpf code used, as well as the userspace code required to drive it all. It will also include performance numbers from this setup compared to regular kernel networking. And most importantly how to put all this together to handle as much data as possible on a single modern multi-core system.
There are some issues for OpenStack multi-region mode, for example, lack of global view quotas control, resource utilization, metering data, replication of image / keypair / security group / volume , L2/L3 networking across OpenStack, ... etc. OpenStack cascading is the best-matched solution to solve these issues in multi-site multi-region cloud
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
- Show you how GoBGP can be used as a software router in conjunction with quagga
- (Tutorial) Walk through the setup of IXP connecting router using GoBGP
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
In the networking world there are a number of ways to increase performance over naive use of basic Berkeley sockets. These techniques have ranged from polling blocking sockets, non-blocking sockets controlled by Epoll, all the way through completely bypassing the Linux kernel for maximum network performance where you talk directly to the network interface card by using something like DPDK or Netmap. All these tools have their place, and generally occupy a space from convenience to performance. But in recent years, that landscape has changed massively.. The tools available to the average Linux systems developer have improved from the creation of io_uring, to the expansion of bpf from a simple filtering language to a full-on programming environment embedded directly in the kernel. Along with that came something called XDP (express datapath). This was Linux kernel's answer to kernel-bypass networking. AF_XDP is the new socket type created by this feature, and generally works very similarly to something like DPDK. History lessons out of the way, this talk will look into, and discuss the merits of this technology, it's place in the broader ecosystem and how it can be used to attain the highest level of performance possible. This talk will dive into crucial details, such as how AF_XDP works, how it can be integrated into a larger system and finally more advanced topics such as request sharding/load balancing. There will be detailed look at the design of AF_XDP, the eBpf code used, as well as the userspace code required to drive it all. It will also include performance numbers from this setup compared to regular kernel networking. And most importantly how to put all this together to handle as much data as possible on a single modern multi-core system.
There are some issues for OpenStack multi-region mode, for example, lack of global view quotas control, resource utilization, metering data, replication of image / keypair / security group / volume , L2/L3 networking across OpenStack, ... etc. OpenStack cascading is the best-matched solution to solve these issues in multi-site multi-region cloud
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
Agenda:
In this talk we will present various locking mechanisms implemented in the linux kernel.
From System V locks to raw spinlocks and the RT patch.
Speaker:
Mark Veltzer - CTO of Hinbit and a senior instructor at John Bryce. Mark is also a member of the Free Source Foundation and contributes to many free projects.
https://github.com/veltzer
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
Presented at the webinar, July 31, 2019
Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
Graham Mainwaring and Robert Hodges summarize management of ClickHouse on Kubernetes using the ClickHouse Kubernetes Operator and introduce a new UI for it. Presented at the 15 Dec '22 SF Bay Area ClickHouse Meetup.
[Container Plumbing Days 2023] Why was nerdctl made?Akihiro Suda
nerdctl (contaiNERD CTL) was made to facilitate development of new technologies in the containerd platform.
Such technologies include:
- Lazy-pulling with Stargz/Nydus/OverlayBD
- P2P image distribution with IPFS
- Image encryption with OCIcrypt
- Image signing with Cosign
- “Real” read-only mounts with mount_setattr
- Slirp-less rootless containers with bypass4netns
- Interactive debugging of Dockerfiles, with buildg
nerdctl is also useful for debugging Kubernetes nodes that are running containerd.
Through this session, the audiences will learn these functionalities of nerdctl, relevant projects, and the roadmap for the future.
https://containerplumbing.org/sessions/2023/why_was_nerdctl_
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
In this talk by Sean Glover, Principal Engineer at Lightbend, we will review how the Strimzi Kafka Operator, a supported technology in Lightbend Platform, makes many operational tasks in Kafka easy, such as the initial deployment and updates of a Kafka and ZooKeeper cluster.
See the blog post containing the YouTube video here: https://www.lightbend.com/blog/running-kafka-on-kubernetes-with-strimzi-for-real-time-streaming-applications
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Building DataCenter networks with VXLAN BGP-EVPNCisco Canada
The session specifically covers the requirements and approaches for deploying the Underlay, Overlay as well as the inter-Fabric connectivity of Data Center Networks or Fabrics. Within the VXLAN BGP-EVPN based Overlay, we focus on the insights like forwarding and control plane functions which are critical to the simplicity operation of the architecture in achieving scale, small failure domains and consistent configuration. To complete the overlay view on VXLAN BGP-EVPN, we are going to the insides of BGP and its EVPN address-familiy and extend to about how multiple DC Fabric can be interconnected within, either as stretched Fabrics or with true DCI. The session concludes with a brief overview of manageability functions, network orchestration capabilities and multi-tenancy details. This Advanced session is intended for network, design and operation engineers from Enterprises to Service Providers.
고승범(peter.ko) / kakao corp.(인프라2팀)
---
카카오에서는 빅데이터 분석, 처리부터 모든 개발 플랫폼을 이어주는 솔루션으로 급부상한 카프카(kafka)를 전사 공용 서비스로 운영하고 있습니다. 전사 공용 카프카를 직접 운영하면서 경험한 트러블슈팅과 운영 노하우 등을 공유하고자 합니다. 특히 카프카를 처음 접하시는 분들이나 이미 사용 중이신 분들이 많이 궁금해하는 프로듀서와 컨슈머 사용 시의 주의점 등에 대해서도 설명합니다.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
Agenda:
In this talk we will present various locking mechanisms implemented in the linux kernel.
From System V locks to raw spinlocks and the RT patch.
Speaker:
Mark Veltzer - CTO of Hinbit and a senior instructor at John Bryce. Mark is also a member of the Free Source Foundation and contributes to many free projects.
https://github.com/veltzer
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
Presented at the webinar, July 31, 2019
Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
Graham Mainwaring and Robert Hodges summarize management of ClickHouse on Kubernetes using the ClickHouse Kubernetes Operator and introduce a new UI for it. Presented at the 15 Dec '22 SF Bay Area ClickHouse Meetup.
[Container Plumbing Days 2023] Why was nerdctl made?Akihiro Suda
nerdctl (contaiNERD CTL) was made to facilitate development of new technologies in the containerd platform.
Such technologies include:
- Lazy-pulling with Stargz/Nydus/OverlayBD
- P2P image distribution with IPFS
- Image encryption with OCIcrypt
- Image signing with Cosign
- “Real” read-only mounts with mount_setattr
- Slirp-less rootless containers with bypass4netns
- Interactive debugging of Dockerfiles, with buildg
nerdctl is also useful for debugging Kubernetes nodes that are running containerd.
Through this session, the audiences will learn these functionalities of nerdctl, relevant projects, and the roadmap for the future.
https://containerplumbing.org/sessions/2023/why_was_nerdctl_
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
In this talk by Sean Glover, Principal Engineer at Lightbend, we will review how the Strimzi Kafka Operator, a supported technology in Lightbend Platform, makes many operational tasks in Kafka easy, such as the initial deployment and updates of a Kafka and ZooKeeper cluster.
See the blog post containing the YouTube video here: https://www.lightbend.com/blog/running-kafka-on-kubernetes-with-strimzi-for-real-time-streaming-applications
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Building DataCenter networks with VXLAN BGP-EVPNCisco Canada
The session specifically covers the requirements and approaches for deploying the Underlay, Overlay as well as the inter-Fabric connectivity of Data Center Networks or Fabrics. Within the VXLAN BGP-EVPN based Overlay, we focus on the insights like forwarding and control plane functions which are critical to the simplicity operation of the architecture in achieving scale, small failure domains and consistent configuration. To complete the overlay view on VXLAN BGP-EVPN, we are going to the insides of BGP and its EVPN address-familiy and extend to about how multiple DC Fabric can be interconnected within, either as stretched Fabrics or with true DCI. The session concludes with a brief overview of manageability functions, network orchestration capabilities and multi-tenancy details. This Advanced session is intended for network, design and operation engineers from Enterprises to Service Providers.
고승범(peter.ko) / kakao corp.(인프라2팀)
---
카카오에서는 빅데이터 분석, 처리부터 모든 개발 플랫폼을 이어주는 솔루션으로 급부상한 카프카(kafka)를 전사 공용 서비스로 운영하고 있습니다. 전사 공용 카프카를 직접 운영하면서 경험한 트러블슈팅과 운영 노하우 등을 공유하고자 합니다. 특히 카프카를 처음 접하시는 분들이나 이미 사용 중이신 분들이 많이 궁금해하는 프로듀서와 컨슈머 사용 시의 주의점 등에 대해서도 설명합니다.
JCSA2013 06 Luigi Iannone - Le protocole LISP ("Locator/Identifier Sepration ...Afnic
Voir la présentation en vidéo sur http://www.youtube.com/watch?v=Om1zqb2VuPM
Luigi Iannone (Télécom ParisTech) présente "Vers un renforcement de l'architecture Internet : le protocole LISP ("Locator/Identifier Separation Protocol")" lors de la Journée du Conseil scientifique de l'Afnic 2013 (JCSA2013), le 9 juillet 2013 dans les locaux de Télécom ParisTech.
NetFlow Monitoring for Cyber Threat DefenseCisco Canada
Recent trends have led to the erosion of the security perimeter and increasingly attackers are gaining operational footprints on the network interior. For more information, please visit our website: http://www.cisco.com/web/CA/index.html
(NET301) New Capabilities for Amazon Virtual Private CloudAmazon Web Services
Amazon's Virtual Private Cloud (Amazon VPC) continues to evolve with new capabilities and enhancements. These features give you increasingly greater isolation, control, and visibility at the all-important networking layer. In this session, we review some of the latest changes, discuss their value, and describe their use cases.
Marek Isalski, Faelix.net Ltd, describes the MikroTik range of routers and their applications, gives a pros and cons summary, and recommendations for budget provider edge deployment.
Cluster API によるKubernetes環境のライフサイクル管理とマルチクラウド環境での適用Motonori Shindo
Cluster API は Kubernetes の宣言的APIとリソースの管理機能を活かし、Kubernetes環境のライフサイクル管理を行うもので、Kubernetesコミュニティで仕様の策定と開発が進められています。
これまでもKubernetes環境の構築を支援するツールはいくつかありましたが、Cluster APIはコミュニティからの大きな支持を得ており、Cluster APIのエコシステムが広がりつつあります。
本セッションでは Cluster API の概要と最新の動向、また、Cluster APIを利用した大規模マルチクラウド環境への適用などをデモを交えながら解説を行います。
本資料はCloud Operator Days Tokyo 2020登壇時の資料です。
Service Mesh for Enterprises / Cloud Native Days Tokyo 2019Motonori Shindo
Cloud Native Days Tokyo 2019での講演資料です。
アプリケーションのマイクロサービス化が一般的になってくるにつれて、セキュリティ、オブザーバビリティなど、課題も徐々に明らかになってきており、そのような課題を解決する技術として「サービスメッシュ」が注目を集めています。本セッションではデファクトのサービスメッシュ実装であるIstioをベースにしたNSX Service Mesh(NSX-SM))がいかに今日のエンタープライズが抱える課題を解決することができるかについて解説を行います。
2. Tunneling vs Encapsulation
• Tunneling Protocols
– Signaling + Encapsulation
• Usually equips some sort of “signaling” mechanism, which manages the tunnel.
• Encapsulation is another part of tunneling protocol.
– E.g. ) PPTP, L2TP, IPsec (IKE), etc.
• Encapsulations
– A way of wrapping (i.e. encapsulating) something
– E.g) GRE, VXLAN, NVGRE, STT, (Ethernet, IP, TCP, ….)
• What I’m going to talk about today is “encapsulation”
• I am not going to talk about “control plane” today (though it’s very important)
CONFIDENTIAL 2
3. L2 over L3 encapsulations typically seen in Network
Virtualization
• GRE (Generic Routing Encapsulation) *
• VXLAN (Virtual Extensible LAN)
• NVGRE (Network Virtualization using GRE)
• STT (Stateless Transport Tunneling)
* Strictly speaking GRE is not an L2 over L3 encapsulation
as it can encapsulate not only L2 but also L3
CONFIDENTIAL 3
4. VXLAN
• Proposed by Cumulus / Arista / Broadcom / Cisco / VMware / Citrix / RedHat
– draft-mahalingam-dutt-dcops-vxlan-09.txt
• Extends VLAN ID (12bit) to VNI (24bit)
• Encapsulation by UDP/IP
– L3 overlay
– Multipath
• Encapsulates Ethernet Frame only
• Simple so that it can be implemented by hardware
• Forming an “ecosystem”
CONFIDENTIAL 4
6. Fabric Network
• Service Oriented Architecture
• 2 or 3 layer network to Leaf & Spine
• High density and bandwidth required
• Layer 3 ECMP
• No oversubscription
• Low and uniform delay characteristic
• Wire & configure once network
• Uniform network configuration
WAN/Internet
WAN/Internet
CONFIDENTIAL 6
7. Multipath Network
• Background
– In order to support significant increase of East-West traffic, Fabric Network based on multipath is
getting popular
• Requisites
– A given flow must traverse over the same paths
– Must have enough “entropy” to make an efficient use of fabric
CONFIDENTIAL 7
8. Multipath by VXLAN
VXLAN (8)UDP (8)IP (20)
Hash (src/dst MAC addr,
src/dst IP addr,
src/dst port number, etc.) *
dst port = 4789
src port = Hash()
Ether IP TCP Data
original packet
* Which fields to hash or which hash algorithm to use is not defined by the protocol. It is up to the implementation.
CONFIDENTIAL 8
9. VXLAN Ecosystem
• Switch / Router
– Arista, Brocade, Cisco, Cumulus, DELL, HP,
Huawei, Juniper, Open vSwitch, Pica8
• Operating System
– Linux, VMware
• Appliances
– A10, Citrix F5
• Testers
– IXIA, Spirent
• ASIC / NIC
– Broadcom, Intel (Fulcrum), Emulex, Mellanox
• Cloud Orchestrator
– CloudStack, OpenStack, vCAC
CONFIDENTIAL 9
Note: this is not an exhaustive list
This is a list of venders who participated in
VXLAN interoperability test at INTEROP Tokyo
2014, which went all successful.
10. NVGRE
• Proposed by Microsoft / Arista / Intel / Google / HP / Broadcom / Emulex
– draft-sridharan-virtualization-nvgre-04.txt
• 24bit Virtual Subnet ID (VSID) and 8bit FlowID
• Encapsulation is GRE as is:
– Put VSID + FlowID in Key Field
– L3 Overlay
– Multipath possible (in theory) but difficult
• Windows affinity
CONFIDENTIAL 10
12. Multipath in NVGRE
GRE (8)IP (20)
Hash (src/dst MAC addr,
src/dst IP addr,
src/dst port number, etc.) *
FlowID = Hash()
Ether IP TCP Data
Original Packet
Router / Switch needs to
lookup the Key Field in GRE
header to do an ideal
multipath!
* Which fields to hash or which hash algorithm to use is not defined by the protocol. It is up to the implementation.
CONFIDENTIAL 12
13. NVGRE ecosystem
• Switch / Router
– Huawei
– Arista and Brocade claim they are going to support but product hasn’t come out yet??
• Operating System
– Microsoft (Windows Server 2012 R2)
• Appliances
– F5
• ASIC / NIC
– Emulex Mellanox
• Cloud Orchestrator
– System Center 2012 R2
CONFIDENTIAL 13
Note: this is not an exhaustive list
14. STT (Stateless Transport Tunneling)
• L2 over L3 encapsulation proposed by VMware
– draft-davie-stt-06.txt
• Why yet another L2 over L3 encapsulation ?
– Performance
– Richer context information
– Multipath
– Software oriented
CONFIDENTIAL 14
15. TSO (TCP Segmentation Offload)
• Modern NIC (shipped within 4-5 years) equips various hardware acceleration features:
– RSS, GSO/TSO, Checksum Offload, etc.
• With TSO, NIC will perform TCP segmentation processing on behalf of Operating System (in
software)
– Operating system can now send up to 64K bytes packet. This will lead to a significant decrease of the
number of packet processing (i.e. interrupt) hence much less context switches needed.
• To take advantage of TSO in NIC, STT encapsulates packets as if it looks like “TCP”!
CONFIDENTIAL 15
19. Throughput and CPU Utilization
0
10
20
30
40
50
60
70
80
90
100
0
1
2
3
4
5
6
7
8
9
10
Linux Bridge OVS Bridge OVS-GRE OVS-STT
スループット CPU (Receive) CPU (Send)
(Gbps) (%)Source: http://networkheresy.com/2012/06/08/the-overhead-of-software-tunneling/
CONFIDENTIAL 19
20. Multipath in STT
STT (18)TCP’ (20)IP (20)
Hash (src/dst MAC addr,
src/dst IP addr,
src/dst port number, etc.)
dst port = 7471 (TBD)
src port = Hash()
Ether IP TCP Data
Original Packet
* Which fields to hash or which hash algorithm to use is not defined by the protocol. It is up to the implementation.
CONFIDENTIAL 20
21. Geneve (Generic Network Virtualization Encapsulation)
• New encapsulation being proposed by VMware, Microsoft, RedHat, Intel
– draft-gross-geneve-00.txt
• Goals
– Extensibility
• Service Chaining, Metadata support, etc.
– Leverage NIC offload
– Above two at the same time! (each one is straightforward, but two at the same time is difficult)
• Highlights
– Information can be added as Option field in TLV formart
– Format carefully designed so that NIC can perform TSO
– OAM and Criticality (indicating parsing the option fields mandatory)
CONFIDENTIAL 21
23. Geneve Implementation
• Recently implemented in Open vSwitch (OVS) and merged into master branch on GitHub
– VNI can be specified
– Geneve Options can’t be specified (at this point)
– Can’t mark OAM flag?? (I tried but didn’t work)
– Looks like Critical flag supported as long as critical options are present
• Geneve dissector for Wireshark also implemented and merged to master branch of Github
• Geneve-aware NIC is not available yet
CONFIDENTIAL 23
27. Information about Geneve
• English
– http://tools.ietf.org/html/draft-gross-geneve-00
– http://cto.vmware.com/geneve-vxlan-network-virtualization-encapsulations/
– http://www.enterprisenetworkingplanet.com/netsp/geneve-generic-network-virtualization-encapsulation-
protocol-advances-video.html
– http://searchsdn.techtarget.com/news/2240219051/VMware-Microsoft-end-encapsulation-protocol-turf-
war-with-GENEVE
– http://www.plexxi.com/2014/06/attention-overlay-tunnel-construction-ahead
– http://blog.shin.do/2014/07/geneve-on-open-vswitch/
• Japanese
– http://blog.shin.do/2014/05/geneve-encapsulation/
– http://blog.shin.do/2014/07/geneve-on-open-vswitch/
CONFIDENTIAL 27
28. Geneve replaces VXLAN / STT / NVGRE ?
• Geneve replaces VXLAN ?
– NO
– VXLAN ecosystem has already grown big enough so it is unlikely to be replaced by something else
– VMware will continue to support VXLAN and ecosystem partners
• Geneve replaces STT?
– In short term, NO. In the long run, maybe if
• Geneve is accepted by the market and Geneve-aware NIC becomes widely available in the same level as STT
today.
• Geneve replaces NVGRE ?
– In short term, NO. In the long run, maybe if
• Geneve gets implemented on Windows and ecosystem is formed in the same level as NVGRE as to today.
CONFIDENTIAL 28
29. Encapsulation is like a wire, right cable in the right place
CONFIDENTIAL 29
http://cto.vmware.com/geneve-vxlan-network-virtualization-encapsulations/
30. World is not that simple
• Some people are against Geneve
• Their claims are more or less as follows:
– What Geneve tries to accomplish can be achieved by existing encapsulation (such as L2TP static
tunneling or VXLAN) as is or with a small extension !?
– Service Chaining, Metadata stuff should not be bound to a particular encapsulation. It should be
independent from encapsulation !?
– 24bit as VNI not long enough !?
CONFIDENTIAL 30
31. L2TPv3 static tunneling
• L2TPv3 being as a tunneling protocol, inherently it has a signaling. That said, it can be used a
plain encapsulation method (i.e. pseudo wire) without using signaling. That is called “L2TPv3
static tunneling” where configuration is made at both ends manually.
• L2TPv3 became an RFC in 2005 (RFC3931) and been in market for many years. Cisco IOS
and Linux (l2tpd) have L2TPv3 static tunneling.
31
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|T|x|x|x|x|x|x|x|x|x|x|x| Ver | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Session ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Cookie (optional, maximum 64 bits)...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
CONFIDENTIAL
32. L2TPv3 static tunneling as a L2 over L3 encapsulation
• Session ID (32bit) corresponds to VNI
• L2TPv3 can be transported directly over IP or UDP. For multipath, UDP would be better.
• No explicit field for context information (metadata, etc.). It has to be configured manually on
both ends (if possible) and express it implicitly as a part of Session ID
– Therefore 32bit Session ID can’t be used entirely for VNI
• Strictly speaking, there is no way in L2TPv3 to tell (in the packet) where the subsequent packet
starts at so that NIC can do TSO. However, L2TPv2 had an “offset” option for this purpose.
Many L2TPv3 implementations still have this “offset” option for backward compatibility to
L2TPv2. So TSO is possible (if NIC understands this legacy option). Cisco and Linux l2tpd
support the offset field.
CONFIDENTIAL 32
33. VXLAN Generic Protocol Extension (a.k.a. eVXLAN)
• Proposed by Cisco、Huawei、Intel、Microsoft
– draft-quinn-vxlan-gpe-03.txt
• An extension to VXLAN
– Support protocols other than Ethernet
• IPv4 (0x01), IPv6 (0x02), Ethernet (0x03), Network Service Header [NSH] (0x04)
– Note that “Net Protocol” is only 8bits width. Protocol type (usually 16bits) has to be specifically encoded to fit into 8bits.
– OAM support
– Version field
• Used by Cisco ACI
CONFIDENTIAL 33
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|P|R|O|Ver| Reserved |Next Protocol |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VXLAN Network Identifier (VNI) | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
34. VXLAN-gpe as L2 over L3 encapsulation
• Mostly identical to VXLAN
– VNI length (24bits)
– Multipath property
– Hardware friendliness
• The biggest motivation of VXLAN-gpe is probably to allow Service Chaining by NSH (network
service header)
• No further extensibility
CONFIDENTIAL 34