SlideShare a Scribd company logo
1 of 33
Download to read offline
PASTE: A Network Programming Interface
for Non-Volatile Main Memory
Michio Honda (NEC Laboratories Europe)
Giuseppe Lettieri (Università di Pisa)
Lars Eggert and Douglas Santry (NetApp)
USENIX NSDI 2018
Review: Memory Hierarchy
Slow, block-oriented persistence
CPU
Caches
HDD / SSD
Block access w/
system calls
Byte access w/
load/store
100-1000s us
70 ns
5-50 ns
Main Memory
Review: Memory Hierarchy
Fast, byte-addressable persistence
CPU
Caches
Block access w/
system calls
Byte access w/
load/store
100-1000s us
70 ns
5-50 ns
-1000s ns
Main Memory
HDD / SSD
Networking is faster than disks/SSDs
1.2KB durable write over TCP/HTTP
Client Server SSD
Syscall, PCIe bus,
physical media
Cables, NICs, TCP/IP,
socket API
23us 1300us
Networking is slower than NVMM
1.2KB durable write over TCP/HTTP
23us 2us
Client Server NVMM
Memcpy, memory bus,
physical media
Cables, NICs, TCP/IP,
socket API
Networking is slower than NVMM
1.2KB durable write over TCP/HTTP
Client Server NVMM
Memcpy, memory bus,
physical media
Cables, NICs, TCP/IP,
socket API
Client
Client
nevts = epoll_wait(fds)
for (i =0; i < nevts; i++) {
read(fds[i], buf);
...
memcpy(nvmm, buf);
...
write(fds[i], reply)
}
Innovations at both stacks
MegaPipe [OSDI’12]
Seastar
mTCP [NSDI’14]
IX [OSDI’14]
Stackmap [ATC’16]
NVTree [FAST’15]
NVWal [ASPLOS’16]
NOVA [FAST’16]
Decibel [NSDI’17]
LSNVMM [ATC’17]
Network stack Storage stack
Stacks are isolated
MegaPipe [OSDI’12]
Seastar
mTCP [NSDI’14]
IX [OSDI’14]
Stackmap [ATC’16]
NVTree [FAST’15]
NVWal [ASPLOS’16]
NOVA [FAST’16]
Decibel [NSDI’17]
LSNVMM [ATC’17]
Network stack Storage stackCosts of
moving data
Bridging the gap
MegaPipe [OSDI’12]
Seastar
mTCP [NSDI’14]
IX [OSDI’14]
Stackmap [ATC’16]
NVTree [FAST’15]
NVWal [ASPLOS’16]
NOVA [FAST’16]
Decibel [NSDI’17]
LSNVMM [ATC’17]
Network stack Storage stack
PASTE
PASTE Design Goals
● Durable zero copy
○ DMA to NVMM
● Selective persistence
○ Exploit modern NIC’s DMA to L3 cache
● Persistent data structures
○ Indexed, named packet buffers backed fy a file
● Generality and safety
○ TCP/IP in the kernel and netmap API
● Best practices from modern network stacks
○ Run-to-completion, blocking, busy-polling, batching etc
PASTE in Action
20
Pring
[7]
App thread
slot [0]
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
Zerocopy
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
21
22
23 24
25
26
27
[0]
[4]
[8]
Pbufs
PASTE in Action
20
Pring
[7]
App thread
slot [0]
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
Zerocopy
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
21
22
23 24
25
26
27
[0]
[4]
[8]
Pbufs
PASTE in Action
● poll() system call
20
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
Zerocopy
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
21
22
23 24
25
26
27
[0]
[4]
[8]
Pbufs
PASTE in Action
● poll() system call
○ Got 6 in-order TCP
segments
20
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
Zerocopy
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
21
22
23 24
25
26
27
[0]
[4]
[8]
Pbufs
PASTE in Action
● poll() system call
○ They are set to Pring
slots
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
Zerocopy
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
1
2
3 4
5
6
27
[0]
[4]
[8]
tail
Pbufs
PASTE in Action
● Return from poll()
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
Zerocopy
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
1
2
3 4
5
6
27
[0]
[4]
[8]
tail
Pbufs
PASTE in Action
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
2. Read data on Pring
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
1
2
3 4
5
6
27
[0]
[4]
[8]
tail
Zerocopy
Pbufs
PASTE in Action
● flush Pbuf data from
CPU cache to DIMM
○ clflush(opt) instruction
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
1
2
3 4
5
6
27
[0]
[4]
[8]
tail
Zerocopy
Pbufs
PASTE in Action
● Pbuf is persistent
data representation
○ Base address is static
i.e., file (/mnt/pm/pp)
○ Buffers can be
recovered after reboot
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
4. Flush Plog entry(ies)
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
1
2
3 4
5
6
27
[0]
[4]
[8]
tail
Zerocopy
1 12096
Pbufs
PASTE in Action
● Prevent the kernel
from recycling the
buffer
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
4. Flush Plog entry(ies)
5. Swap out Pbuf(s)
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
8
2
3 4
5
6
27
[0]
[4]
[8]
tail
Zerocopy
1 12096
Pbufs
PASTE in Action
● Same for Pbuf 2 and 6
0
Pring
[7]
App thread
slot [0]
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
4. Flush Plog entry(ies)
5. Swap out Pbuf(s)
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
cur
Ppool (shared memory)
/mnt/pm/pp
8
9
3 4
5
10
27
[0]
[4]
[8]
tail
Zerocopy
1 12096
2
6
768
987
96
96
Pbufs
PASTE in Action
● Advance cur
○ Return buffers in slot
0-6 to the kernel at
next poll()
App thread
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
4. Flush Plog entry(ies)
5. Swap out Pbuf(s)
6. Update Pring
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
Ppool (shared memory)
/mnt/pm/pp
[0]
[4]
[8]
1 12096
Zerocopy
2
6
768
987
96
96
0
Pring
[7]slot [0]
8
9
3 4
5
10
27 tail
cur
Pbufs
PASTE in Action
App thread
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
4. Flush Plog entry(ies)
5. Swap out Pbuf(s)
6. Update Pring
NIC
TCP/IP
File system
/mnt/pm
lenoffpbuf
Plog
/mnt/pm/plog
user
kernel
Ppool (shared memory)
/mnt/pm/pp
[0]
[4]
[8]
1 12096
Zerocopy
2
6
768
987
96
96
0
Pring
[7]slot [0]
8
9
3 4
5
10
27 tail
cur
Pbufs
Write-Ahead Logs
PASTE in Action
● We can organize various
data structures in Plog
App thread
1. Run NIC I/O and TCP/IP
2. Read data on Pring
3. Flush Pbuf(s)
4. Flush Plog entry(ies)
5. Swap out Pbuf(s)
6. Update Pring
NIC
TCP/IP
File system
/mnt/pm
Ppool (shared memory)
/mnt/pm/pp
[0]
[4]
[8]
Zerocopy
0
Pring
[7]slot [0]
8
9
3 4
5
10
27 tail
cur
Pbufs
53
0 5 7
(1, 96, 120)
(2, 96, 987)
(6, 96, 512)
Plog
/mnt/pm/plog
user
kernel
B+tree
Evaluation
1. How does PASTE outperform existing systems?
2. Is PASTE applicable to existing applications?
3. Is PASTE useful for systems other than file/DB storage?
How does PASTE outperform existing systems?
WAL B+tree (all writes)
64B
1280B
What if we use more
complex data structures?
How does PASTE outperform existing systems?
WAL B+tree (all writes)
64B
1280B
Is PASTE applicable to existing applications?
● Redis
YCSB (read mostly) YCSB (update heavy)
Is PASTE useful for systems other than DB/file
storage?
● Packet logging prior to forwarding
○ Fault-tolerant middlebox [Sigcomm’15]
○ Traffic recording
● Extend mSwitch [SOSR’15]
○ Scalable NFV backend switch
Conclusion
● PASTE is a network programming interface that:
○ Enables durable zero copy to NVMM
○ Helps apps organize persistent data structures on NVMM
○ Lets apps use TCP/IP and be protected
○ Offers high-performance network stack even w/o NVMM
https://github.com/luigirizzo/netmap/tree/paste
micchie@sfc.wide.ad.jp or @michioh
Multicore Scalability
● WAL throughput
Further Opportunity with Co-designed Stacks
● What if we use higher access latency NVMM?
○ e.g., 3D-Xpoint
● Overlap flushes and processing with clflushopt and
mfence before system call (triggers packet I/O)
○ See the paper for results
Systemcall time
clflushopt mfence Systemcall
Receive new
requests
Send
responsesWait for
flushes done
Examine
request clflushopt
Examine
request
Experiment Setup
● Intel Xeon E5-2640v4 (2.4 Ghz)
● HPE 8GB NVDIMM (NVDIMM-N)
● Intel X540 10 GbE NIC
● Comparison
○ Linux and Stackmap [ATC’15] (current state-of-the art)
○ Fair to use the same kernel TCP/IP implementation

More Related Content

What's hot

Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFoholiab
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
 
2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful ServicesThomas Graf
 
Linux network namespaces
Linux network namespacesLinux network namespaces
Linux network namespacesMike Wilson
 
Preventing cpu side channel attacks with kernel tracking
Preventing cpu side channel attacks with kernel trackingPreventing cpu side channel attacks with kernel tracking
Preventing cpu side channel attacks with kernel trackingMarian Marinov
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThomas Graf
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopenHajime Tazaki
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookSREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookAngelo Failla
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkmarkdgray
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating SystemThomas Graf
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stackHajime Tazaki
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Hajime Tazaki
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPFRogerColl2
 

What's hot (20)

Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor Netty
 
Python at Facebook
Python at FacebookPython at Facebook
Python at Facebook
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services
 
Linux network namespaces
Linux network namespacesLinux network namespaces
Linux network namespaces
 
Preventing cpu side channel attacks with kernel tracking
Preventing cpu side channel attacks with kernel trackingPreventing cpu side channel attacks with kernel tracking
Preventing cpu side channel attacks with kernel tracking
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookSREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
 

Similar to PASTE: A Network Programming Interface for Non-Volatile Main Memory

[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...OpenStack Korea Community
 
P4+ONOS SRv6 tutorial.pptx
P4+ONOS SRv6 tutorial.pptxP4+ONOS SRv6 tutorial.pptx
P4+ONOS SRv6 tutorial.pptxtampham61268
 
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SPKrzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SPPROIDEA
 
5 p9 pnor and open bmc overview - final
5 p9 pnor and open bmc overview - final5 p9 pnor and open bmc overview - final
5 p9 pnor and open bmc overview - finalYutaka Kawai
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop ClustersSteve Loughran
 
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceAdding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceSamsung Open Source Group
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBMIBM Danmark
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactAlessandro Selli
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...The Linux Foundation
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
16aug06.ppt
16aug06.ppt16aug06.ppt
16aug06.pptzagreb2
 

Similar to PASTE: A Network Programming Interface for Non-Volatile Main Memory (20)

netLec5.pdf
netLec5.pdfnetLec5.pdf
netLec5.pdf
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
6LoWPAN: An open IoT Networking Protocol
6LoWPAN: An open IoT Networking Protocol6LoWPAN: An open IoT Networking Protocol
6LoWPAN: An open IoT Networking Protocol
 
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
 
P4+ONOS SRv6 tutorial.pptx
P4+ONOS SRv6 tutorial.pptxP4+ONOS SRv6 tutorial.pptx
P4+ONOS SRv6 tutorial.pptx
 
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SPKrzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
 
Restfs internals
Restfs internalsRestfs internals
Restfs internals
 
5 p9 pnor and open bmc overview - final
5 p9 pnor and open bmc overview - final5 p9 pnor and open bmc overview - final
5 p9 pnor and open bmc overview - final
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop Clusters
 
6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol
 
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceAdding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
SnapDiff
SnapDiffSnapDiff
SnapDiff
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compact
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
16aug06.ppt
16aug06.ppt16aug06.ppt
16aug06.ppt
 
Ubuntu
UbuntuUbuntu
Ubuntu
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

PASTE: A Network Programming Interface for Non-Volatile Main Memory

  • 1. PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018
  • 2. Review: Memory Hierarchy Slow, block-oriented persistence CPU Caches HDD / SSD Block access w/ system calls Byte access w/ load/store 100-1000s us 70 ns 5-50 ns Main Memory
  • 3. Review: Memory Hierarchy Fast, byte-addressable persistence CPU Caches Block access w/ system calls Byte access w/ load/store 100-1000s us 70 ns 5-50 ns -1000s ns Main Memory HDD / SSD
  • 4. Networking is faster than disks/SSDs 1.2KB durable write over TCP/HTTP Client Server SSD Syscall, PCIe bus, physical media Cables, NICs, TCP/IP, socket API 23us 1300us
  • 5. Networking is slower than NVMM 1.2KB durable write over TCP/HTTP 23us 2us Client Server NVMM Memcpy, memory bus, physical media Cables, NICs, TCP/IP, socket API
  • 6. Networking is slower than NVMM 1.2KB durable write over TCP/HTTP Client Server NVMM Memcpy, memory bus, physical media Cables, NICs, TCP/IP, socket API Client Client nevts = epoll_wait(fds) for (i =0; i < nevts; i++) { read(fds[i], buf); ... memcpy(nvmm, buf); ... write(fds[i], reply) }
  • 7. Innovations at both stacks MegaPipe [OSDI’12] Seastar mTCP [NSDI’14] IX [OSDI’14] Stackmap [ATC’16] NVTree [FAST’15] NVWal [ASPLOS’16] NOVA [FAST’16] Decibel [NSDI’17] LSNVMM [ATC’17] Network stack Storage stack
  • 8. Stacks are isolated MegaPipe [OSDI’12] Seastar mTCP [NSDI’14] IX [OSDI’14] Stackmap [ATC’16] NVTree [FAST’15] NVWal [ASPLOS’16] NOVA [FAST’16] Decibel [NSDI’17] LSNVMM [ATC’17] Network stack Storage stackCosts of moving data
  • 9. Bridging the gap MegaPipe [OSDI’12] Seastar mTCP [NSDI’14] IX [OSDI’14] Stackmap [ATC’16] NVTree [FAST’15] NVWal [ASPLOS’16] NOVA [FAST’16] Decibel [NSDI’17] LSNVMM [ATC’17] Network stack Storage stack PASTE
  • 10. PASTE Design Goals ● Durable zero copy ○ DMA to NVMM ● Selective persistence ○ Exploit modern NIC’s DMA to L3 cache ● Persistent data structures ○ Indexed, named packet buffers backed fy a file ● Generality and safety ○ TCP/IP in the kernel and netmap API ● Best practices from modern network stacks ○ Run-to-completion, blocking, busy-polling, batching etc
  • 11. PASTE in Action 20 Pring [7] App thread slot [0] NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog Zerocopy user kernel cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs
  • 12. PASTE in Action 20 Pring [7] App thread slot [0] NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog Zerocopy user kernel cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs
  • 13. PASTE in Action ● poll() system call 20 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog Zerocopy user kernel cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs
  • 14. PASTE in Action ● poll() system call ○ Got 6 in-order TCP segments 20 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog Zerocopy user kernel cur Ppool (shared memory) /mnt/pm/pp 21 22 23 24 25 26 27 [0] [4] [8] Pbufs
  • 15. PASTE in Action ● poll() system call ○ They are set to Pring slots 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog Zerocopy user kernel cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Pbufs
  • 16. PASTE in Action ● Return from poll() 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog Zerocopy user kernel cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Pbufs
  • 17. PASTE in Action 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP 2. Read data on Pring NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Zerocopy Pbufs
  • 18. PASTE in Action ● flush Pbuf data from CPU cache to DIMM ○ clflush(opt) instruction 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Zerocopy Pbufs
  • 19. PASTE in Action ● Pbuf is persistent data representation ○ Base address is static i.e., file (/mnt/pm/pp) ○ Buffers can be recovered after reboot 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel cur Ppool (shared memory) /mnt/pm/pp 1 2 3 4 5 6 27 [0] [4] [8] tail Zerocopy 1 12096 Pbufs
  • 20. PASTE in Action ● Prevent the kernel from recycling the buffer 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel cur Ppool (shared memory) /mnt/pm/pp 8 2 3 4 5 6 27 [0] [4] [8] tail Zerocopy 1 12096 Pbufs
  • 21. PASTE in Action ● Same for Pbuf 2 and 6 0 Pring [7] App thread slot [0] 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel cur Ppool (shared memory) /mnt/pm/pp 8 9 3 4 5 10 27 [0] [4] [8] tail Zerocopy 1 12096 2 6 768 987 96 96 Pbufs
  • 22. PASTE in Action ● Advance cur ○ Return buffers in slot 0-6 to the kernel at next poll() App thread 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) 6. Update Pring NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel Ppool (shared memory) /mnt/pm/pp [0] [4] [8] 1 12096 Zerocopy 2 6 768 987 96 96 0 Pring [7]slot [0] 8 9 3 4 5 10 27 tail cur Pbufs
  • 23. PASTE in Action App thread 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) 6. Update Pring NIC TCP/IP File system /mnt/pm lenoffpbuf Plog /mnt/pm/plog user kernel Ppool (shared memory) /mnt/pm/pp [0] [4] [8] 1 12096 Zerocopy 2 6 768 987 96 96 0 Pring [7]slot [0] 8 9 3 4 5 10 27 tail cur Pbufs Write-Ahead Logs
  • 24. PASTE in Action ● We can organize various data structures in Plog App thread 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) 6. Update Pring NIC TCP/IP File system /mnt/pm Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Zerocopy 0 Pring [7]slot [0] 8 9 3 4 5 10 27 tail cur Pbufs 53 0 5 7 (1, 96, 120) (2, 96, 987) (6, 96, 512) Plog /mnt/pm/plog user kernel B+tree
  • 25. Evaluation 1. How does PASTE outperform existing systems? 2. Is PASTE applicable to existing applications? 3. Is PASTE useful for systems other than file/DB storage?
  • 26. How does PASTE outperform existing systems? WAL B+tree (all writes) 64B 1280B What if we use more complex data structures?
  • 27. How does PASTE outperform existing systems? WAL B+tree (all writes) 64B 1280B
  • 28. Is PASTE applicable to existing applications? ● Redis YCSB (read mostly) YCSB (update heavy)
  • 29. Is PASTE useful for systems other than DB/file storage? ● Packet logging prior to forwarding ○ Fault-tolerant middlebox [Sigcomm’15] ○ Traffic recording ● Extend mSwitch [SOSR’15] ○ Scalable NFV backend switch
  • 30. Conclusion ● PASTE is a network programming interface that: ○ Enables durable zero copy to NVMM ○ Helps apps organize persistent data structures on NVMM ○ Lets apps use TCP/IP and be protected ○ Offers high-performance network stack even w/o NVMM https://github.com/luigirizzo/netmap/tree/paste micchie@sfc.wide.ad.jp or @michioh
  • 32. Further Opportunity with Co-designed Stacks ● What if we use higher access latency NVMM? ○ e.g., 3D-Xpoint ● Overlap flushes and processing with clflushopt and mfence before system call (triggers packet I/O) ○ See the paper for results Systemcall time clflushopt mfence Systemcall Receive new requests Send responsesWait for flushes done Examine request clflushopt Examine request
  • 33. Experiment Setup ● Intel Xeon E5-2640v4 (2.4 Ghz) ● HPE 8GB NVDIMM (NVDIMM-N) ● Intel X540 10 GbE NIC ● Comparison ○ Linux and Stackmap [ATC’15] (current state-of-the art) ○ Fair to use the same kernel TCP/IP implementation