SlideShare a Scribd company logo
1 of 31
Millions of transactions per second, with an
advanced new programming model
Seastar
How multifarious and how mutually
complicated are the considerations which
the working of such an engine involve.
There are frequently several distinct sets of
effects going on simultaneously; all in a
manner independent of each other, and yet
to a greater or less degree exercising a
mutual influence. To adjust each to every
other, and indeed even to perceive and
trace them out with perfect correctness and
success, entails difficulties whose nature
partakes to a certain extent of those
involved in every question where conditions
are very numerous and inter-complicated.
Hardware outgrowing software
+ CPU clocks not getting faster.
+ More cores, but hard to use them.
+ Locks have costs even when no contention
+ Data is allocated on one core, copied and used on
others
+ Result: Software can’t keep up with new
hardware (SSD, 10Gbps networking…)
Kernel
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Memory
Workloads changing
+ Complex, multi-layered applications
+ NoSQL data stores
+ More users
+ Lower latencies needed
+ Microservices
- 81% of Redis processing is in the kernel.
- If 100 requests needed for a page, the “99%
latency” affects 63% of pageviews.
Kernel
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Memory
7 Million IOPS
Benchmark hardware
■ 2x Xeon E5-2695v3, 2.3GHz
35M cache, 14 cores
(28 cores total, 56 HT)
■ 8x 8GB DDR4 Micron memory
■ Intel Ethernet CNA XL710-QDA1
A new model
Threads
- Costly locking (example:
POSIX requires multiple
threads to be able to use same
socket)
+ Uses available skills/tools
Shared-nothing
+ Fewer wasted cycles
- Cross-core communication
must be explicit, so harder to
program
How
■ Single-threaded async engine
running on each CPU
■ No threads
■ No shared data
■ All inter-CPU communication by message
passing
Linear scaling
+ Each engine is executed by each core
+ Shared-nothing per-core design
+ Fits existing shared-nothing distributed
applications model
+ Full kernel bypass, supports zero-copy
+ No threads, no context switch and no locks!
+ Instead, asynchronous lambda
invocation
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Kernel
Comparison with old school
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Traditional stack SeaStar’s sharded stack
Memory
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(not
involved)
Userspace
Millions of connections
Traditional stack SeaStar’s sharded stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
But how can you program it?
■ Ada Lovelace’s
problem today
■ Need max. possible
“easy” without
giving up any “fast.”
If the answer
were “no”,
would this
book be 467
pages long?
Basic model
■ Futures
■ Promises
■ Continuations
F-P-C defined: Future
A future is a result of a computation
that may not be available yet.
■ a data buffer that we are reading from the network
■ the expiration of a timer
■ the completion of a disk write
■ the result computation that requires the values from one
or more other futures.
F-P-C defined: Promise
A promise is an object or function that
provides you with a future, with the
expectation that it will fulfill the future.
Basic future/promise
future<int> get(); // promises an int will be produced eventually
future<> put(int) // promises to store an int
void f() {
get().then([] (int value) {
put(value + 1).then([] {
std::cout << "value stored successfullyn";
});
});
}
Chaining
future<int> get(); // promises an int will be produced eventually
future<> put(int) // promises to store an int
void f() {
get().then([] (int value) {
return put(value + 1);
}).then([] {
std::cout << "value stored successfullyn";
});
}
Zero copy friendly
future<temporary_buffer>
socket::read(size_t n);
■ temporary_buffer points at driver-provided pages if
possible
■ stack can linearize scatter-gather buffers using page
tables
■ discarded after use
Zero copy friendly (2)
pair<future<size_t>,
future<temporary_buffer>>
socket::write(temporary_buffer);
■ First future becomes ready when TCP window allows
sending more data (usually immediately)
■ Second future becomes ready when buffer can be
discarded (after TCP ACK)
■ May complete in any order
Fully async filesystem
No threads
read_metadata().then([] {
return lock_pages();
}).then([] {
return read_data();
});
Shared state: networking
■ No shared state except index of
net channels (1 per cpu)
■ No migration of existing TCP connections
Handling shared state: block
■ Each CPU is responsible for handling
specific files/directories/free blocks
(by hash)
■ Can delegate access to another CPU for
locality, but not concurrent shared access
■ Flash optimized - no fancy layout
■ DMA only
Seastar
TCP
Seastar
TCP
Linux
sockets
Seastar TCP
DPDK Virtio or raw
device
access
Linux
process
OSv
networking
Deployment models
Licensing
■ Apache
■ Goals: compatibility and contributor safety
Performance results
■ Linear scaling to 20 cores and beyond
■ 250,000 transactions/core (memcached)
■ Currently limited by client. More client
development in progress.
Applications
■ HTTP server
■ NoSQL system
■ Distributed filesystem
■ Object store
■ Transparent proxy
■ Cache (Memcache, CDN,..)
■ NFV
Thank you
http://www.seastar-project.org/
@CloudiusSystems

More Related Content

What's hot

High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
Andrey Vagin
 

What's hot (20)

Data Reduction for Gluster with VDO
Data Reduction for Gluster with VDOData Reduction for Gluster with VDO
Data Reduction for Gluster with VDO
 
OSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear ofOSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear of
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
Docker volume-isolator-in-mesos
Docker volume-isolator-in-mesosDocker volume-isolator-in-mesos
Docker volume-isolator-in-mesos
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
1027 predictive models in 10 seconds, by David Pardo Villaverde, Corunet
1027 predictive models in 10 seconds, by David Pardo Villaverde, Corunet1027 predictive models in 10 seconds, by David Pardo Villaverde, Corunet
1027 predictive models in 10 seconds, by David Pardo Villaverde, Corunet
 
CRIU: Time and Space Travel for Linux Containers
CRIU: Time and Space Travel for Linux ContainersCRIU: Time and Space Travel for Linux Containers
CRIU: Time and Space Travel for Linux Containers
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object Storage
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
OSMC 2017 | Icinga 2 + Director, flexible Thresholds with Ansible by Kevin H...
OSMC 2017 |  Icinga 2 + Director, flexible Thresholds with Ansible by Kevin H...OSMC 2017 |  Icinga 2 + Director, flexible Thresholds with Ansible by Kevin H...
OSMC 2017 | Icinga 2 + Director, flexible Thresholds with Ansible by Kevin H...
 
Accessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willsonAccessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willson
 
Gluster as Block Store in Containers
Gluster as Block Store in ContainersGluster as Block Store in Containers
Gluster as Block Store in Containers
 
Aerospike Go Language Client
Aerospike Go Language ClientAerospike Go Language Client
Aerospike Go Language Client
 
Gluster as Native Storage for Containers - past, present and future
Gluster as Native Storage for Containers - past, present and futureGluster as Native Storage for Containers - past, present and future
Gluster as Native Storage for Containers - past, present and future
 
Hands On Gluster with Jeff Darcy
Hands On Gluster with Jeff DarcyHands On Gluster with Jeff Darcy
Hands On Gluster with Jeff Darcy
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
 
Gluster and Kubernetes
Gluster and KubernetesGluster and Kubernetes
Gluster and Kubernetes
 
Supercomputing by API: Connecting Modern Web Apps to HPC
Supercomputing by API: Connecting Modern Web Apps to HPCSupercomputing by API: Connecting Modern Web Apps to HPC
Supercomputing by API: Connecting Modern Web Apps to HPC
 

Similar to Seastar at Linux Foundation Collaboration Summit

Similar to Seastar at Linux Foundation Collaboration Summit (20)

Back to the future with C++ and Seastar
Back to the future with C++ and SeastarBack to the future with C++ and Seastar
Back to the future with C++ and Seastar
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Nodejs a-practical-introduction-oredev
Nodejs a-practical-introduction-oredevNodejs a-practical-introduction-oredev
Nodejs a-practical-introduction-oredev
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
Your Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | EvmuxYour Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | Evmux
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4jGraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 

Seastar at Linux Foundation Collaboration Summit

  • 1. Millions of transactions per second, with an advanced new programming model Seastar
  • 2. How multifarious and how mutually complicated are the considerations which the working of such an engine involve. There are frequently several distinct sets of effects going on simultaneously; all in a manner independent of each other, and yet to a greater or less degree exercising a mutual influence. To adjust each to every other, and indeed even to perceive and trace them out with perfect correctness and success, entails difficulties whose nature partakes to a certain extent of those involved in every question where conditions are very numerous and inter-complicated.
  • 3.
  • 4. Hardware outgrowing software + CPU clocks not getting faster. + More cores, but hard to use them. + Locks have costs even when no contention + Data is allocated on one core, copied and used on others + Result: Software can’t keep up with new hardware (SSD, 10Gbps networking…) Kernel Application TCP/IPScheduler queuequeuequeuequeuequeue threads NIC Queues Kernel Memory
  • 5. Workloads changing + Complex, multi-layered applications + NoSQL data stores + More users + Lower latencies needed + Microservices - 81% of Redis processing is in the kernel. - If 100 requests needed for a page, the “99% latency” affects 63% of pageviews. Kernel Application TCP/IPScheduler queuequeuequeuequeuequeue threads NIC Queues Kernel Memory
  • 6.
  • 8.
  • 9. Benchmark hardware ■ 2x Xeon E5-2695v3, 2.3GHz 35M cache, 14 cores (28 cores total, 56 HT) ■ 8x 8GB DDR4 Micron memory ■ Intel Ethernet CNA XL710-QDA1
  • 10. A new model Threads - Costly locking (example: POSIX requires multiple threads to be able to use same socket) + Uses available skills/tools Shared-nothing + Fewer wasted cycles - Cross-core communication must be explicit, so harder to program
  • 11. How ■ Single-threaded async engine running on each CPU ■ No threads ■ No shared data ■ All inter-CPU communication by message passing
  • 12. Linear scaling + Each engine is executed by each core + Shared-nothing per-core design + Fits existing shared-nothing distributed applications model + Full kernel bypass, supports zero-copy + No threads, no context switch and no locks! + Instead, asynchronous lambda invocation Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace
  • 13. Kernel Comparison with old school Application TCP/IPScheduler queuequeuequeuequeuequeue threads NIC Queues Kernel Traditional stack SeaStar’s sharded stack Memory Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queuequeuequeuequeuequeuesmp queue NIC Queue DPDK Kernel (not involved) Userspace
  • 14. Millions of connections Traditional stack SeaStar’s sharded stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Threa d Stack Threa d Stack Threa d Stack Threa d Stack Threa d Stack Threa d Stack Threa d Stack Threa d Stack Thread is a function pointer Stack is a byte array from 64k to megabytes
  • 15. But how can you program it? ■ Ada Lovelace’s problem today ■ Need max. possible “easy” without giving up any “fast.” If the answer were “no”, would this book be 467 pages long?
  • 16. Basic model ■ Futures ■ Promises ■ Continuations
  • 17. F-P-C defined: Future A future is a result of a computation that may not be available yet. ■ a data buffer that we are reading from the network ■ the expiration of a timer ■ the completion of a disk write ■ the result computation that requires the values from one or more other futures.
  • 18. F-P-C defined: Promise A promise is an object or function that provides you with a future, with the expectation that it will fulfill the future.
  • 19. Basic future/promise future<int> get(); // promises an int will be produced eventually future<> put(int) // promises to store an int void f() { get().then([] (int value) { put(value + 1).then([] { std::cout << "value stored successfullyn"; }); }); }
  • 20. Chaining future<int> get(); // promises an int will be produced eventually future<> put(int) // promises to store an int void f() { get().then([] (int value) { return put(value + 1); }).then([] { std::cout << "value stored successfullyn"; }); }
  • 21. Zero copy friendly future<temporary_buffer> socket::read(size_t n); ■ temporary_buffer points at driver-provided pages if possible ■ stack can linearize scatter-gather buffers using page tables ■ discarded after use
  • 22. Zero copy friendly (2) pair<future<size_t>, future<temporary_buffer>> socket::write(temporary_buffer); ■ First future becomes ready when TCP window allows sending more data (usually immediately) ■ Second future becomes ready when buffer can be discarded (after TCP ACK) ■ May complete in any order
  • 23. Fully async filesystem No threads read_metadata().then([] { return lock_pages(); }).then([] { return read_data(); });
  • 24. Shared state: networking ■ No shared state except index of net channels (1 per cpu) ■ No migration of existing TCP connections
  • 25. Handling shared state: block ■ Each CPU is responsible for handling specific files/directories/free blocks (by hash) ■ Can delegate access to another CPU for locality, but not concurrent shared access ■ Flash optimized - no fancy layout ■ DMA only
  • 26. Seastar TCP Seastar TCP Linux sockets Seastar TCP DPDK Virtio or raw device access Linux process OSv networking Deployment models
  • 27. Licensing ■ Apache ■ Goals: compatibility and contributor safety
  • 28. Performance results ■ Linear scaling to 20 cores and beyond ■ 250,000 transactions/core (memcached) ■ Currently limited by client. More client development in progress.
  • 29. Applications ■ HTTP server ■ NoSQL system ■ Distributed filesystem ■ Object store ■ Transparent proxy ■ Cache (Memcache, CDN,..) ■ NFV
  • 30.

Editor's Notes

  1. 318,715 transactions/core at 2 cores, 274,114 transactions/core at 16 cores…
  2. 250,000 transactions/core
  3. Slide 7 - locking is only part of the problem, and mostly eliminated by "lock-free" alternatives to locking. The other problems are cache-line bouncing, and slow atomic operations and memory barriers. How "shared nothing" design cannot eliminate all of these (we still communicate between core), but can minimize it by making it very explicit when these things happen. If I understood Avi correctly, he also says that another problem of the thread model is the large stacks also mean large cache polution on context switches, while our tiny "task" switches don't have large cache pollution. You even mention this later on But I have to admit I'm not completely convinced this is the case (even if the stack is large, the threads use only a tiny portion of it?).
  4. http://aws.amazon.com/ec2/pricing/
  5. http://aws.amazon.com/ec2/pricing/
  6. http://aws.amazon.com/ec2/pricing/
  7. Promises and futures simplify asynchronous programming since they decouple the event producer (the promise) and the event consumer (whoever uses the future). Whether the promise is fulfilled before the future is consumed, or vice versa, does not change the outcome of the code.