Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Nadav Har'El, ScyllaDB
The Generalist Engineer meetup, Tel-Aviv
Ides of March, 2016
SeastarSeastar Or how we implemented a...
2
● Israeli but multi-national startup company
– 15 developers cherry-picked from 10 countries.
● Founded 2013 (“Cloudius ...
3
Make Cassandra 10 times faster
Your mission, should
you choose to accept it:
4
“Make Cassandra 10 times faster”
● Why 10?
● Why Cassandra?
– Popular NoSQL database (2nd to MongoDB).
– Powerful and wi...
5
Our first attempt: OSv
● New OS design specifically for cloud VMs:
– Run a single application per VM (“unikernel”)
– Run...
6
OSv
●
Some of the many ideas we used in OSv:
– Single address space.
– System call is just a function call.
– Faster con...
7
OSv
● Writing an entire OS from scratch was a really
fun exercise for our generalist engineers.
●
Full description of OS...
8
Cassandra on OSv
● Cassandra-stress, READ, 4 vcpu:
On OSv, 34% faster than Linux
● Very nice, but not even close to our ...
9
Bottlenecks: API locks
● In one profile, we saw 20% of run on lock()
and unlock() operations. Most uncontended
– Posix A...
10
Bottlenecks: API copies
● Write/send system calls copies user data to
kernel
– Even on OSv with no user-kernel separati...
11
Bottlenecks: context switching
● One thread per CPU is optimal, >1 require:
– Context switch time
– Stacks consume memo...
12
Bottlenecks:
unscalable applications
● Contended locks ruin scalability to many cores
– Memcache's counter and shared c...
13
Therefore
● Need to provide a better APIs for server
applications
– Not file descriptors, sockets, threads, etc.
● Need...
14
Framework
● One thread per CPU
– Event-driven programming
– Everything (network & disk) is non-blocking
– How to write ...
15
Framework
● Sharded (shared-nothing) applications
– Important!
16
Framework
● Language with no runtime overheads or built-in
data sharing
17
Seastar
● C++14 library
● For writing new high-performance server applications
● Share-nothing model, fully asynchronou...
18
Seastar linear scaling in #cores
19
Seastar linear scaling in #cores
20
Brief introduction to Seastar
21
Sharded application design
● One thread per CPU
● Each thread handles one shard of data
– No shared data (“share nothin...
22
Futures and continuations
● Futures and continuations are the building
blocks of asynchronous programming in
Seastar.
●...
23
Futures and continuations
● A future is a result which may not be available yet:
– Data buffer from the network
– Timer...
24
Futures and continuations
● An asynchronous function (also “promise”) is
a function returning a future:
– future<> slee...
25
Futures and continuations
● A continuation is a callback, typically a lambda
executed when a future becomes ready
– sle...
26
Futures and continuations
● Continuations can be nested:
– future<int> get();
future<> put(int);
get().then([] (int val...
27
Futures and continuations
● Parallelism is easy:
– sleep(100ms).then([] {
std::cout << “100msn”;
});
sleep(200ms).then(...
28
Futures and continuations
● In Seastar, every asynchronous operation is a
future:
– Network read or write
– Disk read o...
29
Network zero-copy
● future<temporary_buffer>
input_stream::read()
– temporary_buffer points at driver-provided pages, i...
30
Two TCP/IP implementations
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space...
31
Disk I/O
● Asynchronous and zero copy, using AIO and
O_DIRECT.
● Not implemented well by all filesystems
– XFS recommen...
32
More info on Seastar
● http://seastar-project.com
● https://github.com/scylladb/seastar
● http://docs.seastar-project.o...
33
ScyllaDB
● NoSQL database, implemented in Seastar.
● Fully compatible with Cassandra:
– Same CQL queries
– Copy over a ...
34
ScyllaDBCassandra
Key cache
Row cache
On-
heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
● Don't doub...
35
Scylla vs. Cassandra
● Single node benchmark:
– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU
E5-2690 v3 @ 2.60GHz
c...
36
Scylla vs. Cassandra
● We really got a x7 – x16 speedup!
● Read speeded up more -
– Cassandra writes are simpler
– Row-...
37
Scylla vs. Cassandra
3 node cluster, 2x12 cores each; RF=3, CL=quorum
38
Better latency, at all load levels
39
What will you do with 10x performance?
● Shrink your cluster by a factor of 10
● Use stronger (but slower) data models
...
40
41
Do we qualify?
In 3 years, our small team wrote:
● A complete kernel and library (OSv).
● An asynchronous programming f...
42
43
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
...
Upcoming SlideShare
Loading in …5
×

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

6,744 views

Published on

Nadav Har'El from ScyllaDB team, on the path for 10x Cassandra performance.
Seastar / ScyllaDB intro at the generalist engineer meetup

Published in: Engineering

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

  1. 1. Nadav Har'El, ScyllaDB The Generalist Engineer meetup, Tel-Aviv Ides of March, 2016 SeastarSeastar Or how we implemented a 10-times faster Cassandra
  2. 2. 2 ● Israeli but multi-national startup company – 15 developers cherry-picked from 10 countries. ● Founded 2013 (“Cloudius Systems”) – by Avi Kivity and Dor Laor of KVM fame. ● Fans of open-source: OSv, Seastar, ScyllaDB.
  3. 3. 3 Make Cassandra 10 times faster Your mission, should you choose to accept it:
  4. 4. 4 “Make Cassandra 10 times faster” ● Why 10? ● Why Cassandra? – Popular NoSQL database (2nd to MongoDB). – Powerful and widely applicable. – Example of a wider class of middleware. ● Why “mission impossible”? – Cassandra not considered particularly slow - – Considered faster than MongoDB, Hbase, et al. – “disk is bottleneck” (no longer, with SSD!)
  5. 5. 5 Our first attempt: OSv ● New OS design specifically for cloud VMs: – Run a single application per VM (“unikernel”) – Run existing Linux applications (Cassandra) – Run these faster than Linux.
  6. 6. 6 OSv ● Some of the many ideas we used in OSv: – Single address space. – System call is just a function call. – Faster context switches. – No spin locks. – Smaller code. – Redesigned network stack (Van Jacobson).
  7. 7. 7 OSv ● Writing an entire OS from scratch was a really fun exercise for our generalist engineers. ● Full description of OSv is beyond the scope of this talk. Check out: – “OSv—Optimizing the Operating System for Virtual Machines”, Usenix ATC 2014.
  8. 8. 8 Cassandra on OSv ● Cassandra-stress, READ, 4 vcpu: On OSv, 34% faster than Linux ● Very nice, but not even close to our goal. What are the remaining bottlenecks?
  9. 9. 9 Bottlenecks: API locks ● In one profile, we saw 20% of run on lock() and unlock() operations. Most uncontended – Posix APIs allow threads to share ● file descriptors ● sockets – As many as 20 lock/unlock for each network packet! ● Uncontended locks were efficient on UP (flag to disable preemption), But atomic operations slow on many cores.
  10. 10. 10 Bottlenecks: API copies ● Write/send system calls copies user data to kernel – Even on OSv with no user-kernel separation – Part of the socket API ● Similar for read
  11. 11. 11 Bottlenecks: context switching ● One thread per CPU is optimal, >1 require: – Context switch time – Stacks consume memory and polute CPU cache – Thread imbalance ● Requires fully non-blocking APIs – Cassandra's uses mmap() for disk….
  12. 12. 12 Bottlenecks: unscalable applications ● Contended locks ruin scalability to many cores – Memcache's counter and shared cache ● Solution: per-cpu data. ● Even lock-free atomic algorithms are unscalable – Cache line bouncing ● Again, better to shard, not share, data. – Becomes worse as core count grows ● NUMA
  13. 13. 13 Therefore ● Need to provide a better APIs for server applications – Not file descriptors, sockets, threads, etc. ● Need to write better applications.
  14. 14. 14 Framework ● One thread per CPU – Event-driven programming – Everything (network & disk) is non-blocking – How to write complex applications?
  15. 15. 15 Framework ● Sharded (shared-nothing) applications – Important!
  16. 16. 16 Framework ● Language with no runtime overheads or built-in data sharing
  17. 17. 17 Seastar ● C++14 library ● For writing new high-performance server applications ● Share-nothing model, fully asynchronous ● Futures & Continuations based – Unified API for all asynchronous operations – Compose complex asyncrhonous operations – The key to complex applications ● (Optionally) full zero-copy user-space TCP/IP (over DPDK) ● Open source: http://www.seastar-project.org/
  18. 18. 18 Seastar linear scaling in #cores
  19. 19. 19 Seastar linear scaling in #cores
  20. 20. 20 Brief introduction to Seastar
  21. 21. 21 Sharded application design ● One thread per CPU ● Each thread handles one shard of data – No shared data (“share nothing”) – Separate memory per CPU (NUMA aware) – Message-passing between CPUs – No locks or cache line bounces ● Reactor (event loop) per thread ● User-space network stack also sharded
  22. 22. 22 Futures and continuations ● Futures and continuations are the building blocks of asynchronous programming in Seastar. ● Can be composed together to a large, complex, asynchronous program.
  23. 23. 23 Futures and continuations ● A future is a result which may not be available yet: – Data buffer from the network – Timer expiration – Completion of a disk write – The result of a computation which requires the values from one or more other futures. ● future<int> ● future<>
  24. 24. 24 Futures and continuations ● An asynchronous function (also “promise”) is a function returning a future: – future<> sleep(duration) – future<temporary_buffer<char>> read() ● The function sets up for the future to be fulfilled – sleep() sets a timer to fulfill the future it returns
  25. 25. 25 Futures and continuations ● A continuation is a callback, typically a lambda executed when a future becomes ready – sleep(1s).then([] { std::cerr << “done”; }); ● A continuation can hold state (lambda capture) – future<int> slow_incr(int i) { sleep(10ms).then( [i] { return i+1; }); }
  26. 26. 26 Futures and continuations ● Continuations can be nested: – future<int> get(); future<> put(int); get().then([] (int value) { put(value+1).then([] { std::cout << “done”; }); }); ● Or chained: – get().then([] (int value) { return put(value+1); }).then([] { std::cout << “done”; });
  27. 27. 27 Futures and continuations ● Parallelism is easy: – sleep(100ms).then([] { std::cout << “100msn”; }); sleep(200ms).then([] { std::cout << “200msn”;
  28. 28. 28 Futures and continuations ● In Seastar, every asynchronous operation is a future: – Network read or write – Disk read or write – Timers – … – A complex combination of other futures ● Useful for everything from writing network stack to writing a full, complex, application.
  29. 29. 29 Network zero-copy ● future<temporary_buffer> input_stream::read() – temporary_buffer points at driver-provided pages, if possible. – Automatically discarded after use (C++). ● future<> output_stream:: write(temporary_buffer) – Future becomes ready when TCP window allows further writes (usually immediately). – Buffer discarded after data is ACKed.
  30. 30. 30 Two TCP/IP implementations Networking API Seastar (native) Stack POSIX (hosted) stack Linux kernel (sockets) User-space TCP/IP Interface layer DPDK Virtio Xen igb ixgb
  31. 31. 31 Disk I/O ● Asynchronous and zero copy, using AIO and O_DIRECT. ● Not implemented well by all filesystems – XFS recommended ● Focusing on SSD ● Future thought: – Direct NVMe support, – Implement filesystem in Seastar.
  32. 32. 32 More info on Seastar ● http://seastar-project.com ● https://github.com/scylladb/seastar ● http://docs.seastar-project.org/ ● http://docs.seastar-project.org/master/md_doc_tu torial.html
  33. 33. 33 ScyllaDB ● NoSQL database, implemented in Seastar. ● Fully compatible with Cassandra: – Same CQL queries – Copy over a complete Cassandra database – Use existing drivers – Use existing cassandra.yaml – Use same nodetool or JMX console – Can be clustered (of course...)
  34. 34. 34 ScyllaDBCassandra Key cache Row cache On- heap / Off-heap Linux page cache SSTables Unified cache SSTables ● Don't double-cache. ● Don't cache unrelated rows. ● Don't cache unparsed sstables. ● Can fit much more into cache. ● No page faults, threads, etc.
  35. 35. 35 Scylla vs. Cassandra ● Single node benchmark: – 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz cassandra-stress Benchmark ScyllaDB Cassandra Write 1,871,556 251,785 Read 1,585,416 95,874 Mixed 1,372,451 108,947
  36. 36. 36 Scylla vs. Cassandra ● We really got a x7 – x16 speedup! ● Read speeded up more - – Cassandra writes are simpler – Row-cache benefits further improve Scylla's read ● Almost 2 million writes per second on single machine! – Google reported in their blogs achieving 1 million writes per second on 330 (!) machines – (2 years ago, and RF=3… but still impressive).
  37. 37. 37 Scylla vs. Cassandra 3 node cluster, 2x12 cores each; RF=3, CL=quorum
  38. 38. 38 Better latency, at all load levels
  39. 39. 39 What will you do with 10x performance? ● Shrink your cluster by a factor of 10 ● Use stronger (but slower) data models ● Run more queries - more value from your data ● Stop using caches in front of databases
  40. 40. 40
  41. 41. 41 Do we qualify? In 3 years, our small team wrote: ● A complete kernel and library (OSv). ● An asynchronous programming framework (Seastar). ● A complete Cassandra-compatible NoSQL database (ScyllaDB).
  42. 42. 42
  43. 43. 43 This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645402.

×