Scaling Cassandra up and down into containers with ZFS
Chris Burroughs
AddThis
2015-09-24
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 1 / 52
Hello!
Chris Burroughs chris.burroughs@gmail.com @csby54
Engineer at AddThis
Co-organizer of the Cassandra DC Meetup
Occasional contributor
interrupt me!
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 2 / 52
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 3 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 4 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 5 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 6 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 7 / 52
AddThis by the Numbers
80,000 request/second (3 billion views/day)
Tools on over 14 million domains
Mostly java on Linux
towards SOA microservices
multiple engineering “squads” with significant discretion
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 8 / 52
Cassandra at AddThis
Cassandra in production since 0.6
About a dozen clusters, new one created per use-case or SLA
Primarily used for latency sensitive, read-mostly storage
Every cluster is multi-DC
Virtual every page load with AddThis tools results in a least one read to Cassandra
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 9 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 10 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 11 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 12 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 13 / 52
On Abstractions
Typical Storage:
Many moving parts: block devices, partitions, raid, volume manager.
Big Plan Up Front: changing partition either not done or painful.
Data integrity not warm and fuzzy: hope fsck works.
Typical Memory:
Virtual memory and malloc/free. Add more DRAM if needed.
Maybe worry about NUMA, at runtime.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 14 / 52
ZFS
A storage sub-system (fs, raid, volume manager)
Always consistent on disk (no fsck)
End-to-end data integrity
Universal: file-system, block, NFS, SMB
Concise, simple administrative tools
scalable data structures (278
max pool size)
Started by Jeff Bonwick and Matthew Ahrens at Sun around 2001.
Available for: Illumos, Solaris, FreeBSD, Linux, MacOS X
Does for storage what VM did for memory.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 15 / 52
Timeline
2001: development started at Sun by Jeff Bonwick and Matthew Ahrens
2005: ZFS source code released
2008: ZFS released in FreeBSD 7.0
2010: Oracle proprietary fork, illumos project continues open-source development
2013: ZFS on (native) Linux GA
2013: Open-source ZFS bands together to form OpenZFS
2014: (new) OpenZFS for Mac OS X launch
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 16 / 52
Rampant Layering Violation
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 17 / 52
Universal Storage
Compression, snapshots, etc. are common features of all datasets.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 18 / 52
MOOOOOOOOOOOOOOOOO
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 19 / 52
COW Bonus: Snapshots
Snapshots: Read only copy at a point in time
create delete incremental
Traditional (rsync-esque) O(n) O(n) O(n)
ZFS O(1) O(∆) O(∆)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 20 / 52
Clones
“Clone” a snapshot to create a writeable dataset
Only pay for the difference in accumulated changes
Clones with no changes take up no space!
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 21 / 52
Hybrid Storage
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 22 / 52
Wait, there is more!
End-To-End data integrity
Online Everything: Expansion, scrubbing, resilvering, “fsck”, etc.
Copy On Write with linearized writes
Transparent compression
Dataset send/recv
Flexible mount points
Nested property based configuration
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 23 / 52
zpool A collection of devices that provides physical storage and data replication.
vdev A device (or collection of devices) with certain performance or fault-tolerance
characteristics. Building blocks of a zpool.
dataset The “data things” (usually filesystems) created on your zpool. Nested in a
hierarchy.
property Key-value pairs used for configuration or reporting of datasets . Inherited in
hierarchy.
ARC Adaptive Replacement Cache. Like a ZFS specific page cache.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 24 / 52
The ARC: your new best friend
$ arcstat.py -f time,read,miss,hit% 1
time read miss hit%
16:26:04 16 7 56
16:26:05 1.4K 671 52
16:26:06 1.9K 900 53
16:26:07 2.1K 972 53
16:26:08 1.5K 697 54
16:26:09 2.1K 906 56
16:26:10 1.9K 844 54
$ arcstat.py -f time,read,miss,hit% 1
time read miss hit%
16:25:47 5.1K 0 100
16:25:48 413K 0 100
16:25:49 403K 0 100
16:25:50 402K 0 100
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 25 / 52
dstat plugin
$ dstat --cpu --zfs-arc --zfs-l2arc
----total-cpu-usage---- -----------ZFS-ARC----------- -------------ZFS-L2ARC--
usr sys idl wai hiq siq| mem hit miss reads hit%| size hit miss hit%
1 1 94 4 0 0|15.0G 796B 372B 1167B 68.2B| 206G 343B 28.8B 92.3B
1 1 95 4 0 0|15.0G 557B 395B 952B 58.5B| 206G 376B 19.0B 95.2B
1 1 95 3 0 0|15.0G 553B 358B 911B 60.7B| 206G 344B 14.0B 96.1B
1 1 95 4 0 0|15.0G 686B 412B 1098B 62.5B| 206G 396B 16.0B 96.1B
1 1 95 4 0 0|15.0G 712B 409B 1121B 63.5B| 206G 386B 23.0B 94.4B
1 1 96 3 0 0|15.0G 446B 331B 777B 57.4B| 206G 307B 24.0B 92.7B
1 1 86 13 0 0|15.0G 708B 332B 1040B 68.1B| 206G 310B 22.0B 93.4B
2 1 93 4 0 0|15.0G 1094B 294B 1388B 78.8B| 206G 280B 14.0B 95.2B
hat tip: @AlTobey
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 26 / 52
Intelligent Prefetch
Cassandra Test cat */*.junk > /dev/null:
page cache 98th percentile reads reported by Cassandra increased 4-6x
arc 98th percentile reads reported by Cassandra increased 2x
LSM trees (Cassandra, HBase, LevelDB, RocksDB, BerkelyDB) mean linear scans are common
on modern storage systems.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 27 / 52
Commands
Only 2 you are ever likely to use
zpool Configure pools
zfs Configure file systems
zdb (Detailed debugging dump)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 28 / 52
production-esque example
# zpool create -f tank mirror /dev/sdb /dev/sdc 
mirror /dev/sdd /dev/sde 
cache /dev/sdf
# zfs create tank/sstables
# zfs set mountpoint=/data/sstables tank/sstables
# zfs set compression=lz4 tank
# zfs set atime=off tank
# chown cassandra:cassandra /data/sstables/
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 374G 1.42T 30K /tank
tank/sstables 374G 1.42T 374G /data/sstables
# zfs get compressratio tank/sstables
NAME PROPERTY VALUE SOURCE
tank/sstables compressratio 1.08x -
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 29 / 52
ZFS on Linux History
In days gone by there was a FUSE project.
Port started by LLNL as a backend for their supercomputer.
Early 2013: “Ready for wide scale deployment on everything from desktops to super
computers.”
Late 2014: Illumos/FreeBSD/Linux developers form OpenZFS group to coordinate
development.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 30 / 52
Today
0.6.5 released September 2015
Native packages for most distributions:
Active user community
http://zfsonlinux.org
#zfsonlinux on freenode
zfs-discuss@zfsonlinux.org
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 31 / 52
Zero in front of the version number?
clusterhq.com/blog/state-zfs-on-linux/
Close to feature parity with Illumos and FreeBSD.
Key end-to-end data integrity features work on Linux like other platforms.
Performance is workload dependent.
ZFS on Linux may be better than other options today for your use cases. It is not
better in all cases.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 32 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 33 / 52
Initial Problem Statement
Large-ish Cassandra cluster serving ML-derived data about URLs using AddThis.
Internal DC storage SLA: 98th percentile of 35ms
Data size and request volume growing, failing to meet SLA even while throwing hardware
at it.
Multiple revenue lines & products affected or launch blocked on cluster performance.
zipfian web traffic with a very long tail.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 34 / 52
L2ARC
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 35 / 52
Setup
# zpool create -f tank mirror /dev/sdb /dev/sdc 
mirror /dev/sdd /dev/sde 
cache /dev/sdf
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
cache
sdf ONLINE 0 0 0
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 36 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 37 / 52
Results
Twice the performance with half the physical nodes.
(Mileage will vary with workload and DRAM:SSD:working-set ratios.)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 38 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 39 / 52
Cute Little Clusters
Datacenter: IAD
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
UN x.xx.xxx.125 154.14 MB 256 ? 87d41c52-2b25-466b-93c1-d65c72f5f
UN x.xx.xxx.124 154.17 MB 256 ? c1a44486-2133-40fc-ba9d-0e671c5b2
UN x.xx.xxx.126 154.17 MB 256 ? 824f6018-eba6-4b44-b716-3b4eeaf69
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 40 / 52
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 41 / 52
Constraints
Need more efficient hardware allocation
for small clusters (multi-tenancy)
Among the most latency sensitive services
Non-trivial legacy network requirements
Application transparency
Infrastructure transparency (inventory,
dns, dhcp, config managemnt)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 42 / 52
Container Spectrum
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52
Container Spectrum
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52
Container Spectrum
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52
Container Spectrum
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52
ZFS Enabling Containerization
Create from base image
Backup running applications
Migrate to new host
Manage quotas
→ zfs clone
→ zfs snapshot
→ zfs send/recv
→ zfs properties
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 44 / 52
lxc-create(1)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 45 / 52
Glue
$ ./bin/port register-container --hostname=HOSTNAME
$ ./bin/port build-container --tag=CONTAINER_TAG 
--resources=standard-small 
--host-tag=PHYSICAL_HOST_TAG
$ ./bin/cobbling-time add-chef-roles --tag=CONTAINER_TAG 
--roles=’ROLES’
$ ./bin/cobbling-time signoff --tag=CONTAINER_TAG 
--next-status=Allocated
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 46 / 52
# lxc-ls -l
drwxrwx--- 3 root root 5 Apr 17 15:23 drydock-2015-04-17T15:23:02
drwxrwx--- 3 root root 5 Mar 12 2015 T5501a951541e62d5
drwxrwx--- 3 root root 5 Mar 17 2015 T55081d472f641f10
drwxrwx--- 3 root root 5 Apr 14 10:27 T552d238170fc63be
drwxrwx--- 3 root root 5 Apr 16 11:38 T552fd746af5be6e9
drwxrwx--- 3 root root 5 Sep 4 08:46 T55e9928f940e0155
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank/lxc 52.8G 2.60T 79K /lxc
tank/lxc/T5501a951541e62d5 483M 150G 957M /lxc/T5501a951541e6
tank/lxc/T55081d472f641f10 15.4G 585G 15.8G /lxc/T55081d472f641
tank/lxc/T552d238170fc63be 22.1G 278G 22.5G /lxc/T552d238170fc6
tank/lxc/T552fd746af5be6e9 8.53G 591G 8.51G /lxc/T552fd746af5be
tank/lxc/T55e9928f940e0155 1.45G 124G 1.87G /lxc/T55e9928f940e0
tank/lxc/drydock-2015-04-17T15:23:02 734M 2.60T 734M /lxc/drydock-2015-0
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 47 / 52
cadvisor
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 48 / 52
cadvisor
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 48 / 52
cadvisor
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 48 / 52
Results
> 2x consolidation
. . . able to defer hardware purchase for a year
Clean method for multiple Cassandra clusters per physical host. Can continue to break
apart clusters by use case and SLA!
Virtually every view (3 billion/day) of AddThis tools involves Cassandra, in a
container, on ZFS.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 49 / 52
Summary & Future Work
ZFS allows us to significantly improve the efficiency of both large and small clusters
ZFS is fundamental to container storage
Future: Continued performance investigations (align block sizes?)
Future: Is this going to evolve into writing our own IaaS?
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 50 / 52
We Are Hiring
http://www.addthis.com/careers
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 51 / 52
Questions?
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 52 / 52

AddThis: Scaling Cassandra up and down into containers with ZFS

  • 1.
    Scaling Cassandra upand down into containers with ZFS Chris Burroughs AddThis 2015-09-24 Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 1 / 52
  • 2.
    Hello! Chris Burroughs chris.burroughs@gmail.com@csby54 Engineer at AddThis Co-organizer of the Cassandra DC Meetup Occasional contributor interrupt me! Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 2 / 52
  • 3.
    1 Cassandra atAddThis 2 ZFS 3 Scaling up 4 Scaling down Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 3 / 52
  • 4.
    Table of Contents 1Cassandra at AddThis 2 ZFS 3 Scaling up 4 Scaling down Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 4 / 52
  • 5.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 5 / 52
  • 6.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 6 / 52
  • 7.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 7 / 52
  • 8.
    AddThis by theNumbers 80,000 request/second (3 billion views/day) Tools on over 14 million domains Mostly java on Linux towards SOA microservices multiple engineering “squads” with significant discretion Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 8 / 52
  • 9.
    Cassandra at AddThis Cassandrain production since 0.6 About a dozen clusters, new one created per use-case or SLA Primarily used for latency sensitive, read-mostly storage Every cluster is multi-DC Virtual every page load with AddThis tools results in a least one read to Cassandra Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 9 / 52
  • 10.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 10 / 52
  • 11.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 11 / 52
  • 12.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 12 / 52
  • 13.
    Table of Contents 1Cassandra at AddThis 2 ZFS 3 Scaling up 4 Scaling down Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 13 / 52
  • 14.
    On Abstractions Typical Storage: Manymoving parts: block devices, partitions, raid, volume manager. Big Plan Up Front: changing partition either not done or painful. Data integrity not warm and fuzzy: hope fsck works. Typical Memory: Virtual memory and malloc/free. Add more DRAM if needed. Maybe worry about NUMA, at runtime. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 14 / 52
  • 15.
    ZFS A storage sub-system(fs, raid, volume manager) Always consistent on disk (no fsck) End-to-end data integrity Universal: file-system, block, NFS, SMB Concise, simple administrative tools scalable data structures (278 max pool size) Started by Jeff Bonwick and Matthew Ahrens at Sun around 2001. Available for: Illumos, Solaris, FreeBSD, Linux, MacOS X Does for storage what VM did for memory. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 15 / 52
  • 16.
    Timeline 2001: development startedat Sun by Jeff Bonwick and Matthew Ahrens 2005: ZFS source code released 2008: ZFS released in FreeBSD 7.0 2010: Oracle proprietary fork, illumos project continues open-source development 2013: ZFS on (native) Linux GA 2013: Open-source ZFS bands together to form OpenZFS 2014: (new) OpenZFS for Mac OS X launch Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 16 / 52
  • 17.
    Rampant Layering Violation ChrisBurroughs (AddThis) #CassandraSummit 2015-09-24 17 / 52
  • 18.
    Universal Storage Compression, snapshots,etc. are common features of all datasets. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 18 / 52
  • 19.
    MOOOOOOOOOOOOOOOOO Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 19 / 52
  • 20.
    COW Bonus: Snapshots Snapshots:Read only copy at a point in time create delete incremental Traditional (rsync-esque) O(n) O(n) O(n) ZFS O(1) O(∆) O(∆) Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 20 / 52
  • 21.
    Clones “Clone” a snapshotto create a writeable dataset Only pay for the difference in accumulated changes Clones with no changes take up no space! Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 21 / 52
  • 22.
    Hybrid Storage Chris Burroughs(AddThis) #CassandraSummit 2015-09-24 22 / 52
  • 23.
    Wait, there ismore! End-To-End data integrity Online Everything: Expansion, scrubbing, resilvering, “fsck”, etc. Copy On Write with linearized writes Transparent compression Dataset send/recv Flexible mount points Nested property based configuration Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 23 / 52
  • 24.
    zpool A collectionof devices that provides physical storage and data replication. vdev A device (or collection of devices) with certain performance or fault-tolerance characteristics. Building blocks of a zpool. dataset The “data things” (usually filesystems) created on your zpool. Nested in a hierarchy. property Key-value pairs used for configuration or reporting of datasets . Inherited in hierarchy. ARC Adaptive Replacement Cache. Like a ZFS specific page cache. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 24 / 52
  • 25.
    The ARC: yournew best friend $ arcstat.py -f time,read,miss,hit% 1 time read miss hit% 16:26:04 16 7 56 16:26:05 1.4K 671 52 16:26:06 1.9K 900 53 16:26:07 2.1K 972 53 16:26:08 1.5K 697 54 16:26:09 2.1K 906 56 16:26:10 1.9K 844 54 $ arcstat.py -f time,read,miss,hit% 1 time read miss hit% 16:25:47 5.1K 0 100 16:25:48 413K 0 100 16:25:49 403K 0 100 16:25:50 402K 0 100 Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 25 / 52
  • 26.
    dstat plugin $ dstat--cpu --zfs-arc --zfs-l2arc ----total-cpu-usage---- -----------ZFS-ARC----------- -------------ZFS-L2ARC-- usr sys idl wai hiq siq| mem hit miss reads hit%| size hit miss hit% 1 1 94 4 0 0|15.0G 796B 372B 1167B 68.2B| 206G 343B 28.8B 92.3B 1 1 95 4 0 0|15.0G 557B 395B 952B 58.5B| 206G 376B 19.0B 95.2B 1 1 95 3 0 0|15.0G 553B 358B 911B 60.7B| 206G 344B 14.0B 96.1B 1 1 95 4 0 0|15.0G 686B 412B 1098B 62.5B| 206G 396B 16.0B 96.1B 1 1 95 4 0 0|15.0G 712B 409B 1121B 63.5B| 206G 386B 23.0B 94.4B 1 1 96 3 0 0|15.0G 446B 331B 777B 57.4B| 206G 307B 24.0B 92.7B 1 1 86 13 0 0|15.0G 708B 332B 1040B 68.1B| 206G 310B 22.0B 93.4B 2 1 93 4 0 0|15.0G 1094B 294B 1388B 78.8B| 206G 280B 14.0B 95.2B hat tip: @AlTobey Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 26 / 52
  • 27.
    Intelligent Prefetch Cassandra Testcat */*.junk > /dev/null: page cache 98th percentile reads reported by Cassandra increased 4-6x arc 98th percentile reads reported by Cassandra increased 2x LSM trees (Cassandra, HBase, LevelDB, RocksDB, BerkelyDB) mean linear scans are common on modern storage systems. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 27 / 52
  • 28.
    Commands Only 2 youare ever likely to use zpool Configure pools zfs Configure file systems zdb (Detailed debugging dump) Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 28 / 52
  • 29.
    production-esque example # zpoolcreate -f tank mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde cache /dev/sdf # zfs create tank/sstables # zfs set mountpoint=/data/sstables tank/sstables # zfs set compression=lz4 tank # zfs set atime=off tank # chown cassandra:cassandra /data/sstables/ # zfs list NAME USED AVAIL REFER MOUNTPOINT tank 374G 1.42T 30K /tank tank/sstables 374G 1.42T 374G /data/sstables # zfs get compressratio tank/sstables NAME PROPERTY VALUE SOURCE tank/sstables compressratio 1.08x - Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 29 / 52
  • 30.
    ZFS on LinuxHistory In days gone by there was a FUSE project. Port started by LLNL as a backend for their supercomputer. Early 2013: “Ready for wide scale deployment on everything from desktops to super computers.” Late 2014: Illumos/FreeBSD/Linux developers form OpenZFS group to coordinate development. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 30 / 52
  • 31.
    Today 0.6.5 released September2015 Native packages for most distributions: Active user community http://zfsonlinux.org #zfsonlinux on freenode zfs-discuss@zfsonlinux.org Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 31 / 52
  • 32.
    Zero in frontof the version number? clusterhq.com/blog/state-zfs-on-linux/ Close to feature parity with Illumos and FreeBSD. Key end-to-end data integrity features work on Linux like other platforms. Performance is workload dependent. ZFS on Linux may be better than other options today for your use cases. It is not better in all cases. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 32 / 52
  • 33.
    Table of Contents 1Cassandra at AddThis 2 ZFS 3 Scaling up 4 Scaling down Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 33 / 52
  • 34.
    Initial Problem Statement Large-ishCassandra cluster serving ML-derived data about URLs using AddThis. Internal DC storage SLA: 98th percentile of 35ms Data size and request volume growing, failing to meet SLA even while throwing hardware at it. Multiple revenue lines & products affected or launch blocked on cluster performance. zipfian web traffic with a very long tail. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 34 / 52
  • 35.
    L2ARC Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 35 / 52
  • 36.
    Setup # zpool create-f tank mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde cache /dev/sdf NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 cache sdf ONLINE 0 0 0 Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 36 / 52
  • 37.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 37 / 52
  • 38.
    Results Twice the performancewith half the physical nodes. (Mileage will vary with workload and DRAM:SSD:working-set ratios.) Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 38 / 52
  • 39.
    Table of Contents 1Cassandra at AddThis 2 ZFS 3 Scaling up 4 Scaling down Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 39 / 52
  • 40.
    Cute Little Clusters Datacenter:IAD =============== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID UN x.xx.xxx.125 154.14 MB 256 ? 87d41c52-2b25-466b-93c1-d65c72f5f UN x.xx.xxx.124 154.17 MB 256 ? c1a44486-2133-40fc-ba9d-0e671c5b2 UN x.xx.xxx.126 154.17 MB 256 ? 824f6018-eba6-4b44-b716-3b4eeaf69 Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 40 / 52
  • 41.
    Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 41 / 52
  • 42.
    Constraints Need more efficienthardware allocation for small clusters (multi-tenancy) Among the most latency sensitive services Non-trivial legacy network requirements Application transparency Infrastructure transparency (inventory, dns, dhcp, config managemnt) Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 42 / 52
  • 43.
    Container Spectrum Chris Burroughs(AddThis) #CassandraSummit 2015-09-24 43 / 52
  • 44.
    Container Spectrum Chris Burroughs(AddThis) #CassandraSummit 2015-09-24 43 / 52
  • 45.
    Container Spectrum Chris Burroughs(AddThis) #CassandraSummit 2015-09-24 43 / 52
  • 46.
    Container Spectrum Chris Burroughs(AddThis) #CassandraSummit 2015-09-24 43 / 52
  • 47.
    ZFS Enabling Containerization Createfrom base image Backup running applications Migrate to new host Manage quotas → zfs clone → zfs snapshot → zfs send/recv → zfs properties Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 44 / 52
  • 48.
    lxc-create(1) Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 45 / 52
  • 49.
    Glue $ ./bin/port register-container--hostname=HOSTNAME $ ./bin/port build-container --tag=CONTAINER_TAG --resources=standard-small --host-tag=PHYSICAL_HOST_TAG $ ./bin/cobbling-time add-chef-roles --tag=CONTAINER_TAG --roles=’ROLES’ $ ./bin/cobbling-time signoff --tag=CONTAINER_TAG --next-status=Allocated Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 46 / 52
  • 50.
    # lxc-ls -l drwxrwx---3 root root 5 Apr 17 15:23 drydock-2015-04-17T15:23:02 drwxrwx--- 3 root root 5 Mar 12 2015 T5501a951541e62d5 drwxrwx--- 3 root root 5 Mar 17 2015 T55081d472f641f10 drwxrwx--- 3 root root 5 Apr 14 10:27 T552d238170fc63be drwxrwx--- 3 root root 5 Apr 16 11:38 T552fd746af5be6e9 drwxrwx--- 3 root root 5 Sep 4 08:46 T55e9928f940e0155 # zfs list NAME USED AVAIL REFER MOUNTPOINT tank/lxc 52.8G 2.60T 79K /lxc tank/lxc/T5501a951541e62d5 483M 150G 957M /lxc/T5501a951541e6 tank/lxc/T55081d472f641f10 15.4G 585G 15.8G /lxc/T55081d472f641 tank/lxc/T552d238170fc63be 22.1G 278G 22.5G /lxc/T552d238170fc6 tank/lxc/T552fd746af5be6e9 8.53G 591G 8.51G /lxc/T552fd746af5be tank/lxc/T55e9928f940e0155 1.45G 124G 1.87G /lxc/T55e9928f940e0 tank/lxc/drydock-2015-04-17T15:23:02 734M 2.60T 734M /lxc/drydock-2015-0 Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 47 / 52
  • 51.
    cadvisor Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 48 / 52
  • 52.
    cadvisor Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 48 / 52
  • 53.
    cadvisor Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 48 / 52
  • 54.
    Results > 2x consolidation .. . able to defer hardware purchase for a year Clean method for multiple Cassandra clusters per physical host. Can continue to break apart clusters by use case and SLA! Virtually every view (3 billion/day) of AddThis tools involves Cassandra, in a container, on ZFS. Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 49 / 52
  • 55.
    Summary & FutureWork ZFS allows us to significantly improve the efficiency of both large and small clusters ZFS is fundamental to container storage Future: Continued performance investigations (align block sizes?) Future: Is this going to evolve into writing our own IaaS? Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 50 / 52
  • 56.
    We Are Hiring http://www.addthis.com/careers ChrisBurroughs (AddThis) #CassandraSummit 2015-09-24 51 / 52
  • 57.
    Questions? Chris Burroughs (AddThis)#CassandraSummit 2015-09-24 52 / 52