Castle enhanced Cassandra

1,916 views

Published on

Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra.

This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,916
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
39
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • 15 years ago ...\n
  • 15 years ago ...\n
  • What were they thinking?\n
  • ... your phone\n
  • ... your notebook computer\n
  • ... your notebook computer\n
  • ... the Internet\n
  • ... databases\n\nSmall databases, indexed w/ btrees\nBtree-based file systems\nRAID\nSCSI, IDE disks\n
  • 10 years later ... \n
  • 10 years later ... \n
  • What are you thinking?\n
  • ... your phone\n
  • ... your computer\n
  • ... the Internet (data-rich)\n
  • 2006; 161 exabytes (.16 zettabytes)\n2010; 988 exabytes (.98 zettabytes)\n2011; 1.8 zettabytes\n2015; >8 zettabytes\n
  • Times have changed, for Big Data, databases are distributed\nExample, Cassandra’s content-addressable ring\n
  • Key-based partitioning...\n
  • ... replication\n\nBASE replaces ACID\n
  • Write optimization; Why it is important given disk limitations\n
  • Btrees, how they work, properties, and how that relates to disk access\n
  • Btrees, how they work, properties, and how that relates to disk access\n
  • Btrees, how they work, properties, and how that relates to disk access\n
  • How Cassandra write-optimizes; LSM-tree\n
  • Sequential disk access wins.\n
  • Where all that leaves the database stack.\n
  • Enter the Castle-based stack.\n
  • \n
  • \n
  • Write optimization ala Castle’s doubling-arrays, etc\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Indexing\n
  • Bloomfilters\n
  • Castle’s management of block devices for performance and redunancy\n
  • \n
  • \n
  • \n
  • Castle’s shared memory interface\n
  • libcastle (C lib), and bindings\n
  • Castle’s Java interface\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Putting Castle to work for Cassandra\n
  • Cassandra’s ColumnFamilyStore abstraction / in-java LSM-tree\n
  • When thresholds are hit, live memtables are “switched-out”, queued for flushing, and then left to the JVM’s garbage collector for cleanup.\n
  • Garbage collection ramifications.\n
  • Replacing CFS with a Castle-backend; Memory is (de)allocated by Castle in kernel-space\n
  • Cassandra holds all bloomfilters and indexes in-memory; Castle does not have this requirement \n
  • Performance\n
  • \n
  • \n
  • \n
  • Castle enhanced Cassandra

    1. 1. Castle-enhanced Cassandra Berlin Buzzwords June 4, 2012 Eric Evans eric@acunu.com @jericevans, @acunu
    2. 2. 1997
    3. 3. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardwareMonday, 6 February 2012
    4. 4. 2007
    5. 5. Are we there yet?Figure 6 Figure 7Source: IDC, 2007 Source: IDC, 2007Figure 8 Figure 3Source: IDC, 2007 Source: IDC, 2007
    6. 6. Big Data distribution AM B C
    7. 7. Big Data distribution AM Key = Aaa B C
    8. 8. Big Data distribution AM Key = Aaa B C
    9. 9. Big Data write optimizing• 7, 500 - 10,000 RPM• 5ms - 9ms seeks• ~150MB/s (sequential)• 75-150 random IOPS
    10. 10. Big Data write optimizing A G A C G KA B D E G H K L
    11. 11. Big Data write optimizing A G query(K) A C G KA B D E G H K L
    12. 12. Big Data write optimizing A G A C G KA B D E G H K L
    13. 13. Big Data write optimizing Memory DiskS1 S2 S3 S4 S5
    14. 14. Big Data write optimizing Memory DiskS1 S2 S3 S4 S5
    15. 15. Two Revolutions 2010 Distributed, shared-nothing databases Write-optimised indexes Write-optimised indexes BTree file systems BTree file systems RAID ... RAID New hardware New hardwareMonday, 6 February 2012
    16. 16. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ... New hardware New hardwareMonday, 6 February 2012
    17. 17. Castle
    18. 18. Castle is...• Filesystem (no, not really)• Key-value store for the Linux kernel• Write-optimized • for rotational disks • for SSDs• Versioned (clones, snapshots)• Disk aggregation • for redundancy • for performance• FLOSS!
    19. 19. Doubling Arrays3 Buffer values in memory until we have > B
    20. 20. Doubling Arrays 39 Buffer values in memory until we have > B
    21. 21. Doubling Arrays 3 9 Then, promote them to disk.
    22. 22. Doubling Arrays11 3 9
    23. 23. Doubling Arrays 11 3 97
    24. 24. Doubling Arrays 3 9 7 11
    25. 25. Doubling Arrays 5 3 91 7 11
    26. 26. Doubling Arrays 1 5 3 7 9 11
    27. 27. Indexesquery(k)
    28. 28. Bloomfiltersquery(k)
    29. 29. Disk Layout: RDADisk Layout: RDA random duplicate allocation random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16
    30. 30. Disk Layout: RDADisk Layout: RDA random duplicate allocation random duplicate allocation 4 2 1 4 5 3 1 3 7 10 7 6 9 10 6 8 15 12 14 11 11 12 13 15 16 16
    31. 31. Disk Layout: RDADisk Layout: RDA random duplicate allocation random duplicate allocation 4 2 1 4 5 3 1 3 7 10 7 6 9 10 6 8 15 12 14 11 11 12 13 15 16 16 14 9 2 13 8 5
    32. 32. Disk Layout: RDA random duplicate allocation Rebuild Times 5 4 Rebuild Time (Hours) 3 2 1 0 RAID10, 8 Disks RAID5, 8 Disks RDA, 8 Disks RDA, 15 DisksMonday, 6 February 2012
    33. 33. Reflex(Formerly Acunu Data Platform)
    34. 34. ColumnFamilyStore Memory Disk S1 S2 S3 S4 S5
    35. 35. MemtablesMemory Memory Memory Disk S1 S2 S3 S4 S5
    36. 36. AcunuColumnFamilyStore
    37. 37. Small random inserts Small random inserts 3Inserting 3 billion rows billion rows Acunu powered Cassandra - ‘standard’ Cassandra -Monday, 6 February 2012
    38. 38. Insert latency Insert latency(while While inserting 3 billion rows rows) inserting 3 billion Acunu powered Cassandra x ‘standard’ Cassandra +
    39. 39. Questions? bitbucket.org/acunu github.com/acunuwww.acunu.com/2/category/technical%20articles/1.html

    ×