Cassandra 0.7



Friday, December 10, 2010
Features
                    • Live schema modification
                    • Secondary indexes
                    • Hadoop OutputFormat
                    • (Very) large rows
                            •   up to 2 billion columns

                    • NetworkTopologyStrategy
Friday, December 10, 2010
Operations

                    • efficient Streaming
                    • Per-ColumnFamily settings of memtable
                            thresholds
                    • Much more (optional) metadata about
                            columns



Friday, December 10, 2010
Operations backports
                    •       HH disable (0.6.2)

                    •       compaction priority (0.6.3)

                    •       HH hourly scan (0.6.3)

                    •       JMX metrics for row-level bloom filters (0.6.3)

                    •       Flow control (0.6.4, 5)

                    •       HH paging (0.6.5)

                    •       Dynamic snitch (0.6.5)

                    •       Tombstone removal in minor compaction (0.6.6)

Friday, December 10, 2010
Compatiblity

                    • Fully backwards-compatible with 0.6 data
                    • Some Thrift API changes
                            •   String row keys become byte[]

                            •   keyspace is set once per connection

                    • Requires drain + cluster restart

Friday, December 10, 2010
Features



Friday, December 10, 2010
Live schema changes


                    • Details: http://www.riptano.com/blog/live-
                            schema-updates-cassandra-07




Friday, December 10, 2010
Data model tradeoffs


                    • Twitter: “Fifteen months ago, it took two
                            weeks to perform ALTER TABLE on the
                            statuses [tweets] table.”




Friday, December 10, 2010
A static ColumnFamily




Friday, December 10, 2010
Friday, December 10, 2010
A dynamic ColumnFamily




Friday, December 10, 2010
SELECT * FROM tweets
 WHERE user_id IN (SELECT follower FROM followers WHERE user_id = ?)




          followers                                 timeline
                                          ?

   ?


                            tweets
Friday, December 10, 2010
SuperColumns = full denormalization




Friday, December 10, 2010
A little deeper


                    • http://twissandra.com
                    • http://github.com/jhermes/twissjava



Friday, December 10, 2010
Secondary indexes




Friday, December 10, 2010
A static ColumnFamily




Friday, December 10, 2010
demo time


                    • Reading the slides after the talk?   See http://
                            www.riptano.com/blog/whats-new-
                            cassandra-07-secondary-indexes




Friday, December 10, 2010
Hadoop OutputFormat
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KS, CF);
...
public void reduce(Text word, Iterable<IntWritable> values, Context
context)
{
    int sum = 0;
    for (IntWritable val : values)
         sum += val.get();
    context.write(outputKey, Collections.singletonList(getMutation
(word, sum)));
}




Friday, December 10, 2010
Large rows


                    • 0.6: smaller of {2GB, memory limit}
                    • 0.7: in_memory_compaction_limit_in_mb


Friday, December 10, 2010
NetworkTopologyStrategy

                    • RackAwareStrategy is tuned for 3 replicas
                            and 2 data centers
                            •   renamed to OldNetworkTopologyStrategy

                    • NTS allows configuring replicas per data
                            center, per Keyspace
                            •   ignores replication_factor directive



Friday, December 10, 2010
Operations



Friday, December 10, 2010
Efficient Streaming
                    • The following slides show how in 0.7, we
                            just send the data portion of the sstables
                            we are moving to a new node over to it
                            (which is contiguous on disk, no random i/
                            o), which rebuilds indexes etc
                    • This minimizes the impact on existing
                            nodes


Friday, December 10, 2010
W           A




                                                F
                                    (A-L]


                            T

                                            L




Friday, December 10, 2010
W           A




                                                 F
                                    (A-F]


                                                (A-F]
                            T
                                    (F-L]
                                            L




Friday, December 10, 2010
W            A




                                                 F


                                    Data
                            T

                                             L
                                    Index
                                    Filter
Friday, December 10, 2010
W            A




                                                 F



                            T

                                             L
                                    Index
                                    Filter
Friday, December 10, 2010
Per-CF memtable thresholds



          • Easier tuning for large numbers of ColumnFamilies




Friday, December 10, 2010
Column Metadata


                    • 0.6: comparator, subcomparator
                    • 0.7: default_validation_class,
                            column_metadata




Friday, December 10, 2010
Native code


                    • JNA introduced in 0.6.5 for mlockall
                    • Extended to hard links in 0.6.6


Friday, December 10, 2010
Flow Control (0.6.4)
                    • Replica nodes drop hopeless requests on
                            the floor
                            •   Coordinator node is unaffected

                            •   TimedOutException signals client to back off

                            •   Requires enough memory to buffer
                                RPCTimeout’s worth of requests

                    • (In the short term, you’re still screwed)
Friday, December 10, 2010
Flow control in 0.5


                    • Why backpressure doesn’t fit Cassandra



Friday, December 10, 2010
Dynamic snitch

public void sortByProximity(List<InetAddress> addresses);




Friday, December 10, 2010
Everything else



Friday, December 10, 2010
0.7 performance
                    • Reads roughly 100% faster, thanks largely to
                            removing String creation
                    • Row-cached reads up to 8x faster after
                            optimizations by tjake and jbellis
                    • Optimizations for reads of large rows
                    • 0.7.1? ~15% improvement everywhere from
                            ByteBuffer optimizations


Friday, December 10, 2010
Thrift: the libpq of Cassandra



                    • OOMs on malformed packets
                    • Python Unicode string issues
                    • PHP support is buggy and maintainerless


Friday, December 10, 2010
Client support from Riptano

                    • Hector
                            •   Building JPA/JDO layer on top

                    • pycassa
                    • phpcassa
                    • Soon: cassandra gem

Friday, December 10, 2010
After 0.7.0

                    • IndexOperator.GT
                    • Triggers / plugins
                    • Entity groups
                    • On-disk data format improvements
                            (Compression, compound keys?)



Friday, December 10, 2010
Summary



Friday, December 10, 2010
Friday, December 10, 2010

Cassandra 0.7, Los Angeles High Scalability Group

  • 1.
  • 2.
    Features • Live schema modification • Secondary indexes • Hadoop OutputFormat • (Very) large rows • up to 2 billion columns • NetworkTopologyStrategy Friday, December 10, 2010
  • 3.
    Operations • efficient Streaming • Per-ColumnFamily settings of memtable thresholds • Much more (optional) metadata about columns Friday, December 10, 2010
  • 4.
    Operations backports • HH disable (0.6.2) • compaction priority (0.6.3) • HH hourly scan (0.6.3) • JMX metrics for row-level bloom filters (0.6.3) • Flow control (0.6.4, 5) • HH paging (0.6.5) • Dynamic snitch (0.6.5) • Tombstone removal in minor compaction (0.6.6) Friday, December 10, 2010
  • 5.
    Compatiblity • Fully backwards-compatible with 0.6 data • Some Thrift API changes • String row keys become byte[] • keyspace is set once per connection • Requires drain + cluster restart Friday, December 10, 2010
  • 6.
  • 7.
    Live schema changes • Details: http://www.riptano.com/blog/live- schema-updates-cassandra-07 Friday, December 10, 2010
  • 8.
    Data model tradeoffs • Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.” Friday, December 10, 2010
  • 9.
  • 10.
  • 11.
  • 12.
    SELECT * FROMtweets WHERE user_id IN (SELECT follower FROM followers WHERE user_id = ?) followers timeline ? ? tweets Friday, December 10, 2010
  • 13.
    SuperColumns = fulldenormalization Friday, December 10, 2010
  • 14.
    A little deeper • http://twissandra.com • http://github.com/jhermes/twissjava Friday, December 10, 2010
  • 15.
  • 16.
  • 17.
    demo time • Reading the slides after the talk? See http:// www.riptano.com/blog/whats-new- cassandra-07-secondary-indexes Friday, December 10, 2010
  • 18.
    Hadoop OutputFormat job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KS,CF); ... public void reduce(Text word, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) sum += val.get(); context.write(outputKey, Collections.singletonList(getMutation (word, sum))); } Friday, December 10, 2010
  • 19.
    Large rows • 0.6: smaller of {2GB, memory limit} • 0.7: in_memory_compaction_limit_in_mb Friday, December 10, 2010
  • 20.
    NetworkTopologyStrategy • RackAwareStrategy is tuned for 3 replicas and 2 data centers • renamed to OldNetworkTopologyStrategy • NTS allows configuring replicas per data center, per Keyspace • ignores replication_factor directive Friday, December 10, 2010
  • 21.
  • 22.
    Efficient Streaming • The following slides show how in 0.7, we just send the data portion of the sstables we are moving to a new node over to it (which is contiguous on disk, no random i/ o), which rebuilds indexes etc • This minimizes the impact on existing nodes Friday, December 10, 2010
  • 23.
    W A F (A-L] T L Friday, December 10, 2010
  • 24.
    W A F (A-F] (A-F] T (F-L] L Friday, December 10, 2010
  • 25.
    W A F Data T L Index Filter Friday, December 10, 2010
  • 26.
    W A F T L Index Filter Friday, December 10, 2010
  • 27.
    Per-CF memtable thresholds • Easier tuning for large numbers of ColumnFamilies Friday, December 10, 2010
  • 28.
    Column Metadata • 0.6: comparator, subcomparator • 0.7: default_validation_class, column_metadata Friday, December 10, 2010
  • 29.
    Native code • JNA introduced in 0.6.5 for mlockall • Extended to hard links in 0.6.6 Friday, December 10, 2010
  • 30.
    Flow Control (0.6.4) • Replica nodes drop hopeless requests on the floor • Coordinator node is unaffected • TimedOutException signals client to back off • Requires enough memory to buffer RPCTimeout’s worth of requests • (In the short term, you’re still screwed) Friday, December 10, 2010
  • 31.
    Flow control in0.5 • Why backpressure doesn’t fit Cassandra Friday, December 10, 2010
  • 32.
    Dynamic snitch public voidsortByProximity(List<InetAddress> addresses); Friday, December 10, 2010
  • 33.
  • 34.
    0.7 performance • Reads roughly 100% faster, thanks largely to removing String creation • Row-cached reads up to 8x faster after optimizations by tjake and jbellis • Optimizations for reads of large rows • 0.7.1? ~15% improvement everywhere from ByteBuffer optimizations Friday, December 10, 2010
  • 35.
    Thrift: the libpqof Cassandra • OOMs on malformed packets • Python Unicode string issues • PHP support is buggy and maintainerless Friday, December 10, 2010
  • 36.
    Client support fromRiptano • Hector • Building JPA/JDO layer on top • pycassa • phpcassa • Soon: cassandra gem Friday, December 10, 2010
  • 37.
    After 0.7.0 • IndexOperator.GT • Triggers / plugins • Entity groups • On-disk data format improvements (Compression, compound keys?) Friday, December 10, 2010
  • 38.
  • 39.