Cassandra Explained


    Berlin Buzzwords
      June 6, 2010

            Eric Evans
    eevans@rackspace.com
           @jericevans
    http://blog.sym-link.com
Outline
●   Background
●   Description
●   API
●   Examples
Background
Influential Papers
●   BigTable
    ● Strong consistency
    ● Sparse map data model


    ● GFS, Chubby, et al


●   Dynamo
    ●   O(1) distributed hash table (DHT)
    ●   BASE (aka eventual consistency)
    ●   Client tunable consistency/availability
NoSQL
●   HBase          ●   Hypertable
●   MongoDB        ●   HyperGraphDB
●   Riak           ●   Memcached
●   Voldemort      ●   Tokyo Cabinet
●   Neo4J          ●   Redis
●   Cassandra      ●   CouchDB
NoSQL Big data
●   HBase           ●   Hypertable
●   MongoDB         ●   HyperGraphDB
●   Riak            ●   Memcached
●   Voldemort       ●   Tokyo Cabinet
●   Neo4J           ●   Redis
●   Cassandra       ●   CouchDB
Bigtable / Dynamo
        Bigtable              Dynamo
●   HBase          ●   Riak
●   Hypertable     ●   Voldemort



            Cassandra ??
Dynamo-Bigtable Lovechild
CAP Theorem “Pick Two”
●   CP               ●   AP
    ●   Bigtable         ●   Dynamo
    ●   Hypertable       ●   Voldemort
    ●   HBase            ●   Cassandra
CAP Theorem “Pick Two”



   ●   Consistency
   ●   Availability
   ●   Partition Tolerance
Description
Properties
●   Symmetric
    ● No single point of failure
    ● Linearly scalable


    ● Ease of administration


●   Flexible partitioning, replica placement
●   Automated provisioning
●   High availability (eventual consistency)
P2P Routing
P2P Routing
Partitioning
●   Random
    ●   128bit namespace, (MD5)
    ●   Good distribution
●   Order Preserving
    ●   Tokens determine namespace
    ●   Natural order (lexicographical)
    ●   Range / cover queries
●   Yours ??
Replica Placement
●   SimpleSnitch
    ●   Default
    ●   N-1 successive nodes
●   RackInferringSnitch
    ●   Infers DC/rack from IP
●   PropertyFileSnitch
    ●   Configured w/ a properties file
Bootstrap
Bootstrap
Bootstrap
Choosing Consistency

         Write                      Read
Level     Description      Level     Description
ZERO      Hail Mary        ZERO      N/A
ANY       1 replica (HH)   ANY       N/A
ONE       1 replica        ONE       1 replica
QUORUM    (N / 2) +1       QUORUM    (N / 2) +1
ALL       All replicas     ALL       All replicas

                       R+W>N
Quorum ((N/2) + 1)
Quorum ((N/2) + 1)
Data Model
Overview
●   Keyspace
    ●   Uppermost namespace
    ●   Typically one per application
●   ColumnFamily
    ●   Associates records of a similar kind
    ●   Record-level Atomicity
    ●   Indexed
●   Column
    ●   Basic unit of storage
Sparse Table
Column
●   name
    ●   byte[]
    ●   Queried against (predicates)
    ●   Determines sort order
●   value
    ●   byte[]
    ●   Opaque to Cassandra
●   timestamp
    ●   long
    ●   Conflict resolution (Last Write Wins)
Column Comparators
●    Bytes
●    UTF8
●    TimeUUID
●    Long
●    LexicalUUID
●    Composite (third-party)


    http://github.com/edanuff/CassandraCompositeType
API
Low / High
●    Thrift
      ●   Compact binary RPC framework
      ●   12 different languages
●    Idiomatic
      ●   Hector (Java)
      ●   Pycassa (Python)
      ●   Others...


    http://wiki.apache.org/cassandra/ClientOptions
Thrift Read Methods
●   get() → Column
●   get_slice() → list<Column>
●   mulitget_slice() → map<key, list<Column>>
●   get_count() → int
●   multiget_count() → map<key, int>
●   get_range_slices()
Thrift Write Methods
●   insert()
●   batch_insert()
●   remove()
●   batch_mutate()
Examples
Pycassa – Python Client API
●    connect() → Thrift proxy
●    cf = ColumnFamily(proxy, ksp, cfname)
●    cf.insert() → long
●    cf.get() → dict
●    cf.get_range() → dict




    http://github.com/vomjom/pycassa
Address Book – Setup
<!-- conf/storage-conf.xml -->
<Keyspace Name=”AddressBook”>
  <ColumnFamily Name=”Addresses”
                CompareWith=”BytesType”
                RowsCached=”10000”
                KeysCached=”50%”
                Comment=”Too lame” />
</Keyspace>
Adding an entry
key = uuid()

columns = {
    'first':   'Eric',
    'last':    'Evans',
    'email':   'eevans@rackspace.com',
    'city':    'Austin',
    'zip':     78250
}

addresses.insert(key, columns)
Fetching a record
# fetching the record by key
record = addresses.get(key)

# accessing columns by name
zipcode = record['zip']
city = record['city']
Indexing
<!-- conf/storage-conf.xml -->
<Keyspace Name=”AddressBook”>
  <ColumnFamily Name=”Addresses”
                CompareWith=”BytesType”
                RowsCached=”10000”
                KeysCached=”50%”
                Comment=”Too lame” />
  <ColumnFamily Name=”ByCity”
                CompareWith=”UTF8Type” />
</Keyspace>
Updating the index
key = uuid()

columns = {
    'first':   'Eric',
    'last':    'Evans',
    'email':   'eevans@rackspace.com',
    'city':    'Austin',
    'zip':     78250
}

addresses.insert(key, columns)
byCity.insert('Austin', {key: ''})
Timeseries
<!-- conf/storage-conf.xml -->
<Keyspace Name=”Sites”>
  <ColumnFamily Name=”Stats”
                CompareWith=”LongType”/>
</Keyspace>
Logging values
# time as a long, binary, network-order
ts = pack('>d', long(time() * 1e6))

stats.insert('org.apache', {ts: value})
Slicing
begin = pack('>d', long(s * 1e6))

stats.get_range('org.apache',
                column_start=begin)

end = pack('>d', long((s + 86400) * 1e6))

stats.get_range(start='org.apache',
                finish='org.debian',
                column_start=begin,
                column_finish=end)
Questions?

Cassandra Explained

  • 1.
    Cassandra Explained Berlin Buzzwords June 6, 2010 Eric Evans eevans@rackspace.com @jericevans http://blog.sym-link.com
  • 2.
    Outline ● Background ● Description ● API ● Examples
  • 3.
  • 4.
    Influential Papers ● BigTable ● Strong consistency ● Sparse map data model ● GFS, Chubby, et al ● Dynamo ● O(1) distributed hash table (DHT) ● BASE (aka eventual consistency) ● Client tunable consistency/availability
  • 5.
    NoSQL ● HBase ● Hypertable ● MongoDB ● HyperGraphDB ● Riak ● Memcached ● Voldemort ● Tokyo Cabinet ● Neo4J ● Redis ● Cassandra ● CouchDB
  • 6.
    NoSQL Big data ● HBase ● Hypertable ● MongoDB ● HyperGraphDB ● Riak ● Memcached ● Voldemort ● Tokyo Cabinet ● Neo4J ● Redis ● Cassandra ● CouchDB
  • 7.
    Bigtable / Dynamo Bigtable Dynamo ● HBase ● Riak ● Hypertable ● Voldemort Cassandra ??
  • 8.
  • 9.
    CAP Theorem “PickTwo” ● CP ● AP ● Bigtable ● Dynamo ● Hypertable ● Voldemort ● HBase ● Cassandra
  • 10.
    CAP Theorem “PickTwo” ● Consistency ● Availability ● Partition Tolerance
  • 11.
  • 12.
    Properties ● Symmetric ● No single point of failure ● Linearly scalable ● Ease of administration ● Flexible partitioning, replica placement ● Automated provisioning ● High availability (eventual consistency)
  • 13.
  • 14.
  • 15.
    Partitioning ● Random ● 128bit namespace, (MD5) ● Good distribution ● Order Preserving ● Tokens determine namespace ● Natural order (lexicographical) ● Range / cover queries ● Yours ??
  • 16.
    Replica Placement ● SimpleSnitch ● Default ● N-1 successive nodes ● RackInferringSnitch ● Infers DC/rack from IP ● PropertyFileSnitch ● Configured w/ a properties file
  • 17.
  • 18.
  • 19.
  • 20.
    Choosing Consistency Write Read Level Description Level Description ZERO Hail Mary ZERO N/A ANY 1 replica (HH) ANY N/A ONE 1 replica ONE 1 replica QUORUM (N / 2) +1 QUORUM (N / 2) +1 ALL All replicas ALL All replicas R+W>N
  • 21.
  • 22.
  • 23.
  • 24.
    Overview ● Keyspace ● Uppermost namespace ● Typically one per application ● ColumnFamily ● Associates records of a similar kind ● Record-level Atomicity ● Indexed ● Column ● Basic unit of storage
  • 25.
  • 26.
    Column ● name ● byte[] ● Queried against (predicates) ● Determines sort order ● value ● byte[] ● Opaque to Cassandra ● timestamp ● long ● Conflict resolution (Last Write Wins)
  • 27.
    Column Comparators ● Bytes ● UTF8 ● TimeUUID ● Long ● LexicalUUID ● Composite (third-party) http://github.com/edanuff/CassandraCompositeType
  • 28.
  • 29.
    Low / High ● Thrift ● Compact binary RPC framework ● 12 different languages ● Idiomatic ● Hector (Java) ● Pycassa (Python) ● Others... http://wiki.apache.org/cassandra/ClientOptions
  • 30.
    Thrift Read Methods ● get() → Column ● get_slice() → list<Column> ● mulitget_slice() → map<key, list<Column>> ● get_count() → int ● multiget_count() → map<key, int> ● get_range_slices()
  • 31.
    Thrift Write Methods ● insert() ● batch_insert() ● remove() ● batch_mutate()
  • 32.
  • 33.
    Pycassa – PythonClient API ● connect() → Thrift proxy ● cf = ColumnFamily(proxy, ksp, cfname) ● cf.insert() → long ● cf.get() → dict ● cf.get_range() → dict http://github.com/vomjom/pycassa
  • 34.
    Address Book –Setup <!-- conf/storage-conf.xml --> <Keyspace Name=”AddressBook”> <ColumnFamily Name=”Addresses” CompareWith=”BytesType” RowsCached=”10000” KeysCached=”50%” Comment=”Too lame” /> </Keyspace>
  • 35.
    Adding an entry key= uuid() columns = { 'first': 'Eric', 'last': 'Evans', 'email': 'eevans@rackspace.com', 'city': 'Austin', 'zip': 78250 } addresses.insert(key, columns)
  • 36.
    Fetching a record #fetching the record by key record = addresses.get(key) # accessing columns by name zipcode = record['zip'] city = record['city']
  • 37.
    Indexing <!-- conf/storage-conf.xml --> <KeyspaceName=”AddressBook”> <ColumnFamily Name=”Addresses” CompareWith=”BytesType” RowsCached=”10000” KeysCached=”50%” Comment=”Too lame” /> <ColumnFamily Name=”ByCity” CompareWith=”UTF8Type” /> </Keyspace>
  • 38.
    Updating the index key= uuid() columns = { 'first': 'Eric', 'last': 'Evans', 'email': 'eevans@rackspace.com', 'city': 'Austin', 'zip': 78250 } addresses.insert(key, columns) byCity.insert('Austin', {key: ''})
  • 39.
    Timeseries <!-- conf/storage-conf.xml --> <KeyspaceName=”Sites”> <ColumnFamily Name=”Stats” CompareWith=”LongType”/> </Keyspace>
  • 40.
    Logging values # timeas a long, binary, network-order ts = pack('>d', long(time() * 1e6)) stats.insert('org.apache', {ts: value})
  • 41.
    Slicing begin = pack('>d',long(s * 1e6)) stats.get_range('org.apache', column_start=begin) end = pack('>d', long((s + 86400) * 1e6)) stats.get_range(start='org.apache', finish='org.debian', column_start=begin, column_finish=end)
  • 42.