June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
1
June 19, 2014
Introduction
2
• Don’t want this to be a data modeling talk	

• We aren't experts - we are learning as we go	

• Hopefully this will be useful to both you and us	

• Informal, questions as we go	

• We will share our experiences so far moving to Cassandra	

• We are working on a bunch existing and new projects	

• We'll talk about 2 1/2 of them	

• Some dev stuff, some ops stuff	

• Some thoughts for the future	

• Athena Scala Driver
June 19, 2014
Who isVast?
3
• Vast operates while-label performance based marketplaces for
publishers; and delivers big data mobile applications for
automotive and real estate sales professionals	

• “Big Data for Big Purchases”	

• Marketplaces	

• Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA,
Yahoo	

• Hundreds of smaller partner sites	

• Analytics	

• Strong team of scarily smart data scientists	

• Integrating analytics everywhere
June 19, 2014
Big Data
4
• HDFS - 1100TB	

• Amazon S3 - 275TB	

• Amazon Glacier - 150TB	

• DynamoDB -12TB	

• Vertica - 2TB	

• Cassandra - 1.5TB	

• SOLR/Lucene - 400GB	

• Zookeeper	

• MySQL	

• Postgres	

• Redis	

• CouchDB
June 19, 2014
Data Flow
5
• Flows between different data store types (many include historical data too)	

• Systems of Record (SOR)	

• Both root nodes and leaf nodes	

• Derived data stores (mostly MVCC) for:	

• Real time customer facing queries	

• Real time analytics	

• Alerting	

• Offline analytics	

• Reporting	

• Debugging	

• Mixture of dumps and deltas	

• We have derived SORs	

• Cached smaller subset records/fields for a specific purpose	

• SORs in multiple data centers - some derived SORs shared	

• Data flow is graph not a tree - feedback
June 19, 2014
Goals
6
• Reduce latency <15 mins for customer facing data	

• Reduce copying and duplication of data	

• Network/Storage/Time costs	

• More streaming & deltas, less dumps and derived SORs	

• Want multi-purpose, multi-tenant central store	

• Something rock solid	

• Something that can handle lots of data fast 	

• Something that can do random access and bulk operations	

• Use for all data store types on previous slide	

• (Over?)build it; they will come	

• Consolidate rest on	

• HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene
June 19, 2014
Why Cassandra?
7
• Regarded as rock solid	

• No single point of failure	

• Active development & open source Java	

• Good fit for the type of data we wanted to store	

• Ease of configuration; all nodes are the same	

• Easily tunable consistency at application level	

• Easy control of sharding at application level	

• Drivers for all our languages (we're mostly JVM but also node)	

• Data locality with other tools	

• Good cross data center support
June 19, 2014
Evolution
8
• July 2013 (alpha on C* 1.1)	

• September 2013 (MTC-1 on C* 2.0.0)	

• First use case (a nasty one) - talk about it later	

• Stress/Destructive testing	

• Found and helped fix a few bugs along the way	

• Learned a lot about tuning and operations	

• Half nodes down at one point	

• Corrupted SSTables on one node	

• We’ve been cautious	

• Started with internal facing only use (don’t need 100% uptime)	

• Moved to external facing use but with ability to fall back off C* in minutes	

• Getting braver	

• C* is only SOR and real time customer facing store for some cases now	

• We have on occasion custom built C* with cherry-picked patches
June 19, 2014
HW Specs MTC-1
9
• Remember we want to build for the C* future	

• 6 nodes	

• 16x cores (Sandy Bridge)	

• 256G RAM	

• Lots of disk cache and mem-mapped NIO buffers	

• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)	

• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)	

• RAID1 OS drives	

• 4x gigabit ethernet
June 19, 2014
SW Specs MTC-1
10
• CentOS 6.5	

• Cassandra 2.0.5	

• JDK 1.7.0_60-b19	

• 8 gig young generation / 6.4 gig eden	

• 16 gig old generation	

• Parallel new collector	

• CMS collector	

• Sounds like overkill but we are multi-tenant and have spiky loads
June 19, 2014
General
11
• LOCAL_QUORUM for reads and writes	

• Use LZ4 compression	

• Use key cache (not row cache)	

• Some SizeTiered, some Leveled CompactionStrategy	

• Drivers	

• Athena (Scala / binary)	

• Astyanax 1.56.48 (Java / thrift)	

• node-cassandra-cql (Node / binary)
June 19, 2014
Use Case 1 - Search API - Problem
12
• 40 million records (including duplicates perVIN) in HDFS	

• Map/Reduce to 7 million SOLR XML updates in HDFS	

• Not delta today because of map/reduce like business rules	

• Export to SOLR XML from HDFS to local FS	

• Re-index via SOLR	

• 40 gig SOLR index - at least 3 slaves	

• OKish every few hours, not every 15 minutes	

• Even though we made very fast parallel indexer	

• % of stored data read per indexing is getting smaller
June 19, 2014
Use Case 1 - Search API - Solution
13
• Indexing in hadoop	

• SOLR(Lucene) segments created (no stored fields)	

• Job option for fallback to stored fields in SOLR index	

• Stored fields go to C* as JSON directly from hadoop	

• Astyanax - 1MB batches - LOCAL_QUORUM	

• Periodically create new table(CF) with full data baseline (clustering) column	

• 200MB/s 3 replicas continuously for one to two minutes	

• 40000 partition keys/s (one per record)	

• Periodically add new (clustering) column to table with deltas from latest dump	

• Delta data size is 100x smaller and hits many fewer partition keys	

• Keep multiple recent tables for rollback (bad data more than recovery)	

• 2 gig SOLR index (20x smaller)
June 19, 2014
Use Case 1 - Search API - Solution
14
• Very bare bones - not even any metadata :-(	

• Thrift style	

• Note we use blob	

• Everything is UTF-8	

• Avro - Utf8	

• Hadoop - Text	

• Astyanax - ByteBuffer	

• Most JVM drivers try to convert text to String
CREATE TABLE "20140618084015_20140618_081920_1403072360" (!
key text,!
column1 blob,!
value blob,!
PRIMARY KEY (key, column1)!
) WITH COMPACT STORAGE;
June 19, 2014
Use Case 1 - Search API - Solution
15
• Stored fields cached in SOLR JVM (verification/warm up tests)	

• MVCC to prevent read-from-future	

• Single clustering key limit for the SOLR core	

• Reads fallback from LOCAL_QUORUM to LOCAL_ONE	

• Better to return something even a subset of results	

• Never happened in production though	

• Issues	

• Don’t recreate table/CF until C* 2.1	

• Early 2.0.x and Astyanax don’t like schema changes	

• Create new tables via CQL3 via Astyanax	

• Monitoring harder since we now use UUID for table name	

• Full (non delta) index write rate strains GC and causes some hinting	

• C* remains rock solid	

• We can constrain by mapper/reducer count, and will probably add zookeeper mutex
June 19, 2014
Use Case 1.5 - RESA
16
• Newer version of real estate	

• Fully streaming delta pipeline (RabbitMQ)	

• Field level SOLR index updates (include latest timestamp)	

• C* row with JSON delta for that timestamp	

• History is used in customer facing features	

• Note this is really the same table as thrift one
CREATE TABLE for_sale (!
id text,!
created_date timestamp,!
delta_json text!
PRIMARY KEY (id, created_date)!
) !
June 19, 2014
Use Case 2 - Feed Management - Problem
17
• Thousands of feeds of different size and frequency	

• Incoming feeds must be “polished”	

• Geocoding must be done	

• Images must be made available in S3	

• Need to reprocess individual feeds	

• Full output records are munged from asynchronously updated
parts	

• Previously huge HDFS job	

• 300M inputs for 70M full output records	

• Records need all data to be “ready” for full output	

• Silly because most work is redundant from previous run	

• Only help partitioning is by brittle HDFS directory structures
June 19, 2014
Use Case 2 - Feed Management - Solution
18
• Scala & Akka & Athena (large throughput - high parallelism)	

• Compound partition key (2^n shards per feed)	

• Spreads data - limits partition “row” length	

• Read entire feed without key scan - small IN clause	

• Random access writes	

• Any sub-field may be updated asynchronously	

• Munged record emitted to HDFS whenever “ready”
CREATE TABLE feed_state (!
feed_name text,!
feed_record_id_shard int,!
record_id uuid,!
raw_record text,!
polished_data text,!
geocode_data text,!
image_status text,!
...!
PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!
)
June 19, 2014
Monitoring
19
• OpsCenter	

• log4j/syslog/graylog	

• Email alerts	

• nagios/zabbix	

• Graphite (autogen graph pages)	

• Machine stats via collectl, JVM from codahale	

• Cassandra stats from codahale	

• Suspect possible issue with hadoop using same coordinator nodes	

• GC logs	

• VisualVM
June 19, 2014
General Issues / Lessons Learned
20
• GC issues	

• Old generation fragmentation causes eventual promotion failure	

• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)	

• Thrift API with bulk load probably not helping, but fragmentation is inevitable	

• Some slow initial mark and remark STW pauses	

• We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-)	

• As said we aim to be multi-tenant	

• Avoid client stupidity, but otherwise accommodate any client behavior	

• GC now well tuned 	

• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day	

• Cassandra and it’s own dog food	

• Can’t wait for hints to be commit log style regular file (C* 3.0)	

• Compactions in progress table	

• OpsCenter rollup - turned off for search api tables
June 19, 2014
General Issues / Lessons Learned
21
• Don’t repair things that don’t need them	

• We also run -pr -par repair on each node	

• Beware when not following the rules	

• We were knowingly running on potentially buggy minor versions	

• If you don’t know what you’re doing you will likely screw up	

• Fortunately for us C* has always kept running fine	

• It is usually pretty easy to fix with some googling	

• Deleting data is counter-intuitively often a good fix!
June 19, 2014
Future
22
• Upgrade 2.0.x to use static columns	

• User defined types :-)	

• De-duplicate data into shared storage in C*	

• Analytics via data-locality	

• Hadoop, Pig, Spark/Scalding, R	

• More cross data center	

• More tuning	

• Full streaming pipeline with C* as side state store
June 19, 2014
Athena
23
• Why would we do such an obviously crazy thing?	

• Need to support async, reactive applications across different problem domains	

• Real-time API used by several disparate clients (iOS, Node.js, …)	

• Ground-up implementation of the CQL 2.0 binary protocol	

• Scala 2.10/2.11	

• Akka 2.3.x 	

• Fully async, nonblocking API	

• Has obvious advantages but requires different paradigm	

• Implemented as an extension for Akka-IO	

• Low-level actor based abstraction	

• Cluster, Host and Connection actors	

• Reasonably stable	

• High-level streaming streaming Session API
June 19, 2014
Athena
24
• Next steps	

• Move off of Play Iteratees and onto Akka Reactive Streams	

• Token based routing	

• Client API very much in flux - suggestions are welcome!	

!
• https://github.com/vast-engineering/athena	

• Release of first beta milestone to Sonatype Maven repository imminent	

• Pull requests welcome!
June 19, 2014
25
Appendix
June 19, 2014
GC Settings
26
-Xms24576M	

-Xmx24576M	

-Xmn8192M	

-Xss228k	

-XX:+UseParNewGC	

-XX:+UseConcMarkSweepGC	

-XX:+CMSParallelRemarkEnabled	

-XX:SurvivorRatio=8	

-XX:MaxTenuringThreshold=1	

-XX:CMSInitiatingOccupancyFraction=70	

-XX:+UseCMSInitiatingOccupancyOnly	

-XX:+UseTLAB	

-XX:+UseCondCardMark	

-XX:+CMSParallelInitialMarkEnabled	

-XX:+CMSEdenChunksRecordAlways	

-XX:+HeapDumpOnOutOfMemoryError	

-XX:+CMSPrintEdenSurvivorChunks	

-XX:+PrintGCDetails	

-XX:+PrintGCDateStamps	

-XX:+PrintHeapAtGC	

-XX:+PrintTenuringDistribution	

-XX:+PrintGCApplicationStoppedTime	

-XX:+PrintPromotionFailure	

-XX:PrintFLSStatistics=1
June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
27

Cassandra at Vast

  • 1.
    June 19, 2014 Cassandraat Vast Graham Sanderson - CTO, David Pratt - Director of Applications 1
  • 2.
    June 19, 2014 Introduction 2 •Don’t want this to be a data modeling talk • We aren't experts - we are learning as we go • Hopefully this will be useful to both you and us • Informal, questions as we go • We will share our experiences so far moving to Cassandra • We are working on a bunch existing and new projects • We'll talk about 2 1/2 of them • Some dev stuff, some ops stuff • Some thoughts for the future • Athena Scala Driver
  • 3.
    June 19, 2014 WhoisVast? 3 • Vast operates while-label performance based marketplaces for publishers; and delivers big data mobile applications for automotive and real estate sales professionals • “Big Data for Big Purchases” • Marketplaces • Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA, Yahoo • Hundreds of smaller partner sites • Analytics • Strong team of scarily smart data scientists • Integrating analytics everywhere
  • 4.
    June 19, 2014 BigData 4 • HDFS - 1100TB • Amazon S3 - 275TB • Amazon Glacier - 150TB • DynamoDB -12TB • Vertica - 2TB • Cassandra - 1.5TB • SOLR/Lucene - 400GB • Zookeeper • MySQL • Postgres • Redis • CouchDB
  • 5.
    June 19, 2014 DataFlow 5 • Flows between different data store types (many include historical data too) • Systems of Record (SOR) • Both root nodes and leaf nodes • Derived data stores (mostly MVCC) for: • Real time customer facing queries • Real time analytics • Alerting • Offline analytics • Reporting • Debugging • Mixture of dumps and deltas • We have derived SORs • Cached smaller subset records/fields for a specific purpose • SORs in multiple data centers - some derived SORs shared • Data flow is graph not a tree - feedback
  • 6.
    June 19, 2014 Goals 6 •Reduce latency <15 mins for customer facing data • Reduce copying and duplication of data • Network/Storage/Time costs • More streaming & deltas, less dumps and derived SORs • Want multi-purpose, multi-tenant central store • Something rock solid • Something that can handle lots of data fast • Something that can do random access and bulk operations • Use for all data store types on previous slide • (Over?)build it; they will come • Consolidate rest on • HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene
  • 7.
    June 19, 2014 WhyCassandra? 7 • Regarded as rock solid • No single point of failure • Active development & open source Java • Good fit for the type of data we wanted to store • Ease of configuration; all nodes are the same • Easily tunable consistency at application level • Easy control of sharding at application level • Drivers for all our languages (we're mostly JVM but also node) • Data locality with other tools • Good cross data center support
  • 8.
    June 19, 2014 Evolution 8 •July 2013 (alpha on C* 1.1) • September 2013 (MTC-1 on C* 2.0.0) • First use case (a nasty one) - talk about it later • Stress/Destructive testing • Found and helped fix a few bugs along the way • Learned a lot about tuning and operations • Half nodes down at one point • Corrupted SSTables on one node • We’ve been cautious • Started with internal facing only use (don’t need 100% uptime) • Moved to external facing use but with ability to fall back off C* in minutes • Getting braver • C* is only SOR and real time customer facing store for some cases now • We have on occasion custom built C* with cherry-picked patches
  • 9.
    June 19, 2014 HWSpecs MTC-1 9 • Remember we want to build for the C* future • 6 nodes • 16x cores (Sandy Bridge) • 256G RAM • Lots of disk cache and mem-mapped NIO buffers • 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each) • 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential) • RAID1 OS drives • 4x gigabit ethernet
  • 10.
    June 19, 2014 SWSpecs MTC-1 10 • CentOS 6.5 • Cassandra 2.0.5 • JDK 1.7.0_60-b19 • 8 gig young generation / 6.4 gig eden • 16 gig old generation • Parallel new collector • CMS collector • Sounds like overkill but we are multi-tenant and have spiky loads
  • 11.
    June 19, 2014 General 11 •LOCAL_QUORUM for reads and writes • Use LZ4 compression • Use key cache (not row cache) • Some SizeTiered, some Leveled CompactionStrategy • Drivers • Athena (Scala / binary) • Astyanax 1.56.48 (Java / thrift) • node-cassandra-cql (Node / binary)
  • 12.
    June 19, 2014 UseCase 1 - Search API - Problem 12 • 40 million records (including duplicates perVIN) in HDFS • Map/Reduce to 7 million SOLR XML updates in HDFS • Not delta today because of map/reduce like business rules • Export to SOLR XML from HDFS to local FS • Re-index via SOLR • 40 gig SOLR index - at least 3 slaves • OKish every few hours, not every 15 minutes • Even though we made very fast parallel indexer • % of stored data read per indexing is getting smaller
  • 13.
    June 19, 2014 UseCase 1 - Search API - Solution 13 • Indexing in hadoop • SOLR(Lucene) segments created (no stored fields) • Job option for fallback to stored fields in SOLR index • Stored fields go to C* as JSON directly from hadoop • Astyanax - 1MB batches - LOCAL_QUORUM • Periodically create new table(CF) with full data baseline (clustering) column • 200MB/s 3 replicas continuously for one to two minutes • 40000 partition keys/s (one per record) • Periodically add new (clustering) column to table with deltas from latest dump • Delta data size is 100x smaller and hits many fewer partition keys • Keep multiple recent tables for rollback (bad data more than recovery) • 2 gig SOLR index (20x smaller)
  • 14.
    June 19, 2014 UseCase 1 - Search API - Solution 14 • Very bare bones - not even any metadata :-( • Thrift style • Note we use blob • Everything is UTF-8 • Avro - Utf8 • Hadoop - Text • Astyanax - ByteBuffer • Most JVM drivers try to convert text to String CREATE TABLE "20140618084015_20140618_081920_1403072360" (! key text,! column1 blob,! value blob,! PRIMARY KEY (key, column1)! ) WITH COMPACT STORAGE;
  • 15.
    June 19, 2014 UseCase 1 - Search API - Solution 15 • Stored fields cached in SOLR JVM (verification/warm up tests) • MVCC to prevent read-from-future • Single clustering key limit for the SOLR core • Reads fallback from LOCAL_QUORUM to LOCAL_ONE • Better to return something even a subset of results • Never happened in production though • Issues • Don’t recreate table/CF until C* 2.1 • Early 2.0.x and Astyanax don’t like schema changes • Create new tables via CQL3 via Astyanax • Monitoring harder since we now use UUID for table name • Full (non delta) index write rate strains GC and causes some hinting • C* remains rock solid • We can constrain by mapper/reducer count, and will probably add zookeeper mutex
  • 16.
    June 19, 2014 UseCase 1.5 - RESA 16 • Newer version of real estate • Fully streaming delta pipeline (RabbitMQ) • Field level SOLR index updates (include latest timestamp) • C* row with JSON delta for that timestamp • History is used in customer facing features • Note this is really the same table as thrift one CREATE TABLE for_sale (! id text,! created_date timestamp,! delta_json text! PRIMARY KEY (id, created_date)! ) !
  • 17.
    June 19, 2014 UseCase 2 - Feed Management - Problem 17 • Thousands of feeds of different size and frequency • Incoming feeds must be “polished” • Geocoding must be done • Images must be made available in S3 • Need to reprocess individual feeds • Full output records are munged from asynchronously updated parts • Previously huge HDFS job • 300M inputs for 70M full output records • Records need all data to be “ready” for full output • Silly because most work is redundant from previous run • Only help partitioning is by brittle HDFS directory structures
  • 18.
    June 19, 2014 UseCase 2 - Feed Management - Solution 18 • Scala & Akka & Athena (large throughput - high parallelism) • Compound partition key (2^n shards per feed) • Spreads data - limits partition “row” length • Read entire feed without key scan - small IN clause • Random access writes • Any sub-field may be updated asynchronously • Munged record emitted to HDFS whenever “ready” CREATE TABLE feed_state (! feed_name text,! feed_record_id_shard int,! record_id uuid,! raw_record text,! polished_data text,! geocode_data text,! image_status text,! ...! PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)! )
  • 19.
    June 19, 2014 Monitoring 19 •OpsCenter • log4j/syslog/graylog • Email alerts • nagios/zabbix • Graphite (autogen graph pages) • Machine stats via collectl, JVM from codahale • Cassandra stats from codahale • Suspect possible issue with hadoop using same coordinator nodes • GC logs • VisualVM
  • 20.
    June 19, 2014 GeneralIssues / Lessons Learned 20 • GC issues • Old generation fragmentation causes eventual promotion failure • Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-) • Thrift API with bulk load probably not helping, but fragmentation is inevitable • Some slow initial mark and remark STW pauses • We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-) • As said we aim to be multi-tenant • Avoid client stupidity, but otherwise accommodate any client behavior • GC now well tuned • 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day • Cassandra and it’s own dog food • Can’t wait for hints to be commit log style regular file (C* 3.0) • Compactions in progress table • OpsCenter rollup - turned off for search api tables
  • 21.
    June 19, 2014 GeneralIssues / Lessons Learned 21 • Don’t repair things that don’t need them • We also run -pr -par repair on each node • Beware when not following the rules • We were knowingly running on potentially buggy minor versions • If you don’t know what you’re doing you will likely screw up • Fortunately for us C* has always kept running fine • It is usually pretty easy to fix with some googling • Deleting data is counter-intuitively often a good fix!
  • 22.
    June 19, 2014 Future 22 •Upgrade 2.0.x to use static columns • User defined types :-) • De-duplicate data into shared storage in C* • Analytics via data-locality • Hadoop, Pig, Spark/Scalding, R • More cross data center • More tuning • Full streaming pipeline with C* as side state store
  • 23.
    June 19, 2014 Athena 23 •Why would we do such an obviously crazy thing? • Need to support async, reactive applications across different problem domains • Real-time API used by several disparate clients (iOS, Node.js, …) • Ground-up implementation of the CQL 2.0 binary protocol • Scala 2.10/2.11 • Akka 2.3.x • Fully async, nonblocking API • Has obvious advantages but requires different paradigm • Implemented as an extension for Akka-IO • Low-level actor based abstraction • Cluster, Host and Connection actors • Reasonably stable • High-level streaming streaming Session API
  • 24.
    June 19, 2014 Athena 24 •Next steps • Move off of Play Iteratees and onto Akka Reactive Streams • Token based routing • Client API very much in flux - suggestions are welcome! ! • https://github.com/vast-engineering/athena • Release of first beta milestone to Sonatype Maven repository imminent • Pull requests welcome!
  • 25.
  • 26.
    June 19, 2014 GCSettings 26 -Xms24576M -Xmx24576M -Xmn8192M -Xss228k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+HeapDumpOnOutOfMemoryError -XX:+CMSPrintEdenSurvivorChunks -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1
  • 27.
    June 19, 2014 Cassandraat Vast Graham Sanderson - CTO, David Pratt - Director of Applications 27