Cassandra at Vast

June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
1

June 19, 2014
Introduction
2
• Don’t want this to be a data modeling talk

• We aren't experts - we are learning as we go

• Hopefully this will be useful to both you and us

• Informal, questions as we go

• We will share our experiences so far moving to Cassandra

• We are working on a bunch existing and new projects

• We'll talk about 2 1/2 of them

• Some dev stuff, some ops stuff

• Some thoughts for the future

• Athena Scala Driver

June 19, 2014
Who isVast?
3
• Vast operates while-label performance based marketplaces for
publishers; and delivers big data mobile applications for
automotive and real estate sales professionals

• “Big Data for Big Purchases”

• Marketplaces

• Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA,
Yahoo

• Hundreds of smaller partner sites

• Analytics

• Strong team of scarily smart data scientists

• Integrating analytics everywhere

June 19, 2014
Big Data
4
• HDFS - 1100TB

• Amazon S3 - 275TB

• Amazon Glacier - 150TB

• DynamoDB -12TB

• Vertica - 2TB

• Cassandra - 1.5TB

• SOLR/Lucene - 400GB

• Zookeeper

• MySQL

• Postgres

• Redis

• CouchDB

June 19, 2014
Data Flow
5
• Flows between different data store types (many include historical data too)

• Systems of Record (SOR)

• Both root nodes and leaf nodes

• Derived data stores (mostly MVCC) for:

• Real time customer facing queries

• Real time analytics

• Alerting

• Offline analytics

• Reporting

• Debugging

• Mixture of dumps and deltas

• We have derived SORs

• Cached smaller subset records/fields for a specific purpose

• SORs in multiple data centers - some derived SORs shared

• Data flow is graph not a tree - feedback

June 19, 2014
Goals
6
• Reduce latency <15 mins for customer facing data

• Reduce copying and duplication of data

• Network/Storage/Time costs

• More streaming & deltas, less dumps and derived SORs

• Want multi-purpose, multi-tenant central store

• Something rock solid

• Something that can handle lots of data fast

• Something that can do random access and bulk operations

• Use for all data store types on previous slide

• (Over?)build it; they will come

• Consolidate rest on

• HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene

June 19, 2014
Why Cassandra?
7
• Regarded as rock solid

• No single point of failure

• Active development & open source Java

• Good ﬁt for the type of data we wanted to store

• Ease of conﬁguration; all nodes are the same

• Easily tunable consistency at application level

• Easy control of sharding at application level

• Drivers for all our languages (we're mostly JVM but also node)

• Data locality with other tools

• Good cross data center support

June 19, 2014
Evolution
8
• July 2013 (alpha on C* 1.1)

• September 2013 (MTC-1 on C* 2.0.0)

• First use case (a nasty one) - talk about it later

• Stress/Destructive testing

• Found and helped ﬁx a few bugs along the way

• Learned a lot about tuning and operations

• Half nodes down at one point

• Corrupted SSTables on one node

• We’ve been cautious

• Started with internal facing only use (don’t need 100% uptime)

• Moved to external facing use but with ability to fall back off C* in minutes

• Getting braver

• C* is only SOR and real time customer facing store for some cases now

• We have on occasion custom built C* with cherry-picked patches

June 19, 2014
HW Specs MTC-1
9
• Remember we want to build for the C* future

• 6 nodes

• 16x cores (Sandy Bridge)

• 256G RAM

• Lots of disk cache and mem-mapped NIO buffers

• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)

• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)

• RAID1 OS drives

• 4x gigabit ethernet

June 19, 2014
SW Specs MTC-1
10
• CentOS 6.5

• Cassandra 2.0.5

• JDK 1.7.0_60-b19

• 8 gig young generation / 6.4 gig eden

• 16 gig old generation

• Parallel new collector

• CMS collector

• Sounds like overkill but we are multi-tenant and have spiky loads

June 19, 2014
General
11
• LOCAL_QUORUM for reads and writes

• Use LZ4 compression

• Use key cache (not row cache)

• Some SizeTiered, some Leveled CompactionStrategy

• Drivers

• Athena (Scala / binary)

• Astyanax 1.56.48 (Java / thrift)

• node-cassandra-cql (Node / binary)

June 19, 2014
Use Case 1 - Search API - Problem
12
• 40 million records (including duplicates perVIN) in HDFS

• Map/Reduce to 7 million SOLR XML updates in HDFS

• Not delta today because of map/reduce like business rules

• Export to SOLR XML from HDFS to local FS

• Re-index via SOLR

• 40 gig SOLR index - at least 3 slaves

• OKish every few hours, not every 15 minutes

• Even though we made very fast parallel indexer

• % of stored data read per indexing is getting smaller

June 19, 2014
Use Case 1 - Search API - Solution
13
• Indexing in hadoop

• SOLR(Lucene) segments created (no stored fields)

• Job option for fallback to stored fields in SOLR index

• Stored fields go to C* as JSON directly from hadoop

• Astyanax - 1MB batches - LOCAL_QUORUM

• Periodically create new table(CF) with full data baseline (clustering) column

• 200MB/s 3 replicas continuously for one to two minutes

• 40000 partition keys/s (one per record)

• Periodically add new (clustering) column to table with deltas from latest dump

• Delta data size is 100x smaller and hits many fewer partition keys

• Keep multiple recent tables for rollback (bad data more than recovery)

• 2 gig SOLR index (20x smaller)

June 19, 2014
14
• Very bare bones - not even any metadata :-(

• Thrift style

• Note we use blob

• Everything is UTF-8

• Avro - Utf8

• Hadoop - Text

• Astyanax - ByteBuffer

• Most JVM drivers try to convert text to String
CREATE TABLE "20140618084015_20140618_081920_1403072360" (!
key text,!
column1 blob,!
value blob,!
PRIMARY KEY (key, column1)!
) WITH COMPACT STORAGE;

June 19, 2014
15
• Stored ﬁelds cached in SOLR JVM (veriﬁcation/warm up tests)

• MVCC to prevent read-from-future

• Single clustering key limit for the SOLR core

• Reads fallback from LOCAL_QUORUM to LOCAL_ONE

• Better to return something even a subset of results

• Never happened in production though

• Issues

• Don’t recreate table/CF until C* 2.1

• Early 2.0.x and Astyanax don’t like schema changes

• Create new tables via CQL3 via Astyanax

• Monitoring harder since we now use UUID for table name

• Full (non delta) index write rate strains GC and causes some hinting

• C* remains rock solid

• We can constrain by mapper/reducer count, and will probably add zookeeper mutex

June 19, 2014
Use Case 1.5 - RESA
16
• Newer version of real estate

• Fully streaming delta pipeline (RabbitMQ)

• Field level SOLR index updates (include latest timestamp)

• C* row with JSON delta for that timestamp

• History is used in customer facing features

• Note this is really the same table as thrift one
CREATE TABLE for_sale (!
id text,!
created_date timestamp,!
delta_json text!
PRIMARY KEY (id, created_date)!
) !

June 19, 2014
Use Case 2 - Feed Management - Problem
17
• Thousands of feeds of different size and frequency

• Incoming feeds must be “polished”

• Geocoding must be done

• Images must be made available in S3

• Need to reprocess individual feeds

• Full output records are munged from asynchronously updated
parts

• Previously huge HDFS job

• 300M inputs for 70M full output records

• Records need all data to be “ready” for full output

• Silly because most work is redundant from previous run

• Only help partitioning is by brittle HDFS directory structures

June 19, 2014
Use Case 2 - Feed Management - Solution
18
• Scala & Akka & Athena (large throughput - high parallelism)

• Compound partition key (2^n shards per feed)

• Spreads data - limits partition “row” length

• Read entire feed without key scan - small IN clause

• Random access writes

• Any sub-ﬁeld may be updated asynchronously

• Munged record emitted to HDFS whenever “ready”
CREATE TABLE feed_state (!
feed_name text,!
feed_record_id_shard int,!
record_id uuid,!
raw_record text,!
polished_data text,!
geocode_data text,!
image_status text,!
...!
PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!
)

June 19, 2014
Monitoring
19
• OpsCenter

• log4j/syslog/graylog

• Email alerts

• nagios/zabbix

• Graphite (autogen graph pages)

• Machine stats via collectl, JVM from codahale

• Cassandra stats from codahale

• Suspect possible issue with hadoop using same coordinator nodes

• GC logs

• VisualVM

June 19, 2014
General Issues / Lessons Learned
20
• GC issues

• Old generation fragmentation causes eventual promotion failure

• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)

• Thrift API with bulk load probably not helping, but fragmentation is inevitable

• Some slow initial mark and remark STW pauses

• We do have a big young gen - New -XX:+ ﬂags in 1.7.0_60 :-)

• As said we aim to be multi-tenant

• Avoid client stupidity, but otherwise accommodate any client behavior

• GC now well tuned

• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day

• Cassandra and it’s own dog food

• Can’t wait for hints to be commit log style regular ﬁle (C* 3.0)

• Compactions in progress table

• OpsCenter rollup - turned off for search api tables

June 19, 2014
General Issues / Lessons Learned
21
• Don’t repair things that don’t need them

• We also run -pr -par repair on each node

• Beware when not following the rules

• We were knowingly running on potentially buggy minor versions

• If you don’t know what you’re doing you will likely screw up

• Fortunately for us C* has always kept running fine

• It is usually pretty easy to fix with some googling

• Deleting data is counter-intuitively often a good fix!

June 19, 2014
Future
22
• Upgrade 2.0.x to use static columns

• User deﬁned types :-)

• De-duplicate data into shared storage in C*

• Analytics via data-locality

• Hadoop, Pig, Spark/Scalding, R

• More cross data center

• More tuning

• Full streaming pipeline with C* as side state store

June 19, 2014
Athena
23
• Why would we do such an obviously crazy thing?

• Need to support async, reactive applications across different problem domains

• Real-time API used by several disparate clients (iOS, Node.js, …)

• Ground-up implementation of the CQL 2.0 binary protocol

• Scala 2.10/2.11

• Akka 2.3.x

• Fully async, nonblocking API

• Has obvious advantages but requires different paradigm

• Implemented as an extension for Akka-IO

• Low-level actor based abstraction

• Cluster, Host and Connection actors

• Reasonably stable

• High-level streaming streaming Session API

June 19, 2014
Athena
24
• Next steps

• Move off of Play Iteratees and onto Akka Reactive Streams

• Token based routing

• Client API very much in ﬂux - suggestions are welcome!

!
• https://github.com/vast-engineering/athena

• Release of ﬁrst beta milestone to Sonatype Maven repository imminent

• Pull requests welcome!

June 19, 2014
GC Settings
26
-Xms24576M

-Xmx24576M

-Xmn8192M

-Xss228k

-XX:+UseParNewGC

-XX:+UseConcMarkSweepGC

-XX:+CMSParallelRemarkEnabled

-XX:SurvivorRatio=8

-XX:MaxTenuringThreshold=1

-XX:CMSInitiatingOccupancyFraction=70

-XX:+UseCMSInitiatingOccupancyOnly

-XX:+UseTLAB

-XX:+UseCondCardMark

-XX:+CMSParallelInitialMarkEnabled

-XX:+CMSEdenChunksRecordAlways

-XX:+HeapDumpOnOutOfMemoryError

-XX:+CMSPrintEdenSurvivorChunks

-XX:+PrintGCDetails

-XX:+PrintGCDateStamps

-XX:+PrintHeapAtGC

-XX:+PrintTenuringDistribution

-XX:+PrintGCApplicationStoppedTime

-XX:+PrintPromotionFailure

-XX:PrintFLSStatistics=1

June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
27

Cassandra at Vast

More Related Content

What's hot

Viewers also liked

Similar to Cassandra at Vast

Recently uploaded

Cassandra at Vast