Austin Cassandra Users 6/19: Apache Cassandra at Vast

863
-1

Published on

For our June meetup, we'll have our local friends at www.vast.com presenting some of their current use cases for Cassandra. Additionally, Vast will be talking about a non-blocking Scala client that they have developed in house.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
863
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Austin Cassandra Users 6/19: Apache Cassandra at Vast

  1. 1. June 19, 2014 Cassandra at Vast Graham Sanderson - CTO, David Pratt - Director of Applications 1
  2. 2. June 19, 2014 Introduction 2 • Don’t want this to be a data modeling talk • We aren't experts - we are learning as we go • Hopefully this will be useful to both you and us • Informal, questions as we go • We will share our experiences so far moving to Cassandra • We are working on a bunch existing and new projects • We'll talk about 2 1/2 of them • Some dev stuff, some ops stuff • Some thoughts for the future • Athena Scala Driver
  3. 3. June 19, 2014 Who isVast? 3 • Vast operates while-label performance based marketplaces for publishers; and delivers big data mobile applications for automotive and real estate sales professionals • “Big Data for Big Purchases” • Marketplaces • Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA, Yahoo • Hundreds of smaller partner sites • Analytics • Strong team of scarily smart data scientists • Integrating analytics everywhere
  4. 4. June 19, 2014 Big Data 4 • HDFS - 1100TB • Amazon S3 - 275TB • Amazon Glacier - 150TB • DynamoDB -12TB • Vertica - 2TB • Cassandra - 1.5TB • SOLR/Lucene - 400GB • Zookeeper • MySQL • Postgres • Redis • CouchDB
  5. 5. June 19, 2014 Data Flow 5 • Flows between different data store types (many include historical data too) • Systems of Record (SOR) • Both root nodes and leaf nodes • Derived data stores (mostly MVCC) for: • Real time customer facing queries • Real time analytics • Alerting • Offline analytics • Reporting • Debugging • Mixture of dumps and deltas • We have derived SORs • Cached smaller subset records/fields for a specific purpose • SORs in multiple data centers - some derived SORs shared • Data flow is graph not a tree - feedback
  6. 6. June 19, 2014 Goals 6 • Reduce latency <15 mins for customer facing data • Reduce copying and duplication of data • Network/Storage/Time costs • More streaming & deltas, less dumps and derived SORs • Want multi-purpose, multi-tenant central store • Something rock solid • Something that can handle lots of data fast • Something that can do random access and bulk operations • Use for all data store types on previous slide • (Over?)build it; they will come • Consolidate rest on • HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene
  7. 7. June 19, 2014 Why Cassandra? 7 • Regarded as rock solid • No single point of failure • Active development & open source Java • Good fit for the type of data we wanted to store • Ease of configuration; all nodes are the same • Easily tunable consistency at application level • Easy control of sharding at application level • Drivers for all our languages (we're mostly JVM but also node) • Data locality with other tools • Good cross data center support
  8. 8. June 19, 2014 Evolution 8 • July 2013 (alpha on C* 1.1) • September 2013 (MTC-1 on C* 2.0.0) • First use case (a nasty one) - talk about it later • Stress/Destructive testing • Found and helped fix a few bugs along the way • Learned a lot about tuning and operations • Half nodes down at one point • Corrupted SSTables on one node • We’ve been cautious • Started with internal facing only use (don’t need 100% uptime) • Moved to external facing use but with ability to fall back off C* in minutes • Getting braver • C* is only SOR and real time customer facing store for some cases now • We have on occasion custom built C* with cherry-picked patches
  9. 9. June 19, 2014 HW Specs MTC-1 9 • Remember we want to build for the C* future • 6 nodes • 16x cores (Sandy Bridge) • 256G RAM • Lots of disk cache and mem-mapped NIO buffers • 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each) • 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential) • RAID1 OS drives • 4x gigabit ethernet
  10. 10. June 19, 2014 SW Specs MTC-1 10 • CentOS 6.5 • Cassandra 2.0.5 • JDK 1.7.0_60-b19 • 8 gig young generation / 6.4 gig eden • 16 gig old generation • Parallel new collector • CMS collector • Sounds like overkill but we are multi-tenant and have spiky loads
  11. 11. June 19, 2014 General 11 • LOCAL_QUORUM for reads and writes • Use LZ4 compression • Use key cache (not row cache) • Some SizeTiered, some Leveled CompactionStrategy • Drivers • Athena (Scala / binary) • Astyanax 1.56.48 (Java / thrift) • node-cassandra-cql (Node / binary)
  12. 12. June 19, 2014 Use Case 1 - Search API - Problem 12 • 40 million records (including duplicates perVIN) in HDFS • Map/Reduce to 7 million SOLR XML updates in HDFS • Not delta today because of map/reduce like business rules • Export to SOLR XML from HDFS to local FS • Re-index via SOLR • 40 gig SOLR index - at least 3 slaves • OKish every few hours, not every 15 minutes • Even though we made very fast parallel indexer • % of stored data read per indexing is getting smaller
  13. 13. June 19, 2014 Use Case 1 - Search API - Solution 13 • Indexing in hadoop • SOLR(Lucene) segments created (no stored fields) • Job option for fallback to stored fields in SOLR index • Stored fields go to C* as JSON directly from hadoop • Astyanax - 1MB batches - LOCAL_QUORUM • Periodically create new table(CF) with full data baseline (clustering) column • 200MB/s 3 replicas continuously for one to two minutes • 40000 partition keys/s (one per record) • Periodically add new (clustering) column to table with deltas from latest dump • Delta data size is 100x smaller and hits many fewer partition keys • Keep multiple recent tables for rollback (bad data more than recovery) • 2 gig SOLR index (20x smaller)
  14. 14. June 19, 2014 Use Case 1 - Search API - Solution 14 • Very bare bones - not even any metadata :-( • Thrift style • Note we use blob • Everything is UTF-8 • Avro - Utf8 • Hadoop - Text • Astyanax - ByteBuffer • Most JVM drivers try to convert text to String CREATE TABLE "20140618084015_20140618_081920_1403072360" (! key text,! column1 blob,! value blob,! PRIMARY KEY (key, column1)! ) WITH COMPACT STORAGE;
  15. 15. June 19, 2014 Use Case 1 - Search API - Solution 15 • Stored fields cached in SOLR JVM (verification/warm up tests) • MVCC to prevent read-from-future • Single clustering key limit for the SOLR core • Reads fallback from LOCAL_QUORUM to LOCAL_ONE • Better to return something even a subset of results • Never happened in production though • Issues • Don’t recreate table/CF until C* 2.1 • Early 2.0.x and Astyanax don’t like schema changes • Create new tables via CQL3 via Astyanax • Monitoring harder since we now use UUID for table name • Full (non delta) index write rate strains GC and causes some hinting • C* remains rock solid • We can constrain by mapper/reducer count, and will probably add zookeeper mutex
  16. 16. June 19, 2014 Use Case 1.5 - RESA 16 • Newer version of real estate • Fully streaming delta pipeline (RabbitMQ) • Field level SOLR index updates (include latest timestamp) • C* row with JSON delta for that timestamp • History is used in customer facing features • Note this is really the same table as thrift one CREATE TABLE for_sale (! id text,! created_date timestamp,! delta_json text! PRIMARY KEY (id, created_date)! ) !
  17. 17. June 19, 2014 Use Case 2 - Feed Management - Problem 17 • Thousands of feeds of different size and frequency • Incoming feeds must be “polished” • Geocoding must be done • Images must be made available in S3 • Need to reprocess individual feeds • Full output records are munged from asynchronously updated parts • Previously huge HDFS job • 300M inputs for 70M full output records • Records need all data to be “ready” for full output • Silly because most work is redundant from previous run • Only help partitioning is by brittle HDFS directory structures
  18. 18. June 19, 2014 Use Case 2 - Feed Management - Solution 18 • Scala & Akka & Athena (large throughput - high parallelism) • Compound partition key (2^n shards per feed) • Spreads data - limits partition “row” length • Read entire feed without key scan - small IN clause • Random access writes • Any sub-field may be updated asynchronously • Munged record emitted to HDFS whenever “ready” CREATE TABLE feed_state (! feed_name text,! feed_record_id_shard int,! record_id uuid,! raw_record text,! polished_data text,! geocode_data text,! image_status text,! ...! PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)! )
  19. 19. June 19, 2014 Monitoring 19 • OpsCenter • log4j/syslog/graylog • Email alerts • nagios/zabbix • Graphite (autogen graph pages) • Machine stats via collectl, JVM from codahale • Cassandra stats from codahale • Suspect possible issue with hadoop using same coordinator nodes • GC logs • VisualVM
  20. 20. June 19, 2014 General Issues / Lessons Learned 20 • GC issues • Old generation fragmentation causes eventual promotion failure • Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-) • Thrift API with bulk load probably not helping, but fragmentation is inevitable • Some slow initial mark and remark STW pauses • We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-) • As said we aim to be multi-tenant • Avoid client stupidity, but otherwise accommodate any client behavior • GC now well tuned • 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day • Cassandra and it’s own dog food • Can’t wait for hints to be commit log style regular file (C* 3.0) • Compactions in progress table • OpsCenter rollup - turned off for search api tables
  21. 21. June 19, 2014 General Issues / Lessons Learned 21 • Don’t repair things that don’t need them • We also run -pr -par repair on each node • Beware when not following the rules • We were knowingly running on potentially buggy minor versions • If you don’t know what you’re doing you will likely screw up • Fortunately for us C* has always kept running fine • It is usually pretty easy to fix with some googling • Deleting data is counter-intuitively often a good fix!
  22. 22. June 19, 2014 Future 22 • Upgrade 2.0.x to use static columns • User defined types :-) • De-duplicate data into shared storage in C* • Analytics via data-locality • Hadoop, Pig, Spark/Scalding, R • More cross data center • More tuning • Full streaming pipeline with C* as side state store
  23. 23. June 19, 2014 Athena 23 • Why would we do such an obviously crazy thing? • Need to support async, reactive applications across different problem domains • Real-time API used by several disparate clients (iOS, Node.js, …) • Ground-up implementation of the CQL 2.0 binary protocol • Scala 2.10/2.11 • Akka 2.3.x • Fully async, nonblocking API • Has obvious advantages but requires different paradigm • Implemented as an extension for Akka-IO • Low-level actor based abstraction • Cluster, Host and Connection actors • Reasonably stable • High-level streaming streaming Session API
  24. 24. June 19, 2014 Athena 24 • Next steps • Move off of Play Iteratees and onto Akka Reactive Streams • Token based routing • Client API very much in flux - suggestions are welcome! ! • https://github.com/vast-engineering/athena • Release of first beta milestone to Sonatype Maven repository imminent • Pull requests welcome!
  25. 25. June 19, 2014 25 Appendix
  26. 26. June 19, 2014 GC Settings 26 -Xms24576M -Xmx24576M -Xmn8192M -Xss228k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+HeapDumpOnOutOfMemoryError -XX:+CMSPrintEdenSurvivorChunks -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1
  27. 27. June 19, 2014 Cassandra at Vast Graham Sanderson - CTO, David Pratt - Director of Applications 27

×