SlideShare a Scribd company logo
June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
1
June 19, 2014
Introduction
2
• Don’t want this to be a data modeling talk	

• We aren't experts - we are learning as we go	

• Hopefully this will be useful to both you and us	

• Informal, questions as we go	

• We will share our experiences so far moving to Cassandra	

• We are working on a bunch existing and new projects	

• We'll talk about 2 1/2 of them	

• Some dev stuff, some ops stuff	

• Some thoughts for the future	

• Athena Scala Driver
June 19, 2014
Who isVast?
3
• Vast operates while-label performance based marketplaces for
publishers; and delivers big data mobile applications for
automotive and real estate sales professionals	

• “Big Data for Big Purchases”	

• Marketplaces	

• Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA,
Yahoo	

• Hundreds of smaller partner sites	

• Analytics	

• Strong team of scarily smart data scientists	

• Integrating analytics everywhere
June 19, 2014
Big Data
4
• HDFS - 1100TB	

• Amazon S3 - 275TB	

• Amazon Glacier - 150TB	

• DynamoDB -12TB	

• Vertica - 2TB	

• Cassandra - 1.5TB	

• SOLR/Lucene - 400GB	

• Zookeeper	

• MySQL	

• Postgres	

• Redis	

• CouchDB
June 19, 2014
Data Flow
5
• Flows between different data store types (many include historical data too)	

• Systems of Record (SOR)	

• Both root nodes and leaf nodes	

• Derived data stores (mostly MVCC) for:	

• Real time customer facing queries	

• Real time analytics	

• Alerting	

• Offline analytics	

• Reporting	

• Debugging	

• Mixture of dumps and deltas	

• We have derived SORs	

• Cached smaller subset records/fields for a specific purpose	

• SORs in multiple data centers - some derived SORs shared	

• Data flow is graph not a tree - feedback
June 19, 2014
Goals
6
• Reduce latency <15 mins for customer facing data	

• Reduce copying and duplication of data	

• Network/Storage/Time costs	

• More streaming & deltas, less dumps and derived SORs	

• Want multi-purpose, multi-tenant central store	

• Something rock solid	

• Something that can handle lots of data fast 	

• Something that can do random access and bulk operations	

• Use for all data store types on previous slide	

• (Over?)build it; they will come	

• Consolidate rest on	

• HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene
June 19, 2014
Why Cassandra?
7
• Regarded as rock solid	

• No single point of failure	

• Active development & open source Java	

• Good fit for the type of data we wanted to store	

• Ease of configuration; all nodes are the same	

• Easily tunable consistency at application level	

• Easy control of sharding at application level	

• Drivers for all our languages (we're mostly JVM but also node)	

• Data locality with other tools	

• Good cross data center support
June 19, 2014
Evolution
8
• July 2013 (alpha on C* 1.1)	

• September 2013 (MTC-1 on C* 2.0.0)	

• First use case (a nasty one) - talk about it later	

• Stress/Destructive testing	

• Found and helped fix a few bugs along the way	

• Learned a lot about tuning and operations	

• Half nodes down at one point	

• Corrupted SSTables on one node	

• We’ve been cautious	

• Started with internal facing only use (don’t need 100% uptime)	

• Moved to external facing use but with ability to fall back off C* in minutes	

• Getting braver	

• C* is only SOR and real time customer facing store for some cases now	

• We have on occasion custom built C* with cherry-picked patches
June 19, 2014
HW Specs MTC-1
9
• Remember we want to build for the C* future	

• 6 nodes	

• 16x cores (Sandy Bridge)	

• 256G RAM	

• Lots of disk cache and mem-mapped NIO buffers	

• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)	

• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)	

• RAID1 OS drives	

• 4x gigabit ethernet
June 19, 2014
SW Specs MTC-1
10
• CentOS 6.5	

• Cassandra 2.0.5	

• JDK 1.7.0_60-b19	

• 8 gig young generation / 6.4 gig eden	

• 16 gig old generation	

• Parallel new collector	

• CMS collector	

• Sounds like overkill but we are multi-tenant and have spiky loads
June 19, 2014
General
11
• LOCAL_QUORUM for reads and writes	

• Use LZ4 compression	

• Use key cache (not row cache)	

• Some SizeTiered, some Leveled CompactionStrategy	

• Drivers	

• Athena (Scala / binary)	

• Astyanax 1.56.48 (Java / thrift)	

• node-cassandra-cql (Node / binary)
June 19, 2014
Use Case 1 - Search API - Problem
12
• 40 million records (including duplicates perVIN) in HDFS	

• Map/Reduce to 7 million SOLR XML updates in HDFS	

• Not delta today because of map/reduce like business rules	

• Export to SOLR XML from HDFS to local FS	

• Re-index via SOLR	

• 40 gig SOLR index - at least 3 slaves	

• OKish every few hours, not every 15 minutes	

• Even though we made very fast parallel indexer	

• % of stored data read per indexing is getting smaller
June 19, 2014
Use Case 1 - Search API - Solution
13
• Indexing in hadoop	

• SOLR(Lucene) segments created (no stored fields)	

• Job option for fallback to stored fields in SOLR index	

• Stored fields go to C* as JSON directly from hadoop	

• Astyanax - 1MB batches - LOCAL_QUORUM	

• Periodically create new table(CF) with full data baseline (clustering) column	

• 200MB/s 3 replicas continuously for one to two minutes	

• 40000 partition keys/s (one per record)	

• Periodically add new (clustering) column to table with deltas from latest dump	

• Delta data size is 100x smaller and hits many fewer partition keys	

• Keep multiple recent tables for rollback (bad data more than recovery)	

• 2 gig SOLR index (20x smaller)
June 19, 2014
Use Case 1 - Search API - Solution
14
• Very bare bones - not even any metadata :-(	

• Thrift style	

• Note we use blob	

• Everything is UTF-8	

• Avro - Utf8	

• Hadoop - Text	

• Astyanax - ByteBuffer	

• Most JVM drivers try to convert text to String
CREATE TABLE "20140618084015_20140618_081920_1403072360" (!
key text,!
column1 blob,!
value blob,!
PRIMARY KEY (key, column1)!
) WITH COMPACT STORAGE;
June 19, 2014
Use Case 1 - Search API - Solution
15
• Stored fields cached in SOLR JVM (verification/warm up tests)	

• MVCC to prevent read-from-future	

• Single clustering key limit for the SOLR core	

• Reads fallback from LOCAL_QUORUM to LOCAL_ONE	

• Better to return something even a subset of results	

• Never happened in production though	

• Issues	

• Don’t recreate table/CF until C* 2.1	

• Early 2.0.x and Astyanax don’t like schema changes	

• Create new tables via CQL3 via Astyanax	

• Monitoring harder since we now use UUID for table name	

• Full (non delta) index write rate strains GC and causes some hinting	

• C* remains rock solid	

• We can constrain by mapper/reducer count, and will probably add zookeeper mutex
June 19, 2014
Use Case 1.5 - RESA
16
• Newer version of real estate	

• Fully streaming delta pipeline (RabbitMQ)	

• Field level SOLR index updates (include latest timestamp)	

• C* row with JSON delta for that timestamp	

• History is used in customer facing features	

• Note this is really the same table as thrift one
CREATE TABLE for_sale (!
id text,!
created_date timestamp,!
delta_json text!
PRIMARY KEY (id, created_date)!
) !
June 19, 2014
Use Case 2 - Feed Management - Problem
17
• Thousands of feeds of different size and frequency	

• Incoming feeds must be “polished”	

• Geocoding must be done	

• Images must be made available in S3	

• Need to reprocess individual feeds	

• Full output records are munged from asynchronously updated
parts	

• Previously huge HDFS job	

• 300M inputs for 70M full output records	

• Records need all data to be “ready” for full output	

• Silly because most work is redundant from previous run	

• Only help partitioning is by brittle HDFS directory structures
June 19, 2014
Use Case 2 - Feed Management - Solution
18
• Scala & Akka & Athena (large throughput - high parallelism)	

• Compound partition key (2^n shards per feed)	

• Spreads data - limits partition “row” length	

• Read entire feed without key scan - small IN clause	

• Random access writes	

• Any sub-field may be updated asynchronously	

• Munged record emitted to HDFS whenever “ready”
CREATE TABLE feed_state (!
feed_name text,!
feed_record_id_shard int,!
record_id uuid,!
raw_record text,!
polished_data text,!
geocode_data text,!
image_status text,!
...!
PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!
)
June 19, 2014
Monitoring
19
• OpsCenter	

• log4j/syslog/graylog	

• Email alerts	

• nagios/zabbix	

• Graphite (autogen graph pages)	

• Machine stats via collectl, JVM from codahale	

• Cassandra stats from codahale	

• Suspect possible issue with hadoop using same coordinator nodes	

• GC logs	

• VisualVM
June 19, 2014
General Issues / Lessons Learned
20
• GC issues	

• Old generation fragmentation causes eventual promotion failure	

• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)	

• Thrift API with bulk load probably not helping, but fragmentation is inevitable	

• Some slow initial mark and remark STW pauses	

• We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-)	

• As said we aim to be multi-tenant	

• Avoid client stupidity, but otherwise accommodate any client behavior	

• GC now well tuned 	

• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day	

• Cassandra and it’s own dog food	

• Can’t wait for hints to be commit log style regular file (C* 3.0)	

• Compactions in progress table	

• OpsCenter rollup - turned off for search api tables
June 19, 2014
General Issues / Lessons Learned
21
• Don’t repair things that don’t need them	

• We also run -pr -par repair on each node	

• Beware when not following the rules	

• We were knowingly running on potentially buggy minor versions	

• If you don’t know what you’re doing you will likely screw up	

• Fortunately for us C* has always kept running fine	

• It is usually pretty easy to fix with some googling	

• Deleting data is counter-intuitively often a good fix!
June 19, 2014
Future
22
• Upgrade 2.0.x to use static columns	

• User defined types :-)	

• De-duplicate data into shared storage in C*	

• Analytics via data-locality	

• Hadoop, Pig, Spark/Scalding, R	

• More cross data center	

• More tuning	

• Full streaming pipeline with C* as side state store
June 19, 2014
Athena
23
• Why would we do such an obviously crazy thing?	

• Need to support async, reactive applications across different problem domains	

• Real-time API used by several disparate clients (iOS, Node.js, …)	

• Ground-up implementation of the CQL 2.0 binary protocol	

• Scala 2.10/2.11	

• Akka 2.3.x 	

• Fully async, nonblocking API	

• Has obvious advantages but requires different paradigm	

• Implemented as an extension for Akka-IO	

• Low-level actor based abstraction	

• Cluster, Host and Connection actors	

• Reasonably stable	

• High-level streaming streaming Session API
June 19, 2014
Athena
24
• Next steps	

• Move off of Play Iteratees and onto Akka Reactive Streams	

• Token based routing	

• Client API very much in flux - suggestions are welcome!	

!
• https://github.com/vast-engineering/athena	

• Release of first beta milestone to Sonatype Maven repository imminent	

• Pull requests welcome!
June 19, 2014
25
Appendix
June 19, 2014
GC Settings
26
-Xms24576M	

-Xmx24576M	

-Xmn8192M	

-Xss228k	

-XX:+UseParNewGC	

-XX:+UseConcMarkSweepGC	

-XX:+CMSParallelRemarkEnabled	

-XX:SurvivorRatio=8	

-XX:MaxTenuringThreshold=1	

-XX:CMSInitiatingOccupancyFraction=70	

-XX:+UseCMSInitiatingOccupancyOnly	

-XX:+UseTLAB	

-XX:+UseCondCardMark	

-XX:+CMSParallelInitialMarkEnabled	

-XX:+CMSEdenChunksRecordAlways	

-XX:+HeapDumpOnOutOfMemoryError	

-XX:+CMSPrintEdenSurvivorChunks	

-XX:+PrintGCDetails	

-XX:+PrintGCDateStamps	

-XX:+PrintHeapAtGC	

-XX:+PrintTenuringDistribution	

-XX:+PrintGCApplicationStoppedTime	

-XX:+PrintPromotionFailure	

-XX:PrintFLSStatistics=1
June 19, 2014
Cassandra at Vast
Graham Sanderson - CTO, David Pratt - Director of Applications
27

More Related Content

What's hot

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
Vladimír Hanušniak
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
EDB
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends DevroomFOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
Morgan Tocker
 
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityBest practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Colin Charles
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
MariaDB plc
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank:  Rocking the Database World with RocksDBThe Hive Think Tank:  Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
My first moments with MongoDB
My first moments with MongoDBMy first moments with MongoDB
My first moments with MongoDB
Colin Charles
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
MariaDB 10 Tutorial - 13.11.11 - Percona Live LondonMariaDB 10 Tutorial - 13.11.11 - Percona Live London
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
Ivan Zoratti
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
SpringPeople
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
Venkata Naga Ravi
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
Allen Wittenauer
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
Christian Gügi
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
slashn
 
Geode - Day 1
Geode - Day 1Geode - Day 1
Geode - Day 1
Swapnil Bawaskar
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
MariaDB plc
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Gruter
 

What's hot (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends DevroomFOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
 
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityBest practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High Availability
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank:  Rocking the Database World with RocksDBThe Hive Think Tank:  Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
 
My first moments with MongoDB
My first moments with MongoDBMy first moments with MongoDB
My first moments with MongoDB
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
 
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
MariaDB 10 Tutorial - 13.11.11 - Percona Live LondonMariaDB 10 Tutorial - 13.11.11 - Percona Live London
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
 
Geode - Day 1
Geode - Day 1Geode - Day 1
Geode - Day 1
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 

Similar to Austin Cassandra Users 6/19: Apache Cassandra at Vast

Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
Apache Geode
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
MapR Technologies
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
Anthony Baker
 
MongoDB Administration 20110922
MongoDB Administration 20110922MongoDB Administration 20110922
MongoDB Administration 20110922
radiocats
 
The MySQL Server ecosystem in 2016
The MySQL Server ecosystem in 2016The MySQL Server ecosystem in 2016
The MySQL Server ecosystem in 2016
sys army
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
Thijs Terlouw
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
Chandan Rajah
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
Severalnines
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Amazon Web Services
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
Murat Çakal
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
 
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionJoel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Outlyer
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
Swiss Big Data User Group
 

Similar to Austin Cassandra Users 6/19: Apache Cassandra at Vast (20)

Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
MongoDB Administration 20110922
MongoDB Administration 20110922MongoDB Administration 20110922
MongoDB Administration 20110922
 
The MySQL Server ecosystem in 2016
The MySQL Server ecosystem in 2016The MySQL Server ecosystem in 2016
The MySQL Server ecosystem in 2016
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionJoel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
DataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 

Austin Cassandra Users 6/19: Apache Cassandra at Vast

  • 1. June 19, 2014 Cassandra at Vast Graham Sanderson - CTO, David Pratt - Director of Applications 1
  • 2. June 19, 2014 Introduction 2 • Don’t want this to be a data modeling talk • We aren't experts - we are learning as we go • Hopefully this will be useful to both you and us • Informal, questions as we go • We will share our experiences so far moving to Cassandra • We are working on a bunch existing and new projects • We'll talk about 2 1/2 of them • Some dev stuff, some ops stuff • Some thoughts for the future • Athena Scala Driver
  • 3. June 19, 2014 Who isVast? 3 • Vast operates while-label performance based marketplaces for publishers; and delivers big data mobile applications for automotive and real estate sales professionals • “Big Data for Big Purchases” • Marketplaces • Large partner sites, including AOL, CARFAX,TrueCar, Realogy, USAA, Yahoo • Hundreds of smaller partner sites • Analytics • Strong team of scarily smart data scientists • Integrating analytics everywhere
  • 4. June 19, 2014 Big Data 4 • HDFS - 1100TB • Amazon S3 - 275TB • Amazon Glacier - 150TB • DynamoDB -12TB • Vertica - 2TB • Cassandra - 1.5TB • SOLR/Lucene - 400GB • Zookeeper • MySQL • Postgres • Redis • CouchDB
  • 5. June 19, 2014 Data Flow 5 • Flows between different data store types (many include historical data too) • Systems of Record (SOR) • Both root nodes and leaf nodes • Derived data stores (mostly MVCC) for: • Real time customer facing queries • Real time analytics • Alerting • Offline analytics • Reporting • Debugging • Mixture of dumps and deltas • We have derived SORs • Cached smaller subset records/fields for a specific purpose • SORs in multiple data centers - some derived SORs shared • Data flow is graph not a tree - feedback
  • 6. June 19, 2014 Goals 6 • Reduce latency <15 mins for customer facing data • Reduce copying and duplication of data • Network/Storage/Time costs • More streaming & deltas, less dumps and derived SORs • Want multi-purpose, multi-tenant central store • Something rock solid • Something that can handle lots of data fast • Something that can do random access and bulk operations • Use for all data store types on previous slide • (Over?)build it; they will come • Consolidate rest on • HDFS,Vertica, Postgres, S3, Glacier, SOLR/Lucene
  • 7. June 19, 2014 Why Cassandra? 7 • Regarded as rock solid • No single point of failure • Active development & open source Java • Good fit for the type of data we wanted to store • Ease of configuration; all nodes are the same • Easily tunable consistency at application level • Easy control of sharding at application level • Drivers for all our languages (we're mostly JVM but also node) • Data locality with other tools • Good cross data center support
  • 8. June 19, 2014 Evolution 8 • July 2013 (alpha on C* 1.1) • September 2013 (MTC-1 on C* 2.0.0) • First use case (a nasty one) - talk about it later • Stress/Destructive testing • Found and helped fix a few bugs along the way • Learned a lot about tuning and operations • Half nodes down at one point • Corrupted SSTables on one node • We’ve been cautious • Started with internal facing only use (don’t need 100% uptime) • Moved to external facing use but with ability to fall back off C* in minutes • Getting braver • C* is only SOR and real time customer facing store for some cases now • We have on occasion custom built C* with cherry-picked patches
  • 9. June 19, 2014 HW Specs MTC-1 9 • Remember we want to build for the C* future • 6 nodes • 16x cores (Sandy Bridge) • 256G RAM • Lots of disk cache and mem-mapped NIO buffers • 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each) • 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential) • RAID1 OS drives • 4x gigabit ethernet
  • 10. June 19, 2014 SW Specs MTC-1 10 • CentOS 6.5 • Cassandra 2.0.5 • JDK 1.7.0_60-b19 • 8 gig young generation / 6.4 gig eden • 16 gig old generation • Parallel new collector • CMS collector • Sounds like overkill but we are multi-tenant and have spiky loads
  • 11. June 19, 2014 General 11 • LOCAL_QUORUM for reads and writes • Use LZ4 compression • Use key cache (not row cache) • Some SizeTiered, some Leveled CompactionStrategy • Drivers • Athena (Scala / binary) • Astyanax 1.56.48 (Java / thrift) • node-cassandra-cql (Node / binary)
  • 12. June 19, 2014 Use Case 1 - Search API - Problem 12 • 40 million records (including duplicates perVIN) in HDFS • Map/Reduce to 7 million SOLR XML updates in HDFS • Not delta today because of map/reduce like business rules • Export to SOLR XML from HDFS to local FS • Re-index via SOLR • 40 gig SOLR index - at least 3 slaves • OKish every few hours, not every 15 minutes • Even though we made very fast parallel indexer • % of stored data read per indexing is getting smaller
  • 13. June 19, 2014 Use Case 1 - Search API - Solution 13 • Indexing in hadoop • SOLR(Lucene) segments created (no stored fields) • Job option for fallback to stored fields in SOLR index • Stored fields go to C* as JSON directly from hadoop • Astyanax - 1MB batches - LOCAL_QUORUM • Periodically create new table(CF) with full data baseline (clustering) column • 200MB/s 3 replicas continuously for one to two minutes • 40000 partition keys/s (one per record) • Periodically add new (clustering) column to table with deltas from latest dump • Delta data size is 100x smaller and hits many fewer partition keys • Keep multiple recent tables for rollback (bad data more than recovery) • 2 gig SOLR index (20x smaller)
  • 14. June 19, 2014 Use Case 1 - Search API - Solution 14 • Very bare bones - not even any metadata :-( • Thrift style • Note we use blob • Everything is UTF-8 • Avro - Utf8 • Hadoop - Text • Astyanax - ByteBuffer • Most JVM drivers try to convert text to String CREATE TABLE "20140618084015_20140618_081920_1403072360" (! key text,! column1 blob,! value blob,! PRIMARY KEY (key, column1)! ) WITH COMPACT STORAGE;
  • 15. June 19, 2014 Use Case 1 - Search API - Solution 15 • Stored fields cached in SOLR JVM (verification/warm up tests) • MVCC to prevent read-from-future • Single clustering key limit for the SOLR core • Reads fallback from LOCAL_QUORUM to LOCAL_ONE • Better to return something even a subset of results • Never happened in production though • Issues • Don’t recreate table/CF until C* 2.1 • Early 2.0.x and Astyanax don’t like schema changes • Create new tables via CQL3 via Astyanax • Monitoring harder since we now use UUID for table name • Full (non delta) index write rate strains GC and causes some hinting • C* remains rock solid • We can constrain by mapper/reducer count, and will probably add zookeeper mutex
  • 16. June 19, 2014 Use Case 1.5 - RESA 16 • Newer version of real estate • Fully streaming delta pipeline (RabbitMQ) • Field level SOLR index updates (include latest timestamp) • C* row with JSON delta for that timestamp • History is used in customer facing features • Note this is really the same table as thrift one CREATE TABLE for_sale (! id text,! created_date timestamp,! delta_json text! PRIMARY KEY (id, created_date)! ) !
  • 17. June 19, 2014 Use Case 2 - Feed Management - Problem 17 • Thousands of feeds of different size and frequency • Incoming feeds must be “polished” • Geocoding must be done • Images must be made available in S3 • Need to reprocess individual feeds • Full output records are munged from asynchronously updated parts • Previously huge HDFS job • 300M inputs for 70M full output records • Records need all data to be “ready” for full output • Silly because most work is redundant from previous run • Only help partitioning is by brittle HDFS directory structures
  • 18. June 19, 2014 Use Case 2 - Feed Management - Solution 18 • Scala & Akka & Athena (large throughput - high parallelism) • Compound partition key (2^n shards per feed) • Spreads data - limits partition “row” length • Read entire feed without key scan - small IN clause • Random access writes • Any sub-field may be updated asynchronously • Munged record emitted to HDFS whenever “ready” CREATE TABLE feed_state (! feed_name text,! feed_record_id_shard int,! record_id uuid,! raw_record text,! polished_data text,! geocode_data text,! image_status text,! ...! PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)! )
  • 19. June 19, 2014 Monitoring 19 • OpsCenter • log4j/syslog/graylog • Email alerts • nagios/zabbix • Graphite (autogen graph pages) • Machine stats via collectl, JVM from codahale • Cassandra stats from codahale • Suspect possible issue with hadoop using same coordinator nodes • GC logs • VisualVM
  • 20. June 19, 2014 General Issues / Lessons Learned 20 • GC issues • Old generation fragmentation causes eventual promotion failure • Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-) • Thrift API with bulk load probably not helping, but fragmentation is inevitable • Some slow initial mark and remark STW pauses • We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-) • As said we aim to be multi-tenant • Avoid client stupidity, but otherwise accommodate any client behavior • GC now well tuned • 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day • Cassandra and it’s own dog food • Can’t wait for hints to be commit log style regular file (C* 3.0) • Compactions in progress table • OpsCenter rollup - turned off for search api tables
  • 21. June 19, 2014 General Issues / Lessons Learned 21 • Don’t repair things that don’t need them • We also run -pr -par repair on each node • Beware when not following the rules • We were knowingly running on potentially buggy minor versions • If you don’t know what you’re doing you will likely screw up • Fortunately for us C* has always kept running fine • It is usually pretty easy to fix with some googling • Deleting data is counter-intuitively often a good fix!
  • 22. June 19, 2014 Future 22 • Upgrade 2.0.x to use static columns • User defined types :-) • De-duplicate data into shared storage in C* • Analytics via data-locality • Hadoop, Pig, Spark/Scalding, R • More cross data center • More tuning • Full streaming pipeline with C* as side state store
  • 23. June 19, 2014 Athena 23 • Why would we do such an obviously crazy thing? • Need to support async, reactive applications across different problem domains • Real-time API used by several disparate clients (iOS, Node.js, …) • Ground-up implementation of the CQL 2.0 binary protocol • Scala 2.10/2.11 • Akka 2.3.x • Fully async, nonblocking API • Has obvious advantages but requires different paradigm • Implemented as an extension for Akka-IO • Low-level actor based abstraction • Cluster, Host and Connection actors • Reasonably stable • High-level streaming streaming Session API
  • 24. June 19, 2014 Athena 24 • Next steps • Move off of Play Iteratees and onto Akka Reactive Streams • Token based routing • Client API very much in flux - suggestions are welcome! ! • https://github.com/vast-engineering/athena • Release of first beta milestone to Sonatype Maven repository imminent • Pull requests welcome!
  • 26. June 19, 2014 GC Settings 26 -Xms24576M -Xmx24576M -Xmn8192M -Xss228k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+HeapDumpOnOutOfMemoryError -XX:+CMSPrintEdenSurvivorChunks -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1
  • 27. June 19, 2014 Cassandra at Vast Graham Sanderson - CTO, David Pratt - Director of Applications 27