Today we use Cassandra along with RabbitMQ and Sinatra to feed 5,000 documents a second across our 600 solr instances. We will talk about how we manage a 21 node, 59 TB Cassandra ring and how we use this to send updates to all of our Solr clusters. We have setups in both our private datacenter and AWS.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra
1. Feeding Solr at Large
Scale with Cassandra
@
Cassandra Day Atlanta
2015-03-19
2. About Me
Joseph Streeky
Manager, Search Framework Development
● Joined Careerbuilder in 2005
● BS Computer Science - Georgia Tech
● Natural Language Processing - Columbia University
● Software Engineering for SaaS - University of California, Berkeley
3. About Me
Joshua Smith
Database Administrator III
Joshua.smith@careerbuilder.com
● Joined Careerbuilder in 2011
● Took over management of Cassandra in 2013
● BS Computer Science - Georgia Tech
4. About
Careerbuilder is the global leader in human capital
solutions, helping companies target and attract their most
important assets - their people.
● More than 22 Million unique visitors a month
● More than 300,000 employers post more than 1 million
jobs on Careerbuilder
● Careerbuilder operates in the United States, Europe,
Canada and Asia. Its sites, combined with partnership
and acquisitions, give Careerbuilder a presence in more
than 55 countries worldwide.
5. About Search @
• 1
million
ac*ve
jobs
each
month
• 60
million
ac*vely
searchable
resumes
• 500
globally
distributed
search
servers
(in
the
U.S.,
Europe,
&
the
cloud)
• Thousands
of
unique,
dynamically
generated
search
indexes
• 1.5
billion
search
documents
• 2-‐3
million
searches
an
hour
8. Feeding Platform Requirements
● Volume Requirements
o 1000+ documents / second
● Able to scale linearly
● Highly available
● Easily able to deploy to multiple location
(Private datacenter vs AWS)
11. Feeding Steps
● Content Creation - Translate to Solr Indexing Format (we use XML)
● Shard - Determine Routing Rules related to this document
● Batch - Group together documents that have the same routing rules for
batch feeding
● Send - Send the batch to Solr
● Verify - Verify that Solr received the batch
● Reprocess - For any set of documents that failed during any step of
the process we place the document(s) here for reprocessing
15. Storage
in
Cassandra
● Two column families per pool
o Initial data
o Translated for Solr
● Both have a DocumentID as the key
● The initial data column family is the key and some
number of columns based on user data
● The Translated column family is just a DocumentID and
a single content field
16. Read
and
Wri6ng
● Quorum Read
● Quorum Write
● Needed for our specific use case, if we fail to read the
newest data we have to automated way to recover
17. Cassandra Ring Specs
● 3 Node ring test
● 21 Nodes Production
o 56 TB
o Datastax Cassandra
o Version 2.0.5
o Vnodes
o RF = 3
o 4K write/s - 3K read/s
18. Cassandra Node Specs
● Dell R620
● 2 x E5-2630 V2
● 2.60 GHZ CPU
● 128 GB RAM
● 3 x 1.6TB SAS SSD in RAID 5
19. Pre 1.2 Performance Stuff
● Compaction Fun
● Cold Read problem
● Garbage collection
20. Compaction Fun
● Cassandra version 1.0.8
● Single threaded compaction
● Eating up heap space until OOM
error
● JNA not installed
● Reduced memtable and cache size
● increased the heap size to 12 GB
21. Cold Read Problem
● Refeeding involves all documents
● Each row will be read multiple times
● Cold reads means lots of seeks
● Spinning disks HATE seeks
22. SSD
● Nightly repair times decreased from 23
hours to less than 3 hours
● Write latency decreased from 15 ms to 2.4
ms
● cassandra.yaml
o concurrent_reads: 96
o concurrent_writes: 192
23. AWS vs Private
● Combination of AMI and chef to configure
● R3.XLarge with EBS optimized
o 4 vCPU, 30 GB RAM, Provisioned IOPS
● RF = 3
● 2 Availability zones for high availability and
local quorum
● Comparable performance to local datacenter
● Currently deploying version 2.0.12