Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Feeding Solr at Large
Scale with Cassandra
@
Cassandra Day Atlanta
2015-03-19

About Me
Joseph Streeky
Manager, Search Framework Development
●  Joined Careerbuilder in 2005
●  BS Computer Science - Georgia Tech
●  Natural Language Processing - Columbia University
●  Software Engineering for SaaS - University of California, Berkeley

About Me
Joshua Smith
Database Administrator III
Joshua.smith@careerbuilder.com
●  Joined Careerbuilder in 2011
●  Took over management of Cassandra in 2013
●  BS Computer Science - Georgia Tech

About
Careerbuilder is the global leader in human capital
solutions, helping companies target and attract their most
important assets - their people.
●  More than 22 Million unique visitors a month
●  More than 300,000 employers post more than 1 million
jobs on Careerbuilder
●  Careerbuilder operates in the United States, Europe,
Canada and Asia. Its sites, combined with partnership
and acquisitions, give Careerbuilder a presence in more
than 55 countries worldwide.

About Search @
•  1
million
ac*ve
jobs
each
month

•  60
million
ac*vely
searchable
resumes

•  500
globally
distributed
search
servers
(in
the
U.S.,
Europe,
&

the
cloud)

•  Thousands
of
unique,
dynamically
generated
search
indexes

•  1.5
billion
search
documents

•  2-‐3
million
searches
an
hour

Feeding Solr at Large
Scale with Cassandra

Feeding Platform Requirements
●  Volume Requirements
o  1000+ documents / second
●  Able to scale linearly
●  Highly available
●  Easily able to deploy to multiple location
(Private datacenter vs AWS)

Our
Feeding
Infrastructure

Feeding
Stack
Hadoop
SQL
RabbitMQ
Cassandra
Processing
Tier
Solr

Feeding Steps
●  Content Creation - Translate to Solr Indexing Format (we use XML)
●  Shard - Determine Routing Rules related to this document
●  Batch - Group together documents that have the same routing rules for
batch feeding
●  Send - Send the batch to Solr
●  Verify - Verify that Solr received the batch
●  Reprocess - For any set of documents that failed during any step of
the process we place the document(s) here for reprocessing

Our
Search
Infrastructure

Solr SolrSolr
Feeding Platform
Our
Search
Infrastructure

Query Load Balancer Query Load Balancer

Storage
in
Cassandra

●  Two column families per pool
o  Initial data
o  Translated for Solr
●  Both have a DocumentID as the key
●  The initial data column family is the key and some
number of columns based on user data
●  The Translated column family is just a DocumentID and
a single content field

Read
and
Wri6ng

●  Quorum Read
●  Quorum Write
●  Needed for our specific use case, if we fail to read the
newest data we have to automated way to recover

Cassandra Ring Specs
●  3 Node ring test
●  21 Nodes Production
o  56 TB
o  Datastax Cassandra
o  Version 2.0.5
o  Vnodes
o  RF = 3
o  4K write/s - 3K read/s

Cassandra Node Specs
●  Dell R620
●  2 x E5-2630 V2
●  2.60 GHZ CPU
●  128 GB RAM
●  3 x 1.6TB SAS SSD in RAID 5

Pre 1.2 Performance Stuff
●  Compaction Fun
●  Cold Read problem
●  Garbage collection

Compaction Fun
●  Cassandra version 1.0.8
●  Single threaded compaction
●  Eating up heap space until OOM
error
●  JNA not installed
●  Reduced memtable and cache size
●  increased the heap size to 12 GB

Cold Read Problem
●  Refeeding involves all documents
●  Each row will be read multiple times
●  Cold reads means lots of seeks
●  Spinning disks HATE seeks

SSD
●  Nightly repair times decreased from 23
hours to less than 3 hours
●  Write latency decreased from 15 ms to 2.4
ms
●  cassandra.yaml
o  concurrent_reads: 96
o  concurrent_writes: 192

AWS vs Private
●  Combination of AMI and chef to configure
●  R3.XLarge with EBS optimized
o  4 vCPU, 30 GB RAM, Provisioned IOPS
●  RF = 3
●  2 Availability zones for high availability and
local quorum
●  Comparable performance to local datacenter
●  Currently deploying version 2.0.12

Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Similar to Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra