SlideShare a Scribd company logo
1 of 24
Download to read offline
Feeding Solr at Large
Scale with Cassandra
@
Cassandra Day Atlanta
2015-03-19
About Me
Joseph Streeky
Manager, Search Framework Development
●  Joined Careerbuilder in 2005
●  BS Computer Science - Georgia Tech
●  Natural Language Processing - Columbia University
●  Software Engineering for SaaS - University of California, Berkeley
About Me
Joshua Smith
Database Administrator III
Joshua.smith@careerbuilder.com
●  Joined Careerbuilder in 2011
●  Took over management of Cassandra in 2013
●  BS Computer Science - Georgia Tech
About
Careerbuilder is the global leader in human capital
solutions, helping companies target and attract their most
important assets - their people.
●  More than 22 Million unique visitors a month
●  More than 300,000 employers post more than 1 million
jobs on Careerbuilder
●  Careerbuilder operates in the United States, Europe,
Canada and Asia. Its sites, combined with partnership
and acquisitions, give Careerbuilder a presence in more
than 55 countries worldwide.
About Search @
•  1	
  million	
  ac*ve	
  jobs	
  each	
  month	
  	
  
•  60	
  million	
  ac*vely	
  searchable	
  resumes	
  
•  500	
  globally	
  distributed	
  search	
  servers	
  (in	
  the	
  U.S.,	
  Europe,	
  &	
  
the	
  cloud)	
  	
  
•  Thousands	
  of	
  unique,	
  dynamically	
  generated	
  search	
  indexes	
  
•  1.5	
  billion	
  search	
  documents	
  
•  2-­‐3	
  million	
  searches	
  an	
  hour	
  
Search Powers…
Feeding Solr at Large
Scale with Cassandra
Feeding Platform Requirements
●  Volume Requirements
o  1000+ documents / second
●  Able to scale linearly
●  Highly available
●  Easily able to deploy to multiple location
(Private datacenter vs AWS)
Technologies	
  Technologies	
  
Our	
  Feeding	
  Infrastructure	
  
Feeding
Stack
Hadoop
SQL
RabbitMQ
Cassandra
Processing
Tier
Solr
Feeding Steps
●  Content Creation - Translate to Solr Indexing Format (we use XML)
●  Shard - Determine Routing Rules related to this document
●  Batch - Group together documents that have the same routing rules for
batch feeding
●  Send - Send the batch to Solr
●  Verify - Verify that Solr received the batch
●  Reprocess - For any set of documents that failed during any step of
the process we place the document(s) here for reprocessing
Our	
  Search	
  Infrastructure	
  
Solr SolrSolr
Feeding Platform
Our	
  Search	
  Infrastructure	
  
Query Load Balancer Query Load Balancer
Feeding Solr at Large
Scale with Cassandra
Storage	
  in	
  Cassandra	
  
●  Two column families per pool
o  Initial data
o  Translated for Solr
●  Both have a DocumentID as the key
●  The initial data column family is the key and some
number of columns based on user data
●  The Translated column family is just a DocumentID and
a single content field
Read	
  and	
  Wri6ng	
  
●  Quorum Read
●  Quorum Write
●  Needed for our specific use case, if we fail to read the
newest data we have to automated way to recover
Cassandra Ring Specs
●  3 Node ring test
●  21 Nodes Production
o  56 TB
o  Datastax Cassandra
o  Version 2.0.5
o  Vnodes
o  RF = 3
o  4K write/s - 3K read/s
Cassandra Node Specs
●  Dell R620
●  2 x E5-2630 V2
●  2.60 GHZ CPU
●  128 GB RAM
●  3 x 1.6TB SAS SSD in RAID 5
Pre 1.2 Performance Stuff
●  Compaction Fun
●  Cold Read problem
●  Garbage collection
Compaction Fun
●  Cassandra version 1.0.8
●  Single threaded compaction
●  Eating up heap space until OOM
error
●  JNA not installed
●  Reduced memtable and cache size
●  increased the heap size to 12 GB
Cold Read Problem
●  Refeeding involves all documents
●  Each row will be read multiple times
●  Cold reads means lots of seeks
●  Spinning disks HATE seeks
SSD
●  Nightly repair times decreased from 23
hours to less than 3 hours
●  Write latency decreased from 15 ms to 2.4
ms
●  cassandra.yaml
o  concurrent_reads: 96
o  concurrent_writes: 192
AWS vs Private
●  Combination of AMI and chef to configure
●  R3.XLarge with EBS optimized
o  4 vCPU, 30 GB RAM, Provisioned IOPS
●  RF = 3
●  2 Availability zones for high availability and
local quorum
●  Comparable performance to local datacenter
●  Currently deploying version 2.0.12
Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

More Related Content

What's hot

An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at RackspaceHadoop and Cassandra at Rackspace
Hadoop and Cassandra at RackspaceStu Hood
 
MongoDB Administration ~ Kevin Hanson
MongoDB Administration ~ Kevin HansonMongoDB Administration ~ Kevin Hanson
MongoDB Administration ~ Kevin Hansonhungarianhc
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and consSaniya Khalsa
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architectureOmid Vahdaty
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorPierre Baillet
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...MongoDB
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining William Simms
 
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Fwdays
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachJeremy Zawodny
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataRamesh Veeramani
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullyMd Kamaruzzaman
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 

What's hot (20)

An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at RackspaceHadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
 
MongoDB Administration ~ Kevin Hanson
MongoDB Administration ~ Kevin HansonMongoDB Administration ~ Kevin Hanson
MongoDB Administration ~ Kevin Hanson
 
Scalding @ Coursera
Scalding @ CourseraScalding @ Coursera
Scalding @ Coursera
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
 
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgent
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 

Viewers also liked

Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandraAaron Ploetz
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
The evolution of Angular 2 @ AngularJS Munich Meetup #5
The evolution of Angular 2 @ AngularJS Munich Meetup #5The evolution of Angular 2 @ AngularJS Munich Meetup #5
The evolution of Angular 2 @ AngularJS Munich Meetup #5Johannes Weber
 
Getting Started with Angular 2
Getting Started with Angular 2Getting Started with Angular 2
Getting Started with Angular 2FITC
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 

Viewers also liked (7)

AngularJS2 / TypeScript / CLI
AngularJS2 / TypeScript / CLIAngularJS2 / TypeScript / CLI
AngularJS2 / TypeScript / CLI
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
AngularJS2
AngularJS2AngularJS2
AngularJS2
 
The evolution of Angular 2 @ AngularJS Munich Meetup #5
The evolution of Angular 2 @ AngularJS Munich Meetup #5The evolution of Angular 2 @ AngularJS Munich Meetup #5
The evolution of Angular 2 @ AngularJS Munich Meetup #5
 
Getting Started with Angular 2
Getting Started with Angular 2Getting Started with Angular 2
Getting Started with Angular 2
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 

Similar to Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Web Services
 
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...MongoDB
 
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDSAWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDSAmazon Web Services
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Web Services
 
Amazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Web Services
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data LakeMarcos Rebelo
 
Amazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineDanilo Poccia
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAmazon Web Services
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinAmazon Web Services
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical dataOleksandr Semenov
 

Similar to Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra (20)

Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS Database Services
AWS Database ServicesAWS Database Services
AWS Database Services
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
 
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDSAWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
 
Cosmos db
Cosmos dbCosmos db
Cosmos db
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Amazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database Engine
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data Lake
 
Amazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database Engine
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled Apps
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit Dublin
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical data
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

  • 1. Feeding Solr at Large Scale with Cassandra @ Cassandra Day Atlanta 2015-03-19
  • 2. About Me Joseph Streeky Manager, Search Framework Development ●  Joined Careerbuilder in 2005 ●  BS Computer Science - Georgia Tech ●  Natural Language Processing - Columbia University ●  Software Engineering for SaaS - University of California, Berkeley
  • 3. About Me Joshua Smith Database Administrator III Joshua.smith@careerbuilder.com ●  Joined Careerbuilder in 2011 ●  Took over management of Cassandra in 2013 ●  BS Computer Science - Georgia Tech
  • 4. About Careerbuilder is the global leader in human capital solutions, helping companies target and attract their most important assets - their people. ●  More than 22 Million unique visitors a month ●  More than 300,000 employers post more than 1 million jobs on Careerbuilder ●  Careerbuilder operates in the United States, Europe, Canada and Asia. Its sites, combined with partnership and acquisitions, give Careerbuilder a presence in more than 55 countries worldwide.
  • 5. About Search @ •  1  million  ac*ve  jobs  each  month     •  60  million  ac*vely  searchable  resumes   •  500  globally  distributed  search  servers  (in  the  U.S.,  Europe,  &   the  cloud)     •  Thousands  of  unique,  dynamically  generated  search  indexes   •  1.5  billion  search  documents   •  2-­‐3  million  searches  an  hour  
  • 7. Feeding Solr at Large Scale with Cassandra
  • 8. Feeding Platform Requirements ●  Volume Requirements o  1000+ documents / second ●  Able to scale linearly ●  Highly available ●  Easily able to deploy to multiple location (Private datacenter vs AWS)
  • 10. Our  Feeding  Infrastructure   Feeding Stack Hadoop SQL RabbitMQ Cassandra Processing Tier Solr
  • 11. Feeding Steps ●  Content Creation - Translate to Solr Indexing Format (we use XML) ●  Shard - Determine Routing Rules related to this document ●  Batch - Group together documents that have the same routing rules for batch feeding ●  Send - Send the batch to Solr ●  Verify - Verify that Solr received the batch ●  Reprocess - For any set of documents that failed during any step of the process we place the document(s) here for reprocessing
  • 13. Solr SolrSolr Feeding Platform Our  Search  Infrastructure   Query Load Balancer Query Load Balancer
  • 14. Feeding Solr at Large Scale with Cassandra
  • 15. Storage  in  Cassandra   ●  Two column families per pool o  Initial data o  Translated for Solr ●  Both have a DocumentID as the key ●  The initial data column family is the key and some number of columns based on user data ●  The Translated column family is just a DocumentID and a single content field
  • 16. Read  and  Wri6ng   ●  Quorum Read ●  Quorum Write ●  Needed for our specific use case, if we fail to read the newest data we have to automated way to recover
  • 17. Cassandra Ring Specs ●  3 Node ring test ●  21 Nodes Production o  56 TB o  Datastax Cassandra o  Version 2.0.5 o  Vnodes o  RF = 3 o  4K write/s - 3K read/s
  • 18. Cassandra Node Specs ●  Dell R620 ●  2 x E5-2630 V2 ●  2.60 GHZ CPU ●  128 GB RAM ●  3 x 1.6TB SAS SSD in RAID 5
  • 19. Pre 1.2 Performance Stuff ●  Compaction Fun ●  Cold Read problem ●  Garbage collection
  • 20. Compaction Fun ●  Cassandra version 1.0.8 ●  Single threaded compaction ●  Eating up heap space until OOM error ●  JNA not installed ●  Reduced memtable and cache size ●  increased the heap size to 12 GB
  • 21. Cold Read Problem ●  Refeeding involves all documents ●  Each row will be read multiple times ●  Cold reads means lots of seeks ●  Spinning disks HATE seeks
  • 22. SSD ●  Nightly repair times decreased from 23 hours to less than 3 hours ●  Write latency decreased from 15 ms to 2.4 ms ●  cassandra.yaml o  concurrent_reads: 96 o  concurrent_writes: 192
  • 23. AWS vs Private ●  Combination of AMI and chef to configure ●  R3.XLarge with EBS optimized o  4 vCPU, 30 GB RAM, Provisioned IOPS ●  RF = 3 ●  2 Availability zones for high availability and local quorum ●  Comparable performance to local datacenter ●  Currently deploying version 2.0.12