SlideShare a Scribd company logo
1 of 19
Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M  posts We keep all postings Users reuse postings Daily archive migration Internal query tools
Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up.  But we had to clean them all up. Start data load.  Wait 12-36 hours.  Witness fail.  Fix code.  Start over.  Sigh. This is a combination of having been sloppy and having old data.  Even with a lot less history, this can bite you.  Get your encoding house in order!
Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have.  One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily

More Related Content

Viewers also liked

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBBoxed Ice
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachJeremy Zawodny
 
Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Dhaval Dalal
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Jeremy Zawodny
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Migrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikMigrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikTony Tam
 
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...MongoDB
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB
 
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLWebinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLMongoDB
 
Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Ted Naleid
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistJeremy Zawodny
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016Norberto Leite
 
Production deployment
Production deploymentProduction deployment
Production deploymentMongoDB
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQLmwasaha mwagambo
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Jeremy Zawodny
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMongoDB
 
Social Media Trends - Content Curation
Social Media Trends - Content CurationSocial Media Trends - Content Curation
Social Media Trends - Content CurationChris Mikulin
 

Viewers also liked (20)

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
 
Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Migrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikMigrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at Wordnik
 
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLWebinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
 
Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011
 
Tayra
TayraTayra
Tayra
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
SphinxSearch
SphinxSearchSphinxSearch
SphinxSearch
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016
 
Production deployment
Production deploymentProduction deployment
Production deployment
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
Social Media Trends - Content Curation
Social Media Trends - Content CurationSocial Media Trends - Content Curation
Social Media Trends - Content Curation
 

Similar to Lessons Learned Migrating 2+ Billion Documents at Craigslist

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge ShareingPhilip Zhong
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewPierre Baillet
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relationalTony Tam
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcriptfoliba
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)emiltamas
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterChris Henry
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring dataJimmy Ray
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBManish Pandit
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxsarah david
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchDaniel M. Farrell
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)MongoSF
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
 

Similar to Lessons Learned Migrating 2+ Billion Documents at Craigslist (20)

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Look Ma! No more blobs
Look Ma! No more blobsLook Ma! No more blobs
Look Ma! No more blobs
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcript
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptx
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_search
 
disertation
disertationdisertation
disertation
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
 
Whynosql
WhynosqlWhynosql
Whynosql
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdf
 

Recently uploaded

Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Lessons Learned Migrating 2+ Billion Documents at Craigslist

  • 1. Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
  • 2. Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
  • 3. Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
  • 4. Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M posts We keep all postings Users reuse postings Daily archive migration Internal query tools
  • 5. Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
  • 6. MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
  • 7. MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
  • 8. MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
  • 9. Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
  • 10. Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
  • 11. Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up. But we had to clean them all up. Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh. This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!
  • 12. Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have. One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
  • 13. Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
  • 14. Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
  • 15. Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
  • 16. Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
  • 17. MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
  • 18. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
  • 19. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily