SlideShare a Scribd company logo
MySQL and Search at Craigslist


           Jeremy Zawodny
        jzawodn@craigslist.org
          http://craigslist.org/

         Jeremy@Zawodny.com
    http://jeremy.zawodny.com/blog/
Who Am I?
    Creator and co-author of High Performance
●

    MySQL
    Creator of mytop
●


    Perl Hacker
●


    MySQL Geek
●


    Craigslist Engineer (as of July, 2008)
●


        MySQL, Data, Search, Perl
    –

    Ex-Yahoo (Perl, MySQL, Search, Web
●

    Services)
What is Craigslist?
What is Craigslist?
    Local Classifieds
●


        Jobs, Housing, Autos, Goods, Services
    –

    ~500 cities world-wide
●


    Free
●


        Except for jobs in ~18 cities and brokered
    –
        apartments in NYC
        Over 20B pageviews/month
    –

        50M monthly users
    –

        50+ countries, multiple languages
    –

        40+M ads/month, 10+M images
    –
What is Craigslist?
    Forums
●


        100M posts
    –

        100s of forums
    –
Technical and other Challenges
    High ad churn rate
●


        Post half-life can be short
    –

    Growth
●


    High traffic volume
●


    Back-end tools and data analysis needs
●


    Growth
●


    Need to archive postings... forever!
●


        100s of millions, searchable
    –

    Internationalization and UTF-8
●
Technical and other Challenges
    Small Team
●


        Fires take priority
    –

        Infrastructure gets creaky
    –

        Organic code and schema growth over years
    –

    Growth
●


    Lack of abstractions
●


        Too much embedded SQL in code
    –

    Documentation vs. Institutional Knowledge
●


        “Why do we have things configured like this?”
    –
Goals
    Use Open Source
●


    Keep infrastructure small and simple
●


        Lower power is good!
    –

        Efficiency all around
    –

        Do more with less
    –

    Keep site easy and appraochable
●


        Don't overload with features
    –

        People are easily confuse
    –
Craigslist Internals Overview
                                   Load Balancer



Read Proxy Array                                                    Write Proxy Array
                   Perl + memcached



                                                                          ...
Web Read Array     Apache 1.3 + mod_perl




 Object Cache                                Search Cluster
                   Perl + memcached                            Sphinx




                                                              Not Included:
Read DB Cluster    MySQL 5.0.xx                               - user db, image db
                                                              - async tasks, email
                                                              - accounting, internal tools
                                                              - and more!
Vertical Partitioning: Roles

Users             Classifieds             Forums




        Write   Read     Long   Trash




        Stats                   Archive
Vertical Partitioning
    Different roles have different access patterns
●


        Sub-roles based on query type
    –

    Easier to manage and scale
●


    Logical, self-contained data
●


    Servers may not need to be as
●

    big/fast/expensive
    Difficult to do retroactively
●


    Various named db “handles” in code
●
Horizontal Partitioning: Hydra

                                        ...
cluster_01   cluster_02    cluster_03         cluster_N




                      client
Horizontal Partitioning: Hydra
    Need to retrofit a lot of code
●


    Need non-blocking Perl MySQL client
●


    Wrapped
●

    http://code.google.com/p/perl-mysql-async/
    Eventually can size DB boxes based on
●

    price/power and adjust mapping function(s)
        Choose hardware first
    –

        Make the db “fit”
    –

    Archiving lets us age a cluster instead of
●

    migrating it's data to a new one.
Search Evolution
    Problem: Users want to find stuff.
●


    Solution: Use MySQL Full Text.
●


    ...time passes...
●


    Problem: MySQL Full Text Doesn't Scale!
●


    Solution: Use Sphinx.
●


    ...time passes...
●


    Problem: Sphinx doesn't scale!
●


    Solution: Patch Sphinx.
●
MySQL Full-Text Problems
    Hitting invisible limits
●


        CPU not pegged, Memory available
    –

        Disk I/O not unreasonable
    –

        Locking / Mutex contention? Probably.
    –

    MyISAM has occasional crashing / corruption
●


    5 clusters of 5 machines
●


        Partitioning based on city and category
    –

        All “hand balanced” and high-maintenance
    –

    ~30M queries/day
●


        Close to limits
    –
Sphinx: My First CL Project
    Sphinx is designed for text search
●


    Fast and lean C++ code
●


    Forking model scales well on multi-core
●


    Control over indexing, weighting, etc.
●


    Also spent some time looking at Apache Solr
●
Search Implementation Details
    Partitioning based on cities (each has a
●

    numeric id)
    Attributes vs. Keywords
●


    Persistent Connections
●


        Custom client and server modifications
    –

    Minimal stopword List
●


    Partition into 2 clusters (1 master, 4 slaves)
●
Sphinx Incremental Indexing
    Re-index every N minutes
●


    Use main + delta strategy
●


        Adopted as: index + today + delta
    –

        One set per city (~500 * 3)
    –

    Slaves handle live queries, update via rsync
●


    Need lots of FDs
●


    Use all 4 cores to index
●


    Every night, perform “daily merge”
●


    Generate config files via Perl
●
Sphinx Incremental Indexing
Sphinx Issues
    Merge bugs [fixed]
●


    File descriptor corruption [fixed]
●


    Persistent connections [fixed]
●


        Overhead of fork() was substantial in our testing
    –

        200 queries/sec vs. 1,000 queries/sec per box
    –

    Missing attribute updates [unreported]
●


    Bogus docids in responses
●


    We need to upgrade to latest Sphinx soon
●


    Andrew and team have been excellent!
●
Search Project Results
    From 25 MySQL Boxes to 10 Sphinx
●


    Lots more headroom!
●


    New Features
●


        Nearby Search
    –

    No seizing or locking issues
●


    1,000+ qps during peak w/room to grow
●


    50M queries per day w/steady growth
●


    Cluster partitioning built but not needed (yet?)
●


    Better separation of code
●
Sphinx Wishlist
    Efficient delete handling (kill lists)
●


    Non-fatal “missing” indexes
●


    Index dump tool
●


    Live document add/change/delete
●


    Built-in replication
●


    Stats and counters
●


    Text attributes
●


    Protocol checksum
●
Data Archiving, Replication, Indexes
    Problem: We want to keep everything.
●


    Solution: Archive to an archive cluster.
●


    Problem: Archiving is too painful. Index
●

    updates are expensive! Slaves affected.
    Solution: Archive with home-grown eventually
●

    consistent replication.
Data Archiving: OOB Replication
    Eventual Consistency
●


    Master process
●


        SET SQL_LOG_BIN=0
    –

        Select expired IDs
    –

        Export records from live master
    –

        Import records into archive master
    –

        Delete expired from live master
    –

        Add IDs to list
    –
Data Archiving: OOB Replication
    Slave process
●


        One per MySQL slave
    –

        Throttled to minimize impact
    –

        State kept on slave
    –

             Clone friendly
         ●



        Simple logic
    –

             Select expired IDs added since my sequence number
         ●


             Delete expired records
         ●


             Update local “last seen” sequence number
         ●
Long Term Data Archiving
    Schema coupling is bad
●


        ALTER TABLE takes forever
    –

        Lots of NULLs flying around
    –

    CouchDB or similar long-term?
●


        Schema-free feels like a good fit
    –

    Tested some home grown solutions already
●


    Separate storage and indexing?
●


        Indexing with Sphinx?
    –
Drizzle, XtraDB, Future Stuff
    CouchDB looks very interesting. Maybe for
●

    archive?
    XtraDB / InnoDB plugin
●


        Better concurrency
    –

        Better tuning of InnoDB internals
    –

    libdrizzle + Perl
●


        DBI/DBD may not fit an async model well
    –

        Can talk to both MySQL and Drizzle!
    –

    Oracle buying Sun?!?!
●
We're Hiring!
    Work in San Francisco
●


    Flexible, Small Company
●


    Excellent Benefits
●


    Help Millions of People Every Week
●


    We Need Perl/MySQL Hackers
●


    Come Help us Scale and Grow
●
Questions?

More Related Content

What's hot

How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
C4Media
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
Uwe Printz
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
Stone Gao
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...slashn
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
MongoDB
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
Tobias Lindaaker
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
Chris Baglieri
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Java
sunnygleason
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2
Tom Corrigan
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBlehresman
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
Ivan Zoratti
 
MySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiMySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup Mumbai
Remote MySQL DBA
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP On
Tomer Gabel
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
gdusbabek
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
iammutex
 
My first powershell script
My first powershell scriptMy first powershell script
My first powershell script
David Cobb
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Colin Charles
 

What's hot (20)

How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Java
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDB
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 
MySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiMySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup Mumbai
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP On
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
My first powershell script
My first powershell scriptMy first powershell script
My first powershell script
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7
 

Similar to My Sql And Search At Craigslist

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
Jeremy Zawodny
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
20080611accel
20080611accel20080611accel
20080611accel
Jeff Hammerbacher
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
eby
 
20081022cca
20081022cca20081022cca
20081022cca
Jeff Hammerbacher
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
Roger Xia
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
Crate.io
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
Alexey Rybak
 
MySQL highav Availability
MySQL highav AvailabilityMySQL highav Availability
MySQL highav Availability
Baruch Osoveskiy
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
shnkr_rmchndrn
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 

Similar to My Sql And Search At Craigslist (20)

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
20080611accel
20080611accel20080611accel
20080611accel
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
20081022cca
20081022cca20081022cca
20081022cca
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 
MySQL highav Availability
MySQL highav AvailabilityMySQL highav Availability
MySQL highav Availability
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Qcon
QconQcon
Qcon
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 

More from MySQLConference

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMySQLConference
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real WorldMySQLConference
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The HoodMySQLConference
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudMySQLConference
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsMySQLConference
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjMySQLConference
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2MySQLConference
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesMySQLConference
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineMySQLConference
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksMySQLConference
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMySQLConference
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20MySQLConference
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendMySQLConference
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceMySQLConference
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureMySQLConference
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMySQLConference
 

More from MySQLConference (17)

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My Sql
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real World
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The Hood
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability Patches
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq Engine
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business Intelligence
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

My Sql And Search At Craigslist

  • 1. MySQL and Search at Craigslist Jeremy Zawodny jzawodn@craigslist.org http://craigslist.org/ Jeremy@Zawodny.com http://jeremy.zawodny.com/blog/
  • 2. Who Am I? Creator and co-author of High Performance ● MySQL Creator of mytop ● Perl Hacker ● MySQL Geek ● Craigslist Engineer (as of July, 2008) ● MySQL, Data, Search, Perl – Ex-Yahoo (Perl, MySQL, Search, Web ● Services)
  • 4. What is Craigslist? Local Classifieds ● Jobs, Housing, Autos, Goods, Services – ~500 cities world-wide ● Free ● Except for jobs in ~18 cities and brokered – apartments in NYC Over 20B pageviews/month – 50M monthly users – 50+ countries, multiple languages – 40+M ads/month, 10+M images –
  • 5. What is Craigslist? Forums ● 100M posts – 100s of forums –
  • 6. Technical and other Challenges High ad churn rate ● Post half-life can be short – Growth ● High traffic volume ● Back-end tools and data analysis needs ● Growth ● Need to archive postings... forever! ● 100s of millions, searchable – Internationalization and UTF-8 ●
  • 7. Technical and other Challenges Small Team ● Fires take priority – Infrastructure gets creaky – Organic code and schema growth over years – Growth ● Lack of abstractions ● Too much embedded SQL in code – Documentation vs. Institutional Knowledge ● “Why do we have things configured like this?” –
  • 8. Goals Use Open Source ● Keep infrastructure small and simple ● Lower power is good! – Efficiency all around – Do more with less – Keep site easy and appraochable ● Don't overload with features – People are easily confuse –
  • 9. Craigslist Internals Overview Load Balancer Read Proxy Array Write Proxy Array Perl + memcached ... Web Read Array Apache 1.3 + mod_perl Object Cache Search Cluster Perl + memcached Sphinx Not Included: Read DB Cluster MySQL 5.0.xx - user db, image db - async tasks, email - accounting, internal tools - and more!
  • 10. Vertical Partitioning: Roles Users Classifieds Forums Write Read Long Trash Stats Archive
  • 11. Vertical Partitioning Different roles have different access patterns ● Sub-roles based on query type – Easier to manage and scale ● Logical, self-contained data ● Servers may not need to be as ● big/fast/expensive Difficult to do retroactively ● Various named db “handles” in code ●
  • 12. Horizontal Partitioning: Hydra ... cluster_01 cluster_02 cluster_03 cluster_N client
  • 13. Horizontal Partitioning: Hydra Need to retrofit a lot of code ● Need non-blocking Perl MySQL client ● Wrapped ● http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on ● price/power and adjust mapping function(s) Choose hardware first – Make the db “fit” – Archiving lets us age a cluster instead of ● migrating it's data to a new one.
  • 14. Search Evolution Problem: Users want to find stuff. ● Solution: Use MySQL Full Text. ● ...time passes... ● Problem: MySQL Full Text Doesn't Scale! ● Solution: Use Sphinx. ● ...time passes... ● Problem: Sphinx doesn't scale! ● Solution: Patch Sphinx. ●
  • 15. MySQL Full-Text Problems Hitting invisible limits ● CPU not pegged, Memory available – Disk I/O not unreasonable – Locking / Mutex contention? Probably. – MyISAM has occasional crashing / corruption ● 5 clusters of 5 machines ● Partitioning based on city and category – All “hand balanced” and high-maintenance – ~30M queries/day ● Close to limits –
  • 16. Sphinx: My First CL Project Sphinx is designed for text search ● Fast and lean C++ code ● Forking model scales well on multi-core ● Control over indexing, weighting, etc. ● Also spent some time looking at Apache Solr ●
  • 17. Search Implementation Details Partitioning based on cities (each has a ● numeric id) Attributes vs. Keywords ● Persistent Connections ● Custom client and server modifications – Minimal stopword List ● Partition into 2 clusters (1 master, 4 slaves) ●
  • 18. Sphinx Incremental Indexing Re-index every N minutes ● Use main + delta strategy ● Adopted as: index + today + delta – One set per city (~500 * 3) – Slaves handle live queries, update via rsync ● Need lots of FDs ● Use all 4 cores to index ● Every night, perform “daily merge” ● Generate config files via Perl ●
  • 20. Sphinx Issues Merge bugs [fixed] ● File descriptor corruption [fixed] ● Persistent connections [fixed] ● Overhead of fork() was substantial in our testing – 200 queries/sec vs. 1,000 queries/sec per box – Missing attribute updates [unreported] ● Bogus docids in responses ● We need to upgrade to latest Sphinx soon ● Andrew and team have been excellent! ●
  • 21. Search Project Results From 25 MySQL Boxes to 10 Sphinx ● Lots more headroom! ● New Features ● Nearby Search – No seizing or locking issues ● 1,000+ qps during peak w/room to grow ● 50M queries per day w/steady growth ● Cluster partitioning built but not needed (yet?) ● Better separation of code ●
  • 22. Sphinx Wishlist Efficient delete handling (kill lists) ● Non-fatal “missing” indexes ● Index dump tool ● Live document add/change/delete ● Built-in replication ● Stats and counters ● Text attributes ● Protocol checksum ●
  • 23. Data Archiving, Replication, Indexes Problem: We want to keep everything. ● Solution: Archive to an archive cluster. ● Problem: Archiving is too painful. Index ● updates are expensive! Slaves affected. Solution: Archive with home-grown eventually ● consistent replication.
  • 24. Data Archiving: OOB Replication Eventual Consistency ● Master process ● SET SQL_LOG_BIN=0 – Select expired IDs – Export records from live master – Import records into archive master – Delete expired from live master – Add IDs to list –
  • 25. Data Archiving: OOB Replication Slave process ● One per MySQL slave – Throttled to minimize impact – State kept on slave – Clone friendly ● Simple logic – Select expired IDs added since my sequence number ● Delete expired records ● Update local “last seen” sequence number ●
  • 26. Long Term Data Archiving Schema coupling is bad ● ALTER TABLE takes forever – Lots of NULLs flying around – CouchDB or similar long-term? ● Schema-free feels like a good fit – Tested some home grown solutions already ● Separate storage and indexing? ● Indexing with Sphinx? –
  • 27. Drizzle, XtraDB, Future Stuff CouchDB looks very interesting. Maybe for ● archive? XtraDB / InnoDB plugin ● Better concurrency – Better tuning of InnoDB internals – libdrizzle + Perl ● DBI/DBD may not fit an async model well – Can talk to both MySQL and Drizzle! – Oracle buying Sun?!?! ●
  • 28. We're Hiring! Work in San Francisco ● Flexible, Small Company ● Excellent Benefits ● Help Millions of People Every Week ● We Need Perl/MySQL Hackers ● Come Help us Scale and Grow ●