SlideShare a Scribd company logo
Wikimedia architecture
  Mark Bergsma <mark@wikimedia.org>
         Wikimedia Foundation Inc.
Overview
                                                                  Focus on
                                                                  architecture, not
                                                                  so much on

•
                                                                  operations or non-
      Intro                                                       technical stuff


•     Global architecture

•     Content Delivery Network (CDN)

•     Application servers

•     Persistent storage



    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Some figures
•     The Wikimedia Foundation:

     •     Was founded in June 2003, in Florida

     •     Currently has 9 employees, the rest is done by
           volunteers

     •     Yearly budget of around $2M, supported mostly
           through donations

     •     Supports the popular Wikipedia project, but also 8
           others: Wiktionary, Wikinews, Wikibooks, Wikiquote,
           Wikisource, Wikiversity, Wikispecies, Wikimedia

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Some figures

•     Wikipedia:

     •     8 million articles spread over hundreds of language
           projects (english, dutch, ...)

     •     110 million revisions

     •     10th busiest site in the world (source: Alexa)

     •     Exponential growth: doubling every 4-6 months in
           terms of visitors / traffic / servers


    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Some technical figures

•     30 000 HTTP requests/s during peak-time

•     3 Gbit/s of data traffic

•     3 data centers: Tampa, Amsterdam, Seoul

•     350 servers, ranging between 1x P4 to 2x Xeon Quad-
      Core, 0.5 - 16 GB of memory

•     ...managed by ~ 6 people


    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Pretty graphs!




Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                  Wikimedia architecture
Pretty graphs!




Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                  Wikimedia architecture
Architecture: LAMP...

                                              PHP

              Users                                           Linux
                                          Apache
                                           web
                                          server                      MySQL




Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                  Wikimedia architecture
...on steroids.
Users                                                      Image
                          GeoDNS                           server
                                                                              Distributed
                                                         (Lighttpd)
                                                                             Object Cache
                                                                             (Memcached)


                                                                     PHP
                           Squid
                           caching                               MediaWiki        Core
                           layers                                               databases
         LVS
                                                                                (MySQL)
                                                              Application
                                                LVS             servers
                                                               (Apache)
                              Invalidation notification
                                                                                External
                                                                 Search
 HTTP
 MySQL
                                     Profiling                                   storage
                                                                (Lucene)
 NFS
 Memcached
 DNS
 HTCP                                 Logging



Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                  Wikimedia architecture
Content Distribution
      Network (CDN)
•     3 clusters on 3 different continents:

     •     Primary cluster in Tampa, Florida

     •     Secondary caching-only clusters in Amsterdam, the
           Netherlands and Seoul, South Korea

•     Geographic Load Balancing, based on source IP of client
      resolver, directs clients to the nearest server cluster

     •     Works by statically mapping IP addresses to countries
           to clusters

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Squid caching

•     HTTP reverse proxy caching implemented using Squid

•     Split into two groups with different characteristics

     •     ‘Text’ for wiki content (mostly compressed HTML
           pages)

     •     ‘Media’ for images and other forms of relatively large
           static files



    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Squid caching

•     55 Squid servers currently, plus 20 waiting for setup

     •     ~ 1 000 HTTP requests/s per server, up to 2 500
           under stress

     •     ~ 100 - 250 Mbit/s per server

     •     ~ 14 000 - 32 000 open connections per server



    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Squid caching
•     Up to 40 GB of disk caches per Squid server

•     Disk seek I/O limited

     •     The more disk spindles, the better!

     •     Up to 4 disks per server (1U rack servers)

•     8 GB of memory, half of that used by Squid

•     Hit rates: 85% for Text, 98% for Media, since the use of
      CARP

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
People & Browsers


     PowerDNS
  (geo-distribution)

                                               Primary datacenter

                                               Text cache cluster
    Regional datacenter
                                                       LVS
 Regional text cache cluster

                                                   CARP Squid
            LVS

                                                   Cache Squid
        CARP Squid


        Cache Squid
                                                    Application

Regional media cache cluster                   Media cache cluster

            LVS                                        LVS


        CARP Squid                                 CARP Squid

        Cache Squid                                Cache Squid



      Purge multicast                              Media storage
Origin server(s)
                                                                Florida
                                                        Primary cluster            CARP
sq1                    sq2                        sq3                 sq4   Caching layer




sq1                    sq2                        sq3                 sq4
                                                                            Destination URL
                                                                            Hash routing layer



                                                                                   Clients
                                                           Amsterdam

knsq1                             knsq2                           knsq3




                                                                                    ...
knsq1                             knsq2                           knsq3




        Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                          Wikimedia architecture
Squid cache invalidation
•     Wiki pages are edited at an unpredictable rate

•     Only the latest revision of a page should be served at all
      times in order not to hinder collaboration

•     Invalidation through expiry times not acceptable, explicit
      cache purging needs to be done

•     Implemented using the UDP based HTCP protocol: on
      edit application servers send out a single message
      containing the URL to be invalidated, which is delivered
      over multicast to all subscribed Squid caches

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
The Wiki software
•     All Wikimedia projects run on a MediaWiki platform

•     Open Source software (GPL)

•     Designed primarily for use by Wikipedia/Wikimedia, but
      also used by many outside parties

     •     Arguably the most popular wiki engine out there

•     Written in PHP

•     Storage primarily in MySQL, other DBMSes supported

•     Very scalable, very good localization
    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
MediaWiki
•     MediaWiki in our application server platform:

     •     ~ 125 servers, 40 waiting to be setup

     •     MediaWiki scales well with multiple CPUs, so we buy
           dual quad-core servers now (8 CPU cores per box)

     •     One centrally managed & synchronized software
           installation for hundreds of wikis

     •     Hardware shared with External Storage and
           Memcached tasks

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
MediaWiki caching
                                                                                               Sometimes caching
                                                                  Tell about                   costs more than
                                                                  Memcached,                   recalculating or
                                                                  no dedicated                 looking up at the
                                                                  slide                        data source...
•     Caches everywhere                                                                        profiling!


•     Most of this data is cached in Memcached, a distributed
      object cache                          Content acceleration
                                                                             & distribution network

            Image metadata                   Users & Sessions



                                                                                  Primary interface
              Parser cache                       MediaWiki
                                                                                  language cache


                                                                                 Secondary interface
            Difference cache                Revision text cache
                                                                                  language cache



    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
MediaWiki dependencies
•    A lot of dependencies                                        FastStringSearch

     in our setup                                                     Proctitle
                                             Apache


                                              PHP
                                                                   Imagemagick

                                              APC
                                                                        Tex
                                           MediaWiki

                                                                       DjVu



                                                                        rsvg



                                                                      ploticus
    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
MediaWiki optimization
•     We try to optimize by...

     •     not doing anything stupid

     •     avoiding expensive algorithms, database queries, etc.

     •     caching every result that is expensive and has
           temporal locality of reference

     •     focusing on the hot spots in the code (profiling!)

•     If a MediaWiki feature is too expensive, it doesn’t get
      enabled on Wikipedia

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
http://noc.wikimedia.org/cgi-bin/report.py
MediaWiki profiling




Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                  Wikimedia architecture
Persistent data
•     Persistent data is stored in the following ways:

     •     Metadata, such as article revision history, article
           relations (links, categories etc.), user accounts and
           settings are stored in the core databases

     •     Actual revision text is stored as blobs in External
           storage

     •     Static (uploaded) files, such as images, are stored
           separately on the image server - metadata (size, type,
           etc.) is cached in the core database and object caches

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Core databases
•     Separate database per wiki (not separate server!)

•     One master, many replicated slaves

•     Read operations are load balanced over the slaves, write
      operations go to the master

     •     The master is used for some read operations in case
           the slaves are not yet up to date (lagged)

•     Runs on ~15 DB servers with 4 - 16 GB of memory, 6x
      73 - 146 GB disks and 2 CPUs each

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Core database scaling

•     Scaling by:

     •     Separating read and write operations (master/slave)

     •     Separating some expensive operations from cheap and
           more frequent operations (query groups)

     •     Separating big, popular wikis from smaller wikis

•     Improves caching: temporal and spatial locality of
      reference and reduces the data set size per server


    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Core database schema
      Langlinks      Externallinks   Categorylinks




         Text                        Restrictions


                        Page


       Revision                       Pagelinks




    Recent changes    Imagelinks     Templatelinks




       Watchlist        Image         Oldimage




         User           Blocks        File archive




     User groups       Logging           Jobs




          ...         Site stats       Profiling
External Storage
•     Article text is stored on separate data storage clusters

     •     Simple append-only blob storage

•     Saves space on expensive and busy core databases for
      largely unused data

•     Allows use of spare resources on application servers (2x
      250-500 GB per server)

•     Currently replicated clusters of 3 MySQL hosts are used;
      this might change in the future for better manageability

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Text compression
•     All revisions of all articles are stored

     •     Every Wikipedia article version since day 1 is available

•     Many articles have hundreds, thousands or tens of
      thousands of revisions

•     Most revisions differ only slightly from previous revisions

•     Therefore subsequent revisions of an article are
      concatenated and then compressed

     •     Achieving very high compression ratios of up to 100x

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Media storage
•     Currently:

     •     1 Image server containing 1.3 TB of data, 4 million files

     •     Overloaded serving just 100 requests/s

     •     Media uploads and deletions over NFS, mounted on all
           application servers

     •     Not scalable

     •     Backups take ~ 2 weeks

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Media storage
•     New API between media storage server and application
      servers, based on HTTP

     •     Methods store, publish, delete and generate thumbnail

•     New file / directory layout structure, using content hashes
      for file names

     •     Files with the same name/URL will have the same
           content, no invalidation necessary

•     Migration to some distributed, replicated setup

    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture
Thumbnail generation
                                                                              404? Ask
                                                                        application servers
                                                                           to generate
                                                                             thumbnail

•    stat() on each request                              Image
     is too expensive, so                                server
     assume every file exists
                                                                                              Application
•    If a thumbnail doesn’t                                                                    servers
     exist, ask the application                                         Squid caches
     servers to render it

                                                                  Clients


    Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc.
                      Wikimedia architecture

More Related Content

What's hot

Storage solutions for High Performance Computing
Storage solutions for High Performance ComputingStorage solutions for High Performance Computing
Storage solutions for High Performance Computing
gmateesc
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
TrendProgContest13
 
Considerations for building your private cloud folsom update 041513
Considerations for building your private cloud   folsom update 041513Considerations for building your private cloud   folsom update 041513
Considerations for building your private cloud folsom update 041513
OpenStack Foundation
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
TrendProgContest13
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
butest
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in Ruby
Rakuten Group, Inc.
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
Rakuten Group, Inc.
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
Phil Cryer
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0
Karsten Dambekalns
 
Oracle RAC and Docker: The Why and How
Oracle RAC and Docker: The Why and HowOracle RAC and Docker: The Why and How
Oracle RAC and Docker: The Why and How
Seth Miller
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
kapa rohit
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical
ISSGC Summer School
 
Session9part2 Servers Detailed
Session9part2  Servers DetailedSession9part2  Servers Detailed
Session9part2 Servers Detailed
ISSGC Summer School
 

What's hot (20)

Storage solutions for High Performance Computing
Storage solutions for High Performance ComputingStorage solutions for High Performance Computing
Storage solutions for High Performance Computing
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
Considerations for building your private cloud folsom update 041513
Considerations for building your private cloud   folsom update 041513Considerations for building your private cloud   folsom update 041513
Considerations for building your private cloud folsom update 041513
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in Ruby
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0
 
Oracle RAC and Docker: The Why and How
Oracle RAC and Docker: The Why and HowOracle RAC and Docker: The Why and How
Oracle RAC and Docker: The Why and How
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical
 
Session9part2 Servers Detailed
Session9part2  Servers DetailedSession9part2  Servers Detailed
Session9part2 Servers Detailed
 

Similar to wikimedia-architecture

Yahoo Communities Architecture Unlikely Bedfellows
Yahoo Communities Architecture Unlikely BedfellowsYahoo Communities Architecture Unlikely Bedfellows
Yahoo Communities Architecture Unlikely Bedfellows
ConSanFrancisco123
 
CRX 2 Content Application Platform
CRX 2 Content Application PlatformCRX 2 Content Application Platform
CRX 2 Content Application Platform
Cédric Hüsler
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
Jeff Hammerbacher
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
20080611accel
20080611accel20080611accel
20080611accel
Jeff Hammerbacher
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sina
Hui Cheng
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Ben Stopford
 
A closer look to locaweb IaaS
A closer look to locaweb IaaSA closer look to locaweb IaaS
A closer look to locaweb IaaS
Gleicon Moraes
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Jeremy Zawodny
 
RunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdfRunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdf
OpenStack Foundation
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture Future
Kimihiko Kitase
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
Sudhir Tonse
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
Ram Chinta
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
OpenCity Community
 
Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with Docker
MariaDB plc
 
20081022cca
20081022cca20081022cca
20081022cca
Jeff Hammerbacher
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_de
mahalomeetup
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
David Newman
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc
 

Similar to wikimedia-architecture (20)

Yahoo Communities Architecture Unlikely Bedfellows
Yahoo Communities Architecture Unlikely BedfellowsYahoo Communities Architecture Unlikely Bedfellows
Yahoo Communities Architecture Unlikely Bedfellows
 
CRX 2 Content Application Platform
CRX 2 Content Application PlatformCRX 2 Content Application Platform
CRX 2 Content Application Platform
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
20080611accel
20080611accel20080611accel
20080611accel
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sina
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
 
A closer look to locaweb IaaS
A closer look to locaweb IaaSA closer look to locaweb IaaS
A closer look to locaweb IaaS
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
 
RunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdfRunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdf
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture Future
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
 
Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with Docker
 
20081022cca
20081022cca20081022cca
20081022cca
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_de
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 

More from Kapil Mohan

97 Things Every SRE Should Know
97 Things Every SRE Should Know97 Things Every SRE Should Know
97 Things Every SRE Should Know
Kapil Mohan
 
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryWhy Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
Kapil Mohan
 
Stories on Alignment
Stories on AlignmentStories on Alignment
Stories on Alignment
Kapil Mohan
 
Definitive guide to building a remote team
Definitive guide to building a remote teamDefinitive guide to building a remote team
Definitive guide to building a remote team
Kapil Mohan
 
What ya smokin'?
What ya smokin'?What ya smokin'?
What ya smokin'?
Kapil Mohan
 
SlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons LearntSlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons Learnt
Kapil Mohan
 
Devops at SlideShare: Talk at Devopsdays Bangalore 2011
Devops at SlideShare: Talk at Devopsdays Bangalore 2011Devops at SlideShare: Talk at Devopsdays Bangalore 2011
Devops at SlideShare: Talk at Devopsdays Bangalore 2011
Kapil Mohan
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago Party
Kapil Mohan
 
Actionscript Standards
Actionscript StandardsActionscript Standards
Actionscript Standards
Kapil Mohan
 
Tashan - the movie in pics
Tashan - the movie in pics Tashan - the movie in pics
Tashan - the movie in pics
Kapil Mohan
 
Amazon S3 storage engine plugin for MySQL
Amazon S3 storage engine plugin for MySQLAmazon S3 storage engine plugin for MySQL
Amazon S3 storage engine plugin for MySQL
Kapil Mohan
 
Larry's Free Culture
Larry's Free CultureLarry's Free Culture
Larry's Free Culture
Kapil Mohan
 
Wrestle Mania 23
Wrestle Mania 23Wrestle Mania 23
Wrestle Mania 23
Kapil Mohan
 
Honeymoon Travels Movie Preview
Honeymoon Travels Movie PreviewHoneymoon Travels Movie Preview
Honeymoon Travels Movie Preview
Kapil Mohan
 
Parzania Movie Preview
Parzania Movie PreviewParzania Movie Preview
Parzania Movie Preview
Kapil Mohan
 
Traffic Signal Movie Preview
Traffic Signal Movie PreviewTraffic Signal Movie Preview
Traffic Signal Movie Preview
Kapil Mohan
 
Introduction to Slideshare at Barcamp Hyderabad
Introduction to Slideshare at Barcamp HyderabadIntroduction to Slideshare at Barcamp Hyderabad
Introduction to Slideshare at Barcamp Hyderabad
Kapil Mohan
 

More from Kapil Mohan (17)

97 Things Every SRE Should Know
97 Things Every SRE Should Know97 Things Every SRE Should Know
97 Things Every SRE Should Know
 
Why Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single RepositoryWhy Google Stores Billions of Lines of Code in a Single Repository
Why Google Stores Billions of Lines of Code in a Single Repository
 
Stories on Alignment
Stories on AlignmentStories on Alignment
Stories on Alignment
 
Definitive guide to building a remote team
Definitive guide to building a remote teamDefinitive guide to building a remote team
Definitive guide to building a remote team
 
What ya smokin'?
What ya smokin'?What ya smokin'?
What ya smokin'?
 
SlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons LearntSlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons Learnt
 
Devops at SlideShare: Talk at Devopsdays Bangalore 2011
Devops at SlideShare: Talk at Devopsdays Bangalore 2011Devops at SlideShare: Talk at Devopsdays Bangalore 2011
Devops at SlideShare: Talk at Devopsdays Bangalore 2011
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago Party
 
Actionscript Standards
Actionscript StandardsActionscript Standards
Actionscript Standards
 
Tashan - the movie in pics
Tashan - the movie in pics Tashan - the movie in pics
Tashan - the movie in pics
 
Amazon S3 storage engine plugin for MySQL
Amazon S3 storage engine plugin for MySQLAmazon S3 storage engine plugin for MySQL
Amazon S3 storage engine plugin for MySQL
 
Larry's Free Culture
Larry's Free CultureLarry's Free Culture
Larry's Free Culture
 
Wrestle Mania 23
Wrestle Mania 23Wrestle Mania 23
Wrestle Mania 23
 
Honeymoon Travels Movie Preview
Honeymoon Travels Movie PreviewHoneymoon Travels Movie Preview
Honeymoon Travels Movie Preview
 
Parzania Movie Preview
Parzania Movie PreviewParzania Movie Preview
Parzania Movie Preview
 
Traffic Signal Movie Preview
Traffic Signal Movie PreviewTraffic Signal Movie Preview
Traffic Signal Movie Preview
 
Introduction to Slideshare at Barcamp Hyderabad
Introduction to Slideshare at Barcamp HyderabadIntroduction to Slideshare at Barcamp Hyderabad
Introduction to Slideshare at Barcamp Hyderabad
 

Recently uploaded

(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAIApplying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
ssuserd4e0d2
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 

Recently uploaded (20)

(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAIApplying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
Applying Retrieval-Augmented Generation (RAG) to Combat Hallucinations in GenAI
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 

wikimedia-architecture

  • 1. Wikimedia architecture Mark Bergsma <mark@wikimedia.org> Wikimedia Foundation Inc.
  • 2. Overview Focus on architecture, not so much on • operations or non- Intro technical stuff • Global architecture • Content Delivery Network (CDN) • Application servers • Persistent storage Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 3. Some figures • The Wikimedia Foundation: • Was founded in June 2003, in Florida • Currently has 9 employees, the rest is done by volunteers • Yearly budget of around $2M, supported mostly through donations • Supports the popular Wikipedia project, but also 8 others: Wiktionary, Wikinews, Wikibooks, Wikiquote, Wikisource, Wikiversity, Wikispecies, Wikimedia Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 4. Some figures • Wikipedia: • 8 million articles spread over hundreds of language projects (english, dutch, ...) • 110 million revisions • 10th busiest site in the world (source: Alexa) • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 5. Some technical figures • 30 000 HTTP requests/s during peak-time • 3 Gbit/s of data traffic • 3 data centers: Tampa, Amsterdam, Seoul • 350 servers, ranging between 1x P4 to 2x Xeon Quad- Core, 0.5 - 16 GB of memory • ...managed by ~ 6 people Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 6. Pretty graphs! Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 7. Pretty graphs! Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 8. Architecture: LAMP... PHP Users Linux Apache web server MySQL Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 9. ...on steroids. Users Image GeoDNS server Distributed (Lighttpd) Object Cache (Memcached) PHP Squid caching MediaWiki Core layers databases LVS (MySQL) Application LVS servers (Apache) Invalidation notification External Search HTTP MySQL Profiling storage (Lucene) NFS Memcached DNS HTCP Logging Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 10. Content Distribution Network (CDN) • 3 clusters on 3 different continents: • Primary cluster in Tampa, Florida • Secondary caching-only clusters in Amsterdam, the Netherlands and Seoul, South Korea • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster • Works by statically mapping IP addresses to countries to clusters Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 11. Squid caching • HTTP reverse proxy caching implemented using Squid • Split into two groups with different characteristics • ‘Text’ for wiki content (mostly compressed HTML pages) • ‘Media’ for images and other forms of relatively large static files Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 12. Squid caching • 55 Squid servers currently, plus 20 waiting for setup • ~ 1 000 HTTP requests/s per server, up to 2 500 under stress • ~ 100 - 250 Mbit/s per server • ~ 14 000 - 32 000 open connections per server Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 13. Squid caching • Up to 40 GB of disk caches per Squid server • Disk seek I/O limited • The more disk spindles, the better! • Up to 4 disks per server (1U rack servers) • 8 GB of memory, half of that used by Squid • Hit rates: 85% for Text, 98% for Media, since the use of CARP Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 14. People & Browsers PowerDNS (geo-distribution) Primary datacenter Text cache cluster Regional datacenter LVS Regional text cache cluster CARP Squid LVS Cache Squid CARP Squid Cache Squid Application Regional media cache cluster Media cache cluster LVS LVS CARP Squid CARP Squid Cache Squid Cache Squid Purge multicast Media storage
  • 15. Origin server(s) Florida Primary cluster CARP sq1 sq2 sq3 sq4 Caching layer sq1 sq2 sq3 sq4 Destination URL Hash routing layer Clients Amsterdam knsq1 knsq2 knsq3 ... knsq1 knsq2 knsq3 Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 16. Squid cache invalidation • Wiki pages are edited at an unpredictable rate • Only the latest revision of a page should be served at all times in order not to hinder collaboration • Invalidation through expiry times not acceptable, explicit cache purging needs to be done • Implemented using the UDP based HTCP protocol: on edit application servers send out a single message containing the URL to be invalidated, which is delivered over multicast to all subscribed Squid caches Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 17. The Wiki software • All Wikimedia projects run on a MediaWiki platform • Open Source software (GPL) • Designed primarily for use by Wikipedia/Wikimedia, but also used by many outside parties • Arguably the most popular wiki engine out there • Written in PHP • Storage primarily in MySQL, other DBMSes supported • Very scalable, very good localization Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 18. MediaWiki • MediaWiki in our application server platform: • ~ 125 servers, 40 waiting to be setup • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box) • One centrally managed & synchronized software installation for hundreds of wikis • Hardware shared with External Storage and Memcached tasks Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 19. MediaWiki caching Sometimes caching Tell about costs more than Memcached, recalculating or no dedicated looking up at the slide data source... • Caches everywhere profiling! • Most of this data is cached in Memcached, a distributed object cache Content acceleration & distribution network Image metadata Users & Sessions Primary interface Parser cache MediaWiki language cache Secondary interface Difference cache Revision text cache language cache Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 20. MediaWiki dependencies • A lot of dependencies FastStringSearch in our setup Proctitle Apache PHP Imagemagick APC Tex MediaWiki DjVu rsvg ploticus Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 21. MediaWiki optimization • We try to optimize by... • not doing anything stupid • avoiding expensive algorithms, database queries, etc. • caching every result that is expensive and has temporal locality of reference • focusing on the hot spots in the code (profiling!) • If a MediaWiki feature is too expensive, it doesn’t get enabled on Wikipedia Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 23. MediaWiki profiling Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 24. Persistent data • Persistent data is stored in the following ways: • Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases • Actual revision text is stored as blobs in External storage • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 25. Core databases • Separate database per wiki (not separate server!) • One master, many replicated slaves • Read operations are load balanced over the slaves, write operations go to the master • The master is used for some read operations in case the slaves are not yet up to date (lagged) • Runs on ~15 DB servers with 4 - 16 GB of memory, 6x 73 - 146 GB disks and 2 CPUs each Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 26. Core database scaling • Scaling by: • Separating read and write operations (master/slave) • Separating some expensive operations from cheap and more frequent operations (query groups) • Separating big, popular wikis from smaller wikis • Improves caching: temporal and spatial locality of reference and reduces the data set size per server Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 27. Core database schema Langlinks Externallinks Categorylinks Text Restrictions Page Revision Pagelinks Recent changes Imagelinks Templatelinks Watchlist Image Oldimage User Blocks File archive User groups Logging Jobs ... Site stats Profiling
  • 28. External Storage • Article text is stored on separate data storage clusters • Simple append-only blob storage • Saves space on expensive and busy core databases for largely unused data • Allows use of spare resources on application servers (2x 250-500 GB per server) • Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 29. Text compression • All revisions of all articles are stored • Every Wikipedia article version since day 1 is available • Many articles have hundreds, thousands or tens of thousands of revisions • Most revisions differ only slightly from previous revisions • Therefore subsequent revisions of an article are concatenated and then compressed • Achieving very high compression ratios of up to 100x Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 30. Media storage • Currently: • 1 Image server containing 1.3 TB of data, 4 million files • Overloaded serving just 100 requests/s • Media uploads and deletions over NFS, mounted on all application servers • Not scalable • Backups take ~ 2 weeks Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 31. Media storage • New API between media storage server and application servers, based on HTTP • Methods store, publish, delete and generate thumbnail • New file / directory layout structure, using content hashes for file names • Files with the same name/URL will have the same content, no invalidation necessary • Migration to some distributed, replicated setup Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture
  • 32. Thumbnail generation 404? Ask application servers to generate thumbnail • stat() on each request Image is too expensive, so server assume every file exists Application • If a thumbnail doesn’t servers exist, ask the application Squid caches servers to render it Clients Mark Bergsma, mark@wikimedia.org, Wikimedia Foundation Inc. Wikimedia architecture