Architecture:
                             Surviving the High Load
                         .




пятница, 6 мая 2011 г.
Who we are ?
                         Alexander Chinaryov
                         Lead Platform Developer
                         Since 2007




                         Alexander Hristoforov
                         Lead Platform Developer
                         Since 2009


                         Oleg Anastasyev
                         Lead Platform Developer
                         Since 2007




пятница, 6 мая 2011 г.
Load : some facts
                            2,8M users online
                         150k pages/s, 50ms avg
                              32Gbit/s out

     •    4000 Messages/s (> # twits)
     •    160k Photo downloads/s
     •    500 Comments/s
     •    90 000 notifications/s
     •    1500/s feed posts, 30k/s gets


пятница, 6 мая 2011 г.
Load: handled by
     • 3 Datacenters
     • 2400 servers & storages (and counting)

     • 1.5M SLOC (99.9% java, 0.1%C)
     • 60 modules

     • 40 devs + 8 testers
     • 20 admins

пятница, 6 мая 2011 г.
Arch: layers
     • 150+50 webs
     • 120 app srvs
     • 25 kinds
       business services
     • 6 SSO
     • >100 caches
     • 230 SQLs
     • >400 noSQL




пятница, 6 мая 2011 г.
Load: Balance
     • LVS
     • One-cluster
           – Weighted RR
           – Pluggable Failure detectors
           – Integrated with one-remote-service
           – Locality groups




пятница, 6 мая 2011 г.
Arch: Presentation
     • Apache Tomcat 6
     • RDK framework:
           – GUI components
           – Independant portlets
           – AJAX update → no full page
           – No javascript required
     • Google Web Toolkit for Dynamics
           – Toolbar, Photo pins, gifts
     • Flash (Apps, players, ads)
пятница, 6 мая 2011 г.
Arch: Business Logic
     • Odnoklassniki-ejb
           – JBoss 4.2
           – JTA, Stateless, Entity beans (BMP)
           – Business Op handling & orchestration
           – Event/handler pattern
           – Component logic
           – Data partitioning
           – Spring (DI)



пятница, 6 мая 2011 г.
Arch: Business Srvcs
     • IM, discussions, feeds
           – JBoss Remoting 2.2
           – One remote service
           – 100k+ req/sec on recent 8 core CPU
            /**
             * Ex. of Remote server
             */
            public interface Server extends RemoteService
            {
                @RemoteMethod
                IListChunk<Friend> getFreshMyFriends(@PartitionSource long userId, IChunkProperties cp);

                @RemoteMethod(invokeAll=true,split=true,reduceStrategy=ListReduceStrategy.class)
                List<?> mapReduceMethod(@PartitionSource long userId, ... );

                @RemoteMethod(invokeAll=true,asyncMaxDelay=1000L,asyncMaxBatch=100)
                void asyncNotify(@PartitionSource long userId, ... );
            }




пятница, 6 мая 2011 г.
Arch: Caches
     • one-graph
           – Social graph storage
           – 30Gb, 17K ops/server 7%CPU …
     • Odnoklassniki-cache
           – users, groups, photos,sessions...
           – Smart
           – Off heap (Unsafe) → no FGC
     • Near cache


пятница, 6 мая 2011 г.
Arch: Persistance
     • MS SQL 2005
           – High Consistency
           – Flexible queries
     • NoSQL: one-db
           – Berkley 4.5 C edition +
           – JBoss remoting based server +
           – Simple querying =
           – noSQL storage server
     • … and others are in research
пятница, 6 мая 2011 г.
Concept: DB Partitioning
     •    DB scaling is hard & expensive
     •    Vertical
     •    Horizontal
     •    ID:
           – long ID = uid << 8 + domain
           – Domain = 0..255
           – Domain → servers map



пятница, 6 мая 2011 г.
Perf : SQL DB
     •    XA → local TA only
     •    Dirty reads
     •    DB JOIN → app server memory
     •    FK, SP, Triggers
     •    DELETE :
           – No delete/insert workflow → update
           – Async batch process, retry
     • Indexes, clustered indexes

пятница, 6 мая 2011 г.
Perf: general
     • Seq Access speed:
           – RAM 10x > SSD 1.5x > 1Gbit eth comm 2x > disk

     • Random Access speed:
           – RAM 20000x (~50ns) > SSD 5-10x > disk (~5ms)
           – Net roundtrip ~ 0.5 ms

     • So:
           – Near data/cache – fastest solution ( cache coherence problem )
           – Partitioned network cache
           – Database access is the slowest thing

     • Still you have to sacrifice consistency


пятница, 6 мая 2011 г.
Surviving : GC
     • Young GC → high CPU load
           – Too much garbage (autoboxing, overlooked log.debug,...)
           – FIX: find and fix code → can take weeks

     • Old GC → pauses → carousel
           – 2-4Gb is limit for ParallelGC ( 1-4 secs )
           – 8-10 Gb is limit for CMS
                 • and it still can stop the world!
           – FIX: use Unsafe (offheap memory) or partition

     • Perm GC → pauses → carousel again
           – Too much .classes
           – FIX: +CMSClassUnloadingEnable




пятница, 6 мая 2011 г.
Surviving: failures
     • SQL partition failure
           – FIX: fault tolerance: read incomplete, write
             fail
     • One-db
           – Non stable replication → no fix :-(
           – Data corruption → separate ids storage
           – Random disk access → SSD, tmpfs




пятница, 6 мая 2011 г.
Surviving: carousel
     • Reasons:
           – Net problems
           – Unusual activity, spammers
           – Full GCs
           – Cold caches
           – Unexpected slowdowns, bugs
           – Activity growth

     • Fixes:
           – Timeout = 3s
           – Client side automatic fail detectors, server cutout
           – Gatekeepers




пятница, 6 мая 2011 г.
Surviving: gatekeepers
     • Fine grain func switches
     • Used for:
           – Fighting with carousel
           – Smooth new functions launch
           – Experiments

     • Can:
           – Turn on/off specific func, individual 3rd party games
           – On per server basis
           – On per user domain




пятница, 6 мая 2011 г.
Surviving: measure!
     • One-log statistics




пятница, 6 мая 2011 г.
Thank you



                         Questions ?


                            We are hiring
                          jobs@forticom.com



пятница, 6 мая 2011 г.
Test yourself ;-)
     • PhotoMarks table
            PhotoId:long   UserId:long   Mark:byte   timestamp

           – 32p x (500M rows, 42 Gb data + 25 Gb index)
           – Load (photoId, userId): 14kops, create: 1500kpos
           – Most load calls are check for row absence

     • Rejected apriori
           – Add more SQL nodes – too expensive
           – Place all marks to cache – 2600Gb RAM is not cheap as well




пятница, 6 мая 2011 г.

Odnoklassniki.ru Architecture

  • 1.
    Architecture: Surviving the High Load . пятница, 6 мая 2011 г.
  • 2.
    Who we are? Alexander Chinaryov Lead Platform Developer Since 2007 Alexander Hristoforov Lead Platform Developer Since 2009 Oleg Anastasyev Lead Platform Developer Since 2007 пятница, 6 мая 2011 г.
  • 3.
    Load : somefacts 2,8M users online 150k pages/s, 50ms avg 32Gbit/s out • 4000 Messages/s (> # twits) • 160k Photo downloads/s • 500 Comments/s • 90 000 notifications/s • 1500/s feed posts, 30k/s gets пятница, 6 мая 2011 г.
  • 4.
    Load: handled by • 3 Datacenters • 2400 servers & storages (and counting) • 1.5M SLOC (99.9% java, 0.1%C) • 60 modules • 40 devs + 8 testers • 20 admins пятница, 6 мая 2011 г.
  • 5.
    Arch: layers • 150+50 webs • 120 app srvs • 25 kinds business services • 6 SSO • >100 caches • 230 SQLs • >400 noSQL пятница, 6 мая 2011 г.
  • 6.
    Load: Balance • LVS • One-cluster – Weighted RR – Pluggable Failure detectors – Integrated with one-remote-service – Locality groups пятница, 6 мая 2011 г.
  • 7.
    Arch: Presentation • Apache Tomcat 6 • RDK framework: – GUI components – Independant portlets – AJAX update → no full page – No javascript required • Google Web Toolkit for Dynamics – Toolbar, Photo pins, gifts • Flash (Apps, players, ads) пятница, 6 мая 2011 г.
  • 8.
    Arch: Business Logic • Odnoklassniki-ejb – JBoss 4.2 – JTA, Stateless, Entity beans (BMP) – Business Op handling & orchestration – Event/handler pattern – Component logic – Data partitioning – Spring (DI) пятница, 6 мая 2011 г.
  • 9.
    Arch: Business Srvcs • IM, discussions, feeds – JBoss Remoting 2.2 – One remote service – 100k+ req/sec on recent 8 core CPU /** * Ex. of Remote server */ public interface Server extends RemoteService { @RemoteMethod IListChunk<Friend> getFreshMyFriends(@PartitionSource long userId, IChunkProperties cp); @RemoteMethod(invokeAll=true,split=true,reduceStrategy=ListReduceStrategy.class) List<?> mapReduceMethod(@PartitionSource long userId, ... ); @RemoteMethod(invokeAll=true,asyncMaxDelay=1000L,asyncMaxBatch=100) void asyncNotify(@PartitionSource long userId, ... ); } пятница, 6 мая 2011 г.
  • 10.
    Arch: Caches • one-graph – Social graph storage – 30Gb, 17K ops/server 7%CPU … • Odnoklassniki-cache – users, groups, photos,sessions... – Smart – Off heap (Unsafe) → no FGC • Near cache пятница, 6 мая 2011 г.
  • 11.
    Arch: Persistance • MS SQL 2005 – High Consistency – Flexible queries • NoSQL: one-db – Berkley 4.5 C edition + – JBoss remoting based server + – Simple querying = – noSQL storage server • … and others are in research пятница, 6 мая 2011 г.
  • 12.
    Concept: DB Partitioning • DB scaling is hard & expensive • Vertical • Horizontal • ID: – long ID = uid << 8 + domain – Domain = 0..255 – Domain → servers map пятница, 6 мая 2011 г.
  • 13.
    Perf : SQLDB • XA → local TA only • Dirty reads • DB JOIN → app server memory • FK, SP, Triggers • DELETE : – No delete/insert workflow → update – Async batch process, retry • Indexes, clustered indexes пятница, 6 мая 2011 г.
  • 14.
    Perf: general • Seq Access speed: – RAM 10x > SSD 1.5x > 1Gbit eth comm 2x > disk • Random Access speed: – RAM 20000x (~50ns) > SSD 5-10x > disk (~5ms) – Net roundtrip ~ 0.5 ms • So: – Near data/cache – fastest solution ( cache coherence problem ) – Partitioned network cache – Database access is the slowest thing • Still you have to sacrifice consistency пятница, 6 мая 2011 г.
  • 15.
    Surviving : GC • Young GC → high CPU load – Too much garbage (autoboxing, overlooked log.debug,...) – FIX: find and fix code → can take weeks • Old GC → pauses → carousel – 2-4Gb is limit for ParallelGC ( 1-4 secs ) – 8-10 Gb is limit for CMS • and it still can stop the world! – FIX: use Unsafe (offheap memory) or partition • Perm GC → pauses → carousel again – Too much .classes – FIX: +CMSClassUnloadingEnable пятница, 6 мая 2011 г.
  • 16.
    Surviving: failures • SQL partition failure – FIX: fault tolerance: read incomplete, write fail • One-db – Non stable replication → no fix :-( – Data corruption → separate ids storage – Random disk access → SSD, tmpfs пятница, 6 мая 2011 г.
  • 17.
    Surviving: carousel • Reasons: – Net problems – Unusual activity, spammers – Full GCs – Cold caches – Unexpected slowdowns, bugs – Activity growth • Fixes: – Timeout = 3s – Client side automatic fail detectors, server cutout – Gatekeepers пятница, 6 мая 2011 г.
  • 18.
    Surviving: gatekeepers • Fine grain func switches • Used for: – Fighting with carousel – Smooth new functions launch – Experiments • Can: – Turn on/off specific func, individual 3rd party games – On per server basis – On per user domain пятница, 6 мая 2011 г.
  • 19.
    Surviving: measure! • One-log statistics пятница, 6 мая 2011 г.
  • 20.
    Thank you Questions ? We are hiring jobs@forticom.com пятница, 6 мая 2011 г.
  • 21.
    Test yourself ;-) • PhotoMarks table PhotoId:long UserId:long Mark:byte timestamp – 32p x (500M rows, 42 Gb data + 25 Gb index) – Load (photoId, userId): 14kops, create: 1500kpos – Most load calls are check for row absence • Rejected apriori – Add more SQL nodes – too expensive – Place all marks to cache – 2600Gb RAM is not cheap as well пятница, 6 мая 2011 г.