• Save
EEDC 2010. Scaling Web Applications
Upcoming SlideShare
Loading in...5
×
 

EEDC 2010. Scaling Web Applications

on

  • 1,512 views

Seminario realizado en el marco del master CANS en la Facultad de Informática de Barcelona. ...

Seminario realizado en el marco del master CANS en la Facultad de Informática de Barcelona.
Anatomia de una aplicación Web
Demasiadas escrituras en la BD, ¿qué puedo hacer?
¿Cómo puedo aprovechar el "Cloud"?
Optimizando aplicaciones Facebook

Statistics

Views

Total Views
1,512
Views on SlideShare
1,409
Embed Views
103

Actions

Likes
2
Downloads
0
Comments
0

4 Embeds 103

http://www.expertosenti.com 65
http://blog.expertosenti.com 27
http://www.slideshare.net 6
http://www.linkedin.com 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

EEDC 2010. Scaling Web Applications EEDC 2010. Scaling Web Applications Presentation Transcript

  • 6.1. Web Scale
    34330
    EEDC
    Execution
    Environments for
    Distributed
    Computing
    6.1.1. Anatomy of a service
    6.1.2. Too many Writes to Database
    6.1.3. Cheaper peaks
    6.1.4. Facebook Platform
    Master in Computer Architecture, Networks and Systems - CANS
  • 6.1. Web Scale
    34330
    EEDC
    Execution
    Environments for
    Distributed
    Computing
    6.1.1. Anatomy of a service
    6.1.2. Too many Writes to Database
    6.1.3. Cheaper peaks
    6.1.4. Facebook Platform
    Master in Computer Architecture, Networks and Systems - CANS
  • Anatomy of a Web Service
  • Problems may arise in…
    Various browsers, plugins,
    operatingsystems,
    performance, screensize,
    PEBKAC, etc
  • Problems may arise in…
    Internet partitioning,
    performance bottlenecks,
    packetloss, jitter
  • Problems may arise in…
    DDoStargetinganothercustomer,
    routingproblems, capacity,
    Power/coolingproblems,
    «lazy» remotehands
  • Problems may arise in…
    Performance limits, bugs,
    configurationerrors,
    faulty HW
  • Problems may arise in…
    Networklimits, interruptlimits
    OS limits, bugs,
    configurationerrors,
    faulty HW, error recovery,
  • Problems may arise in…
    Speed of clients, #threads,
    contentnot in sync,
    unresponsive Apps,
    toomanysources of contents,
    userpersistence,
    configurationerrors, bugs
  • Problems may arise in… Requests/sec
    100 KB 5 MB 50 KB 5 KB 50 KB 50 KB
    Default configuration of Tomcatallows 200 threads/instance
  • Problems may arise in…
    Speed of clients, #threads,
    contentnot in sync,
    unresponsive Apps,
    toomanysources of contents,
    userpersistence
    configurationerrors, bugs
  • Problems may arise in…
    Databaseconcurrency,
    accessto 3rd party data (APIs),
    CPU ormemoryboundproblems,
    datacenterreplication,
    logginguseractions
  • Problems may arise in…
    Database concurrency,
    modifying schemas,
    Massive tables -> indexes,
    disk performance,
    CPU/memory bound,
    datacenter replication
  • Problems may arise in…
    Availability and performance,
    More than 24h to analyze daily logs
    Not reaching Inbox (spam folders)
    Surpass monitoring capacity
  • 6.1. Web Scale
    34330
    EEDC
    Execution
    Environments for
    Distributed
    Computing
    6.1.1. Anatomy of a service
    6.1.2. Too many Writes to Database
    6.1.3. Cheaper peaks
    6.1.4. Facebook Platform
    Master in Computer Architecture, Networks and Systems - CANS
  • Too many writes to database
    There’s no machine that could do 44k/sec over 1 TB of data.
    Scaling reads is easier:
    Big cache
    Replication
    On write you have to:
    Update data
    Update Transaction log
    Update indexes
    Invalidate cache
    Replicate
    Write to 2 or more disks (RAID x)
    http://www.scribd.com/doc/2592098/DVPmysqlucFederation-at-Flickr-Doing-Billions-of-Queries-Per-Day
  • Case
    Database Federation
    Sharding per User-ID
    Global Ring, know where is the data
    PHP Logic to connect shards and data consistent
    What’s a Shard?:
    Horizontal partitioning of a table, usually per Primary Key
    Benefits
    You can scale as long as you have budget
    Disadvantages
    You lost the possibility to do any JOIN, COUNT, RANGE, between Shards
    Your application logic has to be aware
    If you what to rebalance shards, you will need some kind of global unique, beware of auto-increments
    More services needing HA, BCP, change control, and so on
  • Case
    Global Ring?
    Storing Key-Value of:
    User_ID -> Shard_ID
    Photo_ID -> User_ID
    Group_ID -> Shard_ID
    Every access to data has to know where -> memcached with a TTL of 30 minutes
    Global IDs?:
    You don’t want two objects with the same ID!
    Strategies
    GUIDs: 128 bits Ids, so bigger indexes, and poor supported by MySQL
    Central autoincrement: You have a table where for every Id needed you do an insert and let MySQL take care of everything. At 60 photos/sec will be a BIG table
    Replace Into: An only MySQL solution, small tables and allows for redundancy (one server provides odd and another even
  • Case: Replace INTO
    The Tickets64 schema looks like:
    CREATE TABLE `Tickets64` (
    `id` bigint(20) unsigned NOT NULL auto_increment,
    `stub` char(1) NOT NULL default '',
    PRIMARY KEY (`id`), UNIQUE KEY `stub` (`stub`)
    ) ENGINE=MyISAM
    SELECT * from Tickets64 returns a single row that looks something like:
    +-------------------+------+
    | id | stub |
    +-------------------+------+
    | 72157623227190423 | a |
    +-------------------+------+
    When they need a new globally unique 64-bit ID they issue the following SQL:
    REPLACE INTO Tickets64 (stub) VALUES ('a'); SELECT LAST_INSERT_ID();
  • Case
    PHP Logic
    You lost any kind of intershard relational query (No JOINs)
    You lost any kind of integrity reference (No ForeignKeys)
    You have to control distributed transactions
    You select a Favorite (so they need to update your Shard and the one of the other user)
    Open 2 connections to the two shards
    Begin a transaction on both Shards
    Add the data
    If everything is ok -> commit, else roll back and error
    So we improve scalability but impact code complexity and performance off a single page view (hint: async database access)
  • Case
    They get an arbitrary scalable infrastructure
    They have a marginally more complex code
  • Hai!
    I’mworking!
  • Case
    They get an arbitrary scalable infrastructure
    They have a marginally more complex code
    They “only” have 20 engineers, so scalability also means:
    Roughly 2.5 million Flickr members per engineer.
    Roughly 200 million photos per engineer.
    28 user facing pages.
    23 administrative pages.
    20 API methods, though only 7.5 public API methods.
    80 API calls per second.
    250 CPUs.
    850 annual deploys.
    16 feature flags.
  • 6.1. Web Scale
    34330
    EEDC
    Execution
    Environments for
    Distributed
    Computing
    6.1.1. Anatomy of a service
    6.1.2. Too many Writes to Database
    6.1.3. Cheaper peaks
    6.1.4. Facebook Platform
    Master in Computer Architecture, Networks and Systems - CANS
  • Cheaperpeaks
    Ifyourcapacityplanning comes fromtheaggregate of allyourcustomers and you plan tohavethousands of them, whatcouldyou do?
    And your performance impacts in thebrand of yourcustomer (so you’llhaveproblems)
    You are a Start-up withoutloads of money
  • Case
    What a recommendation engine looks like?
  • Case
    Have to store data for every page view their customer gets
    Do MAGIC over millions of rows to calculate related items for YOU
    Show recommendations to user
    Only 2 snippets of Javascript/HTML
    Less than 0’5 seconds per view
  • Case
    Option A
    Every hit to tracker becomes an Insert to a MySQL sharded by customer
    Every hit to recommender recalculates the list of items to show based on collective intelligence
    Benefits
    Straightforward to code and manage
    Quick and easy for a proof of concept
    Disadvantages
    One customer on their peak could surpass the capacity of the MySQL instance
    The same customer on their valley could be wasting money on an idle instance
    Our webserver could be overloaded with the sum of all our customers
    The recommender is a CPU and memory Hog and we need too many servers to cope with our estimated demand
  • Case
    Option B
    Every hit to tracker becomes an Insert to a MySQL sharded by customer
    We have a cron job that recalculates in advance different sets of related items
    Every hit to recommender gets from the DB the corresponding set of items
    Benefits
    Straightforward to code
    The compute intensive task is out of critical path, is asynchronous
    Disadvantages
    One customer on their peak could surpass the capacity of the MySQL instance
    The same customer on their valley could be wasting money on an idle instance
    Our webserver could be overloaded with the sum of all our customers
    We have to control what are doing our cron jobs and check for errors and tune them so they don’t bring down the database
  • Case
    Option C
    Every hit to tracker is only a static image file with various parameters /a.gif?b=1&c=2&…
    We have a cron job that gets the log files from the webservers and database stored items and recalculates in advance different sets of related items
    Every hit to recommender gets from the DB (sharded by customer) the corresponding set of items
    Benefits
    Straightforward to code, only had to move and parse files
    A surge on pageviews don’t bring down the database for writes
    The compute intensive task is out of critical path, it’s asynchronous
    Disadvantages
    One customer on their peak could surpass the capacity of the MySQL instance
    The same customer on their valley could be wasting money on an idle instance
    We have to control what are doing our cron jobs and check for errors and tune them so they don’t bring down the database
    We could hit bandwidth limits
  • Case
    Option D
    Every hit to tracker is only a static image file with various parameters /a.gif?b=1&c=2&…
    We have a cron job that gets the log files from the webservers and database stored items and recalculates in advance different sets of related items
    Every hit to recommender gets from the DB the corresponding set of items
    Went the Hadoop/Hbase way, no more sharding
    Benefits
    Easy to add and remove Data servers on demand so no wasting/limits here
    A surge on page views only costs money, as we get paid per page view, it’s ok 
    The compute intensive task is out of critical path, it’s asynchronous
    Disadvantages
    Beta software, poor documentation/examples
    We have more complexity at our infrastructure
    We could hit bandwidth limits
  • Case: Map/Reduce
    Hadoop:
    It’s “only” a Framework for running Map/Reduce Applications on large clusters. Allows replication and Fault tolerance, as HW failure will be the norm, using a distributed file system, HDFS
    Map/Reduce: In a map/reduce application, there are two kinds of jobs, Map and Reduce.
    Mappers read the HDFS blocks and does local processing and run in parallel. From a webserver log file <url,#hits>
    Reducers get the output of many mappers and consolidate data. If there was a mapper per day, reducer could calculate how many monthly hits get an URL
    Hbase:
    Hadoop/MR design gets better throughput than latency so it’s used as analytical platform, but Hbase allow low latency random access to very big tables (billions of rows per millions of columns)
    Column oriented DB: Table->Row->ColumnFamily->Timestamp=>Value
  • Case
    Option D
    Every hit to tracker is only a static image file with various parameters /a.gif?b=1&c=2&…
    We have a cron job that gets the log files from the webservers and database stored items and recalculates in advance different sets of related items
    Every hit to recommender gets from the DB the corresponding set of items
    Went the Hadoop/Hbase way, no more sharding
    Benefits
    Easy to add and remove Data servers on demand so no wasting/limits here
    A surge on page views only costs money, as we get paid per page view, it’s ok 
    The compute intensive task is out of critical path, it’s asynchronous
    Disadvantages
    Beta software, poor documentation/examples
    We have more complexity at our infrastructure
    We could hit bandwidth limits
  • Case
    Option E
    Every hit to tracker is only a static image file with various parameters /a.gif?b=1&c=2&…
    We have a cron job that gets the log files from the webservers and database stored items and recalculates in advance different sets of related items
    Every hit to recommender gets from the DB the corresponding set of items
    Went the Hadoop/Hbase way, no more sharding
    All static files served by a CDN
    Benefits
    Easy to add and remove Data servers on demand so no wasting/limits here
    A surge on page views only costs money, as we get paid per page view, it’s ok 
    The compute intensive task is out of critical path, it’s asynchronous
    Unlimited bandwidth
    Disadvantages
    Beta software, poor documentation/examples
    We have more complexity at our infrastructure
  • Case: CDN
    What’s a Content Delivery Network?
    Your server or http repository (Amazon S3,..) is the Origin of the content
    They give you a DNS name (bb.cdn.net) and you have to create a CNAME to this name (www.example.com -> bb.cdn.net.)
    When a user asks for www.example.com, the CDN will chose which of their nodes is the nearest to the user and give it/they IP addresses
    The user asks for a content (/a.gif) to the node of the CDN, that will check if it has a fresh copy that will send or if it’s a MISS will check with they upstream caches till your Origin
    So we get unlimited bandwidth and better latency (we can’t surpass the speed of light)
  • Case
    Option E
    Every hit to tracker is only a static image file with various parameters /a.gif?b=1&c=2&…
    We have a cron job that gets the log files from the webservers and database stored items and recalculates in advance different sets of related items
    Every hit to recommender gets from the DB the corresponding set of items
    Went the Hadoop/Hbase way, no more sharding
    All static files served by a CDN
    Benefits
    Easy to add and remove Data servers on demand so no wasting/limits here
    A surge on page views only costs money, as we get paid per page view, it’s ok 
    The compute intensive task is out of critical path, it’s asynchronous
    Unlimited bandwidth
    Disadvantages
    Beta software, poor documentation/examples
    We have more complexity at our infrastructure
  • Case
    They get a completely scalable infrastructure at AWS
    Can provision a new Cruncher, Datastore or Recommender in a matter of minutes and remove it as soon as needed
    They don’t have any upper limit of how many request could serve
    All the requests that can impact on the User Experience of the customers of theirs are served by a CDN
    As there are only 3 kinds of servers and are managed as images, don’t need so much engineers to take care of the infrastructure
  • 6.1. Web Scale
    34330
    EEDC
    Execution
    Environments for
    Distributed
    Computing
    6.1.1. Anatomy of a service
    6.1.2. Too many Writes to Database
    6.1.3. Cheaper peaks
    6.1.4. Facebook Platform
    Master in Computer Architecture, Networks and Systems - CANS
  • Facebook Platform
    If your primary data source is not under your control and it’s too far, what happens?
    An API case
  • Case
    DuplicatedGifts
  • Case
    Lovingit More «Pongos»
    Hittingthebullseye?
  • Case
    It’s a social wish list application
    When you access checks if your friends have enabled the application and shows their wish lists
    You can share your wish lists on Facebook
    You can capture wishes (gifts) and be shown a feed of possible merchants
    Initial loading time is critical
    Expect virality so we won’t have too much response time
  • Case
    Flow
  • Case
    Nicebut
    Slow.
    3 to 7 secondsto load
  • Case
    Define goals
    Define metrics
    Analizemetrics
    Improveone at time
  • Case: Goals
    Time to load < 1 second
    Everythingworks
  • Case: Metrics
    Time tosessionsetup
    Validatingto Facebook
    GettingFriendsInformation
    Lookupsto local Database (lists, items, captureditems)
    Time to load «home» page
    Get HTML
    Getwidgets
    GetJavascripts
    Getvariousgraphicassets
  • Case: Analyzing Metrics
    Time to session setup
    Validating to Facebook (300 ms)
    Getting Friends Information (3 sec)
    Lookups to local Database (lists, items, captured items) (30 ms)
    Time to load «home» page
    Get HTML (400 ms)
    Get widgets (300 ms)
    Get Javascripts (300 ms)
    Get various graphic assets (500 ms)
  • Case: Facebook access
    From
    To
    From 3 seconds to 500 ms!
  • Case: Facebook access
    In ASP.net we “only” have 12 threads/CPU -> Only 12 concurrent requests. From 4 users/sec to 24/sec
    We could use asynchronous calls but:
    Low parallelism, if we don’t know the GetAppUsers, we can’t ask for GetUserInfo, so no speedup
    We could increase the default #threads to another number (.NET 4.0 defaults at 5000/CPU)
    We can get fail resiliency adjusting timeouts and increasing threads, connections, and so on
  • Case: Leveraging “free” tools
    Set future Expires on static files
    Users leverage their browser’s cache and are lighter at server’s side
    Use “free” CDN to get Jquery et Al.
    Microsoft and Google provide a public and free repository of Javascript tools
    Use CSS sprites
    Although graphic files are small, they need a TCP connection to retrieve. Combining most graphic assets in a big file and use CSS to select which one to show
    #nav li a {background-image:url('../img/image_nav.gif')}
    #nav li a.item1 {background-position:0px 0px}
    #nav li a:hover.item1 {background-position:0px -72px}
  • Case: more on Sprites
    Avg size 2KB/file
    HTTP/1.1 (rfc 2616) suggests that browsers download no more than 2 components in parallel per hostname
    Small files doesn’t use all available bandwidth. TCP Slow Start…
    Latency also plays an important role
  • Aboutthissession
    Sergi Morales, Founder & CTO of Expertos en TIPhone: +34 6688-XPNTIEmail : sergi.morales+eedc@expertosenti.comBlog : http://blog.expertosenti.com
    Web: http://www.expertosenti.com
    Expertos en TI: We help Internet oriented projects to leverage all the research done by the big sites (Flickr, Facebook, Twitter, Salesforce, Google, and so on) so they can improve their bottom line and be prepared for growth
  • About the EEDC course
    34330 Execution Environments for Distributed Computing (EEDC), Master in Computer Architecture, Networks and Systems (CANS) Computer Architectura Department (AC)
    Universitat Politècnica de Catalunya – Barcelona Tech (UPC) ECTS credits: 6
    INSTRUCTOR
    Professor Jordi TorresPhone: +34 93 401 7223 Email : torres@ac.upc.eduOffice : Campus Nord, Modul C6. Room 217.
    Web: http://www.JordiTorres.org
  • 34330
    EEDC
    Execution
    Environments for
    Distributed
    Computing
    Sergi Morales
    Founder & CTO
    T: 668897684
    E: sergi.morales@expertosenti.com
    L: www.linkedin.com/in/sergimorales
    Master in Computer Architecture, Networks and Systems - CANS
  • Case
    Asynchronous access to Facebook API server
    Expect to fail
    Tables with so many rows, a key/value approach
    Consistent hashing to loadbalance data
    Sticky servers?