Handling Data in Mega Scale Web Systems
Upcoming SlideShare
Loading in...5

Handling Data in Mega Scale Web Systems






Total Views
Views on SlideShare
Embed Views



2 Embeds 15

http://www.linkedin.com 10
http://www.slideshare.net 5



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Handling Data in Mega Scale Web Systems Handling Data in Mega Scale Web Systems Presentation Transcript

  • Vineet Gupta | GM – Software Engineering | Directi http://www.vineetgupta.com Licensed under Creative Commons Attribution Sharealike Noncommercial Intelligent People. Uncommon Ideas.
      • 22M+ users
      • Dozens of DB servers
      • Dozens of Web servers
      • Six specialized graph database servers to run recommendations engine
    Source: http://highscalability.com/digg-architecture
      • 1 TB / Day
      • 100 M blogs indexed / day
      • 10 B objects indexed / day
      • 0.5 B photos and videos
      • Data doubles in 6 months
      • Users double in 6 months
    Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
      • 2 PB Raw Storage
      • 470 M photos, 4-5 sizes each
      • 400 k photos added / day
      • 35 M photos in Squid cache (total)
      • 2 M photos in Squid RAM
      • 38k reqs / sec to Memcached
      • 4 B queries / day
    Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
      • Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters
      • 2 PB of data
      • 26 B queries / day
      • 1 B page views / day
      • 3 B API calls / month
      • 15,000 App servers
    Source: http://highscalability.com/ebay-architecture/
      • 450,000 low cost commodity servers in 2006
      • Indexed 8 B web-pages in 2005
      • 200 GFS clusters (1 cluster = 1,000 – 5,000 machines)
      • Read / write thruput = 40 GB / sec across a cluster
      • Map-Reduce
        • 100k jobs / day
        • 20 PB of data processed / day
        • 10k MapReduce programs
    Source: http://highscalability.com/google-architecture/
      • Data Size ~ PB
      • Data Growth ~ TB / day
      • No of servers – 10s to 10,000
      • No of datacenters – 1 to 10
      • Queries – B+ / day
  • Host App Server DB Server RAM CPU CPU CPU RAM RAM
  • Sunfire X4640 M2 8 x 6-core 2.6 GHz $ 27k to $ 170k PowerEdge R200 Dual core 2.8 GHz Around $ 550
      • Increasing the hardware resources on a host
      • Pros
        • Simple to implement
        • Fast turnaround time
      • Cons
        • Finite limit
        • Hardware does not scale linearly (diminishing returns for each incremental unit)
        • Requires downtime
        • Increases Downtime Impact
        • Incremental costs increase exponentially
  • T1, T2, T3, T4 App Layer
  • T1, T2, T3, T4 App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4
      • Each node has its own copy of data
      • Shared Nothing Cluster
      • Read : Write = 4:1
        • Scale reads at cost of writes!
      • Duplicate Data – each node has its own copy
      • Master Slave
        • Writes sent to one node, cascaded to others
      • Multi-Master
        • Writes can be sent to multiple nodes
        • Can lead to deadlocks
        • Requires conflict management
  • Master App Layer Slave Slave Slave Slave
      • n x Writes – Async vs. Sync
      • SPOF
      • Async - Critical Reads from Master!
  • Master App Layer Master Slave Slave Slave
      • n x Writes – Async vs. Sync
      • No SPOF
      • Conflicts! O(N2) or O(N3) resolution
  • Write Read Write Read Write Read Write Read Write Read Write Read Write Read
    • Per Server:
      • 4R, 1W
      • 2R, 1W
      • 1R, 1W
      • Vertical Partitioning
        • Divide data on tables / columns
        • Scale to as many boxes as there are tables or columns
        • Finite
      • Horizontal Partitioning
        • Divide data on rows
        • Scale to as many boxes as there are rows!
        • Limitless scaling
  • T1, T2, T3, T4, T5 App Layer
  • T3 App Layer T4 T5 T2 T1
      • Facebook - User table, posts table can be on separate nodes
      • Joins need to be done in code (Why have them?)
  • T3 App Layer T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
      • Value Based
        • Split on timestamp of posts
        • Split on first alphabet of user name
      • Hash Based
        • Use a hash function to determine cluster
      • Lookup Map
        • First Come First Serve
        • Round Robin
  • Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
      • In distributed systems, much weaker forms of consistency are often acceptable, e.g.,
        • Only a few (or even one) possible writers of data, and/or
        • Read-mostly data (seldom modified), and/or
        • Stale data may be acceptable
      • Eventual consistency
        • If no updates take place for a long time, all replicas will eventually become consistent
      • Implementation
        • Need only ensure updates eventually reach all of the replicated copies of the data
      • Monotonic Reads
        • If a node sees a version x at time t, it will never see an older version at a later time
      • Monotonic Writes
        • A write operation by a process on a data item x is completed before any successive write operation on x by the same process
      • Read your writes
        • The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process
      • Writes follow Reads
        • Write occurs on a copy of x that is at least as recent as the last copy read by the process
      • Many Kinds of Computing are “Append-Only”
        • Lots of observations are made about the world
          • Debits, credits, Purchase-Orders, Customer-Change-Requests, etc
        • As time moves on, more observations are added
          • You can’t change the history but you can add new observations
      • Derived Results May Be Calculated
        • Estimate of the “current” inventory
        • Frequently inaccurate
      • Historic Rollups Are Calculated
        • Monthly bank statements
      • 5 joins for 1 query!
        • Do you think FB would do this?
        • And how would you do joins with partitioned data?
      • De-normalization removes joins
      • But increases data volume
        • However disk is cheap and getting cheaper
      • And can lead to inconsistent data
        • But only if we do UPDATEs and DELETEs
      • Normalization’s Goal Is Eliminating Update Anomalies
        • Can Be Changed Without “Funny Behavior”
        • Each Data Item Lives in One Place
    Emp # Emp Name Mgr # Mgr Name Emp Phone Mgr Phone 47 Joe 13 Sam 5-1234 6-9876 18 Sally 38 Harry 3-3123 5-6782 91 Pete 13 Sam 2-1112 6-9876 66 Mary 02 Betty 5-7349 4-0101 Classic problem with de-normalization Can’t update Sam’s phone # since there are many copies De-normalization is OK if you aren’t going to update! Source: http://blogs.msdn.com/pathelland/
      • Partitioning for scaling
        • Replication for availability
      • No ACID transactions
      • No JOINs
      • Immutable data
        • No cascaded UPDATEs and DELETEs
      • Partitioning – for R/W scaling
      • Replication – for availability
      • Versioning – for immutable data
      • Eventual Consistency
      • Error detection and handling
      • Google – BigTable
      • Amazon – Dynamo
      • Facebook – Cassandra (BigTable + Dynamo)
      • LinkedIn – Voldemort (similar to Dynamo)
      • Many more
      • Tens of millions of customers served at peak times
      • Tens of thousands of servers
      • Both customers and servers distributed world wide
    • http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
      • Eventually consistent data store
      • Always writable
      • Decentralized
      • All nodes have the same responsibilities
      • Similar to Chord
        • Each node gets an ID from the space of keys
        • Nodes are arranged in a ring
        • Data stored on the first node clockwise of the current placement of the data key
      • Replication
        • Preference lists of N nodes following the associated node
      • A problem with the Chord scheme
        • Nodes placed randomly on ring
        • Leads to uneven data & load distribution
      • In Dynamo
        • “ Virtual” nodes
        • Each physical node has multiple virtual nodes
          • More powerful machines have more virtual nodes
        • Distribute virtual nodes across the ring
      • Updates generate a new timestamp
        • Vector clocks are used
      • Eventual consistency
        • Multiple versions of the same object might co-exist
      • Syntactic Reconciliation
        • System might be able to resolve conflicts automatically
      • Semantic Reconciliation
        • Conflict resolution pushed to application
      • Request arrives at a node (coordinator)
        • Ideally the node responsible for the particular key
        • Else forwards request to the node responsible for that key and that node will become the coordinator
      • The first N healthy and distinct nodes following the key position are considered for the request
      • Application defines
        • N = total number of participating nodes
        • R = number of nodes required for successful Read
        • W = number of nodes required for successful write
      • R + W > N gives quorum
      • Writes
        • Requires generation of a new vector clock by coordinator
        • Coordinator writes locally
        • Forwards to N nodes, if W-1 respond then the write was successful
      • Reads
        • Forwards to N nodes, if R-1 respond then forwards to user
        • Only unique responses forwarded
        • User handles merging if multiple versions exist
      • Sloppy Quorum
        • Read write ops performed on first N healthy nodes
        • Increases availability
      • Hinted Handoff
        • If node in preference list is not available, send replica to a node further down in the list
        • With a hint containing the identity of the original node
        • The receiving node keeps checking for the original
        • If the original becomes available, transfers replica to it
      • Replica Synchronization
        • Synchronize with another node
        • Each node maintains a separate Merkel tree for each key range it hosts
        • Nodes exchange roots of trees for common key-ranges
        • Quickly determine divergent keys by comparing hashes
      • Ring Membership
        • Membership is explicit to avoid re-balancing of partition assignment
        • Use background gossip to build 1-hop DHT
        • Use external entity to bootstrap the system to avoid partitioned rings
      • Failure Detection
        • Node A finds node B unreachable (for servicing a request)
        • A uses other nodes to service requests and periodically checks B
        • A does not assume B to have failed
        • No globally consistent view of failure (because of explicit ring membership)
      • Application Configurable (N, R, W)
      • Every node is aware of the data hosted by its peers
        • requiring the gossiping of the full routing table with other nodes
        • scalability is limited by this to a few hundred nodes
        • hierarchy may help to overcome the limitation
      • Typical configuration for the Dynamo (N, R, W) is (3, 2, 2)
      • Some implementations vary (N, R, W)
        • Always write might have W=1 (Shopping Cart)
        • Product catalog might have R=1 and W=N
      • Response requirement is 300ms for any request (read or write)
      • Consistency vs. Availability
        • 99.94% one version
        • 0.00057% two
        • 0.00047% three
        • 0.00009% four
      • Server-driven or Client-driven coordination
        • Server-driven
          • uses load balancers
          • forwards requests to desired set of nodes
        • Client-driven 50% faster
          • requires the polling of Dynamo membership updates
          • the client is responsible for determining the appropriate nodes to send the request to
      • Successful responses (without time-out) 99.9995%
      • Enormous data (and high growth)
        • Traditional solutions don’t work
      • Distributed databases
        • Lots of interesting work happening
      • Great time for young programmers!
        • Problem solving ability
  • Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial