Your SlideShare is downloading. ×
Handling Data in Mega Scale Web Systems
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Handling Data in Mega Scale Web Systems

1,343
views

Published on

Published in: Technology

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,343
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Vineet Gupta | GM – Software Engineering | Directi http://www.vineetgupta.com Licensed under Creative Commons Attribution Sharealike Noncommercial Intelligent People. Uncommon Ideas.
  • 2.
      • 22M+ users
      • Dozens of DB servers
      • Dozens of Web servers
      • Six specialized graph database servers to run recommendations engine
    Source: http://highscalability.com/digg-architecture
  • 3.
      • 1 TB / Day
      • 100 M blogs indexed / day
      • 10 B objects indexed / day
      • 0.5 B photos and videos
      • Data doubles in 6 months
      • Users double in 6 months
    Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
  • 4.
      • 2 PB Raw Storage
      • 470 M photos, 4-5 sizes each
      • 400 k photos added / day
      • 35 M photos in Squid cache (total)
      • 2 M photos in Squid RAM
      • 38k reqs / sec to Memcached
      • 4 B queries / day
    Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
  • 5.
      • Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters
      • 2 PB of data
      • 26 B queries / day
      • 1 B page views / day
      • 3 B API calls / month
      • 15,000 App servers
    Source: http://highscalability.com/ebay-architecture/
  • 6.
      • 450,000 low cost commodity servers in 2006
      • Indexed 8 B web-pages in 2005
      • 200 GFS clusters (1 cluster = 1,000 – 5,000 machines)
      • Read / write thruput = 40 GB / sec across a cluster
      • Map-Reduce
        • 100k jobs / day
        • 20 PB of data processed / day
        • 10k MapReduce programs
    Source: http://highscalability.com/google-architecture/
  • 7.
      • Data Size ~ PB
      • Data Growth ~ TB / day
      • No of servers – 10s to 10,000
      • No of datacenters – 1 to 10
      • Queries – B+ / day
  • 8. Host App Server DB Server RAM CPU CPU CPU RAM RAM
  • 9. Sunfire X4640 M2 8 x 6-core 2.6 GHz $ 27k to $ 170k PowerEdge R200 Dual core 2.8 GHz Around $ 550
  • 10.
      • Increasing the hardware resources on a host
      • Pros
        • Simple to implement
        • Fast turnaround time
      • Cons
        • Finite limit
        • Hardware does not scale linearly (diminishing returns for each incremental unit)
        • Requires downtime
        • Increases Downtime Impact
        • Incremental costs increase exponentially
  • 11. T1, T2, T3, T4 App Layer
  • 12. T1, T2, T3, T4 App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4
      • Each node has its own copy of data
      • Shared Nothing Cluster
  • 13.
      • Read : Write = 4:1
        • Scale reads at cost of writes!
      • Duplicate Data – each node has its own copy
      • Master Slave
        • Writes sent to one node, cascaded to others
      • Multi-Master
        • Writes can be sent to multiple nodes
        • Can lead to deadlocks
        • Requires conflict management
  • 14. Master App Layer Slave Slave Slave Slave
      • n x Writes – Async vs. Sync
      • SPOF
      • Async - Critical Reads from Master!
  • 15. Master App Layer Master Slave Slave Slave
      • n x Writes – Async vs. Sync
      • No SPOF
      • Conflicts! O(N2) or O(N3) resolution
  • 16. Write Read Write Read Write Read Write Read Write Read Write Read Write Read
    • Per Server:
      • 4R, 1W
      • 2R, 1W
      • 1R, 1W
  • 17.
      • Vertical Partitioning
        • Divide data on tables / columns
        • Scale to as many boxes as there are tables or columns
        • Finite
      • Horizontal Partitioning
        • Divide data on rows
        • Scale to as many boxes as there are rows!
        • Limitless scaling
  • 18. T1, T2, T3, T4, T5 App Layer
  • 19. T3 App Layer T4 T5 T2 T1
      • Facebook - User table, posts table can be on separate nodes
      • Joins need to be done in code (Why have them?)
  • 20. T3 App Layer T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
  • 21.
      • Value Based
        • Split on timestamp of posts
        • Split on first alphabet of user name
      • Hash Based
        • Use a hash function to determine cluster
      • Lookup Map
        • First Come First Serve
        • Round Robin
  • 22. Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
  • 23.
      • In distributed systems, much weaker forms of consistency are often acceptable, e.g.,
        • Only a few (or even one) possible writers of data, and/or
        • Read-mostly data (seldom modified), and/or
        • Stale data may be acceptable
      • Eventual consistency
        • If no updates take place for a long time, all replicas will eventually become consistent
      • Implementation
        • Need only ensure updates eventually reach all of the replicated copies of the data
  • 24.
      • Monotonic Reads
        • If a node sees a version x at time t, it will never see an older version at a later time
      • Monotonic Writes
        • A write operation by a process on a data item x is completed before any successive write operation on x by the same process
      • Read your writes
        • The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process
      • Writes follow Reads
        • Write occurs on a copy of x that is at least as recent as the last copy read by the process
  • 25.
      • Many Kinds of Computing are “Append-Only”
        • Lots of observations are made about the world
          • Debits, credits, Purchase-Orders, Customer-Change-Requests, etc
        • As time moves on, more observations are added
          • You can’t change the history but you can add new observations
      • Derived Results May Be Calculated
        • Estimate of the “current” inventory
        • Frequently inaccurate
      • Historic Rollups Are Calculated
        • Monthly bank statements
  • 26.  
  • 27.
      • 5 joins for 1 query!
        • Do you think FB would do this?
        • And how would you do joins with partitioned data?
      • De-normalization removes joins
      • But increases data volume
        • However disk is cheap and getting cheaper
      • And can lead to inconsistent data
        • But only if we do UPDATEs and DELETEs
  • 28.
      • Normalization’s Goal Is Eliminating Update Anomalies
        • Can Be Changed Without “Funny Behavior”
        • Each Data Item Lives in One Place
    Emp # Emp Name Mgr # Mgr Name Emp Phone Mgr Phone 47 Joe 13 Sam 5-1234 6-9876 18 Sally 38 Harry 3-3123 5-6782 91 Pete 13 Sam 2-1112 6-9876 66 Mary 02 Betty 5-7349 4-0101 Classic problem with de-normalization Can’t update Sam’s phone # since there are many copies De-normalization is OK if you aren’t going to update! Source: http://blogs.msdn.com/pathelland/
  • 29.
      • Partitioning for scaling
        • Replication for availability
      • No ACID transactions
      • No JOINs
      • Immutable data
        • No cascaded UPDATEs and DELETEs
  • 30.  
  • 31.
      • Partitioning – for R/W scaling
      • Replication – for availability
      • Versioning – for immutable data
      • Eventual Consistency
      • Error detection and handling
  • 32.
      • Google – BigTable
      • Amazon – Dynamo
      • Facebook – Cassandra (BigTable + Dynamo)
      • LinkedIn – Voldemort (similar to Dynamo)
      • Many more
  • 33.
      • Tens of millions of customers served at peak times
      • Tens of thousands of servers
      • Both customers and servers distributed world wide
  • 34.
    • http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
      • Eventually consistent data store
      • Always writable
      • Decentralized
      • All nodes have the same responsibilities
  • 35.  
  • 36.
      • Similar to Chord
        • Each node gets an ID from the space of keys
        • Nodes are arranged in a ring
        • Data stored on the first node clockwise of the current placement of the data key
      • Replication
        • Preference lists of N nodes following the associated node
  • 37.
      • A problem with the Chord scheme
        • Nodes placed randomly on ring
        • Leads to uneven data & load distribution
      • In Dynamo
        • “ Virtual” nodes
        • Each physical node has multiple virtual nodes
          • More powerful machines have more virtual nodes
        • Distribute virtual nodes across the ring
  • 38.
      • Updates generate a new timestamp
        • Vector clocks are used
      • Eventual consistency
        • Multiple versions of the same object might co-exist
      • Syntactic Reconciliation
        • System might be able to resolve conflicts automatically
      • Semantic Reconciliation
        • Conflict resolution pushed to application
  • 39.  
  • 40.
      • Request arrives at a node (coordinator)
        • Ideally the node responsible for the particular key
        • Else forwards request to the node responsible for that key and that node will become the coordinator
      • The first N healthy and distinct nodes following the key position are considered for the request
      • Application defines
        • N = total number of participating nodes
        • R = number of nodes required for successful Read
        • W = number of nodes required for successful write
      • R + W > N gives quorum
  • 41.
      • Writes
        • Requires generation of a new vector clock by coordinator
        • Coordinator writes locally
        • Forwards to N nodes, if W-1 respond then the write was successful
      • Reads
        • Forwards to N nodes, if R-1 respond then forwards to user
        • Only unique responses forwarded
        • User handles merging if multiple versions exist
  • 42.
      • Sloppy Quorum
        • Read write ops performed on first N healthy nodes
        • Increases availability
      • Hinted Handoff
        • If node in preference list is not available, send replica to a node further down in the list
        • With a hint containing the identity of the original node
        • The receiving node keeps checking for the original
        • If the original becomes available, transfers replica to it
  • 43.
      • Replica Synchronization
        • Synchronize with another node
        • Each node maintains a separate Merkel tree for each key range it hosts
        • Nodes exchange roots of trees for common key-ranges
        • Quickly determine divergent keys by comparing hashes
  • 44.
      • Ring Membership
        • Membership is explicit to avoid re-balancing of partition assignment
        • Use background gossip to build 1-hop DHT
        • Use external entity to bootstrap the system to avoid partitioned rings
      • Failure Detection
        • Node A finds node B unreachable (for servicing a request)
        • A uses other nodes to service requests and periodically checks B
        • A does not assume B to have failed
        • No globally consistent view of failure (because of explicit ring membership)
  • 45.
      • Application Configurable (N, R, W)
      • Every node is aware of the data hosted by its peers
        • requiring the gossiping of the full routing table with other nodes
        • scalability is limited by this to a few hundred nodes
        • hierarchy may help to overcome the limitation
  • 46.
      • Typical configuration for the Dynamo (N, R, W) is (3, 2, 2)
      • Some implementations vary (N, R, W)
        • Always write might have W=1 (Shopping Cart)
        • Product catalog might have R=1 and W=N
      • Response requirement is 300ms for any request (read or write)
  • 47.
      • Consistency vs. Availability
        • 99.94% one version
        • 0.00057% two
        • 0.00047% three
        • 0.00009% four
      • Server-driven or Client-driven coordination
        • Server-driven
          • uses load balancers
          • forwards requests to desired set of nodes
        • Client-driven 50% faster
          • requires the polling of Dynamo membership updates
          • the client is responsible for determining the appropriate nodes to send the request to
      • Successful responses (without time-out) 99.9995%
  • 48.  
  • 49.  
  • 50.  
  • 51.  
  • 52.
      • Enormous data (and high growth)
        • Traditional solutions don’t work
      • Distributed databases
        • Lots of interesting work happening
      • Great time for young programmers!
        • Problem solving ability
  • 53.  
  • 54. Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial

×