• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Handling Data in Mega Scale Systems
 

Handling Data in Mega Scale Systems

on

  • 4,705 views

Handling Data in Mega Scale Systems by Vineet Gupta, GM Software Engineer.

Handling Data in Mega Scale Systems by Vineet Gupta, GM Software Engineer.

Statistics

Views

Total Views
4,705
Views on SlideShare
3,892
Embed Views
813

Actions

Likes
4
Downloads
91
Comments
2

5 Embeds 813

http://blog.codechef.com 764
http://blog.ww2.codechef.com 18
http://ankitjain.info 17
http://www.slideshare.net 13
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks Ankit : )
    Are you sure you want to
    Your message goes here
    Processing…
  • ooh.. that's great. Really helped me understanding the archi. behind such high end apps.

    - ankit
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Handling Data in Mega Scale Systems Handling Data in Mega Scale Systems Presentation Transcript

    • Intelligent People. Uncommon Ideas.
      Handling Data in Mega Scale Web Apps(lessons learnt @ Directi)
      Vineet Gupta | GM – Software Engineering | Directi
      http://vineetgupta.spaces.live.com
      Licensed under Creative Commons Attribution Sharealike Noncommercial
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • Not Covering
      Offline Processing (Batching / Queuing)
      Distributed Processing – Map Reduce
      Non-blocking IO
      Fault Detection, Tolerance and Recovery
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • How Big Does it Get
      22M+ users
      Dozens of DB servers
      Dozens of Web servers
      Six specialized graph database servers to run recommendations engine
      Source:http://highscalability.com/digg-architecture
    • How Big Does it Get
      1 TB / Day
      100 M blogs indexed / day
      10 B objects indexed / day
      0.5 B photos and videos
      Data doubles in 6 months
      Users double in 6 months
      Source:http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
    • How Big Does it Get
      2 PB Raw Storage
      470 M photos, 4-5 sizes each
      400 k photos added / day
      35 M photos in Squid cache (total)
      2 M photos in Squid RAM
      38k reqs / sec to Memcached
      4 B queries / day
      Source:http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
    • How Big Does it Get
      Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters
      2 PB of data
      26 B SQL queries / day
      1 B page views / day
      3 B API calls / month
      15,000 App servers
      Source:http://highscalability.com/ebay-architecture/
    • How Big Does it Get
      450,000 low cost commodity servers in 2006
      Indexed 8 B web-pages in 2005
      200 GFS clusters (1 cluster = 1,000 – 5,000 machines)
      Read / write thruput = 40 GB / sec across a cluster
      Map-Reduce
      100k jobs / day
      20 PB of data processed / day
      10k MapReduce programs
      Source:http://highscalability.com/google-architecture/
    • Key Trends
      Data Size ~ PB
      Data Growth ~ TB / day
      No of servers – 10s to 10,000
      No of datacenters – 1 to 10
      Queries – B+ / day
      Specialized needs – more / other than RDBMS
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • Host
      RAM
      CPU
      CPU
      RAM
      CPU
      RAM
      App Server
      DB Server
      Vertical Scaling (Scaling Up)
    • Big Irons
      Sunfire E20k
      PowerEdge SC1435
      36x 1.8GHz processors
      Dualcore 1.8 GHz processor
      $450,000 - $2,500,000
      Around $1,500
    • Vertical Scaling (Scaling Up)
      Increasing the hardware resources on a host
      Pros
      Simple to implement
      Fast turnaround time
      Cons
      Finite limit
      Hardware does not scale linearly (diminishing returns for each incremental unit)
      Requires downtime
      Increases Downtime Impact
      Incremental costs increase exponentially
    • Host
      Host
      App Server
      DB Server
      Vertical Partitioning of Services
    • Vertical Partitioning of Services
      Split services on separate nodes
      Each node performs different tasks
      Pros
      Increases per application Availability
      Task-based specialization, optimization and tuning possible
      Reduces context switching
      Simple to implement for out of band processes
      No changes to App required
      Flexibility increases
      Cons
      Sub-optimal resource utilization
      May not increase overall availability
      Finite Scalability
    • Horizontal Scaling of App Server
      Web Server
      Load Balancer
      Web Server
      DB Server
      Web Server
    • Horizontal Scaling of App Server
      Add more nodes for the same service
      Identical, doing the same task
      Load Balancing
      Hardware balancers are faster
      Software balancers are more customizable
    • The problem - State
      Web Server
      User 1
      Load Balancer
      Web Server
      DB Server
      User 2
      Web Server
    • Sticky Sessions
      Web Server
      User 1
      Load Balancer
      Web Server
      DB Server
      User 2
      Web Server
      Asymmetrical load distribution
      Downtime
    • Central Session Store
      Web Server
      User 1
      Load Balancer
      Web Server
      Session Store
      User 2
      Web Server
      SPOF
      Reads and Writes generate network + disk IO
    • Clustered Sessions
      Web Server
      User 1
      Load Balancer
      Web Server
      User 2
      Web Server
    • Clustered Sessions
      Pros
      No SPOF
      Easier to setup
      Fast Reads
      Cons
      n x Writes
      Increase in network IO with increase in nodes
      Stale data (rare)
    • Sticky Sessions with Central Store
      Web Server
      User 1
      Load Balancer
      Web Server
      DB Server
      User 2
      Web Server
    • More Session Management
      No Sessions
      Stuff state in a cookie and sign it!
      Cookie is sent with every request / response
      Super Slim Sessions
      Keep small amount of frequently used data in cookie
      Pull rest from DB (or central session store)
    • Sessions - Recommendation
      Bad
      Sticky sessions
      Good
      Clustered sessions for small number of nodes and / or small write volume
      Central sessions for large number of nodes or large write volume
      Great
      No Sessions!
    • App Tier Scaling - More
      HTTP Accelerators / Reverse Proxy
      Static content caching, redirect to lighter HTTP
      Async NIO on user-side, Keep-alive connection pool
      CDN
      Get closer to your user
      Akamai, Limelight
      IP Anycasting
      Async NIO
    • Scaling a Web App
      App-Layer
      Add more nodes and load balance!
      Avoid Sticky Sessions
      Avoid Sessions!!
      Data Store
      Tricky! Very Tricky!!!
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • Replication = Scaling by Duplication
      App Layer
      T1, T2, T3, T4
    • Replication = Scaling by Duplication
      App Layer
      T1, T2, T3, T4
      T1, T2, T3, T4
      T1, T2, T3, T4
      T1, T2, T3, T4
      T1, T2, T3, T4
      Each node has its own copy of data
      Shared Nothing Cluster
    • Replication
      Read : Write = 4:1
      Scale reads at cost of writes!
      Duplicate Data – each node has its own copy
      Master Slave
      Writes sent to one node, cascaded to others
      Multi-Master
      Writes can be sent to multiple nodes
      Can lead to deadlocks
      Requires conflict management
    • Master-Slave
      App Layer
      Master
      Slave
      Slave
      Slave
      Slave
      n x Writes – Async vs. Sync
      SPOF
      Async - Critical Reads from Master!
    • Multi-Master
      App Layer
      Master
      Master
      Slave
      Slave
      Slave
      n x Writes – Async vs. Sync
      No SPOF
      Conflicts!
    • Replication Considerations
      Asynchronous
      Guaranteed, but out-of-band replication from Master to Slave
      Master updates its own db and returns a response to client
      Replication from Master to Slave takes place asynchronously
      Faster response to a client
      Slave data is marginally behind the Master
      Requires modification to App to send critical reads and writes to master, and load balance all other reads
      Synchronous
      Guaranteed, in-band replication from Master to Slave
      Master updates its own db, and confirms all slaves have updated their db before returning a response to client
      Slower response to a client
      Slaves have the same data as the Master at all times
      Requires modification to App to send writes to master and load balance all reads
    • Replication Considerations
      Replication at RDBMS level
      Support may exists in RDBMS or through 3rd party tool
      Faster and more reliable
      App must send writes to Master, reads to any db and critical reads to Master
      Replication at Driver / DAO level
      Driver / DAO layer ensures
      writes are performed on all connected DBs
      Reads are load balanced
      Critical reads are sent to a Master
      In most cases RDBMS agnostic
      Slower and in some cases less reliable
    • Diminishing Returns
      Per Server:
      4R, 1W
      2R, 1W
      1R, 1W
      Read
      Read
      Read
      Write
      Write
      Write
      Read
      Read
      Read
      Read
      Write
      Write
      Write
      Write
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • Partitioning = Scaling by Division
      Vertical Partitioning
      Divide data on tables / columns
      Scale to as many boxes as there are tables or columns
      Finite
      Horizontal Partitioning
      Divide data on rows
      Scale to as many boxes as there are rows!
      Limitless scaling
    • Vertical Partitioning
      App Layer
      T1, T2, T3, T4, T5
      Note: A node here typically represents a shared nothing cluster
    • Vertical Partitioning
      App Layer
      T3
      T4
      T5
      T2
      T1
      Facebook - User table, posts table can be on separate nodes
      Joins need to be done in code (Why have them?)
    • Horizontal Partitioning
      App Layer
      T3
      T4
      T5
      T2
      T1
      First million rows
      T3
      T4
      T5
      T2
      T1
      Second million rows
      T3
      T4
      T5
      T2
      T1
      Third million rows
    • Horizontal Partitioning Schemes
      Value Based
      Split on timestamp of posts
      Split on first alphabet of user name
      Hash Based
      Use a hash function to determine cluster
      Lookup Map
      First Come First Serve
      Round Robin
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • CAP Theorem
      Source:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
    • Transactions
      Transactions make you feel alone
      No one else manipulates the data when you are
      Transactional serializability
      The behavior is as if a serial order exists
      Source:http://blogs.msdn.com/pathelland/
      Slide 46
    • Life in the “Now”
      Transactions live in the “now” inside services
      Time marches forward
      Transactions commit
      Advancing time
      Transactions see the committed transactions
      A service’s biz-logic lives in the “now”
      Source:http://blogs.msdn.com/pathelland/
      Slide 47
    • Sending Unlocked Data Isn’t “Now”
      Messages contain unlocked data
      Assume no shared transactions
      Unlocked data may change
      Unlocking it allows change
      Messages are not from the “now”
      They are from the past
      There is no simultaneity at a distance!
      • Similar to speed of light
      • Knowledge travels at speed of light
      • By the time you see a distant object it may have changed!
      • By the time you see a message, the data may have changed!
      Services, transactions, and locks bound simultaneity!
      • Inside a transaction, things appear simultaneous (to others)
      • Simultaneity only inside a transaction!
      • Simultaneity only inside a service!
      Source:http://blogs.msdn.com/pathelland/
      Slide 48
    • Outside Data: a Blast from the Past
      All data from distant stars is from the past
      • 10 light years away; 10 year old knowledge
      • The sun may have blown up 5 minutes ago
      • We won’t know for 3 minutes more…
      All data seen from a distant service is from the “past”
      By the time you see it, it has been unlocked and may change
      Each service has its own perspective
      Inside data is “now”; outside data is “past”
      My inside is not your inside; my outside is not your outside
      This is like going from Newtonian to Einstonian physics
      • Newton’s time marched forward uniformly
      • Instant knowledge
      • Classic distributed computing: many systems look like one
      • RPC, 2-phase commit, remote method calls…
      • In Einstein’s world, everything is “relative” to one’s perspective
      • Today: No attempt to blur the boundary
      Source:http://blogs.msdn.com/pathelland/
      Slide 49
    • Versions and Distributed Systems
      Can’t have “the same” dataat many locations
      Unless it isa snapshot
      Changing distributed dataneeds versions
      Creates asnapshot…
      Source:http://blogs.msdn.com/pathelland/
    • Subjective Consistency
      Given what I know here and now, make a decision
      Remember the versions of all the data used to make this decision
      Record the decision as being predicated on these versions
      Other copies of the object may make divergent decisions
      Try to sort out conflicts within the family
      If necessary, programmatically apologize
      Very rarely, whine and fuss for human help
      Subjective Consistency
       Given the information I have at hand, make a decision and act on it !
       Remember the information at hand !
      Ambassadors Had Authority
      Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later!
      Source:http://blogs.msdn.com/pathelland/
    • Eventual Consistency
      Eventually, all the copies of the object share their changes
      “I’ll show you mine if you show me yours!”
      Now, apply subjective consistency:
      “Given the information I have at hand, make a decision and act on it!”
      Everyone has the same information, everyone comes to the same conclusion about the decisions to take…
      Eventual Consistency
      • Given the same knowledge, produce the same result !
      • Everyone sharing their knowledge leads to the same result...
      This is NOT magic; it is a design requirement !
      Idempotence, commutativity, and associativity of the operations(decisions made) are all implied by this requirement
      Source:http://blogs.msdn.com/pathelland/
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • Why Normalize?
      Classic problemwith de-normalization
      Can’t updateSam’s phone #since there aremany copies
      Emp #
      Emp Name
      Mgr #
      Mgr Name
      Emp Phone
      47
      Joe
      13
      Sam
      5-1234
      18
      Sally
      38
      Harry
      3-3123
      91
      Pete
      13
      Sam
      2-1112
      66
      Mary
      02
      Betty
      5-7349
      Mgr Phone
      6-9876
      5-6782
      6-9876
      4-0101
      Normalization’s Goal Is Eliminating Update Anomalies
      Can Be Changed Without “Funny Behavior”
      Each Data Item Lives in One Place
      De-normalization is
      OK if you aren’t going to update!
      Source:http://blogs.msdn.com/pathelland/
    • Eliminate Joins
    • Eliminate Joins
      6 joins for 1 query!
      Do you think FB would do this?
      And how would you do joins with partitioned data?
      De-normalization removes joins
      But increases data volume
      But disk is cheap and getting cheaper
      And can lead to inconsistent data
      If you are lazy
      However this is not really an issue
    • “Append-Only” Data
      Many Kinds of Computing are “Append-Only”
      Lots of observations are made about the world
      Debits, credits, Purchase-Orders, Customer-Change-Requests, etc
      As time moves on, more observations are added
      You can’t change the history but you can add new observations
      Derived Results May Be Calculated
      Estimate of the “current” inventory
      Frequently inaccurate
      Historic Rollups Are Calculated
      Monthly bank statements
    • Databases and Transaction Logs
      Transaction Logs Are the Truth
      High-performance & write-only
      Describe ALL the changes to the data
      Data-Base  the Current Opinion
      Describes the latest value of the data as perceived by the application
      Log
      DB
      The Database Is a Caching of the Transaction Log !
      It is the subset of the latest committed values represented in the transaction log…
      Source:http://blogs.msdn.com/pathelland/
    • We Are Swimming in a Sea of Immutable Data
      Source:http://blogs.msdn.com/pathelland/
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • Caching
      Makes scaling easier (cheaper)
      Core Idea
      Read data from persistent store into memory
      Store in a hash-table
      Read first from cache, if not, load from persistent store
    • Write thru Cache
      App Server
      Cache
    • Write back Cache
      App Server
      Cache
    • Sideline Cache
      App Server
      Cache
    • Memcached
    • How does it work
      In-memory Distributed Hash Table
      Memcached instance manifests as a process (often on the same machine as web-server)
      Memcached Client maintains a hash table
      Which item is stored on which instance
      Memcached Server maintains a hash table
      Which item is stored in which memory location
    • Outline
      Characteristics
      App Tier Scaling
      Replication
      Partitioning
      Consistency
      Normalization
      Caching
      Data Engine Types
    • It’s not all Relational!
      Amazon - S3, SimpleDb, Dynamo
      Google - App Engine Datastore, BigTable
      Microsoft – SQL Data Services, Azure Storages
      Facebook – Cassandra
      LinkedIn - Project Voldemort
      Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable
    • Tuplespaces
      Basic Concepts
      No tables - Containers-Entity
      No schema - each tuple has its own set of properties
      Amazon SimpleDB – strings only
      Microsoft Azure SQL Data Services
      Strings, blob, datetime, bool, int, double, etc.
      No x-container joins as of now
      Google App Engine Datastore
      Strings, blob, datetime, bool, int, double, etc.
    • Key-Value Stores
      Google BigTable
      Sparse, Distributed, multi-dimensional sorted map
      Indexed by row key, column key, timestamp
      Each value is an un-interpreted array of bytes
      Amazon Dynamo
      Data partitioned and replicated using consistent hashing
      Decentralized replica sync protocol
      Consistency thru versioning
      Facebook Cassandra
      Used for Inbox search
      Open Source
      Scalaris
      Keys stored in lexicographical order
      Improved Paxos to provide ACID
      Memory resident, no persistence
    • In Summary
      Real Life Scaling requires trade offs
      No Silver Bullet
      Need to learn new things
      Need to un-learn
      Balance!
    • QUESTIONS?
    • Intelligent People. Uncommon Ideas.
      Licensed under Creative Commons Attribution Sharealike Noncommercial