Twitter case study


Published on

Twitter's distributed databases- Case study,

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Source_id/destination_id is a unique user id unless the graph is the graph storing favorite tweets in which case, the destination ID may be a tweet ID.Position is timestampFor example, the users who delete their account, their edges are put into “archived” state, allowing them to be restored later. When the edge is deleted, the row isn’t actually deleted from MySQL; it's just marked as being in the deleted state, which has the effect of moving the primary key.
  • Data is partitioned by node, so these queries can each be answered by a single partition, using an indexed range query.
  • Unlike others it is fault tolerant and scalable..Storm can do a continuous query and stream the results to clients in realtime..
  • Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures..
  • 1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
  • 1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
  • 1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
  • Twitter case study

    1. 1. Term paper presented by:• Akhtar S.Quereshi• Anurag Arora• Divya Gandhi• Nishant GoyalDDBMS term paper 1
    2. 2. Twitter tale of big data!3 years, 2 months and 1 day. The time it took from the first Tweet to the billionth Tweet.1 week. The time it took for users to send a billion Tweets in 2011.50 million. The average number of Tweets people sent per day, 2010.140 million. The average number of Tweets people sent per day, February 2011.177 million. Tweets sent on March 11, 2011.Half a billion tweets sent per day in Oct 2012.572,000. Number of new accounts created on March 12, 2011.460,000. Average number of new accounts per day over February 2011.DDBMS term paper 2
    3. 3. Real-time challengeDDBMS term paper 3
    4. 4. DDBMS term paper 4
    5. 5. Agenda of ppt• Managing social graphs- FlockDB• Sharding- Gizzard• Real time data processing/storing:Hadoop/StormDDBMS term paper 5
    6. 6. FlockDB- built over MySQLMaintaining social graph and query processingDDBMS term paper 6
    7. 7. DDBMS term paper 7
    8. 8. Challenges• Timeline needs to rapidly go through the*following* list of user and quickly display alltheir tweets (sorted recency based)• Answer queries like "Whats the intersection ofpeople I follow and people who are followingPresident Obama?"• Handle heavy write traffic, as followers are added orremoved.DDBMS term paper 8
    9. 9. DDBMS term paper 9
    10. 10. These features are difficult toimplement in a traditionalrelational database.DDBMS term paper 10
    11. 11. What is FlockDB?• FlockDB is a distributed graph database for storingadjacency lists.• Optimized not for graph traversal but very largeadjacency lists and fast read/writes.• It is able to support:– a high rate of add/update/remove operations.– potentially complex set arithmetic queries.– paging through query result sets containing millionsof entries.– ability to "archive" and later restore archived edges.DDBMS term paper 11
    12. 12. How FlockDB deals with challenges?• FlockDB database stores all information as edgeattributes in the graph.• The four major attributes in the adjacency listDDBMS term paper 12
    13. 13. • Each edge is actually stored twice.forward: Nick follows Robey at 9:54 today.backward: Robey is followed by Nick at 9:54 today.• "Who follows me?" is just as efficient as"Who do I follow?”DDBMS term paper 13
    14. 14. "Whats the intersection of people I follow andpeople who are following President Obama?“.This can be answered quickly by decomposing it into single-userquery: "Who is following President Obama?“ Data is partitioned by node, so these queries caneach be answered by a single partition, using anindexed range query. Paging through long result sets is done by using theposition field(timestamp) as a cursor.DDBMS term paper 14
    15. 15. Gizzard Framework is used to querythe flockDB distributed datastore.And to handle the partitioning layerDDBMS term paper 15
    16. 16. What’s ‘Sharding’DDBMS term paper 16
    17. 17. Sharding= Partitioning + ReplicationThe problem is: sharding is difficult.Determining smart partitioning schemes forparticular kinds of data requires a lot ofthought. And even more difficult is ensuringthat all of the copies of the data are consistentdespite unreliable communication andoccasional computer failures.DDBMS term paper 17
    18. 18. ShardingThe advantages of sharding are:• High availability• Faster QueriesHow is sharding different thantraditional architectures?DDBMS term paper 18
    19. 19. How is sharding different thantraditional architectures?• Data are parallelized across many datastores• Data are more highly available.• It doesnt use replication• Data are denormalizedDDBMS term paper 19
    20. 20. GizzardDDBMS term paper 20
    21. 21. GizzardGizzard is a framework that offers a basictemplate for solving a certain class of problem.DDBMS term paper 21
    22. 22. GizzardHere are some key features of "Gizzard" Gizzard supports any datastorage backend Gizzard handles partitioning through a forwarding table Gizzard is middleware Gizzard handles replication through a replication tree Gizzard is fault-tolerant Gizzard supports migrations Gizzard handles write conflictsDDBMS term paper 22
    23. 23. How does‘Gizzard’ workDDBMS term paper 23
    24. 24. How does it work ?Gizzard is middlewareIt sits “in the middle” between clients (web front-ends like PHP and Rubyon Rails applications) and the many partitions and replicas of data henceall the data manipulation flow through Gizzard.DDBMS term paper 24
    25. 25. ArchitectureWeb/App ServerGizzardMySQLStatelessDDBMS term paper 25
    26. 26. How does it work ?Gizzard handles partitioning through aforwarding tableGizzard handles partitioning by mappings ranges of data to particularshards.Stored in a forwarding tableDDBMS term paper 26
    27. 27. Partitioning• Define a function Fun( id )• Ranges do not have to beequalDDBMS term paper 27
    28. 28. How does it work ?Gizzard handles replication through areplication treeEach shard referenced in the forwarding table can be either a physicalshard or a logical shard.A physical shard is a reference to a particular data storage back-endA logical shard is just a tree of other shards.DDBMS term paper 28
    29. 29. Partitioning• Logical Shading-Tree• Define Replication PolicyRead Only, Write OnlyReplicateDDBMS term paper 29
    30. 30. How does it work ?Gizzard is fault-tolerantGizzard is designed to avoid any single points of failure.If a certain replica in a partition has crashed, Gizzard routes requests tothe remaining healthy replicas, bearing in mind the weighting function.Writes to an unavailable shard are buffered until the shard again becomesavailable.DDBMS term paper 30
    31. 31. How does‘Gizzard’ handlewrite conflictsDDBMS term paper 31
    32. 32. Write operations have to be idempotent andcommutative.Example: A user quickly follows and unfollows me. Howis this write communtative?Follow and unfollow translate to the same write eventto FlockDB, "set edge state to X". An update applies onlyif the state on disk is older than the state in flight. So inthe case of follow then unfollow, it doesnt matterwhich one is applied to MySQL first, the unfollow statewill always win as it is more recent.DDBMS term paper 32
    33. 33. How does Gizzard handle writeconflicts ?Write conflicts are when two manipulations to the same record try tochange the record in differing ways.Because Gizzard does not guarantee that operations will apply in order.As described write operations must be both idempotent and commutativein order to avoid conflicts.This is actually an easy requirement in many cases than trying to guaranteeordered delivery of messages with bounded latency and high availability.DDBMS term paper 33
    34. 34. MigrationMigrating from Datastore A toDatastore ADDBMS term paper 34
    35. 35. Twitter’s real time dataprocessing and storageneeds.What type of data systemdoes it need?DDBMS term paper 35
    36. 36. DDBMS term paper 36
    37. 37. DDBMS term paper 37
    38. 38. DDBMS term paper 38
    39. 39. DDBMS term paper 39
    40. 40. DDBMS term paper 40
    41. 41. DDBMS term paper 41
    42. 42. DDBMS term paper 42
    43. 43. DDBMS term paper 43
    44. 44. Hadoop• Hadoop Distirbuted File System (HDFS)- it breakseach file you give it into 64- or 128-MB chunks calledblocks and sends them to different machines in thecluster, replicating each block three times along theway.– LZO Compression• Map reduce workflow system- It breaks analysesover large sets of data into small chunks which can bedone in parallel across all 100 (say) machines.Generates the precomputed view on which queries areexecutedDDBMS term paper 44
    45. 45. DDBMS term paper 45
    46. 46. DDBMS term paper 46
    47. 47. DDBMS term paper 47
    48. 48. DDBMS term paper 48
    49. 49. DDBMS term paper 49
    50. 50. DDBMS term paper 50
    51. 51. DDBMS term paper 51
    52. 52. DDBMS term paper 52
    53. 53. DDBMS term paper 53
    54. 54. DDBMS term paper 54
    55. 55. DDBMS term paper 55
    56. 56. DDBMS term paper 56
    57. 57. DDBMS term paper 57
    58. 58. DDBMS term paper 58
    59. 59. DDBMS term paper 59
    60. 60. DDBMS term paper 60
    61. 61. DDBMS term paper 61
    62. 62. Storm topologyDDBMS term paper 62
    63. 63. Example Query: streaming word countDDBMS term paper 63
    64. 64. 1.Guaranteed Message processing.2.Robust Process Management.3.Fault Detection and Automatic Reassignment.4.Efficient Message Passing.USP of STORM:DDBMS term paper 64
    65. 65. Monitoring popular queriesDDBMS term paper 65Storm topology thattracks statistics onsearch queriesSend to human evaluatorsfor question AND Amazon’sMechanical Turk querycategorizes the query.Machine learning modelsevaluates responses andthen push information toback end systems
    66. 66. 1.Twitter Engineering blog.2.Github Forums.ReferencesDDBMS term paper 66
    67. 67. Queries please!!DDBMS term paper 67Thank you!