1. Term paper presented by:
• Akhtar S.Quereshi
• Anurag Arora
• Divya Gandhi
• Nishant Goyal
DDBMS term paper 1
2. Twitter tale of big data!
3 years, 2 months and 1 day. The time it took from the first Tweet to the billionth Tweet.
1 week. The time it took for users to send a billion Tweets in 2011.
50 million. The average number of Tweets people sent per day, 2010.
140 million. The average number of Tweets people sent per day, February 2011.
177 million. Tweets sent on March 11, 2011.
Half a billion tweets sent per day in Oct 2012.
572,000. Number of new accounts created on March 12, 2011.
460,000. Average number of new accounts per day over February 2011.
DDBMS term paper 2
8. Challenges
• Timeline needs to rapidly go through the
*following* list of user and quickly display all
their tweets (sorted recency based)
• Answer queries like "What's the intersection of
people I follow and people who are following
President Obama?"
• Handle heavy write traffic, as followers are added or
removed.
DDBMS term paper 8
10. These features are difficult to
implement in a traditional
relational database.
DDBMS term paper 10
11. What is FlockDB?
• FlockDB is a distributed graph database for storing
adjacency lists.
• Optimized not for graph traversal but very large
adjacency lists and fast read/writes.
• It is able to support:
– a high rate of add/update/remove operations.
– potentially complex set arithmetic queries.
– paging through query result sets containing millions
of entries.
– ability to "archive" and later restore archived edges.
DDBMS term paper 11
12. How FlockDB deals with challenges?
• FlockDB database stores all information as edge
attributes in the graph.
• The four major attributes in the adjacency list
DDBMS term paper 12
13. • Each edge is actually stored twice.
forward: Nick follows Robey at 9:54 today.
backward: Robey is followed by Nick at 9:54 today.
• "Who follows me?" is just as efficient as
"Who do I follow?”
DDBMS term paper 13
14. "What's the intersection of people I follow and
people who are following President Obama?“
.
This can be answered quickly by decomposing it into single-user
query: "Who is following President Obama?“
Data is partitioned by node, so these queries can
each be answered by a single partition, using an
indexed range query.
Paging through long result sets is done by using the
position field(timestamp) as a cursor.
DDBMS term paper 14
15. Gizzard Framework is used to query
the flockDB distributed datastore.
And to handle the partitioning layer
DDBMS term paper 15
17. Sharding
= Partitioning + Replication
The problem is: sharding is difficult.
Determining smart partitioning schemes for
particular kinds of data requires a lot of
thought. And even more difficult is ensuring
that all of the copies of the data are consistent
despite unreliable communication and
occasional computer failures.
DDBMS term paper 17
18. Sharding
The advantages of sharding are:
• High availability
• Faster Queries
How is sharding different than
traditional architectures?
DDBMS term paper 18
19. How is sharding different than
traditional architectures?
• Data are parallelized across many datastores
• Data are more highly available.
• It doesn't use replication
• Data are denormalized
DDBMS term paper 19
21. Gizzard
Gizzard is a framework that offers a basic
template for solving a certain class of problem.
DDBMS term paper 21
22. Gizzard
Here are some key features of "Gizzard"
Gizzard supports any datastorage backend
Gizzard handles partitioning through a forwarding table
Gizzard is middleware
Gizzard handles replication through a replication tree
Gizzard is fault-tolerant
Gizzard supports migrations
Gizzard handles write conflicts
DDBMS term paper 22
24. How does it work ?
Gizzard is middleware
It sits “in the middle” between clients (web front-ends like PHP and Ruby
on Rails applications) and the many partitions and replicas of data hence
all the data manipulation flow through Gizzard.
DDBMS term paper 24
26. How does it work ?
Gizzard handles partitioning through a
forwarding table
Gizzard handles partitioning by mappings ranges of data to particular
shards.
Stored in a forwarding table
DDBMS term paper 26
27. Partitioning
• Define a function Fun( id )
• Ranges do not have to be
equal
DDBMS term paper 27
28. How does it work ?
Gizzard handles replication through a
replication tree
Each shard referenced in the forwarding table can be either a physical
shard or a logical shard.
A physical shard is a reference to a particular data storage back-end
A logical shard is just a tree of other shards.
DDBMS term paper 28
30. How does it work ?
Gizzard is fault-tolerant
Gizzard is designed to avoid any single points of failure.
If a certain replica in a partition has crashed, Gizzard routes requests to
the remaining healthy replicas, bearing in mind the weighting function.
Writes to an unavailable shard are buffered until the shard again becomes
available.
DDBMS term paper 30
32. Write operations have to be idempotent and
commutative.
Example: A user quickly follows and unfollows me. How
is this write communtative?
Follow and unfollow translate to the same write event
to FlockDB, "set edge state to X". An update applies only
if the state on disk is older than the state in flight. So in
the case of follow then unfollow, it doesn't matter
which one is applied to MySQL first, the unfollow state
will always win as it is more recent.
DDBMS term paper 32
33. How does Gizzard handle write
conflicts ?
Write conflicts are when two manipulations to the same record try to
change the record in differing ways.
Because Gizzard does not guarantee that operations will apply in order.
As described write operations must be both idempotent and commutative
in order to avoid conflicts.
This is actually an easy requirement in many cases than trying to guarantee
ordered delivery of messages with bounded latency and high availability.
DDBMS term paper 33
44. Hadoop
• Hadoop Distirbuted File System (HDFS)- it breaks
each file you give it into 64- or 128-MB chunks called
blocks and sends them to different machines in the
cluster, replicating each block three times along the
way.
– LZO Compression
• Map reduce workflow system- It breaks analyses
over large sets of data into small chunks which can be
done in parallel across all 100 (say) machines.
Generates the precomputed view on which queries are
executed
DDBMS term paper 44
64. 1.Guaranteed Message processing.
2.Robust Process Management.
3.Fault Detection and Automatic Reassignment.
4.Efficient Message Passing.
USP of STORM:
DDBMS term paper 64
65. Monitoring popular queries
DDBMS term paper 65
Storm topology that
tracks statistics on
search queries
Send to human evaluators
for question AND Amazon’s
Mechanical Turk query
categorizes the query.
Machine learning models
evaluates responses and
then push information to
back end systems
Source_id/destination_id is a unique user id unless the graph is the graph storing favorite tweets in which case, the destination ID may be a tweet ID.Position is timestampFor example, the users who delete their account, their edges are put into “archived” state, allowing them to be restored later. When the edge is deleted, the row isn’t actually deleted from MySQL; it's just marked as being in the deleted state, which has the effect of moving the primary key.
Data is partitioned by node, so these queries can each be answered by a single partition, using an indexed range query.
Unlike others it is fault tolerant and scalable..Storm can do a continuous query and stream the results to clients in realtime..
Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures..
1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures