Twitter case study

Term paper presented by:
• Akhtar S.Quereshi
• Anurag Arora
• Divya Gandhi
• Nishant Goyal
DDBMS term paper 1

Twitter tale of big data!
3 years, 2 months and 1 day. The time it took from the first Tweet to the billionth Tweet.
1 week. The time it took for users to send a billion Tweets in 2011.
50 million. The average number of Tweets people sent per day, 2010.
140 million. The average number of Tweets people sent per day, February 2011.
177 million. Tweets sent on March 11, 2011.
Half a billion tweets sent per day in Oct 2012.
572,000. Number of new accounts created on March 12, 2011.
460,000. Average number of new accounts per day over February 2011.
DDBMS term paper 2

Real-time challenge
DDBMS term paper 3

Agenda of ppt
• Managing social graphs- FlockDB
• Sharding- Gizzard
• Real time data processing/storing:
Hadoop/Storm
DDBMS term paper 5

FlockDB- built over MySQL
Maintaining social graph and query processing
DDBMS term paper 6

Challenges
• Timeline needs to rapidly go through the
*following* list of user and quickly display all
their tweets (sorted recency based)
• Answer queries like "What's the intersection of
people I follow and people who are following
President Obama?"
• Handle heavy write traffic, as followers are added or
removed.
DDBMS term paper 8

These features are difficult to
implement in a traditional
relational database.
DDBMS term paper 10

What is FlockDB?
• FlockDB is a distributed graph database for storing
adjacency lists.
• Optimized not for graph traversal but very large
adjacency lists and fast read/writes.
• It is able to support:
– a high rate of add/update/remove operations.
– potentially complex set arithmetic queries.
– paging through query result sets containing millions
of entries.
– ability to "archive" and later restore archived edges.
DDBMS term paper 11

How FlockDB deals with challenges?
• FlockDB database stores all information as edge
attributes in the graph.
• The four major attributes in the adjacency list
DDBMS term paper 12

• Each edge is actually stored twice.
forward: Nick follows Robey at 9:54 today.
backward: Robey is followed by Nick at 9:54 today.
• "Who follows me?" is just as efficient as
"Who do I follow?”
DDBMS term paper 13

"What's the intersection of people I follow and
people who are following President Obama?“
.
This can be answered quickly by decomposing it into single-user
query: "Who is following President Obama?“
 Data is partitioned by node, so these queries can
each be answered by a single partition, using an
indexed range query.
 Paging through long result sets is done by using the
position field(timestamp) as a cursor.
DDBMS term paper 14

Gizzard Framework is used to query
the flockDB distributed datastore.
And to handle the partitioning layer
DDBMS term paper 15

What’s ‘Sharding’
DDBMS term paper 16

Sharding
= Partitioning + Replication
The problem is: sharding is difficult.
Determining smart partitioning schemes for
particular kinds of data requires a lot of
thought. And even more difficult is ensuring
that all of the copies of the data are consistent
despite unreliable communication and
occasional computer failures.
DDBMS term paper 17

Sharding
The advantages of sharding are:
• High availability
• Faster Queries
How is sharding different than
traditional architectures?
DDBMS term paper 18

How is sharding different than
traditional architectures?
• Data are parallelized across many datastores
• Data are more highly available.
• It doesn't use replication
• Data are denormalized
DDBMS term paper 19

Gizzard
Gizzard is a framework that offers a basic
template for solving a certain class of problem.
DDBMS term paper 21

Gizzard
Here are some key features of "Gizzard"
 Gizzard supports any datastorage backend
 Gizzard handles partitioning through a forwarding table
 Gizzard is middleware
 Gizzard handles replication through a replication tree
 Gizzard is fault-tolerant
 Gizzard supports migrations
 Gizzard handles write conflicts
DDBMS term paper 22

How does
‘Gizzard’ work
DDBMS term paper 23

How does it work ?
Gizzard is middleware
It sits “in the middle” between clients (web front-ends like PHP and Ruby
on Rails applications) and the many partitions and replicas of data hence
all the data manipulation flow through Gizzard.
DDBMS term paper 24

Architecture
Web/App Server
Gizzard
MySQL
Stateless
DDBMS term paper 25

How does it work ?
Gizzard handles partitioning through a
forwarding table
Gizzard handles partitioning by mappings ranges of data to particular
shards.
Stored in a forwarding table
DDBMS term paper 26

Partitioning
• Define a function Fun( id )
• Ranges do not have to be
equal
DDBMS term paper 27

How does it work ?
Gizzard handles replication through a
replication tree
Each shard referenced in the forwarding table can be either a physical
shard or a logical shard.
A physical shard is a reference to a particular data storage back-end
A logical shard is just a tree of other shards.
DDBMS term paper 28

Partitioning
• Logical Shading-Tree
• Define Replication Policy
Read Only, Write Only
Replicate
DDBMS term paper 29

How does it work ?
Gizzard is fault-tolerant
Gizzard is designed to avoid any single points of failure.
If a certain replica in a partition has crashed, Gizzard routes requests to
the remaining healthy replicas, bearing in mind the weighting function.
Writes to an unavailable shard are buffered until the shard again becomes
available.
DDBMS term paper 30

How does
‘Gizzard’ handle
write conflicts
DDBMS term paper 31

Write operations have to be idempotent and
commutative.
Example: A user quickly follows and unfollows me. How
is this write communtative?
Follow and unfollow translate to the same write event
to FlockDB, "set edge state to X". An update applies only
if the state on disk is older than the state in flight. So in
the case of follow then unfollow, it doesn't matter
which one is applied to MySQL first, the unfollow state
will always win as it is more recent.
DDBMS term paper 32

How does Gizzard handle write
conflicts ?
Write conflicts are when two manipulations to the same record try to
change the record in differing ways.
Because Gizzard does not guarantee that operations will apply in order.
As described write operations must be both idempotent and commutative
in order to avoid conflicts.
This is actually an easy requirement in many cases than trying to guarantee
ordered delivery of messages with bounded latency and high availability.
DDBMS term paper 33

Migration
Migrating from Datastore A to
Datastore A'
DDBMS term paper 34

Twitter’s real time data
processing and storage
needs.
What type of data system
does it need?
DDBMS term paper 35

Hadoop
• Hadoop Distirbuted File System (HDFS)- it breaks
each file you give it into 64- or 128-MB chunks called
blocks and sends them to different machines in the
cluster, replicating each block three times along the
way.
– LZO Compression
• Map reduce workflow system- It breaks analyses
over large sets of data into small chunks which can be
done in parallel across all 100 (say) machines.
Generates the precomputed view on which queries are
executed
DDBMS term paper 44

Storm topology
DDBMS term paper 62

Example Query: streaming word count
DDBMS term paper 63

1.Guaranteed Message processing.
2.Robust Process Management.
3.Fault Detection and Automatic Reassignment.
4.Efficient Message Passing.
USP of STORM:
DDBMS term paper 64

Monitoring popular queries
DDBMS term paper 65
Storm topology that
tracks statistics on
search queries
Send to human evaluators
for question AND Amazon’s
Mechanical Turk query
categorizes the query.
Machine learning models
evaluates responses and
then push information to
back end systems

1.Twitter Engineering blog.
2.Github Forums.
References
DDBMS term paper 66

Queries please!!
DDBMS term paper 67
Thank you!

Twitter case study

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Twitter case study

Similar to Twitter case study (20)

Recently uploaded

Recently uploaded (20)

Twitter case study

Editor's Notes