SlideShare a Scribd company logo
1 of 67
Term paper presented by:
• Akhtar S.Quereshi
• Anurag Arora
• Divya Gandhi
• Nishant Goyal
DDBMS term paper 1
Twitter tale of big data!
3 years, 2 months and 1 day. The time it took from the first Tweet to the billionth Tweet.
1 week. The time it took for users to send a billion Tweets in 2011.
50 million. The average number of Tweets people sent per day, 2010.
140 million. The average number of Tweets people sent per day, February 2011.
177 million. Tweets sent on March 11, 2011.
Half a billion tweets sent per day in Oct 2012.
572,000. Number of new accounts created on March 12, 2011.
460,000. Average number of new accounts per day over February 2011.
DDBMS term paper 2
Real-time challenge
DDBMS term paper 3
DDBMS term paper 4
Agenda of ppt
• Managing social graphs- FlockDB
• Sharding- Gizzard
• Real time data processing/storing:
Hadoop/Storm
DDBMS term paper 5
FlockDB- built over MySQL
Maintaining social graph and query processing
DDBMS term paper 6
DDBMS term paper 7
Challenges
• Timeline needs to rapidly go through the
*following* list of user and quickly display all
their tweets (sorted recency based)
• Answer queries like "What's the intersection of
people I follow and people who are following
President Obama?"
• Handle heavy write traffic, as followers are added or
removed.
DDBMS term paper 8
DDBMS term paper 9
These features are difficult to
implement in a traditional
relational database.
DDBMS term paper 10
What is FlockDB?
• FlockDB is a distributed graph database for storing
adjacency lists.
• Optimized not for graph traversal but very large
adjacency lists and fast read/writes.
• It is able to support:
– a high rate of add/update/remove operations.
– potentially complex set arithmetic queries.
– paging through query result sets containing millions
of entries.
– ability to "archive" and later restore archived edges.
DDBMS term paper 11
How FlockDB deals with challenges?
• FlockDB database stores all information as edge
attributes in the graph.
• The four major attributes in the adjacency list
DDBMS term paper 12
• Each edge is actually stored twice.
forward: Nick follows Robey at 9:54 today.
backward: Robey is followed by Nick at 9:54 today.
• "Who follows me?" is just as efficient as
"Who do I follow?”
DDBMS term paper 13
"What's the intersection of people I follow and
people who are following President Obama?“
.
This can be answered quickly by decomposing it into single-user
query: "Who is following President Obama?“
 Data is partitioned by node, so these queries can
each be answered by a single partition, using an
indexed range query.
 Paging through long result sets is done by using the
position field(timestamp) as a cursor.
DDBMS term paper 14
Gizzard Framework is used to query
the flockDB distributed datastore.
And to handle the partitioning layer
DDBMS term paper 15
What’s ‘Sharding’
DDBMS term paper 16
Sharding
= Partitioning + Replication
The problem is: sharding is difficult.
Determining smart partitioning schemes for
particular kinds of data requires a lot of
thought. And even more difficult is ensuring
that all of the copies of the data are consistent
despite unreliable communication and
occasional computer failures.
DDBMS term paper 17
Sharding
The advantages of sharding are:
• High availability
• Faster Queries
How is sharding different than
traditional architectures?
DDBMS term paper 18
How is sharding different than
traditional architectures?
• Data are parallelized across many datastores
• Data are more highly available.
• It doesn't use replication
• Data are denormalized
DDBMS term paper 19
Gizzard
DDBMS term paper 20
Gizzard
Gizzard is a framework that offers a basic
template for solving a certain class of problem.
DDBMS term paper 21
Gizzard
Here are some key features of "Gizzard"
 Gizzard supports any datastorage backend
 Gizzard handles partitioning through a forwarding table
 Gizzard is middleware
 Gizzard handles replication through a replication tree
 Gizzard is fault-tolerant
 Gizzard supports migrations
 Gizzard handles write conflicts
DDBMS term paper 22
How does
‘Gizzard’ work
DDBMS term paper 23
How does it work ?
Gizzard is middleware
It sits “in the middle” between clients (web front-ends like PHP and Ruby
on Rails applications) and the many partitions and replicas of data hence
all the data manipulation flow through Gizzard.
DDBMS term paper 24
Architecture
Web/App Server
Gizzard
MySQL
Stateless
DDBMS term paper 25
How does it work ?
Gizzard handles partitioning through a
forwarding table
Gizzard handles partitioning by mappings ranges of data to particular
shards.
Stored in a forwarding table
DDBMS term paper 26
Partitioning
• Define a function Fun( id )
• Ranges do not have to be
equal
DDBMS term paper 27
How does it work ?
Gizzard handles replication through a
replication tree
Each shard referenced in the forwarding table can be either a physical
shard or a logical shard.
A physical shard is a reference to a particular data storage back-end
A logical shard is just a tree of other shards.
DDBMS term paper 28
Partitioning
• Logical Shading-Tree
• Define Replication Policy
Read Only, Write Only
Replicate
DDBMS term paper 29
How does it work ?
Gizzard is fault-tolerant
Gizzard is designed to avoid any single points of failure.
If a certain replica in a partition has crashed, Gizzard routes requests to
the remaining healthy replicas, bearing in mind the weighting function.
Writes to an unavailable shard are buffered until the shard again becomes
available.
DDBMS term paper 30
How does
‘Gizzard’ handle
write conflicts
DDBMS term paper 31
Write operations have to be idempotent and
commutative.
Example: A user quickly follows and unfollows me. How
is this write communtative?
Follow and unfollow translate to the same write event
to FlockDB, "set edge state to X". An update applies only
if the state on disk is older than the state in flight. So in
the case of follow then unfollow, it doesn't matter
which one is applied to MySQL first, the unfollow state
will always win as it is more recent.
DDBMS term paper 32
How does Gizzard handle write
conflicts ?
Write conflicts are when two manipulations to the same record try to
change the record in differing ways.
Because Gizzard does not guarantee that operations will apply in order.
As described write operations must be both idempotent and commutative
in order to avoid conflicts.
This is actually an easy requirement in many cases than trying to guarantee
ordered delivery of messages with bounded latency and high availability.
DDBMS term paper 33
Migration
Migrating from Datastore A to
Datastore A'
DDBMS term paper 34
Twitter’s real time data
processing and storage
needs.
What type of data system
does it need?
DDBMS term paper 35
DDBMS term paper 36
DDBMS term paper 37
DDBMS term paper 38
DDBMS term paper 39
DDBMS term paper 40
DDBMS term paper 41
DDBMS term paper 42
DDBMS term paper 43
Hadoop
• Hadoop Distirbuted File System (HDFS)- it breaks
each file you give it into 64- or 128-MB chunks called
blocks and sends them to different machines in the
cluster, replicating each block three times along the
way.
– LZO Compression
• Map reduce workflow system- It breaks analyses
over large sets of data into small chunks which can be
done in parallel across all 100 (say) machines.
Generates the precomputed view on which queries are
executed
DDBMS term paper 44
DDBMS term paper 45
DDBMS term paper 46
DDBMS term paper 47
DDBMS term paper 48
DDBMS term paper 49
DDBMS term paper 50
DDBMS term paper 51
DDBMS term paper 52
DDBMS term paper 53
DDBMS term paper 54
DDBMS term paper 55
DDBMS term paper 56
DDBMS term paper 57
DDBMS term paper 58
DDBMS term paper 59
DDBMS term paper 60
DDBMS term paper 61
Storm topology
DDBMS term paper 62
Example Query: streaming word count
DDBMS term paper 63
1.Guaranteed Message processing.
2.Robust Process Management.
3.Fault Detection and Automatic Reassignment.
4.Efficient Message Passing.
USP of STORM:
DDBMS term paper 64
Monitoring popular queries
DDBMS term paper 65
Storm topology that
tracks statistics on
search queries
Send to human evaluators
for question AND Amazon’s
Mechanical Turk query
categorizes the query.
Machine learning models
evaluates responses and
then push information to
back end systems
1.Twitter Engineering blog.
2.Github Forums.
References
DDBMS term paper 66
Queries please!!
DDBMS term paper 67
Thank you!

More Related Content

What's hot

Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
Introduction to Nebula Graph, an Open-Source Distributed Graph Database
Introduction to Nebula Graph, an Open-Source Distributed Graph DatabaseIntroduction to Nebula Graph, an Open-Source Distributed Graph Database
Introduction to Nebula Graph, an Open-Source Distributed Graph DatabaseNebula Graph
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
GoldenGate and Stream Processing with Special Guest Rakuten
GoldenGate and Stream Processing with Special Guest RakutenGoldenGate and Stream Processing with Special Guest Rakuten
GoldenGate and Stream Processing with Special Guest RakutenJeffrey T. Pollock
 
Column oriented database
Column oriented databaseColumn oriented database
Column oriented databaseKanike Krishna
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance InitiativeDataWorks Summit
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentialsqureshihamid
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 

What's hot (20)

Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Introduction to Nebula Graph, an Open-Source Distributed Graph Database
Introduction to Nebula Graph, an Open-Source Distributed Graph DatabaseIntroduction to Nebula Graph, an Open-Source Distributed Graph Database
Introduction to Nebula Graph, an Open-Source Distributed Graph Database
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Data Vault and DW2.0
Data Vault and DW2.0Data Vault and DW2.0
Data Vault and DW2.0
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
GoldenGate and Stream Processing with Special Guest Rakuten
GoldenGate and Stream Processing with Special Guest RakutenGoldenGate and Stream Processing with Special Guest Rakuten
GoldenGate and Stream Processing with Special Guest Rakuten
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Column oriented database
Column oriented databaseColumn oriented database
Column oriented database
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Big Data Hadoop Customer 360 Degree View
Big Data Hadoop Customer 360 Degree ViewBig Data Hadoop Customer 360 Degree View
Big Data Hadoop Customer 360 Degree View
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 

Viewers also liked

Twitter case study final
Twitter case study  finalTwitter case study  final
Twitter case study finalAishwaryaa Ravi
 
Distributed Airline Reservation System
Distributed Airline Reservation SystemDistributed Airline Reservation System
Distributed Airline Reservation Systemamanchaurasia
 
Design at Scale: A Storage Case Study
Design at Scale: A Storage Case StudyDesign at Scale: A Storage Case Study
Design at Scale: A Storage Case StudyDesignMap
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
Microservice architecture case study
Microservice architecture case studyMicroservice architecture case study
Microservice architecture case studyRudra Tripathy
 
The twitter case study 2014 dimensions of strategy
The twitter case study 2014 dimensions of strategyThe twitter case study 2014 dimensions of strategy
The twitter case study 2014 dimensions of strategyJohn Ashcroft
 
Cisco Systems Case Study: The Architecture Review Process Improving the IT P...
Cisco Systems Case Study: The Architecture Review  Process Improving the IT P...Cisco Systems Case Study: The Architecture Review  Process Improving the IT P...
Cisco Systems Case Study: The Architecture Review Process Improving the IT P...Susan Bouchard
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
Twitter Case Study by Mitesh M Motwani
Twitter Case Study by Mitesh M MotwaniTwitter Case Study by Mitesh M Motwani
Twitter Case Study by Mitesh M MotwaniMitesh M Motwani
 

Viewers also liked (13)

Twitter case study final
Twitter case study  finalTwitter case study  final
Twitter case study final
 
Distributed Airline Reservation System
Distributed Airline Reservation SystemDistributed Airline Reservation System
Distributed Airline Reservation System
 
Design at Scale: A Storage Case Study
Design at Scale: A Storage Case StudyDesign at Scale: A Storage Case Study
Design at Scale: A Storage Case Study
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Redis and it's data types
Redis and it's data typesRedis and it's data types
Redis and it's data types
 
Microservice architecture case study
Microservice architecture case studyMicroservice architecture case study
Microservice architecture case study
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
The twitter case study 2014 dimensions of strategy
The twitter case study 2014 dimensions of strategyThe twitter case study 2014 dimensions of strategy
The twitter case study 2014 dimensions of strategy
 
Cisco Systems Case Study: The Architecture Review Process Improving the IT P...
Cisco Systems Case Study: The Architecture Review  Process Improving the IT P...Cisco Systems Case Study: The Architecture Review  Process Improving the IT P...
Cisco Systems Case Study: The Architecture Review Process Improving the IT P...
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Twitter Case Study by Mitesh M Motwani
Twitter Case Study by Mitesh M MotwaniTwitter Case Study by Mitesh M Motwani
Twitter Case Study by Mitesh M Motwani
 
Twitter PPT
Twitter PPTTwitter PPT
Twitter PPT
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Twitter case study

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setSoner Altin
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptChris Richardson
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...InfiniteGraph
 
JasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max SchiresonJasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max SchiresonMongoDB
 
Distributed databases and dbm ss
Distributed databases and dbm ssDistributed databases and dbm ss
Distributed databases and dbm ssMohd Arif
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altinmustafa sarac
 
Distributed database management system
Distributed database management systemDistributed database management system
Distributed database management systemVinay D. Patel
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 

Similar to Twitter case study (20)

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
 
JasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max SchiresonJasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max Schireson
 
Distributed databases and dbm ss
Distributed databases and dbm ssDistributed databases and dbm ss
Distributed databases and dbm ss
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 
Distributed database management system
Distributed database management systemDistributed database management system
Distributed database management system
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
 
Data
DataData
Data
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Twitter case study

  • 1. Term paper presented by: • Akhtar S.Quereshi • Anurag Arora • Divya Gandhi • Nishant Goyal DDBMS term paper 1
  • 2. Twitter tale of big data! 3 years, 2 months and 1 day. The time it took from the first Tweet to the billionth Tweet. 1 week. The time it took for users to send a billion Tweets in 2011. 50 million. The average number of Tweets people sent per day, 2010. 140 million. The average number of Tweets people sent per day, February 2011. 177 million. Tweets sent on March 11, 2011. Half a billion tweets sent per day in Oct 2012. 572,000. Number of new accounts created on March 12, 2011. 460,000. Average number of new accounts per day over February 2011. DDBMS term paper 2
  • 5. Agenda of ppt • Managing social graphs- FlockDB • Sharding- Gizzard • Real time data processing/storing: Hadoop/Storm DDBMS term paper 5
  • 6. FlockDB- built over MySQL Maintaining social graph and query processing DDBMS term paper 6
  • 8. Challenges • Timeline needs to rapidly go through the *following* list of user and quickly display all their tweets (sorted recency based) • Answer queries like "What's the intersection of people I follow and people who are following President Obama?" • Handle heavy write traffic, as followers are added or removed. DDBMS term paper 8
  • 10. These features are difficult to implement in a traditional relational database. DDBMS term paper 10
  • 11. What is FlockDB? • FlockDB is a distributed graph database for storing adjacency lists. • Optimized not for graph traversal but very large adjacency lists and fast read/writes. • It is able to support: – a high rate of add/update/remove operations. – potentially complex set arithmetic queries. – paging through query result sets containing millions of entries. – ability to "archive" and later restore archived edges. DDBMS term paper 11
  • 12. How FlockDB deals with challenges? • FlockDB database stores all information as edge attributes in the graph. • The four major attributes in the adjacency list DDBMS term paper 12
  • 13. • Each edge is actually stored twice. forward: Nick follows Robey at 9:54 today. backward: Robey is followed by Nick at 9:54 today. • "Who follows me?" is just as efficient as "Who do I follow?” DDBMS term paper 13
  • 14. "What's the intersection of people I follow and people who are following President Obama?“ . This can be answered quickly by decomposing it into single-user query: "Who is following President Obama?“  Data is partitioned by node, so these queries can each be answered by a single partition, using an indexed range query.  Paging through long result sets is done by using the position field(timestamp) as a cursor. DDBMS term paper 14
  • 15. Gizzard Framework is used to query the flockDB distributed datastore. And to handle the partitioning layer DDBMS term paper 15
  • 17. Sharding = Partitioning + Replication The problem is: sharding is difficult. Determining smart partitioning schemes for particular kinds of data requires a lot of thought. And even more difficult is ensuring that all of the copies of the data are consistent despite unreliable communication and occasional computer failures. DDBMS term paper 17
  • 18. Sharding The advantages of sharding are: • High availability • Faster Queries How is sharding different than traditional architectures? DDBMS term paper 18
  • 19. How is sharding different than traditional architectures? • Data are parallelized across many datastores • Data are more highly available. • It doesn't use replication • Data are denormalized DDBMS term paper 19
  • 21. Gizzard Gizzard is a framework that offers a basic template for solving a certain class of problem. DDBMS term paper 21
  • 22. Gizzard Here are some key features of "Gizzard"  Gizzard supports any datastorage backend  Gizzard handles partitioning through a forwarding table  Gizzard is middleware  Gizzard handles replication through a replication tree  Gizzard is fault-tolerant  Gizzard supports migrations  Gizzard handles write conflicts DDBMS term paper 22
  • 24. How does it work ? Gizzard is middleware It sits “in the middle” between clients (web front-ends like PHP and Ruby on Rails applications) and the many partitions and replicas of data hence all the data manipulation flow through Gizzard. DDBMS term paper 24
  • 26. How does it work ? Gizzard handles partitioning through a forwarding table Gizzard handles partitioning by mappings ranges of data to particular shards. Stored in a forwarding table DDBMS term paper 26
  • 27. Partitioning • Define a function Fun( id ) • Ranges do not have to be equal DDBMS term paper 27
  • 28. How does it work ? Gizzard handles replication through a replication tree Each shard referenced in the forwarding table can be either a physical shard or a logical shard. A physical shard is a reference to a particular data storage back-end A logical shard is just a tree of other shards. DDBMS term paper 28
  • 29. Partitioning • Logical Shading-Tree • Define Replication Policy Read Only, Write Only Replicate DDBMS term paper 29
  • 30. How does it work ? Gizzard is fault-tolerant Gizzard is designed to avoid any single points of failure. If a certain replica in a partition has crashed, Gizzard routes requests to the remaining healthy replicas, bearing in mind the weighting function. Writes to an unavailable shard are buffered until the shard again becomes available. DDBMS term paper 30
  • 31. How does ‘Gizzard’ handle write conflicts DDBMS term paper 31
  • 32. Write operations have to be idempotent and commutative. Example: A user quickly follows and unfollows me. How is this write communtative? Follow and unfollow translate to the same write event to FlockDB, "set edge state to X". An update applies only if the state on disk is older than the state in flight. So in the case of follow then unfollow, it doesn't matter which one is applied to MySQL first, the unfollow state will always win as it is more recent. DDBMS term paper 32
  • 33. How does Gizzard handle write conflicts ? Write conflicts are when two manipulations to the same record try to change the record in differing ways. Because Gizzard does not guarantee that operations will apply in order. As described write operations must be both idempotent and commutative in order to avoid conflicts. This is actually an easy requirement in many cases than trying to guarantee ordered delivery of messages with bounded latency and high availability. DDBMS term paper 33
  • 34. Migration Migrating from Datastore A to Datastore A' DDBMS term paper 34
  • 35. Twitter’s real time data processing and storage needs. What type of data system does it need? DDBMS term paper 35
  • 44. Hadoop • Hadoop Distirbuted File System (HDFS)- it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way. – LZO Compression • Map reduce workflow system- It breaks analyses over large sets of data into small chunks which can be done in parallel across all 100 (say) machines. Generates the precomputed view on which queries are executed DDBMS term paper 44
  • 63. Example Query: streaming word count DDBMS term paper 63
  • 64. 1.Guaranteed Message processing. 2.Robust Process Management. 3.Fault Detection and Automatic Reassignment. 4.Efficient Message Passing. USP of STORM: DDBMS term paper 64
  • 65. Monitoring popular queries DDBMS term paper 65 Storm topology that tracks statistics on search queries Send to human evaluators for question AND Amazon’s Mechanical Turk query categorizes the query. Machine learning models evaluates responses and then push information to back end systems
  • 66. 1.Twitter Engineering blog. 2.Github Forums. References DDBMS term paper 66
  • 67. Queries please!! DDBMS term paper 67 Thank you!

Editor's Notes

  1. Source_id/destination_id is a unique user id unless the graph is the graph storing favorite tweets in which case, the destination ID may be a tweet ID.Position is timestampFor example, the users who delete their account, their edges are put into “archived” state, allowing them to be restored later. When the edge is deleted, the row isn’t actually deleted from MySQL; it's just marked as being in the deleted state, which has the effect of moving the primary key.
  2. Data is partitioned by node, so these queries can each be answered by a single partition, using an indexed range query.
  3. Unlike others it is fault tolerant and scalable..Storm can do a continuous query and stream the results to clients in realtime..
  4. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures..
  5. 1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
  6. 1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures
  7. 1. Tracks tasks tree.2. Workers controlled by supervisor. Hence task will never be orphaned sucking up memory.3. Tasks heartbeat to nimbus.4. No immediate queuing. Directly message transfer between tasks.Storm guarantees messages will be processed even in the face of failures