A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

Petabyte Scale Data at Facebook

Dhruba Borthakur ,
Engineer at Facebook,
BigDataCloud, Santa Clara, Oct 18 2012

Agenda

1 Types of Data

2 Data Model and API for Facebook Graph Data

3 SLTP (Semi-OLTP) data

4 Immutable data store for photos, videos, etc

5 Hadoop/Hive analytics storage

Four major types of storage systems
▪  Online Transaction Processing Databases (OLTP)
▪  The Facebook Social Graph

▪  Semi-online Light Transaction Processing Databases (SLTP)
▪  Facebook Messages and Sensory Time Series Data

▪  Immutable DataStore
▪  Photos, videos, etc

▪  Analytics DataStore
▪  Data Warehouse, Logs storage

Size and Scale of Databases
Total Size Technology Bottlenecks

Single digit petabytes MySQL and TAO/ Random read IOPS
Facebook Graph memcache

Facebook
Write IOPS and
Messages and Tens of petabytes Hadoop HDFS and storage capacity
Time Series Hbase
Data
Facebook
High tens of Haystack storage capacity
Photos petabytes

Data
Hundreds of Hive, HDFS and storage capacity
Warehouse petabytes Hadoop

Characteristics
Query Consistency Durability
Latency

< few quickly No data loss
Facebook milliseconds consistent
Graph across data
centers
Facebook
consistent No data loss
Messages < 200 millisec within a data
and Time center
Series Data
Facebook
< 250 millisec immutable No data loss
Photos

Data
< 1 min not consistent No silent data
Warehouse across data loss
centers

Facebook Graph: Objects and Associations

Data model
Objects & Associations
18429207554
(page)

fan
name: Barack Obama
8636146 birthday: 08/04/1961
admin
(user) website: http://…
veriﬁed: 1
friend …
likes
liked by friend
604191769
(user)

6205972929
(story)

Facebook Social Graph: TAO and MySQL
An OLTP workload:
▪  Uneven read heavy workload
▪  Huge working set with creation-time locality
▪  Highly interconnected data
▪  Constantly evolving
▪  As consistent as possible

Architecture
Cache & Storage

MySQL Storage
Web servers TAO Storage Cache

Architecture
Sharding
▪  Object ids and Assoc id1s are mapped to shard ids
MySQL Storage
Web Servers TAO Cache
db2 db4
s1 s3
s5

db1 db3
db8
s2 s6

db7

s4 s7
s8
db5 db6

Workload

▪  Read-heavy workload
▪  Signiﬁcant range queries
▪  LinkBench benchmark being open-sourced
▪  http://www.github.com/facebook/linkbench

▪  Real data distribution of Assocs and their access patterns

Messages & Time Series Database
SLTP workload

Facebook Messages

Messages Chats Emails SMS

Why we chose Hbase and HDFS
▪  High write throughput
▪  Horizontal scalability
▪  Automatic Failover
▪  Strong consistency within a data center
▪  Beneﬁts of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
▪  Why is this SLTP?
▪  Semi-online: Queries run even if part of the database is ofﬂine
▪  Light Transactions: single row transactions
▪  Storage capacity bound rather than iops or cpu bound

What we store in HBase
▪  Small messages
▪  Message metadata (thread/message indices)
▪  Search index
▪  Large attachments stored in Haystack (photo store

Size and scale of Messages Database
▪  6 Billion messages/day
▪  74 Billion operations/day
▪  At peak: 1.5 million operations/sec
▪  55% read, 45% write operations
▪  Average write operation inserts 16 records
▪  All data is lzo compressed
▪  Growing at 8 TB/day

Facebook Photo DataStore
2009 2012

Total Size 15 billion photos
1.5 Petabyte High tens of petabytes

30 million photos/day 300 million photos/day
Upload Rate 3 TB/day 30 TB/day

555K images/sec
Serving Rate

Haystack based Design
Haystack Store
Haystack
Directory

Web Haystack
Server Cache

Browser CDN

Haystack Internals
▪  Log structured, append-only object store
▪  Built on commodity hardware
▪  Application-aware replication
▪  Images stored in 100 GB xfs-ﬁles called needles
▪  An in-memory index for each needle ﬁle
▪  32 bytes of index per photo

Hive Hadoop Analytics Warehouse

Life of a photo tag in Hadoop/Hive storage
Periodic
Analysis
(HIVE)
Adhoc
Analysis
(HIVE)

Daily report on count of Count photos tagged by
photo tags by country (1day) females age 20-25 yesterday

Scrapes

User info reaches
Hive
Warehouse
Warehouse (1day) MySQL
DB

copier/loader
Log line reaches
warehouse (15 min) User tags
a photo

puma

RealHme
AnalyHcs
(HBASE)
Scribe
Log
Storage
(HDFS)
www.facebook.com

Log line reaches Log line generated:
Count users tagging photos Scribeh (10s)
in the last hour (1min) <user_id, photo_id>

Analytics Data Growth(last 4 years)

Facebook Scribe Data/ Nodes in
Queries/Day Size (Total)
Users Day warehouse

Growth 14X 60X 250X 260X 2500X

Why use Hadoop instead of Parallel DBMS?
▪  Stonebraker/DeWitt from the
DBMS community:
▪  Quote “major step backwards”
▪  Published benchmark results
which show that Hive is not as
performant as a traditional DBMS
▪  http://database.cs.brown.edu/
projects/mapreduce-vs-dbms/

Why Hadoop/Hive? Prospecting for Gold..
▪  “Gold in the wild-west”
▪  A platform for huge data-experiments
▪  A majority of queries are searching
for a single gold nugget
▪  Great advantage in keeping all data in
one queryable system
▪  No structure to data, specify
structure at query time

Why Hadoop/Hive? Open to all users
▪  Open Warehouse
▪  No
need to take approval from DBA to
create new table
▪  This
is possible only because Hadoop is
horizontally scalable
▪  Provisioning
excess capacity is cheaper
than traditional databases
▪  Lower
$$/GB ($$/query-latency not
relevant here)

Why Hadoop/Hive? Crowd Sourcing
▪  Crowd Sourcing for data discovery
▪  There are 50K tables in a single
warehouse
▪  Users are DBAs themselves
▪  Questions about a table are directed
to users of that table
▪  Automatic query lineage tools help
here

Why Hadoop/Hive? Open Data Format
▪  No Lock-in
▪  Hive/Hadoop is open source
▪  Data is open-format, one can access
data below the database layer
▪  HBase can query the same data that
is in HDFS
▪  Want a new UDF? No problem, Very
easily extendable

Why Hadoop/Hive? Benchmarks
▪  Shortcomings of existing DBMS benchmarks
▪  most DBMS benchmarks measure latency
▪  Do not measure $$/Gigabyte-of-data
▪  Do not test fault tolerance – kill machines
▪  Do not measure elasticity – add and remove machines
▪  Do not measure throughput – concurrent queries in parallel

New trends in storage software
▪  Trends:

▪  SSDs cheaper, increasing number of CPUs per server
▪  How can BigData use SSDs?
▪  SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB
▪  How can BigData become even bigger?

Questions?
dhruba@fb.com
http://hadoopblog.blogspot.com/

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

More Related Content

What's hot

Viewers also liked

Similar to A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

More from BigDataCloud

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook