At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
1. Petabyte Scale Data at Facebook
Dhruba Borthakur ,
Engineer at Facebook,
BigDataCloud, Santa Clara, Oct 18 2012
2. Agenda
1 Types of Data
2 Data Model and API for Facebook Graph Data
3 SLTP (Semi-OLTP) data
4 Immutable data store for photos, videos, etc
5 Hadoop/Hive analytics storage
3.
4. Four major types of storage systems
▪ Online Transaction Processing Databases (OLTP)
▪ The Facebook Social Graph
▪ Semi-online Light Transaction Processing Databases (SLTP)
▪ Facebook Messages and Sensory Time Series Data
▪ Immutable DataStore
▪ Photos, videos, etc
▪ Analytics DataStore
▪ Data Warehouse, Logs storage
5. Size and Scale of Databases
Total Size Technology Bottlenecks
Single digit petabytes MySQL and TAO/ Random read IOPS
Facebook Graph memcache
Facebook
Write IOPS and
Messages and Tens of petabytes Hadoop HDFS and storage capacity
Time Series Hbase
Data
Facebook
High tens of Haystack storage capacity
Photos petabytes
Data
Hundreds of Hive, HDFS and storage capacity
Warehouse petabytes Hadoop
6. Characteristics
Query Consistency Durability
Latency
< few quickly No data loss
Facebook milliseconds consistent
Graph across data
centers
Facebook
consistent No data loss
Messages < 200 millisec within a data
and Time center
Series Data
Facebook
< 250 millisec immutable No data loss
Photos
Data
< 1 min not consistent No silent data
Warehouse across data loss
centers
8. Data model
Objects & Associations
18429207554
(page)
fan
name: Barack Obama
8636146 birthday: 08/04/1961
admin
(user) website: http://…
verified: 1
friend …
likes
liked by friend
604191769
(user)
6205972929
(story)
9. Facebook Social Graph: TAO and MySQL
An OLTP workload:
▪ Uneven read heavy workload
▪ Huge working set with creation-time locality
▪ Highly interconnected data
▪ Constantly evolving
▪ As consistent as possible
11. Architecture
Sharding
▪ Object ids and Assoc id1s are mapped to shard ids
MySQL Storage
Web Servers TAO Cache
db2 db4
s1 s3
s5
db1 db3
db8
s2 s6
db7
s4 s7
s8
db5 db6
12. Workload
▪ Read-heavy workload
▪ Significant range queries
▪ LinkBench benchmark being open-sourced
▪ http://www.github.com/facebook/linkbench
▪ Real data distribution of Assocs and their access patterns
15. Why we chose Hbase and HDFS
▪ High write throughput
▪ Horizontal scalability
▪ Automatic Failover
▪ Strong consistency within a data center
▪ Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
▪ Why is this SLTP?
▪ Semi-online: Queries run even if part of the database is offline
▪ Light Transactions: single row transactions
▪ Storage capacity bound rather than iops or cpu bound
16. What we store in HBase
▪ Small messages
▪ Message metadata (thread/message indices)
▪ Search index
▪ Large attachments stored in Haystack (photo store
17. Size and scale of Messages Database
▪ 6 Billion messages/day
▪ 74 Billion operations/day
▪ At peak: 1.5 million operations/sec
▪ 55% read, 45% write operations
▪ Average write operation inserts 16 records
▪ All data is lzo compressed
▪ Growing at 8 TB/day
19. Facebook Photo DataStore
2009 2012
Total Size 15 billion photos
1.5 Petabyte High tens of petabytes
30 million photos/day 300 million photos/day
Upload Rate 3 TB/day 30 TB/day
555K images/sec
Serving Rate
20. Haystack based Design
Haystack Store
Haystack
Directory
Web Haystack
Server Cache
Browser CDN
21. Haystack Internals
▪ Log structured, append-only object store
▪ Built on commodity hardware
▪ Application-aware replication
▪ Images stored in 100 GB xfs-files called needles
▪ An in-memory index for each needle file
▪ 32 bytes of index per photo
23. Life of a photo tag in Hadoop/Hive storage
Periodic
Analysis
(HIVE)
Adhoc
Analysis
(HIVE)
Daily report on count of Count photos tagged by
photo tags by country (1day) females age 20-25 yesterday
Scrapes
User info reaches
Hive
Warehouse
Warehouse (1day) MySQL
DB
copier/loader
Log line reaches
warehouse (15 min) User tags
a photo
puma
RealHme
AnalyHcs
(HBASE)
Scribe
Log
Storage
(HDFS)
www.facebook.com
Log line reaches Log line generated:
Count users tagging photos Scribeh (10s)
in the last hour (1min) <user_id, photo_id>
24. Analytics Data Growth(last 4 years)
Facebook Scribe Data/ Nodes in
Queries/Day Size (Total)
Users Day warehouse
Growth 14X 60X 250X 260X 2500X
25. Why use Hadoop instead of Parallel DBMS?
▪ Stonebraker/DeWitt from the
DBMS community:
▪ Quote “major step backwards”
▪ Published benchmark results
which show that Hive is not as
performant as a traditional DBMS
▪ http://database.cs.brown.edu/
projects/mapreduce-vs-dbms/
26. Why Hadoop/Hive? Prospecting for Gold..
▪ “Gold in the wild-west”
▪ A platform for huge data-experiments
▪ A majority of queries are searching
for a single gold nugget
▪ Great advantage in keeping all data in
one queryable system
▪ No structure to data, specify
structure at query time
27. Why Hadoop/Hive? Open to all users
▪ Open Warehouse
▪ No
need to take approval from DBA to
create new table
▪ This
is possible only because Hadoop is
horizontally scalable
▪ Provisioning
excess capacity is cheaper
than traditional databases
▪ Lower
$$/GB ($$/query-latency not
relevant here)
28. Why Hadoop/Hive? Crowd Sourcing
▪ Crowd Sourcing for data discovery
▪ There are 50K tables in a single
warehouse
▪ Users are DBAs themselves
▪ Questions about a table are directed
to users of that table
▪ Automatic query lineage tools help
here
29. Why Hadoop/Hive? Open Data Format
▪ No Lock-in
▪ Hive/Hadoop is open source
▪ Data is open-format, one can access
data below the database layer
▪ HBase can query the same data that
is in HDFS
▪ Want a new UDF? No problem, Very
easily extendable
30. Why Hadoop/Hive? Benchmarks
▪ Shortcomings of existing DBMS benchmarks
▪ most DBMS benchmarks measure latency
▪ Do not measure $$/Gigabyte-of-data
▪ Do not test fault tolerance – kill machines
▪ Do not measure elasticity – add and remove machines
▪ Do not measure throughput – concurrent queries in parallel
31. New trends in storage software
▪ Trends:
▪ SSDs cheaper, increasing number of CPUs per server
▪ How can BigData use SSDs?
▪ SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB
▪ How can BigData become even bigger?