A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook


Published on

At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.

This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

  1. Petabyte Scale Data at FacebookDhruba Borthakur ,Engineer at Facebook,BigDataCloud, Santa Clara, Oct 18 2012
  2. Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) data 4 Immutable data store for photos, videos, etc 5 Hadoop/Hive analytics storage
  3. Four major types of storage systems▪  Online Transaction Processing Databases (OLTP) ▪  The Facebook Social Graph▪  Semi-online Light Transaction Processing Databases (SLTP) ▪  Facebook Messages and Sensory Time Series Data▪  Immutable DataStore ▪  Photos, videos, etc▪  Analytics DataStore ▪  Data Warehouse, Logs storage
  4. Size and Scale of Databases Total Size Technology Bottlenecks Single digit petabytes MySQL and TAO/ Random read IOPS Facebook Graph memcache Facebook Write IOPS and Messages and Tens of petabytes Hadoop HDFS and storage capacity Time Series Hbase Data Facebook High tens of Haystack storage capacity Photos petabytes Data Hundreds of Hive, HDFS and storage capacity Warehouse petabytes Hadoop
  5. Characteristics Query Consistency Durability Latency < few quickly No data loss Facebook milliseconds consistent Graph across data centers Facebook consistent No data loss Messages < 200 millisec within a data and Time center Series Data Facebook < 250 millisec immutable No data loss Photos Data < 1 min not consistent No silent data Warehouse across data loss centers
  6. Facebook Graph: Objects and Associations
  7. Data modelObjects & Associations 18429207554 (page) fan name: Barack Obama 8636146 birthday: 08/04/1961 admin (user) website: http://… verified: 1 friend … likes liked by friend 604191769 (user) 6205972929 (story)
  8. Facebook Social Graph: TAO and MySQLAn OLTP workload:▪  Uneven read heavy workload▪  Huge working set with creation-time locality▪  Highly interconnected data▪  Constantly evolving▪  As consistent as possible
  9. ArchitectureCache & Storage MySQL Storage Web servers TAO Storage Cache
  10. ArchitectureSharding▪  Object ids and Assoc id1s are mapped to shard ids MySQL Storage Web Servers TAO Cache db2 db4 s1 s3 s5 db1 db3 db8 s2 s6 db7 s4 s7 s8 db5 db6
  11. Workload▪  Read-heavy workload ▪  Significant range queries▪  LinkBench benchmark being open-sourced ▪  http://www.github.com/facebook/linkbench▪  Real data distribution of Assocs and their access patterns
  12. Messages & Time Series Database SLTP workload
  13. Facebook Messages Messages Chats Emails SMS
  14. Why we chose Hbase and HDFS▪  High write throughput▪  Horizontal scalability▪  Automatic Failover▪  Strong consistency within a data center▪  Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,▪  Why is this SLTP? ▪  Semi-online: Queries run even if part of the database is offline ▪  Light Transactions: single row transactions ▪  Storage capacity bound rather than iops or cpu bound
  15. What we store in HBase▪  Small messages▪  Message metadata (thread/message indices)▪  Search index▪  Large attachments stored in Haystack (photo store
  16. Size and scale of Messages Database▪  6 Billion messages/day▪  74 Billion operations/day▪  At peak: 1.5 million operations/sec▪  55% read, 45% write operations▪  Average write operation inserts 16 records▪  All data is lzo compressed▪  Growing at 8 TB/day
  17. Haystack: The Photo Store
  18. Facebook Photo DataStore 2009 2012 Total Size 15 billion photos 1.5 Petabyte High tens of petabytes 30 million photos/day 300 million photos/day Upload Rate 3 TB/day 30 TB/day 555K images/sec Serving Rate
  19. Haystack based Design Haystack Store Haystack Directory Web Haystack Server Cache Browser CDN
  20. Haystack Internals▪  Log structured, append-only object store▪  Built on commodity hardware▪  Application-aware replication▪  Images stored in 100 GB xfs-files called needles▪  An in-memory index for each needle file ▪  32 bytes of index per photo
  21. Hive Hadoop Analytics Warehouse
  22. Life of a photo tag in Hadoop/Hive storage Periodic  Analysis  (HIVE)   Adhoc  Analysis  (HIVE)   Daily report on count of Count photos tagged by photo tags by country (1day) females age 20-25 yesterday Scrapes   User info reaches Hive  Warehouse   Warehouse (1day) MySQL  DB   copier/loader   Log line reaches warehouse (15 min) User tags a photo puma  RealHme  AnalyHcs  (HBASE)   Scribe  Log  Storage  (HDFS)   www.facebook.com   Log line reaches Log line generated: Count users tagging photos Scribeh (10s) in the last hour (1min) <user_id, photo_id>
  23. Analytics Data Growth(last 4 years) Facebook Scribe Data/ Nodes in Queries/Day Size (Total) Users Day warehouse Growth 14X 60X 250X 260X 2500X
  24. Why use Hadoop instead of Parallel DBMS?▪  Stonebraker/DeWitt from the DBMS community: ▪  Quote “major step backwards” ▪  Published benchmark results which show that Hive is not as performant as a traditional DBMS ▪  http://database.cs.brown.edu/ projects/mapreduce-vs-dbms/
  25. Why Hadoop/Hive? Prospecting for Gold..▪  “Gold in the wild-west”▪  A platform for huge data-experiments▪  A majority of queries are searching for a single gold nugget▪  Great advantage in keeping all data in one queryable system▪  No structure to data, specify structure at query time
  26. Why Hadoop/Hive? Open to all users▪  Open Warehouse▪  No need to take approval from DBA to create new table▪  This is possible only because Hadoop is horizontally scalable▪  Provisioning excess capacity is cheaper than traditional databases▪  Lower $$/GB ($$/query-latency not relevant here)
  27. Why Hadoop/Hive? Crowd Sourcing▪  Crowd Sourcing for data discovery▪  There are 50K tables in a single warehouse▪  Users are DBAs themselves▪  Questions about a table are directed to users of that table▪  Automatic query lineage tools help here
  28. Why Hadoop/Hive? Open Data Format▪  No Lock-in▪  Hive/Hadoop is open source▪  Data is open-format, one can access data below the database layer▪  HBase can query the same data that is in HDFS▪  Want a new UDF? No problem, Very easily extendable
  29. Why Hadoop/Hive? Benchmarks▪  Shortcomings of existing DBMS benchmarks ▪  most DBMS benchmarks measure latency ▪  Do not measure $$/Gigabyte-of-data ▪  Do not test fault tolerance – kill machines ▪  Do not measure elasticity – add and remove machines ▪  Do not measure throughput – concurrent queries in parallel
  30. New trends in storage software▪  Trends: ▪  SSDs cheaper, increasing number of CPUs per server ▪  How can BigData use SSDs? ▪  SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB ▪  How can BigData become even bigger?
  31. Questions? dhruba@fb.comhttp://hadoopblog.blogspot.com/