Petabyte Scale Data at Facebook


Dhruba Borthakur ,
Engineer at Facebook,
BigDataCloud, Santa Clara, Oct 18 2012
Agenda

 1   Types of Data

 2   Data Model and API for Facebook Graph Data

 3   SLTP (Semi-OLTP) data

 4   Immutable data store for photos, videos, etc

 5   Hadoop/Hive analytics storage
Four major types of storage systems
▪  Online   Transaction Processing Databases (OLTP)
 ▪    The Facebook Social Graph

▪  Semi-online    Light Transaction Processing Databases (SLTP)
 ▪    Facebook Messages and Sensory Time Series Data

▪  Immutable     DataStore
 ▪    Photos, videos, etc

▪  Analytics   DataStore
 ▪    Data Warehouse, Logs storage
Size and Scale of Databases
                       Total Size            Technology        Bottlenecks


                   Single digit petabytes   MySQL and TAO/    Random read IOPS
  Facebook Graph                              memcache

     Facebook
                                                               Write IOPS and
   Messages and      Tens of petabytes      Hadoop HDFS and   storage capacity
    Time Series                                 Hbase
       Data
    Facebook
                       High tens of            Haystack       storage capacity
     Photos             petabytes

      Data
                       Hundreds of           Hive, HDFS and   storage capacity
    Warehouse           petabytes               Hadoop
Characteristics
                 Query          Consistency      Durability
                Latency

                   < few           quickly       No data loss
 Facebook      milliseconds      consistent
  Graph                          across data
                                   centers
  Facebook
                                 consistent      No data loss
  Messages     < 200 millisec   within a data
  and Time                         center
 Series Data
 Facebook
               < 250 millisec    immutable       No data loss
  Photos

   Data
                  < 1 min       not consistent   No silent data
 Warehouse                       across data          loss
                                   centers
Facebook Graph: Objects and Associations
Data model
Objects & Associations
                                           18429207554
                                                 (page)

                                    fan
                                                                  name: Barack Obama
                    8636146                                       birthday: 08/04/1961
                                          admin
                    (user)                                        website: http://…
                                                                  verified: 1
                                               friend             …
                 likes
                         liked by     friend
                                                          604191769
                                                              (user)


   6205972929
       (story)
Facebook Social Graph: TAO and MySQL
An OLTP workload:
▪  Uneven    read heavy workload
▪  Huge   working set with creation-time locality
▪  Highly   interconnected data
▪  Constantly   evolving
▪  As   consistent as possible
Architecture
Cache & Storage

                                       MySQL Storage
     Web servers   TAO Storage Cache
Architecture
Sharding
▪  Object   ids and Assoc id1s are mapped to shard ids
                                                   MySQL Storage
            Web Servers        TAO Cache
                                                         db2 db4
                                 s1 s3
                                 s5

                                                         db1 db3
                                                         db8
                                 s2 s6

                                                          db7

                                 s4 s7
                                 s8
                                                         db5 db6
Workload

▪  Read-heavy    workload
 ▪    Significant range queries
▪  LinkBench    benchmark being open-sourced
 ▪    http://www.github.com/facebook/linkbench

▪  Real   data distribution of Assocs and their access patterns
Messages & Time Series Database
        SLTP workload
Facebook Messages

 Messages   Chats   Emails   SMS
Why we chose Hbase and HDFS
▪  High   write throughput
▪  Horizontal    scalability
▪  Automatic    Failover
▪  Strong   consistency within a data center
▪  Benefits    of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
▪  Why    is this SLTP?
   ▪    Semi-online: Queries run even if part of the database is offline
   ▪    Light Transactions: single row transactions
   ▪    Storage capacity bound rather than iops or cpu bound
What we store in HBase
▪  Small   messages
▪  Message    metadata (thread/message indices)
▪  Search   index
▪  Large   attachments stored in Haystack (photo store
Size and scale of Messages Database
▪  6   Billion messages/day
▪  74    Billion operations/day
▪  At    peak: 1.5 million operations/sec
▪  55%    read, 45% write operations
▪  Average    write operation inserts 16 records
▪  All   data is lzo compressed
▪  Growing     at 8 TB/day
Haystack: The Photo Store
Facebook Photo DataStore
                         2009                     2012


    Total Size      15 billion photos
                      1.5 Petabyte        High tens of petabytes



                  30 million photos/day   300 million photos/day
   Upload Rate           3 TB/day               30 TB/day


                    555K images/sec
   Serving Rate
Haystack based Design
                        Haystack Store
        Haystack
        Directory




          Web             Haystack
         Server            Cache




        Browser             CDN
Haystack Internals
▪  Log   structured, append-only object store
▪  Built   on commodity hardware
▪  Application-aware    replication
▪  Images    stored in 100 GB xfs-files called needles
▪  An   in-memory index for each needle file
 ▪    32 bytes of index per photo
Hive Hadoop Analytics Warehouse
Life of a photo tag in Hadoop/Hive storage
             Periodic	
  Analysis	
  (HIVE)	
                               Adhoc	
  Analysis	
  (HIVE)	
  
 Daily report on count of                                                            Count photos tagged by
 photo tags by country (1day)                                                        females age 20-25 yesterday



                                                                                    Scrapes	
  
                                                                               User info reaches
                                              Hive	
  Warehouse	
              Warehouse (1day)             MySQL	
  DB	
  

                                  copier/loader	
           Log line reaches
                                                            warehouse (15 min)                                     User tags
                                                                                                                   a photo



                         puma	
  
RealHme	
  AnalyHcs	
  (HBASE)	
           Scribe	
  Log	
  Storage	
  (HDFS)	
                   www.facebook.com	
  
                                                              Log line reaches                      Log line generated:
            Count users tagging photos                        Scribeh (10s)
            in the last hour (1min)                                                                 <user_id, photo_id>
Analytics Data Growth(last 4 years)


           Facebook               Scribe Data/    Nodes in
                      Queries/Day                            Size (Total)
             Users                    Day        warehouse



  Growth     14X         60X          250X         260X        2500X
Why use Hadoop instead of Parallel DBMS?
▪    Stonebraker/DeWitt from the
     DBMS community:
     ▪    Quote “major step backwards”
     ▪    Published benchmark results
          which show that Hive is not as
          performant as a traditional DBMS
          ▪    http://database.cs.brown.edu/
               projects/mapreduce-vs-dbms/
Why Hadoop/Hive? Prospecting for Gold..
▪    “Gold in the wild-west”
▪    A platform for huge data-experiments
▪    A majority of queries are searching
     for a single gold nugget
▪    Great advantage in keeping all data in
     one queryable system
▪    No structure to data, specify
     structure at query time
Why Hadoop/Hive? Open to all users
▪  Open   Warehouse
▪  No
    need to take approval from DBA to
 create new table
▪  This
     is possible only because Hadoop is
 horizontally scalable
▪  Provisioning
              excess capacity is cheaper
 than traditional databases
▪  Lower
       $$/GB ($$/query-latency not
 relevant here)
Why Hadoop/Hive? Crowd Sourcing
▪    Crowd Sourcing for data discovery
▪    There are 50K tables in a single
     warehouse
▪    Users are DBAs themselves
▪    Questions about a table are directed
     to users of that table
▪    Automatic query lineage tools help
     here
Why Hadoop/Hive? Open Data Format
▪    No Lock-in
▪    Hive/Hadoop is open source
▪    Data is open-format, one can access
     data below the database layer
▪    HBase can query the same data that
     is in HDFS
▪    Want a new UDF? No problem, Very
     easily extendable
Why Hadoop/Hive? Benchmarks
▪    Shortcomings of existing DBMS benchmarks
     ▪    most DBMS benchmarks measure latency
     ▪    Do not measure $$/Gigabyte-of-data
     ▪    Do not test fault tolerance – kill machines
     ▪    Do not measure elasticity – add and remove machines
     ▪    Do not measure throughput – concurrent queries in parallel
New trends in storage software
▪  Trends:

 ▪    SSDs cheaper, increasing number of CPUs per server
      ▪    How can BigData use SSDs?
 ▪    SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB
      ▪    How can BigData become even bigger?
Questions?
     dhruba@fb.com
http://hadoopblog.blogspot.com/

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

  • 1.
    Petabyte Scale Dataat Facebook Dhruba Borthakur , Engineer at Facebook, BigDataCloud, Santa Clara, Oct 18 2012
  • 2.
    Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) data 4 Immutable data store for photos, videos, etc 5 Hadoop/Hive analytics storage
  • 4.
    Four major typesof storage systems ▪  Online Transaction Processing Databases (OLTP) ▪  The Facebook Social Graph ▪  Semi-online Light Transaction Processing Databases (SLTP) ▪  Facebook Messages and Sensory Time Series Data ▪  Immutable DataStore ▪  Photos, videos, etc ▪  Analytics DataStore ▪  Data Warehouse, Logs storage
  • 5.
    Size and Scaleof Databases Total Size Technology Bottlenecks Single digit petabytes MySQL and TAO/ Random read IOPS Facebook Graph memcache Facebook Write IOPS and Messages and Tens of petabytes Hadoop HDFS and storage capacity Time Series Hbase Data Facebook High tens of Haystack storage capacity Photos petabytes Data Hundreds of Hive, HDFS and storage capacity Warehouse petabytes Hadoop
  • 6.
    Characteristics Query Consistency Durability Latency < few quickly No data loss Facebook milliseconds consistent Graph across data centers Facebook consistent No data loss Messages < 200 millisec within a data and Time center Series Data Facebook < 250 millisec immutable No data loss Photos Data < 1 min not consistent No silent data Warehouse across data loss centers
  • 7.
    Facebook Graph: Objectsand Associations
  • 8.
    Data model Objects &Associations 18429207554 (page) fan name: Barack Obama 8636146 birthday: 08/04/1961 admin (user) website: http://… verified: 1 friend … likes liked by friend 604191769 (user) 6205972929 (story)
  • 9.
    Facebook Social Graph:TAO and MySQL An OLTP workload: ▪  Uneven read heavy workload ▪  Huge working set with creation-time locality ▪  Highly interconnected data ▪  Constantly evolving ▪  As consistent as possible
  • 10.
    Architecture Cache & Storage MySQL Storage Web servers TAO Storage Cache
  • 11.
    Architecture Sharding ▪  Object ids and Assoc id1s are mapped to shard ids MySQL Storage Web Servers TAO Cache db2 db4 s1 s3 s5 db1 db3 db8 s2 s6 db7 s4 s7 s8 db5 db6
  • 12.
    Workload ▪  Read-heavy workload ▪  Significant range queries ▪  LinkBench benchmark being open-sourced ▪  http://www.github.com/facebook/linkbench ▪  Real data distribution of Assocs and their access patterns
  • 13.
    Messages & TimeSeries Database SLTP workload
  • 14.
    Facebook Messages Messages Chats Emails SMS
  • 15.
    Why we choseHbase and HDFS ▪  High write throughput ▪  Horizontal scalability ▪  Automatic Failover ▪  Strong consistency within a data center ▪  Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset, ▪  Why is this SLTP? ▪  Semi-online: Queries run even if part of the database is offline ▪  Light Transactions: single row transactions ▪  Storage capacity bound rather than iops or cpu bound
  • 16.
    What we storein HBase ▪  Small messages ▪  Message metadata (thread/message indices) ▪  Search index ▪  Large attachments stored in Haystack (photo store
  • 17.
    Size and scaleof Messages Database ▪  6 Billion messages/day ▪  74 Billion operations/day ▪  At peak: 1.5 million operations/sec ▪  55% read, 45% write operations ▪  Average write operation inserts 16 records ▪  All data is lzo compressed ▪  Growing at 8 TB/day
  • 18.
  • 19.
    Facebook Photo DataStore 2009 2012 Total Size 15 billion photos 1.5 Petabyte High tens of petabytes 30 million photos/day 300 million photos/day Upload Rate 3 TB/day 30 TB/day 555K images/sec Serving Rate
  • 20.
    Haystack based Design Haystack Store Haystack Directory Web Haystack Server Cache Browser CDN
  • 21.
    Haystack Internals ▪  Log structured, append-only object store ▪  Built on commodity hardware ▪  Application-aware replication ▪  Images stored in 100 GB xfs-files called needles ▪  An in-memory index for each needle file ▪  32 bytes of index per photo
  • 22.
  • 23.
    Life of aphoto tag in Hadoop/Hive storage Periodic  Analysis  (HIVE)   Adhoc  Analysis  (HIVE)   Daily report on count of Count photos tagged by photo tags by country (1day) females age 20-25 yesterday Scrapes   User info reaches Hive  Warehouse   Warehouse (1day) MySQL  DB   copier/loader   Log line reaches warehouse (15 min) User tags a photo puma   RealHme  AnalyHcs  (HBASE)   Scribe  Log  Storage  (HDFS)   www.facebook.com   Log line reaches Log line generated: Count users tagging photos Scribeh (10s) in the last hour (1min) <user_id, photo_id>
  • 24.
    Analytics Data Growth(last4 years) Facebook Scribe Data/ Nodes in Queries/Day Size (Total) Users Day warehouse Growth 14X 60X 250X 260X 2500X
  • 25.
    Why use Hadoopinstead of Parallel DBMS? ▪  Stonebraker/DeWitt from the DBMS community: ▪  Quote “major step backwards” ▪  Published benchmark results which show that Hive is not as performant as a traditional DBMS ▪  http://database.cs.brown.edu/ projects/mapreduce-vs-dbms/
  • 26.
    Why Hadoop/Hive? Prospectingfor Gold.. ▪  “Gold in the wild-west” ▪  A platform for huge data-experiments ▪  A majority of queries are searching for a single gold nugget ▪  Great advantage in keeping all data in one queryable system ▪  No structure to data, specify structure at query time
  • 27.
    Why Hadoop/Hive? Opento all users ▪  Open Warehouse ▪  No need to take approval from DBA to create new table ▪  This is possible only because Hadoop is horizontally scalable ▪  Provisioning excess capacity is cheaper than traditional databases ▪  Lower $$/GB ($$/query-latency not relevant here)
  • 28.
    Why Hadoop/Hive? CrowdSourcing ▪  Crowd Sourcing for data discovery ▪  There are 50K tables in a single warehouse ▪  Users are DBAs themselves ▪  Questions about a table are directed to users of that table ▪  Automatic query lineage tools help here
  • 29.
    Why Hadoop/Hive? OpenData Format ▪  No Lock-in ▪  Hive/Hadoop is open source ▪  Data is open-format, one can access data below the database layer ▪  HBase can query the same data that is in HDFS ▪  Want a new UDF? No problem, Very easily extendable
  • 30.
    Why Hadoop/Hive? Benchmarks ▪  Shortcomings of existing DBMS benchmarks ▪  most DBMS benchmarks measure latency ▪  Do not measure $$/Gigabyte-of-data ▪  Do not test fault tolerance – kill machines ▪  Do not measure elasticity – add and remove machines ▪  Do not measure throughput – concurrent queries in parallel
  • 31.
    New trends instorage software ▪  Trends: ▪  SSDs cheaper, increasing number of CPUs per server ▪  How can BigData use SSDs? ▪  SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB ▪  How can BigData become even bigger?
  • 32.
    Questions? dhruba@fb.com http://hadoopblog.blogspot.com/