Facebook’s Approach to Big Data
Storage Challenge


Weiyan Wang
Software Engineer (Data Infrastructure – HStore)
March 1 2013
Agenda
1   Data Warehouse Overview and Challenge

2   Smart Retention

3   Sort Before Compression

4   HDFS Raid

5   Directory XOR & Compaction

6   Q&A
Life of a tag in Data Warehouse
        Periodic Analysis                                      Adhoc Analysis
                                nocron
  Daily report on count of                       hipal          Count photos tagged by
photo tags by country (1day)                                  females age 20-25 yesterday

                                                             Scrapes
                                                            User info reaches
                                         Warehouse          Warehouse (1day)        UDB

                               copier/loader   Log line reaches            User tags
                                               warehouse (1hr)              a photo

                       puma
  Realtime Analytics                  Scribe Log Storage        www.facebook.com
 Count users tagging photos              Log line reaches         Log line generated:
   in the last hour (1min)                 Scribeh (10s)          <user_id, photo_id>
History (2008/03-2012/03)
Data, Data, and more Data


         Facebook                 Scribe Data/    Nodes in
                    Queries/Day                              Size (Total)
           Users                      Day        Warehouse




Growth      14X         60X          250X          260X        2500X
Directions to handle data growth problem
• Improve the software
•   HDFS Federation
•   Prism

• Improve storage efficiency
•   Store more data without increasing capacity
•   More and more important, translate into millions of
    dollars saving.
Ways to Improve Storage Efficiency
• Better capacity management

• Reduce space usage of Hive tables

• Reduce replication factor of data
Smart Retention – Motivation
• Hive table “retention” metadata
 •   Partitions older than retention value are automatically
     purged by system

• Table owners are unaware of table usage
 •   Difficult to set retention value right at the beginning.

• Improper retention setting may waste spaces
 •   Users only accessed recent 30-day partitions of a 3-
     month-retention table
Smart Retention
• Add a post-execute hook that logs table/partition
  names and query start time to MySQL.

• Calculate the “empirical-retention” per table
  Given a partition P whose creation time is CTP:
  Data_age_at_last_queryP =
              max{StartTimeQ - CTP | ∀query Q
  accesses P}
  Given a table T:
  Empirical_retentionT =
            max{Data_age_at_last_queryP | ∀ P ∈ T}
Smart Retention
• Inform Empirical_retentionT to table owners with a
  call to action:
 •   Accept the empirical value and change retention
 •   Review table query history, figure out better setting

• After 2weeks, the system will archive partitions
  that are older than Empirical_retentionT
 •   Free up spaces after partitions get archived
 •   Users need to restore archived data for querying
Smart Retention – Things Learned
• Table query history enables table owners to
  identify outliers:
 •   A table is queried mostly < 32 days olds data but there
     was one time a 42 days old partition was accessed

• Prioritize tables with the most space savings
 •   Save 8PB from the TOP 100 tables!
Sort Before Compression - Motivation
• In RCFile format, data are stored in columns inside
  every row block
 •   Sort by one or two columns with lots of duplicate values
     reduces final compressed data size

• Trade extra computation for space saving
Sort Before Compression
• Identify the best column to sort
 •   Take a sample of table and sort it by every column. Pick
     the one with the most space saving.

• Transfer target partitions from service clusters to
  compute clusters
• Sort them into compressed RCFile format.
• Sorted partitions are transferred back to service
  clusters to replace original ones
How we sort
set hive.exec.reducers.max=1024;
set hive.io.rcfile.record.buffer.size=67108864;
 INSERT OVERWRITE TABLE hive_table PARTITION
(ds='2012-08-06',source_type='mobile_sort')
     SELECT `(ds|source_type)?+.+` from hive_table
     WHERE ds='2012-08-06' and source_type='mobile'
     DISTRIBUTE BY IF (userid <> 0 AND NOT (userid
is null), userid, CAST(RAND() AS STRING))
     SORT BY userid, ip_address;
Sort Before Compression – Things Learned
• Sorting achieves >40% space saving!

• It’s important to verify data correctness
 •   Compare original and sorted partitions’ hash values
 •   Find a hive bug

• Sort cold data first, and gradually move to hot
  data
HDFS Raid
In HDFS, data are 3X replicated
                                    Meta operations       NameNode
  /warehouse/file1                                        (Metadata)

                               Client
   1     2       3

                     Read/Write Data



             1           3                            2                3


                 2             1         3            1                2

        DataNode 1     DataNode 2       DataNode 3 DataNode 4     DataNode 5
HDFS Raid – File-level XOR (10, 1)
                 Before                                    After
            /warehouse/file1                          /warehouse/file1

1   2 3     4   5   6   7   8   9   10   1    2 3      4   5        6   7   8   9   10

1   2   3   4   5   6   7   8   9   10   1    2   3    4   5        6   7   8   9   10

1   2   3   4   5   6   7   8   9   10
                                              Parity file: /raid/warehouse/file1


                                                               11

                                    (10, 1)                    11

                3X                                         2.2X
HDFS Raid
• What if a file has 15 blocks
•   Treat as 20 blocks and generate one parity with 2
    blocks
•   Replication factor = (15*2+2*2)/15 = 2.27

• Reconstruction
•   Online reconstruction – DistributedRaidFileSystem
•   Offline reconstruction – RaidNode

• Block Placement
HDFS Raid – File-level Reed Solomon
(10, 4) Before
      /warehouse/file1
                                After
                           /warehouse/file1

1   2 3     4   5   6   7   8   9   10   1     2 3     4   5   6   7   8    9   10

1   2   3   4   5   6   7   8   9   10

        3                                    Parity file: /raidrs/warehouse/file1
1   2       4   5   6   7   8   9   10


                                                      11   12 13 14


                                    (10, 4)
                3X                                         1.4X
HDFS Raid – Hybrid Storage
                                       Even older
                              ×1.4
                     RS                3months older
                     Raided
                          ×2.2
                                       1day older
                   XOR Raided
                        ×3
                                       Born
Life of file /warehouse/facebook.jpg
HDFS Raid – Things Learned
• Replication factor 3 ->2.65 (12% space saving)

• Avoid flooding namenode with requests
•   Daily pipeline scans fsimage to pick raidable files
    rather than recursively search from namenode

• Small files disallow more replication reduction
•   50% of files in the warehouse have only 1 or 2
    blocks. They are too small to be raided.
Raid Warm Small Files: Directory level XOR
                    Before                                       After
                          /data/file3                                    /data/file3
/data/file1   /data/file2        /data/file4   /data/file1   /data/file2        /data/file4


 1   2 3      4    5   6    7    8   9   10    1    2 3      4   5    6   7   8   9   10

 1   2 3      4    5   6    7    8   9   10    1   2 3       4   5    6   7   8   9   10
                   5   6                 10
/raid/data/file1           /raid/data/file3             Parity file: /dir-raid/data

      11                        12
                                                                     11
      11                        12
                                                                     11

                   2.7X                                      2.2X
Handle Directory Change                                        Directory change
                                                                   happens very
                          /namespace/infra/ds=2013-07-07
                                                                   infrequently in
                  2
                                                                   warehouse
                                      file1    File2     file3
      Client
                                      file1    file2     file3           Try to read file2,
                                                                    3         encounter
                                      parity
                                                                         missing blocks
                                      parity
                                               1                                            Client

                                        Stripe store (MySQL)         Look at the stripe
    RaidNode
                                      Block id         Stripe id     table, figure out
                                                                     that file4 does not
       4                              Blk_file_1       Strp_1
                                                                     belong to the
                                      Blk_file_2       Strp_1        stripe, and file3 is
Re-raid the directory, before file3   Blk_file_3       Strp_1        in trash.
is actually deleted from cluster      Blk_parity       Strp_1
                                                                     Reconstruct file2!!
Raid Cold Small Files: Compaction
• Compact cold small files into large files and apply
  file-level RS
•       No need to handle directory changes for file-level RS
    •    Re-raid a Directory-RS Raided directory is expensive
•       Raid-aware compaction can achieve best space saving
    •    Change block size to produce files with multiples of
         ten blocks
•       Reduce the number of metadata
Raid-Aware Compaction
▪       Compaction settings:
         set mapred.min.split.size = 39*blockSize;
         set mapred.max.split.size = 39*blockSize;
         set mapred.min.split.size.per.node = 39*blockSize;
         set mapred.min.split.size.per.rack = 39*blockSize;
         set dfs.block.size = blockSize;
         set hive.input.format =
              org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

▪       Calculate the best block size for a partition
    ▪    Make sure bestBlockSize * N ≈ Partition size where
         N = 39p + q (p ∈N+ , q ∈ {10, 20, 30})
    ▪    Compaction will generate p 40-block files and one q-
         block file
Raid-Aware Compaction
▪       Compact SeqFile format partition
    ▪    INSERT OVERWRITE TABLE seq_table
         PARTITION (ds = "2012-08-17")
            SELECT `(ds)?+.+` FROM seq_table
            WHERE ds = "2012-08-17";
▪       Compact RCFile format partition
    ▪    ALTER TABLE rc_table PARTITION
             (ds="2009-08-31") CONCATENATE;
Directory XOR & Compaction - Things Learned
 • Replication factor 2.65 ->2.35! (additional 12% space
   saving) Still rolling out

 • Bookkeeping blocks’ checksums could avoid data
   corruption caused by bugs

 • Unawareness of Raid in HDFS causes some issues
  •   Operational error could cause data loss (forget to move
      parity data with source data)

 • Directory XOR & Compaction only work for warehouse
   data
Questions?

Facebook's Approach to Big Data Storage Challenge

  • 1.
    Facebook’s Approach toBig Data Storage Challenge Weiyan Wang Software Engineer (Data Infrastructure – HStore) March 1 2013
  • 2.
    Agenda 1 Data Warehouse Overview and Challenge 2 Smart Retention 3 Sort Before Compression 4 HDFS Raid 5 Directory XOR & Compaction 6 Q&A
  • 3.
    Life of atag in Data Warehouse Periodic Analysis Adhoc Analysis nocron Daily report on count of hipal Count photos tagged by photo tags by country (1day) females age 20-25 yesterday Scrapes User info reaches Warehouse Warehouse (1day) UDB copier/loader Log line reaches User tags warehouse (1hr) a photo puma Realtime Analytics Scribe Log Storage www.facebook.com Count users tagging photos Log line reaches Log line generated: in the last hour (1min) Scribeh (10s) <user_id, photo_id>
  • 4.
    History (2008/03-2012/03) Data, Data,and more Data Facebook Scribe Data/ Nodes in Queries/Day Size (Total) Users Day Warehouse Growth 14X 60X 250X 260X 2500X
  • 5.
    Directions to handledata growth problem • Improve the software • HDFS Federation • Prism • Improve storage efficiency • Store more data without increasing capacity • More and more important, translate into millions of dollars saving.
  • 6.
    Ways to ImproveStorage Efficiency • Better capacity management • Reduce space usage of Hive tables • Reduce replication factor of data
  • 7.
    Smart Retention –Motivation • Hive table “retention” metadata • Partitions older than retention value are automatically purged by system • Table owners are unaware of table usage • Difficult to set retention value right at the beginning. • Improper retention setting may waste spaces • Users only accessed recent 30-day partitions of a 3- month-retention table
  • 8.
    Smart Retention • Adda post-execute hook that logs table/partition names and query start time to MySQL. • Calculate the “empirical-retention” per table Given a partition P whose creation time is CTP: Data_age_at_last_queryP = max{StartTimeQ - CTP | ∀query Q accesses P} Given a table T: Empirical_retentionT = max{Data_age_at_last_queryP | ∀ P ∈ T}
  • 9.
    Smart Retention • InformEmpirical_retentionT to table owners with a call to action: • Accept the empirical value and change retention • Review table query history, figure out better setting • After 2weeks, the system will archive partitions that are older than Empirical_retentionT • Free up spaces after partitions get archived • Users need to restore archived data for querying
  • 10.
    Smart Retention –Things Learned • Table query history enables table owners to identify outliers: • A table is queried mostly < 32 days olds data but there was one time a 42 days old partition was accessed • Prioritize tables with the most space savings • Save 8PB from the TOP 100 tables!
  • 11.
    Sort Before Compression- Motivation • In RCFile format, data are stored in columns inside every row block • Sort by one or two columns with lots of duplicate values reduces final compressed data size • Trade extra computation for space saving
  • 12.
    Sort Before Compression •Identify the best column to sort • Take a sample of table and sort it by every column. Pick the one with the most space saving. • Transfer target partitions from service clusters to compute clusters • Sort them into compressed RCFile format. • Sorted partitions are transferred back to service clusters to replace original ones
  • 13.
    How we sort sethive.exec.reducers.max=1024; set hive.io.rcfile.record.buffer.size=67108864; INSERT OVERWRITE TABLE hive_table PARTITION (ds='2012-08-06',source_type='mobile_sort') SELECT `(ds|source_type)?+.+` from hive_table WHERE ds='2012-08-06' and source_type='mobile' DISTRIBUTE BY IF (userid <> 0 AND NOT (userid is null), userid, CAST(RAND() AS STRING)) SORT BY userid, ip_address;
  • 14.
    Sort Before Compression– Things Learned • Sorting achieves >40% space saving! • It’s important to verify data correctness • Compare original and sorted partitions’ hash values • Find a hive bug • Sort cold data first, and gradually move to hot data
  • 15.
    HDFS Raid In HDFS,data are 3X replicated Meta operations NameNode /warehouse/file1 (Metadata) Client 1 2 3 Read/Write Data 1 3 2 3 2 1 3 1 2 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
  • 16.
    HDFS Raid –File-level XOR (10, 1) Before After /warehouse/file1 /warehouse/file1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Parity file: /raid/warehouse/file1 11 (10, 1) 11 3X 2.2X
  • 17.
    HDFS Raid • Whatif a file has 15 blocks • Treat as 20 blocks and generate one parity with 2 blocks • Replication factor = (15*2+2*2)/15 = 2.27 • Reconstruction • Online reconstruction – DistributedRaidFileSystem • Offline reconstruction – RaidNode • Block Placement
  • 18.
    HDFS Raid –File-level Reed Solomon (10, 4) Before /warehouse/file1 After /warehouse/file1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 3 Parity file: /raidrs/warehouse/file1 1 2 4 5 6 7 8 9 10 11 12 13 14 (10, 4) 3X 1.4X
  • 19.
    HDFS Raid –Hybrid Storage Even older ×1.4 RS 3months older Raided ×2.2 1day older XOR Raided ×3 Born Life of file /warehouse/facebook.jpg
  • 20.
    HDFS Raid –Things Learned • Replication factor 3 ->2.65 (12% space saving) • Avoid flooding namenode with requests • Daily pipeline scans fsimage to pick raidable files rather than recursively search from namenode • Small files disallow more replication reduction • 50% of files in the warehouse have only 1 or 2 blocks. They are too small to be raided.
  • 21.
    Raid Warm SmallFiles: Directory level XOR Before After /data/file3 /data/file3 /data/file1 /data/file2 /data/file4 /data/file1 /data/file2 /data/file4 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 5 6 10 /raid/data/file1 /raid/data/file3 Parity file: /dir-raid/data 11 12 11 11 12 11 2.7X 2.2X
  • 22.
    Handle Directory Change Directory change happens very /namespace/infra/ds=2013-07-07 infrequently in 2 warehouse file1 File2 file3 Client file1 file2 file3 Try to read file2, 3 encounter parity missing blocks parity 1 Client Stripe store (MySQL) Look at the stripe RaidNode Block id Stripe id table, figure out that file4 does not 4 Blk_file_1 Strp_1 belong to the Blk_file_2 Strp_1 stripe, and file3 is Re-raid the directory, before file3 Blk_file_3 Strp_1 in trash. is actually deleted from cluster Blk_parity Strp_1 Reconstruct file2!!
  • 23.
    Raid Cold SmallFiles: Compaction • Compact cold small files into large files and apply file-level RS • No need to handle directory changes for file-level RS • Re-raid a Directory-RS Raided directory is expensive • Raid-aware compaction can achieve best space saving • Change block size to produce files with multiples of ten blocks • Reduce the number of metadata
  • 24.
    Raid-Aware Compaction ▪ Compaction settings: set mapred.min.split.size = 39*blockSize; set mapred.max.split.size = 39*blockSize; set mapred.min.split.size.per.node = 39*blockSize; set mapred.min.split.size.per.rack = 39*blockSize; set dfs.block.size = blockSize; set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; ▪ Calculate the best block size for a partition ▪ Make sure bestBlockSize * N ≈ Partition size where N = 39p + q (p ∈N+ , q ∈ {10, 20, 30}) ▪ Compaction will generate p 40-block files and one q- block file
  • 25.
    Raid-Aware Compaction ▪ Compact SeqFile format partition ▪ INSERT OVERWRITE TABLE seq_table PARTITION (ds = "2012-08-17") SELECT `(ds)?+.+` FROM seq_table WHERE ds = "2012-08-17"; ▪ Compact RCFile format partition ▪ ALTER TABLE rc_table PARTITION (ds="2009-08-31") CONCATENATE;
  • 26.
    Directory XOR &Compaction - Things Learned • Replication factor 2.65 ->2.35! (additional 12% space saving) Still rolling out • Bookkeeping blocks’ checksums could avoid data corruption caused by bugs • Unawareness of Raid in HDFS causes some issues • Operational error could cause data loss (forget to move parity data with source data) • Directory XOR & Compaction only work for warehouse data
  • 27.