Facebook data warehouse cluster stores more than 100PB of data, with 500+ terabytes of data entered into the clusters every day. To meet the capacity requirement of future data growth, storing data in a cost-effective way becomes a top priority of Facebook data infrastructure team. This talk will present various solutions we use to reduce our warehouse cluster`s data footprint: (1) Smart retention: history-based hive table retention control (2) Increase RCFile compression ratio through clever sorting (3) HDFS file-level raiding to reduce the default replication factor of 3 to a lower ratio; (4) Attack small-file raiding problem though Directory-level raiding and raid-aware compaction.
1. Facebook’s Approach to Big Data
Storage Challenge
Weiyan Wang
Software Engineer (Data Infrastructure – HStore)
March 1 2013
2. Agenda
1 Data Warehouse Overview and Challenge
2 Smart Retention
3 Sort Before Compression
4 HDFS Raid
5 Directory XOR & Compaction
6 Q&A
3. Life of a tag in Data Warehouse
Periodic Analysis Adhoc Analysis
nocron
Daily report on count of hipal Count photos tagged by
photo tags by country (1day) females age 20-25 yesterday
Scrapes
User info reaches
Warehouse Warehouse (1day) UDB
copier/loader Log line reaches User tags
warehouse (1hr) a photo
puma
Realtime Analytics Scribe Log Storage www.facebook.com
Count users tagging photos Log line reaches Log line generated:
in the last hour (1min) Scribeh (10s) <user_id, photo_id>
4. History (2008/03-2012/03)
Data, Data, and more Data
Facebook Scribe Data/ Nodes in
Queries/Day Size (Total)
Users Day Warehouse
Growth 14X 60X 250X 260X 2500X
5. Directions to handle data growth problem
• Improve the software
• HDFS Federation
• Prism
• Improve storage efficiency
• Store more data without increasing capacity
• More and more important, translate into millions of
dollars saving.
6. Ways to Improve Storage Efficiency
• Better capacity management
• Reduce space usage of Hive tables
• Reduce replication factor of data
7. Smart Retention – Motivation
• Hive table “retention” metadata
• Partitions older than retention value are automatically
purged by system
• Table owners are unaware of table usage
• Difficult to set retention value right at the beginning.
• Improper retention setting may waste spaces
• Users only accessed recent 30-day partitions of a 3-
month-retention table
8. Smart Retention
• Add a post-execute hook that logs table/partition
names and query start time to MySQL.
• Calculate the “empirical-retention” per table
Given a partition P whose creation time is CTP:
Data_age_at_last_queryP =
max{StartTimeQ - CTP | ∀query Q
accesses P}
Given a table T:
Empirical_retentionT =
max{Data_age_at_last_queryP | ∀ P ∈ T}
9. Smart Retention
• Inform Empirical_retentionT to table owners with a
call to action:
• Accept the empirical value and change retention
• Review table query history, figure out better setting
• After 2weeks, the system will archive partitions
that are older than Empirical_retentionT
• Free up spaces after partitions get archived
• Users need to restore archived data for querying
10. Smart Retention – Things Learned
• Table query history enables table owners to
identify outliers:
• A table is queried mostly < 32 days olds data but there
was one time a 42 days old partition was accessed
• Prioritize tables with the most space savings
• Save 8PB from the TOP 100 tables!
11. Sort Before Compression - Motivation
• In RCFile format, data are stored in columns inside
every row block
• Sort by one or two columns with lots of duplicate values
reduces final compressed data size
• Trade extra computation for space saving
12. Sort Before Compression
• Identify the best column to sort
• Take a sample of table and sort it by every column. Pick
the one with the most space saving.
• Transfer target partitions from service clusters to
compute clusters
• Sort them into compressed RCFile format.
• Sorted partitions are transferred back to service
clusters to replace original ones
13. How we sort
set hive.exec.reducers.max=1024;
set hive.io.rcfile.record.buffer.size=67108864;
INSERT OVERWRITE TABLE hive_table PARTITION
(ds='2012-08-06',source_type='mobile_sort')
SELECT `(ds|source_type)?+.+` from hive_table
WHERE ds='2012-08-06' and source_type='mobile'
DISTRIBUTE BY IF (userid <> 0 AND NOT (userid
is null), userid, CAST(RAND() AS STRING))
SORT BY userid, ip_address;
14. Sort Before Compression – Things Learned
• Sorting achieves >40% space saving!
• It’s important to verify data correctness
• Compare original and sorted partitions’ hash values
• Find a hive bug
• Sort cold data first, and gradually move to hot
data
15. HDFS Raid
In HDFS, data are 3X replicated
Meta operations NameNode
/warehouse/file1 (Metadata)
Client
1 2 3
Read/Write Data
1 3 2 3
2 1 3 1 2
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
19. HDFS Raid – Hybrid Storage
Even older
×1.4
RS 3months older
Raided
×2.2
1day older
XOR Raided
×3
Born
Life of file /warehouse/facebook.jpg
20. HDFS Raid – Things Learned
• Replication factor 3 ->2.65 (12% space saving)
• Avoid flooding namenode with requests
• Daily pipeline scans fsimage to pick raidable files
rather than recursively search from namenode
• Small files disallow more replication reduction
• 50% of files in the warehouse have only 1 or 2
blocks. They are too small to be raided.
22. Handle Directory Change Directory change
happens very
/namespace/infra/ds=2013-07-07
infrequently in
2
warehouse
file1 File2 file3
Client
file1 file2 file3 Try to read file2,
3 encounter
parity
missing blocks
parity
1 Client
Stripe store (MySQL) Look at the stripe
RaidNode
Block id Stripe id table, figure out
that file4 does not
4 Blk_file_1 Strp_1
belong to the
Blk_file_2 Strp_1 stripe, and file3 is
Re-raid the directory, before file3 Blk_file_3 Strp_1 in trash.
is actually deleted from cluster Blk_parity Strp_1
Reconstruct file2!!
23. Raid Cold Small Files: Compaction
• Compact cold small files into large files and apply
file-level RS
• No need to handle directory changes for file-level RS
• Re-raid a Directory-RS Raided directory is expensive
• Raid-aware compaction can achieve best space saving
• Change block size to produce files with multiples of
ten blocks
• Reduce the number of metadata
24. Raid-Aware Compaction
▪ Compaction settings:
set mapred.min.split.size = 39*blockSize;
set mapred.max.split.size = 39*blockSize;
set mapred.min.split.size.per.node = 39*blockSize;
set mapred.min.split.size.per.rack = 39*blockSize;
set dfs.block.size = blockSize;
set hive.input.format =
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
▪ Calculate the best block size for a partition
▪ Make sure bestBlockSize * N ≈ Partition size where
N = 39p + q (p ∈N+ , q ∈ {10, 20, 30})
▪ Compaction will generate p 40-block files and one q-
block file
25. Raid-Aware Compaction
▪ Compact SeqFile format partition
▪ INSERT OVERWRITE TABLE seq_table
PARTITION (ds = "2012-08-17")
SELECT `(ds)?+.+` FROM seq_table
WHERE ds = "2012-08-17";
▪ Compact RCFile format partition
▪ ALTER TABLE rc_table PARTITION
(ds="2009-08-31") CONCATENATE;
26. Directory XOR & Compaction - Things Learned
• Replication factor 2.65 ->2.35! (additional 12% space
saving) Still rolling out
• Bookkeeping blocks’ checksums could avoid data
corruption caused by bugs
• Unawareness of Raid in HDFS causes some issues
• Operational error could cause data loss (forget to move
parity data with source data)
• Directory XOR & Compaction only work for warehouse
data