Facebook's Approach to Big Data Storage Challenge

Facebook’s Approach to Big Data
Storage Challenge

Weiyan Wang
Software Engineer (Data Infrastructure – HStore)
March 1 2013

Agenda
1 Data Warehouse Overview and Challenge

2 Smart Retention

3 Sort Before Compression

4 HDFS Raid

5 Directory XOR & Compaction

6 Q&A

Life of a tag in Data Warehouse
Periodic Analysis Adhoc Analysis
nocron
Daily report on count of hipal Count photos tagged by
photo tags by country (1day) females age 20-25 yesterday

Scrapes
User info reaches
Warehouse Warehouse (1day) UDB

copier/loader Log line reaches User tags
warehouse (1hr) a photo

puma
Realtime Analytics Scribe Log Storage www.facebook.com
Count users tagging photos Log line reaches Log line generated:
in the last hour (1min) Scribeh (10s) <user_id, photo_id>

History (2008/03-2012/03)
Data, Data, and more Data

Facebook Scribe Data/ Nodes in
Queries/Day Size (Total)
Users Day Warehouse

Growth 14X 60X 250X 260X 2500X

Directions to handle data growth problem
• Improve the software
• HDFS Federation
• Prism

• Improve storage efficiency
• Store more data without increasing capacity
• More and more important, translate into millions of
dollars saving.

Ways to Improve Storage Efficiency
• Better capacity management

• Reduce space usage of Hive tables

• Reduce replication factor of data

Smart Retention – Motivation
• Hive table “retention” metadata
• Partitions older than retention value are automatically
purged by system

• Table owners are unaware of table usage
• Difficult to set retention value right at the beginning.

• Improper retention setting may waste spaces
• Users only accessed recent 30-day partitions of a 3-
month-retention table

Smart Retention
• Add a post-execute hook that logs table/partition
names and query start time to MySQL.

• Calculate the “empirical-retention” per table
Given a partition P whose creation time is CTP:
Data_age_at_last_queryP =
max{StartTimeQ - CTP | ∀query Q
accesses P}
Given a table T:
Empirical_retentionT =
max{Data_age_at_last_queryP | ∀ P ∈ T}

Smart Retention
• Inform Empirical_retentionT to table owners with a
call to action:
• Accept the empirical value and change retention
• Review table query history, figure out better setting

• After 2weeks, the system will archive partitions
that are older than Empirical_retentionT
• Free up spaces after partitions get archived
• Users need to restore archived data for querying

Smart Retention – Things Learned
• Table query history enables table owners to
identify outliers:
• A table is queried mostly < 32 days olds data but there
was one time a 42 days old partition was accessed

• Prioritize tables with the most space savings
• Save 8PB from the TOP 100 tables!

Sort Before Compression - Motivation
• In RCFile format, data are stored in columns inside
every row block
• Sort by one or two columns with lots of duplicate values
reduces final compressed data size

• Trade extra computation for space saving

Sort Before Compression
• Identify the best column to sort
• Take a sample of table and sort it by every column. Pick
the one with the most space saving.

• Transfer target partitions from service clusters to
compute clusters
• Sort them into compressed RCFile format.
• Sorted partitions are transferred back to service
clusters to replace original ones

How we sort
set hive.exec.reducers.max=1024;
set hive.io.rcfile.record.buffer.size=67108864;
INSERT OVERWRITE TABLE hive_table PARTITION
(ds='2012-08-06',source_type='mobile_sort')
SELECT `(ds|source_type)?+.+` from hive_table
WHERE ds='2012-08-06' and source_type='mobile'
DISTRIBUTE BY IF (userid <> 0 AND NOT (userid
is null), userid, CAST(RAND() AS STRING))
SORT BY userid, ip_address;

Sort Before Compression – Things Learned
• Sorting achieves >40% space saving!

• It’s important to verify data correctness
• Compare original and sorted partitions’ hash values
• Find a hive bug

• Sort cold data first, and gradually move to hot
data

HDFS Raid
In HDFS, data are 3X replicated
Meta operations NameNode
/warehouse/file1 (Metadata)

Client
1 2 3

Read/Write Data

1 3 2 3

2 1 3 1 2

DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5

HDFS Raid – File-level XOR (10, 1)
Before After
/warehouse/file1 /warehouse/file1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10
Parity file: /raid/warehouse/file1

11

(10, 1) 11

3X 2.2X

HDFS Raid
• What if a file has 15 blocks
• Treat as 20 blocks and generate one parity with 2
blocks
• Replication factor = (15*2+2*2)/15 = 2.27

• Reconstruction
• Online reconstruction – DistributedRaidFileSystem
• Offline reconstruction – RaidNode

• Block Placement

HDFS Raid – File-level Reed Solomon
(10, 4) Before
/warehouse/file1
After
/warehouse/file1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

3 Parity file: /raidrs/warehouse/file1
1 2 4 5 6 7 8 9 10

11 12 13 14

(10, 4)
3X 1.4X

HDFS Raid – Hybrid Storage
Even older
×1.4
RS 3months older
Raided
×2.2
1day older
XOR Raided
×3
Born
Life of file /warehouse/facebook.jpg

HDFS Raid – Things Learned
• Replication factor 3 ->2.65 (12% space saving)

• Avoid flooding namenode with requests
• Daily pipeline scans fsimage to pick raidable files
rather than recursively search from namenode

• Small files disallow more replication reduction
• 50% of files in the warehouse have only 1 or 2
blocks. They are too small to be raided.

Raid Warm Small Files: Directory level XOR
Before After
/data/file3 /data/file3
/data/file1 /data/file2 /data/file4 /data/file1 /data/file2 /data/file4

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
5 6 10
/raid/data/file1 /raid/data/file3 Parity file: /dir-raid/data

11 12
11
11 12
11

2.7X 2.2X

Handle Directory Change Directory change
happens very
/namespace/infra/ds=2013-07-07
infrequently in
2
warehouse
file1 File2 file3
Client
file1 file2 file3 Try to read file2,
3 encounter
parity
missing blocks
parity
1 Client

Stripe store (MySQL) Look at the stripe
RaidNode
Block id Stripe id table, figure out
that file4 does not
4 Blk_file_1 Strp_1
belong to the
Blk_file_2 Strp_1 stripe, and file3 is
Re-raid the directory, before file3 Blk_file_3 Strp_1 in trash.
is actually deleted from cluster Blk_parity Strp_1
Reconstruct file2!!

Raid Cold Small Files: Compaction
• Compact cold small files into large files and apply
file-level RS
• No need to handle directory changes for file-level RS
• Re-raid a Directory-RS Raided directory is expensive
• Raid-aware compaction can achieve best space saving
• Change block size to produce files with multiples of
ten blocks
• Reduce the number of metadata

Raid-Aware Compaction
▪ Compaction settings:
set mapred.min.split.size = 39*blockSize;
set mapred.max.split.size = 39*blockSize;
set mapred.min.split.size.per.node = 39*blockSize;
set mapred.min.split.size.per.rack = 39*blockSize;
set dfs.block.size = blockSize;
set hive.input.format =
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

▪ Calculate the best block size for a partition
▪ Make sure bestBlockSize * N ≈ Partition size where
N = 39p + q (p ∈N+ , q ∈ {10, 20, 30})
▪ Compaction will generate p 40-block files and one q-
block file

Raid-Aware Compaction
▪ Compact SeqFile format partition
▪ INSERT OVERWRITE TABLE seq_table
PARTITION (ds = "2012-08-17")
SELECT `(ds)?+.+` FROM seq_table
WHERE ds = "2012-08-17";
▪ Compact RCFile format partition
▪ ALTER TABLE rc_table PARTITION
(ds="2009-08-31") CONCATENATE;

Directory XOR & Compaction - Things Learned
• Replication factor 2.65 ->2.35! (additional 12% space
saving) Still rolling out

• Bookkeeping blocks’ checksums could avoid data
corruption caused by bugs

• Unawareness of Raid in HDFS causes some issues
• Operational error could cause data loss (forget to move
parity data with source data)

• Directory XOR & Compaction only work for warehouse
data

Facebook's Approach to Big Data Storage Challenge

More Related Content

What's hot

Similar to Facebook's Approach to Big Data Storage Challenge

More from DataWorks Summit

Recently uploaded

Facebook's Approach to Big Data Storage Challenge