SlideShare a Scribd company logo
1 of 44
Download to read offline
Storage Infrastructure Behind
Facebook Messages
Using HBase at Scale
Jianfeng Zhang
Storage Infrastructure Behind Facebook Messages
• HBase is an open source, non-relational,
distributed database modeled after Google's
BigTable and is written in Java. It is developed
as part of Apache Software Foundation's
Apache Hadoop project and runs on top of
HDFS (Hadoop Distributed Filesystem),
providing BigTable-like capabilities for
Hadoop.
• Facebook is an online social networking service. Its name
comes from a colloquialism for the directory given to
students at some American universities.
• Facebook was founded in February 2004 by Mark
Zuckerberg.
• Users must register before using the site, after which they
may create a personal profile, add other users as friends,
exchange messages, and receive automatic notifications
when they update their profile. Additionally, users may join
common-interest user groups, organized by workplace,
school or college, or other characteristics, and categorize
their friends into lists such as "People From Work" or "Close
Friends".
• As of January 2014, Facebook has about 1.2
billion monthly users.
• Based on its 2012 income of US$5 billion,
Facebook joined the Fortune 500 list for the
first time on the list published in May 2013,
being placed at position 462.
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
Monthly data volume prior to launch

15Billion×1,024byte=14TB

200 Billion×100bytes=11TB
Why picked Hbase for Facebook
Messages
•
•
•
•
•
•

High write throughput
Low latency random reads
Elasticity
Cheap and fault tolerant
Strong consistency within a data center
Experience with HDFS
Messaging Data
• Small and medium data (Hbase)
Message metadata and bodies
Snapshot of recent messages
Search indices
• Attachments and large messages (Haystack)
HBase Architecture
•

ZooKeeper
Metadata
• Hmaster
Recovery
Balacing
• Region Server
Log>Flush
Store>Compaction
Region>Split
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
Facebook Messages: Quick Stats
• 6B+ messages/day
•
•
•
•
•
•

▪ Traffic to HBase
▪ 75+ Billion R+W ops/day
▪ At peak: 1.5M ops/sec
▪ ~ 55% Read vs. 45% Write ops
▪ Avg write op inserts ~16 records across multiple
column families
Facebook Messages: Quick Stats (contd.)
• 2PB+ of online data in HBase (6PB+ with
replication;
• excludes backups)
• ▪ message data, metadata, search index
• ▪ All data LZO compressed
• ▪ Growing at 250TB/month
Facebook Messages: Quick Stats (contd.)
•
•
•
•
•
•
•
•
•
•

Timeline:
▪ Started in Dec 2009
▪ Roll out started in Nov 2010
▪ Fully rolled out by July 2011 (migrated 1B+ accounts from
legacy messages!)
While in production:
▪ Schema changes: not once, but twice!
▪ Implemented & rolled out HFile V2 and numerous other
optimizations in an upward compatible manner!
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
Backups (V2)
• Now, does periodic HFile level backups.
• ▪ Working on:
• ▪ Moving to HFile + Commit Log based
backups to be able to recover to
• finer grained points in time
• ▪ Avoid need to log data to Scribe.
• ▪ Zero copy (hard link based) fast backups
Messages Schema & Evolution
•
•
•
•
•
•
•
•
•
•
•

“Actions” (data) Column Family the source of truth
▪ Log of all user actions (addMessage, markAsRead, etc.)
▪ Metadata (thread index, message index, search index) etc. in other
column families
▪ Metadata portion of schema underwent 3 changes:
▪ Coarse grained snapshots (early development; rollout up to 1M
users)
▪ Hybrid (up to full rollout – 1B+ accounts; 800M+ active)
▪ Fine-grained metadata (after rollout)
▪ MapReduce jobs against production clusters!
▪ Ran in throttled way
▪ Heavy use of HBase bulk import features
Write Path Overview
Flushes: Memstore -> HFile
Read Path Overview
Compactions
Reliability: Early work
• HDFS sync support for durability of
transactions
• ▪ Multi-CF transaction atomicity
• ▪ Several bug fixes in log recovery
• ▪ New block placement policy in HDFS
• ▪ To reduce probability of data loss
Availability: Early Work
•
•
•
•
•
•
•
•
•

Common reasons for unavailability:
▪ S/W upgrades
▪ Solution: rolling upgrades
▪ Schema Changes
▪ Applications needs new Column Families
▪ Need to change settings for a CF
▪ Solution: online “alter table”
▪ Load balancing or cluster restarts took forever
▪ Upon investigation: stuck waiting for compactions to
finish
• ▪ Solution: Interruptible Compactions!
Performance: Early Work
• Read optimizations:
• ▪ Seek optimizations for rows with large number
of cells
• ▪ Bloom Filters
• ▪ minimize HFile lookups
• ▪ Timerange hints on HFiles (great for temporal
data)
• ▪ Multigets
• ▪ Improved handling of compressed HFiles
Performance: Compactions
•
•
•
•
•
•
•
•
•

Critical for read performance
▪ Old Algorithm:
#1. Start from newest file (file 0); include next file if:
▪ size[i] < size[i-1] * C (good!)
#2. Always compact at least 4 files, even if rule #1 isn’t met.
Solution:
#1. Compact at least 4 files, but only if eligible files found.
#2. Also, new file selection based on summation of sizes.
size[i+ < (size*0+ + size*1+ + …size*i-1]) * C
Performance: Compactions
Performance: Compactions
•
•
•
•
•
•
•
•
•
•
•

More problems!
▪ Read performance dips during
peak
▪ Major compaction storms
▪ Large compactions bottleneck
▪ Enhancements/fixes:
▪ Staggered major compactions
▪ Multi-thread compactions;
separate queues for small & big
compactions
▪ Aggressive off-peak compactions
Metrics, metrics, metrics…
•
•
•
•
•
•
•
•
•
•
•

Initially, only had coarse level overall metrics (get/put latency/ops;
block cache counters).
▪ Slow query logging
▪ Added per Column Family stats for:
▪ ops counts, latency
▪ block cache usage & hit ratio
▪ memstore usage
▪ on-disk file sizes
▪ file counts
▪ bytes returned, bytes flushed, compaction statistics
▪ stats by block type (data block vs. index blocks vs. bloom blocks,
etc.)
• ▪ bloom filter stats
Metrics (contd.)
•
•
•
•
•
•

HBase Master Statistics:
▪ Number of region servers alive
▪ Number of regions
▪ Load balancing statistics
▪ ..
▪ All stats stored in Facebook’s Operational Data
Store (ODS).
• ▪ Lots of ODS dashboards for debugging issues
• ▪ Side note: ODS planning to use HBase for
storage pretty soon!
Need to keep up as data grows on you!
•
•
•
•
•
•
•
•
•
•
•
•

Rapidly iterated on several new features while in production:
▪ Block indexes upto 6GB per server! Cluster starts taking longer and
longer. Block cache hit ratio on the decline.
▪ Solution: HFile V2
▪ Multi-level block index, Sharded Bloom Filters
▪ Network pegged after restarts
▪ Solution: Locality on full & rolling restart
▪ High disk utilization during peak
▪ Solution: Several “seek” optimizations to reduce disk IOPS
▪ Lazy Seeks (use time hints to avoid seeking into older HFiles)
▪ Special bloom filter for deletes to avoid additional seek
▪ Utilize off-peak IOPS to do more aggressive compactions during
Scares & Scars!
• Not without our share of scares and incidents:
• ▪ s/w bugs. (e.g., deadlocks, incompatible LZO used for bulk
imported data, etc.)
• ▪ found a edge case bug in log recovery as recently as last week!
• ▪ performance spikes every 6 hours (even off-peak)!
• ▪ cleanup of HDFS’s Recycle bin was sub-optimal! Needed code and
config fix.
• ▪ transient rack switch failures
• ▪ Zookeeper leader election took than 10 minutes when one
member of the
• quorum died. Fixed in more recent version of ZK.
• ▪ HDFS Namenode – SPOF
• ▪ flapping servers (repeated failures)
Scares & Scars! (contd.)
•
•
•
•
•
•
•
•
•
•

Sometimes, tried things which hadn’t been tested in dark launch!
▪ Added a rack of servers to help with performance issue
▪ Pegged top of the rack network bandwidth!
▪ Had to add the servers at much slower pace. Very manual 
.
▪ Intelligent load balancing needed to make this more automated.
▪ A high % of issues caught in shadow/stress testing
▪ Lots of alerting mechanisms in place to detect failures cases
▪ Automate recovery for a lots of common ones
▪ Treat alerts on shadow cluster as hi-pri too!
▪ Sharding service across multiple HBase cells also paid off
Choosing HBase
They evaluated and tested various solutions
• Strong consistency model
• Automatic failover
• Multiple shards per sever for loading balancing
Prevents cascading failures
• Compression-save disk and network bandwidth
• Read modify writer operation support like
counter increment
• Map Reduce supported out of the box
HBase uses HDFS
They get the benefits of HDFS as a storage system for free
• HDFS has attractive features out of the box
Easy to scale out for additional storage
Checksums to detect and recover from corruptions
Block placement enhanced to minimize data loss probability
• HDFS is battle tested inside Facebook
• Currently running petabyte scale clusters
• Development and operational experience with the Hadoop
Working with HBase and HDFS
Goal of zero data loss
•
•
•
•

HDFS sync support for Write-Ahead-Log
Row level ACID property
Early log rolling
Various critical bug fixes
log recovery
region assignments
• HBase master redesign
Zookeeper integration
Stability and performance
• Availability and operational improvements
Rolling restarts-minimal downtime on upgrades
Ability to interrupt long running operations(e.g., compactions)
HBase fsck, Metrics
• Performance
Various improvements to response time, column seeking, bloom filters
• Stability
Fixed various timeouts and race conditions
Constant region count &controlled rolling splits
Operational Challenges
• Darklaunch
• Deployments and monitoring
Lots of internal HBase clusters used for various purposes
A ton of scripts/dashboards/graphs for monitoring
Automatic recovery
• Moving a ton of data around
Migration
Incremental snapshots
Future work
•
•
•
•
•
•
•
•
•
•
•

Reliability, Availability, Scalability!
▪ Lot of new use cases on top of HBase in the works.
▪ HDFS Namenode HA
▪ Recovering gracefully from transient issues
▪ Fast hot-backups
▪ Delta-encoding in block cache
▪ Replication
▪ Performance (HBase and HDFS)
▪ HBase as a service Multi-tenancy
▪ Features- coprocessors, secondary indices
Storage Infrastructure Behind Facebook Messages

More Related Content

What's hot

Tuning IBMs Generational GC
Tuning IBMs Generational GCTuning IBMs Generational GC
Tuning IBMs Generational GCChris Bailey
 
Version Control History and Git Basics
Version Control History and Git BasicsVersion Control History and Git Basics
Version Control History and Git BasicsSreedath N S
 
OpenSPARC T1 Processor
OpenSPARC T1 ProcessorOpenSPARC T1 Processor
OpenSPARC T1 ProcessorDVClub
 
[DanNotes] XPages - Beyound the Basics
[DanNotes] XPages - Beyound the Basics[DanNotes] XPages - Beyound the Basics
[DanNotes] XPages - Beyound the BasicsUlrich Krause
 
Chicago alm user group tfs version control poster - tfvc and git
Chicago alm user group   tfs version control poster - tfvc and gitChicago alm user group   tfs version control poster - tfvc and git
Chicago alm user group tfs version control poster - tfvc and gitDave Burnison
 
223: Modernization and Migrating from the ESB to Containers
223: Modernization and Migrating from the ESB to Containers223: Modernization and Migrating from the ESB to Containers
223: Modernization and Migrating from the ESB to ContainersTrevor Dolby
 
Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setupDong Lin
 
IBM Integration Bus High Availability Overview
IBM Integration Bus High Availability OverviewIBM Integration Bus High Availability Overview
IBM Integration Bus High Availability OverviewPeter Broadhurst
 
Arquitetura de Computadores - RAID
Arquitetura de Computadores - RAIDArquitetura de Computadores - RAID
Arquitetura de Computadores - RAIDelliando dias
 
버전관리시스템 종류와 소개
버전관리시스템 종류와 소개버전관리시스템 종류와 소개
버전관리시스템 종류와 소개Jong-il Seok
 
Take Control of your Integration Testing with TestContainers
Take Control of your Integration Testing with TestContainersTake Control of your Integration Testing with TestContainers
Take Control of your Integration Testing with TestContainersNaresha K
 
How To Configure Nginx Load Balancer on CentOS 7
How To Configure Nginx Load Balancer on CentOS 7How To Configure Nginx Load Balancer on CentOS 7
How To Configure Nginx Load Balancer on CentOS 7VCP Muthukrishna
 
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVMState: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVMJonas Bonér
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
 
Git 입문자를 위한 가이드
Git 입문자를 위한 가이드Git 입문자를 위한 가이드
Git 입문자를 위한 가이드chandler0201
 
Curso Adabas
Curso AdabasCurso Adabas
Curso Adabasneigao
 
NVMe over Fabrics Demystified
NVMe over Fabrics Demystified NVMe over Fabrics Demystified
NVMe over Fabrics Demystified Brad Eckert
 
Gerências de Processos: Threads
Gerências de Processos: ThreadsGerências de Processos: Threads
Gerências de Processos: ThreadsAlexandre Duarte
 

What's hot (20)

Tuning IBMs Generational GC
Tuning IBMs Generational GCTuning IBMs Generational GC
Tuning IBMs Generational GC
 
Version Control History and Git Basics
Version Control History and Git BasicsVersion Control History and Git Basics
Version Control History and Git Basics
 
OpenSPARC T1 Processor
OpenSPARC T1 ProcessorOpenSPARC T1 Processor
OpenSPARC T1 Processor
 
[DanNotes] XPages - Beyound the Basics
[DanNotes] XPages - Beyound the Basics[DanNotes] XPages - Beyound the Basics
[DanNotes] XPages - Beyound the Basics
 
Chicago alm user group tfs version control poster - tfvc and git
Chicago alm user group   tfs version control poster - tfvc and gitChicago alm user group   tfs version control poster - tfvc and git
Chicago alm user group tfs version control poster - tfvc and git
 
Understanding IIS
Understanding IISUnderstanding IIS
Understanding IIS
 
223: Modernization and Migrating from the ESB to Containers
223: Modernization and Migrating from the ESB to Containers223: Modernization and Migrating from the ESB to Containers
223: Modernization and Migrating from the ESB to Containers
 
Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
 
IBM Integration Bus High Availability Overview
IBM Integration Bus High Availability OverviewIBM Integration Bus High Availability Overview
IBM Integration Bus High Availability Overview
 
Arquitetura de Computadores - RAID
Arquitetura de Computadores - RAIDArquitetura de Computadores - RAID
Arquitetura de Computadores - RAID
 
버전관리시스템 종류와 소개
버전관리시스템 종류와 소개버전관리시스템 종류와 소개
버전관리시스템 종류와 소개
 
Take Control of your Integration Testing with TestContainers
Take Control of your Integration Testing with TestContainersTake Control of your Integration Testing with TestContainers
Take Control of your Integration Testing with TestContainers
 
How To Configure Nginx Load Balancer on CentOS 7
How To Configure Nginx Load Balancer on CentOS 7How To Configure Nginx Load Balancer on CentOS 7
How To Configure Nginx Load Balancer on CentOS 7
 
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVMState: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
TAFJ
TAFJTAFJ
TAFJ
 
Git 입문자를 위한 가이드
Git 입문자를 위한 가이드Git 입문자를 위한 가이드
Git 입문자를 위한 가이드
 
Curso Adabas
Curso AdabasCurso Adabas
Curso Adabas
 
NVMe over Fabrics Demystified
NVMe over Fabrics Demystified NVMe over Fabrics Demystified
NVMe over Fabrics Demystified
 
Gerências de Processos: Threads
Gerências de Processos: ThreadsGerências de Processos: Threads
Gerências de Processos: Threads
 

Similar to Storage Infrastructure Behind Facebook Messages

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBAlluxio, Inc.
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduseScott Miao
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Perforce
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 

Similar to Storage Infrastructure Behind Facebook Messages (20)

Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
AHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File SystemsAHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File Systems
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Movingto moodle2 v1 1
Movingto moodle2 v1 1Movingto moodle2 v1 1
Movingto moodle2 v1 1
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
HBase app HUG talk
HBase app HUG talkHBase app HUG talk
HBase app HUG talk
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 

Storage Infrastructure Behind Facebook Messages

  • 1. Storage Infrastructure Behind Facebook Messages Using HBase at Scale Jianfeng Zhang
  • 3. • HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.
  • 4. • Facebook is an online social networking service. Its name comes from a colloquialism for the directory given to students at some American universities. • Facebook was founded in February 2004 by Mark Zuckerberg. • Users must register before using the site, after which they may create a personal profile, add other users as friends, exchange messages, and receive automatic notifications when they update their profile. Additionally, users may join common-interest user groups, organized by workplace, school or college, or other characteristics, and categorize their friends into lists such as "People From Work" or "Close Friends".
  • 5. • As of January 2014, Facebook has about 1.2 billion monthly users. • Based on its 2012 income of US$5 billion, Facebook joined the Fortune 500 list for the first time on the list published in May 2013, being placed at position 462.
  • 8. Monthly data volume prior to launch 15Billion×1,024byte=14TB 200 Billion×100bytes=11TB
  • 9. Why picked Hbase for Facebook Messages • • • • • • High write throughput Low latency random reads Elasticity Cheap and fault tolerant Strong consistency within a data center Experience with HDFS
  • 10. Messaging Data • Small and medium data (Hbase) Message metadata and bodies Snapshot of recent messages Search indices • Attachments and large messages (Haystack)
  • 11. HBase Architecture • ZooKeeper Metadata • Hmaster Recovery Balacing • Region Server Log>Flush Store>Compaction Region>Split
  • 15. Facebook Messages: Quick Stats • 6B+ messages/day • • • • • • ▪ Traffic to HBase ▪ 75+ Billion R+W ops/day ▪ At peak: 1.5M ops/sec ▪ ~ 55% Read vs. 45% Write ops ▪ Avg write op inserts ~16 records across multiple column families
  • 16. Facebook Messages: Quick Stats (contd.) • 2PB+ of online data in HBase (6PB+ with replication; • excludes backups) • ▪ message data, metadata, search index • ▪ All data LZO compressed • ▪ Growing at 250TB/month
  • 17. Facebook Messages: Quick Stats (contd.) • • • • • • • • • • Timeline: ▪ Started in Dec 2009 ▪ Roll out started in Nov 2010 ▪ Fully rolled out by July 2011 (migrated 1B+ accounts from legacy messages!) While in production: ▪ Schema changes: not once, but twice! ▪ Implemented & rolled out HFile V2 and numerous other optimizations in an upward compatible manner!
  • 21. Backups (V2) • Now, does periodic HFile level backups. • ▪ Working on: • ▪ Moving to HFile + Commit Log based backups to be able to recover to • finer grained points in time • ▪ Avoid need to log data to Scribe. • ▪ Zero copy (hard link based) fast backups
  • 22. Messages Schema & Evolution • • • • • • • • • • • “Actions” (data) Column Family the source of truth ▪ Log of all user actions (addMessage, markAsRead, etc.) ▪ Metadata (thread index, message index, search index) etc. in other column families ▪ Metadata portion of schema underwent 3 changes: ▪ Coarse grained snapshots (early development; rollout up to 1M users) ▪ Hybrid (up to full rollout – 1B+ accounts; 800M+ active) ▪ Fine-grained metadata (after rollout) ▪ MapReduce jobs against production clusters! ▪ Ran in throttled way ▪ Heavy use of HBase bulk import features
  • 27. Reliability: Early work • HDFS sync support for durability of transactions • ▪ Multi-CF transaction atomicity • ▪ Several bug fixes in log recovery • ▪ New block placement policy in HDFS • ▪ To reduce probability of data loss
  • 28. Availability: Early Work • • • • • • • • • Common reasons for unavailability: ▪ S/W upgrades ▪ Solution: rolling upgrades ▪ Schema Changes ▪ Applications needs new Column Families ▪ Need to change settings for a CF ▪ Solution: online “alter table” ▪ Load balancing or cluster restarts took forever ▪ Upon investigation: stuck waiting for compactions to finish • ▪ Solution: Interruptible Compactions!
  • 29. Performance: Early Work • Read optimizations: • ▪ Seek optimizations for rows with large number of cells • ▪ Bloom Filters • ▪ minimize HFile lookups • ▪ Timerange hints on HFiles (great for temporal data) • ▪ Multigets • ▪ Improved handling of compressed HFiles
  • 30. Performance: Compactions • • • • • • • • • Critical for read performance ▪ Old Algorithm: #1. Start from newest file (file 0); include next file if: ▪ size[i] < size[i-1] * C (good!) #2. Always compact at least 4 files, even if rule #1 isn’t met. Solution: #1. Compact at least 4 files, but only if eligible files found. #2. Also, new file selection based on summation of sizes. size[i+ < (size*0+ + size*1+ + …size*i-1]) * C
  • 32. Performance: Compactions • • • • • • • • • • • More problems! ▪ Read performance dips during peak ▪ Major compaction storms ▪ Large compactions bottleneck ▪ Enhancements/fixes: ▪ Staggered major compactions ▪ Multi-thread compactions; separate queues for small & big compactions ▪ Aggressive off-peak compactions
  • 33. Metrics, metrics, metrics… • • • • • • • • • • • Initially, only had coarse level overall metrics (get/put latency/ops; block cache counters). ▪ Slow query logging ▪ Added per Column Family stats for: ▪ ops counts, latency ▪ block cache usage & hit ratio ▪ memstore usage ▪ on-disk file sizes ▪ file counts ▪ bytes returned, bytes flushed, compaction statistics ▪ stats by block type (data block vs. index blocks vs. bloom blocks, etc.) • ▪ bloom filter stats
  • 34. Metrics (contd.) • • • • • • HBase Master Statistics: ▪ Number of region servers alive ▪ Number of regions ▪ Load balancing statistics ▪ .. ▪ All stats stored in Facebook’s Operational Data Store (ODS). • ▪ Lots of ODS dashboards for debugging issues • ▪ Side note: ODS planning to use HBase for storage pretty soon!
  • 35. Need to keep up as data grows on you! • • • • • • • • • • • • Rapidly iterated on several new features while in production: ▪ Block indexes upto 6GB per server! Cluster starts taking longer and longer. Block cache hit ratio on the decline. ▪ Solution: HFile V2 ▪ Multi-level block index, Sharded Bloom Filters ▪ Network pegged after restarts ▪ Solution: Locality on full & rolling restart ▪ High disk utilization during peak ▪ Solution: Several “seek” optimizations to reduce disk IOPS ▪ Lazy Seeks (use time hints to avoid seeking into older HFiles) ▪ Special bloom filter for deletes to avoid additional seek ▪ Utilize off-peak IOPS to do more aggressive compactions during
  • 36. Scares & Scars! • Not without our share of scares and incidents: • ▪ s/w bugs. (e.g., deadlocks, incompatible LZO used for bulk imported data, etc.) • ▪ found a edge case bug in log recovery as recently as last week! • ▪ performance spikes every 6 hours (even off-peak)! • ▪ cleanup of HDFS’s Recycle bin was sub-optimal! Needed code and config fix. • ▪ transient rack switch failures • ▪ Zookeeper leader election took than 10 minutes when one member of the • quorum died. Fixed in more recent version of ZK. • ▪ HDFS Namenode – SPOF • ▪ flapping servers (repeated failures)
  • 37. Scares & Scars! (contd.) • • • • • • • • • • Sometimes, tried things which hadn’t been tested in dark launch! ▪ Added a rack of servers to help with performance issue ▪ Pegged top of the rack network bandwidth! ▪ Had to add the servers at much slower pace. Very manual  . ▪ Intelligent load balancing needed to make this more automated. ▪ A high % of issues caught in shadow/stress testing ▪ Lots of alerting mechanisms in place to detect failures cases ▪ Automate recovery for a lots of common ones ▪ Treat alerts on shadow cluster as hi-pri too! ▪ Sharding service across multiple HBase cells also paid off
  • 38. Choosing HBase They evaluated and tested various solutions • Strong consistency model • Automatic failover • Multiple shards per sever for loading balancing Prevents cascading failures • Compression-save disk and network bandwidth • Read modify writer operation support like counter increment • Map Reduce supported out of the box
  • 39. HBase uses HDFS They get the benefits of HDFS as a storage system for free • HDFS has attractive features out of the box Easy to scale out for additional storage Checksums to detect and recover from corruptions Block placement enhanced to minimize data loss probability • HDFS is battle tested inside Facebook • Currently running petabyte scale clusters • Development and operational experience with the Hadoop
  • 40. Working with HBase and HDFS Goal of zero data loss • • • • HDFS sync support for Write-Ahead-Log Row level ACID property Early log rolling Various critical bug fixes log recovery region assignments • HBase master redesign Zookeeper integration
  • 41. Stability and performance • Availability and operational improvements Rolling restarts-minimal downtime on upgrades Ability to interrupt long running operations(e.g., compactions) HBase fsck, Metrics • Performance Various improvements to response time, column seeking, bloom filters • Stability Fixed various timeouts and race conditions Constant region count &controlled rolling splits
  • 42. Operational Challenges • Darklaunch • Deployments and monitoring Lots of internal HBase clusters used for various purposes A ton of scripts/dashboards/graphs for monitoring Automatic recovery • Moving a ton of data around Migration Incremental snapshots
  • 43. Future work • • • • • • • • • • • Reliability, Availability, Scalability! ▪ Lot of new use cases on top of HBase in the works. ▪ HDFS Namenode HA ▪ Recovering gracefully from transient issues ▪ Fast hot-backups ▪ Delta-encoding in block cache ▪ Replication ▪ Performance (HBase and HDFS) ▪ HBase as a service Multi-tenancy ▪ Features- coprocessors, secondary indices