HBase at Facebook
Jonathan Gray
Hadoop World
October 12, 2010
1 Data at Facebook
2 HBase Development
3 Future Work
Agenda
Data at Facebook
500 Million active monthly users
>500 Billion page views per month
25 Billion pieces of content per month
Cache
OS Web server Database Language
Data analysis
Daily Federated MySQLSilver Cluster
Platinum
Hadoop
Cluster
5-15 minutes5-15 minutes
Cluster 1 Cluster 2
Scribe/Hadoop
clu...
Cache
OS Web server Database Language
Data analysis
Cache
OS Web server Database Language
Data analysis
HBase
Key Properties
▪ Linearly scalable
▪ Fast indexed writes
▪ Tight integration with Hadoop
Bridges gap between online ...
Use Case #1
Incremental Updates into Data Warehouse
▪ Currently
▪ Nightly dumps of UDBs into Warehouse
▪ With HBase
▪ Tail...
Use Case #2
High Frequency Counters and Realtime Analytics
▪ Currently
▪ Scribe to HDFS, periodically aggregate to UDB
▪ W...
Use Case #3
User-facing Database for Write Intensive Workloads
▪ Currently
▪ Constantly expanding UDB and Memcache tiers
▪...
HBase Development
Hive Integration
HBase and Hive
▪ HBase Tables usable as Hive Tables
▪ ETL data target
▪ Query data source
▪ Support for d...
HBase Master
Re-architected for HA and Testability
▪ Increased usage of ZooKeeper for failover
▪ Region transitions in ZK
...
Random Read Optimizations
Performance degrades with lots of files
▪ Bloom filters
▪ Dynamic Row or Row+Column as HFile metad...
Random Read Optimizations
Performance degrades with wide rows
▪ Aggressively seek/reseek
▪ Use query and block index to sk...
Administration Tools
Detect and repair potential issues
▪ HBCK
▪ FSCK for HBase
▪ Detect and repair cluster issues
▪ Clust...
Hadoop Improvements
HDFS Appends
▪ Hadoop 0.20
▪ Widely deployed but no support for appends
▪ Hadoop 0.20 with append supp...
Hadoop Improvements
HDFS rolling upgrades and NameNode HA
▪ HDFS in online application
▪ Need to support upgrades without ...
Coming Soon
New Features
▪ East coast / west coast replication
▪ Asynchronous replication between data centers
▪ Faster re...
Future Work
Coprocessors
Complex server-side operations
▪ Dynamically loaded server-side logic
▪ Hook into read/write and cluster oper...
Intelligent Load Balancing
Complex notion of load
▪ Currently based only on region count
▪ Different regions have differen...
Other Future Work
Cluster Performance
▪ Quality of service
▪ One MapReduce job can take down cluster
▪ Dynamic configuratio...
(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Facebook - Jonthan Gray - Hadoop World 2010
Upcoming SlideShare
Loading in...5
×

Facebook - Jonthan Gray - Hadoop World 2010

12,237

Published on

HBase in Production at Facebook

Jonathan Gray
Facebook

Published in: Technology
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,237
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
19
Embeds 0
No embeds

No notes for slide

Facebook - Jonthan Gray - Hadoop World 2010

  1. 1. HBase at Facebook Jonathan Gray Hadoop World October 12, 2010
  2. 2. 1 Data at Facebook 2 HBase Development 3 Future Work Agenda
  3. 3. Data at Facebook
  4. 4. 500 Million active monthly users >500 Billion page views per month 25 Billion pieces of content per month
  5. 5. Cache OS Web server Database Language Data analysis
  6. 6. Daily Federated MySQLSilver Cluster Platinum Hadoop Cluster 5-15 minutes5-15 minutes Cluster 1 Cluster 2 Scribe/Hadoop cluster Scribe/Hadoop cluster
  7. 7. Cache OS Web server Database Language Data analysis
  8. 8. Cache OS Web server Database Language Data analysis
  9. 9. HBase Key Properties ▪ Linearly scalable ▪ Fast indexed writes ▪ Tight integration with Hadoop Bridges gap between online and offline
  10. 10. Use Case #1 Incremental Updates into Data Warehouse ▪ Currently ▪ Nightly dumps of UDBs into Warehouse ▪ With HBase ▪ Tail UDB replication logs into HBase UDB to Warehouse in minutes
  11. 11. Use Case #2 High Frequency Counters and Realtime Analytics ▪ Currently ▪ Scribe to HDFS, periodically aggregate to UDB ▪ With HBase ▪ Scribe to HBase, read in realtime with API or MR Storage, serving, and analysis in one
  12. 12. Use Case #3 User-facing Database for Write Intensive Workloads ▪ Currently ▪ Constantly expanding UDB and Memcache tiers ▪ With HBase ▪ Fast writes, automatic partitioning, linear scaling Fast and scalable writes, just add nodes
  13. 13. HBase Development
  14. 14. Hive Integration HBase and Hive ▪ HBase Tables usable as Hive Tables ▪ ETL data target ▪ Query data source ▪ Support for different read/write patterns ▪ API random write or MR bulk load ▪ API random read or MR table scan
  15. 15. HBase Master Re-architected for HA and Testability ▪ Increased usage of ZooKeeper for failover ▪ Region transitions in ZK ▪ Working master failover in all cases ▪ Refactor/Redesign of major components ▪ Load balancer, cluster startup, failover redesigned ▪ Emphasis on independent testability
  16. 16. Random Read Optimizations Performance degrades with lots of files ▪ Bloom filters ▪ Dynamic Row or Row+Column as HFile metadata ▪ Skip files on disk that don’t match ▪ Timestamp ranges ▪ Stored as HFile metadata ▪ Skip files on disk that don’t cover time range
  17. 17. Random Read Optimizations Performance degrades with wide rows ▪ Aggressively seek/reseek ▪ Use query and block index to skip blocks ▪ Stop processing as soon as we finish query ▪ Expose seeking to Filter API ▪ Allow specialized optimizations ▪ Millions of versions in a row, grab 10
  18. 18. Administration Tools Detect and repair potential issues ▪ HBCK ▪ FSCK for HBase ▪ Detect and repair cluster issues ▪ Cluster Verification ▪ Ensure cluster can be written to, read from ▪ Tables can be created/disabled/dropped
  19. 19. Hadoop Improvements HDFS Appends ▪ Hadoop 0.20 ▪ Widely deployed but no support for appends ▪ Hadoop 0.20 with append support ▪ Apache Hadoop 0.20-Append ▪ Cloudera’s CDH version 3 ▪ Facebook’s version of Hadoop 0.20 ▪ http://github.com/facebook/hadoop-20-append
  20. 20. Hadoop Improvements HDFS rolling upgrades and NameNode HA ▪ HDFS in online application ▪ Need to support upgrades without downtime ▪ More sensitive to NameNode SPOF ▪ Hadoop AvatarNode ▪ Hot standby pair of NameNodes ▪ Failover to new version of NameNode ▪ Failover to hot standby in seconds under failure
  21. 21. Coming Soon New Features ▪ East coast / west coast replication ▪ Asynchronous replication between data centers ▪ Faster recovery ▪ Distributed log splitting ▪ Master controlled rolling restart ▪ Fast and retaining assignment information
  22. 22. Future Work
  23. 23. Coprocessors Complex server-side operations ▪ Dynamically loaded server-side logic ▪ Hook into read/write and cluster operations ▪ Endless possibilities ▪ Server-side merges and joins ▪ Lightweight MapReduce for aggregation ▪ Efficient secondary indexing
  24. 24. Intelligent Load Balancing Complex notion of load ▪ Currently based only on region count ▪ Different regions have different access patterns ▪ And data locality equally important ▪ Next generation load balancing algorithms ▪ Consider complex notion of read load / write load ▪ And HDFS block locations for locality ▪ Retain assignment information between restarts
  25. 25. Other Future Work Cluster Performance ▪ Quality of service ▪ One MapReduce job can take down cluster ▪ Dynamic configuration changes ▪ Change important parameters on running cluster ▪ HDFS performance ▪ Critical target for long-term HBase performance
  26. 26. (c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×