Your SlideShare is downloading. ×
0
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Facebook - Jonthan Gray - Hadoop World 2010

12,017

Published on

HBase in Production at Facebook …

HBase in Production at Facebook

Jonathan Gray
Facebook

Published in: Technology
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,017
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
19
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HBase at Facebook Jonathan Gray Hadoop World October 12, 2010
  • 2. 1 Data at Facebook 2 HBase Development 3 Future Work Agenda
  • 3. Data at Facebook
  • 4. 500 Million active monthly users >500 Billion page views per month 25 Billion pieces of content per month
  • 5. Cache OS Web server Database Language Data analysis
  • 6. Daily Federated MySQLSilver Cluster Platinum Hadoop Cluster 5-15 minutes5-15 minutes Cluster 1 Cluster 2 Scribe/Hadoop cluster Scribe/Hadoop cluster
  • 7. Cache OS Web server Database Language Data analysis
  • 8. Cache OS Web server Database Language Data analysis
  • 9. HBase Key Properties ▪ Linearly scalable ▪ Fast indexed writes ▪ Tight integration with Hadoop Bridges gap between online and offline
  • 10. Use Case #1 Incremental Updates into Data Warehouse ▪ Currently ▪ Nightly dumps of UDBs into Warehouse ▪ With HBase ▪ Tail UDB replication logs into HBase UDB to Warehouse in minutes
  • 11. Use Case #2 High Frequency Counters and Realtime Analytics ▪ Currently ▪ Scribe to HDFS, periodically aggregate to UDB ▪ With HBase ▪ Scribe to HBase, read in realtime with API or MR Storage, serving, and analysis in one
  • 12. Use Case #3 User-facing Database for Write Intensive Workloads ▪ Currently ▪ Constantly expanding UDB and Memcache tiers ▪ With HBase ▪ Fast writes, automatic partitioning, linear scaling Fast and scalable writes, just add nodes
  • 13. HBase Development
  • 14. Hive Integration HBase and Hive ▪ HBase Tables usable as Hive Tables ▪ ETL data target ▪ Query data source ▪ Support for different read/write patterns ▪ API random write or MR bulk load ▪ API random read or MR table scan
  • 15. HBase Master Re-architected for HA and Testability ▪ Increased usage of ZooKeeper for failover ▪ Region transitions in ZK ▪ Working master failover in all cases ▪ Refactor/Redesign of major components ▪ Load balancer, cluster startup, failover redesigned ▪ Emphasis on independent testability
  • 16. Random Read Optimizations Performance degrades with lots of files ▪ Bloom filters ▪ Dynamic Row or Row+Column as HFile metadata ▪ Skip files on disk that don’t match ▪ Timestamp ranges ▪ Stored as HFile metadata ▪ Skip files on disk that don’t cover time range
  • 17. Random Read Optimizations Performance degrades with wide rows ▪ Aggressively seek/reseek ▪ Use query and block index to skip blocks ▪ Stop processing as soon as we finish query ▪ Expose seeking to Filter API ▪ Allow specialized optimizations ▪ Millions of versions in a row, grab 10
  • 18. Administration Tools Detect and repair potential issues ▪ HBCK ▪ FSCK for HBase ▪ Detect and repair cluster issues ▪ Cluster Verification ▪ Ensure cluster can be written to, read from ▪ Tables can be created/disabled/dropped
  • 19. Hadoop Improvements HDFS Appends ▪ Hadoop 0.20 ▪ Widely deployed but no support for appends ▪ Hadoop 0.20 with append support ▪ Apache Hadoop 0.20-Append ▪ Cloudera’s CDH version 3 ▪ Facebook’s version of Hadoop 0.20 ▪ http://github.com/facebook/hadoop-20-append
  • 20. Hadoop Improvements HDFS rolling upgrades and NameNode HA ▪ HDFS in online application ▪ Need to support upgrades without downtime ▪ More sensitive to NameNode SPOF ▪ Hadoop AvatarNode ▪ Hot standby pair of NameNodes ▪ Failover to new version of NameNode ▪ Failover to hot standby in seconds under failure
  • 21. Coming Soon New Features ▪ East coast / west coast replication ▪ Asynchronous replication between data centers ▪ Faster recovery ▪ Distributed log splitting ▪ Master controlled rolling restart ▪ Fast and retaining assignment information
  • 22. Future Work
  • 23. Coprocessors Complex server-side operations ▪ Dynamically loaded server-side logic ▪ Hook into read/write and cluster operations ▪ Endless possibilities ▪ Server-side merges and joins ▪ Lightweight MapReduce for aggregation ▪ Efficient secondary indexing
  • 24. Intelligent Load Balancing Complex notion of load ▪ Currently based only on region count ▪ Different regions have different access patterns ▪ And data locality equally important ▪ Next generation load balancing algorithms ▪ Consider complex notion of read load / write load ▪ And HDFS block locations for locality ▪ Retain assignment information between restarts
  • 25. Other Future Work Cluster Performance ▪ Quality of service ▪ One MapReduce job can take down cluster ▪ Dynamic configuration changes ▪ Change important parameters on running cluster ▪ HDFS performance ▪ Critical target for long-term HBase performance
  • 26. (c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×