• Like
  • Save


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Facebook - Jonthan Gray - Hadoop World 2010

Uploaded on

HBase in Production at Facebook …

HBase in Production at Facebook

Jonathan Gray

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. HBase at Facebook Jonathan Gray Hadoop World October , Facebook has several teams looking into HBase. Several things going into production before end of the year. ME - HBase committer - Open Source Team supporting data infrastructure - Work across teams supporting usage and contribution to open source projects
  • 3. Data at Facebook We have a lot of data
  • 4. Million active monthly users > Billion page views per month Billion pieces of content per month Lots of data Lots of reads Lots of writes Not just a lot of data, but lots of reads and writes Now let’s look at FBs core technology stack
  • 5. Cache Data analysis OS Web server Database Language TRADITIONAL FACEBOOK LAMP STACK User data is read/written to clusters of federated MySQL. To deal with reads, add Memcache ! Data being served to users is written to MySQL and cached in Memcache. To deal with analysis, add Hadoop ! Other types of data sent through Scribe and into Hadoop. ! Clicking on a link, viewing a photo. Data we want for analysis but doesnʼt need to be served. ! Web server logs, MySQL data, PHP generated logs
  • 6. Cluster Cluster Scribe/Hadoop Scribe/Hadoop cluster cluster - minutes - minutes Silver Cluster Daily Federated MySQL Platinum Hadoop Cluster 1. Web servers logs, PHP logs, Services logs via Scribe into Hadoop 2. Daily dumps of MySQL tier 3. Mirrored into additional Hadoop clusters A) What if we want MySQL with 5-15 minute turnaround as well? Hadoop/Hive don’t support incremental updates but MySQL replication stream exists B) What if we want to have random access to all of this data? Hadoop efficient for streaming but not random access
  • 7. Cache Data analysis OS Web server Database Language Now letʼs focus on the data components of this stack
  • 8. Cache Data analysis OS Web server Database Language HBase can help to augment several layers of the data stack. 1.! Persistent data store. For write-intensive workloads and high capacity storage, big table architecture and scale out approach can be advantageous. ! ! ! Column orientation also good for certain types of data like inbox search 2.! Efficient, random realtime read/write access. Recently written data in memory, recently read data in-memory block cache. 3.! Integration with Hadoop and Hive, MapReduce read/write, write direct to HDFS
  • 9. HBase Key Properties ▪ Linearly scalable ▪ Fast indexed writes ▪ Tight integration with Hadoop Bridges gap between online and offline 1. Add more servers, no manual partitioning/repartitioning 2. Designed for fast random access. Writes are in-memory, reads use indexes and lots of caching. 3. Inherits FT properties from Hadoop (persistence and WAL in HDFS) MapReduce input/output Write directly to HDFS as HFiles HBase is unique because it sits between different types; a serving database and a batch processing database
  • 10. Use Case # Incremental Updates into Data Warehouse ▪ Currently ▪ Nightly dumps of UDBs into Warehouse ▪ With HBase ▪ Tail UDB replication logs into HBase UDB to Warehouse in minutes Currently Doing full scrapes of UDBs into Hive nightly HBase Keep up-to-date by reading from MySQL replication logs Writing directly into HBase Turnaround from one day to on order of 15 minutes
  • 11. Use Case # High Frequency Counters and Realtime Analytics ▪ Currently ▪ Scribe to HDFS, periodically aggregate to UDB ▪ With HBase ▪ Scribe to HBase, read in realtime with API or MR Storage, serving, and analysis in one Counters are expensive read/modify/write operations Small amount of data but high number of operations Want to be able to look at data over time and also to calculate aggregates like topN Currently Writing into HDFS or specialized in-memory systems that support high num of ops Periodically analyze/aggregate and dump into UDB HBase Write directly into HBase, optimized counter implementation Available for fast random reads but can also do big MapReduces Single system. One write: for persistent storage, realtime serving, and offline analysis HDFS -> Hive Tables -> Intermediate data -> UDBs is wasteful and slow
  • 12. Use Case # User-facing Database for Write Intensive Workloads ▪ Currently ▪ Constantly expanding UDB and Memcache tiers ▪ With HBase ▪ Fast writes, automatic partitioning, linear scaling Fast and scalable writes, just add nodes MySQL is powerful but not great at storing large amounts of data Multi-MB data or billions of writes Currently Writing into UDB tier and read from UDB/memcache Constantly expanding tier to support growing dataset and writes HBase Writes into HBase are efficient, read directly from HBase Just add more RegionServers to scale, automatic load balancing/partitioning HBase designed for efficient writing. Add more nodes to scale.
  • 13. HBase Development
  • 14. Hive Integration HBase and Hive ▪ HBase Tables usable as Hive Tables ▪ ETL data target ▪ Query data source ▪ Support for different read/write patterns ▪ API random write or MR bulk load ▪ API random read or MR table scan 1. Developed HBase storage handler Can be written to or read from Data sink or data source 2. Support for two kinds of reads and writes Can write randomly with API or can use HFileOutputFormat direct to HDFS from MR Can read randomly with API or can read with TableInputFormat using MR
  • 15. HBase Master Re-architected for HA and Testability ▪ Increased usage of ZooKeeper for failover ▪ Region transitions in ZK ▪ Working master failover in all cases ▪ Refactor/Redesign of major components ▪ Load balancer, cluster startup, failover redesigned ▪ Emphasis on independent testability 1. Simple failover scenarios worked, but not when cluster under state of flux Move region transitions into ZK so new master knows state of regions Support for master dying at any point including concurrent failures of RS Clean up of ZK code to let out unexpected exceptions 2. Clean up lots of master code and refactor things into separate classes so testable De-emphasis on heartbeats and continuos load balancing Explicit phases for startup/failover, load balancer as background process New components that are standalone testable
  • 16. Random Read Optimizations Performance degrades with lots of files ▪ Bloom filters ▪ Dynamic Row or Row+Column as HFile metadata ▪ Skip files on disk that don’t match ▪ Timestamp ranges ▪ Stored as HFile metadata ▪ Skip files on disk that don’t cover time range A number of read performance improvements once we started writing lots of data into HBase Lots of writes leads to lots of files on disk for each region 1. Bloom filters Stored at row or row/col level Folding blooms with configurable error rate, default 99% accuracy at 8-9 bits/key Allows us to skip entire files when doing random reads 2. Timestamp ranges Store the minimum and maximum timestamps on each file as we write it When doing a query for specific timestamp/timerange, can skip files outside it
  • 17. Random Read Optimizations Performance degrades with wide rows ▪ Aggressively seek/reseek ▪ Use query and block index to skip blocks ▪ Stop processing as soon as we finish query ▪ Expose seeking to Filter API ▪ Allow specialized optimizations ▪ Millions of versions in a row, grab Certain schemas and read patterns not efficient with lots of data Wide rows led to inefficient reads by touching too much data 1. Seeking Use query input and block indexes to be able to skip unnecessary blocks Used to always begin at start of row and end at end of row, now seek / early-out Early out whenever possible, skip blocks where possible 2. Expose as filter API Lots of different potential optimizations for specific use cases Rather than continuing to modify core read-path code or adding to client API, expose as filter API Example, row with millions of versions, want to grab 10 specific versions
  • 18. Administration Tools Detect and repair potential issues ▪ HBCK ▪ FSCK for HBase ▪ Detect and repair cluster issues ▪ Cluster Verification ▪ Ensure cluster can be written to, read from ▪ Tables can be created/disabled/dropped Important to be able to detect inconsistencies or unavailability Extensive usage and monitoring of JMX metrics 1. Also created and continuously improving HBCK Ensures all regions are properly assigned and tables are complete Ensures filesystem is in sync with state of running system Able to fix issues like duplicate assignment, non-assignment, etc 2. Cluster verification tools Quickly verify sanity of cluster Also lots of load testing / benchmarking scripts More to be open sourced
  • 19. Hadoop Improvements HDFS Appends ▪ Hadoop . ▪ Widely deployed but no support for appends ▪ Hadoop . with append support ▪ Apache Hadoop . -Append ▪ Cloudera’s CDH version ▪ Facebook’s version of Hadoop . ▪ http://github.com/facebook/hadoop- -append HBase uses a WAL to ensure data durability under node failure This relies on HDFS append support 1. Hadoop 0.20 is current production release But no support for appends in mainline until 0.21/0.22 Hadoop 0.20 append branch developed for HBase and others 2. Will be available through three different versions Official apache release containing minimum set of patches necessary for append support Cloudera’s officially supported distribution v3 Facebook’s internal branch with DW fixes and append support (on github)
  • 20. Hadoop Improvements HDFS rolling upgrades and NameNode HA ▪ HDFS in online application ▪ Need to support upgrades without downtime ▪ More sensitive to NameNode SPOF ▪ Hadoop AvatarNode ▪ Hot standby pair of NameNodes ▪ Failover to new version of NameNode ▪ Failover to hot standby in seconds under failure Use of HBase in online application means HDFS being used in online application 1. Online application Need to be able to upgrade running system Sensitive to NameNode as SPOF (recovery time too long w/ SNN) 2. AvatarNode provides two backup hot NN Can do rolling restarts of DNs then AvatarNode failover to upgrade master Failover in seconds if NN dies, no modification to HBase
  • 21. Coming Soon New Features ▪ East coast / west coast replication ▪ Asynchronous replication between data centers ▪ Faster recovery ▪ Distributed log splitting ▪ Master controlled rolling restart ▪ Fast and retaining assignment information 1. EC/WC replication And replication into offline processing clusters 2. Utilize ZK for configuration so can be changed without a restart 3. Utilize master as coordinator of cluster operations rather than external scripting Can optimize to reduce downtime
  • 22. Future Work Lots of work left to be done - Performance - Reliability - Recovery time - Features!
  • 23. Coprocessors Complex server-side operations ▪ Dynamically loaded server-side logic ▪ Hook into read/write and cluster operations ▪ Endless possibilities ▪ Server-side merges and joins ▪ Lightweight MapReduce for aggregation ▪ Efficient secondary indexing 1. Dynamically load user-provided logic onto region servers Implemented as hooks into all read/write and cluster operations Tied to tables/families and executed within RS 2. Can be used for lots of different things (TM using to implement security) Stored procedures: Combining search terms on the server MapReduce: Perform aggregate operations directly on RS. Sum/average/max/min/ topN Secondary indexes: Declare indexes as metadata and implement with coprocessors
  • 24. Intelligent Load Balancing Complex notion of load ▪ Currently based only on region count ▪ Different regions have different access patterns ▪ And data locality equally important ▪ Next generation load balancing algorithms ▪ Consider complex notion of read load / write load ▪ And HDFS block locations for locality ▪ Retain assignment information between restarts New master design has made it possible to do much fancier load balancing algorithms - Currently only based on region count Different tables have different usage patterns Data locality currently not looked at - Working on next generation of load balancing Complex notion of load. Reads/writes, MemStore pressure, Block cache usage Simple stuff like retaining assignment info between cluster restarts now possible
  • 25. Other Future Work Cluster Performance ▪ Quality of service ▪ One MapReduce job can take down cluster ▪ Dynamic configuration changes ▪ Change important parameters on running cluster ▪ HDFS performance ▪ Critical target for long-term HBase performance
  • 26. (c) Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. . THANK YOU VERY MUCH HAPPY TO ANSWER QUESTIONS