Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop, Hbase and Hive- Bay area Hadoop User Group


Published on

Published in: Technology

Hadoop, Hbase and Hive- Bay area Hadoop User Group

  1. 1. Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
  2. 2. Agenda <ul><li>Use Cases </li></ul><ul><li>Architecture </li></ul><ul><li>Storage Handler </li></ul><ul><li>Load via INSERT </li></ul><ul><li>Query Processing </li></ul><ul><li>Bulk Load </li></ul><ul><li>Q & A </li></ul>Facebook
  3. 3. Motivations <ul><li>Data, data, and more data </li></ul><ul><ul><li>200 GB/day in March 2008 -> 12+ TB/day at the end of 2009 </li></ul></ul><ul><ul><li>About 8x increase per year </li></ul></ul><ul><li>Queries, queries, and more queries </li></ul><ul><ul><li>More than 200 unique users querying per day </li></ul></ul><ul><ul><li>7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries </li></ul></ul><ul><li>They want it all and they want it now </li></ul><ul><ul><li>Users expect faster response time on fresher data </li></ul></ul><ul><ul><li>Sampled subsets aren’t always good enough </li></ul></ul>Facebook
  4. 4. How Can HBase Help? <ul><li>Replicate dimension tables from transactional databases with low latency and without sharding </li></ul><ul><ul><li>(Fact data can stay in Hive since it is append-only) </li></ul></ul><ul><li>Only move changed rows </li></ul><ul><ul><li>“ Full scrape” is too slow and doesn’t scale as data keeps growing </li></ul></ul><ul><ul><li>Hive by itself is not good at row-level operations </li></ul></ul><ul><li>Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing </li></ul><ul><li>Multiversioning for snapshot consistency? </li></ul>Facebook
  5. 5. Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables
  6. 6. Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
  7. 7. Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
  8. 8. HBase Architecture Facebook From
  9. 9. Hive Architecture Facebook
  10. 10. All Together Now! Facebook
  11. 11. Hive CLI With HBase <ul><li>Minimum configuration needed: </li></ul><ul><li>hive </li></ul><ul><li>--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar </li></ul><ul><li>-hiveconf hbase.zookeeper.quorum=zk1,zk2… </li></ul><ul><li>hive> create table … </li></ul>Facebook
  12. 12. Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></ul><ul><li>STORED BY </li></ul><ul><li>'org.apache.hadoop.hive.hbase.HBaseStorageHandler' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>“ hbase.columns.mapping” = </li></ul><ul><li>“ small:name,small:email,large:notes”) </li></ul><ul><li>TBLPROPERTIES ( </li></ul><ul><li>“” = “user_list” </li></ul><ul><li>); </li></ul>Facebook
  13. 13. Column Mapping <ul><li>First column in table is always the row key </li></ul><ul><li>Other columns can be mapped to either: </li></ul><ul><ul><li>An HBase column (any Hive type) </li></ul></ul><ul><ul><li>An HBase column family (must be MAP type in Hive) </li></ul></ul><ul><li>Multiple Hive columns can map to the same HBase column or family </li></ul><ul><li>Limitations </li></ul><ul><ul><li>Currently no control over type mapping (always string in HBase) </li></ul></ul><ul><ul><li>Currently no way to map HBase timestamp attribute </li></ul></ul>Facebook
  14. 14. Load Via INSERT <ul><li>INSERT OVERWRITE TABLE users </li></ul><ul><li>SELECT * FROM …; </li></ul><ul><li>Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat </li></ul><ul><li>HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings) </li></ul><ul><li>Multiple rows with same key -> only one row written </li></ul><ul><li>Limitations </li></ul><ul><ul><li>No write atomicity yet </li></ul></ul><ul><ul><li>No way to delete rows </li></ul></ul><ul><ul><li>Write parallelism is query-dependent (map vs reduce) </li></ul></ul>Facebook
  15. 15. Map-Reduce Job for INSERT Facebook HBase From
  16. 16. Map-Only Job for INSERT Facebook HBase
  17. 17. Query Processing <ul><li>SELECT name, notes FROM users WHERE userid=‘xyz’; </li></ul><ul><li>Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase </li></ul><ul><li>HBase determines the splits (one per table region) </li></ul><ul><li>HBaseSerDe produces lazy rows/maps for RowResults </li></ul><ul><li>Column selection is pushed down </li></ul><ul><li>Any SQL can be used (join, aggregation, union…) </li></ul><ul><li>Limitations </li></ul><ul><ul><li>Currently no filter pushdown </li></ul></ul><ul><ul><li>How do we achieve locality? </li></ul></ul>Facebook
  18. 18. Metastore Integration <ul><li>DDL can be used to create metadata in Hive and HBase simultaneously and consistently </li></ul><ul><li>CREATE EXTERNAL TABLE: register existing Hbase table </li></ul><ul><li>DROP TABLE: will drop HBase table too unless it was created as EXTERNAL </li></ul><ul><li>Limitations </li></ul><ul><ul><li>No two-phase-commit for DDL operations </li></ul></ul><ul><ul><li>ALTER TABLE is not yet implemented </li></ul></ul><ul><ul><li>Partitioning is not yet defined </li></ul></ul><ul><ul><li>No secondary indexing </li></ul></ul>Facebook
  19. 19. Bulk Load <ul><li>Ideally… </li></ul><ul><li>SET hive.hbase.bulk=true; </li></ul><ul><li>INSERT OVERWRITE TABLE users SELECT … ; </li></ul><ul><li>But for now, you have to do some work and issue multiple Hive commands </li></ul><ul><ul><li>Sample source data for range partitioning </li></ul></ul><ul><ul><li>Save sampling results to a file </li></ul></ul><ul><ul><li>Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files) </li></ul></ul><ul><ul><li>Import HFiles into HBase </li></ul></ul><ul><ul><li>HBase can merge files if necessary </li></ul></ul>Facebook
  20. 20. Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
  21. 21. Sampling Query For Range Partitioning <ul><li>Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges. </li></ul><ul><li>select user_id from </li></ul><ul><li>(select user_id </li></ul><ul><li>from hive_user_table </li></ul><ul><li>tablesample(bucket 1 out of 1000 on user_id) s </li></ul><ul><li>order by user_id) sorted_user_5k_sample </li></ul><ul><li>where (row_sequence() % 501)=0; </li></ul>Facebook
  22. 22. Sorting Query For Bulk Load <ul><li>set mapred.reduce.tasks=12; </li></ul><ul><li>set hive.mapred.partitioner= </li></ul><ul><li>org.apache.hadoop.mapred.lib.TotalOrderPartitioner; </li></ul><ul><li>set total.order.partitioner.path=/tmp/hb_range_key_list; </li></ul><ul><li>set hfile.compression=gz; </li></ul><ul><li>create table hbsort(user_id string, user_type string, ...) </li></ul><ul><li>stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’ </li></ul><ul><li>outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('' = '/tmp/hbsort/cf'); </li></ul><ul><li>insert overwrite table hbsort </li></ul><ul><li>select user_id, user_type, createtime, … </li></ul><ul><li>from hive_user_table </li></ul><ul><li>cluster by user_id; </li></ul>Facebook
  23. 23. Deployment <ul><li>Latest Hive trunk (will be in Hive 0.6.0) </li></ul><ul><li>Requires Hadoop 0.20+ </li></ul><ul><li>Tested with HBase 0.20.3 and Zookeeper 3.2.2 </li></ul><ul><li>20-node hbtest cluster at Facebook </li></ul><ul><li>No performance numbers yet </li></ul><ul><ul><li>Currently setting up tests with about 6TB (gz compressed) </li></ul></ul>Facebook
  24. 24. Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li>Special thanks to Samuel Guo for the early versions of the integration code </li></ul>Facebook
  25. 25. Hey, What About HBQL? <ul><li>HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations </li></ul><ul><li>HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs </li></ul>Facebook