Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
Agenda <ul><li>Use Cases </li></ul><ul><li>Architecture </li></ul><ul><li>Storage Handler </li></ul><ul><li>Load via INSER...
Motivations <ul><li>Data, data, and more data </li></ul><ul><ul><li>200 GB/day in March 2008 -> 12+ TB/day at the end of 2...
How Can HBase Help? <ul><li>Replicate dimension tables from transactional databases with low latency and without sharding ...
Use Case 1:  HBase As ETL Data Target Facebook HBase Hive INSERT …  SELECT … Source Files/Tables
Use Case 2:  HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
Use Case 3:  Low Latency Warehouse  Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Hive Architecture Facebook
All Together Now! Facebook
Hive CLI With HBase <ul><li>Minimum configuration needed: </li></ul><ul><li>hive  </li></ul><ul><li>--auxpath hive_hbaseha...
Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></...
Column Mapping <ul><li>First column in table is always the row key </li></ul><ul><li>Other columns can be mapped to either...
Load Via INSERT <ul><li>INSERT OVERWRITE TABLE users </li></ul><ul><li>SELECT * FROM …; </li></ul><ul><li>Hive task writes...
Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT Facebook HBase
Query Processing <ul><li>SELECT name, notes FROM users WHERE userid=‘xyz’; </li></ul><ul><li>Rows are read from HBase via ...
Metastore Integration <ul><li>DDL can be used to create metadata in Hive and HBase simultaneously and consistently </li></...
Bulk Load <ul><li>Ideally… </li></ul><ul><li>SET hive.hbase.bulk=true; </li></ul><ul><li>INSERT OVERWRITE TABLE users SELE...
Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
Sampling Query For Range Partitioning <ul><li>Given 5 million users in a table bucketed into 1000 buckets of 5000 users ea...
Sorting Query For Bulk Load <ul><li>set mapred.reduce.tasks=12;  </li></ul><ul><li>set hive.mapred.partitioner= </li></ul>...
Deployment <ul><li>Latest Hive trunk (will be in Hive 0.6.0) </li></ul><ul><li>Requires Hadoop 0.20+ </li></ul><ul><li>Tes...
Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul><ul><li>http://wiki.apache.org/hadoop/Hive/...
Hey, What About HBQL? <ul><li>HBQL focuses on providing a convenient language layer for managing and accessing individual ...
Upcoming SlideShare
Loading in...5
×

Hadoop, Hbase and Hive- Bay area Hadoop User Group

20,684

Published on

Published in: Technology
3 Comments
61 Likes
Statistics
Notes
No Downloads
Views
Total Views
20,684
On Slideshare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
1,393
Comments
3
Likes
61
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop, Hbase and Hive- Bay area Hadoop User Group"

  1. 1. Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
  2. 2. Agenda <ul><li>Use Cases </li></ul><ul><li>Architecture </li></ul><ul><li>Storage Handler </li></ul><ul><li>Load via INSERT </li></ul><ul><li>Query Processing </li></ul><ul><li>Bulk Load </li></ul><ul><li>Q & A </li></ul>Facebook
  3. 3. Motivations <ul><li>Data, data, and more data </li></ul><ul><ul><li>200 GB/day in March 2008 -> 12+ TB/day at the end of 2009 </li></ul></ul><ul><ul><li>About 8x increase per year </li></ul></ul><ul><li>Queries, queries, and more queries </li></ul><ul><ul><li>More than 200 unique users querying per day </li></ul></ul><ul><ul><li>7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries </li></ul></ul><ul><li>They want it all and they want it now </li></ul><ul><ul><li>Users expect faster response time on fresher data </li></ul></ul><ul><ul><li>Sampled subsets aren’t always good enough </li></ul></ul>Facebook
  4. 4. How Can HBase Help? <ul><li>Replicate dimension tables from transactional databases with low latency and without sharding </li></ul><ul><ul><li>(Fact data can stay in Hive since it is append-only) </li></ul></ul><ul><li>Only move changed rows </li></ul><ul><ul><li>“ Full scrape” is too slow and doesn’t scale as data keeps growing </li></ul></ul><ul><ul><li>Hive by itself is not good at row-level operations </li></ul></ul><ul><li>Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing </li></ul><ul><li>Multiversioning for snapshot consistency? </li></ul>Facebook
  5. 5. Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables
  6. 6. Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
  7. 7. Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
  8. 8. HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  9. 9. Hive Architecture Facebook
  10. 10. All Together Now! Facebook
  11. 11. Hive CLI With HBase <ul><li>Minimum configuration needed: </li></ul><ul><li>hive </li></ul><ul><li>--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar </li></ul><ul><li>-hiveconf hbase.zookeeper.quorum=zk1,zk2… </li></ul><ul><li>hive> create table … </li></ul>Facebook
  12. 12. Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></ul><ul><li>STORED BY </li></ul><ul><li>'org.apache.hadoop.hive.hbase.HBaseStorageHandler' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>“ hbase.columns.mapping” = </li></ul><ul><li>“ small:name,small:email,large:notes”) </li></ul><ul><li>TBLPROPERTIES ( </li></ul><ul><li>“ hbase.table.name” = “user_list” </li></ul><ul><li>); </li></ul>Facebook
  13. 13. Column Mapping <ul><li>First column in table is always the row key </li></ul><ul><li>Other columns can be mapped to either: </li></ul><ul><ul><li>An HBase column (any Hive type) </li></ul></ul><ul><ul><li>An HBase column family (must be MAP type in Hive) </li></ul></ul><ul><li>Multiple Hive columns can map to the same HBase column or family </li></ul><ul><li>Limitations </li></ul><ul><ul><li>Currently no control over type mapping (always string in HBase) </li></ul></ul><ul><ul><li>Currently no way to map HBase timestamp attribute </li></ul></ul>Facebook
  14. 14. Load Via INSERT <ul><li>INSERT OVERWRITE TABLE users </li></ul><ul><li>SELECT * FROM …; </li></ul><ul><li>Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat </li></ul><ul><li>HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings) </li></ul><ul><li>Multiple rows with same key -> only one row written </li></ul><ul><li>Limitations </li></ul><ul><ul><li>No write atomicity yet </li></ul></ul><ul><ul><li>No way to delete rows </li></ul></ul><ul><ul><li>Write parallelism is query-dependent (map vs reduce) </li></ul></ul>Facebook
  15. 15. Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
  16. 16. Map-Only Job for INSERT Facebook HBase
  17. 17. Query Processing <ul><li>SELECT name, notes FROM users WHERE userid=‘xyz’; </li></ul><ul><li>Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase </li></ul><ul><li>HBase determines the splits (one per table region) </li></ul><ul><li>HBaseSerDe produces lazy rows/maps for RowResults </li></ul><ul><li>Column selection is pushed down </li></ul><ul><li>Any SQL can be used (join, aggregation, union…) </li></ul><ul><li>Limitations </li></ul><ul><ul><li>Currently no filter pushdown </li></ul></ul><ul><ul><li>How do we achieve locality? </li></ul></ul>Facebook
  18. 18. Metastore Integration <ul><li>DDL can be used to create metadata in Hive and HBase simultaneously and consistently </li></ul><ul><li>CREATE EXTERNAL TABLE: register existing Hbase table </li></ul><ul><li>DROP TABLE: will drop HBase table too unless it was created as EXTERNAL </li></ul><ul><li>Limitations </li></ul><ul><ul><li>No two-phase-commit for DDL operations </li></ul></ul><ul><ul><li>ALTER TABLE is not yet implemented </li></ul></ul><ul><ul><li>Partitioning is not yet defined </li></ul></ul><ul><ul><li>No secondary indexing </li></ul></ul>Facebook
  19. 19. Bulk Load <ul><li>Ideally… </li></ul><ul><li>SET hive.hbase.bulk=true; </li></ul><ul><li>INSERT OVERWRITE TABLE users SELECT … ; </li></ul><ul><li>But for now, you have to do some work and issue multiple Hive commands </li></ul><ul><ul><li>Sample source data for range partitioning </li></ul></ul><ul><ul><li>Save sampling results to a file </li></ul></ul><ul><ul><li>Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files) </li></ul></ul><ul><ul><li>Import HFiles into HBase </li></ul></ul><ul><ul><li>HBase can merge files if necessary </li></ul></ul>Facebook
  20. 20. Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
  21. 21. Sampling Query For Range Partitioning <ul><li>Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges. </li></ul><ul><li>select user_id from </li></ul><ul><li>(select user_id </li></ul><ul><li>from hive_user_table </li></ul><ul><li>tablesample(bucket 1 out of 1000 on user_id) s </li></ul><ul><li>order by user_id) sorted_user_5k_sample </li></ul><ul><li>where (row_sequence() % 501)=0; </li></ul>Facebook
  22. 22. Sorting Query For Bulk Load <ul><li>set mapred.reduce.tasks=12; </li></ul><ul><li>set hive.mapred.partitioner= </li></ul><ul><li>org.apache.hadoop.mapred.lib.TotalOrderPartitioner; </li></ul><ul><li>set total.order.partitioner.path=/tmp/hb_range_key_list; </li></ul><ul><li>set hfile.compression=gz; </li></ul><ul><li>create table hbsort(user_id string, user_type string, ...) </li></ul><ul><li>stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’ </li></ul><ul><li>outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf'); </li></ul><ul><li>insert overwrite table hbsort </li></ul><ul><li>select user_id, user_type, createtime, … </li></ul><ul><li>from hive_user_table </li></ul><ul><li>cluster by user_id; </li></ul>Facebook
  23. 23. Deployment <ul><li>Latest Hive trunk (will be in Hive 0.6.0) </li></ul><ul><li>Requires Hadoop 0.20+ </li></ul><ul><li>Tested with HBase 0.20.3 and Zookeeper 3.2.2 </li></ul><ul><li>20-node hbtest cluster at Facebook </li></ul><ul><li>No performance numbers yet </li></ul><ul><ul><li>Currently setting up tests with about 6TB (gz compressed) </li></ul></ul>Facebook
  24. 24. Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul><ul><li>http://wiki.apache.org/hadoop/Hive/HBaseIntegration </li></ul><ul><li>http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad </li></ul><ul><li>Special thanks to Samuel Guo for the early versions of the integration code </li></ul>Facebook
  25. 25. Hey, What About HBQL? <ul><li>HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations </li></ul><ul><li>HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs </li></ul>Facebook
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×