• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
19,914
On Slideshare
0
From Embeds
0
Number of Embeds
18

Actions

Shares
Downloads
1,359
Comments
3
Likes
59

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
  • 2. Agenda
    • Use Cases
    • Architecture
    • Storage Handler
    • Load via INSERT
    • Query Processing
    • Bulk Load
    • Q & A
    Facebook
  • 3. Motivations
    • Data, data, and more data
      • 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
      • About 8x increase per year
    • Queries, queries, and more queries
      • More than 200 unique users querying per day
      • 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries
    • They want it all and they want it now
      • Users expect faster response time on fresher data
      • Sampled subsets aren’t always good enough
    Facebook
  • 4. How Can HBase Help?
    • Replicate dimension tables from transactional databases with low latency and without sharding
      • (Fact data can stay in Hive since it is append-only)
    • Only move changed rows
      • “ Full scrape” is too slow and doesn’t scale as data keeps growing
      • Hive by itself is not good at row-level operations
    • Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing
    • Multiversioning for snapshot consistency?
    Facebook
  • 5. Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables
  • 6. Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
  • 7. Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
  • 8. HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 9. Hive Architecture Facebook
  • 10. All Together Now! Facebook
  • 11. Hive CLI With HBase
    • Minimum configuration needed:
    • hive
    • --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar
    • -hiveconf hbase.zookeeper.quorum=zk1,zk2…
    • hive> create table …
    Facebook
  • 12. Storage Handler
    • CREATE TABLE users(
    • userid int, name string, email string, notes string)
    • STORED BY
    • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    • WITH SERDEPROPERTIES (
    • “ hbase.columns.mapping” =
    • “ small:name,small:email,large:notes”)
    • TBLPROPERTIES (
    • “ hbase.table.name” = “user_list”
    • );
    Facebook
  • 13. Column Mapping
    • First column in table is always the row key
    • Other columns can be mapped to either:
      • An HBase column (any Hive type)
      • An HBase column family (must be MAP type in Hive)
    • Multiple Hive columns can map to the same HBase column or family
    • Limitations
      • Currently no control over type mapping (always string in HBase)
      • Currently no way to map HBase timestamp attribute
    Facebook
  • 14. Load Via INSERT
    • INSERT OVERWRITE TABLE users
    • SELECT * FROM …;
    • Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat
    • HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings)
    • Multiple rows with same key -> only one row written
    • Limitations
      • No write atomicity yet
      • No way to delete rows
      • Write parallelism is query-dependent (map vs reduce)
    Facebook
  • 15. Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
  • 16. Map-Only Job for INSERT Facebook HBase
  • 17. Query Processing
    • SELECT name, notes FROM users WHERE userid=‘xyz’;
    • Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase
    • HBase determines the splits (one per table region)
    • HBaseSerDe produces lazy rows/maps for RowResults
    • Column selection is pushed down
    • Any SQL can be used (join, aggregation, union…)
    • Limitations
      • Currently no filter pushdown
      • How do we achieve locality?
    Facebook
  • 18. Metastore Integration
    • DDL can be used to create metadata in Hive and HBase simultaneously and consistently
    • CREATE EXTERNAL TABLE: register existing Hbase table
    • DROP TABLE: will drop HBase table too unless it was created as EXTERNAL
    • Limitations
      • No two-phase-commit for DDL operations
      • ALTER TABLE is not yet implemented
      • Partitioning is not yet defined
      • No secondary indexing
    Facebook
  • 19. Bulk Load
    • Ideally…
    • SET hive.hbase.bulk=true;
    • INSERT OVERWRITE TABLE users SELECT … ;
    • But for now, you have to do some work and issue multiple Hive commands
      • Sample source data for range partitioning
      • Save sampling results to a file
      • Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)
      • Import HFiles into HBase
      • HBase can merge files if necessary
    Facebook
  • 20. Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
  • 21. Sampling Query For Range Partitioning
    • Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.
    • select user_id from
    • (select user_id
    • from hive_user_table
    • tablesample(bucket 1 out of 1000 on user_id) s
    • order by user_id) sorted_user_5k_sample
    • where (row_sequence() % 501)=0;
    Facebook
  • 22. Sorting Query For Bulk Load
    • set mapred.reduce.tasks=12;
    • set hive.mapred.partitioner=
    • org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
    • set total.order.partitioner.path=/tmp/hb_range_key_list;
    • set hfile.compression=gz;
    • create table hbsort(user_id string, user_type string, ...)
    • stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
    • outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');
    • insert overwrite table hbsort
    • select user_id, user_type, createtime, …
    • from hive_user_table
    • cluster by user_id;
    Facebook
  • 23. Deployment
    • Latest Hive trunk (will be in Hive 0.6.0)
    • Requires Hadoop 0.20+
    • Tested with HBase 0.20.3 and Zookeeper 3.2.2
    • 20-node hbtest cluster at Facebook
    • No performance numbers yet
      • Currently setting up tests with about 6TB (gz compressed)
    Facebook
  • 24. Questions?
    • [email_address]
    • [email_address]
    • http://wiki.apache.org/hadoop/Hive/HBaseIntegration
    • http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
    • Special thanks to Samuel Guo for the early versions of the integration code
    Facebook
  • 25. Hey, What About HBQL?
    • HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations
    • HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs
    Facebook