Hadoop, Hbase and Hive- Bay area Hadoop User Group
Upcoming SlideShare
Loading in...5
×
 

Hadoop, Hbase and Hive- Bay area Hadoop User Group

on

  • 22,867 views

 

Statistics

Views

Total Views
22,867
Views on SlideShare
21,393
Embed Views
1,474

Actions

Likes
59
Downloads
1,320
Comments
3

26 Embeds 1,474

http://tedwon.com 400
http://developer.yahoo.net 316
http://www.slideshare.net 179
http://www.readwriteweb.com 172
http://developer.yahoo.com 172
http://readwriteweb.com.br 84
http://webholic.com.br 39
https://developer.yahoo.com 23
http://paper.li 21
http://readwrite.com 15
http://www.party09.com 10
http://static.slidesharecdn.com 8
http://devpub.kr 8
http://blog.iband.kr 7
http://115.68.2.182 5
http://traackit.blogspot.sg 3
http://150.23.23.233:20000 2
http://traackit.blogspot.kr 2
http://traackit.blogspot.fr 1
http://party09.com 1
http://www.cnliam.com 1
http://rss2.com 1
http://a0.twimg.com 1
http://www.hanrss.com 1
http://cache.baidu.com 1
http://localhost 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop, Hbase and Hive- Bay area Hadoop User Group Hadoop, Hbase and Hive- Bay area Hadoop User Group Presentation Transcript

  • Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
  • Agenda
    • Use Cases
    • Architecture
    • Storage Handler
    • Load via INSERT
    • Query Processing
    • Bulk Load
    • Q & A
    Facebook
  • Motivations
    • Data, data, and more data
      • 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
      • About 8x increase per year
    • Queries, queries, and more queries
      • More than 200 unique users querying per day
      • 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries
    • They want it all and they want it now
      • Users expect faster response time on fresher data
      • Sampled subsets aren’t always good enough
    Facebook
  • How Can HBase Help?
    • Replicate dimension tables from transactional databases with low latency and without sharding
      • (Fact data can stay in Hive since it is append-only)
    • Only move changed rows
      • “ Full scrape” is too slow and doesn’t scale as data keeps growing
      • Hive by itself is not good at row-level operations
    • Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing
    • Multiversioning for snapshot consistency?
    Facebook
  • Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables
  • Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
  • Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
  • HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • Hive Architecture Facebook
  • All Together Now! Facebook
  • Hive CLI With HBase
    • Minimum configuration needed:
    • hive
    • --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar
    • -hiveconf hbase.zookeeper.quorum=zk1,zk2…
    • hive> create table …
    Facebook
  • Storage Handler
    • CREATE TABLE users(
    • userid int, name string, email string, notes string)
    • STORED BY
    • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    • WITH SERDEPROPERTIES (
    • “ hbase.columns.mapping” =
    • “ small:name,small:email,large:notes”)
    • TBLPROPERTIES (
    • “ hbase.table.name” = “user_list”
    • );
    Facebook
  • Column Mapping
    • First column in table is always the row key
    • Other columns can be mapped to either:
      • An HBase column (any Hive type)
      • An HBase column family (must be MAP type in Hive)
    • Multiple Hive columns can map to the same HBase column or family
    • Limitations
      • Currently no control over type mapping (always string in HBase)
      • Currently no way to map HBase timestamp attribute
    Facebook
  • Load Via INSERT
    • INSERT OVERWRITE TABLE users
    • SELECT * FROM …;
    • Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat
    • HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings)
    • Multiple rows with same key -> only one row written
    • Limitations
      • No write atomicity yet
      • No way to delete rows
      • Write parallelism is query-dependent (map vs reduce)
    Facebook
  • Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
  • Map-Only Job for INSERT Facebook HBase
  • Query Processing
    • SELECT name, notes FROM users WHERE userid=‘xyz’;
    • Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase
    • HBase determines the splits (one per table region)
    • HBaseSerDe produces lazy rows/maps for RowResults
    • Column selection is pushed down
    • Any SQL can be used (join, aggregation, union…)
    • Limitations
      • Currently no filter pushdown
      • How do we achieve locality?
    Facebook
  • Metastore Integration
    • DDL can be used to create metadata in Hive and HBase simultaneously and consistently
    • CREATE EXTERNAL TABLE: register existing Hbase table
    • DROP TABLE: will drop HBase table too unless it was created as EXTERNAL
    • Limitations
      • No two-phase-commit for DDL operations
      • ALTER TABLE is not yet implemented
      • Partitioning is not yet defined
      • No secondary indexing
    Facebook
  • Bulk Load
    • Ideally…
    • SET hive.hbase.bulk=true;
    • INSERT OVERWRITE TABLE users SELECT … ;
    • But for now, you have to do some work and issue multiple Hive commands
      • Sample source data for range partitioning
      • Save sampling results to a file
      • Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)
      • Import HFiles into HBase
      • HBase can merge files if necessary
    Facebook
  • Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
  • Sampling Query For Range Partitioning
    • Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.
    • select user_id from
    • (select user_id
    • from hive_user_table
    • tablesample(bucket 1 out of 1000 on user_id) s
    • order by user_id) sorted_user_5k_sample
    • where (row_sequence() % 501)=0;
    Facebook
  • Sorting Query For Bulk Load
    • set mapred.reduce.tasks=12;
    • set hive.mapred.partitioner=
    • org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
    • set total.order.partitioner.path=/tmp/hb_range_key_list;
    • set hfile.compression=gz;
    • create table hbsort(user_id string, user_type string, ...)
    • stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
    • outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');
    • insert overwrite table hbsort
    • select user_id, user_type, createtime, …
    • from hive_user_table
    • cluster by user_id;
    Facebook
  • Deployment
    • Latest Hive trunk (will be in Hive 0.6.0)
    • Requires Hadoop 0.20+
    • Tested with HBase 0.20.3 and Zookeeper 3.2.2
    • 20-node hbtest cluster at Facebook
    • No performance numbers yet
      • Currently setting up tests with about 6TB (gz compressed)
    Facebook
  • Questions?
    • [email_address]
    • [email_address]
    • http://wiki.apache.org/hadoop/Hive/HBaseIntegration
    • http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
    • Special thanks to Samuel Guo for the early versions of the integration code
    Facebook
  • Hey, What About HBQL?
    • HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations
    • HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs
    Facebook