×
  • Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
 

Processing High volume hive updates

by on Jun 20, 2012

  • 2,807 views

Apache Hive provides a convenient SQL-based query language to data stored in HDFS. HDFS provides highly scaleable bandwidth to the data, but does not support arbitrary writes. One of Hortonworks` ...

Apache Hive provides a convenient SQL-based query language to data stored in HDFS. HDFS provides highly scaleable bandwidth to the data, but does not support arbitrary writes. One of Hortonworks` customers needs to store a high volume of customer data (> 1 TB/day) and that data contains a high percentage (15%) of record updates distributed across years. In many high-update use-cases, HBase would suffice, but the current lack of push down filters from Hive into HBase and HBase`s single level keys make it too expensive. Our solution is to use a custom record reader that stores the edit records as separate HDFS files and synthesizes the current set of records dynamically as the table is read. This provides an economical solution to their need that works within the framework provided by Hive. We believe this use case applies to many Hive users and plan to develop and open source a reusable solution.

Statistics

Views

Total Views
2,807
Views on SlideShare
2,695
Embed Views
112

Actions

Likes
2
Downloads
0
Comments
0

6 Embeds 112

http://eventifier.co 47
http://marilson.pbworks.com 40
http://www.twylah.com 16
http://eventifier.com 5
https://hwtest.uservoice.com 3
http://hwtest.uservoice.com 1

Accessibility

Categories

Upload Details

Uploaded via SlideShare as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
Post Comment
Edit your comment

Processing High volume hive updates Processing High volume hive updates Presentation Transcript