• Save
Hive: Data Warehousing for Hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hive: Data Warehousing for Hadoop

  • 2,107 views
Uploaded on

Ben Lever, NICTA

Ben Lever, NICTA
Meetup #2, 27 Mar 2012 - http://sydney.bigdataaustralia.com.au/events/53934632/

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,107
On Slideshare
2,107
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • # of users = 943# of movies = 1682# of ratings = 100,000
  • ShellDriverCompilerExecution engineMetastore

Transcript

  • 1. Hive: Data Warehousing for Hadoop Ben Lever @bmlever Big Data Analytics Meetup 27 March 2012
  • 2. Another Data Warehousing System?• Problem: – Lots of data• Partial solution: – Hadoop• Another problem: – MapReduce can be hard – Schema information embedded in program – a lot of data is still structured
  • 3. Solution: Hive• A system for querying and managing structured data within Hadoop – MapReduce for execution – HDFS for storage• Designed for end-users that know more SQL than Java• Apache v2• hive.apache.org
  • 4. Working example: MovieLens• Movie ratings• 3 “tables”: Users Movies Ratings id id user id age title movie id gender release date rating (1 – 5) occupation action timestamp zip code adventure romance ... www.grouplens.org
  • 5. Demo
  • 6. So far• Hive shell• Creating and loading tables• Data model: – INT, BIGINT, TINYINT, STRING, etc – Also: FLOAT, DOUBLE, ARRAY, MAP, STRUCT• Simple queries with filtering• Table data is immutable• Schema on readvsschema on write
  • 7. Hive components TABLE customer ( customer_id BIGINT, Metastore gender STRING, ... schema info launch MapReduce Driver MapReduc e job Hive query HDFS (SQL-like) raw source data (compressed)SELECT *FROM customers CLIWHERE gender = ‘M’;
  • 8. Metastore Hadoop – The Definitive Guide
  • 9. Other SQL-like features• Aggregation – COUNT, AVG• JOIN• GROUP BY• SORT BY• Sub queries
  • 10. Demo
  • 11. Built in functions• Text mining: – ngrams() – context_ngrams() – sentences()• Statistics + mathematics: – stddev() – histogram_numeric() – log – radians
  • 12. User Defined Functions• Written in Java• User Defined Functions (UDFs): – Single row  Single row – e.g. mathematical and string functions• User Defined Aggregate Functions (UDAFs): – Multiple rows  Single row – e.g. AVG• User Defined Table Functions (UDTFs): – Single row  Multiple rows – e.g. “explode”
  • 13. Hive Clients Hadoop – The Definitive Guide
  • 14. Hive ServerJDBC ODBC
  • 15. Sqoop Move data between Hadoop and relational databasesRDBMS Sqoop Hadoop Hive Metastore schema http://incubator.apache.org/projects/sqoop.html
  • 16. Sqoop adapters
  • 17. Conclusion• Scales to handle much more data than traditional systems: – Leverages Hadoop HDFS and MapReduce – Relational/structured data – Schema on read vs schema on write• Supports rapid iteration of ad-hoc queries – SQL-like querying language – Complex queries (joins, etc) with minimal code• Is not a database replacement: – Treats data as immutable – No indexing