• Save
Hive: Data Warehousing for Hadoop
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Hive: Data Warehousing for Hadoop

Uploaded on

Ben Lever, NICTA

Ben Lever, NICTA
Meetup #2, 27 Mar 2012 - http://sydney.bigdataaustralia.com.au/events/53934632/

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • # of users = 943# of movies = 1682# of ratings = 100,000
  • ShellDriverCompilerExecution engineMetastore


  • 1. Hive: Data Warehousing for Hadoop Ben Lever @bmlever Big Data Analytics Meetup 27 March 2012
  • 2. Another Data Warehousing System?• Problem: – Lots of data• Partial solution: – Hadoop• Another problem: – MapReduce can be hard – Schema information embedded in program – a lot of data is still structured
  • 3. Solution: Hive• A system for querying and managing structured data within Hadoop – MapReduce for execution – HDFS for storage• Designed for end-users that know more SQL than Java• Apache v2• hive.apache.org
  • 4. Working example: MovieLens• Movie ratings• 3 “tables”: Users Movies Ratings id id user id age title movie id gender release date rating (1 – 5) occupation action timestamp zip code adventure romance ... www.grouplens.org
  • 5. Demo
  • 6. So far• Hive shell• Creating and loading tables• Data model: – INT, BIGINT, TINYINT, STRING, etc – Also: FLOAT, DOUBLE, ARRAY, MAP, STRUCT• Simple queries with filtering• Table data is immutable• Schema on readvsschema on write
  • 7. Hive components TABLE customer ( customer_id BIGINT, Metastore gender STRING, ... schema info launch MapReduce Driver MapReduc e job Hive query HDFS (SQL-like) raw source data (compressed)SELECT *FROM customers CLIWHERE gender = ‘M’;
  • 8. Metastore Hadoop – The Definitive Guide
  • 9. Other SQL-like features• Aggregation – COUNT, AVG• JOIN• GROUP BY• SORT BY• Sub queries
  • 10. Demo
  • 11. Built in functions• Text mining: – ngrams() – context_ngrams() – sentences()• Statistics + mathematics: – stddev() – histogram_numeric() – log – radians
  • 12. User Defined Functions• Written in Java• User Defined Functions (UDFs): – Single row  Single row – e.g. mathematical and string functions• User Defined Aggregate Functions (UDAFs): – Multiple rows  Single row – e.g. AVG• User Defined Table Functions (UDTFs): – Single row  Multiple rows – e.g. “explode”
  • 13. Hive Clients Hadoop – The Definitive Guide
  • 14. Hive ServerJDBC ODBC
  • 15. Sqoop Move data between Hadoop and relational databasesRDBMS Sqoop Hadoop Hive Metastore schema http://incubator.apache.org/projects/sqoop.html
  • 16. Sqoop adapters
  • 17. Conclusion• Scales to handle much more data than traditional systems: – Leverages Hadoop HDFS and MapReduce – Relational/structured data – Schema on read vs schema on write• Supports rapid iteration of ad-hoc queries – SQL-like querying language – Complex queries (joins, etc) with minimal code• Is not a database replacement: – Treats data as immutable – No indexing