Hive: Data Warehousing for          Hadoop           Ben Lever           @bmlever    Big Data Analytics Meetup         27 ...
Another Data Warehousing System?• Problem:  – Lots of data• Partial solution:  – Hadoop• Another problem:  – MapReduce can...
Solution: Hive• A system for querying and managing  structured data within Hadoop  – MapReduce for execution  – HDFS for s...
Working example: MovieLens• Movie ratings• 3 “tables”:      Users             Movies         Ratings          id          ...
Demo
So far• Hive shell• Creating and loading tables• Data model:  – INT, BIGINT, TINYINT, STRING, etc  – Also: FLOAT, DOUBLE, ...
Hive components                      TABLE customer (                      customer_id    BIGINT,        Metastore       g...
Metastore            Hadoop – The Definitive Guide
Other SQL-like features• Aggregation – COUNT, AVG• JOIN• GROUP BY• SORT BY• Sub queries
Demo
Built in functions• Text mining:  – ngrams()  – context_ngrams()  – sentences()• Statistics + mathematics:  – stddev()  – ...
User Defined Functions• Written in Java• User Defined Functions (UDFs):  – Single row  Single row  – e.g. mathematical an...
Hive Clients               Hadoop – The Definitive Guide
Hive ServerJDBC         ODBC
Sqoop  Move data between Hadoop   and relational databasesRDBMS         Sqoop            Hadoop                           ...
Sqoop adapters
Conclusion• Scales to handle much more data than traditional  systems:  – Leverages Hadoop HDFS and MapReduce  – Relationa...
Upcoming SlideShare
Loading in …5
×

Hive: Data Warehousing for Hadoop

2,216 views

Published on

Ben Lever, NICTA
Meetup #2, 27 Mar 2012 - http://sydney.bigdataaustralia.com.au/events/53934632/

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,216
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • # of users = 943# of movies = 1682# of ratings = 100,000
  • ShellDriverCompilerExecution engineMetastore
  • Hive: Data Warehousing for Hadoop

    1. 1. Hive: Data Warehousing for Hadoop Ben Lever @bmlever Big Data Analytics Meetup 27 March 2012
    2. 2. Another Data Warehousing System?• Problem: – Lots of data• Partial solution: – Hadoop• Another problem: – MapReduce can be hard – Schema information embedded in program – a lot of data is still structured
    3. 3. Solution: Hive• A system for querying and managing structured data within Hadoop – MapReduce for execution – HDFS for storage• Designed for end-users that know more SQL than Java• Apache v2• hive.apache.org
    4. 4. Working example: MovieLens• Movie ratings• 3 “tables”: Users Movies Ratings id id user id age title movie id gender release date rating (1 – 5) occupation action timestamp zip code adventure romance ... www.grouplens.org
    5. 5. Demo
    6. 6. So far• Hive shell• Creating and loading tables• Data model: – INT, BIGINT, TINYINT, STRING, etc – Also: FLOAT, DOUBLE, ARRAY, MAP, STRUCT• Simple queries with filtering• Table data is immutable• Schema on readvsschema on write
    7. 7. Hive components TABLE customer ( customer_id BIGINT, Metastore gender STRING, ... schema info launch MapReduce Driver MapReduc e job Hive query HDFS (SQL-like) raw source data (compressed)SELECT *FROM customers CLIWHERE gender = ‘M’;
    8. 8. Metastore Hadoop – The Definitive Guide
    9. 9. Other SQL-like features• Aggregation – COUNT, AVG• JOIN• GROUP BY• SORT BY• Sub queries
    10. 10. Demo
    11. 11. Built in functions• Text mining: – ngrams() – context_ngrams() – sentences()• Statistics + mathematics: – stddev() – histogram_numeric() – log – radians
    12. 12. User Defined Functions• Written in Java• User Defined Functions (UDFs): – Single row  Single row – e.g. mathematical and string functions• User Defined Aggregate Functions (UDAFs): – Multiple rows  Single row – e.g. AVG• User Defined Table Functions (UDTFs): – Single row  Multiple rows – e.g. “explode”
    13. 13. Hive Clients Hadoop – The Definitive Guide
    14. 14. Hive ServerJDBC ODBC
    15. 15. Sqoop Move data between Hadoop and relational databasesRDBMS Sqoop Hadoop Hive Metastore schema http://incubator.apache.org/projects/sqoop.html
    16. 16. Sqoop adapters
    17. 17. Conclusion• Scales to handle much more data than traditional systems: – Leverages Hadoop HDFS and MapReduce – Relational/structured data – Schema on read vs schema on write• Supports rapid iteration of ad-hoc queries – SQL-like querying language – Complex queries (joins, etc) with minimal code• Is not a database replacement: – Treats data as immutable – No indexing

    ×