Apache Hive


Published on

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Hive

  1. 1. Motivation Analysis of Data made by both engineeringand non-engineering people. The data are growing fast. In 2007, thevolume was 15TB and it grew up to 200TB in2010. Current RDBMS can NOT handle it. Current solution are not available, notscalable, Expensive and Proprietary. 2
  2. 2. Map/Reduce -Apache Hadoop MapReduce is a programing model and anassociated implementation introduced byGoolge in 2004. Apache Hadoop is a software frameworkinspired by Googles MapReduce. 3
  3. 3. Motivation (cont.) Hadoop supports data-intensive distributedapplications.However... –Map-reduce hard to program (users know sql/bash/python). –No schema. 4
  4. 4. What is HIVE?A data warehouse infrastructure built on topof Hadoop for providing datasummarization, query, and analysis. –ETL. –Structure. –Access to different storage. –Query execution via MapReduce.Key Building Principles: –SQL is a familiar language –Extensibility – Types, Functions, Formats, Scripts –Performance 5
  5. 5. Hive ApplicationsLog processingText miningDocument indexingCustomer-facing business intelligence(e.g., Google Analytics)Predictive modeling, hypothesis testing 6
  6. 6. Hive Architecture 7
  7. 7. Data UnitsDatabases.Tables.Partitions.Buckets (or Clusters). 8
  8. 8. Type SystemPrimitive types –Integers:TINYINT, SMALLINT, INT, BIGINT. –Boolean: BOOLEAN. –Floating point numbers: FLOAT, DOUBLE . –String: STRING.Complex types –Structs: {a INT; b INT}. –Maps: M[group]. –Arrays: [a, b, c], A[1] returns b. 9
  9. 9. Physical LayoutWarehouse directory in HDFS – e.g., /user/hive/warehouseTable row data stored in subdirectories ofwarehousePartitions form subdirectories of tabledirectoriesActual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format 1 0
  10. 10. HiveQL 1 1
  11. 11. Examples – DDL Operations CREATE TABLE sample (foo INT, bar STRING)PARTITIONED BY (ds STRING); SHOW TABLES .*s; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; 1 2
  12. 12. Examples – DML Operations LOAD DATA LOCAL INPATH ./sample.txt OVERWRITE INTOTABLE sample PARTITION (ds=2012-02-24); LOAD DATA INPATH /user/falvariz/hive/sample.txtOVERWRITE INTO TABLE sample PARTITION (ds=2012-02-24); 1 3
  13. 13. Examples – DML Operations LOAD DATA LOCAL INPATH ./sample.txt OVERWRITE INTOTABLE sample PARTITION (ds=2012-02-24); LOAD DATA INPATH /user/falvariz/hive/sample.txtOVERWRITE INTO TABLE sample PARTITION (ds=2012-02-24); 1 4
  14. 14. SELECTS and FILTERSSELECT foo FROM sample WHERE ds=2012-02-24;INSERT OVERWRITE DIRECTORY /tmp/hdfs_out SELECT * FROMsample WHERE ds=2012-02-24; INSERT OVERWRITE LOCAL DIRECTORY /tmp/hive-sample-outSELECT * FROM sample; 1 5
  15. 15. Aggregations and GroupsSELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECTs.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; 1 6
  16. 16. Extension mechanisms 1 8
  17. 17. Built-in Functions Mathematical: round, floor, ceil, rand, exp... Collection:size, map_keys, map_values, array_contains. Type Conversion: cast. Date: from_unixtime, to_date, year, datediff... Conditional: if, case, coalesce. String: length, reverse, upper, trim... 2 0
  18. 18. Install and Config Hive 2 2
  19. 19. Installing HiveFrom a Release Tarball:$ wget http://archive.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz$ tar xvzf hive-0.5.0-bin.tar.gz$ cd hive-0.5.0-bin$ export HIVE_HOME=$PWD$ export PATH=$HIVE_HOME/bin:$PATH 2 3
  20. 20. Hive Dependencies Java 1.6Hadoop >= 0.17-0.20Hive *MUST* be able to find Hadoop: – $HADOOP_HOME=<hadoop-install-dir>– Add $HADOOP_HOME/bin to $PATHHive needs r/w access to /tmp and/user/hive/warehouse on HDFS:$ hadoop fs –mkdir /tmp$ hadoop fs –mkdir /user/hive/warehouse$ hadoop fs –chmod g+w /tmp$ hadoop fs –chmod g+w /user/hive/warehouse 2 4
  21. 21. Hive ConfigurationDefault configuration in $HIVE_HOME/conf/hive-default.xml – DO NOT TOUCH THIS FILE!Re(Define) properties in $HIVE_HOME/conf/hive-site.xml Use $HIVE_CONF_DIR to specify alternate conf dirlocationYou can override Hadoop configuration properties inHive’s configuration, e.g:– mapred.reduce.tasks=1 2 5
  22. 22. Hive CLI 2 6
  23. 23. Hive CLI CommandsStart a terminal and run $ hiveShould see a prompt like: hive>Set a Hive or Hadoop conf prop: – hive> set propkey=value; List all properties and values: – hive> set –v; Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename; 2 7
  24. 24. Hive CLI CommandsList tables: – hive> show tables;Describe a table: – hive> describe <tablename>;More information: – hive> describe extended <tablename>;List Functions: – hive> show functions;More information: – hive> describe function<functionname>; 2 8
  25. 25. Conclusion A easy way to process large scale data. Support SQL-based queries. Provide more user defined interfaces to extendProgrammability. Files in HDFS are immutable. Tipically: –Log processing: Daily Report, User Activity Measurement –Data/Text mining: Machine learning (Training Data) –Business intelligence: Advertising Delivery,Spam Detection 2 9