0
Motivation Analysis of Data made by both engineeringand non-engineering people. The data are growing fast. In 2007, thevol...
Map/Reduce -Apache Hadoop  MapReduce is a programing model and anassociated implementation introduced byGoolge in 2004.  A...
Motivation (cont.) Hadoop supports data-intensive distributedapplications.However...  –Map-reduce hard to program (users k...
What is HIVE?A data warehouse infrastructure built on topof Hadoop for providing datasummarization, query, and analysis.  ...
Hive ApplicationsLog  processingText miningDocument indexingCustomer-facing business intelligence(e.g., Google Analyti...
Hive Architecture                    7
Data UnitsDatabases.Tables.Partitions.Buckets (or Clusters).                         8
Type SystemPrimitive types –Integers:TINYINT, SMALLINT, INT, BIGINT. –Boolean: BOOLEAN. –Floating point numbers: FLOAT, DO...
Physical LayoutWarehouse   directory in HDFS – e.g., /user/hive/warehouseTable  row data stored in subdirectories ofware...
HiveQL         1         1
Examples – DDL Operations CREATE TABLE sample (foo INT, bar STRING)PARTITIONED BY (ds STRING); SHOW TABLES .*s; DESCRIBE s...
Examples – DML Operations LOAD DATA LOCAL INPATH ./sample.txt OVERWRITE INTOTABLE sample PARTITION (ds=2012-02-24); LOAD D...
Examples – DML Operations LOAD DATA LOCAL INPATH ./sample.txt OVERWRITE INTOTABLE sample PARTITION (ds=2012-02-24); LOAD D...
SELECTS and FILTERSSELECT foo FROM sample WHERE ds=2012-02-24;INSERT OVERWRITE DIRECTORY /tmp/hdfs_out SELECT * FROMsample...
Aggregations and GroupsSELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s I...
Extension mechanisms                       1                       8
Built-in Functions Mathematical: round, floor, ceil, rand, exp... Collection:size, map_keys, map_values, array_contains. T...
Install and Config Hive                          2                          2
Installing HiveFrom a Release Tarball:$ wget http://archive.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz$ ...
Hive Dependencies Java 1.6Hadoop >= 0.17-0.20Hive *MUST* be able to find Hadoop: – $HADOOP_HOME=<hadoop-install-dir>– A...
Hive ConfigurationDefault       configuration in  $HIVE_HOME/conf/hive-default.xml   – DO NOT TOUCH THIS FILE!Re(Define)...
Hive CLI           2           6
Hive CLI CommandsStart  a terminal and run   $ hiveShould see a prompt like:    hive>Set a Hive or Hadoop conf prop:   ...
Hive CLI CommandsList tables:  – hive> show tables;Describe a table:   – hive> describe <tablename>;More information:  ...
Conclusion A easy way to process large scale data. Support SQL-based queries. Provide more user defined interfaces to exte...
Apache Hive
Apache Hive
Apache Hive
Apache Hive
Apache Hive
Upcoming SlideShare
Loading in...5
×

Apache Hive

2,602

Published on

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,602
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
148
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Hive"

  1. 1. Motivation Analysis of Data made by both engineeringand non-engineering people. The data are growing fast. In 2007, thevolume was 15TB and it grew up to 200TB in2010. Current RDBMS can NOT handle it. Current solution are not available, notscalable, Expensive and Proprietary. 2
  2. 2. Map/Reduce -Apache Hadoop MapReduce is a programing model and anassociated implementation introduced byGoolge in 2004. Apache Hadoop is a software frameworkinspired by Googles MapReduce. 3
  3. 3. Motivation (cont.) Hadoop supports data-intensive distributedapplications.However... –Map-reduce hard to program (users know sql/bash/python). –No schema. 4
  4. 4. What is HIVE?A data warehouse infrastructure built on topof Hadoop for providing datasummarization, query, and analysis. –ETL. –Structure. –Access to different storage. –Query execution via MapReduce.Key Building Principles: –SQL is a familiar language –Extensibility – Types, Functions, Formats, Scripts –Performance 5
  5. 5. Hive ApplicationsLog processingText miningDocument indexingCustomer-facing business intelligence(e.g., Google Analytics)Predictive modeling, hypothesis testing 6
  6. 6. Hive Architecture 7
  7. 7. Data UnitsDatabases.Tables.Partitions.Buckets (or Clusters). 8
  8. 8. Type SystemPrimitive types –Integers:TINYINT, SMALLINT, INT, BIGINT. –Boolean: BOOLEAN. –Floating point numbers: FLOAT, DOUBLE . –String: STRING.Complex types –Structs: {a INT; b INT}. –Maps: M[group]. –Arrays: [a, b, c], A[1] returns b. 9
  9. 9. Physical LayoutWarehouse directory in HDFS – e.g., /user/hive/warehouseTable row data stored in subdirectories ofwarehousePartitions form subdirectories of tabledirectoriesActual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format 1 0
  10. 10. HiveQL 1 1
  11. 11. Examples – DDL Operations CREATE TABLE sample (foo INT, bar STRING)PARTITIONED BY (ds STRING); SHOW TABLES .*s; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; 1 2
  12. 12. Examples – DML Operations LOAD DATA LOCAL INPATH ./sample.txt OVERWRITE INTOTABLE sample PARTITION (ds=2012-02-24); LOAD DATA INPATH /user/falvariz/hive/sample.txtOVERWRITE INTO TABLE sample PARTITION (ds=2012-02-24); 1 3
  13. 13. Examples – DML Operations LOAD DATA LOCAL INPATH ./sample.txt OVERWRITE INTOTABLE sample PARTITION (ds=2012-02-24); LOAD DATA INPATH /user/falvariz/hive/sample.txtOVERWRITE INTO TABLE sample PARTITION (ds=2012-02-24); 1 4
  14. 14. SELECTS and FILTERSSELECT foo FROM sample WHERE ds=2012-02-24;INSERT OVERWRITE DIRECTORY /tmp/hdfs_out SELECT * FROMsample WHERE ds=2012-02-24; INSERT OVERWRITE LOCAL DIRECTORY /tmp/hive-sample-outSELECT * FROM sample; 1 5
  15. 15. Aggregations and GroupsSELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECTs.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; 1 6
  16. 16. Extension mechanisms 1 8
  17. 17. Built-in Functions Mathematical: round, floor, ceil, rand, exp... Collection:size, map_keys, map_values, array_contains. Type Conversion: cast. Date: from_unixtime, to_date, year, datediff... Conditional: if, case, coalesce. String: length, reverse, upper, trim... 2 0
  18. 18. Install and Config Hive 2 2
  19. 19. Installing HiveFrom a Release Tarball:$ wget http://archive.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz$ tar xvzf hive-0.5.0-bin.tar.gz$ cd hive-0.5.0-bin$ export HIVE_HOME=$PWD$ export PATH=$HIVE_HOME/bin:$PATH 2 3
  20. 20. Hive Dependencies Java 1.6Hadoop >= 0.17-0.20Hive *MUST* be able to find Hadoop: – $HADOOP_HOME=<hadoop-install-dir>– Add $HADOOP_HOME/bin to $PATHHive needs r/w access to /tmp and/user/hive/warehouse on HDFS:$ hadoop fs –mkdir /tmp$ hadoop fs –mkdir /user/hive/warehouse$ hadoop fs –chmod g+w /tmp$ hadoop fs –chmod g+w /user/hive/warehouse 2 4
  21. 21. Hive ConfigurationDefault configuration in $HIVE_HOME/conf/hive-default.xml – DO NOT TOUCH THIS FILE!Re(Define) properties in $HIVE_HOME/conf/hive-site.xml Use $HIVE_CONF_DIR to specify alternate conf dirlocationYou can override Hadoop configuration properties inHive’s configuration, e.g:– mapred.reduce.tasks=1 2 5
  22. 22. Hive CLI 2 6
  23. 23. Hive CLI CommandsStart a terminal and run $ hiveShould see a prompt like: hive>Set a Hive or Hadoop conf prop: – hive> set propkey=value; List all properties and values: – hive> set –v; Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename; 2 7
  24. 24. Hive CLI CommandsList tables: – hive> show tables;Describe a table: – hive> describe <tablename>;More information: – hive> describe extended <tablename>;List Functions: – hive> show functions;More information: – hive> describe function<functionname>; 2 8
  25. 25. Conclusion A easy way to process large scale data. Support SQL-based queries. Provide more user defined interfaces to extendProgrammability. Files in HDFS are immutable. Tipically: –Log processing: Daily Report, User Activity Measurement –Data/Text mining: Machine learning (Training Data) –Business intelligence: Advertising Delivery,Spam Detection 2 9
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×