Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hive and HiveQL - Module6


Published on

Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).

Published in: Technology
  • Be the first to comment

Hive and HiveQL - Module6

  1. 1. Hive and HiveQL
  2. 2. What is Hive? • Apache Hive is a data warehouse system for Hadoop. • Hive is not a relational database, it only maintains metadata information about your Big Data stored on HDFS. • Hive allows to treat your Big Data as tables and perform SQL-like operations on the data using a scripting language called HiveQL. • Hive is not a database, but it uses a database (called the metastore) to store the tables that you define. Hive uses Derby by default. • A Hive table consists of a schema stored in the metastore and data stored on HDFS. • Hive converts HiveQL commands into MapReduce jobs.
  3. 3. Hive Architecture Contd.. Step 1: Issuing Commands Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer. Step 2: Hive Query Plan The Hive query is compiled, optimized and planned as a MapReduce job. Step 3: MapReduce Job Executes The corresponding MapReduce job is executed on the Hadoop cluster.
  4. 4. Comparison with Traditional Database
  5. 5. Hive data types
  6. 6. Arithmetic Operators
  7. 7. Mathematical functions
  8. 8. Aggregate functions
  9. 9. Other built-in functions
  10. 10. Managed Tables • When a table is created in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory. • When data is loaded into a managed table, it is moved into Hive’s warehouse directory. CREATE TABLE managed_table (dummy STRING); LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; • It will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table, which is hdfs://user/hive/warehouse/managed_table • If the table is later dropped, then the table, including its metadata and its data, is deleted.
  11. 11. External Tables • When a External table is created, it tells Hive to refer to the data that is at an existing location outside the warehouse directory and it is not managed by Hive. • The location of the external data is specified at table creation time CREATE EXTERNAL TABLE external_table (dummy STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' LOCATION '/user/tom/external_table_location/file.txt'; • Creation and deletion of the data can be controlled.
  12. 12. • Hive tables can be organized into buckets, which imposes extra structure on the table and how the underlying files are stored. Bucketing has two key benefits: • More efficient queries: especially when performing joins on the same bucketed columns. • More efficient sampling: because the data is already split up into smaller pieces.
  13. 13. Storage Formats • There are two dimensions that govern table storage in Hive • Row format : The row format dictates how rows, and the fields in a particular row, are stored. The row format is defined by a SerDe. • File format : The file format dictates the container format for fields in a row. The default storage format: Delimited text • When a table is created with no ROW FORMAT or STORED AS clauses, the default format is delimited text, with a row per line. • The default row delimiter is not a tab character, but the Control-A character. • The default collection item delimiter is a Control-B character, used to delimit items in an ARRAY or STRUCT, or key-value pairs in a MAP. • The default map key delimiter is a Control-C character, used to delimit the key and value in a MAP. • Rows in a table are delimited by a newline character.
  15. 15. Importing Data INSERT OVERWRITE TABLE INSERT OVERWRITE TABLE target SELECT col1, col2 FROM source; • Forpartitionedtables,youcanspecifythepartitiontoinsertintobysupplyingaPARTITION clause: INSERT OVERWRITE TABLE target PARTITION (dt='2010-01-01') SELECT col1, col2 FROM source;
  16. 16. Importing Data Contd.. Multitable insert FROM records2 INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(1) GROUP BY year INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(1) WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;
  17. 17. Importing Data Contd.. CREATE TABLE...AS SELECT CREATE TABLE target AS SELECT col1, col2 FROM source; • A CTAS operation is atomic, so if the SELECT query fails for some reason, then the table is not created.
  18. 18. Altering Tables • ALTER TABLE source RENAME TO target; • ALTER TABLE target ADD COLUMNS (col3 STRING); Dropping Tables DROP TABLE table_name; • The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only the metadata is deleted—the data is left untouched.
  19. 19. Querying data(Sorting and Aggregating) • Sorting data in Hive can be achieved by use of a standard ORDER BY clause. ORDER BY produces a result that is totally sorted, so sets the number of reducers to one. • SORT BY produces a sorted file per reducer. • DISTRIBUTE BY clause used to control which reducer a particular row goes to.
  20. 20. • Inner joins Querying data(Joins)
  21. 21. • Left Outer Join • Right Outer Join • Full Outer Join • Left Semi Join
  22. 22. Subqueries • A subquery is a SELECT statement that is embedded in another SQL statement. Hive has limited support for subqueries, only permitting a subquery in the FROM clause of a SELECT statement. SELECT station, year, AVG(max_temperature) FROM ( SELECT station, year, MAX(temperature) AS max_temperature FROM records2 WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY station, year ) mt GROUP BY station, year;
  23. 23. Views • A view is a sort of “virtual table” that is defined by a SELECT statement. CREATE VIEW max_temperatures (station, year, max_temperature) AS SELECT station, year, MAX(temperature) FROM valid_records GROUP BY station, year; • With the views in place, we can now use them by running a query: SELECT station, year, AVG(max_temperature) FROM max_temperatures GROUP BY station, year;
  24. 24. Hive Join Strategies