• Like

Clogeny's Hadoop Training Series - Apache Hive

  • 1,318 views
Uploaded on

This Hive hands-on training is part of Clogeny's Hadoop Training Series. This will give you a complete overview of Apache Hive including architecture, data models, installation, configuration and …

This Hive hands-on training is part of Clogeny's Hadoop Training Series. This will give you a complete overview of Apache Hive including architecture, data models, installation, configuration and important Hive commands/scripts.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,318
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
84
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Clogeny’s Hadoop Developer Training Series An Introduction to Hive Madhur Nawandar madhur.nawandar@clogeny.com Cloud Computing Private & Public Clouds Big Data Storage DevOps
  • 2. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 What is Hive? A data warehousing infrastructure based on Hadoop Provides easy data summarization Provides ad-hoc querying and analysis of large volumes of data Comes with Hive QL, based on SQL Allows to plug in custom mappers and reducers
  • 3. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 What Hive is NOT Not suitable for small datasets due to high latency Cannot be compared to systems like Oracle Does not offer real-time queries and row level updates
  • 4. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive Architecture Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
  • 5. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Data Models Types - Tables Tables • Made up of actual data and the associated metadata • Actual data is stored in a Hadoop Filesystem • Metadata is always stored in a relational database like MySQL • Managed Tables  Hive physically moves data into its warehouse $ CREATE TABLE managed_table (dummy STRING); $ LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; • External Tables  Hive refers data from existing location in HDFS $ CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; $ LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table; Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
  • 6. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Data Models Types - Partitions Partitions • A way to divide tables into coarse-grained parts • Data is partitioned based on the value of partition column • Supports multiple dimensions • Defined at table creation time using PARTITION BY clause • At the filesystem level, partitions are simply nested subdirectories of the table directory. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
  • 7. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Data Models Types - Partitions CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING); LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB'); Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
  • 8. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Data Model Types - Buckets Buckets • Partitions table within range • Enables more efficient queries by creating smaller buckets of data rather than working with an entire partition. • Make sampling more efficient $ CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS; Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482
  • 9. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Column Data Types Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Primitives TYPE DESCRIPTION EXAMPLE TINYINT 8-bit signed integer 1 SMALLINT 16-bit signed integer 1 INT 32-bit signed integer 1 BIGINT 64-bit signed integer 1 FLOAT 32-bit single precision floating point number 1.0 DOUBLE 64-bit double precision floating point number 1.0 BOOLEAN true/false value TRUE STRING Character string ‘a’,”a” TIMESTRAMP Timestamp with nanosecond precision ‘2012-01-02 03:04:05.123456789’
  • 10. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Column Data Types Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Complex Data Types TYPE DESCRIPTION EXAMPLE ARRAY An ordered collection of fields. The fields must all be of same type array(1, 2) MAP An unordered collection of key-value pairs. Keys must be primitives, values may be any type. For a particular map, the keys must be the same type, and the values must be the same type map(‘a’, 1,’ b’, 2) STRUCT A collection of named fields. The fields may be of different types struct(‘a’, 1, 1.0)
  • 11. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Metastore Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 A central repository of Hive metadata Comprises of 2 parts: • Metastore service • Backing store for the data
  • 12. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Metastore deployment modes 1: Embedded Mode This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database. Both the database and the metastore service run embedded in the main HiveServer process. Both are started for you when you start the HiveServer process. This mode requires the least amount of effort to configure. But it can support only one active user at a time and is not certified for production use.
  • 13. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Metastore deployment modes 2: Local Mode In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host. The embedded metastore service communicates with the metastore database over JDBC.
  • 14. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Metastore deployment modes 3: Remote Mode In this mode the Hive metastore service runs in its own JVM process; other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property).
  • 15. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Metastore Properties Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Property Name Type Description hive.metastore.warehouse.dir URI The directory in HDFS where managed tables are stored hive.metastore.local Boolean Flag for embedded metastore or local metastore hive.metastore.uris Comma separated URIs List of remote metastore URI’s javax.jdo.option.ConnectionURL URI The JDBC URL of the metastore database javax.jdo.option.ConnectionDriverName String The JDBC driver classname javax.jdo.option.ConnectionUserName String The JDBC username javax.jdo.option.ConnectionPassword String The JDBC password
  • 16. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive Packages The following packages are needed by Hive: • hive – base package that provides the complete language and runtime (required) • hive-metastore – provides scripts for running the metastore as a standalone service (optional) • hive-server – provides scripts for running the original HiveServer as a standalone service (optional) • hive-server2 – provides scripts for running the new HiveServer2 as a standalone service (optional)
  • 17. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Comparison with Traditional Databases Schema on Read Verses Schema on Write • In a traditional database, a table’s schema is enforced at data load time • If the data being loaded doesn’t conform to the schema, then it is rejected • Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a query is issued Updates, Transactions, and Indexes • Updates, transactions, and indexes are mainstays of traditional databases. • Until recently, these features have not been considered a part of Hive’s feature set
  • 18. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Installing Hive We will install hive with Metastore as a standalone service For this, install the hive and Metastore packages as: $ yum –y install hive hive-metastore
  • 19. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive Configuration Default configuration in • /etc/hive/conf/hive-default.xml Re(Define) properties in • /etc/hive/conf/hive-site.xml Use $HIVE_CONF_DIR to specify alternate conf dir location You can override Hadoop configuration properties in Hive’s configuration • e.g: mapred.reduce.tasks=1
  • 20. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database Step 1: Install and start MySQL if you have not already done so $ yum install mysql-server Step 2: Configure the MySQL Service and Connector $ yum install mysql-connector-java $ ln -s /usr/share/java/mysql-connector-java-5.1.17.jar /usr/lib/hive/lib/mysql-connector-java-5.1.17.jar Step 3: To set the MySQL root password:
  • 21. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database
  • 22. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database cont… Step 4: To make sure the MySQL server starts at boot • $ /sbin/chkconfig mysqld on Step 5. Create the Database and User • Create the initial database schema using the hive-schema- 0.10.0.mysql.sql file located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory. • Create a user for hive with the hostname of the metastore. • Grant proper privileges to the user.
  • 23. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database cont…
  • 24. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database cont… Step 6: Configure the Metastore Service to Communicate with the MySQL Database • This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same • hive-site.xml on all hosts (client, metastore, HiveServer) • hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host.
  • 25. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database cont…
  • 26. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database cont…
  • 27. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Configure Metastore database cont… Step 7: Create hive user directory in hdfs $ sudo –u hdfs hadoop fs –mkdir /user/hive/warehouse $ sudo –u hdfs hadoop fs –chmod og+rw /user/hive/warehouse $ sudo –u hdfs hadoop fs –chown –R hive /user/hive Step 8: Set Environment Variables: • Add the following to .bashrc file $ vim ~/.bashrc export HADOOP_HOME="/usr/lib/hadoop" PATH=$PATH:"/usr/lib/hadoop/bin“ • Run command “bash” on command prompt $ bash
  • 28. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Starting the Metastore You can run the metastore from the command line: $ hive --service metastore Ensure that the above does not give any error Use Ctrl-c to stop the metastore process running from the command line. To run the metastore as a daemon, the command is: $ service hive-metastore start
  • 29. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Starting the Hive Console To start the Hive console: $ hive To confirm that Hive is working, issue the show tables; command to list the Hive tables; be sure to use a semi-colon after the command: hive> SHOW tables;
  • 30. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands Set a Hive or Hadoop conf property: hive> set propkey=value; List all properties and values: hive> set –v;
  • 31. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands Creating managed table $ cat input/hive/tables/data.txt $ hive hive> CREATE TABLE managed_table (dummy STRING); hive> LOAD DATA LOCAL INPATH ‘input/hive/tables/data.txt' INTO table managed_table; hive> select * from managed_table; $ hadoop fs -cat /user/hive/warehouse/managed_table/data.txt
  • 32. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 33. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 34. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands Creating external table • Select a location in hdfs to create table • Ensure it has write access to other users $ sudo -u hdfs hadoop fs -mkdir /user/joe/table $ sudo -u hdfs hadoop fs -chmod a+w /user/joe/table • Create external table and load data into it: hive> CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/joe/table'; hive> LOAD DATA LOCAL INPATH 'input/hive/tables/data.txt' INTO TABLE external_table; hive> select * from external_table; • Check if the table was created in the external directory $ sudo -u hdfs hadoop fs -cat /user/joe/table/data.txt
  • 35. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 36. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 37. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands Create Partitioned table hive> CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING); Load data in table specifying the partitions hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB'); hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file2' INTO TABLE logs PARTITION (dt='2001-01-01', country='US'); hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file3' INTO TABLE logs PARTITION (dt='2001-01-02', country='US'); See the table contents hive> select * from logs; List all the partitions hive> SHOW PARTITIONS logs;
  • 38. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 39. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 40. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 41. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands Create Bucket: • Create a normal table users and create a bucket named bucketed_users from it hive> set hive.enforce.bucketing=true; hive> CREATE TABLE users (id INT, name STRING); hive> LOAD DATA LOCAL INPATH 'input/hive/tables/users.txt' INTO table users; hive> CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS; hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users; • Check the contents of table per bucket hive> select * from bucketed_users; hive> select * from bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
  • 42. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 43. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Hive CLI Commands
  • 44. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Prerequisites • Create 2 tables sales and things and load data from files hive> CREATE TABLE sales (user STRING, id INT)row format delimited fields terminated by 't' stored as textfile; hive> LOAD DATA LOCAL INPATH 'input/hive/joins/sales.txt' INTO table sales; hive> select * from sales; hive> CREATE TABLE things (id INT, name STRING)row format delimited fields terminated by 't' stored as textfile; hive> LOAD DATA LOCAL INPATH 'input/hive/joins/things.txt' INTO table things; hive> select * from things;
  • 45. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins
  • 46. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Inner Join hive> SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);
  • 47. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Left Outer Join hive> SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
  • 48. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Right Outer Join hive> SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
  • 49. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Full Outer Join hive> SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
  • 50. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Semi Joins • Hive does not support IN sub queries hive> SELECT * from things WHERE things.id IN (SELECT id from sales); • So solution is semi joins hive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);
  • 51. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Joins Map Joins • Used in case when 1 table is very small enough to fit in memory. No reducers used hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);
  • 52. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Other Commands CREATE TABLE…AS SELECT hive> CREATE TABLE target AS SELECT id from things; Altering Tables hive> ALTER TABLE target RENAME TO source; hive> ALTER TABLE source ADD COLUMNS (col2 STRING);
  • 53. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 Other Commands Dropping Tables • For managed tables both data and metadata is deleted • For external tables only metadata is deleted  hive> drop table <table_name>;
  • 54. Clogeny Technologies http://www.clogeny.com (US) 408-556-9645 (India) +91 20 661 43 482 References Hadoop: The Definitive Guide, 3rd Edition Hive Community page • http://hive.apache.org/