Your SlideShare is downloading. ×
Hive Tutorials - Hive Data Types
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hive Tutorials - Hive Data Types

1,009

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,009
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Hive Tutorials Hive Data Types Prashant Kumar Pandey
  • 2. Before we get into details of creating tables, loading data and executing queries in hive to learn power of hive as a data warehouse, let’s learn some basics. We have already understood what is a database or schema in hive however I will cover some more details about it in this tutorial. Then we will understand hive data types. So let’s start. First concept related to hive is database. In hive, database is just a namespace of tables. They are mainly used to organize tables into logical groups. For hive, SCHEMA and DATABASE are synonymous, which means that you can use the keyword SCHEMA instead of DATABASE in all the database-related commands explained below. Most simple syntax for creating a database is CREATE DATABASE <database_name>; which we have already seen earlier in getting started with hive. When you are creating a script, it is valuable to avoid error which may come if database already exist. To accomplish this, hive supports following syntax. CREATE DATABASE IF NOT EXISTS <database_name>; To get list of databases that already exist you can use SHOW DATABASES; command. This command also support regular expression search using SHOW DATABASES LIKE 'pk*'; All databases are created under a top-level directory specified by the property hive.metastore.warehouse.dir. You can override this default using location clause in created database command as shown in below example. CREATE DATABASE pkdb LOCATION '/pkp/mydatabases'; You can use DESCRIBE DATABASE <database_name>; command to describe it. Describe command will show database directory location and a comment about database which you can create using COMMENT clause as shown in below example. CREATE DATABASE pkdb COMMENT 'My default database'; Comment is a nice way to provide in place documentation. Similarly you can also associate keyvalue properties with the database which will be displayed if you use DESCRIBE DATABASE EXTENDED <database_name>; command. Syntax for associating key-value pairs is shown in below example. CREATE DATABASE pkdb WITH DBPROPERTIES ('Created_By' = 'Prashant', 'Date' = '2013-0101'); USE <database_name>; command is used to make any database as current active database. Learn more with us at http://www.thriveschool.blogspot.in/
  • 3. To drop a database we can use DROP DATABASE IF EXISTS <database_name> CASCADE; command. Where IF EXISTS clause is optional which is used to suppress warning message if database does not exist. CASCADE is also optional. If CASCADE is present, it will drop all tables present in specified database. If CASCADE is not present, hive will not allow you to drop database unless it is empty. As of now, hive does not support altering database once created except DBPROPERTIES. You can change key-value pairs associated with you database using alter command as shown in below example. ALTER DATABASE pkdb SET DBPROPERTIES (' Date' = '2013-01-02'); Hive Data Types Unlike other popular relational databases, hive supports collection data types along with many primitive data types. Most of the popular relational databases take control on how storage is managed on disk. But hive leave this control on us which make it easier to process data generated using various tools. We will get into those details later and let’s focus on data types for now. Primitive Data Types Hive supports 4 integers, 2 floating points, Boolean and string data types as shown in below table. For readability, I used uppercase but its case insensitive. Since all of it is internally implemented in JAVA, you can expect similarities in their behavior. Data Type TINYINT SMALLINT INT BIGINT FLOAT DOUBLE BOOLEAN STRING TIMESTAMP BINARY Size 1 Byte 2 Byte 4 Byte 8 Byte Single precision float Double precision float Boolean true or false Sequence of characters Integer, float or string values. Example 10, -10 10, -10 10, -10 10, -10 10.8932 10.8932 TRUE ‘Sample string’ 129357385 8929245.879395 ‘2013-01-01 12:00:00.123456789’ Array of bytes Learn more with us at http://www.thriveschool.blogspot.in/
  • 4. TINYINT, SMALLINT, INT, BIGINT are four integer data types with only differences in their size. FLOAT and DOUBLE are two floating point data types. BOOLEAN is to store true or false. STRING is to store character strings. Note that, in hive, we do not specify length for STRING like in other databases. It’s more flexible and variable in length. TIMESTAMP can be an integer which is interpreted as seconds since UNIX epoch time. It may be a float where number after decimal is nanosecond. It may be string which is interpreted according to the JDBC date string format i.e. YYYY-MM-DD hh:mm:ss.fffffffff. Time component is interpreted as UTC time. Hive provides various functions to convert into specific time zones. BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable for binary data. If you need to cast one type to another like a string to an integer, you can use cast (str as int) function in hive. Collection Data Types Hive supports columns that are STRUCT, MAP, and ARRAY unlike other relational databases. If you are familiar with C language, you might already know what STRUCT is. Hive STRUCT is analogous to STRUCT in C. MAP is a collection of key-value pairs. ARRAY is obviously an ordered sequence of same data type of variable which are accessed using zero based index. We understand that hive is very flexible in offering collection data types but a genuine concern is how data is physically stored in flat files. Hive gives that flexibility as well in our hands. Let’s take an example to understand it. Learn more with us at http://www.thriveschool.blogspot.in/
  • 5. I am assuming an employee table with below structure in hive. CREATE TABLE employees (name salary subordinates deductions address STRING, FLOAT, ARRAY<STRING>, MAP<STRING, FLOAT> STRUCT<stree:STRING, city:STRING, zip:INT> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘001’ COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' STORED AS TEXTFILE; Note that 001 is an octal code for ^A, 002 is ^B and 003 is ^C. for further explanation, we will use text instead of octal code. Let’s assume one record as shown in below table. So, we have an employee named as Prashant. Salary is 10000. He has two subordinates Purna and Sudarshan. From his monthly salary, there are two deductions i.e. 750 for PF and 1000 as income tax. His address is Steet= Marathalli, City is Bangalore and zip code is 560037. Let’s build this record in hive table format as described by hive table structure above. Fields are terminated by ^A. We have five fields Name, Salary, Subordinates, Deductions and Address. You might see it like below. Name^ASalary^ASubordinates^ADeductions^A Address Let’s replace it with actual values for first two fields because they are straight. Prashant^A10000^ASubordinates^ADeductions^A Address Learn more with us at http://www.thriveschool.blogspot.in/
  • 6. Now, we have two subordinates and we know that collection items will be delimited by ^B giving below record. Same will apply to address parts. Prashant^A10000^APurna Dash^BSudarshan Raju ^ADeductions^A Marathalli^BBangalore^B560037 Deduction is bit complex than others. You need to understand that deductions has two collection items (PF=750 and IT=1000) and each of those two collection items is a MAP. MAP will be delimited by ^C (PF^C750 and IT^C1000) then collection will be delimited by ^B giving below record. Prashant^A10000^APurna Dash^BSudarshan Raju ^A PF^C750^B IT^C1000^A Marathalli^BBangalore^B560037 There are other file formats but for now you can consider that’s how data in Hive table will be stored. It’s a series of records in the above format where each record is terminated by a new line as specified in LINES TERMINATED BY 'n' clause of create table. All delimiters used in above example are default delimiters for hive. If you do not specify them, default values will be assumed by hive. You are free to change these values for each table you have. That’s kind of flexibility and looseness makes hive powerful I utilizing data generated by various ETL tools. Learn more with us at http://www.thriveschool.blogspot.in/

×