Big Data Analytics Part2

BIG DATA ANALYTICS
Apache Hive: Introduction, Architecture
BIG DATA ANALYTICS

Unit IV
Understanding HIVE:
 Introducing Hive
Hive services (Architecture)
 Hive services (Architecture)
 Builtin functions in Hive
 Hive DDL
 Data manipulation in Hive

Introduction to Apache HIVE
 Hive is an open source data warehouse system built on top of
Hadoop used for querying and analyzing large datasets
stored in Hadoop files.
 developed by Facebook.
 runs SQL like queries called HQL (Hive query language) which gets
internally converted to map reduce jobs.
 used to analyze structured data.
 best suited for batch jobs

Introduction to HIVE
 Hive: data warehousing application in Hadoop
Query language is HiveQL, variant of SQL
Tables stored on HDFS as flat files
Developed by Facebook, now open source
student = LOAD ‘student_details.txt' USING PigStorage(',')
as (id:int, fname:chararray, lname:chararray, mob:chararray, city:chararray);
student_order = ORDER student BY age DESC;
student_limit = LIMIT student_order 4;
Dump student_limit;
./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig
Developed by Facebook, now open source
 Pig: large-scale data processing system
Scripts are written in Pig Latin, a dataflow language
Developed by Yahoo!, now open source
 Common idea:
Provide higher-level language to facilitate large-data processing
Higher-level language “compiles down” to Hadoop jobs
./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig

Applications of HIVE
 Data Mining
 Log Processing
 Document Indexing
 Customer Facing Business Intelligence
 Predictive Modelling
 Hypothesis Testing

HIVE Features
 Hive is fast and scalable.
 It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
It is capable of analyzing large datasets stored in HDFS.
 It is capable of analyzing large datasets stored in HDFS.
 It allows different storage types such as plain text, RCFile (Record Columnar
File), and HBase.
 It uses indexing to accelerate queries.
 It can operate on compressed data stored in the Hadoop ecosystem.
 It supports user-defined functions (UDFs) where user can provide its functionality.

HIVE Features
 A subset of SQL covering the most common statements
 Agile data types: Array, Map, Struct, and JSON objects
 Builtin functions and User Defined Functions and Aggregates
 Multiple users can query simultaneously
 MapReduce support; JDBC support; External table & ETL support
 Partitions and Buckets (for performance optimization)
 Views and Indexes.
 Hive supports Data Definition Language (DDL), Data
Manipulation Language (DML), and User Defined Functions (UDF).

HIVE
Architecture
API standard for Hive DBMS, enabling
Hive Web UI, Server and CLI provides a user
Driver – It acts like a controller which
Apache Thrift is basically protocols
which define how connections are
made between clients and servers.
API standard for Hive DBMS, enabling
JDBC/ODBC compliant applications to
interact with Hive through a standard
interface.
Hive Web UI, Server and CLI provides a user
interface for an external user to interact with
Hive, allows external clients to interact with Hive
over a network, similar to the JDBC or ODBC
protocols.
Driver – It acts like a controller which
receives the HiveQL statements. The
driver starts the execution of the
statement by creating sessions
Metastore – It stores metadata for
each of the tables like their schema
and location.

HIVE Builtin functions
 Mathematical Functions
 Date functions
Collection functions
 Collection functions
 String functions

Mathematical Functions
round(DOUBLE a) Returns the rounded BIGINT value of a.
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
rand(), rand(INT seed)
Returns a random number (that changes from row to
row) that is distributed uniformly from 0 to 1. Specifying
the seed will make sure the generated random number
sequence is deterministic.
sequence is deterministic.
exp(DOUBLE a) Returns ea where e is the base of the natural logarithm.
ln(DOUBLE a) Returns the natural logarithm of the argument a.
log10(DOUBLE a) Returns the base-10 logarithm of the argument a.
log2(DOUBLE a) Returns the base-2 logarithm of the argument a.
pow(DOUBLE a, DOUBLE p) Returns ap.
sqrt(DOUBLE a) Returns the square root of a.

Collection Functions
size(Map<K.V>) Returns the number of elements in the map type.
size(Array<T>) Returns the number of elements in the array type.
map_keys(Map<K.V>)
Returns an unordered array containing the keys of
the input map.
map_values(Map<K.V>)
Returns an unordered array containing the values of
the input map.
sort_array(Array<T>)
Sorts the input array in ascending order according to
the natural ordering of the array elements

Date Functions
unix_timestamp() Gets current Unix timestamp in seconds.
unix_timestamp(string date) Converts time string to Unix timestamp (in seconds),
to_date(string timestamp) Returns the date part of a timestamp
year(string date) Returns the year part of a date
month(string date) Returns the month part of a date
day(string date) Returns the day part of a date
hour(string date) Returns the hour of the timestamp
minute(string date) Returns the minute of the timestamp.
second(string date) Returns the second of the timestamp.
current_date Returns the current date at the start of query
current_timestamp Returns current timestamp at the start of query evaluation
last_day(string date) Returns the last day of the month which the date belongs

String Functions
ascii(string str) Returns the numeric value of the first character of str.
character_length(string str) Returns the number of UTF-8 characters contained in str
concat(string|binary A, string|binary
B...)
Returns the string or bytes resulting from concatenating the
strings or bytes passed in as parameters in order.
find_in_set(string str, string strList)
Returns the first occurance of str in strList where strList is a
comma-delimited string.
length(string A) Returns the length of the string.
length(string A) Returns the length of the string.
locate(string substr, string str[, int pos])
Returns the position of the first occurrence of substr in str
after position pos.
lower(string A)
Returns the string resulting from converting all characters
to lower case.
ltrim(string A)
Returns the string resulting from trimming spaces from the
beginning(left hand side) of A.

hive>select year(‘2020-12-23 10:20:30’) from emp;
output: 2020
hive>select month(‘2020-12-23 10:20:30’) from emp;
output: 12
UNIX_TIMESTAMP() // returns 1970-01-01 00:00:00 using the default time zone.
UNIX_TIMESTAMP('2000-01-01 00:00:00') returns 946713600 string format
TO_DATE('2020-12-23 10:20:30') returns '2020-12-23'
output: 12
DAY('2020-12-23 10:30:30') returns 23
HOUR('2020-12-23 11:30:30') returns 11
MINUTE('2020-12-23 11:40:30') returns 40
SECOND('2020-12-23 11:20:50') returns 50
WEEKOFYEAR('2000-03-01 10:20:30') returns 9
DATEDIFF('2000-03-01', '2000-01-10') returns 51
DATE_ADD('2000-03-01', 5) returns '2000-03-06‘

 hive>select Id,Name, sqrt(Salary) from employee_data ;
 hive> select min(Salary) from employee_data;
hive> select max(Salary) from employee_data;
 hive> select max(Salary) from employee_data;

Hive Builtin function examples
 select concat("ABC","DEF"); // Returns ABCDEF
 select concat_ws("|","1","2","3"); // Returns 1|2|3
 select format_number(1234567,3); // Returns 1,234,567.000
select format_number(1234567,0); // Returns 1,234,567
 select format_number(1234567,0); // Returns 1,234,567
 select format_number(1234567.23456,3); // 1,234,567.235
 select locate("is","usa is a usa is a"); // Returns 5
 select locate("is","usa is a usa is a",6); // Returns 14
 select lower("UNITEDSTATES"); // unitedstates
 select ltrim(" UNITEDSTATES"); // UNITEDSTATES

select reverse("ABCDEF"); // Returns FEDCBA
select rpad("UNITED",10,'0'); // Returns UNITED0000
select rpad("UNITED",10,' '); // Returns 'UNITED '
select rpad("UNITEDSTATES",10,'0'); // Returns UNITEDSTAT
select rpad("UNITEDSTATES",10,'0'); // Returns UNITEDSTAT
select rpad("UNITEDSTATES",10,null); // Returns NULL
select space(10); // Returns ' '
select split("USA IS A PLACE"," "); // Returns: ["USA","IS","A","PLACE"]
select substr("USA IS A PLACE",5,2); // Returns IS
select substr("USA IS A PLACE",5,100); // Returns IS A PLACE
select upper("unitedstates"); // Returns UNITEDSTATES

select initcap("USA IS A PLACE"); // Returns: Usa Is A Place
select CONCAT(‘cmputer',‘science',‘engg'); //computerscienceengg
select substr('This is hive demo',9,4); // hive
select length('hadoop'); // 6
select length('hadoop'); // 6
select lpad('hadoop',8,'H'); // Hhhadoop
select rpad(‘hadoop’,8,’p’); // hadooppp
 select trim(' Hadoop '); // 'Hadoop‘
select ltrim(' Hadoop '); // 'Hadoop ‘
select rtrim(' Hadoop '); // ' Hadoop‘
select repeat('Hadoop',2); //HadoopHadoop

select reverse('Hadoop'); // OK poodaH
select split('hadoop~supports~split~function','~');
// ["hadoop","supports","split","function"]
select max(Salary) from employee_data;
select min(Salary) from employee_data;
select Id, upper(Name) from employee_data;
select Id, lower(Name) from employee_data;

HIVE Builtin functions
 Hive provides various in-built functions to perform
mathematical and aggregate type operations.
 Create a hive table using the following command:
 Create a hive table using the following command:
 create table employee_data (Id int, Name string , Salary
float) row format delimited fields terminated by ',' ;
 load data local inpath '/home/code/hive/emp_details' in
to table employee_data;

 hive> select Id, Name, sqrt(Salary) from employee_dat
a ;

 hive> select max(Salary) from employee_data;

 select concat("ABC","DEF"); // Returns ABCDEF
 select concat_ws("|","1","2","3"); // Returns 1|2|3
 select format_number(1234567,3); // Returns 1,234,567.000
select format_number(1234567,0); // Returns 1,234,567
 select format_number(1234567,0); // Returns 1,234,567
 select format_number(1234567.23456,3); // 1,234,567.235
 select locate("is","usa is a usa is a"); // Returns 5
 select locate("is","usa is a usa is a",6); // Returns 14
 select lower("UNITEDSTATES"); // unitedstates
 select lcase("UNITEDSTATES"); // unitedstates
 select ltrim(" UNITEDSTATES"); // UNITEDSTATES

select reverse("ABCDEF"); // Returns FEDCBA
select rpad("UNITED",10,'0'); // Returns UNITED0000
select rpad("UNITED",10,' '); // Returns 'UNITED
select rpad("UNITEDSTATES",10,'0'); // Returns UNITEDSTAT
select rpad("UNITEDSTATES",10,'0'); // Returns UNITEDSTAT
select rpad("UNITEDSTATES",10,null); // Returns NULL
select space(10); ==> Returns ' '
select split("USA IS A PLACE"," "); // Returns: ["USA","IS","A","PLACE"]
select substr("USA IS A PLACE",5,2); // Returns IS
select substr("USA IS A PLACE",5,100); // Returns IS A PLACE
select upper("unitedstates"); // Returns UNITEDSTATES

select initcap("USA IS A PLACE"); // Returns: Usa Is A Place
select CONCAT(‘cmputer',‘science',‘engg'); //computerscienceengg
select substr('This is hive demo',9,4); // hive
select length('hadoop'); // 6
select length('hadoop'); // 6
select lpad('hadoop',8,'H'); // Hhhadoop
select rpad(‘hadoop’,8,’p’); // hadooppp
 select trim(' Hadoop '); // 'Hadoop‘
select ltrim(' Hadoop '); // 'Hadoop ‘
select rtrim(' Hadoop '); // ' Hadoop‘
select repeat('Hadoop',2); //HadoopHadoop

select reverse('Hadoop'); // OK poodaH
select split('hadoop~supports~split~function','~');
select max(Salary) from employee_data;
select min(Salary) from employee_data;
select Id, upper(Name) from employee_data;
select Id, lower(Name) from employee_data;

HIVE DDL Commands
 CREATE
 SHOW
DDL Command Use With
CREATE Database, Table
SHOW
Databases, Tables, Table
Properties, Partitions, Functions,
 DESCRIBE
 USE
 DROP
 ALTER
 TRUNCATE
SHOW Properties, Partitions, Functions,
Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table// Deletes all contents

create table txnrecords(txnnno INT, txndate
STRING, custno INT, amount DOUBLE, category
STRING, product STRING, city STRING, State
STRING, product STRING, city STRING, State
STRING, spendby STRING) row format delimited
fields terminated by ',' stored as textfile.
drop table txnrecords
ALTER TABLE employee RENAME TO employee2;

hive> create database if not exists financials;
hive> create table records (year string, temperature int, quantity int)
> row format delimited
> fields terminated by 't';
hive> create table employees (
> name string,
> salary float,
> salary float,
> subordinates array<string>,
> deductions map<string, float>,
> address struct<street:string, city:string, state:string, zip:int>);
hive> create database financials2
> with dbproperties('creator' = ‘Sreedhar', 'date' = '2020-12-19');

HiveQL Data Manipulation
 Load
Student_data.txt
LOAD statement in Hive is used to move
data files into the locations corresponding
to Hive tables
to Hive tables
LOAD DATA [LOCAL] INPATH 'hdfsfilepath/localfilepath'
[OVERWRITE] INTO TABLE existing_table_name

Select
 SELECT statement in Hive is similar to the SELECT
statement in SQL used for retrieving data from the
database.
database.
 SELECT col1,col2 FROM tablename;

INSERT Command
 INSERT command in Hive loads the data into a Hive
table.
 INSERT INTO TABLE tablename1 [PARTITION
 INSERT INTO TABLE tablename1 [PARTITION
(partcol1=val1, partcol2=val2 ...)] select_statement1
FROM from_statement;

DELETE command
 DELETE statement in Hive deletes the table data. If
the WHERE clause is specified, then it deletes the
rows that satisfy the condition in where clause.
rows that satisfy the condition in where clause.
 DELETE FROM tablename [WHERE expression];
 DELETE FROM student WHERE roll_no=104;

HiveQL Data Manipulation
 Load,
 Insert,
Export Data and
 Export Data and
 Create Table

CREATE TABLE
 Hive> CREATE TABLE Employees AS SELECT
eno,ename,sal,address FROM emp WHERE
country=’IN’;
country=’IN’;

Load
 Hive>LOAD DATA LOCAL INPATH
'/home/hduser/sampledata/users.txt‘
LOCAL’ indicates the source data is on local file systemLocal
 LOCAL’ indicates the source data is on local file systemLocal
data will be copied into the final destination (HDFS file system)
by HiveIf ‘Local’ is not specified, the file is assumed to be on
HDFSHive does not do any data transformation while loading
the data

INSERT
 Hive> INSERT OVERWRITE TABLE Employee
Partition (country= ‘IN’,state=’KA’) SELECT * FROM
emp_stage ese WHERE ese.country=’IN’ AND
emp_stage ese WHERE ese.country=’IN’ AND
ese.state=’KA’;

Exporting Data out of Hive
 Hive>INSERT OVERWRITE LOCAL
DIRECTORY '/home/hadoop/data' SELECT name,
age FROM aliens WHERE date_sighted >'2014-09-
age FROM aliens WHERE date_sighted >'2014-09-
15'

Unit V
NoSQL Data Management:
 Introducing to NoSQL,
characteristics of NoSQL
 characteristics of NoSQL
 Types of NoSQL data models
 Schema less databases

 NoSQL database stands for "Not Only SQL" or
"Not SQL."
 NoSQL Database is a non-relational Data
 NoSQL Database is a non-relational Data
Management System, that does not require a fixed
schema.
 NoSQL is used for Big data and real-time web
apps.

Features of NoSQL
 Non-relational
 NoSQL databases never follow the relational model
 Never provide tables with flat fixed-column records
 Work with self-contained aggregates or BLOBs
 Work with self-contained aggregates or BLOBs
 Doesn't require object-relational mapping and data normalization
 No complex features like query languages, query planners, referential integrity joins, ACID
 Schema-free
 NoSQL databases are either schema-free or have relaxed schemas
 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain

Advantages of NoSQL
 Can be used as Primary or Analytic Data Source
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 No Need for Separate Caching Layer
 Support Key Developer Languages and
Platforms
 Simple to implement than using RDBMS
 It can serve as the primary data source for
online applications.
Handles big data which manages data velocity,
 No Need for Separate Caching Layer
 It provides fast performance and horizontal
scalability.
 Can handle structured, semi-structured, and
unstructured data with equal effect
 Object-oriented programming which is easy to
use and flexible
 NoSQL databases don't need a dedicated high-
performance server
 Handles big data which manages data velocity,
variety, volume, and complexity
 Excels at distributed database and multi-data
center operations
 Eliminates the need for a specific caching layer
to store data
 Offers a flexible schema design which can
easily be altered without downtime or service
disruption

Types of NoSQL Databases
 Key-value Pair Based
 Column-oriented Graph
Graphs based
 Graphs based
 Document-oriented

Key Value Pair Based
 Data is stored in key/value pairs. It is designed in
such a way to handle lots of data and heavy load.
 Key-value pair storage databases store data as a
 Key-value pair storage databases store data as a
hash table where each key is unique, and the value
can be a JSON, BLOB(Binary Large Objects), string,
etc.

Column-based
 Column-oriented databases work on columns and
are based on BigTable paper by Google. Every
column is treated separately. Values of single
column is treated separately. Values of single
column databases are stored contiguously.

Document-Oriented:
 Document-Oriented NoSQL DB stores and retrieves
data as a key value pair but the value part is
stored as a document. The document is stored in
stored as a document. The document is stored in
JSON or XML formats. The value is understood by
the DB and can be queried.

Graph-Based
 A graph type database stores entities as well the
relations amongst those entities. The entity is stored
as a node with the relationship as edges. An edge
as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and
edge has a unique identifier.

Tools for NoSQL
 Wide column: Accumulo, Cassandra, Scylla, HBase.
 Document: Apache CouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase,
Cosmos DB, eXist-db, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx,
RethinkDB
RethinkDB
 Key–value: Aerospike, Apache Ignite, ArangoDB, Berkeley DB, Couchbase,
Dynamo, FoundationDB, InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL
Database, OrientDB, Redis, Riak, SciDB, SDBM/Flat File dbm, ZooKeeper
 Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic,
Neo4J, OrientDB, Virtuoso

Big Data Analytics Part2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Analytics Part2

Similar to Big Data Analytics Part2 (20)

More from Sreedhar Chowdam

More from Sreedhar Chowdam (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics Part2