2. Index:
Sunday, 14 October 2018
2
Slide no. Content
1. Introduction
2 - 3 Index
4. What is Big Data ?
5. The 5 Vs of Big Data
6. Data Structures: Characteristics of Big Data
7. Introduction to Hadoop
8. Hadoop Ecosystem
9. Hadoop Distributive File System [ HDFS ]
10. Map Reduce
11. Apache Pig
12. Modes of Pig
13. What is Hive ?
14. Example
15. What is Flume ?
16. Advantages of Flume
3. Sunday, 14 October 2018
3
Slide no. Content
17. What is Sqoop ?
18. What is Hbase ?
19. Project – Banking Finance Data Analysis
20 – 21 Using Pig
22 – 24 Using Hive
25. Conclusion
26. Thank You !
4. What is a Big Data ??
Big data is a term used to refer to the study and applications of data sets that
are so big and complex that traditional data-processing application software are
inadequate to deal with them.
Social
Networks
CloudTransactions
DevicesGovernment Transportation
Health &
Medical
Finance
Sensor
Data
Sunday, 14 October 2018
4
5. The 5 Vs of Big Data :
Sunday, 14 October 2018
5
6. Data Structures : Characteristicsof Big Data
Structured – defined data types, format, structure
Transactional data, OLAP cubes, RDBMS, spreadsheets, etc.
Semi-structured – no fixed schema, structure is implicit and irregular
Web pages, XML data, etc.
Unstructured – data with no inherited schema
Text docs, PDF’s, images, videos, etc.
Sunday, 14 October 2018
6
7. Introductionto Hadoop:
History :
• Based on work done by Google in the early 2000s
• “The Google File System” in 2003
• The core idea was to distribute the data as it is initially
stored
• Each node can then perform computation on the data it
stores without moving the data for the initial processing
• Moving the computing closer to the data
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is made by Apache Software Foundation in 2011.
The backend of Hadoop is written in JAVA.
Sunday, 14 October 2018
7
8. HadoopEcosystem:
OOZIE(overflow)
Z
O
O
K
E
E
P
E
R
HIVE PIG LATIN MAHOUT
MAP REDUCE FRAMEWORK
HBASE
HDFS(Hadoop Distributive File system)
S
Q
O
O
P
F
L
U
M
E
STRUCTURED
DATA(RDBMS)
e.g SQL
UNSTRUCTURED &
SEMI-STRUCTURED
DATA
IMPORT
OR
EXPORT
Sunday, 14 October 2018
8
9. HadoopDistributiveFile System(HDFS):
The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications. It employs a
NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to
data across highly scalable Hadoop clusters.
Various queries used :
• Copying data from local to hdfs
hadoop fs –copyFromLocal <local src> <directory name>
-put
• Copying data from hdfs to local
hadoop fs –copyToLocal <URI source> <local destination>
-get
• To go to the hdfs server
open web browser > type localhost:50070
Sunday, 14 October 2018
9
10. Map Reduce:
Word Count using MapReduce:
MapReduce is a programming model and an associated implementation for
processing and generating big data sets with a parallel, distributed algorithm on a
cluster. A MapReduce program is composed of a map procedure, which performs
filtering and sorting, and a reduce method, which performs a summary operation.
Sunday, 14 October 2018
10
11. ApachePig :
MapReduce
HDFS
Simple
Load
Filter
High Level Abstraction Language
MR Job
Pig is an open-source technology that offers a high-level mechanism for parallel
programming of MapReduce jobs.
Pig is high-level platform for creating MapReduce programs.
Pig is made of 2 components
Pig Latin
Runtime Environment
Sunday, 14 October 2018
11
12. Pigcan be run in two modes -
• Pig on a hadoop cluster-Mapreduce or HDFS mode:
pig or pig –x mapreduce
• Pig on a local machine
pig –x local
Example:
grunt> A = LOAD 'data' USING PigStorage() AS (name:chararray, id:int);
grunt> B = GROUP A BY name;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C;
(Here, Case Sensitive- The names (aliases) of relations A, B, and C and fields
name, id and functions: PigStorage and COUNT)
Alias
Relational operator
Atom/
Field
To Specify Schema
Sunday, 14 October 2018
12
13. What is Hive?
A system for managing and querying unstructured data as if it were structured
Uses Map-Reduce for execution
HDFS for Storage
Key Building Principles
SQL as a familiar data warehousing tool
Interoperability (Extensible Framework to support different file and data formats)
Performance
Sunday, 14 October 2018
13
14. Example:
hive> CREATE DATABASE IF NOT EXISTS user;
hive> USE user;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT ‘Employee details’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘t’
> LINES TERMINATED BY ‘n’
> STORED AS TEXTFILE;
hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE
employee;
Sunday, 14 October 2018
14
15. What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of streaming
data such as log files, events etc. from various sources to a centralized
data store.
Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various
web servers to HDFS.
Sunday, 14 October 2018
15
16. HerearetheadvantagesofusingFlume:
Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
Sunday, 14 October 2018
16
17. What is a SQOOP ?
Sqoop is used to import data from external data stores into Hadoop Distributed File
System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can
also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses.
Sunday, 14 October 2018
17
18. What is Hbase ?
Column-Oriented data store, known as “Hadoop Database”
Distributed – designed to serve large tables
Billions of rows and millions of columns
Supports random real-time CRUD operations (unlike HDFS)
Runs on a cluster of commodity hardware
Server hardware, not laptop/desktops
Open-source, written in Java, Part of the Apache Hadoop ecosystem
Type of “NoSQL” DataBase
Does not provide a SQL based access
Does not adhere to Relational Model for storage
Hbase Table
Example :
info Family content Family
Sunday,14October2018
18
20. UsingPig:
grunt> REGISTER '/usr/local/pig/lib/piggybank.jar' ;
grunt> A = load '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER')
as(customer_id:int,customer_name:chararray,loan_account_no:chararray,sanctioned_loan_amounta:int,currency:char
array,disbused_loan_amount:int,loan_status:chararray,risk:int,location:chararray,reason:chararray) ;
grunt> dump A;
##1 Calculate overall average risks
grunt> B = group A all ;
grunt> C = foreach B generate group , AVG(A.risk);
grunt> dump C;
Sunday, 14 October 2018
20
21. ##3 Calculate average risk per loan_status
grunt> B = group A by loan_status;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
##4 Calculate average risk per location and loan_status
grunt> D = group A by (location,loan_status) ;
grunt> E = foreach D generate group , AVG(A.risk);
grunt> dump E;
Sunday, 14 October 2018
21
##2 Calculate average risk per location
grunt> B = group A by location;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
22. UsingHive:
convert the .xlsx file in .csv format
root@harshita-VirtualBox:/home/harshita# start-all.sh
root@harshita-VirtualBox:/home/harshita# hive
hive>use project;
hive> create table banking(customer_id int,customer_name string,loan_account_no string,sanctioned_loan_amount
int,currency string,disbused_loan_amount int,loan_status string,risk int,location string,reason string) row format delimited
fields terminated by ', ' tblproperties("skip.header.line.count"="1") ;
hive> load data local inpath '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' overwrite into
table banking ;
hive> select * from banking limit 10;
Sunday, 14 October 2018
22
23. ##1 Calculate overall average risks
hive> select AVG(risk) from banking;
##2 Calculate average risk per location
hive> select location,AVG(risk) as avgrisk from
banking group by location;
##3 Calculate average risk per loan_status
hive> select loan_status,AVG(risk) as avgrisk from banking group by
loan_status;
Sunday, 14 October 2018
23
24. ##4 Calculate average risk per location and loan_status
hive> select loan_status , location ,AVG(risk) as avgrisk from banking
group by loan_status , location;
Sunday, 14 October 2018
24
25. Conclusion:
Through this Summer Training I’ve learnt about What actually is
Big Data ? And what are its sources? And how we could handle
this data using various hadoop tools like Hive, Pig, Mapreduce,
Flume, Sqoop.
Sunday, 14 October 2018
25