Big Data Summer training presentation

Summer Training Presentation
on
Big Data: Hadoop
Sunday, 14 October 2018
1

Index:
2
Slide no. Content
1. Introduction
2 - 3 Index
4. What is Big Data ?
5. The 5 Vs of Big Data
6. Data Structures: Characteristics of Big Data
7. Introduction to Hadoop
8. Hadoop Ecosystem
9. Hadoop Distributive File System [ HDFS ]
10. Map Reduce
11. Apache Pig
12. Modes of Pig
13. What is Hive ?
14. Example
15. What is Flume ?
16. Advantages of Flume

3
Slide no. Content
17. What is Sqoop ?
18. What is Hbase ?
19. Project – Banking Finance Data Analysis
20 – 21 Using Pig
22 – 24 Using Hive
25. Conclusion
26. Thank You !

What is a Big Data ??
Big data is a term used to refer to the study and applications of data sets that
are so big and complex that traditional data-processing application software are
inadequate to deal with them.
Social
Networks
CloudTransactions
DevicesGovernment Transportation
Health &
Medical
Finance
Sensor
Data
4

The 5 Vs of Big Data :
5

Data Structures : Characteristicsof Big Data
 Structured – defined data types, format, structure
 Transactional data, OLAP cubes, RDBMS, spreadsheets, etc.
 Semi-structured – no fixed schema, structure is implicit and irregular
 Web pages, XML data, etc.
 Unstructured – data with no inherited schema
 Text docs, PDF’s, images, videos, etc.
6

Introductionto Hadoop:
History :
• Based on work done by Google in the early 2000s
• “The Google File System” in 2003
• The core idea was to distribute the data as it is initially
stored
• Each node can then perform computation on the data it
stores without moving the data for the initial processing
• Moving the computing closer to the data
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is made by Apache Software Foundation in 2011.
The backend of Hadoop is written in JAVA.
7

HadoopEcosystem:
OOZIE(overflow)
Z
O
O
K
E
E
P
E
R
HIVE PIG LATIN MAHOUT
MAP REDUCE FRAMEWORK
HBASE
HDFS(Hadoop Distributive File system)
S
Q
O
O
P
F
L
U
M
E
STRUCTURED
DATA(RDBMS)
e.g SQL
UNSTRUCTURED &
SEMI-STRUCTURED
DATA
IMPORT
OR
EXPORT
8

HadoopDistributiveFile System(HDFS):
The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications. It employs a
NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to
data across highly scalable Hadoop clusters.
Various queries used :
• Copying data from local to hdfs
hadoop fs –copyFromLocal <local src> <directory name>
-put
• Copying data from hdfs to local
hadoop fs –copyToLocal <URI source> <local destination>
-get
• To go to the hdfs server
open web browser > type localhost:50070
9

Map Reduce:
Word Count using MapReduce:
MapReduce is a programming model and an associated implementation for
processing and generating big data sets with a parallel, distributed algorithm on a
cluster. A MapReduce program is composed of a map procedure, which performs
filtering and sorting, and a reduce method, which performs a summary operation.
10

ApachePig :
MapReduce
HDFS
Simple
Load
Filter
High Level Abstraction Language
MR Job
 Pig is an open-source technology that offers a high-level mechanism for parallel
programming of MapReduce jobs.
 Pig is high-level platform for creating MapReduce programs.
 Pig is made of 2 components
 Pig Latin
 Runtime Environment
11

Pigcan be run in two modes -
• Pig on a hadoop cluster-Mapreduce or HDFS mode:
pig or pig –x mapreduce
• Pig on a local machine
pig –x local
Example:
grunt> A = LOAD 'data' USING PigStorage() AS (name:chararray, id:int);
grunt> B = GROUP A BY name;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C;
(Here, Case Sensitive- The names (aliases) of relations A, B, and C and fields
name, id and functions: PigStorage and COUNT)
Alias
Relational operator
Atom/
Field
To Specify Schema
12

What is Hive?
 A system for managing and querying unstructured data as if it were structured
 Uses Map-Reduce for execution
 HDFS for Storage
 Key Building Principles
 SQL as a familiar data warehousing tool
 Interoperability (Extensible Framework to support different file and data formats)
 Performance
13

Example:
hive> CREATE DATABASE IF NOT EXISTS user;
hive> USE user;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT ‘Employee details’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘t’
> LINES TERMINATED BY ‘n’
> STORED AS TEXTFILE;
hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE
employee;
14

What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of streaming
data such as log files, events etc. from various sources to a centralized
data store.
Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various
web servers to HDFS.
15

HerearetheadvantagesofusingFlume:
 Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
 When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
16

What is a SQOOP ?
Sqoop is used to import data from external data stores into Hadoop Distributed File
System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can
also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses.
17

What is Hbase ?
 Column-Oriented data store, known as “Hadoop Database”
 Distributed – designed to serve large tables
 Billions of rows and millions of columns
 Supports random real-time CRUD operations (unlike HDFS)
 Runs on a cluster of commodity hardware
 Server hardware, not laptop/desktops
 Open-source, written in Java, Part of the Apache Hadoop ecosystem
 Type of “NoSQL” DataBase
 Does not provide a SQL based access
 Does not adhere to Relational Model for storage
Hbase Table
Example :
info Family content Family
Sunday,14October2018
18

PROJECT : Banking Finance
Data Analysis
19

UsingPig:
grunt> REGISTER '/usr/local/pig/lib/piggybank.jar' ;
grunt> A = load '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER')
as(customer_id:int,customer_name:chararray,loan_account_no:chararray,sanctioned_loan_amounta:int,currency:char
array,disbused_loan_amount:int,loan_status:chararray,risk:int,location:chararray,reason:chararray) ;
grunt> dump A;
##1 Calculate overall average risks
grunt> B = group A all ;
grunt> C = foreach B generate group , AVG(A.risk);
grunt> dump C;
20

##3 Calculate average risk per loan_status
grunt> B = group A by loan_status;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
##4 Calculate average risk per location and loan_status
grunt> D = group A by (location,loan_status) ;
grunt> E = foreach D generate group , AVG(A.risk);
grunt> dump E;
21
##2 Calculate average risk per location
grunt> B = group A by location;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;

UsingHive:
convert the .xlsx file in .csv format
root@harshita-VirtualBox:/home/harshita# start-all.sh
root@harshita-VirtualBox:/home/harshita# hive
hive>use project;
hive> create table banking(customer_id int,customer_name string,loan_account_no string,sanctioned_loan_amount
int,currency string,disbused_loan_amount int,loan_status string,risk int,location string,reason string) row format delimited
fields terminated by ', ' tblproperties("skip.header.line.count"="1") ;
hive> load data local inpath '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' overwrite into
table banking ;
hive> select * from banking limit 10;
22

##1 Calculate overall average risks
hive> select AVG(risk) from banking;
##2 Calculate average risk per location
hive> select location,AVG(risk) as avgrisk from
banking group by location;
##3 Calculate average risk per loan_status
hive> select loan_status,AVG(risk) as avgrisk from banking group by
loan_status;
23

##4 Calculate average risk per location and loan_status
hive> select loan_status , location ,AVG(risk) as avgrisk from banking
group by loan_status , location;
24

Conclusion:
Through this Summer Training I’ve learnt about What actually is
Big Data ? And what are its sources? And how we could handle
this data using various hadoop tools like Hive, Pig, Mapreduce,
Flume, Sqoop.
25

Thank You…!!
26

Big Data Summer training presentation

More Related Content

What's hot

Similar to Big Data Summer training presentation

Recently uploaded

Big Data Summer training presentation