Summer Training Presentation
on
Big Data: Hadoop
Sunday, 14 October 2018
1
Index:
Sunday, 14 October 2018
2
Slide no. Content
1. Introduction
2 - 3 Index
4. What is Big Data ?
5. The 5 Vs of Big Data
6. Data Structures: Characteristics of Big Data
7. Introduction to Hadoop
8. Hadoop Ecosystem
9. Hadoop Distributive File System [ HDFS ]
10. Map Reduce
11. Apache Pig
12. Modes of Pig
13. What is Hive ?
14. Example
15. What is Flume ?
16. Advantages of Flume
Sunday, 14 October 2018
3
Slide no. Content
17. What is Sqoop ?
18. What is Hbase ?
19. Project – Banking Finance Data Analysis
20 – 21 Using Pig
22 – 24 Using Hive
25. Conclusion
26. Thank You !
What is a Big Data ??
Big data is a term used to refer to the study and applications of data sets that
are so big and complex that traditional data-processing application software are
inadequate to deal with them.
Social
Networks
CloudTransactions
DevicesGovernment Transportation
Health &
Medical
Finance
Sensor
Data
Sunday, 14 October 2018
4
The 5 Vs of Big Data :
Sunday, 14 October 2018
5
Data Structures : Characteristicsof Big Data
 Structured – defined data types, format, structure
 Transactional data, OLAP cubes, RDBMS, spreadsheets, etc.
 Semi-structured – no fixed schema, structure is implicit and irregular
 Web pages, XML data, etc.
 Unstructured – data with no inherited schema
 Text docs, PDF’s, images, videos, etc.
Sunday, 14 October 2018
6
Introductionto Hadoop:
History :
• Based on work done by Google in the early 2000s
• “The Google File System” in 2003
• The core idea was to distribute the data as it is initially
stored
• Each node can then perform computation on the data it
stores without moving the data for the initial processing
• Moving the computing closer to the data
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is made by Apache Software Foundation in 2011.
The backend of Hadoop is written in JAVA.
Sunday, 14 October 2018
7
HadoopEcosystem:
OOZIE(overflow)
Z
O
O
K
E
E
P
E
R
HIVE PIG LATIN MAHOUT
MAP REDUCE FRAMEWORK
HBASE
HDFS(Hadoop Distributive File system)
S
Q
O
O
P
F
L
U
M
E
STRUCTURED
DATA(RDBMS)
e.g SQL
UNSTRUCTURED &
SEMI-STRUCTURED
DATA
IMPORT
OR
EXPORT
Sunday, 14 October 2018
8
HadoopDistributiveFile System(HDFS):
The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications. It employs a
NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to
data across highly scalable Hadoop clusters.
Various queries used :
• Copying data from local to hdfs
hadoop fs –copyFromLocal <local src> <directory name>
-put
• Copying data from hdfs to local
hadoop fs –copyToLocal <URI source> <local destination>
-get
• To go to the hdfs server
open web browser > type localhost:50070
Sunday, 14 October 2018
9
Map Reduce:
Word Count using MapReduce:
MapReduce is a programming model and an associated implementation for
processing and generating big data sets with a parallel, distributed algorithm on a
cluster. A MapReduce program is composed of a map procedure, which performs
filtering and sorting, and a reduce method, which performs a summary operation.
Sunday, 14 October 2018
10
ApachePig :
MapReduce
HDFS
Simple
Load
Filter
High Level Abstraction Language
MR Job
 Pig is an open-source technology that offers a high-level mechanism for parallel
programming of MapReduce jobs.
 Pig is high-level platform for creating MapReduce programs.
 Pig is made of 2 components
 Pig Latin
 Runtime Environment
Sunday, 14 October 2018
11
Pigcan be run in two modes -
• Pig on a hadoop cluster-Mapreduce or HDFS mode:
pig or pig –x mapreduce
• Pig on a local machine
pig –x local
Example:
grunt> A = LOAD 'data' USING PigStorage() AS (name:chararray, id:int);
grunt> B = GROUP A BY name;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C;
(Here, Case Sensitive- The names (aliases) of relations A, B, and C and fields
name, id and functions: PigStorage and COUNT)
Alias
Relational operator
Atom/
Field
To Specify Schema
Sunday, 14 October 2018
12
What is Hive?
 A system for managing and querying unstructured data as if it were structured
 Uses Map-Reduce for execution
 HDFS for Storage
 Key Building Principles
 SQL as a familiar data warehousing tool
 Interoperability (Extensible Framework to support different file and data formats)
 Performance
Sunday, 14 October 2018
13
Example:
hive> CREATE DATABASE IF NOT EXISTS user;
hive> USE user;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT ‘Employee details’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘t’
> LINES TERMINATED BY ‘n’
> STORED AS TEXTFILE;
hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE
employee;
Sunday, 14 October 2018
14
What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of streaming
data such as log files, events etc. from various sources to a centralized
data store.
Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various
web servers to HDFS.
Sunday, 14 October 2018
15
HerearetheadvantagesofusingFlume:
 Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
 When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
Sunday, 14 October 2018
16
What is a SQOOP ?
Sqoop is used to import data from external data stores into Hadoop Distributed File
System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can
also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses.
Sunday, 14 October 2018
17
What is Hbase ?
 Column-Oriented data store, known as “Hadoop Database”
 Distributed – designed to serve large tables
 Billions of rows and millions of columns
 Supports random real-time CRUD operations (unlike HDFS)
 Runs on a cluster of commodity hardware
 Server hardware, not laptop/desktops
 Open-source, written in Java, Part of the Apache Hadoop ecosystem
 Type of “NoSQL” DataBase
 Does not provide a SQL based access
 Does not adhere to Relational Model for storage
Hbase Table
Example :
info Family content Family
Sunday,14October2018
18
PROJECT : Banking Finance
Data Analysis
Sunday, 14 October 2018
19
UsingPig:
grunt> REGISTER '/usr/local/pig/lib/piggybank.jar' ;
grunt> A = load '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER')
as(customer_id:int,customer_name:chararray,loan_account_no:chararray,sanctioned_loan_amounta:int,currency:char
array,disbused_loan_amount:int,loan_status:chararray,risk:int,location:chararray,reason:chararray) ;
grunt> dump A;
##1 Calculate overall average risks
grunt> B = group A all ;
grunt> C = foreach B generate group , AVG(A.risk);
grunt> dump C;
Sunday, 14 October 2018
20
##3 Calculate average risk per loan_status
grunt> B = group A by loan_status;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
##4 Calculate average risk per location and loan_status
grunt> D = group A by (location,loan_status) ;
grunt> E = foreach D generate group , AVG(A.risk);
grunt> dump E;
Sunday, 14 October 2018
21
##2 Calculate average risk per location
grunt> B = group A by location;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
UsingHive:
convert the .xlsx file in .csv format
root@harshita-VirtualBox:/home/harshita# start-all.sh
root@harshita-VirtualBox:/home/harshita# hive
hive>use project;
hive> create table banking(customer_id int,customer_name string,loan_account_no string,sanctioned_loan_amount
int,currency string,disbused_loan_amount int,loan_status string,risk int,location string,reason string) row format delimited
fields terminated by ', ' tblproperties("skip.header.line.count"="1") ;
hive> load data local inpath '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' overwrite into
table banking ;
hive> select * from banking limit 10;
Sunday, 14 October 2018
22
##1 Calculate overall average risks
hive> select AVG(risk) from banking;
##2 Calculate average risk per location
hive> select location,AVG(risk) as avgrisk from
banking group by location;
##3 Calculate average risk per loan_status
hive> select loan_status,AVG(risk) as avgrisk from banking group by
loan_status;
Sunday, 14 October 2018
23
##4 Calculate average risk per location and loan_status
hive> select loan_status , location ,AVG(risk) as avgrisk from banking
group by loan_status , location;
Sunday, 14 October 2018
24
Conclusion:
Through this Summer Training I’ve learnt about What actually is
Big Data ? And what are its sources? And how we could handle
this data using various hadoop tools like Hive, Pig, Mapreduce,
Flume, Sqoop.
Sunday, 14 October 2018
25
Thank You…!!
Sunday, 14 October 2018
26

Big Data Summer training presentation

  • 1.
    Summer Training Presentation on BigData: Hadoop Sunday, 14 October 2018 1
  • 2.
    Index: Sunday, 14 October2018 2 Slide no. Content 1. Introduction 2 - 3 Index 4. What is Big Data ? 5. The 5 Vs of Big Data 6. Data Structures: Characteristics of Big Data 7. Introduction to Hadoop 8. Hadoop Ecosystem 9. Hadoop Distributive File System [ HDFS ] 10. Map Reduce 11. Apache Pig 12. Modes of Pig 13. What is Hive ? 14. Example 15. What is Flume ? 16. Advantages of Flume
  • 3.
    Sunday, 14 October2018 3 Slide no. Content 17. What is Sqoop ? 18. What is Hbase ? 19. Project – Banking Finance Data Analysis 20 – 21 Using Pig 22 – 24 Using Hive 25. Conclusion 26. Thank You !
  • 4.
    What is aBig Data ?? Big data is a term used to refer to the study and applications of data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them. Social Networks CloudTransactions DevicesGovernment Transportation Health & Medical Finance Sensor Data Sunday, 14 October 2018 4
  • 5.
    The 5 Vsof Big Data : Sunday, 14 October 2018 5
  • 6.
    Data Structures :Characteristicsof Big Data  Structured – defined data types, format, structure  Transactional data, OLAP cubes, RDBMS, spreadsheets, etc.  Semi-structured – no fixed schema, structure is implicit and irregular  Web pages, XML data, etc.  Unstructured – data with no inherited schema  Text docs, PDF’s, images, videos, etc. Sunday, 14 October 2018 6
  • 7.
    Introductionto Hadoop: History : •Based on work done by Google in the early 2000s • “The Google File System” in 2003 • The core idea was to distribute the data as it is initially stored • Each node can then perform computation on the data it stores without moving the data for the initial processing • Moving the computing closer to the data The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is made by Apache Software Foundation in 2011. The backend of Hadoop is written in JAVA. Sunday, 14 October 2018 7
  • 8.
    HadoopEcosystem: OOZIE(overflow) Z O O K E E P E R HIVE PIG LATINMAHOUT MAP REDUCE FRAMEWORK HBASE HDFS(Hadoop Distributive File system) S Q O O P F L U M E STRUCTURED DATA(RDBMS) e.g SQL UNSTRUCTURED & SEMI-STRUCTURED DATA IMPORT OR EXPORT Sunday, 14 October 2018 8
  • 9.
    HadoopDistributiveFile System(HDFS): The HadoopDistributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. Various queries used : • Copying data from local to hdfs hadoop fs –copyFromLocal <local src> <directory name> -put • Copying data from hdfs to local hadoop fs –copyToLocal <URI source> <local destination> -get • To go to the hdfs server open web browser > type localhost:50070 Sunday, 14 October 2018 9
  • 10.
    Map Reduce: Word Countusing MapReduce: MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation. Sunday, 14 October 2018 10
  • 11.
    ApachePig : MapReduce HDFS Simple Load Filter High LevelAbstraction Language MR Job  Pig is an open-source technology that offers a high-level mechanism for parallel programming of MapReduce jobs.  Pig is high-level platform for creating MapReduce programs.  Pig is made of 2 components  Pig Latin  Runtime Environment Sunday, 14 October 2018 11
  • 12.
    Pigcan be runin two modes - • Pig on a hadoop cluster-Mapreduce or HDFS mode: pig or pig –x mapreduce • Pig on a local machine pig –x local Example: grunt> A = LOAD 'data' USING PigStorage() AS (name:chararray, id:int); grunt> B = GROUP A BY name; grunt> C = FOREACH B GENERATE COUNT ($0); grunt> DUMP C; (Here, Case Sensitive- The names (aliases) of relations A, B, and C and fields name, id and functions: PigStorage and COUNT) Alias Relational operator Atom/ Field To Specify Schema Sunday, 14 October 2018 12
  • 13.
    What is Hive? A system for managing and querying unstructured data as if it were structured  Uses Map-Reduce for execution  HDFS for Storage  Key Building Principles  SQL as a familiar data warehousing tool  Interoperability (Extensible Framework to support different file and data formats)  Performance Sunday, 14 October 2018 13
  • 14.
    Example: hive> CREATE DATABASEIF NOT EXISTS user; hive> USE user; hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, > salary String, destination String) > COMMENT ‘Employee details’ > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ‘t’ > LINES TERMINATED BY ‘n’ > STORED AS TEXTFILE; hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE employee; Sunday, 14 October 2018 14
  • 15.
    What is Flume? ApacheFlume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events etc. from various sources to a centralized data store. Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS. Sunday, 14 October 2018 15
  • 16.
    HerearetheadvantagesofusingFlume:  Using ApacheFlume we can store the data in to any of the centralized stores (HBase, HDFS).  When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them. Sunday, 14 October 2018 16
  • 17.
    What is aSQOOP ? Sqoop is used to import data from external data stores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external datastores such as relational databases, enterprise data warehouses. Sunday, 14 October 2018 17
  • 18.
    What is Hbase?  Column-Oriented data store, known as “Hadoop Database”  Distributed – designed to serve large tables  Billions of rows and millions of columns  Supports random real-time CRUD operations (unlike HDFS)  Runs on a cluster of commodity hardware  Server hardware, not laptop/desktops  Open-source, written in Java, Part of the Apache Hadoop ecosystem  Type of “NoSQL” DataBase  Does not provide a SQL based access  Does not adhere to Relational Model for storage Hbase Table Example : info Family content Family Sunday,14October2018 18
  • 19.
    PROJECT : BankingFinance Data Analysis Sunday, 14 October 2018 19
  • 20.
    UsingPig: grunt> REGISTER '/usr/local/pig/lib/piggybank.jar'; grunt> A = load '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER') as(customer_id:int,customer_name:chararray,loan_account_no:chararray,sanctioned_loan_amounta:int,currency:char array,disbused_loan_amount:int,loan_status:chararray,risk:int,location:chararray,reason:chararray) ; grunt> dump A; ##1 Calculate overall average risks grunt> B = group A all ; grunt> C = foreach B generate group , AVG(A.risk); grunt> dump C; Sunday, 14 October 2018 20
  • 21.
    ##3 Calculate averagerisk per loan_status grunt> B = group A by loan_status; grunt> C = foreach B generate group,AVG(A.risk); grunt> dump C; ##4 Calculate average risk per location and loan_status grunt> D = group A by (location,loan_status) ; grunt> E = foreach D generate group , AVG(A.risk); grunt> dump E; Sunday, 14 October 2018 21 ##2 Calculate average risk per location grunt> B = group A by location; grunt> C = foreach B generate group,AVG(A.risk); grunt> dump C;
  • 22.
    UsingHive: convert the .xlsxfile in .csv format root@harshita-VirtualBox:/home/harshita# start-all.sh root@harshita-VirtualBox:/home/harshita# hive hive>use project; hive> create table banking(customer_id int,customer_name string,loan_account_no string,sanctioned_loan_amount int,currency string,disbused_loan_amount int,loan_status string,risk int,location string,reason string) row format delimited fields terminated by ', ' tblproperties("skip.header.line.count"="1") ; hive> load data local inpath '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' overwrite into table banking ; hive> select * from banking limit 10; Sunday, 14 October 2018 22
  • 23.
    ##1 Calculate overallaverage risks hive> select AVG(risk) from banking; ##2 Calculate average risk per location hive> select location,AVG(risk) as avgrisk from banking group by location; ##3 Calculate average risk per loan_status hive> select loan_status,AVG(risk) as avgrisk from banking group by loan_status; Sunday, 14 October 2018 23
  • 24.
    ##4 Calculate averagerisk per location and loan_status hive> select loan_status , location ,AVG(risk) as avgrisk from banking group by loan_status , location; Sunday, 14 October 2018 24
  • 25.
    Conclusion: Through this SummerTraining I’ve learnt about What actually is Big Data ? And what are its sources? And how we could handle this data using various hadoop tools like Hive, Pig, Mapreduce, Flume, Sqoop. Sunday, 14 October 2018 25
  • 26.