SlideShare a Scribd company logo
1 of 26
Summer Training Presentation
on
Big Data: Hadoop
Sunday, 14 October 2018
1
Index:
Sunday, 14 October 2018
2
Slide no. Content
1. Introduction
2 - 3 Index
4. What is Big Data ?
5. The 5 Vs of Big Data
6. Data Structures: Characteristics of Big Data
7. Introduction to Hadoop
8. Hadoop Ecosystem
9. Hadoop Distributive File System [ HDFS ]
10. Map Reduce
11. Apache Pig
12. Modes of Pig
13. What is Hive ?
14. Example
15. What is Flume ?
16. Advantages of Flume
Sunday, 14 October 2018
3
Slide no. Content
17. What is Sqoop ?
18. What is Hbase ?
19. Project – Banking Finance Data Analysis
20 – 21 Using Pig
22 – 24 Using Hive
25. Conclusion
26. Thank You !
What is a Big Data ??
Big data is a term used to refer to the study and applications of data sets that
are so big and complex that traditional data-processing application software are
inadequate to deal with them.
Social
Networks
CloudTransactions
DevicesGovernment Transportation
Health &
Medical
Finance
Sensor
Data
Sunday, 14 October 2018
4
The 5 Vs of Big Data :
Sunday, 14 October 2018
5
Data Structures : Characteristicsof Big Data
 Structured – defined data types, format, structure
 Transactional data, OLAP cubes, RDBMS, spreadsheets, etc.
 Semi-structured – no fixed schema, structure is implicit and irregular
 Web pages, XML data, etc.
 Unstructured – data with no inherited schema
 Text docs, PDF’s, images, videos, etc.
Sunday, 14 October 2018
6
Introductionto Hadoop:
History :
• Based on work done by Google in the early 2000s
• “The Google File System” in 2003
• The core idea was to distribute the data as it is initially
stored
• Each node can then perform computation on the data it
stores without moving the data for the initial processing
• Moving the computing closer to the data
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is made by Apache Software Foundation in 2011.
The backend of Hadoop is written in JAVA.
Sunday, 14 October 2018
7
HadoopEcosystem:
OOZIE(overflow)
Z
O
O
K
E
E
P
E
R
HIVE PIG LATIN MAHOUT
MAP REDUCE FRAMEWORK
HBASE
HDFS(Hadoop Distributive File system)
S
Q
O
O
P
F
L
U
M
E
STRUCTURED
DATA(RDBMS)
e.g SQL
UNSTRUCTURED &
SEMI-STRUCTURED
DATA
IMPORT
OR
EXPORT
Sunday, 14 October 2018
8
HadoopDistributiveFile System(HDFS):
The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications. It employs a
NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to
data across highly scalable Hadoop clusters.
Various queries used :
• Copying data from local to hdfs
hadoop fs –copyFromLocal <local src> <directory name>
-put
• Copying data from hdfs to local
hadoop fs –copyToLocal <URI source> <local destination>
-get
• To go to the hdfs server
open web browser > type localhost:50070
Sunday, 14 October 2018
9
Map Reduce:
Word Count using MapReduce:
MapReduce is a programming model and an associated implementation for
processing and generating big data sets with a parallel, distributed algorithm on a
cluster. A MapReduce program is composed of a map procedure, which performs
filtering and sorting, and a reduce method, which performs a summary operation.
Sunday, 14 October 2018
10
ApachePig :
MapReduce
HDFS
Simple
Load
Filter
High Level Abstraction Language
MR Job
 Pig is an open-source technology that offers a high-level mechanism for parallel
programming of MapReduce jobs.
 Pig is high-level platform for creating MapReduce programs.
 Pig is made of 2 components
 Pig Latin
 Runtime Environment
Sunday, 14 October 2018
11
Pigcan be run in two modes -
• Pig on a hadoop cluster-Mapreduce or HDFS mode:
pig or pig –x mapreduce
• Pig on a local machine
pig –x local
Example:
grunt> A = LOAD 'data' USING PigStorage() AS (name:chararray, id:int);
grunt> B = GROUP A BY name;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C;
(Here, Case Sensitive- The names (aliases) of relations A, B, and C and fields
name, id and functions: PigStorage and COUNT)
Alias
Relational operator
Atom/
Field
To Specify Schema
Sunday, 14 October 2018
12
What is Hive?
 A system for managing and querying unstructured data as if it were structured
 Uses Map-Reduce for execution
 HDFS for Storage
 Key Building Principles
 SQL as a familiar data warehousing tool
 Interoperability (Extensible Framework to support different file and data formats)
 Performance
Sunday, 14 October 2018
13
Example:
hive> CREATE DATABASE IF NOT EXISTS user;
hive> USE user;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT ‘Employee details’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘t’
> LINES TERMINATED BY ‘n’
> STORED AS TEXTFILE;
hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE
employee;
Sunday, 14 October 2018
14
What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of streaming
data such as log files, events etc. from various sources to a centralized
data store.
Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various
web servers to HDFS.
Sunday, 14 October 2018
15
HerearetheadvantagesofusingFlume:
 Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
 When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
Sunday, 14 October 2018
16
What is a SQOOP ?
Sqoop is used to import data from external data stores into Hadoop Distributed File
System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can
also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses.
Sunday, 14 October 2018
17
What is Hbase ?
 Column-Oriented data store, known as “Hadoop Database”
 Distributed – designed to serve large tables
 Billions of rows and millions of columns
 Supports random real-time CRUD operations (unlike HDFS)
 Runs on a cluster of commodity hardware
 Server hardware, not laptop/desktops
 Open-source, written in Java, Part of the Apache Hadoop ecosystem
 Type of “NoSQL” DataBase
 Does not provide a SQL based access
 Does not adhere to Relational Model for storage
Hbase Table
Example :
info Family content Family
Sunday,14October2018
18
PROJECT : Banking Finance
Data Analysis
Sunday, 14 October 2018
19
UsingPig:
grunt> REGISTER '/usr/local/pig/lib/piggybank.jar' ;
grunt> A = load '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER')
as(customer_id:int,customer_name:chararray,loan_account_no:chararray,sanctioned_loan_amounta:int,currency:char
array,disbused_loan_amount:int,loan_status:chararray,risk:int,location:chararray,reason:chararray) ;
grunt> dump A;
##1 Calculate overall average risks
grunt> B = group A all ;
grunt> C = foreach B generate group , AVG(A.risk);
grunt> dump C;
Sunday, 14 October 2018
20
##3 Calculate average risk per loan_status
grunt> B = group A by loan_status;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
##4 Calculate average risk per location and loan_status
grunt> D = group A by (location,loan_status) ;
grunt> E = foreach D generate group , AVG(A.risk);
grunt> dump E;
Sunday, 14 October 2018
21
##2 Calculate average risk per location
grunt> B = group A by location;
grunt> C = foreach B generate group,AVG(A.risk);
grunt> dump C;
UsingHive:
convert the .xlsx file in .csv format
root@harshita-VirtualBox:/home/harshita# start-all.sh
root@harshita-VirtualBox:/home/harshita# hive
hive>use project;
hive> create table banking(customer_id int,customer_name string,loan_account_no string,sanctioned_loan_amount
int,currency string,disbused_loan_amount int,loan_status string,risk int,location string,reason string) row format delimited
fields terminated by ', ' tblproperties("skip.header.line.count"="1") ;
hive> load data local inpath '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' overwrite into
table banking ;
hive> select * from banking limit 10;
Sunday, 14 October 2018
22
##1 Calculate overall average risks
hive> select AVG(risk) from banking;
##2 Calculate average risk per location
hive> select location,AVG(risk) as avgrisk from
banking group by location;
##3 Calculate average risk per loan_status
hive> select loan_status,AVG(risk) as avgrisk from banking group by
loan_status;
Sunday, 14 October 2018
23
##4 Calculate average risk per location and loan_status
hive> select loan_status , location ,AVG(risk) as avgrisk from banking
group by loan_status , location;
Sunday, 14 October 2018
24
Conclusion:
Through this Summer Training I’ve learnt about What actually is
Big Data ? And what are its sources? And how we could handle
this data using various hadoop tools like Hive, Pig, Mapreduce,
Flume, Sqoop.
Sunday, 14 October 2018
25
Thank You…!!
Sunday, 14 October 2018
26

More Related Content

What's hot

HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
Dan Han
 

What's hot (20)

Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
The Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICDThe Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICD
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop
HadoopHadoop
Hadoop
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 

Similar to Big Data Summer training presentation

Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
thiruvel
 

Similar to Big Data Summer training presentation (20)

Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Madhu
MadhuMadhu
Madhu
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
CSB_community
CSB_communityCSB_community
CSB_community
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 

Recently uploaded

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Big Data Summer training presentation

  • 1. Summer Training Presentation on Big Data: Hadoop Sunday, 14 October 2018 1
  • 2. Index: Sunday, 14 October 2018 2 Slide no. Content 1. Introduction 2 - 3 Index 4. What is Big Data ? 5. The 5 Vs of Big Data 6. Data Structures: Characteristics of Big Data 7. Introduction to Hadoop 8. Hadoop Ecosystem 9. Hadoop Distributive File System [ HDFS ] 10. Map Reduce 11. Apache Pig 12. Modes of Pig 13. What is Hive ? 14. Example 15. What is Flume ? 16. Advantages of Flume
  • 3. Sunday, 14 October 2018 3 Slide no. Content 17. What is Sqoop ? 18. What is Hbase ? 19. Project – Banking Finance Data Analysis 20 – 21 Using Pig 22 – 24 Using Hive 25. Conclusion 26. Thank You !
  • 4. What is a Big Data ?? Big data is a term used to refer to the study and applications of data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them. Social Networks CloudTransactions DevicesGovernment Transportation Health & Medical Finance Sensor Data Sunday, 14 October 2018 4
  • 5. The 5 Vs of Big Data : Sunday, 14 October 2018 5
  • 6. Data Structures : Characteristicsof Big Data  Structured – defined data types, format, structure  Transactional data, OLAP cubes, RDBMS, spreadsheets, etc.  Semi-structured – no fixed schema, structure is implicit and irregular  Web pages, XML data, etc.  Unstructured – data with no inherited schema  Text docs, PDF’s, images, videos, etc. Sunday, 14 October 2018 6
  • 7. Introductionto Hadoop: History : • Based on work done by Google in the early 2000s • “The Google File System” in 2003 • The core idea was to distribute the data as it is initially stored • Each node can then perform computation on the data it stores without moving the data for the initial processing • Moving the computing closer to the data The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is made by Apache Software Foundation in 2011. The backend of Hadoop is written in JAVA. Sunday, 14 October 2018 7
  • 8. HadoopEcosystem: OOZIE(overflow) Z O O K E E P E R HIVE PIG LATIN MAHOUT MAP REDUCE FRAMEWORK HBASE HDFS(Hadoop Distributive File system) S Q O O P F L U M E STRUCTURED DATA(RDBMS) e.g SQL UNSTRUCTURED & SEMI-STRUCTURED DATA IMPORT OR EXPORT Sunday, 14 October 2018 8
  • 9. HadoopDistributiveFile System(HDFS): The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. Various queries used : • Copying data from local to hdfs hadoop fs –copyFromLocal <local src> <directory name> -put • Copying data from hdfs to local hadoop fs –copyToLocal <URI source> <local destination> -get • To go to the hdfs server open web browser > type localhost:50070 Sunday, 14 October 2018 9
  • 10. Map Reduce: Word Count using MapReduce: MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation. Sunday, 14 October 2018 10
  • 11. ApachePig : MapReduce HDFS Simple Load Filter High Level Abstraction Language MR Job  Pig is an open-source technology that offers a high-level mechanism for parallel programming of MapReduce jobs.  Pig is high-level platform for creating MapReduce programs.  Pig is made of 2 components  Pig Latin  Runtime Environment Sunday, 14 October 2018 11
  • 12. Pigcan be run in two modes - • Pig on a hadoop cluster-Mapreduce or HDFS mode: pig or pig –x mapreduce • Pig on a local machine pig –x local Example: grunt> A = LOAD 'data' USING PigStorage() AS (name:chararray, id:int); grunt> B = GROUP A BY name; grunt> C = FOREACH B GENERATE COUNT ($0); grunt> DUMP C; (Here, Case Sensitive- The names (aliases) of relations A, B, and C and fields name, id and functions: PigStorage and COUNT) Alias Relational operator Atom/ Field To Specify Schema Sunday, 14 October 2018 12
  • 13. What is Hive?  A system for managing and querying unstructured data as if it were structured  Uses Map-Reduce for execution  HDFS for Storage  Key Building Principles  SQL as a familiar data warehousing tool  Interoperability (Extensible Framework to support different file and data formats)  Performance Sunday, 14 October 2018 13
  • 14. Example: hive> CREATE DATABASE IF NOT EXISTS user; hive> USE user; hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, > salary String, destination String) > COMMENT ‘Employee details’ > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ‘t’ > LINES TERMINATED BY ‘n’ > STORED AS TEXTFILE; hive> LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE employee; Sunday, 14 October 2018 14
  • 15. What is Flume? Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events etc. from various sources to a centralized data store. Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS. Sunday, 14 October 2018 15
  • 16. HerearetheadvantagesofusingFlume:  Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).  When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them. Sunday, 14 October 2018 16
  • 17. What is a SQOOP ? Sqoop is used to import data from external data stores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external datastores such as relational databases, enterprise data warehouses. Sunday, 14 October 2018 17
  • 18. What is Hbase ?  Column-Oriented data store, known as “Hadoop Database”  Distributed – designed to serve large tables  Billions of rows and millions of columns  Supports random real-time CRUD operations (unlike HDFS)  Runs on a cluster of commodity hardware  Server hardware, not laptop/desktops  Open-source, written in Java, Part of the Apache Hadoop ecosystem  Type of “NoSQL” DataBase  Does not provide a SQL based access  Does not adhere to Relational Model for storage Hbase Table Example : info Family content Family Sunday,14October2018 18
  • 19. PROJECT : Banking Finance Data Analysis Sunday, 14 October 2018 19
  • 20. UsingPig: grunt> REGISTER '/usr/local/pig/lib/piggybank.jar' ; grunt> A = load '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER') as(customer_id:int,customer_name:chararray,loan_account_no:chararray,sanctioned_loan_amounta:int,currency:char array,disbused_loan_amount:int,loan_status:chararray,risk:int,location:chararray,reason:chararray) ; grunt> dump A; ##1 Calculate overall average risks grunt> B = group A all ; grunt> C = foreach B generate group , AVG(A.risk); grunt> dump C; Sunday, 14 October 2018 20
  • 21. ##3 Calculate average risk per loan_status grunt> B = group A by loan_status; grunt> C = foreach B generate group,AVG(A.risk); grunt> dump C; ##4 Calculate average risk per location and loan_status grunt> D = group A by (location,loan_status) ; grunt> E = foreach D generate group , AVG(A.risk); grunt> dump E; Sunday, 14 October 2018 21 ##2 Calculate average risk per location grunt> B = group A by location; grunt> C = foreach B generate group,AVG(A.risk); grunt> dump C;
  • 22. UsingHive: convert the .xlsx file in .csv format root@harshita-VirtualBox:/home/harshita# start-all.sh root@harshita-VirtualBox:/home/harshita# hive hive>use project; hive> create table banking(customer_id int,customer_name string,loan_account_no string,sanctioned_loan_amount int,currency string,disbused_loan_amount int,loan_status string,risk int,location string,reason string) row format delimited fields terminated by ', ' tblproperties("skip.header.line.count"="1") ; hive> load data local inpath '/home/harshita/Desktop/projects/Banking-Finanance/banking_input_data.csv' overwrite into table banking ; hive> select * from banking limit 10; Sunday, 14 October 2018 22
  • 23. ##1 Calculate overall average risks hive> select AVG(risk) from banking; ##2 Calculate average risk per location hive> select location,AVG(risk) as avgrisk from banking group by location; ##3 Calculate average risk per loan_status hive> select loan_status,AVG(risk) as avgrisk from banking group by loan_status; Sunday, 14 October 2018 23
  • 24. ##4 Calculate average risk per location and loan_status hive> select loan_status , location ,AVG(risk) as avgrisk from banking group by loan_status , location; Sunday, 14 October 2018 24
  • 25. Conclusion: Through this Summer Training I’ve learnt about What actually is Big Data ? And what are its sources? And how we could handle this data using various hadoop tools like Hive, Pig, Mapreduce, Flume, Sqoop. Sunday, 14 October 2018 25
  • 26. Thank You…!! Sunday, 14 October 2018 26