SlideShare a Scribd company logo
Presentation on
Big Data/Hadoop
Submitted to :
Department of CSE
AITS, Udaipur
Submitted by:
1.Laxmi Rauth
2.Anand Mohan
B.Tech (4th year)
Big Data
"Big Data” is a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications.
In simple terms, "Big Data" consists of very large volumes of
heterogeneous data that is being generated, often, at high
speeds.
Big Data requires the use of a new set of tools, applications
and frameworks to process and manage the data.
Characteristics of Big Data:
The characteristics of Big Data are
popularly known as Three V's of
Big Data.
Volume:
This size aspect of data is referred to as
Volume in the Big Data world.
Velocity:
This speed aspect of data generation is
referred to as Velocity in the Big Data
world.
Variety:
This aspect of varied data formats is
referred to as Variety in the Big Data
world.
Sources of Big Data can be
broadly classified into six
different categories:
1.Enterprise Data
2. Transactional Data
3. Social Media
4. Activity Generated
5. Public Data
6. Archives
Sources of Big Data:
Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming
models. That manages data processing and storage for big
data applications running in clustered systems.
History of Hadoop had
started in the year 2002 with
the project Apache
Nutch. Hadoop was created
by Doug Cutting, the creator
of Apache Lucene, the widely
used text search library.
According to Hadoop's creator
Doug Cutting, "The name
Hadoop given by my kid to a
stuffed yellow elephant. Short,
relatively easy to spell and
pronounce, meaningless, and
not used elsewhere.
Hadoop was created by Doug Cutting and
Mike Cafarella.
History of Hadoop
Characteristics of Hadoop
Hadoop provides
a reliable shared
storage (HDFS)
and analysis
system (Map-
Reduce).
Hadoop is highly
scalable
As Hadoop
scales linearly, a
Hadoop Cluster
can contain tens,
hundreds, or
even thousands
of servers.
Hadoop is highly
flexible and can
process both
structured as
well as
unstructured
data. Hadoop has
built-in fault
tolerance.
Hadoop works on
the principle of
write once and
read multiple
times. Hadoop is
optimized for
large and very
large data sets.
Hadoop is very
cost effective as
it can work with
commodity
hardware and
does not require
expensive high-
end hardware.
Hadoop works in a master-worker / master-slave
fashion.
Hadoop has two core components: HDFS and
MapReduce.
HDFS (Hadoop Distributed File System) offers a
highly reliable and distributed storage, and ensures
reliability, by storing the data across multiple nodes.
MapReduce offers an analysis system which can
perform complex computations on large datasets. This
component is responsible for performing all the
computations and works by breaking down a large
complex computation into multiple tasks and assigns
those to individual worker/slave nodes.
The master contains the Namenode and Job Tracker
components.
Namenode holds the information about all the other
nodes in the Hadoop Cluster.
Job Tracker keeps track of the individual tasks/jobs
assigned to each of the nodes and coordinates the
exchange of information and results.
Each Worker / Slave contains the Task Tracker and a
Datanode components.
Task Tracker is responsible for running the task /
computation assigned to it.
Datanode is responsible for holding the data.
Hadoop Distributions
Cloudera was the first company
to be formed to build enterprise
solutions based on Hadoop.
Cloudera has a Hadoop
distribution known as
Cloudera's Distribution for
Hadoop (CDH).
MapR is another major
distribution available in the
market.
MapR is available in the cloud
through some of the leading
cloud providers Amazon Web
Services (AWS), Google
Compute Engine, CenturyLink
Technology Solutions, and
OpenStack.
Amazon Web Services (AWS)
Elastic MapReduce (EMR) was
among the first Hadoop
offerings available in the market.
Azure HDInsight is Microsoft's
distribution of Hadoop.
Hortonworks has a Hadoop
distribution known as
Hortonworks Data Platform
(HDP).
Cloudera
Hortonworks
Amazon
Elastic Map
Reduce
(EMR)
MapR
Azure
HDInsight
Hadoop Ecosystem
Apach
e Hive
Apache
Pig
Apache
ZooKeeper
Y
1. Apache Pig is a software framework which offers a run-time
environment for execution of MapReduce jobs on a Hadoop Cluster via
a high-level scripting language called Pig Latin.
2. Apache Hive Data Warehouse framework facilitates the querying and
management of large datasets residing in a distributed store/file system
like Hadoop Distributed File System (HDFS).
3. Apache Mahout is a scalable machine learning and data mining library.
4. Apache HBase is a distributed, versioned, column-oriented, scalable
and a big data store on top of Hadoop/HDFS.
5. Apache Sqoop is a tool designed for efficiently transferring the data
between Hadoop and Relational Databases (RDBMS).
6. Apache Oozie is a job workflow scheduling and coordination manager
for managing the jobs executed on Hadoop.
7. Apache ZooKeeper is an open source coordination service for
distributed applications.
8. Apache Ambari is an open source software framework for provisioning,
managing, and monitoring Hadoop clusters.
Apache
Hive
Apache
HBase
Apache
Mahout
Apache
Sqoop
Apache
Oozie
HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework.
other Apache Projects which are built around the Hadoop Framework which are part of the Hadoop Ecosystem.
Apache
Ambari
Hadoop Core Components
YARN
Yet another resource negotiator) is a
resource manager that knows how to
allocate distributed computer
resources to various cluster
MapReduce
MapReduce is a framework that
enables running MapReduce jobs on
the hadoop cluster powered by YARN.
It provides a high level API for
implementing Custom Map and
Reduce function in various languages
as well as the code infrastructure
needed to submit, run and monitor
MapReduce jobs.
HDFS
(Hadoop diatributed file system)
designed for storing large files of the
magnitute of hundreds of megabytes
or gigabytes and provides
high_throughput streaming data
access to them.
Hadoop Versions
2.7.x2.7.7
31 May 2018
2.8 x.2.8.5
15 September 2018
2.9 x 2.9.2
9 november 2018
3.1 x 3.1.2
6 February 2019
3.2 x 3.2.0
16 january
Hadoop 2.8.0 installation
Download
Hadoop
and Java
Install
Java
Extract
the
Hadoop
file
Testing
Edit
configuration
files
Set
environment
variable
Set path
Replace
the bin
file
Format
name
node
Download Hadoop 2.8.0 (Link: http://www-
eu.apache.org/dist/hadoop/common/hadoop-
2.8.0/hadoop-
2.8.0.tar.gz OR http://archive.apache.org/dist/hadoop/c
ore//hadoop-2.8.0/hadoop-2.8.0.tar.gz)
install java on your under "C:JAVA"Java JDK 1.8.0.zip
(Link: http://www.oracle.com/technetwork/java/javase/
downloads/jdk8-downloads-2133151.html)
.
use "Javac -version" to check the version of Java installed
in your system
Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place under "C:Hadoop-2.8.0".
HADOOP_HOME
JAVA_HOME
path
Set the path HADOOP_HOME Environment variable
Environment Variable -> New ->
Variable name: HADOOP_HOME
Variable value: C:hadoop-2.8.0bin
->OK
Set the path JAVA_HOME Environment variable
Environment Variable -> New ->
Variable name: JAVA_HOME
Variable value: C:javabin
->OK
set the Hadoop bin directory path and JAVA bin
directory path.
Environment variables -> System variables ->
path -> Edit -> New -> C:hadoop-2.8.0bin -> New
-> C:javabin -> OK
Edit the configuration
files, paste these
code below xml
paragraph and save
the files.
Create foldersfile C:/Hadoop-2.8.0/etc/hadoop
/mapred-site.xml
file C:Hadoop-2.8.0/etc/hadoop/h
dfs-site.xml
file C:/Hadoop-2.8.0/etc/hadoo
p/hadoop-env.cmd
file C:/Hadoop-2.8.0/etc/hadoop/y
arn-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value
>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:hadoop-
2.8.0datanamenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:hadoop-
2.8.0datadatanode</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.nam
e</name>
<value>yarn</value>
</property>
</configuration>
Create
folder "data" under "C:Hadoo
p-2.8.0"
Create
folder "datanode" under "C:H
adoop-2.8.0data"
Create
folder "namenode" under "C:
Hadoop-2.8.0data"
<configuration>
<property>
<name>yarn.nodemanager.aux-
services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices
.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.Sh
uffleHandler</value>
</property>
</configuration>
Edit file C:/Hadoop-
2.8.0/etc/hadoop/hadoop-
env.cmd by closing the
command
line "JAVA_HOME=%JAVA_HO
ME%" instead of
set "JAVA_HOME=C:Java" (On
C:java this is path to file
jdk.18.0)
file C:/Hadoop-
2.8.0/etc/hadoop/core-site.xml
Replace the bin file
Delete file bin on C:Hadoop-
2.8.0bin, replaced by file bin
on file just download (from
Hadoop Configuration.zip).
Dowload file Hadoop
Configuration.zip
(Link: https://github.com/Muha
mmadBilalYar/HADOOP-
INSTALLATION-ON-
WINDOW-
10/blob/master/Hadoop%20C
onfiguration.zip)
Open cmd and typing command "hdfs namenode –format" .
Format the namenode
We Create
Quality Professional
PPT Presentation
Testing
Open cmd and change directory to "C:Hadoop-2.8.0sbin" and
type "start-all.cmd" to start apache.
Make sure these apps
are running
1.Hadoop Namenode
2.Hadoop datanode
3.YARN Resourc
Manager
4.YARN Node
Manager
Open: http://localhost:50070
Open: http://localhost:8088
MAPREDUCE
The MapReduce algorithm contains two important
tasks, namely Map and Reduce.
MapReduce is a framework using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
Map takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).
MapReduce is a processing technique and a program
model for distributed computing based on java.
reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
Generally MapReduce
paradigm is based on sending
the computer to where the data
resides!
The major advantage of
MapReduce is that it is easy to
scale data processing over
multiple computing nodes.
MapReduce program executes in
three stages, namely map stage,
shuffle stage, and reduce stage.
MapReduce
1
8
7
6
5
4
3
2
Map stage − The
map or mapper’s job
is to process the input
data. Generally the
input data is in the
form of file or
directory and is stored
in the Hadoop file
system (HDFS). The
input file is passed to
the mapper function
line by line. The
mapper processes the
data and creates
several small chunks
of data.
Stages of
MapReduce
program
Reduce stage − This
stage is the
combination of
the Shuffle stage and
the Reduce stage.
The Reducer’s job is
to process the data
that comes from the
mapper. After
processing, it
produces a new set of
output, which will be
stored in the HDFS.
Thank you

More Related Content

What's hot

Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Tarak Tar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Giovanna Roda
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Chanchal Tripathi
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!
 

What's hot (19)

Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
hadoop
hadoophadoop
hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 

Similar to Hadoop basics

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
HDFS
HDFSHDFS
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
SudhanshiBakre1
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
DIVYA370851
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
Simplilearn
 
Hadoop
HadoopHadoop

Similar to Hadoop basics (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
HDFS
HDFSHDFS
HDFS
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
paper
paperpaper
paper
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
hadoop
hadoophadoop
hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 

Recently uploaded (20)

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 

Hadoop basics

  • 1. Presentation on Big Data/Hadoop Submitted to : Department of CSE AITS, Udaipur Submitted by: 1.Laxmi Rauth 2.Anand Mohan B.Tech (4th year)
  • 2. Big Data "Big Data” is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. In simple terms, "Big Data" consists of very large volumes of heterogeneous data that is being generated, often, at high speeds. Big Data requires the use of a new set of tools, applications and frameworks to process and manage the data.
  • 3. Characteristics of Big Data: The characteristics of Big Data are popularly known as Three V's of Big Data. Volume: This size aspect of data is referred to as Volume in the Big Data world. Velocity: This speed aspect of data generation is referred to as Velocity in the Big Data world. Variety: This aspect of varied data formats is referred to as Variety in the Big Data world. Sources of Big Data can be broadly classified into six different categories: 1.Enterprise Data 2. Transactional Data 3. Social Media 4. Activity Generated 5. Public Data 6. Archives Sources of Big Data:
  • 4. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. That manages data processing and storage for big data applications running in clustered systems.
  • 5. History of Hadoop had started in the year 2002 with the project Apache Nutch. Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. According to Hadoop's creator Doug Cutting, "The name Hadoop given by my kid to a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere. Hadoop was created by Doug Cutting and Mike Cafarella. History of Hadoop
  • 6. Characteristics of Hadoop Hadoop provides a reliable shared storage (HDFS) and analysis system (Map- Reduce). Hadoop is highly scalable As Hadoop scales linearly, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers. Hadoop is highly flexible and can process both structured as well as unstructured data. Hadoop has built-in fault tolerance. Hadoop works on the principle of write once and read multiple times. Hadoop is optimized for large and very large data sets. Hadoop is very cost effective as it can work with commodity hardware and does not require expensive high- end hardware.
  • 7. Hadoop works in a master-worker / master-slave fashion. Hadoop has two core components: HDFS and MapReduce. HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, by storing the data across multiple nodes. MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes. The master contains the Namenode and Job Tracker components. Namenode holds the information about all the other nodes in the Hadoop Cluster. Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results. Each Worker / Slave contains the Task Tracker and a Datanode components. Task Tracker is responsible for running the task / computation assigned to it. Datanode is responsible for holding the data.
  • 8. Hadoop Distributions Cloudera was the first company to be formed to build enterprise solutions based on Hadoop. Cloudera has a Hadoop distribution known as Cloudera's Distribution for Hadoop (CDH). MapR is another major distribution available in the market. MapR is available in the cloud through some of the leading cloud providers Amazon Web Services (AWS), Google Compute Engine, CenturyLink Technology Solutions, and OpenStack. Amazon Web Services (AWS) Elastic MapReduce (EMR) was among the first Hadoop offerings available in the market. Azure HDInsight is Microsoft's distribution of Hadoop. Hortonworks has a Hadoop distribution known as Hortonworks Data Platform (HDP). Cloudera Hortonworks Amazon Elastic Map Reduce (EMR) MapR Azure HDInsight
  • 9. Hadoop Ecosystem Apach e Hive Apache Pig Apache ZooKeeper Y 1. Apache Pig is a software framework which offers a run-time environment for execution of MapReduce jobs on a Hadoop Cluster via a high-level scripting language called Pig Latin. 2. Apache Hive Data Warehouse framework facilitates the querying and management of large datasets residing in a distributed store/file system like Hadoop Distributed File System (HDFS). 3. Apache Mahout is a scalable machine learning and data mining library. 4. Apache HBase is a distributed, versioned, column-oriented, scalable and a big data store on top of Hadoop/HDFS. 5. Apache Sqoop is a tool designed for efficiently transferring the data between Hadoop and Relational Databases (RDBMS). 6. Apache Oozie is a job workflow scheduling and coordination manager for managing the jobs executed on Hadoop. 7. Apache ZooKeeper is an open source coordination service for distributed applications. 8. Apache Ambari is an open source software framework for provisioning, managing, and monitoring Hadoop clusters. Apache Hive Apache HBase Apache Mahout Apache Sqoop Apache Oozie HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework. other Apache Projects which are built around the Hadoop Framework which are part of the Hadoop Ecosystem. Apache Ambari
  • 10. Hadoop Core Components YARN Yet another resource negotiator) is a resource manager that knows how to allocate distributed computer resources to various cluster MapReduce MapReduce is a framework that enables running MapReduce jobs on the hadoop cluster powered by YARN. It provides a high level API for implementing Custom Map and Reduce function in various languages as well as the code infrastructure needed to submit, run and monitor MapReduce jobs. HDFS (Hadoop diatributed file system) designed for storing large files of the magnitute of hundreds of megabytes or gigabytes and provides high_throughput streaming data access to them.
  • 11. Hadoop Versions 2.7.x2.7.7 31 May 2018 2.8 x.2.8.5 15 September 2018 2.9 x 2.9.2 9 november 2018 3.1 x 3.1.2 6 February 2019 3.2 x 3.2.0 16 january
  • 12. Hadoop 2.8.0 installation Download Hadoop and Java Install Java Extract the Hadoop file Testing Edit configuration files Set environment variable Set path Replace the bin file Format name node
  • 13. Download Hadoop 2.8.0 (Link: http://www- eu.apache.org/dist/hadoop/common/hadoop- 2.8.0/hadoop- 2.8.0.tar.gz OR http://archive.apache.org/dist/hadoop/c ore//hadoop-2.8.0/hadoop-2.8.0.tar.gz) install java on your under "C:JAVA"Java JDK 1.8.0.zip (Link: http://www.oracle.com/technetwork/java/javase/ downloads/jdk8-downloads-2133151.html) . use "Javac -version" to check the version of Java installed in your system
  • 14. Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place under "C:Hadoop-2.8.0". HADOOP_HOME JAVA_HOME path Set the path HADOOP_HOME Environment variable Environment Variable -> New -> Variable name: HADOOP_HOME Variable value: C:hadoop-2.8.0bin ->OK Set the path JAVA_HOME Environment variable Environment Variable -> New -> Variable name: JAVA_HOME Variable value: C:javabin ->OK set the Hadoop bin directory path and JAVA bin directory path. Environment variables -> System variables -> path -> Edit -> New -> C:hadoop-2.8.0bin -> New -> C:javabin -> OK
  • 15. Edit the configuration files, paste these code below xml paragraph and save the files. Create foldersfile C:/Hadoop-2.8.0/etc/hadoop /mapred-site.xml file C:Hadoop-2.8.0/etc/hadoop/h dfs-site.xml file C:/Hadoop-2.8.0/etc/hadoo p/hadoop-env.cmd file C:/Hadoop-2.8.0/etc/hadoop/y arn-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value > </property> </configuration> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>C:hadoop- 2.8.0datanamenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>C:hadoop- 2.8.0datadatanode</value> </property> </configuration> <configuration> <property> <name>mapreduce.framework.nam e</name> <value>yarn</value> </property> </configuration> Create folder "data" under "C:Hadoo p-2.8.0" Create folder "datanode" under "C:H adoop-2.8.0data" Create folder "namenode" under "C: Hadoop-2.8.0data" <configuration> <property> <name>yarn.nodemanager.aux- services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices .mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.Sh uffleHandler</value> </property> </configuration> Edit file C:/Hadoop- 2.8.0/etc/hadoop/hadoop- env.cmd by closing the command line "JAVA_HOME=%JAVA_HO ME%" instead of set "JAVA_HOME=C:Java" (On C:java this is path to file jdk.18.0) file C:/Hadoop- 2.8.0/etc/hadoop/core-site.xml
  • 16. Replace the bin file Delete file bin on C:Hadoop- 2.8.0bin, replaced by file bin on file just download (from Hadoop Configuration.zip). Dowload file Hadoop Configuration.zip (Link: https://github.com/Muha mmadBilalYar/HADOOP- INSTALLATION-ON- WINDOW- 10/blob/master/Hadoop%20C onfiguration.zip)
  • 17. Open cmd and typing command "hdfs namenode –format" . Format the namenode We Create Quality Professional PPT Presentation
  • 18. Testing Open cmd and change directory to "C:Hadoop-2.8.0sbin" and type "start-all.cmd" to start apache. Make sure these apps are running 1.Hadoop Namenode 2.Hadoop datanode 3.YARN Resourc Manager 4.YARN Node Manager Open: http://localhost:50070 Open: http://localhost:8088
  • 19. MAPREDUCE The MapReduce algorithm contains two important tasks, namely Map and Reduce. MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). MapReduce is a processing technique and a program model for distributed computing based on java. reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. Generally MapReduce paradigm is based on sending the computer to where the data resides! The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. MapReduce 1 8 7 6 5 4 3 2
  • 20. Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Stages of MapReduce program Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.