SlideShare a Scribd company logo
Presentation By
, M.Tech(CSE).,(PhD),
Big Data Education Group, Bangalore
Big Data Vs Hadoop
Big data is simply the large sets of data that businesses and other parties put together to
serve specific goals and operations. Big data can include many different kinds of data in
many different kinds of formats.
For example, businesses might put a lot of work into collecting thousands of pieces of data
on purchases in currency formats, on customer identifiers like name or Social Security
number, or on product information in the form of model numbers, sales numbers or
inventory numbers.
All of this, or any other large mass of information, can be called big data. As a rule, it’s raw
and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software
products work to interpret or parse the results of big data searches through specific
proprietary algorithms and methods.
Hadoop is an open-source program under the Apache license that is maintained by a global
community of users. It includes various main components, including a Map Reduce set of
functions and a Hadoop distributed file system (HDFS).
The idea behind Map Reduce is that Hadoop can first map a large data set, and then
perform a reduction on that content for specific results.
A reduce function can be thought of as a kind of filter for raw data. The HDFS system then
acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to
deal with big data in any number of ways.
For example, Hadoop can be used to pursue data strategies like clustering and targeting
with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well
to simple queries.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
 Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
It was originally developed to support distribution for the Nutch search engine project.
Hadoop jobs run under 5 daemons mainly,
 Name node
 Data Node
 Secondary Name node
 Job Tracker
 Task Tracker
Starting Daemons
 Hadoop is a large-scale distributed batch processing infrastructure.
 Its true power lies in its ability to scale to hundreds or thousands of computers, each with
several processor cores.
 Hadoop is also designed to efficiently distribute large amounts of work across a set of
machines.
 Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to
terabytes or petabytes.
 At this scale, it is likely that the input data set will not even fit on a single computer's hard
drive, much less in memory.
 So Hadoop includes a distributed file system which breaks up input data and sends
fractions of the original data to several machines in your cluster to hold.
 This results in the problem being processed in parallel using all of the machines in the
cluster and computes output results as efficiently as possible.
Hadoop Advantages
Hadoop is an open source, versatile tool that provides the power of distributed computing.
By using distributed storage & transferring code instead of data, Hadoop avoids the costly
transmission step when working with large data sets.
Redundancy of data allows Hadoop to recover from single node fail.
Ease to create programs with Hadoop As it uses the Map Reduce framework.
Need not worry about partitioning the data, determining which nodes will perform which
tasks, or handling communication between nodes as It is all done by Hadoop for you.
Hadoop leaving you free to focus on what is most important to you and your data and what
you want to do with it.
Challenges:
Performing large-scale computation is difficult.
Whenever multiple machines are used in cooperation with one another, the probability of
failures rises.
In a distributed environment, however, partial failures are an expected and common
occurrence.
Networks can experience partial or total failure if switches and routers break down. Data
may not arrive at a particular point in time due to unexpected network congestion.
Clocks may become desynchronized, lock files may not be released, parties involved in
distributed atomic transactions may lose their network connections part-way through, etc.
In each of these cases, the rest of the distributed system should be able to recover from the
component failure or transient error condition and continue to make progress.
Synchronization between multiple machines remains the biggest challenge in
distributed system design.
For example, if 100 nodes are present in a system and one of them crashes, the other
99 nodes should be able to continue the computation, ideally with only a small penalty
proportionate to the loss of 1% of the computing power.
Hadoop typically isn't a one-stop-shopping product and must be used in coordination
with Map Reduce and a range of other complementary technologies from what is
referred to as the Hadoop ecosystem.
Although it's open source, it's by no means free. Companies implementing a Hadoop
cluster generally choose one of the commercial distributions of the framework, which
poses maintenance and support costs.
They need to pay for hardware and hire experienced programmers or train existing
employees on working with Hadoop, Map Reduce and related technologies such as
Hive, HBase and Pig.
Challenges:
Following are the major common areas found as weaknesses of Hadoop framework
or system:
As you know Hadoop uses HDFS and Map Reduce, Both of their master processes
are single points of failure, Although there is active work going on for High
Availability versions.
Until the Hadoop 2.x release, HDFS and Map Reduce will be using single-master
models which can result in single points of failure.
Hadoop does not offer storage or network level encryption which is very big concern
for government sector application data.
HDFS is inefficient for handling small files, and it lacks transparent compression.
As HDFS is not designed to work well with random reads over small files due to its
optimization for sustained throughput.
Map Reduce is a shared-nothing architecture hence Tasks that require global
synchronization or sharing of mutable data are not a good fit which can pose
challenges for some algorithms
• HDFS Introduction
• HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold
very large amounts of data (terabytes or even petabytes), and provide high-throughput
access to this information.
• Files are stored in a redundant fashion across multiple machines to ensure their durability
to failure and high availability to very parallel applications. This module introduces the
design of this distributed file system and instructions on how to operate it.
• A distributed file system is designed to hold a large amount of data and provide access to
this data to many clients distributed across a network. There are a number of distributed
file systems that solve this problem in different ways.
• HDFS should store data reliably. If individual machines in the cluster malfunction, data
should still be available.
• HDFS should provide fast, scalable access to this information. It should be possible to
serve a larger number of clients by simply adding more machines to the cluster.
* HDFS should integrate well with Hadoop Map Reduce, allowing data to be read and
computed upon locally when possible.
• Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense
of random seek times to arbitrary positions in files.
• Due to the large size of files, and the sequential nature of reads, the system does not
provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
• Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as
a whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
• The design of HDFS is based on the design of GFS, the Google File System. Its design was
described in a paper published by Google.
• HDFS Architecture
• HDFS is a block-structured file system: individual files are broken into blocks of a fixed
size. These blocks are stored across a cluster of one or more machines with data storage
capacity.
• Individual machines in the cluster are referred to as Data Nodes. A file can be made of
several blocks, and they are not necessarily stored on the same machine; the target
machines which hold each block are chosen randomly on a block-by-block basis.
• Thus access to a file may require the cooperation of multiple machines, but supports file
sizes far larger than a single-machine DFS; individual files can require more space than a
single hard drive could hold.
• If several machines must be involved in the serving of a file, then a file could be rendered
unavailable by the loss of any one of those machines. HDFS combats this problem by
replicating each block across a number of machines (3, by default).
• Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast,
the default block size in HDFS is 64MB -- orders of magnitude larger. This allows HDFS
to decrease the amount of metadata storage required per file (the list of blocks per file will
be smaller as the size of individual blocks increases).
Master node
Two slave nodes
HADOOPARCHITECTURE
• HDFS expects to read a block start-to-finish for a program. This makes it particularly
useful to the Map Reduce style of programming.
• Because HDFS stores files as a set of large blocks across several machines, these files are
not part of the ordinary file system. Typing ls on a machine running a Data Node daemon
will display the contents of the ordinary Linux file system being used to host the Hadoop
services -- but it will not include any of the files stored inside the HDFS.
• This is because HDFS runs in a separate namespace, isolated from the contents of your
local files. The files inside HDFS (or more accurately: the blocks that make them up) are
stored in a particular directory managed by the Data Node service, but the files will named
only with block ids.
• It is important for this file system to store its metadata reliably. Furthermore, while the file
data is accessed in a write once and read many model, the metadata structures (e.g., the
names of files and directories) can be modified by a large number of clients concurrently.
• It is important that this information is never desynchronized. Therefore, it is all handled by
a single machine, called the Name Node.
• The Name Node stores all the metadata for the file system. Because of the relatively low
amount of metadata per file (it only tracks file names, permissions, and the locations of
each block of each file), all of this information can be stored in the main memory of the
Name Node machine, allowing fast access to the metadata.
Centralized namenode
- Maintains metadata info about files
Many data node (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
• To open a file, a client contacts the Name Node and retrieves a list of locations for the
blocks that comprise the file. These locations identify the Data Nodes which hold each
block.
• Clients then read file data directly from the Data Node servers, possibly in parallel. The
Name Node is not directly involved in this bulk data transfer, keeping its overhead to a
minimum.
• Name Node information must be preserved even if the Name Node machine fails; there are
multiple redundant systems that allow the Name Node to preserve the file system's
metadata even if the Name Node itself crashes irrecoverably.
• Name Node failure is more severe for the cluster than Data Node failure. While individual
Data Nodes may crash and the entire cluster will continue to operate, the loss of the Name
Node will render the cluster inaccessible until it is manually restored.
• Fortunately, as the Name Node's involvement is relatively minimal, the odds of it failing
are considerably lower than the odds of an arbitrary Data Node failing at any given point
in time.
File Reading
File Writing
Summary
Big data is simply the large sets of data, Hadoop is one of the tools
designed to handle big data.
Map Reduce is that Hadoop can first map a large data set, and then
perform a reduction on that content for specific results.
Hadoop jobs run under 5 daemons mainly,
 Name node, Data Node, Secondary Name node
 Job Tracker, Task Tracker
HDFS, the Hadoop Distributed File System, is a distributed file system
designed to hold very large amounts of data (terabytes or even petabytes),
and provide high-throughput access to this information.
Thank You…….
Any Queries

More Related Content

What's hot

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Chanchal Tripathi
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Edureka!
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 

What's hot (19)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked

Weixin's resume-20
Weixin's resume-20Weixin's resume-20
Weixin's resume-20WeixinZhang
 
Pilotirane v Turcia
Pilotirane v TurciaPilotirane v Turcia
Pilotirane v Turciaandreanl
 
Piloting in Turkey - Hasan Toman
Piloting in Turkey - Hasan TomanPiloting in Turkey - Hasan Toman
Piloting in Turkey - Hasan Tomanandreanl
 
Cloud walk 10th anniversary sept 11, 2011
Cloud walk 10th anniversary sept 11, 2011Cloud walk 10th anniversary sept 11, 2011
Cloud walk 10th anniversary sept 11, 2011
smartstorming
 
Group work presentation combined (Consultative meeting Pokhara) 2016
Group work presentation combined (Consultative meeting Pokhara) 2016Group work presentation combined (Consultative meeting Pokhara) 2016
Group work presentation combined (Consultative meeting Pokhara) 2016
Ekendra Lamsal
 
เรื่อง Social network ครูสุนทรีย์ รุจิรา
เรื่อง   Social  network ครูสุนทรีย์ รุจิราเรื่อง   Social  network ครูสุนทรีย์ รุจิรา
เรื่อง Social network ครูสุนทรีย์ รุจิรา
Rujira Mueangma
 
Visage instrument
Visage instrumentVisage instrument
Visage instrumentandreanl
 
Weixin's resume-20
Weixin's resume-20Weixin's resume-20
Weixin's resume-20WeixinZhang
 
The impact in Poland
The impact in PolandThe impact in Poland
The impact in Polandandreanl
 
เรื่อง Social network ครูสุนทรีย์ รุจิรา
เรื่อง   Social  network ครูสุนทรีย์ รุจิราเรื่อง   Social  network ครูสุนทรีย์ รุจิรา
เรื่อง Social network ครูสุนทรีย์ รุจิรา
Rujira Mueangma
 
Regional Level Consultative Workshop of Western Region (LGCDP-II)
Regional Level Consultative Workshop of Western Region (LGCDP-II)Regional Level Consultative Workshop of Western Region (LGCDP-II)
Regional Level Consultative Workshop of Western Region (LGCDP-II)
Ekendra Lamsal
 
desmond-morris-maimuta-goala
desmond-morris-maimuta-goaladesmond-morris-maimuta-goala
desmond-morris-maimuta-goalaBadoiu Catalina
 
Caso clínico: abdomen agudo
Caso clínico: abdomen agudoCaso clínico: abdomen agudo
Caso clínico: abdomen agudoMiriam Nova
 
INVESTMENT ACCOUNTING W.R.T. AS-13
INVESTMENT ACCOUNTING W.R.T. AS-13INVESTMENT ACCOUNTING W.R.T. AS-13
INVESTMENT ACCOUNTING W.R.T. AS-13Naveen Khubchandani
 

Viewers also liked (16)

Weixin's resume-20
Weixin's resume-20Weixin's resume-20
Weixin's resume-20
 
Pilotirane v Turcia
Pilotirane v TurciaPilotirane v Turcia
Pilotirane v Turcia
 
Piloting in Turkey - Hasan Toman
Piloting in Turkey - Hasan TomanPiloting in Turkey - Hasan Toman
Piloting in Turkey - Hasan Toman
 
Cloud walk 10th anniversary sept 11, 2011
Cloud walk 10th anniversary sept 11, 2011Cloud walk 10th anniversary sept 11, 2011
Cloud walk 10th anniversary sept 11, 2011
 
Group work presentation combined (Consultative meeting Pokhara) 2016
Group work presentation combined (Consultative meeting Pokhara) 2016Group work presentation combined (Consultative meeting Pokhara) 2016
Group work presentation combined (Consultative meeting Pokhara) 2016
 
เรื่อง Social network ครูสุนทรีย์ รุจิรา
เรื่อง   Social  network ครูสุนทรีย์ รุจิราเรื่อง   Social  network ครูสุนทรีย์ รุจิรา
เรื่อง Social network ครูสุนทรีย์ รุจิรา
 
Visage instrument
Visage instrumentVisage instrument
Visage instrument
 
Weixin's resume-20
Weixin's resume-20Weixin's resume-20
Weixin's resume-20
 
The impact in Poland
The impact in PolandThe impact in Poland
The impact in Poland
 
เรื่อง Social network ครูสุนทรีย์ รุจิรา
เรื่อง   Social  network ครูสุนทรีย์ รุจิราเรื่อง   Social  network ครูสุนทรีย์ รุจิรา
เรื่อง Social network ครูสุนทรีย์ รุจิรา
 
Regional Level Consultative Workshop of Western Region (LGCDP-II)
Regional Level Consultative Workshop of Western Region (LGCDP-II)Regional Level Consultative Workshop of Western Region (LGCDP-II)
Regional Level Consultative Workshop of Western Region (LGCDP-II)
 
desmond-morris-maimuta-goala
desmond-morris-maimuta-goaladesmond-morris-maimuta-goala
desmond-morris-maimuta-goala
 
Caso clínico: abdomen agudo
Caso clínico: abdomen agudoCaso clínico: abdomen agudo
Caso clínico: abdomen agudo
 
INVESTMENT ACCOUNTING W.R.T. AS-13
INVESTMENT ACCOUNTING W.R.T. AS-13INVESTMENT ACCOUNTING W.R.T. AS-13
INVESTMENT ACCOUNTING W.R.T. AS-13
 
INCOME FROM OTHER SOURCES
INCOME FROM OTHER SOURCESINCOME FROM OTHER SOURCES
INCOME FROM OTHER SOURCES
 
METHODS & TECHNIQUES OF COSTING
METHODS & TECHNIQUES OF COSTINGMETHODS & TECHNIQUES OF COSTING
METHODS & TECHNIQUES OF COSTING
 

Similar to Bigdata and Hadoop Introduction

Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
HDFS
HDFSHDFS
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Discuss the advantages of Hadoop technology and distributed data fil.pdf
Discuss the advantages of Hadoop technology and distributed data fil.pdfDiscuss the advantages of Hadoop technology and distributed data fil.pdf
Discuss the advantages of Hadoop technology and distributed data fil.pdf
arhamgarmentsdelhi
 

Similar to Bigdata and Hadoop Introduction (20)

Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
HDFS
HDFSHDFS
HDFS
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Anju
AnjuAnju
Anju
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
paper
paperpaper
paper
 
Discuss the advantages of Hadoop technology and distributed data fil.pdf
Discuss the advantages of Hadoop technology and distributed data fil.pdfDiscuss the advantages of Hadoop technology and distributed data fil.pdf
Discuss the advantages of Hadoop technology and distributed data fil.pdf
 
hadoop
hadoophadoop
hadoop
 

Recently uploaded

The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 

Recently uploaded (20)

The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 

Bigdata and Hadoop Introduction

  • 1. Presentation By , M.Tech(CSE).,(PhD), Big Data Education Group, Bangalore
  • 2. Big Data Vs Hadoop Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers. Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a Map Reduce set of functions and a Hadoop distributed file system (HDFS).
  • 3. The idea behind Map Reduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary. Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
  • 4. Hadoop was created by Doug Cutting and Mike Cafarella in 2005.  Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. Hadoop jobs run under 5 daemons mainly,  Name node  Data Node  Secondary Name node  Job Tracker  Task Tracker Starting Daemons
  • 5.  Hadoop is a large-scale distributed batch processing infrastructure.  Its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores.  Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.  Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes.  At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory.  So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold.  This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
  • 6. Hadoop Advantages Hadoop is an open source, versatile tool that provides the power of distributed computing. By using distributed storage & transferring code instead of data, Hadoop avoids the costly transmission step when working with large data sets. Redundancy of data allows Hadoop to recover from single node fail. Ease to create programs with Hadoop As it uses the Map Reduce framework. Need not worry about partitioning the data, determining which nodes will perform which tasks, or handling communication between nodes as It is all done by Hadoop for you. Hadoop leaving you free to focus on what is most important to you and your data and what you want to do with it.
  • 7. Challenges: Performing large-scale computation is difficult. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress.
  • 8. Synchronization between multiple machines remains the biggest challenge in distributed system design. For example, if 100 nodes are present in a system and one of them crashes, the other 99 nodes should be able to continue the computation, ideally with only a small penalty proportionate to the loss of 1% of the computing power. Hadoop typically isn't a one-stop-shopping product and must be used in coordination with Map Reduce and a range of other complementary technologies from what is referred to as the Hadoop ecosystem. Although it's open source, it's by no means free. Companies implementing a Hadoop cluster generally choose one of the commercial distributions of the framework, which poses maintenance and support costs. They need to pay for hardware and hire experienced programmers or train existing employees on working with Hadoop, Map Reduce and related technologies such as Hive, HBase and Pig.
  • 9. Challenges: Following are the major common areas found as weaknesses of Hadoop framework or system: As you know Hadoop uses HDFS and Map Reduce, Both of their master processes are single points of failure, Although there is active work going on for High Availability versions. Until the Hadoop 2.x release, HDFS and Map Reduce will be using single-master models which can result in single points of failure. Hadoop does not offer storage or network level encryption which is very big concern for government sector application data. HDFS is inefficient for handling small files, and it lacks transparent compression. As HDFS is not designed to work well with random reads over small files due to its optimization for sustained throughput. Map Reduce is a shared-nothing architecture hence Tasks that require global synchronization or sharing of mutable data are not a good fit which can pose challenges for some algorithms
  • 10.
  • 11. • HDFS Introduction • HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. • Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it. • A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network. There are a number of distributed file systems that solve this problem in different ways. • HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available. • HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster. * HDFS should integrate well with Hadoop Map Reduce, allowing data to be read and computed upon locally when possible.
  • 12. • Applications that use HDFS are assumed to perform long sequential streaming reads from files. HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files. • Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data. The overhead of caching is great enough that data should simply be re-read from HDFS source. • Individual machines are assumed to fail on a frequent basis, both permanently and intermittently. The cluster must be able to withstand the complete failure of several machines, possibly many happening at the same time (e.g., if a rack fails all together). While performance may degrade proportional to the number of machines lost, the system as a whole should not become overly slow, nor should information be lost. Data replication strategies combat this problem. • The design of HDFS is based on the design of GFS, the Google File System. Its design was described in a paper published by Google.
  • 13. • HDFS Architecture • HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. • Individual machines in the cluster are referred to as Data Nodes. A file can be made of several blocks, and they are not necessarily stored on the same machine; the target machines which hold each block are chosen randomly on a block-by-block basis. • Thus access to a file may require the cooperation of multiple machines, but supports file sizes far larger than a single-machine DFS; individual files can require more space than a single hard drive could hold. • If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default). • Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB -- orders of magnitude larger. This allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases).
  • 14. Master node Two slave nodes HADOOPARCHITECTURE
  • 15. • HDFS expects to read a block start-to-finish for a program. This makes it particularly useful to the Map Reduce style of programming. • Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system. Typing ls on a machine running a Data Node daemon will display the contents of the ordinary Linux file system being used to host the Hadoop services -- but it will not include any of the files stored inside the HDFS. • This is because HDFS runs in a separate namespace, isolated from the contents of your local files. The files inside HDFS (or more accurately: the blocks that make them up) are stored in a particular directory managed by the Data Node service, but the files will named only with block ids. • It is important for this file system to store its metadata reliably. Furthermore, while the file data is accessed in a write once and read many model, the metadata structures (e.g., the names of files and directories) can be modified by a large number of clients concurrently. • It is important that this information is never desynchronized. Therefore, it is all handled by a single machine, called the Name Node. • The Name Node stores all the metadata for the file system. Because of the relatively low amount of metadata per file (it only tracks file names, permissions, and the locations of each block of each file), all of this information can be stored in the main memory of the Name Node machine, allowing fast access to the metadata.
  • 16. Centralized namenode - Maintains metadata info about files Many data node (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 17. • To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file. These locations identify the Data Nodes which hold each block. • Clients then read file data directly from the Data Node servers, possibly in parallel. The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum. • Name Node information must be preserved even if the Name Node machine fails; there are multiple redundant systems that allow the Name Node to preserve the file system's metadata even if the Name Node itself crashes irrecoverably. • Name Node failure is more severe for the cluster than Data Node failure. While individual Data Nodes may crash and the entire cluster will continue to operate, the loss of the Name Node will render the cluster inaccessible until it is manually restored. • Fortunately, as the Name Node's involvement is relatively minimal, the odds of it failing are considerably lower than the odds of an arbitrary Data Node failing at any given point in time.
  • 20. Summary Big data is simply the large sets of data, Hadoop is one of the tools designed to handle big data. Map Reduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. Hadoop jobs run under 5 daemons mainly,  Name node, Data Node, Secondary Name node  Job Tracker, Task Tracker HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.