SlideShare a Scribd company logo
Big Data
&
Hadoop
Krishna
Nayak Sujeer
What is Big Data?
Big Data is a collection of large datasets that cannot be processed
using traditional computing techniques. It is not a single technique or a
tool, rather it involves many areas of business and technology.
What comes under Big data?
Big data involves the data produced by different devices and
applications. Given below are some of the fields that come under the
umbrella of Big Data.
 Black Box Data: It is a component of helicopter, airplanes, and jets,
etc. It captures voices of the flight crew, recordings of microphones and
earphones, and the performance information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the
globe.
 Stock Exchange Data: The stock exchange data holds information
about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
 Power Grid Data: The power grid data holds information consumed by
a particular node with respect to a base station.
 Transport Data: Transport data includes model, capacity, distance
and availability of a vehicle.
Big data
Big Data includes huge volume, high velocity, and extensible variety of
data. The data in it will be of three types.
 Structured data: Relational data.
 Semi Structured data: XML data.
 Unstructured data: Word, PDF, Text, Media Logs.
Benefits of Big data
 Using the information kept in the social network like Facebook, the
marketing agencies are learning about the response for their
campaigns, promotions, and other advertising mediums.
 Using the information in the social media like preferences and
product perception of their consumers, product companies and retail
organizations are planning their production.
 Using the data regarding the previous medical history of patients,
hospitals are providing better and quick service.
Operational Big data
These include systems like MongoDB that provide operational
capabilities for real-time, interactive workloads where data is primarily
captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud
computing architectures that have emerged over the past decade to
allow massive computations to be run inexpensively and efficiently. This
makes operational big data workloads much easier to manage, cheaper,
and faster to implement.
Some NoSQL systems can provide insights into patterns and trends
based on real-time data with minimal coding and without the need for
data scientists and additional infrastructure.
Analytical Big data
These includes systems like Massively Parallel Processing (MPP)
database systems and MapReduce that provide analytical capabilities
for retrospective and complex analysis that may touch most or all of the
data.
MapReduce provides a new method of analyzing data that is
complementary to the capabilities provided by SQL, and a system based
on MapReduce that can be scaled up from single servers to thousands
of high and low end machines.
Big data challenges
The major challenges associated with big data are as follows:
 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
Traditional enterprise approach
Limitation of Traditional approach
This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the data.
But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database bottleneck.
Googles solution
Google solved this problem using an algorithm called MapReduce. This
algorithm divides the task into small parts and assigns them to many
computers, and collects the results from them which when integrated,
form the result dataset .
Hadoop
Hadoop
Using the solution provided by Google, Doug Cutting and his team
developed an Open Source Project called HADOOP.
Hadoop runs applications using the MapReduce algorithm, where the
data is processed in parallel with others. In short, Hadoop is used to
develop applications that could perform complete statistical analysis on
huge amounts of data.
Hadoop Framework
Hadoop architecture
At its core, Hadoop has two major layers namely:
(a) Processing/Computation layer (MapReduce), and
(b) Storage layer (Hadoop Distributed File System).
Hadoop architecture
MapReduce
MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large amounts
of data (multi-terabyte data-sets), on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source
framework.
HDFS
The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed to
run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is
designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications
having large datasets.
Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules:
 Hadoop Common: These are Java libraries and utilities required by
other Hadoop modules.
 Hadoop YARN: This is a framework for job scheduling and cluster
resource management.
How does Hadoop work?
It is quite expensive to build bigger servers with heavy configurations
that handle large scale processing, but as an alternative, you can tie
together many commodity computers with single-CPU, as a single
functional distributed system and practically, the clustered machines
can read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server. So this is the first
motivational factor behind using Hadoop that it runs across clustered
and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes
the following core tasks that Hadoop performs:
 Data is initially divided into directories and files. Files are divided
into uniform sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for
further processing.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work
across the machines and in turn, utilizes the underlying parallelism of
the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been designed
to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically and
Hadoop continues to operate without interruption.
 Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
End.
Thank you.

More Related Content

What's hot

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Tyrone Systems
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT4
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
Mohammadhasan Farazmand
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Bigdata
Bigdata Bigdata
Bigdata
NithiDazz
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Vipin Batra
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
siliconsudipt
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
Natalino Busa
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
RojaT4
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
VIKAS KATARE
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 

What's hot (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data
Big dataBig data
Big data
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 

Viewers also liked

Linguagensdeprogramao 100611235520-phpapp01
Linguagensdeprogramao 100611235520-phpapp01Linguagensdeprogramao 100611235520-phpapp01
Linguagensdeprogramao 100611235520-phpapp01
Romário santos
 
Ellie Goulding Digipak analysis
Ellie Goulding Digipak analysisEllie Goulding Digipak analysis
Ellie Goulding Digipak analysis
rakhatewany
 
Crescimento da atividade turística em faro
Crescimento da atividade turística em faroCrescimento da atividade turística em faro
Crescimento da atividade turística em faro
Nuno Bexiga
 
Water quality data
Water quality dataWater quality data
Water quality data
An Taisce
 
Resumen y ensayo
Resumen y ensayoResumen y ensayo
Resumen y ensayo
nasmutha1996
 
Terror alert usa_new 7-31-06
Terror alert usa_new 7-31-06Terror alert usa_new 7-31-06
Terror alert usa_new 7-31-06
macedonianinitiative
 
6 клас урок
6 клас урок6 клас урок
6 клас урок
orestznak
 
незалежне оцінювання, шляхи розв’язування
незалежне оцінювання, шляхи розв’язуваннянезалежне оцінювання, шляхи розв’язування
незалежне оцінювання, шляхи розв’язування
Тетяна Герман
 
Taking the Fear out of WAF
Taking the Fear out of WAFTaking the Fear out of WAF
Taking the Fear out of WAF
Brian A. McHenry
 
Caracteristicas del gobierno de calles
Caracteristicas del gobierno de callesCaracteristicas del gobierno de calles
Caracteristicas del gobierno de calles
fernandaximena3oC
 

Viewers also liked (13)

James S. Reinhold Resume
James S. Reinhold ResumeJames S. Reinhold Resume
James S. Reinhold Resume
 
Mafer
MaferMafer
Mafer
 
Linguagensdeprogramao 100611235520-phpapp01
Linguagensdeprogramao 100611235520-phpapp01Linguagensdeprogramao 100611235520-phpapp01
Linguagensdeprogramao 100611235520-phpapp01
 
JR_Berry_Resume_11-2015
JR_Berry_Resume_11-2015JR_Berry_Resume_11-2015
JR_Berry_Resume_11-2015
 
Ellie Goulding Digipak analysis
Ellie Goulding Digipak analysisEllie Goulding Digipak analysis
Ellie Goulding Digipak analysis
 
Crescimento da atividade turística em faro
Crescimento da atividade turística em faroCrescimento da atividade turística em faro
Crescimento da atividade turística em faro
 
Water quality data
Water quality dataWater quality data
Water quality data
 
Resumen y ensayo
Resumen y ensayoResumen y ensayo
Resumen y ensayo
 
Terror alert usa_new 7-31-06
Terror alert usa_new 7-31-06Terror alert usa_new 7-31-06
Terror alert usa_new 7-31-06
 
6 клас урок
6 клас урок6 клас урок
6 клас урок
 
незалежне оцінювання, шляхи розв’язування
незалежне оцінювання, шляхи розв’язуваннянезалежне оцінювання, шляхи розв’язування
незалежне оцінювання, шляхи розв’язування
 
Taking the Fear out of WAF
Taking the Fear out of WAFTaking the Fear out of WAF
Taking the Fear out of WAF
 
Caracteristicas del gobierno de calles
Caracteristicas del gobierno de callesCaracteristicas del gobierno de calles
Caracteristicas del gobierno de calles
 

Similar to Big Data & Hadoop

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
snehal parikh
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
B04 06 0918
B04 06 0918B04 06 0918
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
faizrashid1995
 

Similar to Big Data & Hadoop (20)

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
paper
paperpaper
paper
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 

More from Krishna Sujeer

1-informatica-training
1-informatica-training1-informatica-training
1-informatica-trainingKrishna Sujeer
 
SAP REQRUITMENT NOTES03
SAP REQRUITMENT NOTES03SAP REQRUITMENT NOTES03
SAP REQRUITMENT NOTES03Krishna Sujeer
 
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,Krishna Sujeer
 
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,Krishna Sujeer
 
KEYTAKEAWAYAS Krishna Nayak v2.0 Notes
KEYTAKEAWAYAS Krishna Nayak v2.0 NotesKEYTAKEAWAYAS Krishna Nayak v2.0 Notes
KEYTAKEAWAYAS Krishna Nayak v2.0 NotesKrishna Sujeer
 
Recruitment_Process[1]
Recruitment_Process[1]Recruitment_Process[1]
Recruitment_Process[1]Krishna Sujeer
 
ETI_Krishna_Nayak_Sujeer
ETI_Krishna_Nayak_SujeerETI_Krishna_Nayak_Sujeer
ETI_Krishna_Nayak_SujeerKrishna Sujeer
 
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)Krishna Sujeer
 
Software testing strategies
Software testing strategiesSoftware testing strategies
Software testing strategiesKrishna Sujeer
 
Software Quality Management
Software Quality ManagementSoftware Quality Management
Software Quality ManagementKrishna Sujeer
 
Basic adminstration and configuration techniques
Basic adminstration and configuration techniquesBasic adminstration and configuration techniques
Basic adminstration and configuration techniquesKrishna Sujeer
 
LSI - PMP - Training material
LSI - PMP - Training materialLSI - PMP - Training material
LSI - PMP - Training materialKrishna Sujeer
 
SAP HCM CORE MODULES V1.0
SAP HCM CORE MODULES V1.0SAP HCM CORE MODULES V1.0
SAP HCM CORE MODULES V1.0Krishna Sujeer
 
MASTER_Trainer Notes_V5.2
MASTER_Trainer Notes_V5.2MASTER_Trainer Notes_V5.2
MASTER_Trainer Notes_V5.2Krishna Sujeer
 

More from Krishna Sujeer (20)

1-informatica-training
1-informatica-training1-informatica-training
1-informatica-training
 
SAP REQRUITMENT NOTES03
SAP REQRUITMENT NOTES03SAP REQRUITMENT NOTES03
SAP REQRUITMENT NOTES03
 
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
 
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
KRISHNA NAYAKSujeer B.E(IT), MS(QM) , PGDCA,ITIL PMP,
 
KEYTAKEAWAYAS Krishna Nayak v2.0 Notes
KEYTAKEAWAYAS Krishna Nayak v2.0 NotesKEYTAKEAWAYAS Krishna Nayak v2.0 Notes
KEYTAKEAWAYAS Krishna Nayak v2.0 Notes
 
Selenium.PDF
Selenium.PDFSelenium.PDF
Selenium.PDF
 
Recruitment_Process[1]
Recruitment_Process[1]Recruitment_Process[1]
Recruitment_Process[1]
 
ETI_Krishna_Nayak_Sujeer
ETI_Krishna_Nayak_SujeerETI_Krishna_Nayak_Sujeer
ETI_Krishna_Nayak_Sujeer
 
KRISHNA_NAYAK_Sujeer
KRISHNA_NAYAK_SujeerKRISHNA_NAYAK_Sujeer
KRISHNA_NAYAK_Sujeer
 
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
itil2011foundation-allvolumes-signed-131020235516-phpapp01 (1)
 
Software testing strategies
Software testing strategiesSoftware testing strategies
Software testing strategies
 
Software Processes
Software ProcessesSoftware Processes
Software Processes
 
Software Quality Management
Software Quality ManagementSoftware Quality Management
Software Quality Management
 
Software Testing
Software TestingSoftware Testing
Software Testing
 
Basic adminstration and configuration techniques
Basic adminstration and configuration techniquesBasic adminstration and configuration techniques
Basic adminstration and configuration techniques
 
LSI - PMP - Training material
LSI - PMP - Training materialLSI - PMP - Training material
LSI - PMP - Training material
 
SAP HCM CORE MODULES V1.0
SAP HCM CORE MODULES V1.0SAP HCM CORE MODULES V1.0
SAP HCM CORE MODULES V1.0
 
20410B_01
20410B_0120410B_01
20410B_01
 
MASTER_Trainer Notes_V5.2
MASTER_Trainer Notes_V5.2MASTER_Trainer Notes_V5.2
MASTER_Trainer Notes_V5.2
 
1- java
1- java1- java
1- java
 

Big Data & Hadoop

  • 2. What is Big Data? Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it involves many areas of business and technology.
  • 3. What comes under Big data? Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data.  Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.  Social Media Data: Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.  Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.  Power Grid Data: The power grid data holds information consumed by a particular node with respect to a base station.  Transport Data: Transport data includes model, capacity, distance and availability of a vehicle.
  • 4. Big data Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types.  Structured data: Relational data.  Semi Structured data: XML data.  Unstructured data: Word, PDF, Text, Media Logs.
  • 5. Benefits of Big data  Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.  Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.  Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.
  • 6. Operational Big data These include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement. Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.
  • 7. Analytical Big data These includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data. MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines.
  • 8. Big data challenges The major challenges associated with big data are as follows:  Capturing data  Curation  Storage  Searching  Sharing  Transfer  Analysis  Presentation
  • 10. Limitation of Traditional approach This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.
  • 11. Googles solution Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset .
  • 13. Hadoop Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.
  • 15. Hadoop architecture At its core, Hadoop has two major layers namely: (a) Processing/Computation layer (MapReduce), and (b) Storage layer (Hadoop Distributed File System).
  • 17. MapReduce MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.
  • 18. HDFS The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets. Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules:  Hadoop Common: These are Java libraries and utilities required by other Hadoop modules.  Hadoop YARN: This is a framework for job scheduling and cluster resource management.
  • 19. How does Hadoop work? It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines. Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs:  Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).  These files are then distributed across various cluster nodes for further processing.
  • 20. Advantages of Hadoop Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.  Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.  Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.  Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.