SlideShare a Scribd company logo
INTRODUCTION TO BIG DATA &
HADOOP
OUTLINE
• Data Generation Sources
• Per minute data evaluation
• What is Big Data?
• Limitations of RDBMS
• What is Hadoop?
• History of Hadoop
• Hadoop Core Components
• Hadoop Architecture
• Hadoop Ecosystem
OUTLINE
• Hadoop V1 v/s Hadoop V2
• Hadoop Distributions
• Who uses Hadoop?
• Overview of Data Lake
DATA GENERATION IN LAST FEW DECADES
DATA GENERATING NOW
’32 BILLION DEVICES PLUGGED IN &
GENERATING DATA BY 2020′
❑The EMC Digital Universe Study launched its seventh edition.
According to the study, by 2020, the amount of data in our
digital universe is expected to grow from 4.4 trillion GB to 44
trillion GB
-11th APRIL 2014
❑According to computer giant IBM, "2.5 exabytes - that's 2.5
billion gigabytes (GB) - of data was generated every day in
2012. That's big by anyone's standards. "About 75% of data is
unstructured, coming from sources such as text, voice and
video."
-Mr. Miles
WHAT IS BIG DATA?
Gartner : “Big Data as high volume, velocity and
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.”
BIG DATA Volume, Velocity and Variety
 Volume
• It refers to the vast amounts of data generated every
second.
 Velocity
• Refers to the speed at which new data is generated and
the speed at which data moves around.
 Variety
• Refers to the different types of data generated from
different sources.
BIG DATA HAS ALSO BEEN DEFINED
BY THE FIVE V’s
1.Volume
2.Velocity
3.Variety
4.Veracity
5.Value
BIG DATA Veracity
 Veracity
• Veracity refers to the biases, noise and abnormality in
data.
• Is the data that is being stored, and mined meaningful
to the problem being analyzed.
BIG DATA Value
 Value
• Data has intrinsic value—but it must be discovered.
• There are a range of quantitative and investigative
techniques to derive value from Big Data.
• The technological breakthrough makes much more
accurate and precise decisions possible.
• exploring the value in big data requires experimentation
and exploration. Whether creating new products or
looking for ways to gain competitive advantage.
BIG DATA
Analysing
Big Data:
● Predictive
analysis
● Text analytics
● Sentiment
analysis
● Image
Processing
● Voice
analytics
● Movement
Analytics
● Etc.
Data Sources
● ERP
● CRM
● Inventory
● Finance
● Conversations
● Voice
● Social Media
● Browser logs
● Photos
● Videos
● Log
● Sensors
● Etc.
Volume
Veracity
Variety
Velocity
Turning Big Data into Value
Value
Comparison Between Traditional RDBMS And Hadoop
LIMITATIONS OF RDBMS TO SUPPORT
“BIG DATA”
• Designed and structured to accommodate structured
data.
• Data size has increased tremendously, RDBMS finds
it challenging to handle such huge data volumes.
• lacks in high velocity because it’s designed for steady
data retention rather than rapid growth.
• Not designed for distributed computing.
• Many issues while scaling up for massive datasets.
• Expensive specialized hardware.
• Even if RDBMS is used to handle and store
“BigData,” it will turn out to be very expensive.
WHAT IS HADOOP?
WHAT IS HADOOP?
• Hadoop is an open-source software framework
• Allows the distributed storage and processing of large
data sets across clusters of commodity hardware
• uses simple programming models for processing
• It is designed to scale up from single servers to
thousands of machines.
 each offering local computation and storage.
• It provides massive storage for any kind of data.
• enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs.
• Stores files in the form of blocks.
WHAT IS HADOOP?(cont)
• Hadoop is an open-source implementation of Google
MapReduce, GFS(Google File System).
• Hadoop was created by Dough Cutting, the creator of
Apache Lucene, the widely used text search library.
HISTORY OF HADOOP
• 2003 - Google launches project Nutch to handle billions of searches
and indexing millions of web pages.
• Oct 2003 - Google releases papers with GFS (Google File System).
• Dec 2004 - Google releases papers with MapReduce.
• 2005 - Nutch used GFS and MapReduce to perform operations.
• 2006 - Yahoo! created Hadoop based on GFS and MapReduce (with
Doug Cutting and team)
• 2007 - Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008 - Apache took over Hadoop
• Jul 2008 - Tested a 4000 node cluster with Hadoop successfully
• 2009 - Hadoop successfully sorted a petabyte of data in less than 17
hours to handle billions of searches and indexing millions of web
pages.
HADOOP CORE COMPONENTS
HDFS(storage) and MapReduce(processing) are the two core
components of Apache Hadoop.
HDFS
• HDFS is a distributed file system that provides high-
throughput access to data.
• It provides a limited interface for managing the file system to
allow it to scale and provide high throughput.
• HDFS creates multiple replicas of each data block and
distributes them on computers throughout a cluster to enable
reliable and rapid access.
HADOOP CORE COMPONENTS
 MapReduce
• MapReduce is a framework for performing distributed data
processing using the MapReduce programming paradigm.
• Each job has a user-defined map phase and user-defined
reduce phase where the output of the map phase is aggregated.
• HDFS is the storage system for both input and output of the
MapReduce jobs.
HDFS OVERVIEW
• Based on Google’s GFS (Google File System)
• Provides redundant storage of massive amounts of
data
– Using commodity hardware
• Data is distributed across all nodes at load time.
• Provides for efficient Map Reduce processing.
– Operates on top of an existing filesystem.
• Files are stored as ‘Blocks’
– Each Block is replicated across several Data Nodes
• NameNode stores metadata and manages access.
• No data caching due to large datasets
HADOOP ARCHITECTURE
MasterMmmfj-Slave
Master-Slave
HADOOP ARCHITECTURE
NameNode
• Stores all metadata: filenames, locations of each block
on DataNodes, file attributes, etc…
• Keeps metadata in RAM for fast lookup.
• Filesystem metadata size is limited to the amount of
available RAM on NameNode
DataNode
• Stores file contents as blocks.
• Different blocks of the same file are stored on different
DataNodes.
• Periodically sends a report of all existing blocks to the
NameNode.
COMPONENTS(DAEMONS) OF HADOOP
• NameNode
• DataNode
• Secondary NameNode
• JobTracker
• TaskTracker
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Job Tracker
Name Node
Client
HADOOP ARCHITECTURE
Job Tracker
Name Node
APACHE HADOOP ECOSYSTEM
(CONT.)
❑Pig
❑A high-level data-flow language and execution
framework for parallel computation.
❑Hive
❑A data warehouse infrastructure that provides data
summarization and ad hoc querying.
❑Sqoop
❑A tool designed for efficiently transferring bulk data
between Hadoop and structured datastores such as
relational databases.
❑HBase
❑A scalable, distributed database that supports structured
data storage for large tables.
HADOOP 1.0 VS HADOOP 2.0
MR V1 MR v2
HADOOP DISTRIBUTIONS
➢Let's say we go download Apache Hadoop and
MapReduce from http://hadoop.apache.org/
➢At first it works great but then we decide to start using
HBase
➢No problem, just download HBase from
http://hadoop.apache.org/ and point to your existing
HDFS installation
➢But we find that HBase can only work with a previous
version of HDFS, so we go downgrade HDFS and
everything still works great.
➢Later on we decide to add Pig
➢Unfortunately the version of Pig doesn't work with the
version of HDFS, it wants us to upgrade
➢But if we upgrade we will break HBase.
HADOOP DISTRIBUTIONS
 Hadoop Distributions aim to resolve version
incompatibilities
 Distribution vendors will
 Integration test a set of Hadoop product
 Package Hadoop product in various installation
formats
 linux packages, tarballs, etc
 Distribution may provide additional scripts to execute
Hadoop
 Some vendors may choose to backport features and
bug fixes made by Apache
 Typically vendors will employ Hadoop committers so
the bugs find will make it into Apache repository.
DISTRIBUTION VENDORS
• Cloudera Distribution for Hadoop(CDH)
• MapR Distribution
• Hortonworks Data Platform (HDP)
• Greenplum
• IBM BigInsights
CLOUDERA DISTRIBUTION FOR
HADOOP(CDH)
• Cloudera has taken the lead on providing Hadoop
Distribution
• Cloudera is affecting the Hadoop ecosystem in the same
way RedHat popularized Linux in the enterprise circle
Most Popular Distribution
http://cloudera.com/hadoop
100% open-source
• Cloudera employs a large percentage of core Hadoop
committers
• CDH is provided in various formats Linux package,
Virtual Machine Images and Tarballs
• Integrates majority of popular Hadoop product
HDFS, MapReduce, HBase, Hive, Oozie, Pig, Sqoop,
Zookeeper and Flume etc.
SUPPORTED OPERATING SYSTEM
• Each Distribution will support its own list of Operating
System
• Common OS supported
 Red Hat Enterprise
 CentOS
 Oracle Linux
 Ubuntu
 SUSE Linux Enterprise Server
WHO USES HADOOP
➢Amazon/A9
➢Facebook
➢Google
➢IBM
➢Joost
➢LinkedIn
➢New York Times
➢PowerSet
➢Yahoo!
Now It’s Our
Turn
OVERVIEW OF DATA LAKE
“A Data Lake is a large storage repository and processing
engine. They provide "massive storage for any kind
of data, enormous processing power and the ability to
handle virtually limitless concurrent tasks or jobs."
DATA LAKE
REFERENCE
• https://hadoop.apache.org/
• http://www.cloudera.com/hadoop-and-big-data.html
• http://hortonworks.com/hadoop/
• Hadoop: The Definitive Guide, 4th Edition - O'Reilly Media
Any BIGGER Question?
Thank You
Presenter
Amir R. Shaikh
Hadoop Administrator
Thank You
For
Your attention and your time
have a good day ahead
Signing off

More Related Content

What's hot

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
17aroumougamh
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Febiyan Rachman
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
Bart Vandewoestyne
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
nabati
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
Imviplav
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
Maryan Faryna
 

What's hot (20)

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 

Viewers also liked

What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
T-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copyT-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copyDzung Nguyen
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integrationDzung Nguyen
 
JIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar DeckJIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar Deck
Addteq
 
Big data and Hadoop introduction
Big data and Hadoop introductionBig data and Hadoop introduction
Big data and Hadoop introductionDzung Nguyen
 
Amazing hacks you wish you knew before
Amazing hacks you wish you knew beforeAmazing hacks you wish you knew before
Amazing hacks you wish you knew before
Vikas Gupta
 
Hilda y itzel 1
Hilda y itzel 1Hilda y itzel 1
Hilda y itzel 1
hilda pantaleon
 
Презентация 1.6 - Железобетонные конструкции
Презентация 1.6 - Железобетонные конструкцииПрезентация 1.6 - Железобетонные конструкции
Презентация 1.6 - Железобетонные конструкции
Илья Конышев
 
Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...
Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...
Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...
Gianni Dominici
 
Hadoop Perspectives for 2017
Hadoop Perspectives for 2017Hadoop Perspectives for 2017
Hadoop Perspectives for 2017
Precisely
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
Richard Vidgen
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
MOOCs- Know How!
MOOCs- Know How!MOOCs- Know How!
MOOCs- Know How!
Neha (Ashi) Tandon
 
Tema 7. A arte paleocristiá
Tema 7.  A arte paleocristiáTema 7.  A arte paleocristiá
Tema 7. A arte paleocristiá
maikarequejoalvarez
 
A standard set of recruiting metrics
A standard set of recruiting metricsA standard set of recruiting metrics
A standard set of recruiting metrics
Rob McIntosh
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
8
88

Viewers also liked (20)

What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
T-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copyT-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copy
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
 
JIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar DeckJIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar Deck
 
Big data and Hadoop introduction
Big data and Hadoop introductionBig data and Hadoop introduction
Big data and Hadoop introduction
 
Pkg developmentca
Pkg developmentcaPkg developmentca
Pkg developmentca
 
Amazing hacks you wish you knew before
Amazing hacks you wish you knew beforeAmazing hacks you wish you knew before
Amazing hacks you wish you knew before
 
Hilda y itzel 1
Hilda y itzel 1Hilda y itzel 1
Hilda y itzel 1
 
Презентация 1.6 - Железобетонные конструкции
Презентация 1.6 - Железобетонные конструкцииПрезентация 1.6 - Железобетонные конструкции
Презентация 1.6 - Железобетонные конструкции
 
1 ентбря
1 ентбря1 ентбря
1 ентбря
 
Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...
Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...
Il nuovo scenario competitivo: globalizzazione e delocalizzazione. Ricostruir...
 
Hadoop Perspectives for 2017
Hadoop Perspectives for 2017Hadoop Perspectives for 2017
Hadoop Perspectives for 2017
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
MOOCs- Know How!
MOOCs- Know How!MOOCs- Know How!
MOOCs- Know How!
 
Tema 7. A arte paleocristiá
Tema 7.  A arte paleocristiáTema 7.  A arte paleocristiá
Tema 7. A arte paleocristiá
 
A standard set of recruiting metrics
A standard set of recruiting metricsA standard set of recruiting metrics
A standard set of recruiting metrics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
8
88
8
 

Similar to Introduction to BIg Data and Hadoop

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
KMS Technology
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek1
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
TIB Academy
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 

Similar to Introduction to BIg Data and Hadoop (20)

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 

Recently uploaded

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 

Introduction to BIg Data and Hadoop

  • 1. INTRODUCTION TO BIG DATA & HADOOP
  • 2. OUTLINE • Data Generation Sources • Per minute data evaluation • What is Big Data? • Limitations of RDBMS • What is Hadoop? • History of Hadoop • Hadoop Core Components • Hadoop Architecture • Hadoop Ecosystem
  • 3. OUTLINE • Hadoop V1 v/s Hadoop V2 • Hadoop Distributions • Who uses Hadoop? • Overview of Data Lake
  • 4. DATA GENERATION IN LAST FEW DECADES
  • 6. ’32 BILLION DEVICES PLUGGED IN & GENERATING DATA BY 2020′ ❑The EMC Digital Universe Study launched its seventh edition. According to the study, by 2020, the amount of data in our digital universe is expected to grow from 4.4 trillion GB to 44 trillion GB -11th APRIL 2014 ❑According to computer giant IBM, "2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. That's big by anyone's standards. "About 75% of data is unstructured, coming from sources such as text, voice and video." -Mr. Miles
  • 7.
  • 8. WHAT IS BIG DATA? Gartner : “Big Data as high volume, velocity and variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making.”
  • 9. BIG DATA Volume, Velocity and Variety  Volume • It refers to the vast amounts of data generated every second.  Velocity • Refers to the speed at which new data is generated and the speed at which data moves around.  Variety • Refers to the different types of data generated from different sources.
  • 10. BIG DATA HAS ALSO BEEN DEFINED BY THE FIVE V’s 1.Volume 2.Velocity 3.Variety 4.Veracity 5.Value
  • 11. BIG DATA Veracity  Veracity • Veracity refers to the biases, noise and abnormality in data. • Is the data that is being stored, and mined meaningful to the problem being analyzed.
  • 12. BIG DATA Value  Value • Data has intrinsic value—but it must be discovered. • There are a range of quantitative and investigative techniques to derive value from Big Data. • The technological breakthrough makes much more accurate and precise decisions possible. • exploring the value in big data requires experimentation and exploration. Whether creating new products or looking for ways to gain competitive advantage.
  • 14. Analysing Big Data: ● Predictive analysis ● Text analytics ● Sentiment analysis ● Image Processing ● Voice analytics ● Movement Analytics ● Etc. Data Sources ● ERP ● CRM ● Inventory ● Finance ● Conversations ● Voice ● Social Media ● Browser logs ● Photos ● Videos ● Log ● Sensors ● Etc. Volume Veracity Variety Velocity Turning Big Data into Value Value
  • 15. Comparison Between Traditional RDBMS And Hadoop
  • 16. LIMITATIONS OF RDBMS TO SUPPORT “BIG DATA” • Designed and structured to accommodate structured data. • Data size has increased tremendously, RDBMS finds it challenging to handle such huge data volumes. • lacks in high velocity because it’s designed for steady data retention rather than rapid growth. • Not designed for distributed computing. • Many issues while scaling up for massive datasets. • Expensive specialized hardware. • Even if RDBMS is used to handle and store “BigData,” it will turn out to be very expensive.
  • 18. WHAT IS HADOOP? • Hadoop is an open-source software framework • Allows the distributed storage and processing of large data sets across clusters of commodity hardware • uses simple programming models for processing • It is designed to scale up from single servers to thousands of machines.  each offering local computation and storage. • It provides massive storage for any kind of data. • enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. • Stores files in the form of blocks.
  • 19. WHAT IS HADOOP?(cont) • Hadoop is an open-source implementation of Google MapReduce, GFS(Google File System). • Hadoop was created by Dough Cutting, the creator of Apache Lucene, the widely used text search library.
  • 20. HISTORY OF HADOOP • 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. • Oct 2003 - Google releases papers with GFS (Google File System). • Dec 2004 - Google releases papers with MapReduce. • 2005 - Nutch used GFS and MapReduce to perform operations. • 2006 - Yahoo! created Hadoop based on GFS and MapReduce (with Doug Cutting and team) • 2007 - Yahoo started using Hadoop on a 1000 node cluster • Jan 2008 - Apache took over Hadoop • Jul 2008 - Tested a 4000 node cluster with Hadoop successfully • 2009 - Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages.
  • 21. HADOOP CORE COMPONENTS HDFS(storage) and MapReduce(processing) are the two core components of Apache Hadoop. HDFS • HDFS is a distributed file system that provides high- throughput access to data. • It provides a limited interface for managing the file system to allow it to scale and provide high throughput. • HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access.
  • 22. HADOOP CORE COMPONENTS  MapReduce • MapReduce is a framework for performing distributed data processing using the MapReduce programming paradigm. • Each job has a user-defined map phase and user-defined reduce phase where the output of the map phase is aggregated. • HDFS is the storage system for both input and output of the MapReduce jobs.
  • 23. HDFS OVERVIEW • Based on Google’s GFS (Google File System) • Provides redundant storage of massive amounts of data – Using commodity hardware • Data is distributed across all nodes at load time. • Provides for efficient Map Reduce processing. – Operates on top of an existing filesystem. • Files are stored as ‘Blocks’ – Each Block is replicated across several Data Nodes • NameNode stores metadata and manages access. • No data caching due to large datasets
  • 25. HADOOP ARCHITECTURE NameNode • Stores all metadata: filenames, locations of each block on DataNodes, file attributes, etc… • Keeps metadata in RAM for fast lookup. • Filesystem metadata size is limited to the amount of available RAM on NameNode DataNode • Stores file contents as blocks. • Different blocks of the same file are stored on different DataNodes. • Periodically sends a report of all existing blocks to the NameNode.
  • 26. COMPONENTS(DAEMONS) OF HADOOP • NameNode • DataNode • Secondary NameNode • JobTracker • TaskTracker
  • 27. Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Job Tracker Name Node Client HADOOP ARCHITECTURE Job Tracker Name Node
  • 29. (CONT.) ❑Pig ❑A high-level data-flow language and execution framework for parallel computation. ❑Hive ❑A data warehouse infrastructure that provides data summarization and ad hoc querying. ❑Sqoop ❑A tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. ❑HBase ❑A scalable, distributed database that supports structured data storage for large tables.
  • 30. HADOOP 1.0 VS HADOOP 2.0
  • 31. MR V1 MR v2
  • 32. HADOOP DISTRIBUTIONS ➢Let's say we go download Apache Hadoop and MapReduce from http://hadoop.apache.org/ ➢At first it works great but then we decide to start using HBase ➢No problem, just download HBase from http://hadoop.apache.org/ and point to your existing HDFS installation ➢But we find that HBase can only work with a previous version of HDFS, so we go downgrade HDFS and everything still works great. ➢Later on we decide to add Pig ➢Unfortunately the version of Pig doesn't work with the version of HDFS, it wants us to upgrade ➢But if we upgrade we will break HBase.
  • 33. HADOOP DISTRIBUTIONS  Hadoop Distributions aim to resolve version incompatibilities  Distribution vendors will  Integration test a set of Hadoop product  Package Hadoop product in various installation formats  linux packages, tarballs, etc  Distribution may provide additional scripts to execute Hadoop  Some vendors may choose to backport features and bug fixes made by Apache  Typically vendors will employ Hadoop committers so the bugs find will make it into Apache repository.
  • 34. DISTRIBUTION VENDORS • Cloudera Distribution for Hadoop(CDH) • MapR Distribution • Hortonworks Data Platform (HDP) • Greenplum • IBM BigInsights
  • 35. CLOUDERA DISTRIBUTION FOR HADOOP(CDH) • Cloudera has taken the lead on providing Hadoop Distribution • Cloudera is affecting the Hadoop ecosystem in the same way RedHat popularized Linux in the enterprise circle Most Popular Distribution http://cloudera.com/hadoop 100% open-source • Cloudera employs a large percentage of core Hadoop committers • CDH is provided in various formats Linux package, Virtual Machine Images and Tarballs • Integrates majority of popular Hadoop product HDFS, MapReduce, HBase, Hive, Oozie, Pig, Sqoop, Zookeeper and Flume etc.
  • 36. SUPPORTED OPERATING SYSTEM • Each Distribution will support its own list of Operating System • Common OS supported  Red Hat Enterprise  CentOS  Oracle Linux  Ubuntu  SUSE Linux Enterprise Server
  • 38. OVERVIEW OF DATA LAKE “A Data Lake is a large storage repository and processing engine. They provide "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs."
  • 40. REFERENCE • https://hadoop.apache.org/ • http://www.cloudera.com/hadoop-and-big-data.html • http://hortonworks.com/hadoop/ • Hadoop: The Definitive Guide, 4th Edition - O'Reilly Media
  • 42. Thank You Presenter Amir R. Shaikh Hadoop Administrator Thank You For Your attention and your time have a good day ahead Signing off