SlideShare a Scribd company logo
Introduction to Apache Hadoop
Instructor
Dendej Sawarnkatat
dendej@gmail.com
Agenda
• Big Data Computation
• Introduction to Hadoop
• Hadoop Architecture
• MapReduce
• Hadoop Ecosystems
2
BIG DATA COMPUTATION 3
Traditional Approach
• Enterprise Computation
4
• Enterprise Computation
Large
Data
Processed By Powerful
computer
Traditional Approach
• Enterprise Computation
5
Big
Data
Processing limit Powerful
computer
Only so much
data could be
processed
Breaking Down the Data
Big
Data
Is broken into
pieces
6
Moving Computation to Data
• Concurrent Computation of Smaller Data
Big
Data
Combined
result
COMPUTATION
7
Parallel Computing ???
8
Fault Tolerance is a “MUST”
9
Parallel vs. Distributed
10
Distributed Computing
• The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems
11
CAP Theorem
• CAP theorem (or Brewer’s theorem) is a set of
basic requirements that describes a distributed
system
• Consistency: all the server in the system will have
the same data
• Availability: all the server in the system will be
available and they will return all the data available
(also if they could be not consistent across the
system)
• Partition (tolerance): the system will continues to
operate as a whole despite arbitrary message loss
or failure of a part of the system
12
CAP Theorem (1)
13
CAP Theorem (2)
14
According to
the theorem, a
distributed
system
CANNOT satisfy
all the three
requirements
at the SAME
time (“two out
of three”
concept).
Problems In Distributed
Computing
1. Hardware Failure:
• As soon as we start using many pieces of
hardware, the chance that one will fail is
fairly high.
2. Combine the data after analysis:
• Most analysis tasks need to be able to
combine the data in some way; data read
from one disk may need to be combined
with the data from any of the other 99
disks. 15
WHAT IS HADOOP? 16
What’s Hadoop?
• “An open source software platform for
distributed storage and distributed
processing of very large data sets on
computer clusters built on commodity
hardware” - Hortonworks.
• Solving the first problem by Avoiding data
loss through replication
• redundant copies of the data are kept by
the system so that in the event of failure 17
What’s Hadoop? (cont’d)
• The second problem is solved by a simple
programming model called Mapreduce.
• Hadoop is also a highly popular open
source implementation of MapReduce
• a powerful tool designed for deep analysis
and transformation of very large data sets.
18
Hadoop…
• Where it comes from?
• The “legend” says that the name comes from
Doug Cutting (one of the founder of the
project) son’s toy elephant.
• So it is also the logo of the yellow smiling
elephant.
19
History
• [2002] Hadoop, created by Doug Cutting
(part of the Lucene project), starts as an
Open Source search engine for the Web.
• It has its origins in Apache Nutch, parts of
the Lucene project (full text search engine).
• [2003] Google publishes a paper
describing its own distributed file system,
also called GFS.
20
History (1)
• [2004] The first version of NDFS, Nutch
Distributed FS, implementing the Google’s
paper.
• [2004] Google publishes, another, paper
introducing the MapReduce algorithm
• [2005] The first version of MapReduce is
implemented in Nutch
21
History (2)
• [2005 (end)] Nutch’s MapReduce is
running on NDFS
• [2006 (Feb)] Nutch’s MapReduce and
NDFS became the core of a new Lucene’s
subproject.
• [2008] Yahoo launches the World’s largest
Hadoop PRODUCTION site
22
Key Features
23
• Automatic parallelization and distribution
• Fault-tolerance
• Data Locality
• Writing the Map and Reduce functions
only
• Single-threaded model
Why Hadoop ?
• Cheaper – scalable to Petabyte/Zetabyte
and more with commodity hardware
• Faster – Parallel Processing
• Better – Suitable for particular types of
‘Big Data’ applications
24
Right Data
• LOB (Line of Business) – not suitable
• Transactional Data
• Behavioral Data -- suitable
• Web usage
• Shopping behavior
• etc
25
Hadoop Applications
• Risk Modeling
• Customer Churn Analysis
• Recommendation Engine
• Ad Targeting
• Transaction Analysis
• Threat Analysis
• Search Quality
26
Who uses Hadoop?
• Facebook
• Yahoo
• Amazon
• eBay
• IBM
• New York Times
• Etc
27
HADOOP ARCHITECTURE 28
HDFS
• The Hadoop Distributed File System
• For a developer point of view it looks like a
standard file system
• Runs on top of OS file system (extf3,…)
• Designed to store a very large amount of
data (petabytes and so on) and to solve
some problems that comes with DFS or NFS
• Provides fast and scalable access to the
data Stores data reliably 29
HDFS under the hood
• All the files loaded in Hadoop are split into
chunks, called blocks.
• Each block has a fixed size of 64Mb! (newer
version has default block size of 256 Mb).
MyData ~ 150Mb
HDFS
Blk_01
64Mb
Blk_03, 22Mb
Blk_02
64Mb
30
Hadoop cluster
• A Hadoop cluster consist in mainly two
modules:
• A way to store distributed data, the HDFS or
Hadoop Distributed File System (storage
layer)
• A way to process data, the MapReduce
(compute layer).
31
Name Node (Master)
• A dedicated node where all the metadata
of all the files (blocks) inside my system
are stored.
• It’s the directory manager of the HDFS
32
Data Node (Slave)
• A daemon (a service in the Windows
language) running on each cluster nodes.
• Responsible to store the blocks
33
Accessing Data
• To access a file, a client contact the
Namenode to retrieve the list of locations
for the blocks.
• With the locations the client contact the
Datanodes to read the data (possibly in
parallel).
34
Data Redundancy
• Hadoop replicates each block THREE times, as
it’s stored in the HDFS.
• The location of every blocks is managed by
the Namenode
• If a block is under-replicated (due to some
failures on a node), the Namenode is smart
enough to create another replica, until each
node has three replica inside the cluster
• E.g. if we have 100Tb of data to store in
Hadoop, we will need 300Tb of storage
space. 35
Replication Management
36
HDFS Architecture
37
HDFS Read Architecture
38
HDFS Write Pipeline
39
Hadoop 2.0: Next-gen platform
40
Hadoop V1 vs. Hadoop V2
41
Hadoop 2.0
• Store all data in one place Interact with
data in multiple ways
42
Hadoop 2.x
• The new Hadoop has now FOUR modules
(instead of two)
• HadoopCommon: common utilities
supporting all the other modules
• HDFS: an evolution of the previous
distributed FS
• Hadoop YARN: a function for job scheduling
and cluster resource management
• Hadoop MapReduce: a YARN based system
for parallel processing of large data sets 43
Hadoop 2.x
• Hadoop v2, leveraging YARN, is aiming to
become the new OS for the data processing
44
Hadoop and real time
• Hadoop v2, using YARN, and Storm (a free
and open source distributed real time
computation system) can compute your
data in real time
• Some Hadoop distribution (like
Hortonworks) are working on an effortless
integration
45
Hadoop Architecture
46
Hadoop Cluster Deployment
47
Hadoop Deployment
48
Namenode availability
• If the Namenode fails ALL the cluster
becomes inaccessible
• In the early versions the Namenode was a
single point of failure
• Couple of solution are now available:
• the Namenode stores the data on the
network through NFS
• most production sites have two Namenode:
Active and Standby
49
Hadoop 3.x Features
• Support for Erasure Encoding in HDFS
• YARN Timeline Service v.2
• Shell Script Rewrite
• Shaded Client Jars
• Support for Opportunistic Containers
• MapReduce Task-Level Native
Optimization
50
Hadoop 3.x Features
• Support for More than 2 NameNodes
• Default Ports of Multiple Services have
been Changed
• Support for Filesystem Connector
• Intra-DataNode Balancer
• Reworked Daemon and Task Heap
Management
51
Hadoop’s Ports
52
HADOOP ECOSYSTEM 53
Hadoop Distributions
Open Source Commercial Cloud
Apache
Hadoop
Cloudera AWS
Hortonworks
Microsoft Azure
HDInsight
MapR DataProc
54
Hadoop related projects
• PIG: high level language fro analyzing large
data-sets. It’s working as a compiler that
produce M/R jobs
• HIVE: data warehouse software facilities
querying and managing large data-sets with a
SQL like language
• Hbase : a scalable, distributed database that
supports structured data storage for large
tables
• Cassandra: a scalable multi-master database 55
Big Data Ecosystem
56
Big Data Platforms
57
Big Data Landscape
58

More Related Content

What's hot

Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
Giovanna Roda
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
BADR
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
rightsize
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Hadoop
Hadoop Hadoop
Hadoop
ABHIJEET RAJ
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
Yahoo Developer Network
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Sumeet Singh
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
Jan Pieter Posthuma
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 

What's hot (20)

Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop
Hadoop Hadoop
Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 

Similar to 002 Introduction to hadoop v3

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Zohar Elkayam
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data
Big dataBig data
Big data
Alisha Roy
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 

Similar to 002 Introduction to hadoop v3 (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

002 Introduction to hadoop v3

  • 1. Introduction to Apache Hadoop Instructor Dendej Sawarnkatat dendej@gmail.com
  • 2. Agenda • Big Data Computation • Introduction to Hadoop • Hadoop Architecture • MapReduce • Hadoop Ecosystems 2
  • 4. Traditional Approach • Enterprise Computation 4 • Enterprise Computation Large Data Processed By Powerful computer
  • 5. Traditional Approach • Enterprise Computation 5 Big Data Processing limit Powerful computer Only so much data could be processed
  • 6. Breaking Down the Data Big Data Is broken into pieces 6
  • 7. Moving Computation to Data • Concurrent Computation of Smaller Data Big Data Combined result COMPUTATION 7
  • 9. Fault Tolerance is a “MUST” 9
  • 11. Distributed Computing • The key issues involved in this Solution: • Hardware failure • Combine the data after analysis • Network Associated Problems 11
  • 12. CAP Theorem • CAP theorem (or Brewer’s theorem) is a set of basic requirements that describes a distributed system • Consistency: all the server in the system will have the same data • Availability: all the server in the system will be available and they will return all the data available (also if they could be not consistent across the system) • Partition (tolerance): the system will continues to operate as a whole despite arbitrary message loss or failure of a part of the system 12
  • 14. CAP Theorem (2) 14 According to the theorem, a distributed system CANNOT satisfy all the three requirements at the SAME time (“two out of three” concept).
  • 15. Problems In Distributed Computing 1. Hardware Failure: • As soon as we start using many pieces of hardware, the chance that one will fail is fairly high. 2. Combine the data after analysis: • Most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks. 15
  • 17. What’s Hadoop? • “An open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built on commodity hardware” - Hortonworks. • Solving the first problem by Avoiding data loss through replication • redundant copies of the data are kept by the system so that in the event of failure 17
  • 18. What’s Hadoop? (cont’d) • The second problem is solved by a simple programming model called Mapreduce. • Hadoop is also a highly popular open source implementation of MapReduce • a powerful tool designed for deep analysis and transformation of very large data sets. 18
  • 19. Hadoop… • Where it comes from? • The “legend” says that the name comes from Doug Cutting (one of the founder of the project) son’s toy elephant. • So it is also the logo of the yellow smiling elephant. 19
  • 20. History • [2002] Hadoop, created by Doug Cutting (part of the Lucene project), starts as an Open Source search engine for the Web. • It has its origins in Apache Nutch, parts of the Lucene project (full text search engine). • [2003] Google publishes a paper describing its own distributed file system, also called GFS. 20
  • 21. History (1) • [2004] The first version of NDFS, Nutch Distributed FS, implementing the Google’s paper. • [2004] Google publishes, another, paper introducing the MapReduce algorithm • [2005] The first version of MapReduce is implemented in Nutch 21
  • 22. History (2) • [2005 (end)] Nutch’s MapReduce is running on NDFS • [2006 (Feb)] Nutch’s MapReduce and NDFS became the core of a new Lucene’s subproject. • [2008] Yahoo launches the World’s largest Hadoop PRODUCTION site 22
  • 23. Key Features 23 • Automatic parallelization and distribution • Fault-tolerance • Data Locality • Writing the Map and Reduce functions only • Single-threaded model
  • 24. Why Hadoop ? • Cheaper – scalable to Petabyte/Zetabyte and more with commodity hardware • Faster – Parallel Processing • Better – Suitable for particular types of ‘Big Data’ applications 24
  • 25. Right Data • LOB (Line of Business) – not suitable • Transactional Data • Behavioral Data -- suitable • Web usage • Shopping behavior • etc 25
  • 26. Hadoop Applications • Risk Modeling • Customer Churn Analysis • Recommendation Engine • Ad Targeting • Transaction Analysis • Threat Analysis • Search Quality 26
  • 27. Who uses Hadoop? • Facebook • Yahoo • Amazon • eBay • IBM • New York Times • Etc 27
  • 29. HDFS • The Hadoop Distributed File System • For a developer point of view it looks like a standard file system • Runs on top of OS file system (extf3,…) • Designed to store a very large amount of data (petabytes and so on) and to solve some problems that comes with DFS or NFS • Provides fast and scalable access to the data Stores data reliably 29
  • 30. HDFS under the hood • All the files loaded in Hadoop are split into chunks, called blocks. • Each block has a fixed size of 64Mb! (newer version has default block size of 256 Mb). MyData ~ 150Mb HDFS Blk_01 64Mb Blk_03, 22Mb Blk_02 64Mb 30
  • 31. Hadoop cluster • A Hadoop cluster consist in mainly two modules: • A way to store distributed data, the HDFS or Hadoop Distributed File System (storage layer) • A way to process data, the MapReduce (compute layer). 31
  • 32. Name Node (Master) • A dedicated node where all the metadata of all the files (blocks) inside my system are stored. • It’s the directory manager of the HDFS 32
  • 33. Data Node (Slave) • A daemon (a service in the Windows language) running on each cluster nodes. • Responsible to store the blocks 33
  • 34. Accessing Data • To access a file, a client contact the Namenode to retrieve the list of locations for the blocks. • With the locations the client contact the Datanodes to read the data (possibly in parallel). 34
  • 35. Data Redundancy • Hadoop replicates each block THREE times, as it’s stored in the HDFS. • The location of every blocks is managed by the Namenode • If a block is under-replicated (due to some failures on a node), the Namenode is smart enough to create another replica, until each node has three replica inside the cluster • E.g. if we have 100Tb of data to store in Hadoop, we will need 300Tb of storage space. 35
  • 40. Hadoop 2.0: Next-gen platform 40
  • 41. Hadoop V1 vs. Hadoop V2 41
  • 42. Hadoop 2.0 • Store all data in one place Interact with data in multiple ways 42
  • 43. Hadoop 2.x • The new Hadoop has now FOUR modules (instead of two) • HadoopCommon: common utilities supporting all the other modules • HDFS: an evolution of the previous distributed FS • Hadoop YARN: a function for job scheduling and cluster resource management • Hadoop MapReduce: a YARN based system for parallel processing of large data sets 43
  • 44. Hadoop 2.x • Hadoop v2, leveraging YARN, is aiming to become the new OS for the data processing 44
  • 45. Hadoop and real time • Hadoop v2, using YARN, and Storm (a free and open source distributed real time computation system) can compute your data in real time • Some Hadoop distribution (like Hortonworks) are working on an effortless integration 45
  • 49. Namenode availability • If the Namenode fails ALL the cluster becomes inaccessible • In the early versions the Namenode was a single point of failure • Couple of solution are now available: • the Namenode stores the data on the network through NFS • most production sites have two Namenode: Active and Standby 49
  • 50. Hadoop 3.x Features • Support for Erasure Encoding in HDFS • YARN Timeline Service v.2 • Shell Script Rewrite • Shaded Client Jars • Support for Opportunistic Containers • MapReduce Task-Level Native Optimization 50
  • 51. Hadoop 3.x Features • Support for More than 2 NameNodes • Default Ports of Multiple Services have been Changed • Support for Filesystem Connector • Intra-DataNode Balancer • Reworked Daemon and Task Heap Management 51
  • 54. Hadoop Distributions Open Source Commercial Cloud Apache Hadoop Cloudera AWS Hortonworks Microsoft Azure HDInsight MapR DataProc 54
  • 55. Hadoop related projects • PIG: high level language fro analyzing large data-sets. It’s working as a compiler that produce M/R jobs • HIVE: data warehouse software facilities querying and managing large data-sets with a SQL like language • Hbase : a scalable, distributed database that supports structured data storage for large tables • Cassandra: a scalable multi-master database 55

Editor's Notes

  1. How to learn Emphasize on programming with Java Apology for document Document is not quite complete Some parts are irrelevant Some just get added because of its interesting nature Some are missing Some are not part of this documentß Student must lecture on undocumented details
  2. Exmaple: In the cloud, on an elastic first level system, the service should be “stateless” or at least “soft-state” (cached) and must always response to the query, even if the backend is down. So the system will be “A”, immediate responsive, and “P”, regardless a failure in the backend the system is responding to the requests
  3. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. Old: Apache Hadoop is a framework for running applications on large cluster built of commodity hardware.
  4. This is the core of Hadoop!
  5. The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file
  6. http://hortonworks.com/blog/stream-processing-in-hadoop-yarn-storm-and-the-hortonworks-data-platform/