SlideShare a Scribd company logo
1 of 55
Download to read offline
Apache Hadoop
Big Data Engineering
Prepared by:
● Islam Elbanna
● Mahmoud Hanafy
Presented by:
● Ahmed Mahran
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Introduction
What is Hadoop?
"Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is designed to scale up from single servers to thousands
of machines, each providing computation and storage"
Open Source software + Hardware commodity = IT Cost reduction
Introduction - Cont.
Why Hadoop ?
● Performance
● Storage
● Scalability
● Fault tolerance
● Cost efficiency (Commodity Machines)
Introduction - Cont.
What is Hadoop used for ?
● Searching
● Log processing
● Recommendation system
● Analytics
● Video and Image analysis
Introduction - Cont.
Who uses Hadoop ?
● Amazon
● Facebook
● Google
● IBM
● New York Times
● Yahoo
● Twitter
● LinkedIn
● …
Introduction - Cont.
Hadoop RDBMS
Non-Structured/Structured data Structured data
Scale Out Scale Up
Procedural/Functional programming Declarative Queries
Offline batch processing Online/Batch Transactions
Petabytes Gigabytes
Key Value Pairs Predefined fields
Hadoop Vs RDBMS
Introduction - Cont.
Problem:
20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
~ Four months to read the web (Time).
~1,000 hard drives just to store the web (Storage).
Introduction - Cont.
Solution: same problem with 1000 machines < 3 hours
But we need:
● Communication and coordination
● Recovering from machine failure
● Status reporting
● Debugging
● Optimization
Distributed System
Introduction - Cont.
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Introduction - Cont.
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Introduction - Cont.
Introduction - Cont.
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
History
● 2002-2004 Started as a sub-project of Apache
Nutch.
● 2003-2004 Google published Google File System
(GFS) and MapReduce Framework Paper.
● 2004 Doug Cutting and Mike Cafarella
implemented Google’s frameworks in Nutch.
● In 2006 Yahoo hires Doug Cutting to work on
Hadoop with a dedicated team.
● In 2008 Hadoop became Apache Top Level Project.
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Assumptions
● Hardware Failure
● Streaming Data Access
● Large Data Sets
● Simple Coherency Model
● Moving Computation is Cheaper than Moving Data
● Software Platform Portability
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture
Hadoop designed and built on two independent
frameworks
Hadoop = HDFS + MapReduce
HDFS: is a reliable distributed file system that provides
high-throughput access to data.
● File divided into blocks 64MB (default)
● Each block replicated 3 times (default)
MapReduce: is a framework for performing high
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Case Study: Word Count
Problem: We need to calculate word
frequencies in billions of web pages
● Input: Files with one document per
record
● Output: List of words and their
frequencies in the whole documents
Case Study: Solution
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
MapReduce Design
● Map
● Reduce
● Shuffle & Sort
Case Study: Map Phase
● Specify a map function that takes a key/value pair
key = document URL
value = document contents
● Output of map function is key/value pairs.
In our case, output(word, “1”) once per word in the document
Case Study: Reduce Phase
● MapReduce library gathers together all pairs with the same key
(shuffle/sort)
● The reduce function combines the values for a key
In our case, compute the sum
● Output of reduce will be like that
Architecture - Cont.
MapReduce Design
● Map: extract
something you
care about from
each record.
Architecture - Cont.
MapReduce Design
● Reduce :
aggregate,
summarize, filter,
or transform
mapper output
Architecture - Cont.
MapReduce Design
Overall View:
Architecture - Cont.
MapReduce Design
● Shuffle & Sort :
redirect the
mapper output to
the right reducer
Case Study: Overall View
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
MapReduce
Programmer specifies two primary methods:
map(k1, v1) → <k2, v2>
reduce(k2, list<v2>) → <k3, v3>
Case Study : Code Example
Map Function
Case Study : Code Example
Reduce Function
Hadoop not only JAVA (streaming)
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker
Architecture - Cont.
Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
⚪
Architecture - Cont.
Main Modules
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
Access Procedure
● Read From HDFS
● Write to HDFS
Architecture - Cont.
Access Procedure
● Read From HDFS
● Write to HDFS
Architecture - Cont.
Access Procedure
● Read From HDFS
● Write to HDFS
Architecture - Cont.
Tasks distribution Procedure:
JobTracker choses the nodes to
execute the tasks to achieve the
data locality principle
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Hadoop Modes
Hadoop Modes
● Standalone
● Pseudo-Distributed
● Fully-Distributed
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
MapReduce 1 Vs MapReduce 2(YARN)
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Questions
References
● Book “Hadoop in Action” by Chuck Lam
● Book “Hadoop The Definitive Guide” by Tom Wbite
● http://hadoop.apache.org/
● http://en.wikipedia.org/wiki/Apache_Hadoop
● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil
● http://www.slideshare.net/PhilippeJulio/hadoop-architecture
● http://www.slideshare.net/rantav/introduction-to-map-reduce
● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=
&from_search=2
● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q
f1&b=&from_search=12
● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig
● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1
4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1
Thanks

More Related Content

What's hot

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 

What's hot (20)

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Anju
AnjuAnju
Anju
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 

Similar to Apache Hadoop - Big Data Engineering

An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...Prof. Maulik Trivedi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 

Similar to Apache Hadoop - Big Data Engineering (20)

An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 

More from BADR

Sunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into SolrSunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into SolrBADR
 
Docker up and Running For Web Developers
Docker up and Running For Web DevelopersDocker up and Running For Web Developers
Docker up and Running For Web DevelopersBADR
 
Vue.js
Vue.jsVue.js
Vue.jsBADR
 
There and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming LanguagesThere and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming LanguagesBADR
 
Take Pride in Your Code - Test-Driven Development
Take Pride in Your Code - Test-Driven DevelopmentTake Pride in Your Code - Test-Driven Development
Take Pride in Your Code - Test-Driven DevelopmentBADR
 
Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility PrincipleBADR
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL DatabasesBADR
 
Explicit Semantic Analysis
Explicit Semantic AnalysisExplicit Semantic Analysis
Explicit Semantic AnalysisBADR
 
Getting some Git
Getting some GitGetting some Git
Getting some GitBADR
 
ReactiveX
ReactiveXReactiveX
ReactiveXBADR
 
Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak PeekBADR
 
Android from A to Z
Android from A to ZAndroid from A to Z
Android from A to ZBADR
 
MySQL Indexing
MySQL IndexingMySQL Indexing
MySQL IndexingBADR
 
Duckville - The Strategy Design Pattern
Duckville - The Strategy Design PatternDuckville - The Strategy Design Pattern
Duckville - The Strategy Design PatternBADR
 
The Perks and Perils of the Singleton Design Pattern
The Perks and Perils of the Singleton Design PatternThe Perks and Perils of the Singleton Design Pattern
The Perks and Perils of the Singleton Design PatternBADR
 

More from BADR (15)

Sunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into SolrSunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into Solr
 
Docker up and Running For Web Developers
Docker up and Running For Web DevelopersDocker up and Running For Web Developers
Docker up and Running For Web Developers
 
Vue.js
Vue.jsVue.js
Vue.js
 
There and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming LanguagesThere and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming Languages
 
Take Pride in Your Code - Test-Driven Development
Take Pride in Your Code - Test-Driven DevelopmentTake Pride in Your Code - Test-Driven Development
Take Pride in Your Code - Test-Driven Development
 
Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility Principle
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
Explicit Semantic Analysis
Explicit Semantic AnalysisExplicit Semantic Analysis
Explicit Semantic Analysis
 
Getting some Git
Getting some GitGetting some Git
Getting some Git
 
ReactiveX
ReactiveXReactiveX
ReactiveX
 
Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak Peek
 
Android from A to Z
Android from A to ZAndroid from A to Z
Android from A to Z
 
MySQL Indexing
MySQL IndexingMySQL Indexing
MySQL Indexing
 
Duckville - The Strategy Design Pattern
Duckville - The Strategy Design PatternDuckville - The Strategy Design Pattern
Duckville - The Strategy Design Pattern
 
The Perks and Perils of the Singleton Design Pattern
The Perks and Perils of the Singleton Design PatternThe Perks and Perils of the Singleton Design Pattern
The Perks and Perils of the Singleton Design Pattern
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 

Apache Hadoop - Big Data Engineering

  • 1.
  • 2. Apache Hadoop Big Data Engineering Prepared by: ● Islam Elbanna ● Mahmoud Hanafy Presented by: ● Ahmed Mahran
  • 3. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 4. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 5. Introduction What is Hadoop? "Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage" Open Source software + Hardware commodity = IT Cost reduction
  • 6. Introduction - Cont. Why Hadoop ? ● Performance ● Storage ● Scalability ● Fault tolerance ● Cost efficiency (Commodity Machines)
  • 7. Introduction - Cont. What is Hadoop used for ? ● Searching ● Log processing ● Recommendation system ● Analytics ● Video and Image analysis
  • 8. Introduction - Cont. Who uses Hadoop ? ● Amazon ● Facebook ● Google ● IBM ● New York Times ● Yahoo ● Twitter ● LinkedIn ● …
  • 9. Introduction - Cont. Hadoop RDBMS Non-Structured/Structured data Structured data Scale Out Scale Up Procedural/Functional programming Declarative Queries Offline batch processing Online/Batch Transactions Petabytes Gigabytes Key Value Pairs Predefined fields Hadoop Vs RDBMS
  • 10. Introduction - Cont. Problem: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~ Four months to read the web (Time). ~1,000 hard drives just to store the web (Storage).
  • 11. Introduction - Cont. Solution: same problem with 1000 machines < 3 hours But we need: ● Communication and coordination ● Recovering from machine failure ● Status reporting ● Debugging ● Optimization Distributed System
  • 12. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing
  • 13. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing
  • 14. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing Introduction - Cont.
  • 15. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing
  • 16. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 17. History ● 2002-2004 Started as a sub-project of Apache Nutch. ● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper. ● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch. ● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team. ● In 2008 Hadoop became Apache Top Level Project.
  • 18. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 19. Assumptions ● Hardware Failure ● Streaming Data Access ● Large Data Sets ● Simple Coherency Model ● Moving Computation is Cheaper than Moving Data ● Software Platform Portability
  • 20. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 21. Architecture Hadoop designed and built on two independent frameworks Hadoop = HDFS + MapReduce HDFS: is a reliable distributed file system that provides high-throughput access to data. ● File divided into blocks 64MB (default) ● Each block replicated 3 times (default) MapReduce: is a framework for performing high
  • 22. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 23. Case Study: Word Count Problem: We need to calculate word frequencies in billions of web pages ● Input: Files with one document per record ● Output: List of words and their frequencies in the whole documents
  • 25. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 26. Architecture - Cont. MapReduce Design ● Map ● Reduce ● Shuffle & Sort
  • 27. Case Study: Map Phase ● Specify a map function that takes a key/value pair key = document URL value = document contents ● Output of map function is key/value pairs. In our case, output(word, “1”) once per word in the document
  • 28. Case Study: Reduce Phase ● MapReduce library gathers together all pairs with the same key (shuffle/sort) ● The reduce function combines the values for a key In our case, compute the sum ● Output of reduce will be like that
  • 29. Architecture - Cont. MapReduce Design ● Map: extract something you care about from each record.
  • 30. Architecture - Cont. MapReduce Design ● Reduce : aggregate, summarize, filter, or transform mapper output
  • 31. Architecture - Cont. MapReduce Design Overall View:
  • 32. Architecture - Cont. MapReduce Design ● Shuffle & Sort : redirect the mapper output to the right reducer
  • 34. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 35. Architecture - Cont. MapReduce Programmer specifies two primary methods: map(k1, v1) → <k2, v2> reduce(k2, list<v2>) → <k3, v3>
  • 36. Case Study : Code Example Map Function
  • 37. Case Study : Code Example Reduce Function
  • 38. Hadoop not only JAVA (streaming)
  • 39. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 40. Architecture - Cont. Main Modules ● File System (HDFS) ⚪ Name Node ⚪ Secondary Name Node ⚪ Data Node ● MapReduce Framework ⚪ Job Tracker ⚪ Task Tracker
  • 41. Architecture - Cont. Main Modules ● File System (HDFS) ⚪ Name Node ⚪ Secondary Name Node ⚪ Data Node ⚪
  • 42. Architecture - Cont. Main Modules ● MapReduce Framework ⚪ Job Tracker ⚪ Task Tracker
  • 43. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 44. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS
  • 45. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS
  • 46. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS
  • 47. Architecture - Cont. Tasks distribution Procedure: JobTracker choses the nodes to execute the tasks to achieve the data locality principle
  • 48. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 49. Hadoop Modes Hadoop Modes ● Standalone ● Pseudo-Distributed ● Fully-Distributed
  • 50. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 51. MapReduce 1 Vs MapReduce 2(YARN)
  • 52. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 54. References ● Book “Hadoop in Action” by Chuck Lam ● Book “Hadoop The Definitive Guide” by Tom Wbite ● http://hadoop.apache.org/ ● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil ● http://www.slideshare.net/PhilippeJulio/hadoop-architecture ● http://www.slideshare.net/rantav/introduction-to-map-reduce ● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b= &from_search=2 ● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q f1&b=&from_search=12 ● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig ● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig ● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1 4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1