(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽❤️🧑🏻 89...
Apache Hadoop - Big Data Engineering
1.
2. Apache Hadoop
Big Data Engineering
Prepared by:
● Islam Elbanna
● Mahmoud Hanafy
Presented by:
● Ahmed Mahran
3. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
4. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
5. Introduction
What is Hadoop?
"Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is designed to scale up from single servers to thousands
of machines, each providing computation and storage"
Open Source software + Hardware commodity = IT Cost reduction
7. Introduction - Cont.
What is Hadoop used for ?
● Searching
● Log processing
● Recommendation system
● Analytics
● Video and Image analysis
8. Introduction - Cont.
Who uses Hadoop ?
● Amazon
● Facebook
● Google
● IBM
● New York Times
● Yahoo
● Twitter
● LinkedIn
● …
9. Introduction - Cont.
Hadoop RDBMS
Non-Structured/Structured data Structured data
Scale Out Scale Up
Procedural/Functional programming Declarative Queries
Offline batch processing Online/Batch Transactions
Petabytes Gigabytes
Key Value Pairs Predefined fields
Hadoop Vs RDBMS
10. Introduction - Cont.
Problem:
20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
~ Four months to read the web (Time).
~1,000 hard drives just to store the web (Storage).
11. Introduction - Cont.
Solution: same problem with 1000 machines < 3 hours
But we need:
● Communication and coordination
● Recovering from machine failure
● Status reporting
● Debugging
● Optimization
Distributed System
16. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
17. History
● 2002-2004 Started as a sub-project of Apache
Nutch.
● 2003-2004 Google published Google File System
(GFS) and MapReduce Framework Paper.
● 2004 Doug Cutting and Mike Cafarella
implemented Google’s frameworks in Nutch.
● In 2006 Yahoo hires Doug Cutting to work on
Hadoop with a dedicated team.
● In 2008 Hadoop became Apache Top Level Project.
18. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
19. Assumptions
● Hardware Failure
● Streaming Data Access
● Large Data Sets
● Simple Coherency Model
● Moving Computation is Cheaper than Moving Data
● Software Platform Portability
20. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
21. Architecture
Hadoop designed and built on two independent
frameworks
Hadoop = HDFS + MapReduce
HDFS: is a reliable distributed file system that provides
high-throughput access to data.
● File divided into blocks 64MB (default)
● Each block replicated 3 times (default)
MapReduce: is a framework for performing high
22. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
23. Case Study: Word Count
Problem: We need to calculate word
frequencies in billions of web pages
● Input: Files with one document per
record
● Output: List of words and their
frequencies in the whole documents
25. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
27. Case Study: Map Phase
● Specify a map function that takes a key/value pair
key = document URL
value = document contents
● Output of map function is key/value pairs.
In our case, output(word, “1”) once per word in the document
28. Case Study: Reduce Phase
● MapReduce library gathers together all pairs with the same key
(shuffle/sort)
● The reduce function combines the values for a key
In our case, compute the sum
● Output of reduce will be like that
34. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
39. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
40. Architecture - Cont.
Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker
41. Architecture - Cont.
Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
⚪
43. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
47. Architecture - Cont.
Tasks distribution Procedure:
JobTracker choses the nodes to
execute the tasks to achieve the
data locality principle
48. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
50. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
52. Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions