SlideShare a Scribd company logo
Cloud Computing
Hwajung Lee
Key Reference:
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Cloud Computing
• Cloud Introduction
• Cloud Service Model
• Big Data
• Hadoop
• MapReduce
• HDFS (Hadoop Distributed File System)
MapReduce
MapReduce
Hadoop
• Hadoop is a Reliable Shared Storage and Analysis System
• Hadoop = HDFS + MapReduce + α
- HDFS provides Data Storage
- HDFS: Hadoop Distributed FileSystem
- MapReduce provides Data Analysis
- MapReduce = Map Function + Reduce Function
MapReduce
Scaling Out
• Scaling out is done by the DFS (Distributed FileSystem),
where the data is divided and stored in distributed computers
& servers
• Hadoop uses HDFS to move the MapReduce computation to
several distributed computing machines that will process a
part of the divided data assigned
MapReduce
Jobs
• MapReduce job is a unit of work that needs to be executed
• Job types: Data input, MapReduce program, Configuration
Information, etc.
• Job is executed by dividing it into one of two types of tasks
• Map Task
• Reduce Task
MapReduce
Node types for Job execution
• Job execution is controlled by 2 types of nodes
• Jobtracker
• Tasktracker
• Jobtracker coordinates all jobs
• Jobtracker schedules all tasks and assigns the tasks to tasktrackers
MapReduce
• Tasktracker will execute its assigned task
• Tasktracker will send a progress reports to the Jobtracker
• Jobtracker will keep a record of the progress of all jobs executed
MapReduce
Data flow
• Hadoop divides the input into input splits (or splits) suitable
for the MapReduce job
• Split has a fixed-size
• Split size is commonly matched to the size of a HDFS block
(64 MB) for maximum processing efficiency
MapReduce
Data flow
• Map Task is created for each split
• Map Task executes the map function for all records within the split
• Hadoop commonly executes the Map Task on the node where the
input data resides
MapReduce
Data flow
• Data-Local Map Task
• Data locality optimization
does not need to use the cluster network
• Data-local flow process shows why the
Optimal Split Size = 64 MB HDFS Block Size
MapReduce
• Rack-Local Map Task
• A node hosting the
HDFS block replicas for
a map task’s input split
could be running other map tasks
• Job Scheduler will look for a free map slot on
a node in the same rack as one of the blocks
Map Task
HDFS Block
Node
Rack
Data Center
Data flow
MapReduce
• Off-Rack Map Task
• Needed when the
Job Scheduler
cannot perform data-local or rack-local map tasks
• Uses inter-rack network transfer
Data flow
MapReduce
Map
• Map task will write its output to the local disk
• Map task output is not the final output, it is only the
intermediate output
Reduce
• Map task output is processed by Reduce Tasks to produce the final
output
• Reduce Task output is stored in HDFS
• For a completed job, the Map Task output can be discarded
MapReduce
Single Reduce Task
• Node includes Split, Map, Sort, and Output unit
• Light blue arrows show data transfers in a node
• Black arrows show data transfers between nodes
MapReduce
Single Reduce Task
• Number of reduce tasks is specified
independently, and is not based on
the size of the input
MapReduce
Combiner Function
• User specified function to run on the Map output
 Forms the input to the Reduce function
• Specifically designed to minimize the data transferred between
Map Tasks and Reduce Tasks
• Solves the problem of limited network speed on the cluster
and helps to reduce the time in completing MapReduce jobs
MapReduce
Multiple Reducer
• Map tasks partition their output, each creating one partition for
each reduce task
• Each partition may use many keys and key associated
values
• All records for a key are kept in a single partition
MapReduce
Multiple Reducers
• Shuffle process is used in the data flow
between the Map tasks and Reduce tasks
Shuffle
MapReduce
Zero Reducer
• Zero reducer uses
no shuffle process
• Applied when all of the
processing can be carried
out in parallel Map tasks
HDFS
HDFS
Hadoop
• Hadoop is a Reliable Shared Storage and Analysis System
• Hadoop = HDFS + MapReduce + α
- HDFS provides Data Storage
- HDFS: Hadoop Distributed FileSystem
- MapReduce provides Data Analysis
- MapReduce = Map Function + Reduce
Function
HDFS
HDFS: Hadoop Distributed FileSystem
• DFS (Distributed FileSystem) is designed for storage
management of a network of computers
• HDFS is optimized to store large terabyte size files with
streaming data access patterns
HDFS
HDFS: Hadoop Distributed FileSystem
• HDFS was designed to be optimal in performance for a WORM
(Write Once, Read Many times) pattern
• HDFS is designed to run on clusters of general computers
& servers from multiple vendors
HDFS
HDFS Characteristics
• HDFS is optimized for large scale and high throughput data
processing
• HDFS does not perform well in supporting applications that
require minimum delay (e.g., tens of milliseconds range)
HDFS
Blocks
• Files in HDFS are divided into block size chunks
 64 Megabyte default block size
• Block is the minimum size of data that it can read or write
• Blocks simplifies the storage and replication process
 Provides fault tolerance & processing speed enhancement
for larger files
HDFS
HDFS
• HDFS clusters use 2 types of nodes
• Namenode (master node)
• Datanode (worker node)
HDFS
Namenode
• Manages the filesystem namespace
• Namenode keeps track of the datanodes that have blocks of
a distributed file assigned
• Maintains the filesystem tree and the metadata for all the files
and directories in the tree
• Stores on the local disk using 2 file forms
• Namespace Image
• Edit Log
HDFS
Namenode
• Namenode holds the filesystem metadata in its memory
• Namenode’s memory size determines the limit to the
number of files in a filesystem
• But then, what is Metadata?
HDFS
Metadata
• Traditional concept of the library card catalogs
• Categorizes and describes the contents and context of the data
files
• Maximizes the usefulness of the original data file by making it
easy to find and use
HDFS
Metadata Types
• Structural Metadata
• Focuses on the data structure’s design and specification
• Descriptive Metadata
• Focuses on the individual instances of application data or
the data content
HDFS
• Datanodes
• Workhorse of the filesystem
• Store and retrieve blocks when requested by the client or the
namenode
• Periodically reports back to the namenode with lists of blocks that
were stored
HDFS
Client Access
• Client can access the filesystem (on behalf of the user) by
communicating with the namenode and datanodes
• Client can use a filesystem interface (similar to a POSIX
(Portable Operating System Interface)) so the user code does
not need to know about the namenode and datanodes to
function properly
HDFS
Namenode Failure
• Namenode keeps track of the datanodes that have blocks of a
distributed file assigned
 Without the namenode, the filesystem cannot be used
• If the computer running the namenode malfunctions then
reconstruction of the files (from the blocks on the datanodes)
would not be possible
 Files on the filesystem would be lost
HDFS
Namenode Failure Resilience
• Namenode failure prevention schemes
1. Namenode File Backup
2. Secondary Namenode
HDFS
Namenode File Backup
• Back up the namenode files that form the persistent state of
the filesystem’s metadata
• Configure the namenode to write its persistent state to
multiple filesystems
 Synchronous and atomic backup
• Common backup configuration
 Copy to Local Disk and Remote FileSystem
HDFS
Secondary Namenode
• Secondary namenode does not act the same way as the
namenode
• Secondary namenode periodically merges the namespace image
with the edit log to prevent the edit log from becoming too large
• Secondary namenode usually runs on a separate computer to
perform the merge process because this requires significant
processing capability and memory
HDFS
Hadoop 2.x Release Series HDFS Reliability Enhancements
• HDFS Federation
• HDFS HA (High-Availability)
HDFS
HDFS Federation
• Allows a cluster to scale by adding namenodes
• Each namenode manages a
namespace volume and a block pool
• Namespace volume is made up of the metadata for the namespace
• Block pool contains all the blocks for the files in the namespace
HDFS
•HDFS Federation
• Namespace volumes are all independent
• Namenodes do not communicate with each other
• Failure of a namenode is also independent to other namenodes
• A namenode failure does not influence the availability of
another namenode’s namespace
HDFS
HDFS High-Availability
• Pair of namenodes (Primary & Standby) are set to be in Active-
Standby configuration
• Secondary namenode stores the latest edit log entries and an
up-to-date block mapping
• When the primary namenode fails, the standby namenode
takes over serving client requests
HDFS
HDFS High-Availability
• Although the active-standby namenode can takeover operation
quickly (e.g., few tens of seconds), to avoid unnecessary
namenode switching, standby namenode activation will be
executed after a sufficient observation period
(e.g., approximately a minute or a few minutes)
• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live,
work, and think. Houghton Mifflin Harcourt, 2013.
• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.
• J. Venner, Pro Hadoop. Apress, 2009.
• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics
and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2,
Winter 2011.
• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating
revolutionary breakthroughs in commerce, science and society," Computing
Community Consortium, pp. 1-15, Dec. 2008.
• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item
Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.
References
• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data,"
Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International Conference on
Collaboration Technologies and Systems, pp. 42-47, May 2013.
• M. Chen, S. Mao, and Y
. Liu, “Big Data: A Survey,” Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE Transactions on
Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.
• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a- Service:
An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.
References
• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions
on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.
• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database Synchronization
Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2,
pp. 392-398, May 2010.
• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html
[Accessed June 1, 2015]
• Hadoop Apache, http://hadoop.apache.org
• Wikipedia, http://www.wikipedia.org
Image sources
• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons
• Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
References

More Related Content

Similar to Lecture10_CloudServicesModel_MapReduceHDFS.pptx

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Roushan Sinha
 
Hadoop
HadoopHadoop
Hadoop
avnishagr
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
Manoel Ribeiro
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
Bharathi567510
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
SakthiVinoth78
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
Amjith Singh
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 

Similar to Lecture10_CloudServicesModel_MapReduceHDFS.pptx (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 

Recently uploaded

Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
nikitacareer3
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 

Recently uploaded (20)

Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 

Lecture10_CloudServicesModel_MapReduceHDFS.pptx

  • 1. Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
  • 2. Cloud Computing • Cloud Introduction • Cloud Service Model • Big Data • Hadoop • MapReduce • HDFS (Hadoop Distributed File System)
  • 4. MapReduce Hadoop • Hadoop is a Reliable Shared Storage and Analysis System • Hadoop = HDFS + MapReduce + α - HDFS provides Data Storage - HDFS: Hadoop Distributed FileSystem - MapReduce provides Data Analysis - MapReduce = Map Function + Reduce Function
  • 5. MapReduce Scaling Out • Scaling out is done by the DFS (Distributed FileSystem), where the data is divided and stored in distributed computers & servers • Hadoop uses HDFS to move the MapReduce computation to several distributed computing machines that will process a part of the divided data assigned
  • 6. MapReduce Jobs • MapReduce job is a unit of work that needs to be executed • Job types: Data input, MapReduce program, Configuration Information, etc. • Job is executed by dividing it into one of two types of tasks • Map Task • Reduce Task
  • 7. MapReduce Node types for Job execution • Job execution is controlled by 2 types of nodes • Jobtracker • Tasktracker • Jobtracker coordinates all jobs • Jobtracker schedules all tasks and assigns the tasks to tasktrackers
  • 8. MapReduce • Tasktracker will execute its assigned task • Tasktracker will send a progress reports to the Jobtracker • Jobtracker will keep a record of the progress of all jobs executed
  • 9. MapReduce Data flow • Hadoop divides the input into input splits (or splits) suitable for the MapReduce job • Split has a fixed-size • Split size is commonly matched to the size of a HDFS block (64 MB) for maximum processing efficiency
  • 10. MapReduce Data flow • Map Task is created for each split • Map Task executes the map function for all records within the split • Hadoop commonly executes the Map Task on the node where the input data resides
  • 11. MapReduce Data flow • Data-Local Map Task • Data locality optimization does not need to use the cluster network • Data-local flow process shows why the Optimal Split Size = 64 MB HDFS Block Size
  • 12. MapReduce • Rack-Local Map Task • A node hosting the HDFS block replicas for a map task’s input split could be running other map tasks • Job Scheduler will look for a free map slot on a node in the same rack as one of the blocks Map Task HDFS Block Node Rack Data Center Data flow
  • 13. MapReduce • Off-Rack Map Task • Needed when the Job Scheduler cannot perform data-local or rack-local map tasks • Uses inter-rack network transfer Data flow
  • 14. MapReduce Map • Map task will write its output to the local disk • Map task output is not the final output, it is only the intermediate output Reduce • Map task output is processed by Reduce Tasks to produce the final output • Reduce Task output is stored in HDFS • For a completed job, the Map Task output can be discarded
  • 15. MapReduce Single Reduce Task • Node includes Split, Map, Sort, and Output unit • Light blue arrows show data transfers in a node • Black arrows show data transfers between nodes
  • 16. MapReduce Single Reduce Task • Number of reduce tasks is specified independently, and is not based on the size of the input
  • 17. MapReduce Combiner Function • User specified function to run on the Map output  Forms the input to the Reduce function • Specifically designed to minimize the data transferred between Map Tasks and Reduce Tasks • Solves the problem of limited network speed on the cluster and helps to reduce the time in completing MapReduce jobs
  • 18. MapReduce Multiple Reducer • Map tasks partition their output, each creating one partition for each reduce task • Each partition may use many keys and key associated values • All records for a key are kept in a single partition
  • 19. MapReduce Multiple Reducers • Shuffle process is used in the data flow between the Map tasks and Reduce tasks Shuffle
  • 20. MapReduce Zero Reducer • Zero reducer uses no shuffle process • Applied when all of the processing can be carried out in parallel Map tasks
  • 21. HDFS
  • 22. HDFS Hadoop • Hadoop is a Reliable Shared Storage and Analysis System • Hadoop = HDFS + MapReduce + α - HDFS provides Data Storage - HDFS: Hadoop Distributed FileSystem - MapReduce provides Data Analysis - MapReduce = Map Function + Reduce Function
  • 23. HDFS HDFS: Hadoop Distributed FileSystem • DFS (Distributed FileSystem) is designed for storage management of a network of computers • HDFS is optimized to store large terabyte size files with streaming data access patterns
  • 24. HDFS HDFS: Hadoop Distributed FileSystem • HDFS was designed to be optimal in performance for a WORM (Write Once, Read Many times) pattern • HDFS is designed to run on clusters of general computers & servers from multiple vendors
  • 25. HDFS HDFS Characteristics • HDFS is optimized for large scale and high throughput data processing • HDFS does not perform well in supporting applications that require minimum delay (e.g., tens of milliseconds range)
  • 26. HDFS Blocks • Files in HDFS are divided into block size chunks  64 Megabyte default block size • Block is the minimum size of data that it can read or write • Blocks simplifies the storage and replication process  Provides fault tolerance & processing speed enhancement for larger files
  • 27. HDFS HDFS • HDFS clusters use 2 types of nodes • Namenode (master node) • Datanode (worker node)
  • 28. HDFS Namenode • Manages the filesystem namespace • Namenode keeps track of the datanodes that have blocks of a distributed file assigned • Maintains the filesystem tree and the metadata for all the files and directories in the tree • Stores on the local disk using 2 file forms • Namespace Image • Edit Log
  • 29. HDFS Namenode • Namenode holds the filesystem metadata in its memory • Namenode’s memory size determines the limit to the number of files in a filesystem • But then, what is Metadata?
  • 30. HDFS Metadata • Traditional concept of the library card catalogs • Categorizes and describes the contents and context of the data files • Maximizes the usefulness of the original data file by making it easy to find and use
  • 31. HDFS Metadata Types • Structural Metadata • Focuses on the data structure’s design and specification • Descriptive Metadata • Focuses on the individual instances of application data or the data content
  • 32. HDFS • Datanodes • Workhorse of the filesystem • Store and retrieve blocks when requested by the client or the namenode • Periodically reports back to the namenode with lists of blocks that were stored
  • 33. HDFS Client Access • Client can access the filesystem (on behalf of the user) by communicating with the namenode and datanodes • Client can use a filesystem interface (similar to a POSIX (Portable Operating System Interface)) so the user code does not need to know about the namenode and datanodes to function properly
  • 34. HDFS Namenode Failure • Namenode keeps track of the datanodes that have blocks of a distributed file assigned  Without the namenode, the filesystem cannot be used • If the computer running the namenode malfunctions then reconstruction of the files (from the blocks on the datanodes) would not be possible  Files on the filesystem would be lost
  • 35. HDFS Namenode Failure Resilience • Namenode failure prevention schemes 1. Namenode File Backup 2. Secondary Namenode
  • 36. HDFS Namenode File Backup • Back up the namenode files that form the persistent state of the filesystem’s metadata • Configure the namenode to write its persistent state to multiple filesystems  Synchronous and atomic backup • Common backup configuration  Copy to Local Disk and Remote FileSystem
  • 37. HDFS Secondary Namenode • Secondary namenode does not act the same way as the namenode • Secondary namenode periodically merges the namespace image with the edit log to prevent the edit log from becoming too large • Secondary namenode usually runs on a separate computer to perform the merge process because this requires significant processing capability and memory
  • 38. HDFS Hadoop 2.x Release Series HDFS Reliability Enhancements • HDFS Federation • HDFS HA (High-Availability)
  • 39. HDFS HDFS Federation • Allows a cluster to scale by adding namenodes • Each namenode manages a namespace volume and a block pool • Namespace volume is made up of the metadata for the namespace • Block pool contains all the blocks for the files in the namespace
  • 40. HDFS •HDFS Federation • Namespace volumes are all independent • Namenodes do not communicate with each other • Failure of a namenode is also independent to other namenodes • A namenode failure does not influence the availability of another namenode’s namespace
  • 41. HDFS HDFS High-Availability • Pair of namenodes (Primary & Standby) are set to be in Active- Standby configuration • Secondary namenode stores the latest edit log entries and an up-to-date block mapping • When the primary namenode fails, the standby namenode takes over serving client requests
  • 42. HDFS HDFS High-Availability • Although the active-standby namenode can takeover operation quickly (e.g., few tens of seconds), to avoid unnecessary namenode switching, standby namenode activation will be executed after a sufficient observation period (e.g., approximately a minute or a few minutes)
  • 43. • V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013. • T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009. • S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011. • B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008. • G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003. References
  • 44. • J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014. • S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y . Liu, “Big Data: A Survey,” Mobile Networks and Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014. • X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014. • Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a- Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013. References
  • 45. • I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012. • M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010. • IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015] • Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org Image sources • Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons References