SlideShare a Scribd company logo
1 of 4
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
1
Architecture, benefits and challenges of Hadoop
Kirti Jayadevan
Introduction to Big Data Concepts, Technologies and deployment
Alakh Verma
2-28-2016
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
2
Abstract: [Hadoop is designed to scale up from single server to infinite machines. Unlike
relational database management system, it provides data storage over distributed systems by
partitioning data and executing computation in parallel. This paper provides an overview of
the architecture, benefits and challenges of Hadoop by comparing with lambda architecture
and spark cluster.]
Hadoop is an open source framework and is an Apache project, used to process large
amount of datasets. It is developed using distributed file system design using TCP/IP
protocols. Here servers can be added dynamically without any interruption (Shvachko, et al.,
2010, pg.1). HDFS (Hadoop Distributed File System) is the file system component of
Hadoop.
The HDFS architecture includes one name node and multiple data nodes. The name
node stores file system metadata. HDFS client first contacts the name node to know the
location of data and contacts the nearest data node to access the data. The file content is split
into block of 128 MB and each block is stored and replicated in three data nodes (Shvachko,
et al., 2010, pg.1). The data node includes one file that contains the data itself and a second
file that stores block’s metadata. During each start up the name node and data node performs
a handshake by verifying the namespace id and software version of data node. This helps to
register data node with name node. After registration, a block report that contains up-to-date
view of where block replicas are located is sent by data node every hour to the name node.
Then the data nodes send heart beats every 3 seconds which helps the name node to know
that data node is operating and block replicas are available (Shvachko, et al., 2010, pg.2).
Name node replies to the heartbeat with instructions to data node on whether to replicate
block, remove block or shut down the node (Shvachko, et al., 2010, pg.2). The name node
acts as a checkpoint node or backup node to protect the file system metadata. The checkpoint
node maintains the persistent record of files and directories in application data which is
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
3
written to the disk (Shvachko, et al., 2010, pg.3). Thus Hadoop does not depend on hardware
for fault tolerance. To avoid data corruption during system upgrades, name node creates a
snapshot that saves current state of file system and instructs data nodes, while handshaking,
to create local snapshot (Shvachko, et al., 2010, pg.4).
To communicate with HDFS, we use HDFS client which reference files and
directories by paths in the namespace. The user program uses map reduce framework,
developed by Google, to handle distributed computing in large datasets. When the user
program calls the map-reduce function, it splits the input files into pieces and uses a master -
worker relationship to distribute the data in those files (Ghemawat, S & Dean, J., 2004, pg.4).
The master tracks the job and assigns the job while the workers execute the tasks given by the
master. The master assigns map tasks to workers which read the input file and store the
intermediate key value in different location. These locations are then passed to the master and
master in turn assigns reduce tasks to other workers which read the locations and identify the
intermediate data (Ghemawat, S & Dean, J., 2004, pg.4). Later, these workers sort those data
and append it to the output file. Once all map and reduce tasks are completed the master
wakes up the user program (Ghemawat, S & Dean, J., 2004, pg.4). Thus it helps to iterate
through the large data sets quickly. Hadoop also uses many other tools and frameworks like
HBase, Pig, Avro, Hive for data access, data serialization etc. (Shvachko, et al., 2010, pg.1).
Spark cluster, developed in UC Berkeley, is faster with iterative datasets when
compared to Hadoop. It is a programming interface for in memory data mining on clusters.
Spark uses resilient distributed datasets (RDD) that enables data reuse and it performs in
memory computations with low latency (Zaharia, M., et al., 2011). Lambda architecture by
Nathan Marz, briefs the framework of Hadoop. It is designed to provide fault tolerance and
scalability without interrupting the service. The batch layer, speed layer and serving layer of
Lambda architecture are used in big data technologies. Distributed file system, HDFS uses
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
4
dataset from batch layer that can be queried with low latency. The speed layer of lambda
architecture deals with click-stream or recent data. Hadoop does not deal with speed layer
and it is not ACID compliant.
References:
1. Shvachko, K., Kuang, H., Radia, S., Chansler, R., The Hadoop Distributed File
System, 2010, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies, Yahoo.
2. Dean, Jeffrey., and Ghemawat, Sanjay., MapReduce: Simplified Data Processing on
Large Clusters, 2004 OSDI Operating Systems Design and Implementation
Conference.
3. Lambda Architecture. , MapR technologies, Retrieved from:
https://www.mapr.com/developercentral/lambda-architecture
4. Hadoop Introduction., Retrieved from:
http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
5. Zaharia, Matei., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Ma, Justin.,
McCauley, Murphy., Franklin, J, Michael., Shenker, Scott., Stoica, Ion., Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing, Electrical Engineering and Computer Sciences University of California at
Berkeley Technical Report No. UCB/EECS-2011-82
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf.

More Related Content

What's hot

Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Nandhitha B
 
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Nandhitha B
 

What's hot (16)

Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Unit 1
Unit 1Unit 1
Unit 1
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Cppt
CpptCppt
Cppt
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
 
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
 
Hadoop
HadoopHadoop
Hadoop
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Viewers also liked

Assignment_4
Assignment_4Assignment_4
Assignment_4
Kirti J
 
Stratégie de contenu partie 1 - mardi 16 juin 2015
Stratégie de contenu   partie 1 - mardi 16 juin 2015Stratégie de contenu   partie 1 - mardi 16 juin 2015
Stratégie de contenu partie 1 - mardi 16 juin 2015
Vincent Wallon
 

Viewers also liked (17)

Assignment_4
Assignment_4Assignment_4
Assignment_4
 
Office Heroes League | "Office Official" Workplace Design & Moves
Office Heroes League | "Office Official" Workplace Design & MovesOffice Heroes League | "Office Official" Workplace Design & Moves
Office Heroes League | "Office Official" Workplace Design & Moves
 
Jennifer Asplund Comm 125 Portfolio
Jennifer Asplund Comm 125 PortfolioJennifer Asplund Comm 125 Portfolio
Jennifer Asplund Comm 125 Portfolio
 
Востряково. Брендинг мясных изделий | cleverbranding.ru
Востряково. Брендинг мясных изделий | cleverbranding.ruВостряково. Брендинг мясных изделий | cleverbranding.ru
Востряково. Брендинг мясных изделий | cleverbranding.ru
 
Жизнь в стиле ЭКО | cleverbranding.ru
Жизнь в стиле ЭКО | cleverbranding.ru Жизнь в стиле ЭКО | cleverbranding.ru
Жизнь в стиле ЭКО | cleverbranding.ru
 
My journey in PYP
My journey in PYPMy journey in PYP
My journey in PYP
 
Брендинг здорового питания. Сильнейший брендинг или его отсутствие? | cleverb...
Брендинг здорового питания. Сильнейший брендинг или его отсутствие? | cleverb...Брендинг здорового питания. Сильнейший брендинг или его отсутствие? | cleverb...
Брендинг здорового питания. Сильнейший брендинг или его отсутствие? | cleverb...
 
Neptune facebook autoremediation_talk
Neptune facebook autoremediation_talkNeptune facebook autoremediation_talk
Neptune facebook autoremediation_talk
 
Neptune : Re-thinking Incident Response Automation
Neptune : Re-thinking Incident Response Automation Neptune : Re-thinking Incident Response Automation
Neptune : Re-thinking Incident Response Automation
 
presentation
presentationpresentation
presentation
 
Маша и Медведь. Экспресс-аудит бренда | cleverbranding.ru
Маша и Медведь. Экспресс-аудит бренда | cleverbranding.ruМаша и Медведь. Экспресс-аудит бренда | cleverbranding.ru
Маша и Медведь. Экспресс-аудит бренда | cleverbranding.ru
 
Travail de Fin d'Etudes 2014 : L'intégration de la visioconférence pour rendr...
Travail de Fin d'Etudes 2014 : L'intégration de la visioconférence pour rendr...Travail de Fin d'Etudes 2014 : L'intégration de la visioconférence pour rendr...
Travail de Fin d'Etudes 2014 : L'intégration de la visioconférence pour rendr...
 
Fiche pratique rifseep cdg 60
Fiche pratique rifseep cdg 60Fiche pratique rifseep cdg 60
Fiche pratique rifseep cdg 60
 
Stratégie de contenu partie 1 - mardi 16 juin 2015
Stratégie de contenu   partie 1 - mardi 16 juin 2015Stratégie de contenu   partie 1 - mardi 16 juin 2015
Stratégie de contenu partie 1 - mardi 16 juin 2015
 
EB5 Visa Presentation Paramount Miami World Center Development
EB5 Visa Presentation Paramount Miami World Center DevelopmentEB5 Visa Presentation Paramount Miami World Center Development
EB5 Visa Presentation Paramount Miami World Center Development
 
EB5 Visa - Green Card Investment Presentation
EB5 Visa - Green Card Investment PresentationEB5 Visa - Green Card Investment Presentation
EB5 Visa - Green Card Investment Presentation
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 

Similar to assignment3

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 

Similar to assignment3 (20)

DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
G017143640
G017143640G017143640
G017143640
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
hadoop
hadoophadoop
hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
D04501036040
D04501036040D04501036040
D04501036040
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Big data
Big dataBig data
Big data
 

assignment3

  • 1. [Type the companyname] Architecture, benefits and Challenges ofHadoop 1 Architecture, benefits and challenges of Hadoop Kirti Jayadevan Introduction to Big Data Concepts, Technologies and deployment Alakh Verma 2-28-2016
  • 2. [Type the companyname] Architecture, benefits and Challenges ofHadoop 2 Abstract: [Hadoop is designed to scale up from single server to infinite machines. Unlike relational database management system, it provides data storage over distributed systems by partitioning data and executing computation in parallel. This paper provides an overview of the architecture, benefits and challenges of Hadoop by comparing with lambda architecture and spark cluster.] Hadoop is an open source framework and is an Apache project, used to process large amount of datasets. It is developed using distributed file system design using TCP/IP protocols. Here servers can be added dynamically without any interruption (Shvachko, et al., 2010, pg.1). HDFS (Hadoop Distributed File System) is the file system component of Hadoop. The HDFS architecture includes one name node and multiple data nodes. The name node stores file system metadata. HDFS client first contacts the name node to know the location of data and contacts the nearest data node to access the data. The file content is split into block of 128 MB and each block is stored and replicated in three data nodes (Shvachko, et al., 2010, pg.1). The data node includes one file that contains the data itself and a second file that stores block’s metadata. During each start up the name node and data node performs a handshake by verifying the namespace id and software version of data node. This helps to register data node with name node. After registration, a block report that contains up-to-date view of where block replicas are located is sent by data node every hour to the name node. Then the data nodes send heart beats every 3 seconds which helps the name node to know that data node is operating and block replicas are available (Shvachko, et al., 2010, pg.2). Name node replies to the heartbeat with instructions to data node on whether to replicate block, remove block or shut down the node (Shvachko, et al., 2010, pg.2). The name node acts as a checkpoint node or backup node to protect the file system metadata. The checkpoint node maintains the persistent record of files and directories in application data which is
  • 3. [Type the companyname] Architecture, benefits and Challenges ofHadoop 3 written to the disk (Shvachko, et al., 2010, pg.3). Thus Hadoop does not depend on hardware for fault tolerance. To avoid data corruption during system upgrades, name node creates a snapshot that saves current state of file system and instructs data nodes, while handshaking, to create local snapshot (Shvachko, et al., 2010, pg.4). To communicate with HDFS, we use HDFS client which reference files and directories by paths in the namespace. The user program uses map reduce framework, developed by Google, to handle distributed computing in large datasets. When the user program calls the map-reduce function, it splits the input files into pieces and uses a master - worker relationship to distribute the data in those files (Ghemawat, S & Dean, J., 2004, pg.4). The master tracks the job and assigns the job while the workers execute the tasks given by the master. The master assigns map tasks to workers which read the input file and store the intermediate key value in different location. These locations are then passed to the master and master in turn assigns reduce tasks to other workers which read the locations and identify the intermediate data (Ghemawat, S & Dean, J., 2004, pg.4). Later, these workers sort those data and append it to the output file. Once all map and reduce tasks are completed the master wakes up the user program (Ghemawat, S & Dean, J., 2004, pg.4). Thus it helps to iterate through the large data sets quickly. Hadoop also uses many other tools and frameworks like HBase, Pig, Avro, Hive for data access, data serialization etc. (Shvachko, et al., 2010, pg.1). Spark cluster, developed in UC Berkeley, is faster with iterative datasets when compared to Hadoop. It is a programming interface for in memory data mining on clusters. Spark uses resilient distributed datasets (RDD) that enables data reuse and it performs in memory computations with low latency (Zaharia, M., et al., 2011). Lambda architecture by Nathan Marz, briefs the framework of Hadoop. It is designed to provide fault tolerance and scalability without interrupting the service. The batch layer, speed layer and serving layer of Lambda architecture are used in big data technologies. Distributed file system, HDFS uses
  • 4. [Type the companyname] Architecture, benefits and Challenges ofHadoop 4 dataset from batch layer that can be queried with low latency. The speed layer of lambda architecture deals with click-stream or recent data. Hadoop does not deal with speed layer and it is not ACID compliant. References: 1. Shvachko, K., Kuang, H., Radia, S., Chansler, R., The Hadoop Distributed File System, 2010, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Yahoo. 2. Dean, Jeffrey., and Ghemawat, Sanjay., MapReduce: Simplified Data Processing on Large Clusters, 2004 OSDI Operating Systems Design and Implementation Conference. 3. Lambda Architecture. , MapR technologies, Retrieved from: https://www.mapr.com/developercentral/lambda-architecture 4. Hadoop Introduction., Retrieved from: http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm 5. Zaharia, Matei., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Ma, Justin., McCauley, Murphy., Franklin, J, Michael., Shenker, Scott., Stoica, Ion., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2011-82 http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf.