SlideShare a Scribd company logo
1 of 4
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 553
Unraveling the Data Structures of Big data, the HDFS Architecture and
Importance of Data Replication in HDFS
Aseema Sultana1
1Asst.Professor, Dept. of Information Science and Engineering, HKBK CE, Bangalore, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – Big Data is one of the most challenging aspectsin
today's world filled with emerging information.Thefactthatit
leads us to be dealt with massive amounts of overwhelming
raw data creates new opportunities in the said field. The term
Big Data describes extremely large volumesofdatathatcomes
from varied sources in different formats with high velocity.
Here, the amount of data is not what is important, but what
can be done with the data is more important. In this process,
useful information is fetched from the massive amount ofdata
and is analyzed. The analyzing part itself is a very challenging
and demanding piece of work as it demands fault forbearing,
amenable and ascendible systems that work upon the said
data. The ultimate goal here is to make the process of
managing and maintaining data a convenient and feasible
task. This paper provides a review on Big Data, its data
structures, details of the HDFS architecture, and replicationof
data files within HDFS.
Key Words: Big Data, Structured, Semi-structured,
Quasi-structured, Un-structured Data, Hadoop, HDFS
Architecture, Data Replication, Space Reclamation.
1. INTRODUCTION
What is Big Data? Big Data is not just large volumes of data,
but also data with high variety and high velocity. In other
words, Big Data is huge data in different forms being
generated at high speed which is difficult to manage or
process using any existing tools and architectures. Let us
first see what the input data into big data systemsis, it could
be transactions from a bank, web server logs, chatter from
Twitter or Facebook or any other social network,MP3songs,
vlogs or any broadcasted videos, content of web pages,
personal documents, medical reports, and many more to
count, as mentioned in [1].
According to [2], in the 1990s the data volume was
measured in terabytes. In subsequent decade the data
volume was being measured in petabytes. And the decadeof
2010s is dealing with the exponential data volume drivenby
variety and many sources of digitized data. Almost
everybody and everything is leaving a digital footprint. Due
to the volume in data, it has been considered to start
measuring data in Exabyte.
Some applications that generate a lot of data are shown in
Fig -1. In the said figure we can see how the amount of
information is evolving in the graph w.r.t every ten years.
Data is increasing and exploding decade after decade and so
are the sources for Big Data.
Fig -1: The rise of Big Data sources and the Evolution of
data
2. DATA STRUCTURES OF BIG DATA
Big Data comes in many forms; it can be Structured, Semi-
Structured,Quasi-Structuredor Unstructured asin[3].Fig-2
shows the Data structure types and the data growth
accordingly.
Fig -2: The pyramid of data structures of big data
2.1 Structured Data
Structured data is data that comes with clarity,
description, presentation, explicate length or format. This
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 554
type of data can be easily re-organized or processed by any
data mining tools. One could envision this type of data as a
perfectly organized book library where books are orderly
arranged,labeled and are easily accessible. Hencethistypeof
data can be effectively used by organizations. Few examples
of structured data include numbers, dates, etc., it is believed
that this type of data comprises about 15-20% of data
present in the world. In Fig-3, we can see one such example
representing structured data. We can see data from an excel
sheet that displays clearly formatted information such as
First Name, Last Name and Gender of a few people.
2.2 Semi-Structured Data
This type of data is loosely defined or irregularly
structured which allowsa schema-lessdescription format in
which the data is less constrained than is usuallyin database
work. Semi-structured data model allows information from
several sources,with related butdifferentproperties,tobefit
together in one whole, thus suitable for integration of
databases and sharing information on the Web. Semi-
structured data is data that may be irregular or incomplete
and have a structure that may change rapidly or
unpredictably. It generally has some structure, but does not
conform to a fixed schema. It is Schema-less and self-
describing,i.e., data carriesinformationaboutitsownschema
(e.g., in terms of XML element tags). The main attraction for
semi-structured data is that structure can be imposed upon
the data as in [4]. Fig -3 shows an example for semi-
structureddata;here we can see the XMLdocument that was
fetched from the ‘view page source’ option while viewing a
web page.
2.3 Quasi-Structured Data
This data is not among the commonly mentionedtypesof
data. It is the one in between the semi-structured and
unstructured data. In other words, Quasi-structured data is
data which is unstructured but is in an erratic format which
can be handled with special tools. For example, consider
visiting the website www.google.com, the URL (Uniform
Resource Locator) is tracked into log files of the webpage's
server; the user’s computer activity is being monitored. The
URL defines a clickstream that can be parsed and mined by
data scientists to discover usage patterns and uncover
relationships between clicks and areas of interest on
websites, as mentioned in [2]. This is illustrated in Fig -3
where the user types “HKBK College of Engineering” in the
search option in Google. Upon a search, the clickstream
“https://www.google.co.in/search?q=hkbk+college+of+enginee
ring&rlz=1C1CHBD_enIN761IN761&oq=HKBK+col&aqs=chro
me.0.0j69i57j69i61l3j0.3550j0j8&sourceid=chrome&ie=UTF-8”
is generated which shall be parsed and mined.
2.4 Un-Structured Data
Un-Structured data sometimes referred to as messy or
unorganizedraw data is primarily in the formof text and can
also be found in forms of multimedia files. This type of data
doesnothave a formal structure,inotherwordsthestructure
is unidentifiable. Word processing documents, Books,
Journals, Health documents, e-mail messages, videos,
presentations, webpages, photos, audio filesandmany other
kinds of business documents are examples of unstructured
data. As we know that structured data is information with a
high degree of organization, unstructured data is essentially
the opposite. To make use of this data, the massive data that
was worthless is identified and stored in an organized
fashion. Through the use of specialized software, the items
can then be searched through. However, there are two
aspects to be considered, firstly, the process of converting
unstructureddata to organized datawouldbecostlyandtime
consuming. Secondly, not all types of unstructured data can
easily be converted into a structured model. Information
aging is another factor that needs to be considered in this
case where the data keeps accumulating day after day but is
left unused. One example of un-structured data is illustrated
in Fig -3 in which a video about Big Data from YouTube is
streamed.
2.5 Heterogeneous Data
This is a combination of all 4 types of data. For example,
say we have a system that stores call logs for a call-center. In
this case, data such as date and time stamp could represent
the structured type. On the other hand, the customer chat
historycouldrepresent the quasi-orsemi-structureddata.So
much information can be extracted from the available set of
the dataofdifferent structures.Figurebelowshowsexamples
of all four forms of data.
Fig -3: Illustration of data structures of big data
3. HADOOP AND HDFS ARCHITECTURE
Hadoop is an open source software framework used to
process big data. It was created by Doug Cuttingand Michael
J. Cafarella as in [5]. Hadoop was developed for parallel and
distributedprocessing systemsandwasinitiallymotivatedin
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 555
a paper by Google. As we know that managing big data is a
near to impossible task for the existing traditional systems,
this was made possible using Hadoop. With Hadoop, there is
no limit on the storage of data and the process of computing
huge data generated in high speeds is moving from
traditional databases to distributed systems.
Fig -4: The HDFS Architecture and Data Replication
HDFS (Hadoop Distributed File system) is a highly fault
tolerant parallel file systemwith amaster/slavearchitecture
whose objective includes storing and managing huge data
sets, managing failures of hardware, data access and
simplifying simple data coherency issues as in [6].
The HDFS architecture is asshown in Fig -4.Thefigureshows
the HDFS master/slave architecture. Each cluster consists of
exactlyone NameNode which is amasterserverthatmanages
accesses to files w.r.t clients and also manages file system
namespace. There can be n number of DataNodesdepending
upon the number of nodes in the cluster. The DataNodes
manage storage of files. HDFS allows user data to be stored
into files usually having the files itself split into more than
one blocks and in turn these blocks stored in one or more
DataNodes. The responsibility of NameNode is to execute
operations like Opening, Closing or renaming files. On the
other hand, the responsibility of DataNodes is to serve the
read and write requests from the clients. In addition, the
NameNode also instructs the DataNodes to perform block
creation, cloning and deletion as per the need.
The NameNode periodically receives a Heartbeat and a
Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is
functioning properly. A Blockreport contains a list of all
blocks on a DataNode. The NameNode and DataNodes are
designed to run on commodity hardware that runs a
GNU/Linux operating system, as mentioned in [6]. HDFS is
built using the Java language; therefore, any machine that
supports Java can run the NameNode or the DataNode
software.
3.1 Data Cloning
Why is data replication required? The answer to this
question relies on the reliability and performance of HDFS.
Havingreplicasinthe DataNodesis what distinguishesHDFS
from other parallel or distributed file systems. The main
purpose of data replication is to better develop data
availability, data reliability and fault tolerance. HDFS is
designed to store files in a very reliable way using data
replication in the clusters. Each file is stored asa sequence of
blocks. The replication factor for each file and the block size
are configurable asper each file. The above figurealsoshows
how each file could be replicatedindifferentDataNodes.Here
we can see that the replication factor can be specified during
the time of file creation. The replication factor need not be
restricted since it can be changed later. All files in HDFShave
strictly one writer at any point of time and can be written
only once. Alldecisions regarding cloningof blocksare made
by the NameNode.
3.2 Data Replica Placement
Instances of HDFS run on a cluster of computers that
spread across a number of racks. Switches are used for
communication between two nodes in different racks. The
rack id to which each DataNode needs to belong to is
determined by the NameNode. Usuallythereplicasareplaced
in DataNodes on unique racks. This way data loss can be
prevented even if an entire rack fails since data can be
fetched from another rack that holds the replica. Hence the
replicas can be evenly distributed in the cluster. This policy
eases balancing component failure but increases cost of
writes as each write needs a transfer of replica blocks to
another rack.
For instance, when a file has a replication factor of 3,
HDFS’s placement policy is to put one replica on one node in
the local rack, another on a node in a different (remote) rack,
and the laston a different node in the same remote rack. This
policy cuts the inter-rack write traffic which generally
improveswrite performance.The chanceof rack failureisfar
lessthan thatof node failure; this policy doesnotimpactdata
reliability and availability guarantees. However, it does
reduce the aggregatenetwork bandwidth usedwhenreading
data since a block is placed in only two unique racks rather
than three. With this policy,thereplicasof a filedonotevenly
distribute across the racks. One third of replicas are on one
node, two thirds of replicas are on one rack, and the other
third are evenly distributed acrossthe remaining racks. This
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 556
policy improves write performance without compromising
data reliability or read performance as in [6].
3.3 Data Selection
When a read requestis made, HDFStriesto satisfy itfrom
the replica that is nearest to the reader. The replica, if exists,
on the same rack as the requestor, is preferred to satisfy the
request. Else, if the cluster spans multiple data centers, then
the replica residing at the local data center is chosen over a
remote replica.
3.4 Space Redeeming
Traditionally when we delete a file or a folder in our system
on windows, the file is moved to recycle bin. This means the
files space is not freed yet. File deletes in HDFS work
similarly. When a file is deleted by an applicationor a user, it
is notdeleted from HDFS immediately. Uponadelete,firstthe
file is renamed and moved to the /trash directory. The file
resides inthe trashdirectory for aconfiguredamountoftime.
After the time expires, the file is then deleted permanently
from the HDFS namespace by the NameNode.
The deletion of a file causes all block spaces to be freed that
are associated with the deleted file. However, the file can be
restored while it still resides in the trash directory. To
undelete a file, the deleted file must be restored from the
trash directory. As per [6], the current default policy is to
delete files permanently from trash that are more than 6
hours old.
4. CONCLUSION AND FUTURE WORK
HDFS should support use of snapshots that can be helpfulin
storing a copy of data at a particular instance of time. This
can be particularly helpful to rollback a corrupted HDFS
instance to a previously known point of time when the
system worked without issues. There is a potential for
making faster advancements in scientific discipline for
analyzing large amountsof data .The technicalchallengesare
most common across the large variety of application
domains, therefore new cost effective and faster methods
must be implemented to analyze the big data.
ACKNOWLEDGEMENT
I thank Dr. Syed Mustafa, Professor and Head, Departmentof
Information Science and Engineering, HKBK CE for
motivating me and providing their expertise that greatly
assisted in writing the paper.
I am also grateful to the authors of papers and documents I
have referenced. I would like to express my appreciation to
the authors for sharing their pearls of wisdom.
REFERENCES
[1] Big Data Now: 2012 Edition by O’Reilly Media, Inc.
[2] Master Thesis on Tools and Methods for Big Data
Analysis, by Miroslav Vozábal- 2016, University of West
Bohemia.
[3] Research Paper on Big Data and Hadoop, Iqbaldeep
Kaur, Navneet Kaur, Amandeep Ummat, Jaspreet Kaur,
Navjot Kaur. IJCST Vol. 7, ISSue 4, Oct - Dec 2016.
[4] Adding Structure to Unstructured Data, Database
Research Group (CIS), University of Pennsylvania
ScholarlyCommons. 1997.
[5] HADOOP ARCHITECTURE AND FAULT TOLERANCE
BASED HADOOP CLUSTERS IN GEOGRAPHICALLY
DISTRIBUTED DATA CENTER, T. Cowsalya and S.R.
Mugunthan, ARPN Journal of Engineering and Applied
Sciences.
[6] The HDFS Architecture guide,
https://hadoop.apache.org/.
[7] Big Data And Hadoop: A Review Pape, Rahul Beakta,
https://www.researchgate.net/publication/
BIOGRAPHIES
Prof. Aseema Sultana was born in
Bangalore, India. She received the B.E
degree in ISE from HKBK CE, in 2010,
and M.Tech in CSE from MVJ CE, in2012.
She worked as a Sr. Project Engineer in
Wipro (2012-2016) and currently
working as an Asst. Professor in HKBK
CE.1’st Author
Photo

More Related Content

What's hot

Information economics and big data
Information economics and big dataInformation economics and big data
Information economics and big dataMark Albala
 
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...DATAVERSITY
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEijsptm
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageIOSR Journals
 
Data set module 1
Data set   module 1Data set   module 1
Data set module 1Data-Set
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...IJCSIS Research Publications
 
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...IRJET Journal
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data IJECEIAES
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
 
Data warehouse,data mining & Big Data
Data warehouse,data mining & Big DataData warehouse,data mining & Big Data
Data warehouse,data mining & Big DataRavinder Kamboj
 
No sql databases new millennium database for big data, big users, cloud compu...
No sql databases new millennium database for big data, big users, cloud compu...No sql databases new millennium database for big data, big users, cloud compu...
No sql databases new millennium database for big data, big users, cloud compu...eSAT Publishing House
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data scienceJohnson Ubah
 

What's hot (19)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Management
Data ManagementData Management
Data Management
 
Information economics and big data
Information economics and big dataInformation economics and big data
Information economics and big data
 
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big data
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
1
11
1
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
Data set module 1
Data set   module 1Data set   module 1
Data set module 1
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
 
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
 
Data warehouse,data mining & Big Data
Data warehouse,data mining & Big DataData warehouse,data mining & Big Data
Data warehouse,data mining & Big Data
 
No sql databases new millennium database for big data, big users, cloud compu...
No sql databases new millennium database for big data, big users, cloud compu...No sql databases new millennium database for big data, big users, cloud compu...
No sql databases new millennium database for big data, big users, cloud compu...
 
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 

Similar to IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and Importance of Data Replication in HDFS

BDA Mod1@AzDOCUMENTS.in.pdf
BDA Mod1@AzDOCUMENTS.in.pdfBDA Mod1@AzDOCUMENTS.in.pdf
BDA Mod1@AzDOCUMENTS.in.pdfJayanthSram
 
Comparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big dataComparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big dataIRJET Journal
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISIRJET Journal
 
IRJET - Big Data Analysis its Challenges
IRJET - Big Data Analysis its ChallengesIRJET - Big Data Analysis its Challenges
IRJET - Big Data Analysis its ChallengesIRJET Journal
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data LakeIRJET Journal
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksDr. Richard Otieno
 
Week 4 Lecture 1 - Databases and Data WarehousesManagement of .docx
Week 4 Lecture 1 - Databases and Data WarehousesManagement of .docxWeek 4 Lecture 1 - Databases and Data WarehousesManagement of .docx
Week 4 Lecture 1 - Databases and Data WarehousesManagement of .docxjessiehampson
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Data set Introduction to Big Data
Data set   Introduction to Big DataData set   Introduction to Big Data
Data set Introduction to Big DataData-Set
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with HadoopIRJET Journal
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxJethroDignadice2
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET Journal
 
Managing Data Strategically
Managing Data StrategicallyManaging Data Strategically
Managing Data StrategicallyMichael Findling
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
 
Data Science: A Revolution of Data
Data Science: A Revolution of DataData Science: A Revolution of Data
Data Science: A Revolution of DataIRJET Journal
 

Similar to IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and Importance of Data Replication in HDFS (20)

BDA Mod1@AzDOCUMENTS.in.pdf
BDA Mod1@AzDOCUMENTS.in.pdfBDA Mod1@AzDOCUMENTS.in.pdf
BDA Mod1@AzDOCUMENTS.in.pdf
 
Comparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big dataComparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big data
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
 
IRJET - Big Data Analysis its Challenges
IRJET - Big Data Analysis its ChallengesIRJET - Big Data Analysis its Challenges
IRJET - Big Data Analysis its Challenges
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
 
Sample
Sample Sample
Sample
 
Week 4 Lecture 1 - Databases and Data WarehousesManagement of .docx
Week 4 Lecture 1 - Databases and Data WarehousesManagement of .docxWeek 4 Lecture 1 - Databases and Data WarehousesManagement of .docx
Week 4 Lecture 1 - Databases and Data WarehousesManagement of .docx
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Data set Introduction to Big Data
Data set   Introduction to Big DataData set   Introduction to Big Data
Data set Introduction to Big Data
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with Hadoop
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
 
Managing Data Strategically
Managing Data StrategicallyManaging Data Strategically
Managing Data Strategically
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
Data Science: A Revolution of Data
Data Science: A Revolution of DataData Science: A Revolution of Data
Data Science: A Revolution of Data
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 

IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and Importance of Data Replication in HDFS

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 553 Unraveling the Data Structures of Big data, the HDFS Architecture and Importance of Data Replication in HDFS Aseema Sultana1 1Asst.Professor, Dept. of Information Science and Engineering, HKBK CE, Bangalore, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Big Data is one of the most challenging aspectsin today's world filled with emerging information.Thefactthatit leads us to be dealt with massive amounts of overwhelming raw data creates new opportunities in the said field. The term Big Data describes extremely large volumesofdatathatcomes from varied sources in different formats with high velocity. Here, the amount of data is not what is important, but what can be done with the data is more important. In this process, useful information is fetched from the massive amount ofdata and is analyzed. The analyzing part itself is a very challenging and demanding piece of work as it demands fault forbearing, amenable and ascendible systems that work upon the said data. The ultimate goal here is to make the process of managing and maintaining data a convenient and feasible task. This paper provides a review on Big Data, its data structures, details of the HDFS architecture, and replicationof data files within HDFS. Key Words: Big Data, Structured, Semi-structured, Quasi-structured, Un-structured Data, Hadoop, HDFS Architecture, Data Replication, Space Reclamation. 1. INTRODUCTION What is Big Data? Big Data is not just large volumes of data, but also data with high variety and high velocity. In other words, Big Data is huge data in different forms being generated at high speed which is difficult to manage or process using any existing tools and architectures. Let us first see what the input data into big data systemsis, it could be transactions from a bank, web server logs, chatter from Twitter or Facebook or any other social network,MP3songs, vlogs or any broadcasted videos, content of web pages, personal documents, medical reports, and many more to count, as mentioned in [1]. According to [2], in the 1990s the data volume was measured in terabytes. In subsequent decade the data volume was being measured in petabytes. And the decadeof 2010s is dealing with the exponential data volume drivenby variety and many sources of digitized data. Almost everybody and everything is leaving a digital footprint. Due to the volume in data, it has been considered to start measuring data in Exabyte. Some applications that generate a lot of data are shown in Fig -1. In the said figure we can see how the amount of information is evolving in the graph w.r.t every ten years. Data is increasing and exploding decade after decade and so are the sources for Big Data. Fig -1: The rise of Big Data sources and the Evolution of data 2. DATA STRUCTURES OF BIG DATA Big Data comes in many forms; it can be Structured, Semi- Structured,Quasi-Structuredor Unstructured asin[3].Fig-2 shows the Data structure types and the data growth accordingly. Fig -2: The pyramid of data structures of big data 2.1 Structured Data Structured data is data that comes with clarity, description, presentation, explicate length or format. This
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 554 type of data can be easily re-organized or processed by any data mining tools. One could envision this type of data as a perfectly organized book library where books are orderly arranged,labeled and are easily accessible. Hencethistypeof data can be effectively used by organizations. Few examples of structured data include numbers, dates, etc., it is believed that this type of data comprises about 15-20% of data present in the world. In Fig-3, we can see one such example representing structured data. We can see data from an excel sheet that displays clearly formatted information such as First Name, Last Name and Gender of a few people. 2.2 Semi-Structured Data This type of data is loosely defined or irregularly structured which allowsa schema-lessdescription format in which the data is less constrained than is usuallyin database work. Semi-structured data model allows information from several sources,with related butdifferentproperties,tobefit together in one whole, thus suitable for integration of databases and sharing information on the Web. Semi- structured data is data that may be irregular or incomplete and have a structure that may change rapidly or unpredictably. It generally has some structure, but does not conform to a fixed schema. It is Schema-less and self- describing,i.e., data carriesinformationaboutitsownschema (e.g., in terms of XML element tags). The main attraction for semi-structured data is that structure can be imposed upon the data as in [4]. Fig -3 shows an example for semi- structureddata;here we can see the XMLdocument that was fetched from the ‘view page source’ option while viewing a web page. 2.3 Quasi-Structured Data This data is not among the commonly mentionedtypesof data. It is the one in between the semi-structured and unstructured data. In other words, Quasi-structured data is data which is unstructured but is in an erratic format which can be handled with special tools. For example, consider visiting the website www.google.com, the URL (Uniform Resource Locator) is tracked into log files of the webpage's server; the user’s computer activity is being monitored. The URL defines a clickstream that can be parsed and mined by data scientists to discover usage patterns and uncover relationships between clicks and areas of interest on websites, as mentioned in [2]. This is illustrated in Fig -3 where the user types “HKBK College of Engineering” in the search option in Google. Upon a search, the clickstream “https://www.google.co.in/search?q=hkbk+college+of+enginee ring&rlz=1C1CHBD_enIN761IN761&oq=HKBK+col&aqs=chro me.0.0j69i57j69i61l3j0.3550j0j8&sourceid=chrome&ie=UTF-8” is generated which shall be parsed and mined. 2.4 Un-Structured Data Un-Structured data sometimes referred to as messy or unorganizedraw data is primarily in the formof text and can also be found in forms of multimedia files. This type of data doesnothave a formal structure,inotherwordsthestructure is unidentifiable. Word processing documents, Books, Journals, Health documents, e-mail messages, videos, presentations, webpages, photos, audio filesandmany other kinds of business documents are examples of unstructured data. As we know that structured data is information with a high degree of organization, unstructured data is essentially the opposite. To make use of this data, the massive data that was worthless is identified and stored in an organized fashion. Through the use of specialized software, the items can then be searched through. However, there are two aspects to be considered, firstly, the process of converting unstructureddata to organized datawouldbecostlyandtime consuming. Secondly, not all types of unstructured data can easily be converted into a structured model. Information aging is another factor that needs to be considered in this case where the data keeps accumulating day after day but is left unused. One example of un-structured data is illustrated in Fig -3 in which a video about Big Data from YouTube is streamed. 2.5 Heterogeneous Data This is a combination of all 4 types of data. For example, say we have a system that stores call logs for a call-center. In this case, data such as date and time stamp could represent the structured type. On the other hand, the customer chat historycouldrepresent the quasi-orsemi-structureddata.So much information can be extracted from the available set of the dataofdifferent structures.Figurebelowshowsexamples of all four forms of data. Fig -3: Illustration of data structures of big data 3. HADOOP AND HDFS ARCHITECTURE Hadoop is an open source software framework used to process big data. It was created by Doug Cuttingand Michael J. Cafarella as in [5]. Hadoop was developed for parallel and distributedprocessing systemsandwasinitiallymotivatedin
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 555 a paper by Google. As we know that managing big data is a near to impossible task for the existing traditional systems, this was made possible using Hadoop. With Hadoop, there is no limit on the storage of data and the process of computing huge data generated in high speeds is moving from traditional databases to distributed systems. Fig -4: The HDFS Architecture and Data Replication HDFS (Hadoop Distributed File system) is a highly fault tolerant parallel file systemwith amaster/slavearchitecture whose objective includes storing and managing huge data sets, managing failures of hardware, data access and simplifying simple data coherency issues as in [6]. The HDFS architecture is asshown in Fig -4.Thefigureshows the HDFS master/slave architecture. Each cluster consists of exactlyone NameNode which is amasterserverthatmanages accesses to files w.r.t clients and also manages file system namespace. There can be n number of DataNodesdepending upon the number of nodes in the cluster. The DataNodes manage storage of files. HDFS allows user data to be stored into files usually having the files itself split into more than one blocks and in turn these blocks stored in one or more DataNodes. The responsibility of NameNode is to execute operations like Opening, Closing or renaming files. On the other hand, the responsibility of DataNodes is to serve the read and write requests from the clients. In addition, the NameNode also instructs the DataNodes to perform block creation, cloning and deletion as per the need. The NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. The NameNode and DataNodes are designed to run on commodity hardware that runs a GNU/Linux operating system, as mentioned in [6]. HDFS is built using the Java language; therefore, any machine that supports Java can run the NameNode or the DataNode software. 3.1 Data Cloning Why is data replication required? The answer to this question relies on the reliability and performance of HDFS. Havingreplicasinthe DataNodesis what distinguishesHDFS from other parallel or distributed file systems. The main purpose of data replication is to better develop data availability, data reliability and fault tolerance. HDFS is designed to store files in a very reliable way using data replication in the clusters. Each file is stored asa sequence of blocks. The replication factor for each file and the block size are configurable asper each file. The above figurealsoshows how each file could be replicatedindifferentDataNodes.Here we can see that the replication factor can be specified during the time of file creation. The replication factor need not be restricted since it can be changed later. All files in HDFShave strictly one writer at any point of time and can be written only once. Alldecisions regarding cloningof blocksare made by the NameNode. 3.2 Data Replica Placement Instances of HDFS run on a cluster of computers that spread across a number of racks. Switches are used for communication between two nodes in different racks. The rack id to which each DataNode needs to belong to is determined by the NameNode. Usuallythereplicasareplaced in DataNodes on unique racks. This way data loss can be prevented even if an entire rack fails since data can be fetched from another rack that holds the replica. Hence the replicas can be evenly distributed in the cluster. This policy eases balancing component failure but increases cost of writes as each write needs a transfer of replica blocks to another rack. For instance, when a file has a replication factor of 3, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the laston a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improveswrite performance.The chanceof rack failureisfar lessthan thatof node failure; this policy doesnotimpactdata reliability and availability guarantees. However, it does reduce the aggregatenetwork bandwidth usedwhenreading data since a block is placed in only two unique racks rather than three. With this policy,thereplicasof a filedonotevenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed acrossthe remaining racks. This
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 01 | Jan-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 556 policy improves write performance without compromising data reliability or read performance as in [6]. 3.3 Data Selection When a read requestis made, HDFStriesto satisfy itfrom the replica that is nearest to the reader. The replica, if exists, on the same rack as the requestor, is preferred to satisfy the request. Else, if the cluster spans multiple data centers, then the replica residing at the local data center is chosen over a remote replica. 3.4 Space Redeeming Traditionally when we delete a file or a folder in our system on windows, the file is moved to recycle bin. This means the files space is not freed yet. File deletes in HDFS work similarly. When a file is deleted by an applicationor a user, it is notdeleted from HDFS immediately. Uponadelete,firstthe file is renamed and moved to the /trash directory. The file resides inthe trashdirectory for aconfiguredamountoftime. After the time expires, the file is then deleted permanently from the HDFS namespace by the NameNode. The deletion of a file causes all block spaces to be freed that are associated with the deleted file. However, the file can be restored while it still resides in the trash directory. To undelete a file, the deleted file must be restored from the trash directory. As per [6], the current default policy is to delete files permanently from trash that are more than 6 hours old. 4. CONCLUSION AND FUTURE WORK HDFS should support use of snapshots that can be helpfulin storing a copy of data at a particular instance of time. This can be particularly helpful to rollback a corrupted HDFS instance to a previously known point of time when the system worked without issues. There is a potential for making faster advancements in scientific discipline for analyzing large amountsof data .The technicalchallengesare most common across the large variety of application domains, therefore new cost effective and faster methods must be implemented to analyze the big data. ACKNOWLEDGEMENT I thank Dr. Syed Mustafa, Professor and Head, Departmentof Information Science and Engineering, HKBK CE for motivating me and providing their expertise that greatly assisted in writing the paper. I am also grateful to the authors of papers and documents I have referenced. I would like to express my appreciation to the authors for sharing their pearls of wisdom. REFERENCES [1] Big Data Now: 2012 Edition by O’Reilly Media, Inc. [2] Master Thesis on Tools and Methods for Big Data Analysis, by Miroslav Vozábal- 2016, University of West Bohemia. [3] Research Paper on Big Data and Hadoop, Iqbaldeep Kaur, Navneet Kaur, Amandeep Ummat, Jaspreet Kaur, Navjot Kaur. IJCST Vol. 7, ISSue 4, Oct - Dec 2016. [4] Adding Structure to Unstructured Data, Database Research Group (CIS), University of Pennsylvania ScholarlyCommons. 1997. [5] HADOOP ARCHITECTURE AND FAULT TOLERANCE BASED HADOOP CLUSTERS IN GEOGRAPHICALLY DISTRIBUTED DATA CENTER, T. Cowsalya and S.R. Mugunthan, ARPN Journal of Engineering and Applied Sciences. [6] The HDFS Architecture guide, https://hadoop.apache.org/. [7] Big Data And Hadoop: A Review Pape, Rahul Beakta, https://www.researchgate.net/publication/ BIOGRAPHIES Prof. Aseema Sultana was born in Bangalore, India. She received the B.E degree in ISE from HKBK CE, in 2010, and M.Tech in CSE from MVJ CE, in2012. She worked as a Sr. Project Engineer in Wipro (2012-2016) and currently working as an Asst. Professor in HKBK CE.1’st Author Photo