SlideShare a Scribd company logo
1 of 7
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. IV (Jan – Feb. 2016), PP 23-29
www.iosrjournals.org
DOI: 10.9790/0661-18142329 www.iosrjournals.org 23 | Page
A Review: Hadoop Storage and Clustering Algorithms
Latika Kakkar1
, Gaurav Mehta2
1
(Department of Computer Science and Engineering, Chitkara University, India)
2
(Department of Computer Science and Engineering, Chitkara University, India)
Abstract : In the last few years there has been voluminous increase in the storing and processing of data,
which require convincing speed and also requirement of storage space. Big data is a defined as large, diverse
and complex data sets which has issues of storage, analysis and visualization for processing results. Four
characteristics of Big data which are–Volume, Value, Variety and Velocity makes it difficult for traditional
systems to process the big data. Apache Hadoop is an auspicious software framework that develops applications
that process huge amounts of data in parallel with large clusters of commodity hardware in a fault-tolerant and
veracious manner. Various performance metrics such as reliability, fault tolerance, accuracy, confidentiality
and security are improved as with the use of Hadoop. Hadoop MapReduce is an effective Computation Model
for processing large data on distributed data clusters such as Clouds. We first introduce the general idea of big
data and then review related technologies, such as could computing and Hadoop. Various clustering techniques
are also analyzed based on parameters like numbers of clusters, size of clusters, type of dataset and noise.
Keywords: Big data, Cloud Computing, Clusters, Hadoop , HDFS, MapReduce
I. Introduction
To analyze data from different aspects and to outline and sum up this data into valuable data i.e.
important information is known as Data Mining. Data Mining is process of analyzing correlations and patterns
among huge data in a database. Major data mining techniques involve classification, regression and clustering.
Clustering is the process of assigning data into groups called clusters such that object in same cluster is more
similar than the objects of other clusters. Clustering is a main task of Data Mining. It common technique for
statistical data analysis used in many fields which includes pattern recognition, analysis of image, information
retrieval. Now a day’s big data is growing rapidly specially those related to internet companies. For example,
Google, Facebook processes petabytes of data within a month.Figure1 shows the increase of the data volume in
a rapid manner. There are various challenges as the datasets are increasing in a drastic manner. Due to
advancement of information technology huge amount of data can be generated. On an average, YouTube
uploads 72 hours of videos in every minute [1]. This creates the problem of collection and integration of
tremendous data from widespread sources of data. The accelerated advancement of cloud computing and the
Internet of Things further contributes the fine increase of data. Through this increase in volume and variety there
is a great accentuation on the existing computing capacity. This huge data causes problem of storing and
managing such large complex datasets with fulfilling the hardware and software infrastructure. The mining of
such complex, heterogeneous and voluminous data at different levels of analyzing, perception, anticipating
should be done carefully to unfold its natural properties so as to filter good decision making. [2]
II. Challenges Of Big Data
2.1 Confidentiality of Data
Various big data service providers use different tools to process and analyze data, due to which there
are security risks. For example, the transactional dataset generally includes a set of complete operating data to
execute the important business processes. Such data consists of lower granularity and delicate information like
credit card numbers. Therefore, preventive measures are taken to protect such sensitive data, to ensure its safety.
2.2 Representation of Data
The datasets are heterogeneous in nature. Representation of Data aims to make meaningful data for
analysis of computers and user comprehension. Improper representation of data affects the data analysis.
Efficient data representation shows data structure, class and technologies which enables useful operations on
different datasets.
2.3 Management of Data Life Cycle
The storage system used normally cannot support huge quantity of data. Data freshness is the key
component which describes the values private in big data. Therefore, to decide which data shall be stored and
which data shall be ignored, a principle associated with the analytical value should be developed.
A Review: Hadoop Storage And Clustering Algorithms
DOI: 10.9790/0661-18142329 www.iosrjournals.org 24 | Page
2.4 Analytical Mechanism:
The analytical system processes large amount of heterogeneous data with limited time period.
However, the traditional RDBMSs lack scalability and expandability, by which performance requirement is
affected. Non-relational databases are advantageous in the analysis of unstructured data but there are still some
problems in performance of non-relational databases. Few enterprises have employed hybrid architecture of
database that has advantages of both types of databases e.g. Facebook and Taobao.
2.5 Energy Management
Large amount of electrical energy is consumed for the analysis, storage, and transmission of big data.
Therefore, power-consumption control and management mechanism of system or computer were required for
big data with the constraints that the expansion and accessibility are ensured.
2.6 Scalability
The analytical system designed for big data should be compatible with all types of datasets. It must be
able to process drastically increasing complex datasets.
2.7 Reduction of Redundancy and Data Compression
Redundancy reduction and data compression is very useful to diminish the cost of the entire system on
the basis that there is no affect on the data. For example, mostly the data developed by sensor networks have
high degree of redundancy that may be compressed and filtered. [3-5]
Fig. 1 Composition of Big Data
III. Related Technologies
Fundamental technologies that are closely related to big data:
3.1Cloud Computing
Cloud computing refers to parallel and distributed system that consists of computers that is visualized
and inter-connected. This system is represented as a computing resource based on service agreements
established through intervention between the customers and the service providers. Cloud computing enables the
organizations to consume the resources as a utility rather than developing and maintaining the internal
computing infrastructures.
Benefits of cloud computing:
 Pay per use: Users can pay only for the resources and workload they use which means that the computing
resources are measured at a granular level.
 Self-service provisioning: The computing resources can be operated by end users for the workload on
demand.
 Elasticity: With the increase and the decrease of the computing needs of the companies, they can scale up
and scale down the resources.
3.2 Cloud Computing and Big Data Affiliation
Cloud computing and big data are closely related. The key components of cloud computing are shown
in Fig.2. For the storage and processing of big data cloud computing is the solution. On the other hand, the big
data also increases the development of cloud computing. The technology of cloud computing of distributed
storage efficiently manages the big data and the parallel computing capability of cloud computing improves the
big data analysis.
A Review: Hadoop Storage And Clustering Algorithms
DOI: 10.9790/0661-18142329 www.iosrjournals.org 25 | Page
Fig. 2 Key components of cloud computing
3.3 Hadoop
The storage infrastructure has to provide two basic requirements. First is to provide service for
information storage and secondly to provide interface for efficient query access and data analysis. Traditionally,
structured RDBMSs were used for storing, managing and data analysis but with the increasing growth of big
data, efficient and reliable storage mechanism is needed. Hadoop is a project of the Apache Software
Foundation. Hadoop [6] is an open-source software framework implemented using Java and is designed to be
used on large distributed systems of commodity hardware. The software framework of Hadoop is designed in
such a way that it manages hardware failure automatically. Hadoop is very popular software tool due to it being
open-source. Yahoo, Facebook, Twitter, LinkedIn, and others are currently using Hadoop. [8]
3.3.1 Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the file system component of the Hadoop framework.
Hadoop Distributed file System is used to store huge data on commodity hardware [6]. HDFS is designed for a
large amount of big data files. The size of data to be stored using Hadoop Distributed file System could be sized
from gigabytes to terabytes [7]. The property and feature of HDFS is that it can support millions of files and has
scalability to hundreds of nodes. HDFS has separate storage places for metadata and application data [8]. The
metadata is stored in the NameNode server and application data are stored on DataNodes which contain blocks
of data. The communication between the different nodes of the HDFS is done using a TCP-based protocol [8].
Fault tolerance is the ability of a system to function adequately and correctly even after some system
components gets failed. Hadoop and HDFS are highly fault tolerant as HDFS can be spread over multiple nodes
that is cost effective. It duplicates the data so that if one copy of data is lost then still there are backup copies
available. [9]
3.3.2 Mapreduce Processing Model
Hadoop MapReduce processes big data in parallel and provides output with efficient performance.
Map-reduce consist of Map function and Reduce function. Map function executes filtering and sorting of large
data sets. Reduce function performs the summary operation which combines the result and provides the
enhanced output. Hadoop HDFS and Map-Reduce are delineated with the help of Google file system. Google
File System (GFS) is developed by Google is a distributed file system that provide organized and adequate
access to data using large clusters of commodity servers. Map phase: The Master node accepts the input and
then divides a large problem is into smaller sub-problems. It then distributes these sub-problems among worker
nodes in a multi-level tree structure. These sub-problems are then processed by the worker nodes which execute
and sent the result back to the master node. Reduce phase: Reduce function combines the output of all sub-
problems and collect it in master node and produces final output. Each map function is associated with a reduce
function.
A Review: Hadoop Storage And Clustering Algorithms
DOI: 10.9790/0661-18142329 www.iosrjournals.org 26 | Page
Fig 3 A Multinode Hadoop Cluster
Figure 3 shows a multi-node Hadoop cluster. It consists of a Master node and slave nodes. Task tracker
and datanode operate on slave nodes. Master node consists of Task tracker, Datanode, Job tracker and
Namenode. The Namenode is used to operate the file system. The job scheduling is executed by job tracker. The
datanode stores the information which is in knowledge of the Namenode. On user interaction with Hadoop
system, the Namenode becomes active and all the information about datanode data is processed by the
Namenode.
Fig.4 HDFS file system
Figure 4 shows the HDFS file system. Namenode is used to access and store user data. Thus Namenode
becomes a single point of failure. For that a secondary Namenode is always active which takes snapshots from
Namenode and stores all the information itself. On failure of the Namenode, all the data can be recovers from
the secondary Namenode. Hadoops’s HDFS can handle Fault Tolerance, but there are some other points of
Hadoop and HDFS that could require some improvement. There is the problem of scalability of the NameNode
[8]. The Namenode could subject to physical hardware constraints because Namenode is the single node.For
example, because the namespace is kept in memory, there could be probability of HDFS to become
unresponsive if the namespace grows to aa limit that it has to use node’s memory. The solution to this issue is to
use more than one Namenode but this can create problems in communication [8]. Another solution to this
problem is to store only large data files in the HDFS. By using large data files the size of namespace will be
reduced. Moreover the file system like HDFS is designed for large data files. Hadoop also lacks in providing
good high level support. This issue can occur with open source software. Enhancing Hadoop’s functionality on a
system can be difficult without proper support [11]. HDFS is also sensitive to scheduling delays which restricts
it to provide its full potential. Thus the node could have to wait for its new task. This issue is particularly found
in HDFS client code [12].
4 Comparisons of Various Clustering Algorithms
4.1K-Means Algorithm
K-Means is a clustering algorithm that partitions objects into k clusters, where each cluster has a centre
point called centroid clustering algorithm finds the desired number of distinct clusters and their centroids.
Algorithmic steps for K-Means clustering [21]
1) To find the value of K. K is the number of desired clusters.
2) Initialization – Now choose starting points that can be used to estimate centroids of clusters.
3) Classification – To check each object in the dataset and assigning it to the centroid nearest to it.
4) Calculation of centroid – After assignment of the objects in the cluster, new k centroids is calculated.
5) Meeting criteria – The steps 3 and 4 are repeated until the objects stop changing their cluster s and centroid of
all clusters get fixed.
A Review: Hadoop Storage And Clustering Algorithms
DOI: 10.9790/0661-18142329 www.iosrjournals.org 27 | Page
4.2 Fuzzy C-Mean
Algorithmic steps for Fuzzy C-Means clustering [22]
We let X = {x1, x2, x3 ..., xn} be the set of data points and V = {v1, v2, v3 ..., vc} be the set of centers.
1) Randomly ‘c’ cluster centers are selected.
2) The fuzzy membership 'µij' is calculated using:
3) Then fuzzy centers 'vj' are computed using:
4) Steps 2and 3 are repeated until the minimum ‘J’ value is achieved or
|| U (k+1)
-U (k)
||< β, where, ‘k’ is the iteration step, ‘β’ = termination criterion between [0, 1], ‘U’= (µij) n*c’ is the
fuzzy membership matrix, ‘J’ = objective function.
FCM iteratively changes the centers of clusters to exact location in a dataset. When fuzzy logic is
introduced in K-Means clustering algorithm, it becomes Fuzzy C-Means algorithm. FCM clustering is based on
fuzzy behavior. The structure of FCM is basically similar to K-Means. In [20] experimental results showed that
K-Means takes less elapsed time than FCM. It can be concluded that in K-Means algorithm, the number of final
clusters are to be defined beforehand. Such algorithms may be susceptible to local optima, can be outlier-
sensitive and memory space required. The time-complexity of the K-Means algorithm is O(ncdi) and that of
FCM algorithm is O(ndc2i). [20] Concluded that K-Means algorithm is better than FCM algorithm. Fuzzy c-
means requires high computation time than K-Means because in FCM, fuzzy measures are also required to be
calculated. FCM can handles issues related to pattern understanding, noise in data and thus provides faster
solutions. FCM can be used for discovering association rules, retrieval of image and functional dependencies.
4.3 Hierarchical Clustering
Hierarchical clustering is a clustering algorithm which is used to build hierarchy of clusters. There are
two strategies for hierarchical clustering:
Agglomerative: In Agglomerative each object initiates in its own cluster and cluster pairs are merged as we
move up in the hierarchy. Agglomerative is bottom up approach.
Divisive: In divisive all the observations start in one cluster, and when we move downwards, splitting of cluster
is performed iteratively. Divisive is top down approach. Dendogram is used to show results of hierarchical
clustering. The complexity of agglomerative clustering is O(n3
),thus in case of large datasets performance of
agglomerative clustering is too slow. The complexity of divisive clustering is O(2n
), which is more worse.
Hierarchical clustering has limitations linear time complexity. These algorithms can also be used as a combined
approach to produce better results. Agglomerative hierarchical clustering is used because of its quality and K-
means is used because of its run-time efficiency. Comparative analysis of various clustering algorithms, K-
means, Hierarchical, Self Organization map algorithm, Expectation Maximization algorithm and Fuzzy-c
algorithm is shown in fig 5. [21, 22, 23, 24, 25]
4.4 Self-Organization Map (SOM)
The SOM [26] is self organization map is used to find better mapping from high dimensional input
space to a two dimensional node representation. In SOM, object represented by same node are grouped in same
cluster.
SOM algorithm pseudo code [26]:
1. Size and dimension of map is chosen.
2. With every new sample vector:
2.1 The distance between every codebook vector and new vector is computed
2.2 Using learning rate that decreases in time and distance radius on the map, all codebook vectors with new
vector is computed.
A Review: Hadoop Storage And Clustering Algorithms
DOI: 10.9790/0661-18142329 www.iosrjournals.org 28 | Page
Table 1. Comparative analysis of various clustering algorithms
IV. Conclusion
For efficient processing of Big data, Hadoop proves to be a good software. HDFS is a very large
distributed file system that uses commodity clusters and provides high throughput as well as fault tolerance.
This is very useful property of Hadoop because in case of data loss or system crashes, it can be adverse to
business with irreversible consequences. For efficient retrieval of data from Hadoop , comparative analysis of
various clustering algorithm is made based on various input parameters (Table1).
References
[1] Mayer-Schönberger V, Cukier K, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin
Harcourt, 2013.
[2] Chen M, Mao S, Liu Y, Big data: A survey: Mobile Networks and Applications, 19(2), 2014 Apr 1, 171-209.
[3] Labrinidis A, Jagadish HV, Challenges and opportunities with big data, Proceedings of the VLDB Endowment, 5(12), 2012 Aug
1,2032-3.
[4] Chaudhuri S, Dayal U, Narasayya V, An overview of business intelligence technology, Communications of the ACM,54(8), 2011
Aug 1,88-98.
[5] Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV,
Challenges and Opportunities with big data, 2011-1.
[6] Apache Hadoop, http://hadoop.apache.org
[7] Borthakur D, The hadoop distributed file system: Architecture and design, Hadoop Project Website, 2007 Aug 11, 1.
[8] Shvachko K, Kuang H, Radia S, Chansler R, The hadoop distributed file system. In Mass Storage Systems and Technologies
(MSST), 2010 IEEE 26th Symposium ,2010 May 3, 1-10.
[9] Evans J. Fault Tolerance in Hadoop for Work Migration. CSCI B34 Survey Paper, 3(28),2011,11.
[10] Bessani AN, Cogo VV, Correia M, Costa P, Pasin M, Silva F, Arantes L, Marin O, Sens P, Sopena J. Making Hadoop MapReduce
Byzantine Fault-Tolerant, DSN, Fast abstract, 2010.
[11] Wang F, Qiu J, Yang J, Dong B, Li X, Li Y. Hadoop high availability through metadata replication. In Proceedings of the first
international workshop on Cloud data management, 2009 Nov 2, 37-44.
[12] Shafer J, Rixner S, Cox AL, The hadoop distributed filesystem: Balancing portability and performance, In Performance Analysis of
Systems & Software (ISPASS), 2010 IEEE International Symposium, 2010 Mar 28,122-133.
[13] Dean J, Ghemawat S, MapReduce: simplified data processing on large clusters, Communications of the ACM. 51(1),2008 Jan
1,107-13.
[14] Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Qin X, Improving mapreduce performance through data placement
in heterogeneous hadoop clusters, In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE
International Symposium, 2010 Apr 19, 1-9.
[15] Cattell R. Scalable SQL and NoSQL data stores , ACM SIGMOD Record, 39(4), 2011 May 6, 12-27.
[16] Chaiken R, Jenkins B, Larson PÅ, Ramsey B, Shakib D, Weaver S, Zhou J, SCOPE: easy and efficient parallel processing of
massive data sets, Proceedings of the VLDB Endowment, 1(2), 2008 Aug 1, 1265-76.
[17] Grossman RL, Gu Y, Sabala M, Zhang W, Compute and storage clouds using wide area high performance networks, Future
Generation Computer Systems, 25(2), 2009 Feb 28,179-83.
[18] Liu X, Han J, Zhong Y, Han C, He X, Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on
HDFS, In Cluster Computing and Workshops, CLUSTER '09. IEEE International Conference, 2009 Aug 31, 1-8.
A Review: Hadoop Storage And Clustering Algorithms
DOI: 10.9790/0661-18142329 www.iosrjournals.org 29 | Page
[19] Grace RK, Manimegalai R, Kumar SS, Medical image retrieval system in grid using Hadoop framework, In Computational Science
and Computational Intelligence (CSCI), International Conference, Vol. 1, 2014 Mar 10, 144-148.
[20] Ghosh S, Dubey SK, Comparative analysis of k-means and fuzzy c-means algorithms, International Journal of Advanced Computer
Science and Applications,4(4),2013,34-9.
[21] Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY, An efficient k-means clustering algorithm: Analysis and
implementation, Pattern Analysis and Machine Intelligence, IEEE , 24(7), 2002Jul, 881-92.
[22] Rao VS, Vidyavathi DS. Comparative investigations and performance analysis of FCM and MFPCM algorithms on iris data, Indian
Journal of Computer Science and Engineering,1(2),2010,145-51.
[23] Joshi A, Kaur R, A review: Comparative study of various clustering techniques in data mining, International Journal of Advanced
Research in Computer Science and Software Engineering, 3(3), 2013 Mar, 55-7.
[24] Mingoti SA, Lima JO, Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering
algorithms, European Journal of Operational Research, 174(3), 2006 Nov 1, 1742-59.
[25] Steinbach M, Karypis G, Kumar V, A comparison of document clustering techniques, In KDD workshop on text mining, Vol. 400,
No. 1, 2000 Aug 20,525-526.
[26] Abbas OA, Comparisons between Data Clustering Algorithms, Int. Arab J. Inf. Technol... 5(3), 2008 Jul 1,320-5.

More Related Content

What's hot

IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...IJARIIT
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
NOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data ManagementNOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data Managementijtsrd
 
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...IRJET Journal
 
Big data security and privacy issues in the
Big data security and privacy issues in theBig data security and privacy issues in the
Big data security and privacy issues in theIJNSA Journal
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCEAM Publications,India
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 

What's hot (19)

Bigdata
Bigdata Bigdata
Bigdata
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...Big data analysis model for transformation and effectiveness in the e governa...
Big data analysis model for transformation and effectiveness in the e governa...
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
big data
big databig data
big data
 
NOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data ManagementNOSQL Database Engines for Big Data Management
NOSQL Database Engines for Big Data Management
 
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...
IRJET-Unraveling the Data Structures of Big data, the HDFS Architecture and I...
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big data security and privacy issues in the
Big data security and privacy issues in theBig data security and privacy issues in the
Big data security and privacy issues in the
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 

Viewers also liked

Viewers also liked (20)

L017317681
L017317681L017317681
L017317681
 
D0731218
D0731218D0731218
D0731218
 
J017616976
J017616976J017616976
J017616976
 
F017643543
F017643543F017643543
F017643543
 
Nervous system problems
Nervous system problemsNervous system problems
Nervous system problems
 
trung tâm mua đồng hồ casio chất lượng cao
trung tâm mua đồng hồ casio chất lượng caotrung tâm mua đồng hồ casio chất lượng cao
trung tâm mua đồng hồ casio chất lượng cao
 
H017376369
H017376369H017376369
H017376369
 
LordJeshuaInheritanceJuly2016
LordJeshuaInheritanceJuly2016LordJeshuaInheritanceJuly2016
LordJeshuaInheritanceJuly2016
 
Eng Khaled Adel
Eng Khaled Adel Eng Khaled Adel
Eng Khaled Adel
 
Design, Electrostatic and Eigen Frequency Analysis of Fixed– Fixed Beam MEMS ...
Design, Electrostatic and Eigen Frequency Analysis of Fixed– Fixed Beam MEMS ...Design, Electrostatic and Eigen Frequency Analysis of Fixed– Fixed Beam MEMS ...
Design, Electrostatic and Eigen Frequency Analysis of Fixed– Fixed Beam MEMS ...
 
L017248388
L017248388L017248388
L017248388
 
CV
CVCV
CV
 
M017116571
M017116571M017116571
M017116571
 
G1803063741
G1803063741G1803063741
G1803063741
 
J017625966
J017625966J017625966
J017625966
 
Lace cocktail dresses gudeer.com
Lace cocktail dresses   gudeer.comLace cocktail dresses   gudeer.com
Lace cocktail dresses gudeer.com
 
I012435863
I012435863I012435863
I012435863
 
J010417781
J010417781J010417781
J010417781
 
J012626269
J012626269J012626269
J012626269
 
IMPOR [ API ] - BARANG PERBAIKAN
IMPOR [ API ] - BARANG PERBAIKANIMPOR [ API ] - BARANG PERBAIKAN
IMPOR [ API ] - BARANG PERBAIKAN
 

Similar to E018142329

HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONSHIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONScscpconf
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataIJMIT JOURNAL
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
 
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdfpole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdfAkuhuruf
 
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGSECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGIJNSA Journal
 
Security issues associated with big data in cloud computing
Security issues associated with big data in cloud computingSecurity issues associated with big data in cloud computing
Security issues associated with big data in cloud computingIJNSA Journal
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A reviewShilpa Soi
 
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAREAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAijp2p
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...dbpublications
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 

Similar to E018142329 (20)

HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONSHIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigData
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdfpole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
 
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGSECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING
 
Security issues associated with big data in cloud computing
Security issues associated with big data in cloud computingSecurity issues associated with big data in cloud computing
Security issues associated with big data in cloud computing
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A review
 
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAREAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 

E018142329

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. IV (Jan – Feb. 2016), PP 23-29 www.iosrjournals.org DOI: 10.9790/0661-18142329 www.iosrjournals.org 23 | Page A Review: Hadoop Storage and Clustering Algorithms Latika Kakkar1 , Gaurav Mehta2 1 (Department of Computer Science and Engineering, Chitkara University, India) 2 (Department of Computer Science and Engineering, Chitkara University, India) Abstract : In the last few years there has been voluminous increase in the storing and processing of data, which require convincing speed and also requirement of storage space. Big data is a defined as large, diverse and complex data sets which has issues of storage, analysis and visualization for processing results. Four characteristics of Big data which are–Volume, Value, Variety and Velocity makes it difficult for traditional systems to process the big data. Apache Hadoop is an auspicious software framework that develops applications that process huge amounts of data in parallel with large clusters of commodity hardware in a fault-tolerant and veracious manner. Various performance metrics such as reliability, fault tolerance, accuracy, confidentiality and security are improved as with the use of Hadoop. Hadoop MapReduce is an effective Computation Model for processing large data on distributed data clusters such as Clouds. We first introduce the general idea of big data and then review related technologies, such as could computing and Hadoop. Various clustering techniques are also analyzed based on parameters like numbers of clusters, size of clusters, type of dataset and noise. Keywords: Big data, Cloud Computing, Clusters, Hadoop , HDFS, MapReduce I. Introduction To analyze data from different aspects and to outline and sum up this data into valuable data i.e. important information is known as Data Mining. Data Mining is process of analyzing correlations and patterns among huge data in a database. Major data mining techniques involve classification, regression and clustering. Clustering is the process of assigning data into groups called clusters such that object in same cluster is more similar than the objects of other clusters. Clustering is a main task of Data Mining. It common technique for statistical data analysis used in many fields which includes pattern recognition, analysis of image, information retrieval. Now a day’s big data is growing rapidly specially those related to internet companies. For example, Google, Facebook processes petabytes of data within a month.Figure1 shows the increase of the data volume in a rapid manner. There are various challenges as the datasets are increasing in a drastic manner. Due to advancement of information technology huge amount of data can be generated. On an average, YouTube uploads 72 hours of videos in every minute [1]. This creates the problem of collection and integration of tremendous data from widespread sources of data. The accelerated advancement of cloud computing and the Internet of Things further contributes the fine increase of data. Through this increase in volume and variety there is a great accentuation on the existing computing capacity. This huge data causes problem of storing and managing such large complex datasets with fulfilling the hardware and software infrastructure. The mining of such complex, heterogeneous and voluminous data at different levels of analyzing, perception, anticipating should be done carefully to unfold its natural properties so as to filter good decision making. [2] II. Challenges Of Big Data 2.1 Confidentiality of Data Various big data service providers use different tools to process and analyze data, due to which there are security risks. For example, the transactional dataset generally includes a set of complete operating data to execute the important business processes. Such data consists of lower granularity and delicate information like credit card numbers. Therefore, preventive measures are taken to protect such sensitive data, to ensure its safety. 2.2 Representation of Data The datasets are heterogeneous in nature. Representation of Data aims to make meaningful data for analysis of computers and user comprehension. Improper representation of data affects the data analysis. Efficient data representation shows data structure, class and technologies which enables useful operations on different datasets. 2.3 Management of Data Life Cycle The storage system used normally cannot support huge quantity of data. Data freshness is the key component which describes the values private in big data. Therefore, to decide which data shall be stored and which data shall be ignored, a principle associated with the analytical value should be developed.
  • 2. A Review: Hadoop Storage And Clustering Algorithms DOI: 10.9790/0661-18142329 www.iosrjournals.org 24 | Page 2.4 Analytical Mechanism: The analytical system processes large amount of heterogeneous data with limited time period. However, the traditional RDBMSs lack scalability and expandability, by which performance requirement is affected. Non-relational databases are advantageous in the analysis of unstructured data but there are still some problems in performance of non-relational databases. Few enterprises have employed hybrid architecture of database that has advantages of both types of databases e.g. Facebook and Taobao. 2.5 Energy Management Large amount of electrical energy is consumed for the analysis, storage, and transmission of big data. Therefore, power-consumption control and management mechanism of system or computer were required for big data with the constraints that the expansion and accessibility are ensured. 2.6 Scalability The analytical system designed for big data should be compatible with all types of datasets. It must be able to process drastically increasing complex datasets. 2.7 Reduction of Redundancy and Data Compression Redundancy reduction and data compression is very useful to diminish the cost of the entire system on the basis that there is no affect on the data. For example, mostly the data developed by sensor networks have high degree of redundancy that may be compressed and filtered. [3-5] Fig. 1 Composition of Big Data III. Related Technologies Fundamental technologies that are closely related to big data: 3.1Cloud Computing Cloud computing refers to parallel and distributed system that consists of computers that is visualized and inter-connected. This system is represented as a computing resource based on service agreements established through intervention between the customers and the service providers. Cloud computing enables the organizations to consume the resources as a utility rather than developing and maintaining the internal computing infrastructures. Benefits of cloud computing:  Pay per use: Users can pay only for the resources and workload they use which means that the computing resources are measured at a granular level.  Self-service provisioning: The computing resources can be operated by end users for the workload on demand.  Elasticity: With the increase and the decrease of the computing needs of the companies, they can scale up and scale down the resources. 3.2 Cloud Computing and Big Data Affiliation Cloud computing and big data are closely related. The key components of cloud computing are shown in Fig.2. For the storage and processing of big data cloud computing is the solution. On the other hand, the big data also increases the development of cloud computing. The technology of cloud computing of distributed storage efficiently manages the big data and the parallel computing capability of cloud computing improves the big data analysis.
  • 3. A Review: Hadoop Storage And Clustering Algorithms DOI: 10.9790/0661-18142329 www.iosrjournals.org 25 | Page Fig. 2 Key components of cloud computing 3.3 Hadoop The storage infrastructure has to provide two basic requirements. First is to provide service for information storage and secondly to provide interface for efficient query access and data analysis. Traditionally, structured RDBMSs were used for storing, managing and data analysis but with the increasing growth of big data, efficient and reliable storage mechanism is needed. Hadoop is a project of the Apache Software Foundation. Hadoop [6] is an open-source software framework implemented using Java and is designed to be used on large distributed systems of commodity hardware. The software framework of Hadoop is designed in such a way that it manages hardware failure automatically. Hadoop is very popular software tool due to it being open-source. Yahoo, Facebook, Twitter, LinkedIn, and others are currently using Hadoop. [8] 3.3.1 Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) is the file system component of the Hadoop framework. Hadoop Distributed file System is used to store huge data on commodity hardware [6]. HDFS is designed for a large amount of big data files. The size of data to be stored using Hadoop Distributed file System could be sized from gigabytes to terabytes [7]. The property and feature of HDFS is that it can support millions of files and has scalability to hundreds of nodes. HDFS has separate storage places for metadata and application data [8]. The metadata is stored in the NameNode server and application data are stored on DataNodes which contain blocks of data. The communication between the different nodes of the HDFS is done using a TCP-based protocol [8]. Fault tolerance is the ability of a system to function adequately and correctly even after some system components gets failed. Hadoop and HDFS are highly fault tolerant as HDFS can be spread over multiple nodes that is cost effective. It duplicates the data so that if one copy of data is lost then still there are backup copies available. [9] 3.3.2 Mapreduce Processing Model Hadoop MapReduce processes big data in parallel and provides output with efficient performance. Map-reduce consist of Map function and Reduce function. Map function executes filtering and sorting of large data sets. Reduce function performs the summary operation which combines the result and provides the enhanced output. Hadoop HDFS and Map-Reduce are delineated with the help of Google file system. Google File System (GFS) is developed by Google is a distributed file system that provide organized and adequate access to data using large clusters of commodity servers. Map phase: The Master node accepts the input and then divides a large problem is into smaller sub-problems. It then distributes these sub-problems among worker nodes in a multi-level tree structure. These sub-problems are then processed by the worker nodes which execute and sent the result back to the master node. Reduce phase: Reduce function combines the output of all sub- problems and collect it in master node and produces final output. Each map function is associated with a reduce function.
  • 4. A Review: Hadoop Storage And Clustering Algorithms DOI: 10.9790/0661-18142329 www.iosrjournals.org 26 | Page Fig 3 A Multinode Hadoop Cluster Figure 3 shows a multi-node Hadoop cluster. It consists of a Master node and slave nodes. Task tracker and datanode operate on slave nodes. Master node consists of Task tracker, Datanode, Job tracker and Namenode. The Namenode is used to operate the file system. The job scheduling is executed by job tracker. The datanode stores the information which is in knowledge of the Namenode. On user interaction with Hadoop system, the Namenode becomes active and all the information about datanode data is processed by the Namenode. Fig.4 HDFS file system Figure 4 shows the HDFS file system. Namenode is used to access and store user data. Thus Namenode becomes a single point of failure. For that a secondary Namenode is always active which takes snapshots from Namenode and stores all the information itself. On failure of the Namenode, all the data can be recovers from the secondary Namenode. Hadoops’s HDFS can handle Fault Tolerance, but there are some other points of Hadoop and HDFS that could require some improvement. There is the problem of scalability of the NameNode [8]. The Namenode could subject to physical hardware constraints because Namenode is the single node.For example, because the namespace is kept in memory, there could be probability of HDFS to become unresponsive if the namespace grows to aa limit that it has to use node’s memory. The solution to this issue is to use more than one Namenode but this can create problems in communication [8]. Another solution to this problem is to store only large data files in the HDFS. By using large data files the size of namespace will be reduced. Moreover the file system like HDFS is designed for large data files. Hadoop also lacks in providing good high level support. This issue can occur with open source software. Enhancing Hadoop’s functionality on a system can be difficult without proper support [11]. HDFS is also sensitive to scheduling delays which restricts it to provide its full potential. Thus the node could have to wait for its new task. This issue is particularly found in HDFS client code [12]. 4 Comparisons of Various Clustering Algorithms 4.1K-Means Algorithm K-Means is a clustering algorithm that partitions objects into k clusters, where each cluster has a centre point called centroid clustering algorithm finds the desired number of distinct clusters and their centroids. Algorithmic steps for K-Means clustering [21] 1) To find the value of K. K is the number of desired clusters. 2) Initialization – Now choose starting points that can be used to estimate centroids of clusters. 3) Classification – To check each object in the dataset and assigning it to the centroid nearest to it. 4) Calculation of centroid – After assignment of the objects in the cluster, new k centroids is calculated. 5) Meeting criteria – The steps 3 and 4 are repeated until the objects stop changing their cluster s and centroid of all clusters get fixed.
  • 5. A Review: Hadoop Storage And Clustering Algorithms DOI: 10.9790/0661-18142329 www.iosrjournals.org 27 | Page 4.2 Fuzzy C-Mean Algorithmic steps for Fuzzy C-Means clustering [22] We let X = {x1, x2, x3 ..., xn} be the set of data points and V = {v1, v2, v3 ..., vc} be the set of centers. 1) Randomly ‘c’ cluster centers are selected. 2) The fuzzy membership 'µij' is calculated using: 3) Then fuzzy centers 'vj' are computed using: 4) Steps 2and 3 are repeated until the minimum ‘J’ value is achieved or || U (k+1) -U (k) ||< β, where, ‘k’ is the iteration step, ‘β’ = termination criterion between [0, 1], ‘U’= (µij) n*c’ is the fuzzy membership matrix, ‘J’ = objective function. FCM iteratively changes the centers of clusters to exact location in a dataset. When fuzzy logic is introduced in K-Means clustering algorithm, it becomes Fuzzy C-Means algorithm. FCM clustering is based on fuzzy behavior. The structure of FCM is basically similar to K-Means. In [20] experimental results showed that K-Means takes less elapsed time than FCM. It can be concluded that in K-Means algorithm, the number of final clusters are to be defined beforehand. Such algorithms may be susceptible to local optima, can be outlier- sensitive and memory space required. The time-complexity of the K-Means algorithm is O(ncdi) and that of FCM algorithm is O(ndc2i). [20] Concluded that K-Means algorithm is better than FCM algorithm. Fuzzy c- means requires high computation time than K-Means because in FCM, fuzzy measures are also required to be calculated. FCM can handles issues related to pattern understanding, noise in data and thus provides faster solutions. FCM can be used for discovering association rules, retrieval of image and functional dependencies. 4.3 Hierarchical Clustering Hierarchical clustering is a clustering algorithm which is used to build hierarchy of clusters. There are two strategies for hierarchical clustering: Agglomerative: In Agglomerative each object initiates in its own cluster and cluster pairs are merged as we move up in the hierarchy. Agglomerative is bottom up approach. Divisive: In divisive all the observations start in one cluster, and when we move downwards, splitting of cluster is performed iteratively. Divisive is top down approach. Dendogram is used to show results of hierarchical clustering. The complexity of agglomerative clustering is O(n3 ),thus in case of large datasets performance of agglomerative clustering is too slow. The complexity of divisive clustering is O(2n ), which is more worse. Hierarchical clustering has limitations linear time complexity. These algorithms can also be used as a combined approach to produce better results. Agglomerative hierarchical clustering is used because of its quality and K- means is used because of its run-time efficiency. Comparative analysis of various clustering algorithms, K- means, Hierarchical, Self Organization map algorithm, Expectation Maximization algorithm and Fuzzy-c algorithm is shown in fig 5. [21, 22, 23, 24, 25] 4.4 Self-Organization Map (SOM) The SOM [26] is self organization map is used to find better mapping from high dimensional input space to a two dimensional node representation. In SOM, object represented by same node are grouped in same cluster. SOM algorithm pseudo code [26]: 1. Size and dimension of map is chosen. 2. With every new sample vector: 2.1 The distance between every codebook vector and new vector is computed 2.2 Using learning rate that decreases in time and distance radius on the map, all codebook vectors with new vector is computed.
  • 6. A Review: Hadoop Storage And Clustering Algorithms DOI: 10.9790/0661-18142329 www.iosrjournals.org 28 | Page Table 1. Comparative analysis of various clustering algorithms IV. Conclusion For efficient processing of Big data, Hadoop proves to be a good software. HDFS is a very large distributed file system that uses commodity clusters and provides high throughput as well as fault tolerance. This is very useful property of Hadoop because in case of data loss or system crashes, it can be adverse to business with irreversible consequences. For efficient retrieval of data from Hadoop , comparative analysis of various clustering algorithm is made based on various input parameters (Table1). References [1] Mayer-Schönberger V, Cukier K, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013. [2] Chen M, Mao S, Liu Y, Big data: A survey: Mobile Networks and Applications, 19(2), 2014 Apr 1, 171-209. [3] Labrinidis A, Jagadish HV, Challenges and opportunities with big data, Proceedings of the VLDB Endowment, 5(12), 2012 Aug 1,2032-3. [4] Chaudhuri S, Dayal U, Narasayya V, An overview of business intelligence technology, Communications of the ACM,54(8), 2011 Aug 1,88-98. [5] Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV, Challenges and Opportunities with big data, 2011-1. [6] Apache Hadoop, http://hadoop.apache.org [7] Borthakur D, The hadoop distributed file system: Architecture and design, Hadoop Project Website, 2007 Aug 11, 1. [8] Shvachko K, Kuang H, Radia S, Chansler R, The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium ,2010 May 3, 1-10. [9] Evans J. Fault Tolerance in Hadoop for Work Migration. CSCI B34 Survey Paper, 3(28),2011,11. [10] Bessani AN, Cogo VV, Correia M, Costa P, Pasin M, Silva F, Arantes L, Marin O, Sens P, Sopena J. Making Hadoop MapReduce Byzantine Fault-Tolerant, DSN, Fast abstract, 2010. [11] Wang F, Qiu J, Yang J, Dong B, Li X, Li Y. Hadoop high availability through metadata replication. In Proceedings of the first international workshop on Cloud data management, 2009 Nov 2, 37-44. [12] Shafer J, Rixner S, Cox AL, The hadoop distributed filesystem: Balancing portability and performance, In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium, 2010 Mar 28,122-133. [13] Dean J, Ghemawat S, MapReduce: simplified data processing on large clusters, Communications of the ACM. 51(1),2008 Jan 1,107-13. [14] Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Qin X, Improving mapreduce performance through data placement in heterogeneous hadoop clusters, In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium, 2010 Apr 19, 1-9. [15] Cattell R. Scalable SQL and NoSQL data stores , ACM SIGMOD Record, 39(4), 2011 May 6, 12-27. [16] Chaiken R, Jenkins B, Larson PÅ, Ramsey B, Shakib D, Weaver S, Zhou J, SCOPE: easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, 1(2), 2008 Aug 1, 1265-76. [17] Grossman RL, Gu Y, Sabala M, Zhang W, Compute and storage clouds using wide area high performance networks, Future Generation Computer Systems, 25(2), 2009 Feb 28,179-83. [18] Liu X, Han J, Zhong Y, Han C, He X, Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS, In Cluster Computing and Workshops, CLUSTER '09. IEEE International Conference, 2009 Aug 31, 1-8.
  • 7. A Review: Hadoop Storage And Clustering Algorithms DOI: 10.9790/0661-18142329 www.iosrjournals.org 29 | Page [19] Grace RK, Manimegalai R, Kumar SS, Medical image retrieval system in grid using Hadoop framework, In Computational Science and Computational Intelligence (CSCI), International Conference, Vol. 1, 2014 Mar 10, 144-148. [20] Ghosh S, Dubey SK, Comparative analysis of k-means and fuzzy c-means algorithms, International Journal of Advanced Computer Science and Applications,4(4),2013,34-9. [21] Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY, An efficient k-means clustering algorithm: Analysis and implementation, Pattern Analysis and Machine Intelligence, IEEE , 24(7), 2002Jul, 881-92. [22] Rao VS, Vidyavathi DS. Comparative investigations and performance analysis of FCM and MFPCM algorithms on iris data, Indian Journal of Computer Science and Engineering,1(2),2010,145-51. [23] Joshi A, Kaur R, A review: Comparative study of various clustering techniques in data mining, International Journal of Advanced Research in Computer Science and Software Engineering, 3(3), 2013 Mar, 55-7. [24] Mingoti SA, Lima JO, Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms, European Journal of Operational Research, 174(3), 2006 Nov 1, 1742-59. [25] Steinbach M, Karypis G, Kumar V, A comparison of document clustering techniques, In KDD workshop on text mining, Vol. 400, No. 1, 2000 Aug 20,525-526. [26] Abbas OA, Comparisons between Data Clustering Algorithms, Int. Arab J. Inf. Technol... 5(3), 2008 Jul 1,320-5.