SlideShare a Scribd company logo
1 of 62
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
BY : KUMARI SURABHI
INDEX TERMS
• Openstack Cloud
• Hadoop
• Distributed system,
• Virtualization
• Big data.
INTRODUCTION
• Computers are revolutionizing the human era, especially the IT field.
• With new technologies and ideas, smart, efficient and faster computers
and frameworks are introduced in market every day.
• Advancement of intelligent machines and ideas makes computation,
storage and transaction of data faster and accurate which eventually
makes institutions, companies and individuals solve their problems with
ease.
• Among so many computation improvements, cloud computing and
distributed system are the main focus of this seminar.
CLOUD COMPUTING
• Cloud computing is the emerging technology which is being exploited in
every aspects of technology.
• Cloud computing is an abstract term describing the use of resources,
which do not belong to the user to perform required task and then
disconnect from the resources when not in use.
• Most obvious examples would be Gmail, Google doc, Amazon EC2 and
storage, social networks such as Facebook and many more.
WHY CLOUD COMPUTING ?
BIG DATA
• Describe the exponential growth, availability and use of information, both
structured and unstructured.
• Concept of Big Data has three basic dimensions: volume, variety, velocity
and other dimensions are veracity and complexity.
PROBLEM STATEMENTS AND PRELIMINARIES
The purpose of this seminar is to answer the following
questions:
●
Is performance of OpenStack cloud is better than real system on
Hadoop cluster?
●
Is it feasible to run image processing MapReduce job in a Hadoop
cluster on OpenStack cloud?
●
Are the technical difficulties of converting image files to pdf using
MapReduce framework in distributed system?
CLOUD COMPUTING & OPENSTACK CLOUD
• Cloud computing is the use of computing resources (hardware and
software) that are delivered as a service over a network (typically the
Internet).
• Cloud services include the delivery of software, infrastructure, and
storage over the Internet.
• Based on technology used, cloud computing can be sub categorized into:
Public, Private, Community and Hybrid clouds.
Cloud Computing continued...
cloud computing can be broadly divided into:
●
Software as a Service
(SaaS),
●
Platform as a Service
(PaaS),
●
Infrastructure as a
Service (IaaS).
CLOUD COMPUTING continued...
Based on technology used, cloud computing can be sub categorized into:
• Public,
• Private,
• Community
• Hybrid clouds.
CLOUD COMPUTING continued...
There are many platforms available to set up a cloud such as :
• CloudStak (Foundation)
• DevStack,
• Openstack,
• Eucalyptus,
• Nebula and many more.
Note : Openstack is chosen as the framework for implementing
cloud for the article.
OPENSTACK CLOUD
• It is open-source which is used to open to pick and mix any hardware
needs.
• Open to design own networks.
• Open to use any virtualization technology.
• Open to other needed features and so on.
OPENSTACK CLOUD
• OpenStack is an open source cloud framework, originally launched by
Rackspace and NASA, with the aim to promote cloud standards and
provide a solid foundation for cloud development.
• It is most widely used tool for setting a private and public cloud.
• Big companies like Dell, AMD, Cisco, HP and Rackspace are using it.
• Linux heavyweights like Red Hat and Ubuntu are implementing it.
• Amazon is available on OpenStack with compatible API with Amazon.
OPENSTACK CLOUD continued...
• OpenStack is an Infrastructure as a Service (IaaS) cloud computing project
that is free open source software.
• It is revolutionizing the cloud computing world
• Aim to create a system where storage, resources and performance would
scale everything up quickly and efficiently.
OPENSTACK CLOUD continued...
The OpenStack cloud currently consists
of six projects:
• Nova,
• Swift, Glance,
• Keystone,
• Quantum,
• Horizon.
OPENSTACK CLOUD continued...
●
Nova :
Nova is the computing fabric controller for the OpenStack cloud.
●
Swift :
Swift is the storage system for OpenStack which is analogous to Amazon
Web Service Simples Storage system.
●
Glance :
It is an imaging service for OpenStack which is responsible for discovery,
registration and delivery services for disk and server images.
OPENSTACK CLOUD continued...
●
Keystone :
Keystone is OpenStack identity service which provides authentication
and authorization for all components of OpenStack.
●
Horizon :
Horizon is web based dashboard that provides administrators and users a
graphical interface to access, provision and automate cloud-based
resources.
OPENSTACK CLOUD ARCHITECTURE
HADOOP
• There are many distributed systems available to comply with the big data
problems faced by big companies.
• Hadoop is one of the available frameworks.
• Hadoop makes data mining, analytics, and processing of big data cheap
and fast.
• Hadoop is an open source project and is made to deal with terabytes of
data in minutes.
• Hadoop stores and processes any kind of data.
• Hadoop is natively written in Java but can be accessed using other
languages such as SQL-inspired language (Hive), c/c++, python and many
more.
HADOOP continued...
• An open source web search engine which was based on Googles
MapReduce.
• Hadoop works on commodity hardware.
HDFS(Hadoop Distributed File System)
●
Hadoop Distributed File System provides unrestricted, high-
speed access to the data application.
●
A scalable, fault tolerant, high performance distributed file
system.
●
Namenode holds filesystem metadata.
●
Files are broken up and spread over datanodes.
●
Data divided into 64MB(default) or 128 blocks, each block
replicated 3 times(default) .
ARCHITECTURE OF HDFS
MAPREDUCE
●
Map Reduce programs are executed in two main phases,
called mapping and reducing.
●
Each phase is defined by a data processing function, and
these functions are called mapper and reducer, respectively.
●
In the mapping phase, Map Reduce takes the input data and
feeds each data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs
from the mapper and arrives at the final result.
●
In simple terms, the mapper is meant to filter and transform
the input into something that the reducer can aggregate over.
MAPREDUCE continued...
PERFORMANCE ANALYSIS MODEL
●
In the performance analysis model, we will discusse use of
two basic applications :
1. WordCount Application
2. Imagetopdf Conversion Application.
●
WordCount is a common MapReduce program which is used
to count total number of word found in the document.
●
Imagetopdf Conversion program is used for converting image
into pdf.
PERFORMANCE ANALYSIS MODEL continued...
●
These two programs are executed in commodity computers
cluster as well as OpenStack Cloud virtual instance cluster.
●
Analyse the performance by changing the number of nodes
and size of the data.
●
The performance analysis has been done for both the
applications.
WORD COUNT APPLICATION
●
WordCount is a simple application that counts the number of
occurrences of each word in a given input set.
●
The purpose of this program is to calculate the total number of repetition
of words in a particular document.
●
The pseudo code for the Mapper and Reducer for WordCount program is
outlined in Algorithm 1 and Algorithm 2 respectively.
Mapper function for WordCount Program
●
Input: String filename, String document
●
Output: String token , 1
1) Map(String filename, String document)
2) {
3) List<String> T = tokenize(document);
4) For each token in T
5) {
6) emit ((String) token, (Integer) 1);
7) }
8) }
Reducer function for WordCount Program
●
Input: String token, 1
●
Output: (String) token , sum
1) Reduce(String token, List<Integer> values)
2) {
3) Integer sum = 0;
4) For each value in values
5) {
6) Sum = sum + value;
7) }
8) emit ((String) token, (Integer) sum);
9) }
10) }
Image to Pdf Conversion Application
●
Hadoop is popular for processing textual big data so there are a lot of
materials available if an application related with texts is to be developed.
●
But few works has been done on image data processing in Hadoop.
●
So there were a lot of challenges while developing the application.
●
Some of the difficulties faced were serialization issues with images,
splitting of images by hadoop to its default blocks, image to pdf
conversion, text to pdf conversion and many more.
Work flow of the application
●
Under Map Reduce model, data processing primitives are called mappers
and reducers.
Mapper function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Map(String key, KUPDF value)
2) {
3) For each bufferList in value
4) {
5) write (filename, value)
6) }
7) }
Reducer function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Reduce(String key, KUPDF values)
2) {
3) For each value in values
4) {
5) concat value as separate page of pdf
6) }
7) write (key, final pdf)
8) }
9) }
MAPPER AND REDUCER
●
In the mapping phase, Map Reduce takes the input data and feeds each
data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at the final result.
●
The mapper is meant to filter and transform the input into something
that the reducer can aggregate over.
●
PDFMapper and PDFReducer class in the application does the above
mentioned jobs in the application developed with image files and pdf
files.
METHODS OF PERFORMANCE EVALUATIONS
Cloud Cluster Setup
• quad core Intel® Xenon(R) CPU of 64 bits
• 16GB RAM
• 1TB ATA disk
• 500GB ATA disk as storage
• 32 bits dual gigabit network interfaces was used.
• Ubuntu 14.04 server was installed for Operating System.
Cloud Cluster Setup continued...
●
Installation and configuration of Openstack Essex cloud was done by the
tutorial of Openstack.
●
Appropriate images were created using virtualization tool, such as
QEMU, supporting KVM or XEN and by using terminal commands.
●
virtual systems for cloud cluster setup, were created by using terminal
commands and by accessing web-interface of openstack after successful
network configuration for fixed and floating ips and other security
parameters by using the images created.
Commodity Computer Cluster Setup
The four-node cluster of commodity computers is setup on :
●
Intel i5 quad core 64-bit CPU with 2GB RAM
●
One with 160GB ATA disk
●
Other three with 80GB ATA disk
●
32 bit gigabit network interface.
2. Commodity Computer Cluster Setup continued...
●
Passwordless secure shell was confugured
●
Java 7 was installed and
●
Hadoop 0.20.2 was configured on all four instances.
Cloud Cluster Setup continued...
●
Four-node cluster, one acting as master/slave and the rest three as
slaves, was created by using Ubuntu 14.04 image.
●
Passwordless secure shell was confugured, java 7 was installed and
Hadoop 0.20.2 was configured on all four instances.
Configuration of Experiments
Commodity
Computer/Cloud Server
Commodity Computers
Details
Server Computer Details
Master vs
Cse-dcg
2 GB RAM | 2
VCPU | 160 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Storage
Slave1 vs
uesr1
2 GB RAM | 2
VCPU | 160 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Storage
Slavs2 vs
user2
2 GR BAM | 2
VCPU | 80 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Sotrage
Slave3 vs
user3
2 GB RAM | 2
VCPU | 80 GB
Storage
2 GB RAM |
2 VCPU |
80 GB Storage
Experimental Results
After the successful configuration of the clusters, three jobs:
●
Two jobs to convert image files to pdf files and
●
One job of word count were run on both systems.
●
The first two jobs were based on image and pdf files being serialized in
map reduce framework.
●
The last job was implemented based on the standard WordCount
program available in the Hadoop package.
●
The algorithms are run first on two node cluster with a Master and two
slaves and then scaled up to four node cluster of a Master and four slaves
(Master running slave machine as well).
1) Directory-wise Image to PDF:
The results of first job is summarized in the Table II and Table III :
●
TABLE II : SUMMARY OF FIRST JOB ON TWO NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
5 minute and
20 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
3 minute and
43 second
Directory-wise Image to PDF continued...
●
The first jobs algorithm is designed to search images directory-wise and
convert each image files to pdf file with same directory tree as the input
image files.
●
TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
3 minute and
8 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
1 minute and
31 second
GRAPHICAL REPRESENTATION
Time taken for First Job
EXPLANATION
The input to the job contained :
●
23 folder
●
94 files and
●
In total 169 MB in size.
The output was the conversion of each image files to pdf with same
directory and file names and configured to generate 90.1 MB in size for
both the processing.
●
The processing were repeated three times to get the average.
2) Multiple Images to Single PDF:
●
Modified version of the first one explained above.
●
All images are converted to final single pdf output file.
●
This processing is also done first in two node and later scaled up to four
node cluster as done in first algorithm.
●
The processing are repeated three time to get the average.
●
The experiments are summarized in Table IV and Table V.
Multiple Images to Single PDF continued...
●
The input contained 476 image files in one directory and its size was 926
MB.
●
The output was a single pdf file of 200.1 MB in size.
TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
11 minute and
29 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
12 minute and
28 second
Multiple Images to Single PDF continued...
TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second
GRAPHICAL REPRESENTATION
This Shows that commodity computer clusters are more efficient than
virtual node cluster in Openstack cloud.
Time taken for Second Job
EXPLANATION
●
The first two jobs were processed on mapping of small image files that
are no so effective in Hadoop system as hadoop performs really well with
large data sets as input.
●
So in order to test the real performance of Hadoop on big data, the
default word count job of Hadoop system was also run.
●
Hadoop was designed for text processing rather than image processing
so the textual processing was also chosen to analyse the Hadoop clusters.
TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES
The inputs for the job was a text file of 1.1 GB and the output was a file
containing list of words which was 364.6KB in size.
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second
TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
4 minute and
0 second
Cloud
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
5 minute and
1 second
GRAPHICAL REPRESENTATION
It proves that Hadoop cluster in commodity computers performed well
than the Hadoop cluster in the cloud.
Time taken for Third Job
PERFORMANCE ANALYSIS
●
The Hadoop distributed system set up on personal computers is certain
to be more efficient and faster than the cloud system.
●
First reason is that Hadoop is developed with commodity machines in
mind.
●
Second obvious reason is that the processing is done in physical
hardware without any resources sharing as compared to cloud systems.
CONTRADICTION
●
The first job is contradictory with the points discussed above and with
other two jobs.
REASON :
●
The job has to recursively read and write files, thus has to cache all the
bytes read and to be written,which is faster in cloud as the nodes are in
one server and there is no wire communication between nodes.
CONCLUSION
●
An analysis of running a Hadoop cluster incloud and in real system and
identifying the best solution by running simple Hadoop jobs in the
configured clusters.
●
It concludes that running a Hadoop cluster in cloud for data storage and
analysis is more flexible and easily-scalable than the real system cluster.
●
The two nodes to four nodes experiments proved the easy-scalability
where cloud cluster scaled up with creation of an instance from already
configured image.
●
The case was not the same in real system where we needed to get the
machine, download the softwares, adjust configuration to join the new
machine to the cluster.
●
The failed nodes in cloud cluster could be terminated and can be
replaced with a new instance in seconds but the same is not possible in
real system.
●
The cluster in real system computers are faster than the cloud clusters.
●
But due to different advantageous features of the cloud computing
system such as quick termination of servers (nodes) if problems arise and
creation of the node from the same state the machine was terminated,
automatic networking, instant creation of nodes and cluster and many
such features cloud Hadoop clusterwould be more favorable.
●
Despite the difficulties in writing algorithms related with images in map
reduce framework and serialization errors of images, and despite the
popularity of text processings in Hadoop, It is still possible to perform
image processings in distributed framework as hadoop.
FUTURE SCOPE
To perform the same algorithms using different cloud frameworks and
comparing the commodity cluster performance versus new cloud virtual
cluster Or analysis and comparison of Openstack cloud virtual cluster
versus new cloud virtual cluster.
REFERENCES
●
[1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web
Services,”Amazon Web Services, 2014.
●
[2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg,
Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype
and reality for delivering computing as the 5th utility,” Future Generation
Computer Systems, 2009, Elsevier.
●
[3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data
Security Model for Cloud Computing, Proceedings of the 2009
International Workshop on Information Security and Application ,”
(IWISA 2009), (pp. 21-22). China.
●
[4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica
placement to the reliability of distributed brick storage systems,”
Proceedings of the 25th IEEE International Conference on Distributed
Computing Systems, (pp. 187-196), 2005, IEEE.
●
[5] Daniel Ford et al., “Availability in globally distributed storage system,”
Google Inc.
●
[6] HDFS Architecture Guide,“from Address:,
”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.
THANK YOU

More Related Content

What's hot

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLinaro
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 

What's hot (20)

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 

Similar to A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Nane Kratzke
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native EnvironmentsWSO2
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 

Similar to A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters (20)

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hadoop
HadoopHadoop
Hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
hadoop
hadoophadoop
hadoop
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
 
Data scientist a perfect job
Data scientist a perfect jobData scientist a perfect job
Data scientist a perfect job
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

  • 1. A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters BY : KUMARI SURABHI
  • 2. INDEX TERMS • Openstack Cloud • Hadoop • Distributed system, • Virtualization • Big data.
  • 3. INTRODUCTION • Computers are revolutionizing the human era, especially the IT field. • With new technologies and ideas, smart, efficient and faster computers and frameworks are introduced in market every day. • Advancement of intelligent machines and ideas makes computation, storage and transaction of data faster and accurate which eventually makes institutions, companies and individuals solve their problems with ease. • Among so many computation improvements, cloud computing and distributed system are the main focus of this seminar.
  • 4. CLOUD COMPUTING • Cloud computing is the emerging technology which is being exploited in every aspects of technology. • Cloud computing is an abstract term describing the use of resources, which do not belong to the user to perform required task and then disconnect from the resources when not in use. • Most obvious examples would be Gmail, Google doc, Amazon EC2 and storage, social networks such as Facebook and many more.
  • 6.
  • 7. BIG DATA • Describe the exponential growth, availability and use of information, both structured and unstructured. • Concept of Big Data has three basic dimensions: volume, variety, velocity and other dimensions are veracity and complexity.
  • 8. PROBLEM STATEMENTS AND PRELIMINARIES The purpose of this seminar is to answer the following questions: ● Is performance of OpenStack cloud is better than real system on Hadoop cluster? ● Is it feasible to run image processing MapReduce job in a Hadoop cluster on OpenStack cloud? ● Are the technical difficulties of converting image files to pdf using MapReduce framework in distributed system?
  • 9. CLOUD COMPUTING & OPENSTACK CLOUD • Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). • Cloud services include the delivery of software, infrastructure, and storage over the Internet. • Based on technology used, cloud computing can be sub categorized into: Public, Private, Community and Hybrid clouds.
  • 10. Cloud Computing continued... cloud computing can be broadly divided into: ● Software as a Service (SaaS), ● Platform as a Service (PaaS), ● Infrastructure as a Service (IaaS).
  • 11. CLOUD COMPUTING continued... Based on technology used, cloud computing can be sub categorized into: • Public, • Private, • Community • Hybrid clouds.
  • 12. CLOUD COMPUTING continued... There are many platforms available to set up a cloud such as : • CloudStak (Foundation) • DevStack, • Openstack, • Eucalyptus, • Nebula and many more. Note : Openstack is chosen as the framework for implementing cloud for the article.
  • 13. OPENSTACK CLOUD • It is open-source which is used to open to pick and mix any hardware needs. • Open to design own networks. • Open to use any virtualization technology. • Open to other needed features and so on.
  • 14. OPENSTACK CLOUD • OpenStack is an open source cloud framework, originally launched by Rackspace and NASA, with the aim to promote cloud standards and provide a solid foundation for cloud development. • It is most widely used tool for setting a private and public cloud. • Big companies like Dell, AMD, Cisco, HP and Rackspace are using it. • Linux heavyweights like Red Hat and Ubuntu are implementing it. • Amazon is available on OpenStack with compatible API with Amazon.
  • 15. OPENSTACK CLOUD continued... • OpenStack is an Infrastructure as a Service (IaaS) cloud computing project that is free open source software. • It is revolutionizing the cloud computing world • Aim to create a system where storage, resources and performance would scale everything up quickly and efficiently.
  • 16. OPENSTACK CLOUD continued... The OpenStack cloud currently consists of six projects: • Nova, • Swift, Glance, • Keystone, • Quantum, • Horizon.
  • 17. OPENSTACK CLOUD continued... ● Nova : Nova is the computing fabric controller for the OpenStack cloud. ● Swift : Swift is the storage system for OpenStack which is analogous to Amazon Web Service Simples Storage system. ● Glance : It is an imaging service for OpenStack which is responsible for discovery, registration and delivery services for disk and server images.
  • 18. OPENSTACK CLOUD continued... ● Keystone : Keystone is OpenStack identity service which provides authentication and authorization for all components of OpenStack. ● Horizon : Horizon is web based dashboard that provides administrators and users a graphical interface to access, provision and automate cloud-based resources.
  • 20. HADOOP • There are many distributed systems available to comply with the big data problems faced by big companies. • Hadoop is one of the available frameworks. • Hadoop makes data mining, analytics, and processing of big data cheap and fast. • Hadoop is an open source project and is made to deal with terabytes of data in minutes. • Hadoop stores and processes any kind of data. • Hadoop is natively written in Java but can be accessed using other languages such as SQL-inspired language (Hive), c/c++, python and many more.
  • 21. HADOOP continued... • An open source web search engine which was based on Googles MapReduce. • Hadoop works on commodity hardware.
  • 22. HDFS(Hadoop Distributed File System) ● Hadoop Distributed File System provides unrestricted, high- speed access to the data application. ● A scalable, fault tolerant, high performance distributed file system. ● Namenode holds filesystem metadata. ● Files are broken up and spread over datanodes. ● Data divided into 64MB(default) or 128 blocks, each block replicated 3 times(default) .
  • 24. MAPREDUCE ● Map Reduce programs are executed in two main phases, called mapping and reducing. ● Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. ● In the mapping phase, Map Reduce takes the input data and feeds each data element to the mapper. ● In the reducing phase, the reducer processes all the outputs from the mapper and arrives at the final result. ● In simple terms, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.
  • 26. PERFORMANCE ANALYSIS MODEL ● In the performance analysis model, we will discusse use of two basic applications : 1. WordCount Application 2. Imagetopdf Conversion Application. ● WordCount is a common MapReduce program which is used to count total number of word found in the document. ● Imagetopdf Conversion program is used for converting image into pdf.
  • 27. PERFORMANCE ANALYSIS MODEL continued... ● These two programs are executed in commodity computers cluster as well as OpenStack Cloud virtual instance cluster. ● Analyse the performance by changing the number of nodes and size of the data. ● The performance analysis has been done for both the applications.
  • 28. WORD COUNT APPLICATION ● WordCount is a simple application that counts the number of occurrences of each word in a given input set. ● The purpose of this program is to calculate the total number of repetition of words in a particular document. ● The pseudo code for the Mapper and Reducer for WordCount program is outlined in Algorithm 1 and Algorithm 2 respectively.
  • 29. Mapper function for WordCount Program ● Input: String filename, String document ● Output: String token , 1 1) Map(String filename, String document) 2) { 3) List<String> T = tokenize(document); 4) For each token in T 5) { 6) emit ((String) token, (Integer) 1); 7) } 8) }
  • 30. Reducer function for WordCount Program ● Input: String token, 1 ● Output: (String) token , sum 1) Reduce(String token, List<Integer> values) 2) { 3) Integer sum = 0; 4) For each value in values 5) { 6) Sum = sum + value; 7) } 8) emit ((String) token, (Integer) sum); 9) } 10) }
  • 31. Image to Pdf Conversion Application ● Hadoop is popular for processing textual big data so there are a lot of materials available if an application related with texts is to be developed. ● But few works has been done on image data processing in Hadoop. ● So there were a lot of challenges while developing the application. ● Some of the difficulties faced were serialization issues with images, splitting of images by hadoop to its default blocks, image to pdf conversion, text to pdf conversion and many more.
  • 32. Work flow of the application ● Under Map Reduce model, data processing primitives are called mappers and reducers.
  • 33. Mapper function for ImagetoPdf Program ● Input: String key, KUPDF value ● Output: filename, KUPDF value(pdf file) 1) Map(String key, KUPDF value) 2) { 3) For each bufferList in value 4) { 5) write (filename, value) 6) } 7) }
  • 34. Reducer function for ImagetoPdf Program ● Input: String key, KUPDF value ● Output: filename, KUPDF value(pdf file) 1) Reduce(String key, KUPDF values) 2) { 3) For each value in values 4) { 5) concat value as separate page of pdf 6) } 7) write (key, final pdf) 8) } 9) }
  • 35. MAPPER AND REDUCER ● In the mapping phase, Map Reduce takes the input data and feeds each data element to the mapper. ● In the reducing phase, the reducer processes all the outputs from the mapper and arrives at the final result. ● The mapper is meant to filter and transform the input into something that the reducer can aggregate over. ● PDFMapper and PDFReducer class in the application does the above mentioned jobs in the application developed with image files and pdf files.
  • 36. METHODS OF PERFORMANCE EVALUATIONS Cloud Cluster Setup • quad core Intel® Xenon(R) CPU of 64 bits • 16GB RAM • 1TB ATA disk • 500GB ATA disk as storage • 32 bits dual gigabit network interfaces was used. • Ubuntu 14.04 server was installed for Operating System.
  • 37. Cloud Cluster Setup continued... ● Installation and configuration of Openstack Essex cloud was done by the tutorial of Openstack. ● Appropriate images were created using virtualization tool, such as QEMU, supporting KVM or XEN and by using terminal commands. ● virtual systems for cloud cluster setup, were created by using terminal commands and by accessing web-interface of openstack after successful network configuration for fixed and floating ips and other security parameters by using the images created.
  • 38. Commodity Computer Cluster Setup The four-node cluster of commodity computers is setup on : ● Intel i5 quad core 64-bit CPU with 2GB RAM ● One with 160GB ATA disk ● Other three with 80GB ATA disk ● 32 bit gigabit network interface.
  • 39. 2. Commodity Computer Cluster Setup continued... ● Passwordless secure shell was confugured ● Java 7 was installed and ● Hadoop 0.20.2 was configured on all four instances.
  • 40. Cloud Cluster Setup continued... ● Four-node cluster, one acting as master/slave and the rest three as slaves, was created by using Ubuntu 14.04 image. ● Passwordless secure shell was confugured, java 7 was installed and Hadoop 0.20.2 was configured on all four instances.
  • 41. Configuration of Experiments Commodity Computer/Cloud Server Commodity Computers Details Server Computer Details Master vs Cse-dcg 2 GB RAM | 2 VCPU | 160 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage Slave1 vs uesr1 2 GB RAM | 2 VCPU | 160 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage Slavs2 vs user2 2 GR BAM | 2 VCPU | 80 GB Storage 2 GB RAM | 2 VCPU | 80 GB Sotrage Slave3 vs user3 2 GB RAM | 2 VCPU | 80 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage
  • 42. Experimental Results After the successful configuration of the clusters, three jobs: ● Two jobs to convert image files to pdf files and ● One job of word count were run on both systems. ● The first two jobs were based on image and pdf files being serialized in map reduce framework. ● The last job was implemented based on the standard WordCount program available in the Hadoop package. ● The algorithms are run first on two node cluster with a Master and two slaves and then scaled up to four node cluster of a Master and four slaves (Master running slave machine as well).
  • 43. 1) Directory-wise Image to PDF: The results of first job is summarized in the Table II and Table III : ● TABLE II : SUMMARY OF FIRST JOB ON TWO NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 23 folder | 94 image files | 169 MB 23 folder | 94 pdf files| 90.1 MB 5 minute and 20 second Cloud Cluster 23 folder | 94 image files| 169 MB 1 folder | 94 pdf files| 90.1 MB 3 minute and 43 second
  • 44. Directory-wise Image to PDF continued... ● The first jobs algorithm is designed to search images directory-wise and convert each image files to pdf file with same directory tree as the input image files. ● TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 23 folder | 94 image files | 169 MB 23 folder | 94 pdf files| 90.1 MB 3 minute and 8 second Cloud Cluster 23 folder | 94 image files| 169 MB 1 folder | 94 pdf files| 90.1 MB 1 minute and 31 second
  • 46. EXPLANATION The input to the job contained : ● 23 folder ● 94 files and ● In total 169 MB in size. The output was the conversion of each image files to pdf with same directory and file names and configured to generate 90.1 MB in size for both the processing. ● The processing were repeated three times to get the average.
  • 47. 2) Multiple Images to Single PDF: ● Modified version of the first one explained above. ● All images are converted to final single pdf output file. ● This processing is also done first in two node and later scaled up to four node cluster as done in first algorithm. ● The processing are repeated three time to get the average. ● The experiments are summarized in Table IV and Table V.
  • 48. Multiple Images to Single PDF continued... ● The input contained 476 image files in one directory and its size was 926 MB. ● The output was a single pdf file of 200.1 MB in size. TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 11 minute and 29 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 12 minute and 28 second
  • 49. Multiple Images to Single PDF continued... TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 7 minute and 51 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 9 minute and 22 second
  • 50. GRAPHICAL REPRESENTATION This Shows that commodity computer clusters are more efficient than virtual node cluster in Openstack cloud. Time taken for Second Job
  • 51. EXPLANATION ● The first two jobs were processed on mapping of small image files that are no so effective in Hadoop system as hadoop performs really well with large data sets as input. ● So in order to test the real performance of Hadoop on big data, the default word count job of Hadoop system was also run. ● Hadoop was designed for text processing rather than image processing so the textual processing was also chosen to analyse the Hadoop clusters.
  • 52. TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES The inputs for the job was a text file of 1.1 GB and the output was a file containing list of words which was 364.6KB in size. INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 7 minute and 51 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 9 minute and 22 second
  • 53. TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 text file | 1.1 GB 1 text files with counts | 361.6 KB 4 minute and 0 second Cloud Cluster 1 text file | 1.1 GB 1 text files with counts | 361.6 KB 5 minute and 1 second
  • 54. GRAPHICAL REPRESENTATION It proves that Hadoop cluster in commodity computers performed well than the Hadoop cluster in the cloud. Time taken for Third Job
  • 55. PERFORMANCE ANALYSIS ● The Hadoop distributed system set up on personal computers is certain to be more efficient and faster than the cloud system. ● First reason is that Hadoop is developed with commodity machines in mind. ● Second obvious reason is that the processing is done in physical hardware without any resources sharing as compared to cloud systems.
  • 56. CONTRADICTION ● The first job is contradictory with the points discussed above and with other two jobs. REASON : ● The job has to recursively read and write files, thus has to cache all the bytes read and to be written,which is faster in cloud as the nodes are in one server and there is no wire communication between nodes.
  • 57. CONCLUSION ● An analysis of running a Hadoop cluster incloud and in real system and identifying the best solution by running simple Hadoop jobs in the configured clusters. ● It concludes that running a Hadoop cluster in cloud for data storage and analysis is more flexible and easily-scalable than the real system cluster. ● The two nodes to four nodes experiments proved the easy-scalability where cloud cluster scaled up with creation of an instance from already configured image. ● The case was not the same in real system where we needed to get the machine, download the softwares, adjust configuration to join the new machine to the cluster.
  • 58. ● The failed nodes in cloud cluster could be terminated and can be replaced with a new instance in seconds but the same is not possible in real system. ● The cluster in real system computers are faster than the cloud clusters. ● But due to different advantageous features of the cloud computing system such as quick termination of servers (nodes) if problems arise and creation of the node from the same state the machine was terminated, automatic networking, instant creation of nodes and cluster and many such features cloud Hadoop clusterwould be more favorable. ● Despite the difficulties in writing algorithms related with images in map reduce framework and serialization errors of images, and despite the popularity of text processings in Hadoop, It is still possible to perform image processings in distributed framework as hadoop.
  • 59. FUTURE SCOPE To perform the same algorithms using different cloud frameworks and comparing the commodity cluster performance versus new cloud virtual cluster Or analysis and comparison of Openstack cloud virtual cluster versus new cloud virtual cluster.
  • 60. REFERENCES ● [1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web Services,”Amazon Web Services, 2014. ● [2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype and reality for delivering computing as the 5th utility,” Future Generation Computer Systems, 2009, Elsevier. ● [3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data Security Model for Cloud Computing, Proceedings of the 2009 International Workshop on Information Security and Application ,” (IWISA 2009), (pp. 21-22). China.
  • 61. ● [4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica placement to the reliability of distributed brick storage systems,” Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, (pp. 187-196), 2005, IEEE. ● [5] Daniel Ford et al., “Availability in globally distributed storage system,” Google Inc. ● [6] HDFS Architecture Guide,“from Address:, ”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.