project--2 nd review_2

Deploying and Researching
Hadoop in Virtual Machines

Hadoop:
• Hadoop is an open source software platform.
• It is derived from Google’s MapReduce and GFS(Google file
system).
• Hadoop is an open source implementation of MapReduce.
• It develops open source software for reliable and scalable distributed
computing.
Definition:
• Basically, it's a way of storing enormous data sets across clusters of
computers .
• It is designed to be Robust and Efficient.
• The Apache Hadoop software library is a framework .
• It is designed to scale up from single servers to thousands of
machines.

Abstract:
• Hadoop's emerging and the maturity of virtualization make it
feasible.
• It introduces some technologies used such as CloudStack,
MapReduce and Hadoop.
• How to deploy Hadoop in virtual machines which can be
obtained from Cloud Stack .
• we run some Hadoop programs under the virtual cluster.

Introduction:
• Now a days, the most frequently used programs are those
Internet based services.
• MapReduce can process 20 PB of data per day.
• Ability to read and write data.
• A reliable shared storage and analysis system (HDFS and
MapReduce)
• Enables applications to work .

Literature survey:
• Ignoring the data locality issue in different types of
environments can easily reduce the MapReduce
performance.
• Experimental results on two real data-intensive
applications show that their data placement strategy.
• The first generation of Hadoop had two single points of
failure: the NameNode and JobTracker processes.
• Hadoop MapReduce has two main services: the
jobtracker and the tasktracker.

Existing System:
• Need to process terabytes of data in efficient manner on daily
bases.
• In the existing system we are using single virtual machine.
• The disadvantage is that the potential for poor performance
and heavy load undoubtedly, which is what to be solved .

Proposed System:
• In the proposed system we are using cloud stack infrastructure.
• MapReduce is designed under cluster, management of thousands
commodity PCs is a big job.
• Deploying the Hadoop Applications on virtual machines .
• Maybe the biggest problem is the power consumption.

Modules:
• Module 1: User has to start namenode, datanode,
jobtracker and task tracker nodes based on the virtual
machine.
• Module2: User observes the virtual machines running on
cluster infrastructure.
• Module3: User can connect to any virtual machine
running on cluster by providing required details.
• Module4: In this module user can deploy the files on
connected virtual machine and do research on any virtual
machine.

Hardware Requirements
• Pentium 4 Processor
• 8GB RAM
• 64 bit OS(Ubuntu)
• 200 GB HDD

Software Requirements
• Java 6
• Eclipse Indigo (With Hadoop Configuration)
• Hadoop Appliance
• Cygwin
• CloudStack

USECASE DIAGRAM
name node
data node
start job tracker
connect to VM
logout
deploy files
research on files
user

SEQUENCE DIAGRAM
user HDFS
start name node
response
data node
response
job tracker
response
deploy files
response
research on files
response
logout
response

COLLABORATION DIAGRAM
user HDFS
1: start name node
2: response
3: data node
4: response
5: job tracker
6: response
7: deploy files
8: response
9: research on files
10: response
11: logout
12: response

TESTING
 Black Box Testing
 White Box Testing
 Grey Box Testing
 Regression Testing

Test Cases
Name Input Output
Activate Root Account Username and password Successfully Enabled
Starting management
Server
Management Server Details Successfully started
Adding Pod Pod details Successfully Added
Adding Zone Zone Details Successfully Added
Adding Cluster Cluster Details Successfully Added
Primary Storage Primary Storage Details Successfully Added
Secondary Storage Secondary Storage Details Successfully Added

Conclusion:
• This Project CloudStack, MapReduce programming
model and Hadoop, which allows distributed parallel
running, which shows that it is feasible to deploying and
research Hadoop in Virtual machines . The advantages are
that it can ease the management, fully utilize the
computing resources, make Hadoop more reliable and
save power and so on. Then some methods to optimize
Hadoop in virtual machines are discussed.

Future Enhancements
• Right Management:
For example, we can arrange a test administrator to be
responsible for this experimental course, then the
experimental teachers can only view and count related
information of experimental course, other courses do not
have permission.
• Experimental Control and Report Submission:
The instructor can specify the actionable experimental
project, and the system design experimental record, save the
1219 experimental project information that students have
taken in pilot project, facilitate faculty management .

BIBLIOGRAPHY
• List of Reference Documents:
• Grady Brooch, “The Unified Modeling Language Users guide”
• Roger S Pressman, “Software Engineering”, A practitioners
approach
• Walker Royce, “Software Project Management”
• Head First Series for Java
• Web References:
• http://en.wikipedia.org/wiki/HDFS#Hadoop_distributed_file_system
• http://hadoop.apache.org/
• http://en.wikipedia.org/wiki/Mapreduce
• http://en.wikipedia.org/wiki/Main_Page
• http://cloudstack.apache.org/about.html

project--2 nd review_2

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (18)

Similar to project--2 nd review_2

Similar to project--2 nd review_2 (20)

project--2 nd review_2