50120130405014 2-3


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

50120130405014 2-3

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 5, September – October (2013), pp. 109-114 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET ©IAEME DYNAMIC DATA REPLICATION AND JOB SCHEDULING BASED ON POPULARITY AND CATEGORY Priya Deshpande1, Brijesh Khundhawala2, Prasanna Joeg3 1 Assistant Professor, MITCOE, Pune 2 ME-Student, MITCOE, Pune 3 Professor MITCOE, Pune ABSTRACT Dealing with a huge amount of data puts the requirement for efficient data access more critical in data grids. Improving data access time is a one way of reducing the job execution time i.e. improving performance. To speed up the data access and reduce bandwidth consumption, data grids replicate data in multiple locations. This paper studies a new data replication strategy in data grid, which takes into account two important issues concerning replication: storage capability of different nodes and bandwidth consumption between nodes. It also considers the popularity of the file for replacement. Lesser popular files get less priority then the higher popular file. We also need to consider the limitation on storage. We can optimize the performance by putting the file as much close to client as possible. Our algorithm optimizes the replication with taking in to consideration popularity of the file, limited storage and category of the file. Keywords: Date Replication, Job Scheduling, Replica Strategy I. INTRODUCTION Large scale geographically distributed system are becoming very much popular in dataintensive applications, most importantly scientific applications. Life Sciences, astrophysics and bioinformatics research communities are deploying Grid Systems to process large amounts of datasets and which are stored at geographically dispersed locations. Millions of files are generated regularly which goes beyond the amount in terabytes. The volume of interesting data is measured in terabytes and will become in petabytes in short time because the development of technology and the ability of research are growing fast [13]. There is really a great need to ensure efficient access to such huge and widely dispersed data in a data grid. In Data Grid, performance is majorly influenced 109
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME by the data locality [1]. Data replication is a widely known method used to improve the performance of data access in distributed systems. By creating replicas we can efficiently reduce the bandwidth consumption and access latency. In particular increasing the data read performance from the perspective of clients is the motif of the data replication algorithm. Replication is a mechanism for creating and managing multiple copies of Files. Replica management service can be viewed as composed of following activities: creating new replica(s), registering these new replicas in a Replica Catalog and querying the catalog to find the location of the respective replica(s).The replication mechanism includes three main subjects: which file should be replicated, when to replicate and where to replicate. For improving the job execution time we have tried to consider job scheduling, too. First of all we are trying to put the data in the grid category wise, i.e. data with same category are placed as much close as possible. Then while replica replacement we replace the required data with the least popular file which is decided based on the access frequency of the file. We are trying to place the job as much close as possible to the required data. So, overall performances can be improved. II. RELATED WORKS In the Grid Computing environment Data replication and Scheduling is primary concern for performance optimization. Replica selection, Replica placement and Replica replacement has always been very much crucial for the performance. Replica placement should be done in such a way that there should be minimum file transfer time for job execution. Replica replacement has some strategies like LRU and LFU. There are many researches going on in these areas. EDGSim[2], a simulation implemented by the European Data Grid project, was designed to simulate the performance of European Data Grid but was focused on the scheduling algorithm optimization. Data location is important but no replication was considered. While, Gridnet[3] aims to address replication of data. It proposed a dynamic replication algorithm and memory middleware that was evaluated to improve the data access time. The importance of data locality was first described by K. Rangnathan[4]. It suggested replication strategies to reduce network bandwidth and access delay. Our system architecture is similar to proposed in it few changes. H.Sato et al. [5] proposed a file replication algorithm that improved simple replication methods by taking into consideration network capacity and file access pattern. Similarly, R.S. Chang et al. [6] proposed the Latest Access Largest Weight(LALW) method, which used data access history by applying a greater weight to a more recent access in data replication. In [9], a decentralized architecture for adaptive media dissemination was proposed. They assumed that the popularity of the datasets satisfies the Zipf Distribution. Author defined the replica weight based on popularity. In [7] Dynamic Optimal Replication Strategy is proposed which is based on the File’s Access History, Network Status and File’s Size. Performances show that it works better then LRU and LFU. In [8] Dynamic strategy is proposed which tracks changes in the data access patterns and then applies the relevant tradition replication strategies like LRU and LFU best for the data access pattern. In our paper we have proposed strategy taking into account the File Access History to determine popularity, Category of the Data and Location of job to be executed, which definitely gives a hope for a better performance. Rest Of the paper is structured as follows: Section 3 proposes System Architecture for the strategy, Section 4 specifies steps for dynamic replication and section 5 defines the scheduling strategy. Section 6 gives conclusion and Section 7 suggests references. 110
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME III. SYSTEM ARCHITECTURE System Architecture for our grid model is shown in Fig. 1 Components of our architecture are as follows: • LS: Local Scheduler of Grid Site. • DS: Dataset/ Data Scheduler of Grid site. • RC: Replica Catalogue stores list of all replicas on the grid. • CE: Computing Element: Each grid site contains 0 or more CEs for its computing capability • SE: Storage element: Each site contains 0 or more Storage elements representing its storage capacity. • JB: Job Broker which receives jobs from users and submits it to appropriate grid site. • Replication manager: A centralized server that stores replication information of the system. It contains Active replicator to perform replication for the system. It is better to have a decentralized RM. Figure 1 System Architecture [12] Let’s have a brief how process goes on. After every predefined interval, the replica manager collects data usage information of the environment. The interval selected should not be too much large as the information collected should be fresh. Equally interval should not be too much small. Because if this happens too much rapidly then it increases bandwidth usage and system processing power equally. Then it decides based on the data which files to replicate based on the strategy going to be defined in section 3. In that it considers both distance and relation of the data to each other. Jobs from various clients will be submitted to Job Broker. We can assume job broker as a name node of a hadoop[11]. Job broker decides to which machine a job should be assigned to. In our model LS and DS works in parallel. When LS executes job, DS will find the data required in the local machine and in other machines one step ahead. So time is saved and system utilization is increased. This is shown in detail in section 4. IV. DYNAMIC REPLICATION Here it is assumed that the data in Data grids belongs to a field of research, e.g. Biology, Chemical, Meteorology, Medical, etc. [12] they are the first level of a hierarchical tree. Splitting them further down, we can divide the biology to cell biology, molecular of biology, cell technology, proteomics, etc. We can split this category further down. The reason behind this assumption is that 111
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME data in one category is rarely or never used in another category. By doing so, we can form the hierarchical tree of relationships between data of different categories. Each data entry is in one category and has a close deal with data entries in the same category rather than other categories. Because the replication takes place before the job execution it is better to put the replica nearer to the site which frequently uses it. If we gather the data which has a high probability of getting used, performance will be definitely increased. Our idea is to muster the data that are highly related to each other into small regions so that the job which uses that data will be scheduled to run in that region. As discussed previously main issues of the replication are: • Which data to be replicated? • Where to put the new replica? Following sections answer to these issues: A. REPLICA DECISION In Order to decide which file needs to be copied we find popularity of the files and based on that we choose the file. In the actual usage, data access patterns change over time, so any dynamic replication strategy must keep track of file access histories to decide on when, what and where to replicate. The “popularity” of the file is determined by finding out its access rate by various clients/users. Thus to find out the popular file is the key and first step of our strategy. Here, it is assumed that the recently popular file will be accessed more frequently in the near future. This popularity record is maintained by every replication server. For replica decision data category is also important. Replica decision will be made according to category of the data. Relevant data will be placed together. Each unique file is assigned a unique identifier (FID). After regular interval our algorithm is invoked to find out the popularity of files. Access history logs are cleared at the beginning of each replication interval to capture the current access pattern dynamics. The interval is chosen based on the arrival rate of data requests. Short interval will be chosen for high data requests and vice versa. Interval is adopted dynamically. Data access for each unique file is aggregated and summarized and Number of Access NOA (f) is stored in the server. Then the average amount of data accessed is calculated and any file that has more data access then average amount, it needs to be replicated. We are going to replicate the chosen file only if the number of replicas of the chosen file is less than the threshold value. Threshold value can be decided by the following equation: R=q/w R is the relative capacity of the whole system. q is the sum of all node’s capacity and w is the total size of all the files in the data grid. B. REPLICA PLACEMENT As stated above, our strategy tries to put the replica as much as close to the category it belongs, so that the job belongs to that category will be executed nearby which in turn reduces the time for file transfer at the time of job execution. For example, in an organization if we put data related to HR department as much close as possible then it will be faster for a job to fetch all the data required. We can even place job by taking into account its category. In the same manner we can put data for different departments by considering their category. To put the files closer we find out the distance. Distance is the time required to transfer the file from one node to other. So distance should be as lower as possible. But for the replicas of the same file distance should be as greater as possible. So, the two replicas of the same file don’t come in to same region. To chose a site to place the newly created replica we evaluate the Distance for all the sites for selected file. The site which offers lowest distance will be chosen to store a new replica. If data store is lesser than the required then we will 112
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME again use the popularity of the files on the target site to find out the least popular files. Which will be deleted and then new replica will be stored. V. SCHEDULING STRATEGY As discussed above we try to put the job as much closer to category which it belongs. Job broker makes use of both the DS- Data Scheduler and LS- Local Scheduler to optimize the job execution. Job broker will calculate the estimated time taken by all the sites and then will chose the site which has minimum estimated time. Estimated required time = ሼࡰࢀሺ࢐ሻ ൅ ࡽࢀሺ࢐ሻ ൅ ࡱࢀሽ DT: Time required transferring the data from other nodes to site where job is being executed. QT: Queuing Time ET: Time required executing the job. ...... [12] After this process, job is assigned to site with minimum Estimated Time. In the tradition systems, firstly all the data needed is gathered and then only job’s execution is started. Now here in our strategy Data fetching and job execution will be done in parallel. Local Scheduler will fetch the files required for job execution on the local site and will put them in the queue as per their turn for usage. Files which are not available will be brought to current site by Data Scheduler. When the job is executing DS will fetch and bring the file to local site. For example, if the site is executing first task then DS will try to bring the files needed for the second task of the job at the same time. If the file required is yet not arrived while needed CE will wait. As soon as the file arrives it will resume its execution. So this strategy minimizes the time required to execute the job. Job Execution: Receive Job(J); CreateThread(LS) { Receivedata(d); ExecuteJob(); } CreateThread(DS) { Data d =FetchData(); SendData(d); } Return Result: 113
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME VI. CONCLUSION In this paper, we proposed dynamic optimal strategy which first calculates the popularity of the files based on data access history. And then the most popular file is taken in to consideration. Then of the number of replicas is less than the threshold value then replica is placed on the most appropriate node based on the file’s category. Job execution is also suggested to improve the performance. Traditional replication strategies don’t react to current status, so they are not as much effective as dynamic replication strategies. But still there are many areas need to be considered for the improvement of performance in the Data Grid Environment. More parameters needs to be considered in future as Grid sizes are increasing drastically and complication are increased. VII. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] Foster, I., The grid: A new infrastructure for 21st century science. Physics Today. V55. 42-47, 2002, John Wiley & Sons. P.Crosby.EDGSim.http://www.hep.ucl.ac.uk/~pac/EDGSim/ H. Lamehamedi, et al., Simulation of Dynamic Data Replication Strategies in Data Grids. In Proc.Of 12th Heterogeneous Computing Workshop (HCW2003), Nice, France, Apr 2003.IEEE-CS Press. K. Rangnathan, I. Foster, “Design and Evaluation of Dynamic Replication Strategies for a High- Performance Data Grid”, International Conference on Conference on computing in High Energy and Nuclear Physics, 2001. H. Sato, et al., “Access-Pattern and Bandwidth Aware File Replication Algorithm in a Grid Environment”, International Conference on Grid Computing, pp. 250-257, 2008. R.S. Chang, H.p. Chang, “A Dynamic Data Replication Strategy Using Access-Weights in Data Gtids” supercomputing, Vol. 45 No 3, pp. 277-295,2008. Wquing Zhao, XianbinXu, Zhuowei Wang, Yuping Zhang, Shuibing He, “A Dynamic Optimal Replication Strategy in Data Grid Environment”, @ 2010 IEEE. MyunghoonJeon, Kwang-Ho Lim, Hyun Ahn, Byoung-Dai Lee, “Dynamic Data Replication Scheme in cloud Computing Environment”, @2012 IEEE. PhillippeCudre-Mauroux, and Karl Aberer, “A Decentralized Architecture for Adaptive Media Dissemination”, ICME’-2 Proceedings, 2002, pp. 533-536. Mohammad Shorfuzzaman, Peter Graham and RAsitEskicioglu, ”Popularity Driven Dynamic Replica Placement in Hierarchical Data Grids”, 2008 Ninth international Conference on Parallel and Distributed Computing, Applications and Technologies. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo!, Sunnyvale, California USA, {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com, ”The Hadoop Distributed File System” Nhan Nguyen Dang, Soonwook Hwang, Sang Boem Lim*,”Improvement of Data Grid’s Performance by Combining Job Scheduling with Dynamic Replication Strategy”,@2007 The Sixth International Conference on Grid and Cooperative Computing A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets,” Journal of Network and Computer Application,vol. 23, pages 187-200, 2000. M. Pushpalatha, T. Ramarao, Revathi Venkataraman and Sorna Lakshmi, “Mobility Aware Data Replication using Minimum Dominating Set in Mobile Ad Hoc Networks”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 645 - 658, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 114