Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 6, November - December (2013), pp. 386-393 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET ©IAEME COMMUNICATION BETWEEN DISTRIBUTED SYSTEMS USING GOOGLE INFRASTRUCTURE Isak Shabani1, 1 Amir Kovaçi2 Asst. Professor, University of Prishtina “Hasan Prishtina”, Kosova Master student, University of Prishtina “Hasan Prishtina”, Kosova 2 ABSTRACT Distributed Systems are software systems in which the components installed on computer networks communicate with each other by passing messages in order to perform interconnected operations. Programs which run on the distributed systems are known as distributed programs and are designed using distributed programming. Computer networks are spread everywhere, mobile networks, enterprise networks and other kind of networks share same properties. Distributed communication in Google is reached by using Google File System (GFS) which enables efficient and reliable access on data. The main purpose of distributed system design is the share of resources which are possible to be shared in computer networks. Keywords: GFS, Google AppEngine, Datastore, SDK. 1. INTRODUCTION Distributed systems are defined as those systems in which hardware and software components communicate through message exchange, those systems can physically be located in different locations but for those to communicate efficiently they should meet conditions such as concurrent communication, synchronization of actions with the other computer nodes, independent handle of failures within the system so that the other parts to be unaffected from the failure. Google File System (GFS) is at the core of data processing and storage of the Google as search engine. GFS files are separated in chunks similar with cluster or sectors in the traditional file system. Further more GFS infrastructure contains multiple nodes; the Master Node is the main one and multiple Chunk servers. Search and other operations on data are performed similarly like in the relational database systems. Big tables enable data manipulation on the big data with high performance outcomes. Through specific search strategies and data operation features Google manages handling of big huge information, processing of many queries and producing great results 386
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME for the clients. Critical requirements that Google is able to fulfill are those related to reliability and continuous data provide [3]. Different from the centralized systems where data processing resources are on the main data processing center, at the distributed systems approach each of the components is self sustainable in most of the case and the systems do not have a single point of failure. Recently many distributed systems are designed using different technologies and programming language, enriched with many namespaces designed specifically for the distributed systems. Important is identification of the programming languages and tools which are suitable for designing of distributed systems using the Google possibilities as a framework for developing such systems. In this work will be examined the aspect of designing of distributed systems by using Google features together with the libraries in Java programming language. Today’s web based systems very often are based on complex distributed systems which itself are built upon different software modules, developed by different teams, programmed in different languages and spread in different machines. Evolution and advancement of computer systems, the rapid development of LAN and WAN networks and rapid growth of data exchange has enabled connection and information exchange between each other of different machines (PC, server, mobile equipments). This development has further taken to the advancement of computer networks known as distributed systems. There are different definitions for the distributed systems; all of them have in common the conclusion that with distributed systems in thought in a group of computer equipments which are independent of each other while the end user has the impression that the system is a single one. Likewise the centralized systems, distributed systems have their hardware and software components. While the computers are autonomous from each other, from the software point of view, applications are uploaded in the server; they should function in the manner so the users think that they are using a single unique server. 2. DISTRIBUTED SYSTEMS COMMUNICATION In distributed systems, usually are used three main communication models for the components to interact with each other: • Inter process communication • Remote Invocation • Indirect Communication Interconnected process communication is performed through send-receive procedures represented as a sequence of bytes. Locking mechanism is used in order to synchronize communication between the sender and the receiver. Sockets enable communication between two end points of the processes. Local ports and the internet address enable the reception of the message by the given process. Indirect communication is carried out with an intermediary between the sender and the receiver; publish subscribe systems, message queues and shared memory approaches are typical cases of this communication setup. 3. GOOGLE INFRASTRUCTURE From the distributed perspective, Google is built from a number of distributed services which provide the basic functionalit, in general these features can be grouped as following [1]: • Communication layer model including the service for remote procedure call and indirect communication through request serialization of remote calls. 387
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME • • Coordination and service data providing data access by means of : GFS, which handles the requests coming from different applications and services. Chubby, provides coordination services and data storage option on small volumes of data. Bigtable, provide the database which provides semi structured access on data [8]. Services for distributed processing performing parallel operations in the physical layer infrastructure. 3.1. Google as Cloud Service Google nowdays plays a significant role in the technology of cloud computing which is defined as a set of applications based on the internet, with the capability of storing and processing most of the user requirements. Main services provided by Google are Gmail, Google Docs, Google Talk, Google Calendar which aim to replace the traditional software. Through the service platform, Google offers API for distributed systems and internet services for hosting web applications. With the inception of Google App Engine, Google has managed to go beyond software services and in provides the infrastructure for distributed services as cloud services where the business and organization can run their web applications through Google framework [2]. 3.2.Google File System (GFS) Similar with the file systems used for general purposes providing possibilities on file and directory operations on different applications, GFS also is also a distributed file system with a variety of abstractions and provides more advanced capabilities. The main aim is to process the growing requirements of Google as a search engine and other request which come from other web applications. There are a set of requirements which GFS must fulfill: • • • The first requirement is GFS to be executed in a reliable way in the hardware and software architecture. The designers of GFS started with the assumption that components will fail (not just hardware components but also software components) and that the design must be sufficiently tolerant of such failures to enable application-level services to continue their operation in the face of any likely combination of failure conditions GFS is optimized for the patterns of usage within Google, both in terms of the types of files stored and the patterns of access to those files. The number of files stored in GFS is not huge in comparison with other systems, but the files tend to be massive. The patterns of access are also atypical of file systems in general. Accesses are dominated by sequential reads through large files and sequential writes that append data to files, and GFS is very much tailored towards this style of access. These file patterns are influenced, for example, by the storage of many web pages sequentially in single files that are scanned by a variety of data analysis programs. The level of concurrent access is also high in Google, with large numbers of concurrent appends being particularly prevalent, often accompanied by concurrent reads. GFS must meet all the requirements for the Google infrastructure as a whole; that is, it must scale (particularly in terms of volume of data and number of clients), it must be reliable in spite of the assumption about failures noted above, it must perform well and it must be open in that it should support the development of new web applications. In terms of performance and given the types of data file stored, the system is optimized for high and sustained throughput in reading data, and this is prioritized over latency. This is not to say that latency is unimportant, rather, that this particular component (GFS) needs to be optimized for highperformance reading and appending of large volumes of data for the correct operation of the system as a whole. 388
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME 3.3. GFS Architecture The most spread used design for storage in GFS is that based on fixed sized chunks, where each of the chunks is of 64 MB. This is quite large compared to other file system designs. At one level this simply reflects the size of the files stored in GFS. At another level, this decision is crucial to providing highly efficient sequential reads and appends of large amounts of data [2]. The job of GFS is to provide a mapping from files to chunks and then to support standard operations on files, mapping down to operations on individual chunks. This is achieved with the architecture shown in Figure 1which shows an instance of a GFS file system as it maps onto a given physical cluster. Each GFS cluster has a single master and multiple chunkservers. The role of the master is to manage metadata about the file system defining the namespace for files, access control information and the mapping of each particular file to the associated set of chunks. When clients need to access data starting from a particular byte offset within a file, the GFS client library will first translate this to a file name and chunk index pair (easily computed given the fixed size of chunks) [5]. This is then sent to the master in the form of an RPC request (using protocol buffers). The master replies with the appropriate chunk identifier and location of the replicas, and this information is cached in the client and used subsequently to access the data by direct RPC invocation to one of the replicated chunkservers. Control Flow Client GFS master data GFS Client Library Data Flow GFS chunkser ver data ...... GFS chunkser ver Figure 1: GFS Architecture 4. GOOGLE APP ENGINE Google App Engine represents the service to host the web application which usually are accessed through a web browser which can be social networks, games, mobile application, publications etc. The engine also can serve also to other traditional applications like documents, images, videos, but the main aim purpose of the engine are the dynamic application running on real time [4]. Especially the Google engine is designed to serve as a storage layer for the applications which need multiple simultaneous user access, the engine handles this through adaption with the changes in the environment (scalability). With the increase of the number of users the engine allocates the necessary resources. 389
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME 4.1. Execution Enviroment The Google engine is composed from three main features [5]: application instances, data storage and the services. The engine accept the request and identifies the application from the domain name address, it follows with the selection of one of the servers, which in most of the cases is the server with the most rapid reply. For the resources to be used as much as possible and avoiding re-initialization, the engine allows the execution environment to be longer than the time needed to handle the requests. Each of the instances has local memory which servers to save the imported code and data structures. The engine creates and destroys the instances based on the traffic needs. Different from the traditional methods of accessing the application, the application code cannot access in this way the server, the application can read the files from the file system but cannot read on them, also the reading of the application files canon be performed. Google App Engine supports the languages: Java, Python and the environment based in the Go language. 4.2. Static File Server The static files (html, CSS, images) which do not change during the operations are not needed to be programmed and served by the application. These files are managed through the dedicate server for the static content. The servers are optimized for the access for the internal architecture and the network for processing the static resources [7]. 4.3. Datastore In the recent decades the most used approach for data storage of web application is that based on Relational Databases setup. Tables, rows, column and queries are used for storing and display of the data. Other types are those based on the hierarchical organization (file system, XML databases) and object databases. The database system of GAE resembles more to object databases and it does not support join operations on queries. The queries run against the datastore return one roe more entities of a given type and they can also return only the keys of the entities. Queries also can filter the data on the different conditions based on the values of properties of the entities; data ordering also is possible to be made. Distinct from the Relational database where the queries are planned and executed in the real time against the tables which are stored in the way they are designed by the developer/designer, in the Google App Engine queries are run differently, each of the queries is managed by an index which itself is manged by the datastore. When the application runs a query, datastore finds the corresponding index, it proceeds by scanning the first row which matches the index and returns the entity for each of the rows linked to the index till the first row which does not match query. Google App Engine provides some indexes for simple queries, while for the complex queries the application itself must have additional information for indexes during the configuration phase. The engine itself offers the possibility of creating a configuration file in which the indexes used during the test phase can be used. With the transaction processes are performed in the way that the changes are done in full or in case of any failure they are rolled back to the previous state so that during the multiple simultaneous accesses the data are in coherent state. When the commands are called though the data store API, the result is returned to the caller only after the transaction is performed successfully. 4.4. Big Table Big tables are storage systems for saving structured data up to the levels of petabytes [PB] and they can be distributed across thousands of servers. Applications such as Google Earth, Google Finance, Orkut are typical cases of use of big tables. These applications have very extreme requirements where is required that through asynchronous processes millions of operations to be 390
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME performed at thigh speed. Big tables store their data in multidimensional arrays sorted in rows, columns and time stamp. The columns which are accessed more frequently are arranged to be as family of columns. With the time stamp mechanism is arranged that multiple versions of the content to be saved in the same cell and the last version to be accessible. The rows are sorted alphabetically and in the groups, groups with similarities are stored in the same engine for easy access. Big table help Google to have small increase of costs for new services and processing power. Big tables are located above the GFS for data storage and log files and the system for task management. Chubby[2] is the server tasked to provide only one server for storing the location where the program with the data runs, it also serve to store schema information. To implement applications which employ big tables a master server and other secondary server are needed. Google through its engine offers the possibility to install and run the applications in the Google infrastructure. Applications developed through Google’s engine are easily developed and maintained and are scalable. To develop such applications the programming language such as Java, Javascript, Python, Rubyetj can be used [10]. Datastore stores entity objects with their properties and support different data types; it also offers the means to execute multiple operations through a single transaction which is very important for web applications which require simultaneous access by many users. Differently from the relational database systems, datastore uses distributed architecture for big dataset management and it differs a lot how it describes the relationships between data objects [6]. Two entities of the same kind can have different properties, whereas different entities can have properties with same name but with different type. Although there are quite similarities with relational databases, it has quite some differences with them, joins cannot be performed, this is more because datastore is designed to process huge data. Each data record represents an entity and in the code is interpreted as an object; each entity has a key which uniquely identifies among the other entities in the datastore. Each of the entities has one or more named entities which are attributes of the object. In the code below we have shown the case where a datastore of Student kind is created; it contains some properties with different types of values and saves them in a new entity. //Datastore initialization Datastore Servicedatastore = new DatastoreServiceFactory().getDatastoreService(); Entitystu = newEntity("Student"); stu.setProperty("Emri", req.getParameter("emri_input")); stu.setProperty("Mbiemri", req.getParameter("mbiemri_input")); stu.setProperty("Email", req.getParameter("email_input")); //Ruajtja e te dhenave ne tabele - datastore datastore.put(stu); Save, display and delete of the entities is done by using the corresponding commands of the datastore. Data can be retrieved from the store by using getDatastoreService method DatastoreServiceFactory class. To create or edit an entity in the datastore we call the mthod put() by supplying as a parameter the name of the entity. Display and deletion of a record is performed using the put() and get() methods. Queries of class are used to build the queries whereas PreparedQuery is used to display the entities from the datastore write in Java Eclipse. 391
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME DatastoreServicedatastore = DatastoreServiceFactory.getDatastoreService(); //Paraqitja e te dhenave ne menyretabelare //Marrja e te dhenave permes klases Query Query q = newQuery("Student"); //Marrja e rezultateve permes interface Prepared query PreparedQuery pq = datastore.prepare(q); for (Entityresult : pq.asIterable()) { //shkruarja e te dhenavepercdo rresht writer.write("<tr>"); writer.write(" <td>"+result.getProperty("Emri") +"</td> "); writer.write("<td>"+result.getProperty("Mbiemri") +"</td>"); writer.write("<td>"+result.getProperty("Email") +"</td>"); writer.write("</tr>"); } Figure 2: Display of data of student entity from datastore viewer 5. CONCLUSION AND FUTURE WORK In this paper we have presented some of the main features of Google as a distributed system and the possibilities it offers as a platform for building distributed systems. Designing distributed systems through Google capabilities such cloud computing results in distributed systems which are stable, secure, reliable and easy maintainable. With Google App Engine and with the application of languages such as Java and Python web based applications can be built without investing too much time and money. A big advantage of distributed applications developed with Gogole infrastructure (NoSQL databases) is high scalability. NoSQL databases are designed specifically with the requirements of the Internet as the focal point. To provide reliable access to millions of visitors around the world in a few hundred milliseconds, you need functionality beyond that of relational databases. Google App Engine offers the datastore as NoSQL storage. It allows you to store entities, each with a set of key-value pairs. Among other benefits of the App Engine offering is that you need not worry about system administration, and its APIs easily integrate with the rest of the platform. 392
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME REFERENCES [1] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E. Gruber, “Bigtable: A Distributed Storage System for Structured Data”, Google, Inc, Journal ACM Transactions on Computer Systems (TOCS), Volume 26 Issue 2, Article No. 4, June 2008 [2] Seth Gilbert and Nancy A. Lynch, “Perspectives on the CAP Theorem”, Volume 45 Issue 2, February 2012 [3] Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean Michel L´eon, Yawei Li, Alexander Lloyd and Vadim Yushprakh, “Megastore: Providing Scalable, Highly Available Storage for Interactive Services”, CIDR, 2011 [4] Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine:, Computer Science Department, Stanford University. http://infolab.stanford.edu/pub/papers/google.pdf [5] Andrew Fikes, “Storage Architecture and Challenges Google” http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en// university/relations/facultysummit2010/storage_architecture_and_challenges.pdf [6] Fay Chang and Jeffrey Dean, RESEARCH PAPER BASED ON BIGTABLE, “A Distributed Storage System For Structured Data”, http://resumegrace.appspot.com/pdfs/ResearchPaper_BigTable_Distributed.pdf [7] Google App Engine: Using Static Files, https://developers.google.com/appengine/docs/python/gettingstartedpython27/staticfiles [8] http://bigtable.appspot.com. [9] Preeti Gupta, Parveen Kumar and Anil Kumar Solanki, “A Comparative Analysis of Minimum-Process Coordinated Checkpointing Algorithms for Mobile Distributed Systems”, International Journal of Computer Engineering & Technology (IJCET), Volume 1, Issue 1, 2010, pp. 46 - 56, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [10] Nathwani Namrata, “Network Attached Storage Different from Traditional File Servers & Implementation of Windows Based NAS”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 539 - 549, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [11] Parveen Kumar and Poonam Gahlan, “A Minimum Process Synchronous Checkpointing Algorithm for Mobile Distributed System”, International Journal of Computer Engineering & Technology (IJCET), Volume 1, Issue 1, 2010, pp. 72 - 81, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 393