"Volunteer Computing with BOINC Client-Server side" por Diamantino Cruz e Ricardo Madeira


Published on

Trabalho de Sistemas Paralelos e Distribuidos : "Volunteer Computing with BOINC
Client-Server side" por Diamantino Cruz e Ricardo Madeira

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

"Volunteer Computing with BOINC Client-Server side" por Diamantino Cruz e Ricardo Madeira

  1. 1. 1 Volunteer Computing with BOINC Client-Server side Diamantino Cruz, Ricardo M. Madeira, and Rui Lopes Abstract— Around 300 million personal computers are connected to the internet, the majority are idle or under-used for most of the time, having their processing and storage potential go to waste, however that wasted potential is starting to be taken advantage by projects using volunteer computing (around 1% at the moment [12, 13]), volunteer computing uses computational resources that would otherwise be unused, to solve computationally intensive projects [10]. This paper is going to analyze the BOINC (Berkeley Open Infrastructure for Network Computing) service on a client-server perspective. Index Terms— Distributed Systems, Client/server, BOINC, Volunteer Computing —————————— —————————— 1 INTRODUCTION 1.1 Death, Taxes, and BOINC I T is said that life holds but two certainties, death and taxes. Nevertheless with the technologic expansion of the last few years it’s reasonably safe to assume that the puter to help physicists develop and exploit particle acce- lerators, such as CERN's Large Hadron Collider") to Spinhenge@home ("where you will actively support the many paths leading to the future have some convergence research of nano-magnetic molecules. In the future these in some points, being a lot of them in different parallel molecules will be used in localized tumor chemotherapy and distributed systems. According to Tannenbaum [1] and to develop tiny memory-modules.") and its easy to "A distributed system is a collection of independent computers deduce why cant escape BOINC (as a platform of investi- that appears to its users as a single coherent system". In this gation for the near future). work we shall make a quick introduction to some of those systems, address generally Client-Server system and do a 1.2 Client-Server focused approach on volunteer Computing. And what is Client-server describes the relationship between two volunteer computing? And why can’t we escape from it? computer programs in which one program, the client Volunteer computing uses Internet-connected comput- program, makes a service request to another, the server ers, volunteered by their owners, as a source of compu- program. Standard networked functions such as email ting power and storage. This paper discusses the client- exchange, web access and database access, are based on server side of BOINC (Berkeley Open Infrastructure for the client-server model. For example, a web browser is a Network Computing), a middleware system for volunteer client program at the user computer that may access in- computing. Originally developed to support the formation at any web server in the world. To check your SETI@home project before it became useful as a platform bank account from your computer, a web browser client for other distributed applications in areas as diverse as program in your computer forwards your request to a mathematics, medicine, molecular biology, climatology, web server program at the bank. That program may in and astrophysics. In essence BOINC is software that can turn forward the request to its own database client pro- use the unused CPU and GPU cycles on a computer to do gram that sends a request to a database server at another scientific computing—what one individual doesn't use of bank computer to retrieve your account balance. The his/her computer, BOINC uses, it consists of a server balance is returned to the bank database client, which in system and client software that communicate with each turn serves it back to the web browser client in your per- other to distribute, process, and return work units. Just to sonal computer, which displays the information for you. glimpse at the sheer amount of potential of this project, using a single computer costing about $4,000, a BOINC The client-server model has become one of the central project can dispatch about 8.8 million tasks per day. If ideas of network computing. Most business applications each client is issued one task per day and each task uses being written today use the client-server model. So do the 12 CPU hours on a 1 GFLOPS computer, the project can Internet's main application protocols, such as HTTP, support 8.8 million clients and obtain 4.4 PetaFLOPS of SMTP, Telnet, DNS, etc. In marketing, the term has been computing power. With two additional server computers, used to distinguish distributed computing by smaller a project can dispatch about 23.6 million tasks per day [2]. dispersed computers from the "monolithic" centralized Now think of this power redirected to projects ranging computing of mainframe computers. But this distinction from LHC@home ("a volunteer computing program has largely disappeared as mainframes and their applica- which enables you to contribute idle time on your com- tions have also turned to the client-server model and
  2. 2. 2 become part of network computing. Each instance of the client software can send data re- The early efforts in Grid computing started as projects to quests to one or more connected servers. In turn, the link US supercomputing sites, but now it has grown far servers can accept these requests, process them, and re- beyond its original intent. In fact, there are many applica- turn the requested information to the client. Although tions that can benefit from the Grid infrastructure, includ- this concept can be applied for a variety of reasons to ing collaborative engineering, data exploration, high- many different kinds of applications, the architecture throughput computing, and of course distributed super- remains fundamentally the same. computing. The most basic type of client-server architecture employs The term ‘Grid’ is chosen to suggest the idea of a ‘power only two types of hosts: clients and servers. This type of grid’: namely that application scientists can plug into the architecture is sometimes referred to as two-tier. It allows computing infrastructure like plugging into an electrical devices to share files and resources. The two tier architec- power grid. It is important to note, however, that the term ture means that the client acts as one tier and application ‘Grid’ is sometimes used synonymously with a net- in combination with server acts as another tier. worked, high performance-computing infrastructure. Obviously this aspect is an important enabling technolo- These days, clients are most often web browsers, al- gy for future applications, but in reality it is only part of a though that has not always been the case. Servers typical- much larger scenario that also includes information han- ly include web servers, database servers and mail servers. dling and support for knowledge within the scientific Online gaming is usually client-server too. In the specific process. It is this broader view of the infrastructure that is case of MMORPG, the servers are typically operated by now being referred to as the Semantic Grid. The Semantic the company selling the game; for other games one of the Grid is characterized by an open system, with a high players will act as the host by setting his game in server degree of automation, which supports flexible collabora- mode. tion and computation on a global scale. [4] The interaction between client and server is often de- 1.4 Peer-to-Peer scribed using sequence diagrams. Sequence diagrams are In the literature, the term Peer-to-Peer (P2P) is used to standardized in the Unified Modeling Language. describe a wide variety of software applications. The applications that have been classified as P2P come from a When both the client- and server-software are running on diverse range of domains such as file-sharing, distributed the same computer, this is called a single seat setup and computing, instant messaging and content distribution. In also even they break up the value of the current em- the literature, there is a lack of agreement on the set of ployee. criteria that can be used to call an application P2P. The computers that are connected to the Internet but have Specific types of clients include web browsers, email variable connectivity and temporary network addresses clients and online chat clients. are often called computers on the edge of the Internet. The existing definitions for the term P2P can be broadly Specific types of servers include web servers, ftp servers, divided into two groups, depending on whether they application servers, database servers, name servers, mail emphasize the ability of P2P to utilize computers at the servers, file servers, print servers, and terminal servers. edge of the Internet, as the key defining characteristics of Most web services are also types of servers. [3] P2P. The definitions in the first group (e.g., [5]) do em- phasize the ability of P2P to utilize computers at the edge 1.3 Grid of the Internet, as the key defining characteristics of P2P, The last decade has seen a considerable increase in com- whereas those in the second category (e.g., [6]) do not. modity computer and network performance, mainly as a result of faster hardware and more sophisticated soft- The potential of P2P applications lie in their ability to ware. Nevertheless, there are still problems, in the fields utilize the computers at the edge of the Internet. However of science, engineering and business, which cannot be making this the defining characteristics for P2P applica- dealt effectively with the current generation of super- tions excludes P2P applications used in Intranets, where computers. In fact, due to their size and complexity, these P2P applications are extremely useful for tasks such as problems are often numerically and/or data intensive collaboration. P2P applications deployed in an Intranet and require a variety of heterogeneous resources that are do not require the capability to utilize the computers at not available from a single machine. A number of teams the edge of the Internet. While the ability to utilize com- have conducted studies on the cooperative use of geo- puters at the edge of the Internet is a benefit of P2P appli- graphically distributed resources conceived as a single cations we do not consider it a defining characteristic in powerful virtual computer. This new approach is known this thesis. by several names, such as, metacomputing, seamless scalable computing, global computing, and more recently A useful and accurate definition for P2P is given by Grid computing. Schollmeier et al. ([6]) in 2001, at the IEEE P2P conference.
  3. 3. 3 The definition states that: provider's infrastructure. Developers create applications on the provider's platform over the Internet. PaaS provid- A distributed network architecture may be called ers may use APIs, website portals or gateway software Peer-to-Peer (P-to-P, P2P,...) network, if the participants installed on the customer's computer. Force.com, (an share a part of their own hardware resources (processing outgrowth of Salesforce.com) and GoogleApps are exam- power, storage capacity, network link capacity, prin- ples of PaaS. Developers need to know that currently, ters,...). These shared resources are necessary to provide there are not standards for interoperability or data porta- the Service and content offered by the network (e.g., file bility in the cloud. Some providers will not allow soft- sharing or shared workspaces for collaboration). They are ware created by their customers to be moved off the pro- accessible by other peers directly without passing inter- vider's platform. mediary entities. The participants of such a network are thus resource (Service and content) providers as well as In the software-as-a-service cloud model, the vendor resource (Service and content) requester (Servant- supplies the hardware infrastructure, the software prod- concept). uct and interacts with the user through a front-end portal. SaaS is a very broad market. Services can be anything 1.5 Cloud Computing from Web-based email to inventory control and database Cloud computing is a general term for anything that in- processing. Because the service provider hosts both the volves delivering hosted services over the Internet. These application and the data, the end user is free to use the services are broadly divided into three categories: Infra- service from anywhere. [7] structure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). The name cloud compu- Or like Brian Hayes [8] simply puts “Data and programs ting was inspired by the cloud symbol that's often used to are being swept up from desktop PCs and corporate serv- represent the Internet in flow charts and diagrams. er rooms and installed in “the compute cloud.” Whether it’s called cloud computing or on-demand computing, A cloud service has three distinct characteristics that dif- software as a service, or the Internet as platform, the ferentiate it from traditional hosting. It is sold on de- common element is a shift in the geography of computa- mand, typically by the minute or the hour; it is elastic -- a tion. When you create a spreadsheet with the Google user can have as much or as little of a service as they want Docs service, major components of the software resides at any given time; and the service is fully managed by the on unseen computers, whereabouts unknown possibly provider (the consumer needs nothing but a personal scattered across continents.” computer and Internet access). Significant innovations in virtualization and distributed computing, as well as im- 2 Volunteer Computing proved access to high-speed Internet and a weak econo- my, have accelerated interest in cloud computing. 2.1 Don’t ask what the internet can do for you; ask what u can do for the world. A cloud can be private or public. A public cloud sells Volunteer computing is an arrangement in which people services to anyone on the Internet. (Currently, Amazon (volunteers) provide computing resources to projects, Web Services is the largest public cloud provider.) A pri- which use the resources to do distributed computing vate cloud is a proprietary network or a data center that and/or storage. Volunteers are typically members of the supplies hosted services to a limited number of people. general public who own Internet-connected PCs. Organi- When a service provider uses public cloud resources to zations such as schools and businesses may also volun- create their private cloud, the result is called a virtual teer the use of their computers. Projects are typically aca- private cloud. Private or public, the goal of cloud compu- demic (university-based) and do scientific research. But ting is to provide easy, scalable access to computing re- there are exceptions; for example, GIMPS and distri- sources and IT services. buted.net (two major projects) are not academic. Several aspects of the project/volunteer relationship are worth Infrastructure-as-a-Service like Amazon Web Services noting: provides virtual server instances with unique IP ad- Volunteers are effectively anonymous; although they may dresses and blocks of storage on demand. Customers use be required to register and supply email address or other the provider's application program interface (API) to information, they are not linked to a real-world identity. start, stop, access and configure their virtual servers and Because of their anonymity, volunteers are not accounta- storage. In the enterprise, cloud computing allows a com- ble to projects. If a volunteer misbehaves in some way (for pany to pay for only as much capacity as is needed, and example, by intentionally returning incorrect computa- bring more online as soon as required. Because this pay- tional results) the project cannot prosecute or discipline for-what-you-use model resembles the way electricity, the volunteer. fuel and water are consumed; it's sometimes referred to Volunteers must trust projects in several ways: as utility computing. Platform-as-a-service in the cloud is defined as a set of software and product development tools hosted on the
  4. 4. 4 The volunteer trusts the project to provide This is different from volunteer computing. ʹDesktop gridʹ  applications that don't damage their computer or invade computing ‐ which uses desktop PCs within an organiza‐ their privacy. tion ‐ is superficially similar to volunteer computing, but  The volunteer trusts that the project is truthful about because it has accountability and lacks anonymity, it is  what work is being done by its applications, and how the significantly different.   resulting intellectual property will be used. If your definition of ʹGrid computingʹ encompasses all  The volunteer trusts the project to follow proper distributed computing (which is silly ‐ thereʹs already a  security practices, so that hackers cannot use the project perfectly good term for that) then volunteer computing is  as a vehicle for malicious activities. a type of Grid computing.   The first volunteer computing project was GIMPS (Great Internet Mersenne Prime Search), which started in 1995. 2.4 Is it the same as “peer to peer” computing? Other early projects include distributed.net, SETI@home, and Folding@home. Today there are over 50 active No. ʹPeer‐to‐peer computingʹ describes systems such as  projects. Napster, Gnutella, and Freenet, in which files and other  data are exchanged between ʹpeersʹ (i.e. PCs) without the  involvement of a central server. This differs in several  2.2 Why is it important? ways from volunteer computing:   Because of the huge number (> 1 billion) of PCs in the world, volunteer computing supplies more computing Volunteer computing uses central servers. There is typi‐ power to science than does any other type of computing. cally no peer‐to‐peer communication.   This computing power enables scientific research that could not be done otherwise. This advantage will increase Peer‐to‐peer computing benefits the participants (i.e. the  over time, because the laws of economics dictate that people sharing files). Thereʹs no notion of a ʹprojectʹ to  consumer products such as PCs and game consoles will which resources are donated.   advance faster than more specialized products, and that there will be more of them. Peer‐to‐peer computing usually involves storage and  retrieval, not computing.   Volunteer computing power can't be bought; it must be earned. A research project that has limited funding but large public appeal can get huge computing power. In 3.1 HOW THE MAGIC IS DONE SERVER-SIDE contrast, traditional supercomputers are extremely ex- pensive, and are available only for applications that can afford them (for example, nuclear weapon design and 3.2 SERVER DESCRIPTION espionage). BOINC‐based  projects  are  autonomous.  Each  project  operates a server consisting of several components:  Volunteer computing encourages public interest in   science, and provides the public with voice in determin- Web interfaces for account and team management, mes‐ ing the directions of scientific research. sage boards, and other features:  2.3 How does it compare to grid computing?   A task server that creates tasks dispatches them to clients,  It depends on how you define ʹGrid computingʹ. The term  and processes returned tasks.  generally refers to the sharing of computing resources    within and between organizations, with the following  A data server that downloads input files and executables,  properties:   and that uploads output files.    Each organization can act as either producer or consumer  These  components  share  various  data  stored  on  disk,  of resources (hence the analogy with the electrical power  including relational databases and upload/download files  grid, in which electric companies can buy and sell power  (see Figure 1).  to/from other companies, according to fluctuating de‐ mand).   The organizations are mutually accountable. If one organ‐ ization misbehaves, the others can respond by suing them  or refusing to share resources with them.  
  5. 5. 5 should take to complete. The reply includes a list of  instances and their corresponding jobs. Handling a re‐ quest involves a number of database operations: read‐ ing and updating records for the user account and  team, the host, and the various jobs and instances. The  scheduler is implemented as a Fast CGI program run  from an Apache web server [3], and many instances  can run concurrently.  Figure 1: A BOINC server consists of several components,  • The feeder streamlines the scheduler’s database  sharing several forms of storage.  access. It maintains a shared‐memory segment con‐   taining 1) static database tables such as applications,  Each client periodically communicates with the task serv- platforms, and application versions, and 2) a fixed‐size  er to report completed work and to get new work. In cache of unsent instance/job pairs. The scheduler finds  addition, the server performs a number of background instances that can be sent to a particular client by  functions, such as retrying and garbage collecting tasks. scanning this memory segment. A semaphore syn‐ The load on a task server depends on the number of vo- chronizes access to the shared‐memory segment. To  lunteer hosts and their rates of communication. The num- minimize contention for this semaphore, the scheduler  ber of volunteer hosts in current projects ranges from tens marks a cache entry as “busy” (and releases the sema‐ to hundreds of thousands, and in the future may reach tens or hundreds of millions. If servers become over- phore) while it reads the instance from the database to  loaded, requests fail and hosts become idle. Thus, server verify that it is still unsent.  performance can limit the computing capacity available to • The transitioner examines jobs for which a state  a volunteer computing project. change has occurred (e.g., a completed instance has  been reported). Depending on the situation, it may  3.3 BOINC TASK SERVER ARCHITECTURE generates new instances, flag the job as having a per‐ manent error, or trigger validation or assimilation of  the job.  3.3.1 TASK SERVER COMPONENTS • The validator compares the instances of a job and  selects a canonical instance representing the correct  BOINC implements a task server using a number of sepa- output. It determines the credit granted to users and  rate programs, which share a common MySQL database hosts that return the correct output, and updates those  (see Figure 2). database records.  • The assimilator handles job that are “com‐ pleted”: i.e., that have a canonical instance or for  which a permanent error has occurred. Handling a  successfully completed job might involve writing out‐ puts to an application database or archiving the out‐ put files.  • The file deleter deletes input and output files  that are no longer needed.    • The database purger removes jobs and instance    database entries that are no longer needed, first writ‐ Figure 2: The components of a BOINC task server  ing them to XML log files. This bounds the size of    these tables, so that they act as a working set rather  The work generator creates new jobs and their input files.  than an archive. This allows database management  For  example,  the  SETI@home  work  generator  reads  digi‐ operations (such as backups and schema changes) to  tal  tapes  containing  data  from  a  radio  telescope,  divides  be done quickly.  this  data  into  files,  and  creates  jobs  in  the  BOINC  data‐   base.  The  work  generator  sleeps  if  the  number  of  unsent  The  programs  communicate  through  the  BOINC  data‐ instances exceeds a threshold, limiting the amount of disk  base. For example, when the work generator creates a job  storage needed for input files.  it  sets  a  flag  in  the  job’s  database  record  indicating  that  • The scheduler handles requests from BOINC  the transitioner should examine it. Most of the programs  clients. Each request includes a description of the host,  repeatedly  scan  the  database,  enumerating  records  that  a list of completed instances, and a request for addi‐ have  the  relevant  flag  set,  handling  these  records,  and  tional work, expressed in terms of the time the work  clearing the flags in the database. Database indices on the 
  6. 6. 6 flag  fields  make  these  enumerations  efficient.  When  an  accountable  to  the  project  (indeed,  their  identity  is  un‐ enumeration  returns  nothing,  the  program  sleeps  for  a  known),  and  the  volunteered  hosts  are  unreliable  and  short period.  insecure.    Thus, when a task is sent to a host, several types of er‐ Thus,  a  BOINC  task  server  consists  of  many  processes,  rors  are  possible.  Incorrect  output  may  result  from  a  mostly asynchronous with respect to client requests, that  hardware malfunction (especially in hosts that are “over‐ communicate  through  a  database.  This  approach  has  the  clocked”), an incorrect modification to the application, or  disadvantage  of  imposing  a  high  load  on  the  database  a intentional malicious attack by the volunteer. The appli‐ server.  One  can  imagine  an  alternative  design  in  which  cation  may  crash.  There  may  be  no  response  to  the  almost all functions are done by the scheduler, synchron‐ project,  e.g.  because  the  host  dies  or  stops  running  ously  with  client  requests.  This  would  have  lower  data‐ BOINC. An  unrecoverable  error  may  occur  while  down‐ base  overhead.  However,  the  current  design  has  several  loading or uploading files. The result may be correct but  important advantages:  reported too late to be of use.  • It is resilient with respect to failures. For exam‐ Persistent redundant computing  ple, only the assimilator uses the application database,  Because the above problems occur with nonnegligible  and if it unavailable only the assimilator is blocked.  frequency, volunteer computing requires mechanisms for  The other components continue to execute, and the  validation  (to  ensure  that  outputs  are  correct)  and  retry  BOINC database (i.e., the job records tagged as ready  (to  ensure  that  tasks  eventually  get  done).  BOINC  pro‐ to assimilate) acts as a queue for the assimilator when  vides  a  mechanism  called  persistent  redundant  compu‐ it runs again.  ting that accomplishes both goals.  • It is resilient with respect to performance. If  This  mechanism  involves  performing  each  task  inde‐ backend components (e.g. the validator or assimilator)  pendently  on  two  or  more  computers,  comparing  the  perform poorly and fall behind, the client‐visible  outputs,  looking  for  a  “quorum”  of  equivalent  outputs,  components (the feeder and scheduler) are unaffected.  and  generating  new  instances  as  needed  to  reach  a  quo‐ rum.  THE VARIOUS COMPONENTS CAN EASILY BE DISTRIBUTED  In BOINC terminology, a job is a computational task,  AND/OR REPLICATED (SEE BELOW).  specified  by  a  set  of  input  files  and  an  application  pro‐ gram.  Each  job  J  has  several  scheduling‐related  parame‐ ters:  3.4 SCALABILTY • DelayBound(J): a time interval that determines  the deadline for instances of J.  • NInstances(J): the number of instances of J to be  3.3.1 COMPONENT DISTRIBUTION created initially.  The  programs  making  up  a  BOINC  task  server  may  run  • MinQuorum(J): the minimum size of a quorum.  on different computers. In particular, the BOINC database  • Estimates of the amount of computing, disk  may  run  on  a  separate  computer  (MySQL  allows  remote  access).  Many  of  the  programs  require  access  to  shared  space, and memory required by J.  files  (configuration  files,  log  files,  upload/download  data  • Upper bounds on the number of erroneous, cor‐ files)  so  generally  the  server  computers  are  on  the  same  rect, and total instances. These are used to detect jobs  LAN and use a network file system such as NFS.  that consistently crash the application, that return in‐ The  server  programs  may  also  be  replicated,  either  on  a  consistent results, or that cause their results to not be  multiprocessor  host  or  on  different  hosts.  Interference  reported.  between replicas is avoided by having each replica  work    on a different subset of database items. The space of data‐ A  job instance  (or  just  “instance”)  refers  to  a  job  and  base identifiers is partitioned: if there are n replicas, repli‐ specifies a set of output files. An instance is dispatched to  ca i handles only items (e.g., jobs) for which (ID mod n) =  at most one host. An instance is reported when it listed in  i.  a scheduler request message. If enough instances of a job  have  been  reported  and  are  equivalent,  they  are  marked  3.4. FAILURE PROTECTION as valid and one of them is selected as the job’s canonical  instance.  BOINC  implements  persistent  redundant  computing  as  3.4.1. THE BOINC COMPUTING MODEL follows:  Grid  computing  involves  resource  sharing  between  1. When a job J is created, NInstances(J) instances  organizations  that  are  mutually  accountable.  In  contrast,  for J are created and marked as unsent.  participants  in  a  volunteer  computing  project  are  not 
  7. 7. 7 2. When  a  client  requests  work,  the  task  server  se‐ graphics. It communicates with the core client using  lects  one  or  more  unsent  instances  and  dispatches  remote procedure calls over TCP.  them  to  the  host.  Two  instances  of  the  same  job  are  • A BOINC screensaver (if enabled by the volun‐ never sent to the same participant, making it unlikely  teer) runs when the computer is idle. It doesn’t gener‐ that  a  maliciously  incorrect  result  will  be  accepted  a  ate screensaver graphics itself, but rather communi‐ valid. The instance’s deadline is set to the current time  cates with the core client, requesting that one of the  plus DelayBound(J).  running applications display full‐screen graphics.  3. If  an  instance’s  deadline  passes  before  it  is  re‐ ported, the server marks it as “timed out” and creates  4.1 OVERALL ARCHITECTURE a new instance of J. It also checks whether the limit on  the  number  of  error  or  total  instance  of  J  has  been  reached,  and  if  so  marks  J  as  having  a  permanent  er‐ 4.1.1 SHARED-MEMORY MESSAGE-PASSING ror.  4. When  an  instance  I  is  reported,  and  its  job  al‐ The  runtime  system  requires  bidirectional  communi‐ ready  has  a  canonical  instance  I*,  the  server  invokes  cation  between  the  core  client  and  applications.  How  an application‐specific function that compares I and I*,  should  this  work?  Operating  systems  offer  a  variety  of  and marks I as valid if they are equivalent. If there is  mechanisms  for  inter‐process  communication,  process  no canonical instance yet, and the number of success‐ control,  and  synchronization.  For  example,  POSIXcom‐ ful  instances  is  at  least  MinQuorum(J),  the  server  in‐ pliant systems have signals, semaphores, and pipes. Win‐ vokes an application‐specific function which, if it finds  dows has mutexes, messages, and various system calls for  a quorum of equivalent instances, selects one of them  process and thread control. We avoided platform‐specific  as the canonical instance I*, and marks the instances as  mechanisms because of the resulting code complexity.  valid  if  they  are  equivalent  to  I*.  Volunteers  are    granted credit for valid instances.[9]  Instead,  the  BOINC  runtime  system  is  based  on    shared‐memory message passing. For each application it  executes, the core client creates a shared memory segment  4 HOW THE MAGIC IS DONE CLIENT-SIDE containing  a  data  structure  with  a  number  of  unidirec‐ tional message channels. Each channel consists of a fixed‐ size  buffer  and  a  ‘present’  flag.  Message  queuing,  if  needed,  is  provided  at  a  higher  software  level.  All  mes‐ sages are XML, minimizing versioning problems.    The BOINC runtime system uses eight message chan‐ nels,  four  in  each  direction.  For  example,  one  channel  carries  task  control  messages  (telling  the  application  to  suspend,  resume,  quit  or  abort)  while  another  conveys    graphics‐related  messages  (telling  the  application  to  Figure  3:  the  BOINC  client  software  includes  a  ‘core  create or destroy graphics windows).  client’  that  executes  applications  and  interacts  with  them    through a runtime system.  • Applications are typically long‐running scientific  programs. They may consist of a single process or a  dynamic set of multiple processes.  • The BOINC core client program communicates  with schedulers, uploads and downloads files, and ex‐ ecutes and coordinates applications.  • The BOINC Manager provides a graphical inter‐ face allowing users to view and control computation  status (see Figure 3). For each task, it shows the frac‐   tion done and the estimated time to completion, and  Figure 4: The core client communicates with applications  lets the user open a window showing the application’s  by shared‐memory message passing. 
  8. 8. 8 4.1.2 SIMPLE AND COMPOUND APPLICATIONS effect) and tries again.  BOINC supports both simple and compound applica‐ tions.  Simple  applications  consist  of  a  single  program,  4.2.2. RELIABLE TERMINATION and  their  scientific  code,  graphics  code,  and  the  BOINC  The core client uses standard functions (such as  wait‐ runtime  library  reside  and  execute  in  a  single  address  pid()  on  Unix)  to  find  when  applications  have  finished  space.  Compound  applications  consist  of  several  pro‐ and  whether  they  exited  normally.  On  some  versions  of  grams – typically a coordinator that executes one or more  Windows,  when  a  program  is  killed  externally  by  the  worker  programs.  The  coordinator,  for  example,  might  user,  it  is  indistinguishable  (from  the  core  client’s  view‐ run  pre‐processing,  main,  and  post‐processing  programs  point)  from  a  call  to  exit(0).  To  solve  this  problem,  the  in  sequence,  or  it  might  launch  one  or  more  programs  BOINC API  finalization  routine  writes  a  ‘finished  file’. If  (e.g.  coupled  climate  models)  that  run  concurrently  and  the core client detects that a program has exited unexpec‐ communicate via shared memory. It might run a graphics  tedly but no ‘finished file’ is found, it restarts the applica‐ program concurrently with a scientific program.  tion.  The  BOINC  runtime  library  is  linked  with  each  pro‐ gram  of  a  compound  application.  The  BOINC  API  lets  5 Checkpointing each  program  specify  which  message  channels  it  will  handle,  and  whether  the  message  handling  should  be  BOINC  expects  applications  to  do  checkpoint/restart,  done by the runtime system or by the application. In the  so that they can quit and restart repeatedly and still finish  example  shown  in  Figure  5,  the  coordinator  handles  their  intended  computation.  BOINC  user  preferences  process  control  messages,  while  the  graphics  program  include  a  minimum  interval  between  periods  of  disk  ac‐ handles graphics messages.  tivity. This is useful for laptops whose disks spin down to    conserve  power.  The  BOINC  runtime  system  must  allow    applications  to  checkpoint  frequently  (to  minimize    wasted  CPU  time)  but  must  respect  the  minimum  disk    interval.    BOINC applications typically have particular points in  their execution where the state of the computation can be  represented  compactly  (e.g.  by  the  values  of  outer  loop  indices). These “checkpointable states” may be separated  by milliseconds or by minutes. The BOINC API provides  a function  bool boinc_time_to_checkpoint();  that  should  be  called  whenever  the  application  is  in  a  Figure  5:  A  compound  application  consists  of  several  checkpointable  state.  It  can  be  called  frequently  (hun‐ processes,  each  of  which  handles  particular  message  dreds  or  thousands  of  times  a  second).  It  returns  true  if  channels  the  minimum  disk  interval  has  elapsed  since  the  last  checkpoint.  If  so,  the  application  should  write  a  check‐ 4.2. FAILURE PROTECTION point file and call  boinc_checkpoint_completed();  These functions automatically make checkpointing a criti‐ 4.2.1. ORPHANED AND DUPLICATE PROCESSES cal  section  with  respect  to  quit  messages.  They  also  in‐ Sometimes  the  core  client  exits  unexpectedly  (for  ex‐ form  the  core  client  when  the  application  has  check‐ ample, because it crashes). In these situations, a mechan‐ pointed,  so  that  it  can  correctly  account  total  CPU  time,  ism  is  needed  that  will  cause  applications  to  eventually  and so that it can avoid doing preempt‐by‐quit for appli‐ exit.  BOINC  uses  heartbeat  messages,  which  are  sent  cations that haven’t checkpointed recently. once per second from the core client to each application. If  5.1 Output file integrity an  application  doesn’t  get  a  heartbeat  message  for  30  Many  BOINC  applications  write  incrementally  to  output  seconds, it exits.  files. If an application is preempted by quitting at a time  Each application executes in a directory containing its  when  has  extended  an  output  file  since  the  last  check‐ input  and  output  files.  To  prevent  duplicate  copies  of  an  point, the same output will be written when the task runs  application  from  executing  in  the  same  directory,  the  again, producing an erroneous output file. There are sev‐ runtime  system  uses  a  lock  file.  The  API  initialization  eral  ways  of  dealing  with  this.  The  application  can  copy  routine tries to acquire the lock file; if it can’t, it waits for  output  files  during  checkpoint;  this  is  potentially  ineffi‐ 30  seconds  (allowing  the  heartbeat  mechanism  to  take 
  9. 9. 9 cient. It can store the size of output files in the checkpoint  has used 1.5 million years of CPU time. Scientists can now  file, and seek to these offsets on restart. Or it can use a set  resurrect and reconsider these discarded ideas.[14]  of  printf()‐replacement  functions  (supplied  by  BOINC)  in part by a grant from XYZ. that  buffer  output  in  memory,  and  flush  these  buffers  during checkpoint.  REFERENCES [1] Andrew S. Tanenbaum, Computer Networks, pp 2, 2002.  6. Remote diagnostics and debugging [2] David  P.  Anderson,  Eric  Korpela,  Rom  Walton.  ʺHigh‐ Applications  can  fail  by  crashing  or  going  into  infinite  Performance Task Distribution for Volunteer Computingʺ Space  loops. Some failures occur only in specific contexts – CPU  Sciences Laboratory University of California, Berkeley H.  type,  OS  version,  library  version,  even  CPU  speed.  Such  [3] http://Wikipedia.com  failures  may  be  common  on  volunteer  hosts,  yet  never  [4] http://dsonline.computer.org  occur on the project’s development machines. The BOINC  [5] Andy  Oram  (editor),  Peer‐to‐peer:  Harnessing  the  power  of  disruptive technologies, pg 22, O´Reilly, 2001.   runtime  system  has  several  features  that  collect  failure  [6] Rüdiger  Schollmeier,  A  definition  of  peer‐to‐peer  networking  information:  for  the  classification  of  peer‐to‐peer  architectures  and  applica‐ • An application’s standard error output is di‐ tions,  IEEE  International  Conference  on  Peer‐to‐Peer  Compu‐ rected to a file and returned to the project’s server for  ting, 2001, pp. 101 to 102.  all tasks, failed and not.  [7] searchcloudcomputing.techtarget.com  • If an application crashes, stack trace is written to  [8] Brian  Hayes.  Cloud  computing.  Commun.  ACM,  51(7):9{11,  standard error. It the application includes a symbol  2008.ISSN00010782.doi:http://doi.acm.org/10.1145/1364782.1647 table, the stack trace is symbolic.  86.  [9] David  P.  Anderson,Carl  Christensen,Bruce  Allen  UC  Berkeley  • If an application is aborted (because the task ex‐ Space  Sciences  Laboratory  Dept.  of  Physics,  University  of  Ox‐ ceeds time, disk, or memory limits, or is aborted by  ford,Physics  Dept.,  University  of  Wisconsin  –  Milwaukee  “De‐ the user) a stack trace is written to standard error.  signing a Runtime System for Volunteer Computing “  All  information  about  a  task  (exit  code,  signal  number,  [10] D.  P.  Anderson,  J.  Cobb,  E.  Korpela,  M.Lebofsky,  and  D.  Wer‐ standard  error  output,  volunteer  host  platform)  is  stored  thimer,  SETI@home:  An  Experiment  in  Public‐Resource  Com‐ puting.  Communications  of  the  ACM,  Vol.  45,  No.  11,  2002,  pp.  in  a  relational  database  on  the  server,  making  it  easy  to  56‐61.  isolate the contexts in which failures occur. Many BOINC‐ [11] D. Toth & D. Finkel, A Comparison of Techniques for Distribut‐ based  projects  have  small  “alpha  testing”  projects,  with  ing File‐Based Tasks for Public‐Resource Computing, Proc. 17th  enough  volunteers  to  cover  the  main  platforms,  so  that  IASTED International Conference on Parallel and Distributed Com‐ context‐specific problems can be fixed before applications  puting and Systems, Phoenix, Arizona, USA, 2005, pp. 398‐403.  are released to the public. [12] J.  Bohannon,  Grassroots  Supercomputing.  Science  308,  2005,  810‐813.  7. Conclusion [13] C. Sagan. The Demon‐Haunted World: Science As a Candle in  Nowadays the opinion of the average internet user about  the Dark. Random House, 1996.  internet  itself  isn’t  a  generous  one;  Carl  Sagan  observed  [14] Dr. David P. Anderson. “Public Computing: Reconnecting People  that the general publicʹs attitude toward science is  to Science”, March 21, 2004.  increasingly  one  of  alienation  and  even  hostility[13],  of  course  u  can  say  this  is  a  trend  ever  since  Prometheus  stole  the  divine  fire  and  gave  it  to  humans,  volunteer  computing  is  a  step  on  the  right  direction.  Not  only  it  puts potencial in the everyday internet user as a vessel of  knowledge but it takes the vouch of goverments and capi‐ talist  companies  in  scientific  research.  Because computer  owners  can  contribute  to  whatever  project  they  choose,  the  control  over  resource  allocation  for  science  will  be  shifted  away  from  government  funding  agencies  (with  the myriad factors that control their policies) and towards  the  public.  This  has  its  risks:  the  public  may  be  easier  to  deceive  than  a  peer‐review  panel.  But  it  offers  a  very  direct  and  democratic  mechanism  for  deciding  research  policy.  If  a  scientist  has  an  idea  for  a  computation,  but  finds that it will take a million years of computer time, the  normal  reaction  is  to  toss  the  idea  in  a  wastebasket.  But  public computing makes such ideas feasible: SETI@home