Grid resource management for data mining applications


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Grid resource management for data mining applications

  1. 1. Grid resource management for data mining applications Valentin Kravtsov a Thomas Niessen b Assaf Schuster a Werner Dubitzky c,1 Vlado Stankovski d,∗ a Technion Israel Institute of Technology, Haifa, Israel b Fraunhofer Institute for Intelligent Analysis and Information Systems, Bonn, Germany c University of Ulster, Biomedical Sciences Research Institute, Coleraine, UK d University of Ljubljana, Ljubljana, Slovenia Abstract Emerging data mining applications in science, engineering and other sectors in- creasingly exploit large and distributed data sources as well as computationally intensive algorithms. Adapting such applications to grid computing environments has implications for grid resource brokering. A grid resource broker supporting such applications needs to provide effective and efficient job scheduling, execution and monitoring. Furthermore, to be useable by domain-oriented end users and to be able to evolve gracefully with emerging grid technology, it should hide the underlying complexity of the grid from such users and be compliant with important grid stan- dards and technology. The DataMiningGrid Resource Broker was designed to meet these requirements. This paper presents the DataMiningGrid Resource Broker and the results from evaluating it in a European-wide test bed. Key words: Grid, resource broker, data mining, GridBus ∗ Corresponding author. Email addresses: svali (Valentin Kravtsov), (Werner Dubitzky), (Vlado Stankovski). 1 We acknowledge the cooperation of all DataMiningGrid partners and collabora- tors in the DataMiningGrid project. This work was supported largely by the Eu- ropean Commission FP6 grant DataMiningGrid (, Con- tract No. 004475. Preprint submitted to Elsevier 17 November 2006
  2. 2. 1 Introduction Due to the increased computerization of many industrial, scientific, and public sectors, the growth of available digital data is proceeding at an unprecedented rate. The effective and efficient management and use of stored data, and in particular the transformation of these data into information and knowledge, is considered a key requirement for success in such domains. Data mining [1] (a.k.a. knowledge discovery in databases) is the de-facto technology address- ing this information need. Until a few years ago, data mining has mainly been concerned with small to moderately sized data sets within the context of largely homogeneous and localized computing environments. These assump- tions are no longer met in modern scientific and industrial complex-problem solving environments, which are more and more relying on the sharing of geo- graphically dispersed computing resources. This shift to large-scale distributed computing has profound implications in terms of the way data are analyzed. Future data mining applications will need to operate on massive data sets and against the backdrop of complex domain knowledge. The domain knowledge (computer-based and human-based), the data sets themselves, and the pro- grams for processing, analyzing, evaluating, and visualizing the data, and other relevant resources will increasingly reside at geographically distributed sites on heterogeneous infrastructures and platforms. Grid computing promises to become an essential technology capable of addressing the changing computing requirements of future distributed data mining environments. As a result of these developments, grid-enabled and distributed data mining has become an active area of research and development in recent years [2–4]. This new area attempts to meet the requirements of emerging and future data mining applications by exploiting and sharing computational resources avail- able in grid computing environments. Critical hardware and software resources to be shared in such applications include primary and secondary storage de- vices, processing units, data and data mining application software. Facilitat- ing effective and efficient sharing of such resources across local and wide area computing networks in the context of modern data mining applications is not trivial. This is particularly the case in grid computing environments spanning multiple administrative domains and heterogeneous software and hardware platforms. The EU-funded DataMiningGrid project [5] is a large-scale effort aimed at developing a generic system facilitating the development and deployment of grid-enabled data mining applications. Some key requirements of this effort include (1) end users should be able to use the system without needing to know details of the underlying grid technology, (2) developers should be able to grid-enabling existing data mining applications with little or no intervention in existing application code, and (3) the system should adhere to existing and 2
  3. 3. emerging grid and grid-related standards. A critical element in such a system is a grid resource broker [6]. Essentially, a grid resource broker examines and keeps track of the resources and their capabilities in a grid computing envi- ronment, matches incoming application job requests and their requirements against those resources, assigns the jobs to available and suitable resources in the grid, and initiates and monitors the execution of the jobs. This work provides a detailed account of the DataMiningGrid Resources Broker that was developed for the DataMiningGrid system. The Resource Broker was designed with the special attention to the requirements of data mining applications. The remainder of the paper is organized as follows. Section 2 and 3 provide the background of data mining and resource broker technology. Section 4 presents a detailed description of the requirements of the DataMiningGrid Resource Broker. This is followed by Section 5, which reviews related work. In Sec- tion 6 the DataMiningGrid system architecture is presented and the Resource Broker’s role and function within this architecture is explained together with details relating to the design and implementation of the Broker. The func- tion of the Broker is further described and illustrated in Section 7. Section 8 presents some results obtained from evaluating the Resource Broker within the DataMiningGrid test bed. Finally, Section 9 concludes with a summary and some critical remarks. 2 DataMining In its typical form, data mining can be viewed as the formulation, analysis, and implementation of an induction process (proceeding from the specific to the general) that facilitates the extraction of information (nontrivial, previously unknown, and potentially useful patterns) from data [1,7,9]. Data mining in- volves the use of software, sound methodology, and human creativity to achieve new insight through the exploration of data. The goal of data mining is to al- low the end user (e.g., scientist, engineer, marketing, retail or finance expert) to improve his or her decision-making. Compared with classical statistical ap- proaches, data mining is perhaps best seen as a process that encompasses a wider range of integrated methodologies and tools including databases, mod- elling, statistics, machine learning, knowledge-based techniques, uncertainty handling, and visualization. Data mining is a continuous and iterative pro- cess. Typically, such a process involves the broad steps of (1) problem and data understanding; (2) data pre-processing; (3) data analysis (this refers to algorithms that induced patterns and rules from a collection of observations); (4) result post-processing and interpretation; and (5) actions and decisions in the context of the domain questions to be addressed [8,9]. 3
  4. 4. 3 Grid resource brokering In a grid a resource broker is defined as an entity that provides the bridge between resource offers and resource requests (called jobs in grid terminology) [11]. Usually, resource brokers are developed to automate the following oper- ations [11]: • Matching resources offers to resource requests. Each job requires a set of resources, such as memory, disk space, etc., to be executed. In order to match these resource requirements of the job to available resource in a grid, resource offers and resource requests need to be expressed in a way so that the resource broker can understand them and carry out the matching. • Jobs scheduling. When several resource requests and offers are available, different scheduling policies may be applied to match requests to offers. The scheduling policy is designated to maximize some predefined utility function. Such functions may, for example, minimize the job execution latency or maximize benefits (e.g., monetary) of the resource providers. • Staging-in of data and executables. Typically, prior to job execution, data and executables have to be staged in, i.e., made available on the execu- tion machine(s) that have been selected by the resource broker. The resource broker orchestrates this process and normally needs to enforce an all-or-none stage-in policy. Once the stage-in is completed the resource broker initiates job execution. • Monitoring of the execution process. Normally, the status of a job changes during the course of its execution. A job might have reached a state in which it is pending, active, completed, failed, and so on. These changes need be conveyed to the user so as to allow him or her to take appropriate actions (e.g., retry, abort). • Results delivery (stage-out) to the user. In grid computing environ- ments it is common that the execution machine is located outside of the administrative domain of the job submitter. Therefore, it is important that after the completion of the execution, the results are transferred to some persistent storage and that a clean-up on the execution machines is per- formed (including the removal of all stage-in files). Data mining applications differ from applications performing other informa- tion management and processing tasks. They typically process large amounts of data while the code of the actual data mining algorithms is relatively small. This has led to bundling of many stand-alone data mining applications into application suites, which share a common underlying internal representation of the data and graphical user interface. Usually, the individual applications are executable in batch mode and can be parameterized at the start of execution [12]. However, the majority of specialized data mining algorithms developed by data mining researchers do not consist of integrated application suits. The 4
  5. 5. grid-enabling of such programs poses a much wider range of varying require- ments to a resource broker. 4 Requirements for a resource broker facilitating grid-enabled data mining A data mining application consists of the application of data mining technol- ogy to data analytical tasks required within various application domains. Cen- tral elements of any data mining application are the data to be mined, the data mining algorithm(s) to be used, and a user who specifies and controls the data mining process. Grid-enabling data mining applications is motivated by the sharing of computational resources via local and wide area networks. Existing data mining applications may benefit from grid-enabled resource sharing with regard to improved effectiveness or efficiency or other benefits (e.g., novel use, wider access or better use of existing resources). Furthermore, grid-enabled resource sharing may actually facilitate novel data mining applications. Given the nature of data mining applications, four key computational resources to be shared can be identified: 1. Data. The data to be mined in the form of electronic databases, data files, documents, and so on; 2. Application programs. Data mining application programs providing the implementation of data mining algorithms used to mine the data; 3. Processors. Computing processor units providing the raw compute power for processing of the data; and 4. Storage. Data storage devices to physically store the input and output data of data mining applications. A system whose main function it is to facilitate the sharing of such resources within a grid environment supporting data mining applications should take into account the unique constraints and requirements of data mining applica- tions with respect to the data management and data mining software tools, and the users of these tools. To determine and specify the detailed requirements of the DataMiningGrid Resource Broker, we have defined a representative set of use cases. The use cases are based on real-world data mining applications from industry and science from the following areas: medicine, biology, bioin- formatics, customer relationship management and car repair log diagnostics in the automotive industry, monitoring of network-based computer system, ecological modelling. Following is a list of the identified requirements. 5
  6. 6. 1. Fully automated resource aggregation. In order to execute the data mining tasks efficiently, the resource broker needs to automatically dis- cover all the available computational resources and filter out those, which do not comply with the requirements associated with the data mining ap- plication. 2. Data-oriented scheduling. The data mining application programs pro- cess large volumes of data [13]. To facilitate efficient processing of these data DataMiningGrid Resource Broker needs to minimize transfer of large amounts of data across the network. Additionally, the Resource Broker needs to support data mining applications where data cannot be moved for reasons other than large volume (e.g., security, privacy, legislative or regulatory reasons). This means that the Resource Broker should support a data mining process in which the data mining algorithm or program is shipped to the data. 3. Zero-footprint on execution machines. The Resource Broker should be capable to handle the execution of any data mining algorithm that can be run from the command-line. Therefore, the data mining program executables should not be required to be installed in advance on the ex- ecution machines. Instead, the Resource Broker should arrange for them to be transferred to the execution machine(s) together with the necessary libraries and data. Furthermore, at the end of the execution process the staged-in files need to be cleaned up, leaving the execution machine(s) with zero footprint after the execution. 4. Extensive monitoring. The Resource Broker should monitor and col- lect all job execution status information, from start of execution to com- pletion (successful or unsuccessful) and allow the user to monitor this information while the execution is in process. This information should include items like error logs, execution machine address, and so on. 5. Parameter sweep support. Many data mining algorithms require a repeated execution of the same process with different parameters (con- trolling the behaviour of the algorithm) or different input data sets. This is typically required in optimization tasks or sensitivity analysis tasks. Therefore, the Resource Broker should provide mechanisms that support the automated instantiation of the variables given a set of instantiation rules. 6. Interoperability. Since grid is defined as a collection of the heteroge- neous and distributed resources, it is mandatory that the DataMining- Grid Resource Broker provides a flexible framework facilitating the ex- ecution on a wide range of execution machines. Furthermore, this func- tionality should be realized in such a way that the user or application developer is not required to provide any wrappers or other low-level con- structs. If the executables constrains allow it to run on a certain type of machine, the Resource Broker should be able to launch the executable on such machines. 6
  7. 7. 7. Adherence to interoperability standards. Even in the fast-changing area of grid technology adherence to interoperability standards and other standards used by large parts of the community is crucial for building future-proof systems. Relevant standards include Open Grid Service Ar- chitecture (OGSA) [14] and Web Services Resource Framework (WSRF) [15]. OGSA is a distributed interaction and computing architecture based on the concept of a grid computing service, assuring interoperability on heterogeneous systems so that different types of resources can communi- cate and share information. WSRF is aimed at defining a generic frame- work for modelling and accessing persistent resources using Web services so that the definition and implementation of a service and the integration and management of multiple services is made easier [38]. WSRF narrowed the meaning of grid services to those services that conform to the WSRF specification [39], although, a broader, more natural meaning of a grid service is still in use. 8. Adherence to security standards. As the DataMiningGrid Resource Broker is designed to execute a wide variety of data mining applications, flexible security mechanisms need to be introduced to provide different levels of security for different types of users. Since data mining may be performed on data with severe security and privacy constraints, it is nec- essary that the Resource Broker facilitates the enforcement of secure and encrypted data transport, and to include authentication and authoriza- tion mechanisms. In particular, it should follow the standards of the pub- lic key infrastructure based on X.509-compliant certificates [16], the WS-I Basic Security Profile 1.0 [17], WS-Security [18], WS-Secure-Conversation [19]. 9. User friendliness. Data mining application end users and data mining experts, while being experts in their own field, cannot be assumed to be experts in grid technology. Thus, as many grid-related details and details about the Resource Broker as possible should be hidden from these users. 5 Related work Paying special attention to de-facto standards and grid middleware, we could not identify any existing off-the-shelf resource broker that meets all (or most) of the requirements specified for the DataMiningGrid Resource Broker. Both Globus Toolkit 4 (GT4) [20] and Condor [21] were previously selected to form the underlying grid middleware infrastructure in the DataMiningGrid system architecture. Below we summarize the findings of our research into existing relevant grid technology including resource brokers. The GridLab Resource Management System (GRMS) [22] is a job meta- scheduling and resource management framework that allows users to build 7
  8. 8. and deploy job and resource management systems for grid environments. It is based on dynamic resource discovery and selection, mapping and scheduling methodologies, and it is able to deal with resource management challenges. The GRMS manages the entire process of remote job submission and control over distributed systems. However, its strengths are fully expressed in the com- bination with the complete GridLab middleware. At the time of the design of the DataMiningGrid Resource Broker, the GRMS was not WSRF-compliant, and did not support any interaction with GT4. To the best of our knowledge, a parameter sweep functionality as required by several demonstrator applica- tions is also not supported by GRMS. Enabling Grids for E-science (EGEE) [23], a large EC-funded project aim- ing to provide a worldwide seamless grid infrastructure for e-Science, uses a resource broker, which is installed on a central machine. The resource bro- ker receives requests and then decides to dispatch jobs according to system parameters. At the time of the DataMiningGrid Resource Broker design, the latest version of the EGEE resource broker was the LCG-2 [24] resource bro- ker. This version is not service-based (no compliance with WSRF) and does not support automated parameter sweep functionality. At that time the plan was to replace LCG-2 with the gLite resource broker [25]. Cactus [26] is a numerical problem-solving environment for scientists. It sup- ports data grid features using MPICH-G and Globus Toolkit. However, appli- cations in the Cactus environment have to be written in MPI, which implies that a legacy application cannot be easily adapted to run on a grid. Cac- tus is not WSRF-compliant and does not provide the needed data handling functionality. Nimrod-G specializes in parameter-sweep computation. However, the schedul- ing approach within Nimrod-G aims at optimizing user-supplied parameters such as deadline and budget for computational jobs only [27]. It does not sup- port methods for accessing remote data repositories and for optimizing data transfer. Also, Nimrod-G does not have any mechanisms for automated re- source aggregation. To facilitate interoperability, Nimrod-G requires that job wrapper agents be provided for new applications. The GridBus resource broker [28] extends the Nimrod-G computational grid resource broker model to distributed data-oriented grids. The GridBus broker also extends Nimrod-G’s parametric modelling language by supporting dy- namic parameters, i.e., parameters whose values are determined at runtime [29]. However, the original GridBus implementation does not support auto- mated resource discovery, interoperability (allowing execution on Unix-based machines only), and the WSRF standard. It also lacks data movement opti- mizations needed for data mining. 8
  9. 9. 6 Design and implementation Figure 1 depicts the DataMiningGrid system architecture, including the Re- source Broker and components related to it. The architecture consists of four distinct layers. The top three layers contain components designed or enhanced to facilitate data mining applications in highly dynamic and heterogeneous grid environments. The components developed by the DataMiningGrid con- sortium are highlighted in red in the diagram. Additional components provid- ing and supporting typical data mining process and functions are highlighted in grey. These may include support for managing provenance information, for access to heterogeneous database systems, and sophisticated tools permitting the visualization of results. The components do not directly interface with the Resource Broker nor do they provide any relevant information to the job submission process. Since the focus of this study is on the Resource Broker, these components are not discussed here. To better understand the role and function of the DataMiningGrid Resource Broker in the system, the DataMin- ingGrid system architecture and some of its components are now described in some details. 6.1 Client layer The highest level of the DataMiningGrid system represents the different clients, which serve as user interfaces to the grid by interfacing with the WSRF- compliant services located in the high-level service layer. Depending on the level of expertise of the different user groups (in terms of grid technology, data mining and domain-specific technologies), the architecture supports different types of clients, including general-purpose workflow editors, Web portals with a minimal set of options, or applications that integrate access to the grid sys- tem into their native user interface. Typically, each of these clients provides a graphical user interface to access and control one or more of the following functions: 1. Searching in the grid for available applications according to user defined criteria such as application name, vendor, version, and type of data min- ing function provided by the application (e.g., feature selection, classifi- cation, clustering, association mining, text mining, and so on). For users to be able to access and use a data mining application, the application needs to reside on servers that are permanently connected to the grid. 2. Specification of parameter values and input data for the selected data mining application. Once the user has provided this information, an XML-coded job description document is created and stored locally for resubmission before it is passed to the Resource Broker for interpreta- 9
  10. 10. tion. 3. Initiation of the execution of the data mining application in the grid. Upon this initiation, the Resource Broker reads and interprets the job description document and triggers the execution of the jobs. While the first two functions may be omitted, for instance, when providing a client for executing invariant but re-occurring tasks, the third is compulsory for every type of client. For example, in the DataMiningGrid system special units for the Triana workflow editor [30], implementing the full set of functions listed above, were developed as well as a Web portal serving as a job submission interface accepting only pre-configured job descriptions. Also located in the client layer is a component for monitoring all jobs submitted to the Resource Broker by a client. 6.2 Grid middleware layer The grid middleware layer contains all components, which are included in GT4 (green). These provide basic functionality including security mechanisms, high-speed, file-based data transport (GridFTP/RFT), and the grid-wide reg- istry Monitoring and Discovery System 4 (WS-MDS) [31]. The latter imple- ments a distributed in-memory XML database that stores information about available resources contained in the fabric layer, jobs currently being processed by the system, and available data mining applications. This information can be retrieved using standard XPath queries [32]. The Grid Resource Allocation and Management Service (WS-GRAM) [33] (which is responsible for submitting, monitoring, and cancelling jobs) man- ages the execution of applications (i.e., jobs) on a particular computational resource through its adapter mechanism. These adapters may either execute jobs on the local machine, where GT is installed (using a C-like fork command) or pass jobs to a local task scheduling system, which then schedules the jobs on the machines it controls. In its current version, the DataMiningGrid sys- tem uses Condor [21] as a local scheduler for managing clusters. However, the original GT4-Condor adapter lacks the capabilities for transferring complete directory structures and executing Java applications. The current Condor im- plementation (version 6.7) and, as a result, the standard GT4 Condor adapter restrict data movement to copying files only. While this problem needs to be addressed by the Condor development team, we work around it by compress- ing the recursive directory structures into a single archive (i.e., single file), moving the archive to the execution machine, and extracting its content be- fore the actual execution of the data mining application. For executing Java applications, we extended the original GT4-Condor adaptor to handle param- eters regarding the Java virtual machine and the class path. The original GT4 10
  11. 11. Fork adapter also lacks the ability to execute Java applications. Its modifica- tions are very similar to the changes we made to the standard GT4-Condor adapter. As the grid system outlined here is based on GT4, it also offers the same secu- rity mechanisms such as public key infrastructure based on X.509-compliant certificates, SAML Authorization Decision support, and encryption of all net- work communication including messages between Web services and data trans- fer. 6.3 Grid fabric layer The lowest layer represents the grid resources such as data mining applica- tions available in the grid, data, CPUs, storage, networks, and clusters (pink, blue). As discussed before, the latter are controlled by local schedulers such as Condor. These resources are accessed only by the grid middleware. 6.4 High-level services layer The different clients and monitoring components interface with the Informa- tion Integrator service for application discovery and management and with the Resource Broker for job execution and monitoring. The Resource Broker was not implemented completely from scratch, but is based on the GridBus Grid Service Broker and Scheduler version 2.4 [29,34]. The following considerations motivated this choice: 1. The GridBus resource broker is capable of submitting jobs to the execu- tion subsystem (WS-GRAM) of GT4 (as well as to many other resources and grid systems, e.g., Alchemi, Unicore, XGrid, and others). 2. The GridBus broker’s architecture is clearly structured and well designed from a software engineering point of view. 3. Unlike many other resource brokers, GridBus does not require any par- ticular information or security system. It is designed as a stand-alone software, ready to be integrated with various existing components and systems. While offering a solid basis to base the DataMiningGrid Resource Broker on, the GridBus broker in its original version does not meet some critical requirements specified for the DataMiningGrid Resource Broker: 1. GridBus v2.4 is not service-oriented, but needs to be installed on every client machine. Thus, it does not adhere to recent grid standards such as 11
  12. 12. WSRF and OGSA. 2. It does not provide mechanisms for automated resource aggregation, but requires the user provide this function. Such tasks require exten- sive knowledge about the grids topology and its internal representations. Typically, this task cannot be performed by users who are not experts in grid technology. 3. It supports job execution on Linux/Unix-based machines only. This con- tradicts some of the basic requirements of grid systems, which are in- tended to support and interoperate with heterogeneous hardware and software (including operating systems) computing environments. The GridBus v2.4 implementation was modified to fit into the service-oriented architecture of the DataMiningGrid system by wrapping it as a WSRF-compliant service. The resulting Resource Broker service exposes its main features through simple public interfaces. It was further enhanced to query the MDS4 automat- ically in order to obtain the set of available resources. To match the resource capabilities to job requirements, the Resource Broker requires a job descrip- tion document passed to it by a client component from the upper layer (Figure 1) for each individual job. This XML-based document consists of two parts: • A non-changeable application description containing all invariant attributes of the respective application (e.g., system architecture, location of the exe- cutable and libraries, programming language). These attributes cannot be altered by users of the system, but are typically specified by the application developer during the process of publishing the application in the grid. • Modifiable values, which are provided by end users before or during runtime (e.g., application parameter values, data input, additional requirements) of the application. These are entered using one of the graphical user interface clients from the client layer (Figure 1). From the resulting job description the Resource Broker evaluates various types of information for resource aggregation. Static resource requirements regarding system architecture and op- erating system. Applications implemented in a hardware-dependent lan- guage (e.g., C) typically run only on the system architecture and operating system they have been compiled for (e.g., PowerPC or Intel Itanium running Linux). For this reason, the Resource Broker has to select execution machines that offer the same system architecture and operating system as required by the application. Modifiable resource requirements: memory and disk space. While data mining applications may require a minimal amount of memory and disk space a start-up time, memory and disk space demands typically rise with the amount of data being processed and with the solution space being explored. 12
  13. 13. Therefore, end users are allowed to specify these requirements in accordance with the data volume to be processed and their knowledge of the application’s behaviour. The Resource Broker will take into account these user-defined re- quirements and match them to those machines and resources that meet them. Modifiable requirements: identity of machines. In some cases end users may generally wish to limit the list of possible execution machines based on personal preferences, for instance, when processing sensitive data. To support this requirement, it is possible for the user to specify the IPs of such machines in the job description. Such a list causes the Resource Broker to match only those resources and machines listed and to ignore all other machines indepen- dent of their capabilities. The total number of jobs. Instead of specifying single values for each option and data input that the selected application requires, it is also possible to declare a list of distinct values (e.g., true, false) or a loop (e.g., from 0.50 to 10.00 with step 0.25). These represent rules for variable instantiations, which are translated into a number of jobs with different parameters by the Resource Broker. This is referred to as a multi-job. As a result, the Broker will prefer computational resources that are capable of executing the whole list of jobs at once in order to minimize data transfer. Typically, such resources are either clusters or high-performance machines offering many distinct processors. As an example, if the user specifies two input files (a.txt, b.txt) for the same data input and two loops running from 1 to 10 with step 1 as parameters for two options, the Resource Broker will translate this into two hundred (2 x 10 x 10) distinct jobs. If no singe resource capable of executing them at once is available, the Broker will distribute these jobs over those resources that provide the highest capability. In addition, the job description includes further information that becomes important at the job submission stage. This information is briefly described below: • Instructions on where the application executables are stored, including all required libraries, and how to start the selected applications. These are required for transferring applications to execution machines across the grid, which is part of the stage-in process discussed in more detail in the following section. By staging in applications together with the input data dynamically at run-time, the system is capable of executing these applications on any suitable machine in the grid without prior installation of the respective application. • All data inputs and data outputs that have to be transferred prior the execution. • All option values (application parameters) that have to be passed to the application at start-up. As the Resource Broker is capable of scheduling 13
  14. 14. applications that are started in batch-mode from a command line, it passes all option values as flag-value pairs. Here, each flag is fixed and represents a single option. The values, however, may change for each call if a multi-job is specified. Finally, we enabled the Resource Broker to use machines that are not operated under Linux/Unix. The original implementation from GridBus wraps all appli- cations with a shell script before scheduling them for execution. Contradicting the basic philosophy of grid computing, this prevents, for example, execution of applications on Windows-based machines. In addition, this restriction also proved to be in violation with the requirements of the project partners in the DataMiningGrid project [5], who use several pools of machines running under MS Windows. This issue was resolved by simply removing the creation of this wrapper script and modifying the Broker accordingly. 7 Executing data mining applications When a DataMiningGrid-enabled application is being executed, the Resource Broker takes on a central role in orchestrating the relevant processes. 7.1 Matching The DataMiningGrid system is designed to be generic system, capable of ex- ecuting any batch algorithm in a grid computing environment. In order to orchestrate the execution process, the DataMiningGrid Resource Broker relies on two sets of information to match available resources to job execution re- quests from users. The first set is represented by the user request for resources based on detailed information of the application to be executed. These in- clude the number of needed CPUs and their architecture, type of operating system, free memory and disk space, and so on. This information is matched against the specification and status of the available resources in the grid envi- ronment. The resource managers of the available grid resources automatically register their resource offers in the central information system, which based on the WS Globus Monitoring and Discovery System (WS-MDS). Ultimately, the application requirements and resource specification data is encoded as XML-formatted docuemnts. The matching process begins when the Resource Broker receives the request for resources (jobs) and their requirements. Upon reception of this information, the Resource Broker queries the information system for the available grid resources, and filters out those that do not meet the requirements. Possible user 14
  15. 15. requirements specifying a restriction on the execution on one or more specific WS-GRAMs are also taken into account during this matching process. Such restrictions are useful for working with databases or data files, whose content cannot be moved (or is too expensive to move). After matching is completed, the matching module transfers to the scheduler a detailed list of the resources that meet the requirements to execute the job execution request of a user. 7.2 Scheduling The scheduling policy component of the DataMiningGrid Resource Broker is implemented as a pluggable module, which can be changed on demand. The default policy is to prefer those WS-GRAM services with the largest number of computational resources. On average, this will minimize the data trans- fers, as WS-GRAMs that have sufficiently large resources to execute all the jobs of single multi-job submission will require only one stage-in process. The Resource Broker was designed on the basis of a number of assumptions re- garding the execution environment and its purpose. The assumptions can be summarized as follows: (a) the preferred policy is to minimize the execution latency for each user (all the jobs must be completed as soon as possible); (b) failure of single jobs should not cancel the execution of the remaining jobs; (c) job execution time is, on average, longer than stage-in time; and (d) data movement is expensive. Using these assumptions and rules, the scheduling al- gorithm implemented by the DataMiningGrid Resource Broker is in Algorithm 1. The rationale for the scheduling algorithm is to address the assumptions and requirements discussed above. The Resource Broker prepares the collection of available WS-GRAMs and sorts the collection in descending order by the number of free CPUs. The WS-GRAM with the highest capacity is selected first in order to reduce, on average, the number of stage-in procedures. Stage- in, being an expensive procedure, is performed once per WS-GRAM and not per job. If the selected WS-GRAM does not have sufficient capacity to execute all jobs, the WS-GRAM with the next highest capacity is selected, and so on. This behaviour is explained by assumption (c) above: It is preferred to start the execution on a new WS-GRAM instead of waiting for the previous WS- GRAM to finish execution of the submitted jobs. The Resource Broker tries to submit all jobs as soon as it can in order to reduce the execution latency of user’s multi-jobs submission. During the execution of the jobs, all jobs are being constantly monitored until the last job is completed. 15
  16. 16. 7.3 Stage-in The optimization of the stage-in process is based on the assumption that data movement is an expensive and time-consuming process. The minimization of data movement is achieved by performing the stage-in process once per WS-GRAM (as opposed to once per job) and each job is provided with the complete URI of the local data storage location. Usually, the WS-GRAM is responsible for at least one cluster of machines. The per-WS-GRAM approach has the advantage that the stage-in operation is performed far fewer times than would be necessary in a per-job approach. After the successful stage-in of all the executables data and libraries, the executable is launched and the execution monitored until completion. 7.4 Job monitoring As already mentioned, a single job description document may result in the execution of thousands of jobs, which have all different job IDs but the same scheduler ID assigned to them. This scheduler ID allows tracking of a set of jobs even if their execution is managed by different instances of the execution manager (i.e., different GT4 installations). Figure 2 depicts the client-side monitoring component, which upon user request, is capable of displaying up- to-date information about the status of each job by querying the Resource Broker with the scheduler ID. During job execution, the Resource Broker constantly monitors all the changes in the status of each job and reports it to the user. The jobs that are detected as failed are automatically resubmitted to another WS-GRAM for several re- tries. In parallel, WS-GRAMs that are detected as failed are removed from the available WS-GRAMs list and are not taking part in the rescheduling of jobs. 7.5 Stage-out In the execution request, the user also specifies the location of the storage server on which he wants to save the results. At the end of the execution process, all results are shipped to the storage server, and the URI of that location is returned to the user for further data handling. The Resource Broker also takes care of deleting all data transferred or generated in the course of executing an individual job description document, including executables, libraries, input data and temporal data, thus leaving the execution machine in the same state as it was prior to execution. 16
  17. 17. 8 Evaluation The Resource Broker was tested in the DataMiningGrid test bed [5] span- ning three European countries - United Kingdom, Germany and Slovenia. The test bed consists of four servers with GT4 installations and three local com- putational clusters based on Condor with varying number of computational machines. The central GT4 server was running core GT4 services as well as the Resource Broker and Information Integrator services. We present below the average results of several executions on a test bed, containing execution machines as presented in Table 1. In order to evaluate the Resource Broker, several data mining applications were tested. Performance measures of two of these applications are presented here. The first application uses J48 [12], which is an implementation of the C4.5 Decision Tree algorithm [40] developed by Ross Quinlan and is part of the open source data mining toolkit ”Weka” [41]. The second application is used for re-engineering of gene-regulatory networks from dynamic microarray data. This algorithm, the Co-dependence Algorithm, was developed by the University of Ulster and the Weihenstaphan University of Applied Sciences [35]. It is based on an evolutionary algorithm technique. The Co-dependence Algorithm was chosen to represent a ”long job” and the J48 Algorithm to represent a ”short job”. The Co-dependence Algorithm’s fastest total serial run-time on the fastest machine in the test bed was measured at 2200 seconds and the J48 Algorithm’s fastest total serial run-time on the fastest machine in the test bed was measured at 250 seconds. 8.1 Speed-up Speed-up quantifies the reduction of elapsed time obtained by processing a constant amount of work or load on a successfully greater number of computers or processors. Speed-up is a typical reason why users want to grid-enable data mining applications. For pure and homogeneous parallel computer scenarios, speed-up is typically defined as the ratio of serial execution time of the fastest known serial algorithm (requiring a certain number of basic operations to solve the problem) to the parallel execution time of the chosen algorithm [36]. However, in a grid environment heterogeneity is the norm. Therefore, the assumptions made for a homogeneous parallel computer set-up do normally not hold in a grid. To estimate speed-up, we determined the fastest serial run-time, tS , of a single instance of two algorithms on the fastest machine in the test bed and then measured the parallel run-times, tP (N ) for an increasing number, N , of CPUs 17
  18. 18. in the test bed based on running a fixed number, K, of instances (jobs) of the algorithm. From this we calculate speed-up as follows: speed-up = K · tS /tP (N ) (1) For our speed-up experiments, we used a fixed number of 100 jobs (K = 100) and the following number of processors: N = 10, N = 50 and N = 100. The raw run-time measurements and the derived speed-up measures of both algorithms are depicted in Table 2 and Table 3 below. The data shows that both the Co-dependence Algorithm and the J48 Algorithm speed-up increase approximately linearly: 1 • Co-dependence Algorithm: speed-up ∼ 2 N 1 • J48 Algorithm: speed-up ∼ 3 N The gap between the optimum speed-up and the achieved speed-up can be explained by analyzing the machines, which were utilized in the experiment. For instance, let us examine the experiments with N = 100. Since the ter- mination time of the parallel execution depends on the termination time of the job on the slowest machine, the speed-up results would be fairly close to optimum for the long jobs (460,000 / 4867 ≈ 95) and still acceptable for the short jobs (57,000 / 831 ≈ 69) in a test-bed consisting of homogeneous hardware. The worse performance of the shorter jobs originates in the systems overhead for each submitted job, which includes submission of the job to local scheduler, propagation of the state changes, etc. As the systems overhead is approximately constant for each job (since it does not include the stage-in/out time that is done once per WS-GRAM), the experiments including long jobs, for which such overhead is negligible comparing to their execution time, show results very close to the optimum speed-up. 8.2 Scale-up In general, scalability on a distributed computing architecture is a measure of its capability to effectively utilize an increasing number of resources such as processing elements [36]. Another way of formulating this is that the scalability of a parallel (or distributed) system is a measure of its capacity to deliver linearly increasing speed-up with respect to the number of processors used [37]. Typically, good scalability of a distributed system is desired if the system needs to support more users, humans and other computers. To avoid adverse impact on the response times of current users, the capacity of the system must be grown (or scaled up) in proportion to the additional load caused by more users. Therefore, scale-up may be defined as the ability of a greater number of processors to accommodate a proportionally greater workload in more-or- 18
  19. 19. less constant amount of time. For grid-enabled data mining systems, scale-up becomes critical as the number users of such a system increases. Hence, good scale-up behaviour is a critical requirement for the DataMiningGrid Resource Broker. To quantify the scale-up behaviour of the DataMiningGrid Resource Broker, we carried out a series of experiments with increasing system capacity (num- ber, K, of CPUs) and system load (number, K, of short and long data mining jobs) with the algorithms and test bed described above. The data obtained from these experiments is depicted in Table 4 and Table 5 below. The data shows that the scaling-up behaviour of the Resource Broker is excellent. In the long job (Co-dependence Algorithm) scenario the response time increases by only 3.2% as the N and K are increased from 10 to 50 and to 100. In case of the short job experiments (J48 Algorithm) the response time increases by 19.5% as the load and capacity are stepped up. The data shows that when dealing with a relatively small number of jobs, the slowest machines have a critical negative impact on the achieved scale-up. When executing 100 or less jobs on 100 machines, the slowest machines are the ones that increase the gap between the actual scale-up and the maximum scale-up. However, when submitting a large set of jobs, the influence of the small number of slowest machines is decreasing, causing this gap to be constantly reduced. 9 Conclusions and future work This study presents the Resource Broker development from the DataMining- Grid project [5]. The design of the Resource Broker was driven by require- ments arising from data mining applications in different sectors. A review of existing technology showed that no resource broker was available to address all requirements. The DataMiningGrid Resource Broker combines the follow- ing features (Section 4): (a) fully automated resource aggregation, (b) data- oriented scheduling, (c) zero-footprint on execution machines, (d) extensive monitoring of resources, (e) parameter sweep support, (f) interoperability, (g) adherence to interoperability standards, (h) adherence to security standards, and (i) user friendliness. Extensive performance experiments we carried out indicate that the Resource Broker shows excellent speed-up and scale-up be- haviour for jobs which run more then half an hour. Both are critical to support modern data mining applications. The DataMiningGrid Resource Broker has been developed as part of a com- prehensive architecture for grid-enabling data mining applications. This ar- chitecture is designed to be highly flexible (software and hardware platforms, type of data mining applications and data mining technology), extensible (adding of system features, applications and resources), efficient (throughput 19
  20. 20. and scalability), and user-friendly (graphical and Web user interfaces, support for user-definable workflow, hiding of underlying complexity from user). The development of the DataMiningGrid system architecture and its Resource Bro- ker involved an extensive study and evaluation of WSRF [15] and the Globus Toolkit [20]. Following service-oriented architecture principle, the DataMining- Grid Resource Broker supports a full, easy-to-use framework for data mining in grid computing. Currently a large set of data mining applications are being evaluated on the DataMiningGrid system. This will help us to identify areas where additional develop may be useful. In its present implementation, one Resource Broker is required for each virtual organization. Future work will consider a distributed implementation, which may help to further enhance availability and load bal- ancing features. References [1] M.J., Berry, G. Linoff, Data Mining Techniques For Marketing, Sales and Customer Support, John Wiley & Sons, Inc., New York, 1997. [2] A.K.T.P. Au, V. Curcin, M. Ghanem, et al., Why grid-based data mining matters? fighting natural disasters on the grid: from SARS to land slides, in Cox S.J. (editor), UK e-science all-hands meeting, AHM 2004, Nottingham, UK, September 2004, EPSRC, 2004, Pages: 121 - 126, ISBN: 1-9044-2521-6. Also available at: . [3] M. Cannataro, D. Talia, P. Trunfio, Distributed data mining on the grid, Future Generation Computer Systems 18, 11011112, 2002 [4] W.K. Cheung, X-F. Zhang, Z-W. Luo, F.C.H. Tong, Service-Oriented Distributed Data Mining, IEEE Internet Computing, 44-54, July/August 2006. [5] The DataMiningGrid Consortium and Project, [6] K. Krauter, R. Buyya, M. Maheswaran, A taxonomy and survey of grid resource management systems for distributed computing, Software: Practice and Experience, Vol. 32, No. 2, 135-164, 2001. [7] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining. MIT Press, Cambridge, MA, 2001. [8] C. Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, in Journal of Data Warehousing, 5(4): 13-22, 2000. [9] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, Vol. 39, No. 11, 27-34, 1996. 20
  21. 21. [10] I. Foster and C. Kesselman, editors. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 2004. [11] J. Nabrzyski, J.M. Schopf, and J. Weglarz (editors), Grid Resource Management: State of the Art and Future Trends, Kluwer Academic Publishers Boston, Dordrecht, London 2004. [12] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation, Morgan Kaufmann, San Francisco (US), 2000. [13] T. Kosar, M. Livny, Aframework for reliable and efficient data placement in distributed computing systems, Journal of Parallel and Distributed Computing, No. 65, 1146 1157, 2005. [14] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, Globus Project, 2002, [15] I. Foster, K. Czajkowski, D.E. Ferguson, J. Frey, S. Graham, T. Maguire, D. Snelling, S. Tuecke, Modeling and managing State in distributed systems: the role of OGSI and WSRF, Proc of the IEEE, Vol. 93, No. 3, 604-612, 2005. [16] V. Welch, I. Foster, C. Kesselman, O. Mulmo, L. Pearlman, S. Tuecke, J. Gawor, S. Meder, F. Siebenlist, X.509 Proxy Certificates for Dynamic Delegation, Proc of 3rd Annual PKI R & D Workshop, 2004. [17] A. Barbir, M. Gudgin, M. McIntosh Basic Security Profile Version 1.0 Working Group Draft, online on 2004-05-12.html [18] J. Rosenberg and D. Remy, Securing Web Services with WS-Security. Indianapolis: Sams Publishing, 2004. [19] Della-Libera, G., Dixon, B., Garg, P., and Hada, S. Web Services Secure Conversation (WS-SecureConversation). Kaler, C. and Nadalin, A. eds., Microsoft, IBM, VeriSign, RSA Security, 2002. [20] The Globus Alliance, A Globus Primer: Or, Everything You Wanted to Know about Globus, but Were Afraid To Ask. Describing Globus Toolkit Version 4, Primer 0.6.pdf [21] M. Litzkow, M. Livny, Experience with the Condor distributed batch system, Proc IEEE Workshop on Experimental Distributed Systems, 97-100, 1990. [22] G. Allen, T. Goodale, T. Radke, M. Russell, E. Seidel, K. Davis, K.N. Dolkas, N.D. Doulamis, T. Kielmann, A. Merzky, J. Nabrzyski, J. Pukacki, J. Shalf, I. Taylor, Enabling Applications on the Grid: A Gridlab Overview, International Journal of High Performance Computing Applications, Vol. 17, No. 4, 449-466, 2003. 21
  22. 22. [23] F, Gagliardi, B. Jones, F. Grey, M-E. Bgin, M. Heikkurinen, Building an infrastructure for scientific Grid computing: status and goals of the EGEE project, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 363, No. 1833, 1729-1742, 2005. [24] LCG-2 User guide, Guide.html [25] C. Munro, B. Koblitz, Performance comparison of the LCG2 and gLite file catalogues, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 559, No. 1, 48-52, 2006. [26] G. Allen, W. Benger, T. Goodale, H. Hege, G. Lanfermann, A. Merzky, T. Radke, E. Seidel, J. Shalf, The Cactus Code: A Problem Solving Environment for the Grid, Proc of the Ninth International Symposium on High Performance Distributed Computing (HPDC), Pittsburgh, USA, IEEE Computer Society Press, Los Alamitos, CA, USA, 2000. [27] D. Abramson, J. Giddy, L. Kotler, High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?, Proc of the International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun, Mexico, IEEE Computer Society Press, Los Alamitos, CA, USA, p. 520-528, 2000. [28] S. Venugopal, R. Buyya, L. Winton, A Grid Service Broker for Scheduling e-Science Applications on Global Data Grids, Journal of Concurrency and Computation: Practice and Experience, Wiley Press, USA Volume 18, No. 6, 685 699, 2005. [29] K. Nadiminti, S. Venugopal, H. Gibbins, T. Ma, R. Buyya, The GridBus Grid Service Broker and Scheduler, [30] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, I. Wang, Programming scientific and distributed workflow with Triana services, Concurrency and Computation: Practice and Experience, Vol. 18, No. 10, 1021-1037, 2005. [31] Globus Toolkit 4.0: Information Services, [32] XML Path Language (XPath) Version 1.0 W3C Recommendation 16 November 1999, [33] GT 4.0: Execution Management, online at [34] V. Kravtsov, T. Niessen, V. Stankovski, A. Schuster, Service-based Resource Brokering for Grid-based Data Mining. Proceedings of he 2006 International Conference on Grid Computing and Applications. Las-Vegas, USA, pages163- 169, 2006. 22
  23. 23. [35] J. Mandel, N. Palfreyman, W. Dubitzky, Modelling codependence in biological systems. IEE Proc Systems Biol, 153(5), 2006. In Press. [36] V. Kumar, A. Gupta, Analyzing Scalability of Parallel Algorithms and Architectures, Journal of Parallel and Distributed Computing (special issue on scalability), Vol. 22, No. 3, 379-391, 1994. [37] A. Grama, A. Gupta, V. Kumar, Isoefficiency Function: A Scalability Metric for Parallel Algorithms and Architectures, IEEE Parallel and Distributed Technology, Special Issue on Parallel and Distributed Systems: From Theory to Practice, Vol. 1, No. 3, 12-21, 1993. [38] Tim Banks, Web Services Resource Framework Primer, OASIS committee Draft 01 December, No. 7, 2005. [39] K. Czajkowski, D. F. Ferguson, I. Foster, J. Frey, S. Graham, I. Sedukhin, D. Snelling, S. Tuecke, W. Vambenepe. The WS-Resource Framework, March 5, 2004. [40] R. Quinlan, J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco (US), 1993. [41] I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, S. J. Cunningham. Weka: Practical machine learning tools and techniques with Java implementations. In N. Kasabov and K. Ko, editors, Proceedings of the ICONIP/ANZIIS/ANNES’99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, Dunedin, New Zealand, 192-196, 1999. 23
  24. 24. Fig. 1. The DataMiningGrid Resource Broker in the DataMiningGrid system archi- tecture 24
  25. 25. Algorithm 1. Scheduling algorithm of the DataMiningGrid Resource Broker. Function Schedule(jobs) Obtain the set R of available grid resources (WS-GRAMs) M = {R1 , R2 , . . . , Rm } ⊆ R {Filter out incompatible resources} Sort M in descending order by |Ri | {|Ri | equals the number of idle CPUs in Ri } unsubmitted jobs ← jobs; T ← {} DO WHILE unsubmitted jobs = {} Rh ∈ M, ∀j = h, |Rj | ≤ |Rh | {Select highest-capacity WS-GRAM} Submit min(|Rh |, |unsubmitted jobs|) jobs to Rh IF during the submission Rh was detected as failed THEN M ← M Rh {Remove failed WS-GRAM from M } IF M = {} ∧ T = {} ∧ unsubmitted jobs = {} THEN Cancel all jobs in jobsunsubmitted jobs RETURN failure END Go to beginning of DO WHILE END {if failed submission} M ← M Rh ; T ← T ∪ Rh {Move Rh from M to temporary list T } Remove submitted jobs from list unsumitted jobs IF M = {} ∧ unsubmitted jobs = {} THEN Wait for pre-defined timeout M ← T ; T ← {} END {if submitted to all WS-GRAMs} END {End of do while loop} RETURN success END {End of Function Schedule} 25
  26. 26. Fig. 2. The client-side job monitoring component Table 1 Basic test bed configuration. Location No. CPUs Typical CPU Memory University of Ulster, UK 50-70 Itanium II 900 MHz 2 GB University of Ulster, UK 4-10 Pentium, 2 to 3 GHz 1 GB Fraunhofer Institute, Germany 4 Pentium 1.4 GHz 1 GB University of Ljubljana, Slovenia 20-40 Pentium 1.8 GHz 1 GB Table 2 Left: Raw run-time measurements of the ”long-job” Co-dependence Algorithm. Right: Speed-up according to Equation (1). 26
  27. 27. Table 3 Left: Raw run-time measurements of the ”short-job” J48 Algorithm. Right: Speed- up according to Equation (1). Table 4 Scale-up experiment results for the ”long-job” Co-dependence Algorithm. Table 5 Scale-up experiment results for the short-job J48 Algorithm. 27