High-throughput computing and opportunistic computing for matchmaking processes and indexing processes
University of Calabria Bachelor thesis in Computer EngineeringHigh-throughput computing and opportunistic computing for matchmaking processes and indexing processes Supervisor Bachelor Candidate Ing. Carlo Mastroianni Silvio Sangineto Matriculation Number: 83879 2007-2008
Contents Introduction to the Thesis Introduction to Distributed Systems Introduction to the Grid, High-throughput Computing and opportunistic computing Condor Why Condor? Introduction to Prototype Architecture Centralized prototype architecture Centralized Scorer Results achieved A possible solution: Distributed Scorer Distributed Scorer New Results achieved From “local” business case to the big business case…
Introduction to the Thesis Creation of a Distributed Web-Spider with particular attention about the efficiency, scalability, energy saving and costs. Description: The goal of this project is recovery the URLs about Actually in Italy not exist Italian Companies. This recovery is possible because a complete list about the we can use a customer database with general Italian Companies that have a Web-Site!!! informations which: VAT number, phone, emails, etc.. These informations can be matched with the Web-Site contents so we can find the official Web- Site for each company. Why: Knowing the Official Web-Site is very important because you can know quickly: • contacts and emails about it; • updates, news preview; • many descriptions about the Company activities; • other informations (e.g. history).
Introduction to the ThesisBoundary value problems for my thesis: Difficulty to estimate how many companies have a Web-Site (Coverage Level); The Web-Site structures could have many parts no-standard (some Web-Sites couldn’t have information about VAT number, email, etc..) ; The updating of the data-base that contains the URLs must allow to catch the Web- Site of a new Company and the new Web-Site of an old Company; Some problems about privacy (e.g. email). Relevant problems for my thesis: Usually in the Web-Spider that exists on the Web (e.g. Load balancing work, efficient resources Google), when they need to utilization; increase the computational Scalability; power the Company buy Costs; other servers to provide it!!!!! Energy saving. (General Solution)
Introduction to the ThesisWe want to find an answer to the relevantproblems in the “local” business case to use thesesolutions for the “big” business case !!!
Introduction to Distributed SystemsDefinition:A distributed system consists of a collection of autonomous computers, connected through a network anddistribution middleware, which enables computers to coordinate their activities and to share the resourcesof the system, so that users perceive the system as a single, integrated computing facility. In our case we use a distributed system to have more computational power… Advantages of Distributed System: Reliability; Sharing of resources; Aggregate computing power; Scalability;
Grid Computing, High-throughput computing and opportunistic computingGrid Computing: Grids are intrinsically distributed and heterogeneous but must be viewed bythe user (whether an individual or another computer) as a virtual environment with uniformaccess to resources. Much of Grid software technology addresses the issues of resourcescheduling, quality of service, fault tolerance, decentralized control and security and so on, whichenable the Grid to be perceived as a single virtual platform by the user.High-throughput computing: Opportunistic computing:The goal of a high-throughput computing The goal of opportunistic computing is theEnvironment is to provide large amounts of ability to utilize resources whenever they arefault-tolerant computational power over available, without requiring 100% availability.prolonged periods of time by effectivelyutilizing all resources available to the network. The two goals are naturally coupled. High-throughput computing is most The two goals are naturally coupled. High-throughput computing is most easily achieved through opportunistic means. easily achieved through opportunistic means.
CondorModern processing environments that consist of large collections of workstations interconnectedby high capacity network raise the following challenging question: can we satisfy the needs ofusers who need extra capacity without lowering the quality of service experienced by the owners ofunder utilized workstations? . . . The Condor scheduling system is our answer to this question. At the University of Wisconsin, Miron Livny combined his 1983 doctoral thesis on cooperative processing with the powerful Crystal Multicomputer designed by DeWitt, Finkel, and Solomon and the novel Remote UNIX software designed by Litzkow. The result was Condor, a new system for distributed computing. The goal of the Condor Project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing and opportunistic computing on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput. Condor is a middleware that allow the users to join and use the distributed resources.
CondorCondor is a specialized job and a resource management system (RMS) forcomputeintensive jobs. Like other full-featured systems, Condor provides a jobmanagement mechanism, scheduling policy, priority scheme, resource monitoring, andresource management. Users submit their jobs to Condor, and Condor subsequentlychooses when and where to run them based upon a policy, monitors their progress, andultimately informs the user upon completion. Two very important mechanisms: ClassAds: The ClassAd mechanism in Condor provides an extremely flexible and expressive framework for matching resource requests (e.g. jobs) with resource offers (e.g. machines) RemoteSystemCalls: When running jobs on remote machines, Condor can often preserve the local execution environment via remote system calls. Remote system calls is one of Condor’s mobile sandbox mechanisms for redirecting all of a jobs I/O-related system calls back to the machine that submitted the job. Therefore, users do not need to make data files available on remote workstations before Condor executes their programs there, even in the absence of a shared file system.
CondorHow condor works? This is an example [An agent (A) is shown executing a job on a resource (R) with the help of a matchmaker (M)]: Step 1: The agent and the resource advertise themselves to the matchmaker. Step 2: The matchmaker informs the two parties that they are potentially compatible. Step 3: The agent contacts the resource and executes a job. This figure shows the major processes in a Condor system
CondorWhat happen when you have more condor pools? This is an example [An agent (A) is shown executing a job on a resource (R) via direct flocking] : Step 1: The agent and the resource advertise themselves locally. Step 2: The agent is unsatisfied, so it also advertises itself to Condor Pool B. Step 3: The matchmaker (M) informs the two parties that they are potentially compatible. Step 4: The agent contacts the resource and executes a job.
CondorCondor Universe:Condor has several runtime environments (called a universe) from which tochoose. The Java Universe was the best for our project (for this first version)so I could take advantage of portability (heterogeneous system) and it was good forthe “local” business case. A universe for Java programs was added to Condor inlate 2001. This was due to a growing community of scientific users that wished toperform simulations and other work in Java. Although such programs might runslower than native code, such losses were offset by faster development times andaccess to larger numbers of machines.
Why Condor?• We used Condor because (some motivations):1) Efficient resource management (opportunistic computing and high-throughput computing, ClassAds, etc..);2) It’s a middleware for heterogeneous Distributed Systems (e.g. we can use different types of Operative Systems);3) It’s an open source project and It’s used in many projects in the world like batch system;4) Flexibility.
Introduction to Centralized Prototype ArchitectureWeb-Sites Customer Customer Data-Base Data-Base Identifying information Make Query Index Results Crawler Crawler Index Index Scorer Scorer New Companies, Candidates New Web-Sites Updater Validator Validator Data-Base Data-Base Manual Manual URLs URLs
Introduction to Centralized Prototype ArchitectureCrawler:The prototype Web-Spider must have a Crawler that make an Index of the companies Web-Sites (e.g. UbiCrawler). This Crawler can be hired by us or we can build a new Crawler onthe basis of several products already ready (Nutch, Heritrix, Jspider, etc.). In this businesscase we used the data extract throught theUbiCrawler. For indexing processes we usedManaging GigaByte (MG4J).Consumer Data-Base:This database contains the identifying information about the Companies: VAT number,phone, mails, company name, sign, etc.. Scorer: In this step there is the execution of several query and many matchmaking processes to find the right “match” between identifying information and the companies Web-Sites. Each match will have a score.
Centralized Scorer Class Diagram - These are the most important classes, where we can see the principal processes of the Web- Spider (together with the indexing processes)
Centralized ScorerInto the Centralized Scorer we have the following activities: Score Query Query Query Check Check over the over the over over over Phone and address the the the VAT other URL type Number fields name pageAll these activities are completed in about 5 seconds (average), soto complete the analysis of a Company you need to wait this time.If you have to analysee 56.000 company you have to wait about280.000 seconds!!! There is a big problem: the number of the Companies can be very high !!!!
Centralized ScorerWe can glance at the java code that implements some functions: AssociaDomini constructor In this Class is implemented principally the logic that allows the “match” between the identifying information and the companies’ Web-Sites.
Centralized Scorer3/3 How the method calledassocia() record the resultson a log file We preferred to use hibernate because it’s an open source java persistence framework project. Perform powerfull object relational mapping.
Results AchievedOn a sample of 56000 companies: Query Coverag Coverag Phone and VAT Number: e (#) e (%) These types of query are very good for the coverage and for the reliability. Sign 2747 4,43% Phone 25715 41,47% Sign: Low coverage VAT 4369 7,05% Query Precision Number Company Name: (%) Very good coverage but low precision Compan 27487 44,33% Sign 1% y Name Phone 25% VAT 55% How many companies can you cover with Number these queries? Company 3% What precision can you achieve? Name
Results Achieved For a sample of 56000 companies: Trend (S) 1 Personal Computer works for 77h (only300000 for this computation)250000200000 Personal Computer used:150000 Computer Desktop, Intel Dual Core 2,4 GHZ, 2 GB100000 Trend (S) di Ram e 1 TB di HDisk50000 Possible Problems: 0 the personal computer goes down; 500 Companies 1000 Companies 10000 56000 there are new Companies (updating) or some Companies Companies Web-Sites are changed, in this case the computation must continue… The matchmaking processes For 1.000.000 companies that you have to and indexing processes are analyse: frequent in the time!!!! 1 Personal Computer works about for 1389 days. It’s an ideal case… This isn’t a scalable solution!
A possible solution: Distributed ScorerWe want to make a scalable solution for our Web-Spider.There are some important constraints that we have to respect:1) Energy Saving;2) Efficient resources management and efficient resources utilization;3) Cost cutting;4) Having more companies analysed in a long time; We can submit each set of queries on a different computer !!!
Distributed ScorerWe built a distributed scorer using the Condor middleware. This is a possible architecture where execute our Distributed Scorer. Example of architecture used by the National Institute of Nuclear Physics
Distributed Scorer We built a wrapper class to prepare We used the the work environment on Condor. vertical This class realize the logic distribution and connection between the application the horizontal and Condor. This class is runned on distribution. the Server (Central Manager).
Distributed Scorer We can see some tests on Condor for our application:Some examples about Submit Description Files, these files are used by Condor for thematchmaking processes between the resources and the Jobs. This is our Condor Pool during the tests
Distributed ScorerOur application submit the jobs… We can check the status for our jobs… If we have more jobs… we can check the status for our resources… We can check the status for our jobs… Now, we have to check which results we achieved with this Distributed Scorer!! What is better?
New Results Achieved (1) We have an excellent work load balancing and efficient resources utilization…7000000 (2) We can see how is possible increase the6000000 number of computations in a period of time (using High-throughput computing).5000000 It works even better if we have a sample of4000000 1 PC Companies higher. (+ Scalability!)3000000 50 PCs 3000000 500 PCs2000000 2500000 Marginal1000000 2000000 Seconds for 56000 companies Gain 1500000 0 Seconds for 250000 companies 10 hours 50 hours 500 hours 1000000 Seconds for 500000 companies 500000 0 1 PC 50 PCs 500 PCs
New Results AchievedWe can think to use the internet user’s machine when they are in an inactive mode… orwe can use the companies’ machines because they can use our web-spider for directmarketing…Make profit with your idle CPU cycles! 1.200.000 € 1.000.000 € 800.000 € Energy cost for the Company (every year) (with owner You can economize much 600.000 € machines) money and much energy 400.000 € How does the energy cost increase in a year? (using saving!!! 200.000 € users machines) 0€ 1 Macchina 50 Macchine 500 Macchine
From “local” business case to the big business case… The Googleplex is the corporate headquarters complex of Google, Inc., located at 1600 Amphitheatre Parkway in Mountain View, Santa Clara County, California, near San Jose. Google purchased some of Silicon Graphics properties, including the Googleplex, for $319 million. In late 2006 and early 2007 the company installed a series of solar panels, capable of producing 1.6 megawatts of electricity. At the time, it was believed to be the largest corporate installation in the United States. About 30 percent of the Googleplexs electricity needs will be fulfilled by this project, with the remainder being purchased.