User Inspired Management of Scientific Jobs in Grids and Clouds

22,920 views
25,493 views

Published on

This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve meta-scheduler decisions. The resource abstraction layer proposed and implemented helps scientists to interact with a wide variety compute resources.

Published in: Technology, Career
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
22,920
On SlideShare
0
From Embeds
0
Number of Embeds
20,011
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

User Inspired Management of Scientific Jobs in Grids and Clouds

  1. 1. User Inspired Management of Scientific Jobs in Grids and Clouds<br />Eran Chinthaka Withana<br />School of Informatics and Computing<br />Indiana University, Bloomington, Indiana, USA<br />Doctoral Committee<br />Professor Beth Plale, PhD<br />Dr. Dennis Gannon, PhD<br />Professor Geoffrey Fox, PhD<br />Professor David Leake, PhD<br />
  2. 2. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - EranChinthakaWithana<br />2<br />
  3. 3. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - EranChinthakaWithana<br />3<br />
  4. 4. Mid-Range Science<br />Challenges<br />Resource requirements going beyond lab and university, but not suited for large-scale resources<br />Difficulties finding sufficient compute resources<br />E.g.: short term forecast in LEAD for energy and agriculture<br />Lacking resources to have strong CS support person on team<br />Need for less-expensive and more-available resources<br />Opportunities <br />Wide variety of computational resources<br />Science gateways<br />Thesis Defense - EranChinthakaWithana<br />4<br />
  5. 5. Current Landscape<br />Grid Computing<br />Batch orientation, long queues even under moderate loads, no access transparency<br />Drawbacks in quota system<br />Levels of computer science expertise required<br />Cloud Computing<br />High availability, pay-as-you-go model, on-demand limitless1 resource allocation<br />Payment policy and research cost models<br />Use of Workflow Systems<br />Hybrid workflows<br />Enables utilization of heterogeneous compute resources<br />E.g.: Vortex2 Experiment<br />Need for resource abstraction layers and optimal selection of resources<br />Need for improvement of scientific job executions<br />Better scheduler decisions, selection of compute resources<br />Reliability issues in compute resources<br />Importance of learning user patterns and experiences <br />Thesis Defense - Eran Chinthaka Withana<br />5<br />1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.<br />
  6. 6. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />6<br />
  7. 7. Research Questions<br />“Can user patterns and experiences be used to improve scientific job executions in large scale systems?”<br />“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “<br />“Can these be put to use to advance science?”<br />Thesis Defense - Eran Chinthaka Withana<br />7<br />
  8. 8. Contributions<br />Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.<br />Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.<br />Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.<br />Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.<br />Thesis Defense - Eran Chinthaka Withana<br />8<br />
  9. 9. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />9<br />
  10. 10. Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds<br />Objective<br />Reducing the impact of startup overheads for time-critical applications<br />Problem space<br />Workflows can have multiple paths<br />Workflow descriptions not available<br />Need for predictions to identify job execution sequence<br />Learning from user behavioral patterns to predict future jobs<br />Research outline<br />Algorithm to predict future jobs by extracting user patterns from historical information<br />Use of knowledge-based techniques<br />Zero knowledge or pre-populated job information consisting of connection between jobs<br />Similar cases retrieved are used to predict future jobs, reducing high startup overheads<br />Algorithm assessment <br />Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users<br />10<br />Thesis Defense - Eran Chinthaka Withana<br />
  11. 11. Demonstration of User Patterns with Workflows<br />Suite of workflows can differ from domain to domain<br />E.g. WRF (Weather Research and Forecasting) as upstream node<br />User patterns reveal sequence of jobs taking different users/domains into consideration<br />Useful for a science gateway serving wide-range of mid-scale scientists<br />11<br />Weather Predictions<br />Crop Predictions<br />WRF<br />Wind Farm Location Evaluations<br />Wild Fire Propagation Simulation<br />Thesis Defense - Eran Chinthaka Withana<br />
  12. 12. Role of Successful Predictions to Reduce Startup Overheads<br />Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)<br />r = probability of <br />successful prediction <br />(prediction accuracy)<br />Percentage time =<br />reduction<br />For simplicity, assuming equal job exec and startup times <br />Percentage time =<br />reduction<br />12<br />Thesis Defense - Eran Chinthaka Withana<br />
  13. 13. Relationship of Predictions to Execution Time<br />Observations<br />Percentage time reduction increases with accuracy of predictions<br />Time reduction is reduced exponentially with increased work-to-overhead ratio<br />Need to find criticalpoint for a given situation<br />Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions<br />Cost of wrong predictions<br />Depends on compute resource<br />Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions<br />Compromising cost to improve time<br />Percentage time =<br />reduction<br />13<br />Accuracy of Predictions = <br /> total successful future job predictions / total predictions<br />Thesis Defense - Eran Chinthaka Withana<br />
  14. 14. Prediction Engine: System Architecture<br />Prediction<br />Retriever<br />14<br />Thesis Defense - Eran Chinthaka Withana<br />
  15. 15. Use of Reasoning<br />Store and retrieve cases<br />Steps<br />Retrieval of similar cases<br />Similarity measurement<br />Use of thresholds<br />Reuse of old cases<br />Case adaptation<br />Storage<br />15<br />Thesis Defense - Eran Chinthaka Withana<br />
  16. 16. Case Similarity Calculation<br />Each case represented by set of attributes<br />Selected by finding effect on goal variable (next job)<br />16<br />Thesis Defense - Eran Chinthaka Withana<br />
  17. 17. Evaluation<br />Use cases<br />Individual job workload1<br />40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab<br />Workflow use case<br />System doesn’t see or assume workflow specification<br />Experimental setup<br />2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system<br />1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ <br />17<br />Thesis Defense - Eran Chinthaka Withana<br />
  18. 18. Evaluation: Average Accuracy of Predictions<br />Individual Jobs Workload<br />~ 75% accurate predictions with user patterns <br />~ 32% accurate predictions with service names<br />18<br />Thesis Defense - Eran Chinthaka Withana<br />Workflow Workload<br />~ 95% accurate predictions with user patterns <br />~ 53% accurate predictions with service names<br />
  19. 19. Evaluation: Time Saved<br />Amount of time that can be saved, if resources are provisioned, when job is ready to run<br />Startup time<br />Assumed to be 3mins (average for commercial providers)<br />19<br />Individual Jobs Workload<br />Workflow Workload<br />Thesis Defense - Eran Chinthaka Withana<br />
  20. 20. Evaluation: Prediction Accuracies for Use Cases<br />User patterns based predictions performs 2x better than service names based<br />Thesis Defense - Eran Chinthaka Withana<br />20<br />
  21. 21. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />21<br />
  22. 22. User Perceived Reliability<br />Failures tolerated through<br />fault tolerance, high availability, recoverability, etc.,[Birman05]. <br />What matters from a user’s point of view is whether these failures are visible to users or not<br />E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability<br />Reliability is not of resources themselves <br />Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. <br />It is a more broadly encompassing system reliability that can only be seen at user or workflow level<br />Can depend on user’s configuration and job types as well<br />We refer to this form of reliability as user-perceived reliability.<br />Importance of user-perceived reliability <br />Selecting a resource to schedule an experiment when user has access to multiple compute resources<br />E.g. LEAD reliability<br />supercomputing resources vs<br />Windows Azure resources<br />Thesis Defense - Eran Chinthaka Withana<br />22<br />
  23. 23. Why User Perceived Reliability is Useful<br />User perceived failure probabilities <br />Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3<br />𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 <br />𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24<br />Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B. <br /> <br />Thesis Defense - Eran Chinthaka Withana<br />23<br />
  24. 24. Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions<br />Objective<br />Reduce impact of low reliability of compute resources<br />Deducing user-perceived reliabilities<br /> learning from user experiences and perceptions<br />Research outline<br />Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information<br />Use of machine learning techniques<br />Trained classifiers to represent compute resources and their reliabilities<br />Prediction of job failures<br />Algorithm assessment <br />Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters<br />24<br />Thesis Defense - Eran Chinthaka Withana<br />
  25. 25. System Architecture<br />Thesis Defense - Eran Chinthaka Withana<br />25<br />A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.<br />Classifiers types<br />Static classifier: train classifier initially from historical information<br />Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation <br />
  26. 26. System Architecture<br />Thesis Defense - Eran Chinthaka Withana<br />26<br />Classifier manager uses Weka[Hall09] framework<br />Classification methods<br />Naïve Bayes and KStar<br />Static and Dynamic classifiers<br />Dynamic pruning of features[Fadishei09] for increased efficiency<br />Classifier manager<br />Creates and maintains classifiers for each compute resource<br />A new job is evaluated based on these classifiers to deduce predicted reliability of job execution<br />Policy Implementers<br />Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource<br />
  27. 27. Evaluation<br />Workloads from parallel <br /> workload archive[Feitelson]<br />LANL: Two years worth of <br />jobs from 1994 to 1996 on<br />1024-node CM-5 at Los <br />Alamos National Lab<br />LPC: Ten months (Aug, 2004<br /> to May, 2005) worth of job <br />records on 70 Xeon node <br />cluster at ”Laboratoire de <br />Physique Corpusculaire” <br />of UniversitatBlaise-Pascal, France<br />Minor cleanups to remove intermediate job states<br />10000 jobs were selected from each workload<br />LANL had 20% failed jobs<br />LPC had 30% failed jobs<br />Thesis Defense - Eran Chinthaka Withana<br />27<br />
  28. 28. Evaluation<br />Workload classification and maintenance<br />Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].<br />Classifier construction<br />Static classifier: first 1000 jobs trains classifier.<br />Dynamic classifier: all 10000 jobs for classifier construction and evaluation. <br />Evaluation Metrics<br />Average reliability prediction accuracy: accuracy of predicting success/fail of job<br />Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully<br />baseline measure: ideal cumulative time that can be saved over time<br />Time Consumed For Classification and Updating Classifier<br />Effect of pruning attributes<br />Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)<br />Thesis Defense - Eran Chinthaka Withana<br />28<br />
  29. 29. Evaluation<br />Evaluation Metrics<br />Effect of Job Reliability Predictions<br />on Selecting Compute Resources<br />Extended version of GridSim[Buyya02] <br />models four compute resources<br />NWS[Wolski99] for bandwidth estimation and <br />QBets[Nurmi07] for queue wait time <br />estimation<br />Total execution time = data <br />movement time + queue wait time + job execution time (found in workload)<br />Schedulers<br />Total Execution Time Priority Scheduler <br />Reliability Prediction Based Time Priority Scheduler<br />Metrics<br />Average Accuracy of Selecting Reliable Resources to Execute Jobs<br />Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs<br />All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.<br />Thesis Defense - EranChinthakaWithana<br />29<br />
  30. 30. Evaluation Metrics Summary<br />Thesis Defense - Eran Chinthaka Withana<br />30<br />
  31. 31. Results:Average Reliability Prediction Accuracy<br />31<br />Static<br />Dynamic / Updateable<br />LANL<br />LANL Accuracy Saturation ~ 82%<br />LPC Accuracy Saturation ~ 97%<br />KStar has performed slightly better than Naïve Bayes<br />LPC<br />Thesis Defense - Eran Chinthaka Withana<br />
  32. 32. Results:Time Savings<br />32<br />Static<br />Dynamic / Updateable<br />LANL<br />With static classifier, KStar has saved 90-100%<br />Updateable classifier <br />For LANL Both KStar and NB ~ 50% saving<br />For LPC ~ 90% saving<br />LPC<br />Thesis Defense - Eran Chinthaka Withana<br />
  33. 33. Results:Time Consumed for Classification and Updating Classifier<br />Thesis Defense - Eran Chinthaka Withana<br />33<br />Static Classifier<br />Updateable Classifier<br />Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)<br />
  34. 34. Results:Effect of Pruning Attributes<br />Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier<br />Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal<br />Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications<br />Identification of attributes to prune is a dynamic and expensive task <br />system can be used in practical cases even without pruning of attributes.<br />Thesis Defense - Eran Chinthaka Withana<br />34<br />
  35. 35. Results:Effect of Job Reliability Predictions on Selecting Compute Resources<br />Poor performance of execution time priority scheduler<br />After 1000 jobs (training) time wasted with our approach stays fairly constant<br />Thesis Defense - Eran Chinthaka Withana<br />35<br />
  36. 36. Evaluation Conclusion<br />Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.<br />Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.<br />Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.<br />Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions<br />Thesis Defense - Eran Chinthaka Withana<br />36<br />
  37. 37. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />37<br />
  38. 38. Scientific Computing Resource Abstraction Layer<br />Variety of scientific computing platforms and opportunities<br />Requirements<br />Support existing job description languages and also should be extensible to support other languages.<br />Provide a uniform and interoperable interface for external entities to interact with it.<br />Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.<br />Extensibility to support new and future resource managers with minimal changes. <br />Provide monitoring and fault recovery, especially when working with utility computing resources.<br />Provide light-weight, robust and scalable infrastructure.<br />Integration to variety of workflow environments.<br />Thesis Defense - Eran Chinthaka Withana<br />38<br />
  39. 39. Scientific Computing Resource Abstraction Layer<br />Our contribution<br />Resource abstraction layer <br />Implemented as a web service<br />Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.<br />Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), <br />directly interacts with resource managers so requires no grid or meta scheduling middleware<br />Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms<br />Features<br />Does not need high level of computer science knowledge to install and maintain system <br />Use of Globus was a challenge for most non-compute scientists<br />Involvement of system administrators to install and maintain Sigiri is minimal<br />Memory foot print of is minimal<br />Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)<br />Better fault tolerance and failure recovery.<br />Thesis Defense - Eran Chinthaka Withana<br />39<br />
  40. 40. Architecture<br />Asynchronous messaging model of message publishers and consumers<br />Daemons shadowing compute resources<br />Distributed component deployment<br />Daemon, front end Web service and job queue <br />Thesis Defense - Eran Chinthaka Withana<br />40<br />
  41. 41. Client Interaction Service<br />Deployed as an Apache Axis2 Web service to enable interoperability<br />Accepts job requests and enable management and monitoring functions<br />Job submission schema does not enforce schema for job description<br />Enables multiple job description languages<br />Thesis Defense - Eran Chinthaka Withana<br />41<br />
  42. 42. Client Interaction Service<br />Thesis Defense - Eran Chinthaka Withana<br />42<br />Job Submission Response<br />Job Submission Request<br />
  43. 43. Daemons<br />Each managed compute resource has a light-weight daemon<br />periodically checks job request queue<br />translates job specification to a resource manager specific language<br />submits pending jobs and persists correlation between resource manager's job id with internal id<br />Extensible daemon API <br />enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems<br />Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements<br />Current Support<br />LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure<br />Thesis Defense - Eran Chinthaka Withana<br />43<br />
  44. 44. Integration of Cloud Computing Resources<br />Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.<br />Enables scientists to interact with multiple cloud providers within same system<br />Features<br />Extensions can be written as modules independent of other extensions, typically to carry out a single task<br />Enforced failure handling to prevent orphan VMs, resources<br />Thesis Defense - Eran Chinthaka Withana<br />44<br />
  45. 45. Security<br />Client Security<br />Between client and Web service layer<br />Support for both transport level security (using SSL) and application layer security (using WS-Security)<br />Client negotiation of security credentials with WS-Security policy support within Apache Axis2<br />Compute Resource Security<br />System has support to store different types of security credentials<br />Username/password combinations, X.509 credentials<br />Thesis Defense - Eran Chinthaka Withana<br />45<br />
  46. 46. Performance Evaluation<br />Test Scenarios<br />Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.<br />Each client waits for all jobs to finish before submitting next set of jobs.<br />For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.<br />Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions<br />client does not block upon submission of a job<br />failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased<br />Thesis Defense - Eran Chinthaka Withana<br />46<br />
  47. 47. Performance Evaluation:Baseline Measurements<br />Thesis Defense - Eran Chinthaka Withana<br />47<br />
  48. 48. Performance Evaluation:Metrics<br />Thesis Defense - Eran Chinthaka Withana<br />48<br />
  49. 49. Performance Evaluation:Scalability Metrics<br />Thesis Defense - Eran Chinthaka Withana<br />49<br />
  50. 50. Performance Evaluation<br />Experimental Setup<br />Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster <br />System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)<br />Both these nodes were not dedicated for our experiment when we were running tests<br />Client Environment<br />Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)<br />All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead<br /> Data Collection<br />Each test was run number of clients * 10 times and results were averaged.<br />Each parameter is tested for 100 to 1000 concurrent clients<br />Total of 110,000 tests were run. <br />Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. <br />Thesis Defense - Eran Chinthaka Withana<br />50<br />
  51. 51. Results<br />Thesis Defense - Eran Chinthaka Withana<br />51<br />Baseline Measurements<br />All overheads scaling proportional to number of clients<br />No failures<br />Case 1<br />Case 2<br />
  52. 52. Results<br />Thesis Defense - Eran Chinthaka Withana<br />52<br />Metrics for Test Case 1 and 2<br />Both response time and total overhead scaling proportional to number of clients<br />No failures<br />
  53. 53. Results<br />Thesis Defense - Eran Chinthaka Withana<br />53<br />Scalability Metrics<br />Failures<br />No failures with Sigiri<br />Failures starting from<br />300 clients for Gram<br />Case 1<br />Case 2<br />
  54. 54. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />54<br />
  55. 55. Applications: LEAD<br />Motivations<br />Grid middleware reliability and scalability study[Marru08] and workflow failure rates. <br />components of LEAD infrastructure were considered for adaptation to other scientific environments.<br />Sigiri initially prototyped to support Load Leveler, PBS and LSF. <br />Implications<br />Improved workflow success rates <br />Mitigation need for Globus middleware<br />Ability work with non-standard job managers<br />Thesis Defense - Eran Chinthaka Withana<br />55<br />
  56. 56. Applications: LEAD II<br />Emergence of community- driven, production-quality workflow infrastructures<br />E.g. Trident Scientific Workflow Workbench with Workflow Foundation<br />Possibility of using alternate supercomputing resources<br />E.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure<br />Support for Windows based scientific computing environments.<br />56<br />
  57. 57. Background: LEAD II and Vortex2 Experiment<br />May 1, 2010 to June 15, 2010<br />~6 weeks, 7-days per week<br />Workflow started on hour every hour each morning. <br />Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. <br />If model data was not available at NCEP and University of Oklahoma, workflow could not begin.<br />Execution of complete WRF stack within 1 hour<br />57<br />
  58. 58. Trident Vortex2 Workflow<br />Bulk of time (50 min) spent in Lead Workflow Proxy Activity<br />58<br />Sigiri Integration<br />
  59. 59. Applications: Enabling Geo-Science Application on Windows Azure<br />Geo-Science Applications<br />High Resource Requirements<br />Compute intensive, dedicated HPC hardware<br />e.g. Weather Research and Forecasting (WRF) Model<br />Emergence of ensemble applications<br />Large amount of small jobs<br />e.g. Examining each air layer, over a long period of time. <br />Single experiment = About 14000 jobs each taking few minutes to complete<br />59<br />
  60. 60. Geo-Science Applications: Opportunities<br />Cloud computing resources<br />On-demand access to “unlimited” resources<br />Flexibility<br />Worker roles and VM roles<br />Recent porting of geo-science applications<br />WRF, WRF Preprocessing System (WPS) port to Windows<br />Increased use of ensemble applications (large number of small runs)<br />Production quality, opensource scientific workflow systems<br />Microsoft Trident<br />60<br />
  61. 61. Research Vision<br />Enabling geo-science experiments <br />Type of applications<br />Compute intensive, ensembles<br />Type of scientists<br />Meteorologists, atmospheric scientists, emergency management personnel, geologists<br />Utilizing both Cloud computing and Grid computing resources<br />Utilizing opensource, production quality scientific workflow environments<br />Improved data and meta-data management<br />Geo-Science Applications<br />Scientific Workflows<br />Compute Resources<br />61<br />
  62. 62. Proposed Framework<br />Thesis Defense - Eran Chinthaka Withana<br />62<br />Azure Blob Store<br />Azure <br />Management<br />API<br />Sigiri<br />Job Mgmt.Daemons<br />Azure Fabric<br />Web Service<br />Trident<br />Activity<br />Job Queue<br />Azure Custom <br />VM Images<br />VM Instance<br />IIS<br />WRF<br />Sigiri Worker<br />Service<br />MSMPI<br />Windows 2008R2<br />
  63. 63. Applications: Pragma Testbed Support<br />Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]<br />an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations<br />In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. <br />Sigiri was used within IU Pragma testbed<br />IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.<br />IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.<br />In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully<br />Thesis Defense - Eran Chinthaka Withana<br />63<br />
  64. 64. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />64<br />
  65. 65. Related Work<br />Scientific Job Management Systems<br />Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]<br />provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.<br />Carmen[Watson81] project <br />provided a cloud environment that has enabled collaboration between neuroscientists<br />requires all programs to be packaged as WS-I[Ballinger04] compliant Web services<br />Condor[Frey02] pools can also be utilized to unify certain compute resource interactions.<br />uses Globus toolkit[Foster05] (and GRAM underneath) <br />Poor failure recovery <br />overlooks failure modes of a cloud platform<br />Thesis Defense - Eran Chinthaka Withana<br />65<br />
  66. 66. Related Work<br />Scientific Research and Cloud Computing<br />IaaS, PaaS and SaaS environment evaluations<br />Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]<br />Ease of setting up custom environments and control<br />Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]<br />Optimization to balance cost and time of executions[Deelman08][Yu05]<br />Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07] <br />Job Prediction Algorithms<br />Prediction of<br />Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]<br />AI based and statistical modeling based approaches<br />AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance<br />Reliability of Compute Resources<br />Birman[Birman05] and aspects of resources causing system reliability issues<br />Statistical modeling to predict failures[Kandaswamy08]<br />Thesis Defense - Eran Chinthaka Withana<br />66<br />
  67. 67. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />67<br />
  68. 68. Conclusion<br />User inspired management of scientific jobs<br />Concentrate on identification of user patterns and perceptions<br />Harnesses historical information<br />Applies knowledge gained to improve scientific job executions<br />Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements<br />Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.<br />Resource abstraction service<br />Help mid-scale scientists to obtain access to resources that are cheap and available<br />Strives to do so with a tool that is easy to set up and administer<br />Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications<br />Applications demonstrate how our research contributed to advance science in respective domains.<br />Thesis Defense - Eran Chinthaka Withana<br />68<br />
  69. 69. Contributions<br />Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.<br />Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.<br />Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.<br />Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.<br />Thesis Defense - Eran Chinthaka Withana<br />69<br />
  70. 70. Future Work<br />Short term research directions<br />Integration of future job predictions and user-perceived reliability predictions<br />Evolving resource abstraction service to support more compute resources<br />Management of ensemble runs<br />Fault tolerance with proactive replication<br />Long Term Research Directions<br />Thesis Defense - Eran Chinthaka Withana<br />70<br />
  71. 71. Thank You !!<br />Thesis Defense - Eran Chinthaka Withana<br />71<br />

×