Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

User Inspired Management of Scientific Jobs in Grids and Clouds

27,471 views

Published on

This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve meta-scheduler decisions. The resource abstraction layer proposed and implemented helps scientists to interact with a wide variety compute resources.

Published in: Technology, Career
  • Be the first to comment

  • Be the first to like this

User Inspired Management of Scientific Jobs in Grids and Clouds

  1. 1. User Inspired Management of Scientific Jobs in Grids and Clouds<br />Eran Chinthaka Withana<br />School of Informatics and Computing<br />Indiana University, Bloomington, Indiana, USA<br />Doctoral Committee<br />Professor Beth Plale, PhD<br />Dr. Dennis Gannon, PhD<br />Professor Geoffrey Fox, PhD<br />Professor David Leake, PhD<br />
  2. 2. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - EranChinthakaWithana<br />2<br />
  3. 3. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - EranChinthakaWithana<br />3<br />
  4. 4. Mid-Range Science<br />Challenges<br />Resource requirements going beyond lab and university, but not suited for large-scale resources<br />Difficulties finding sufficient compute resources<br />E.g.: short term forecast in LEAD for energy and agriculture<br />Lacking resources to have strong CS support person on team<br />Need for less-expensive and more-available resources<br />Opportunities <br />Wide variety of computational resources<br />Science gateways<br />Thesis Defense - EranChinthakaWithana<br />4<br />
  5. 5. Current Landscape<br />Grid Computing<br />Batch orientation, long queues even under moderate loads, no access transparency<br />Drawbacks in quota system<br />Levels of computer science expertise required<br />Cloud Computing<br />High availability, pay-as-you-go model, on-demand limitless1 resource allocation<br />Payment policy and research cost models<br />Use of Workflow Systems<br />Hybrid workflows<br />Enables utilization of heterogeneous compute resources<br />E.g.: Vortex2 Experiment<br />Need for resource abstraction layers and optimal selection of resources<br />Need for improvement of scientific job executions<br />Better scheduler decisions, selection of compute resources<br />Reliability issues in compute resources<br />Importance of learning user patterns and experiences <br />Thesis Defense - Eran Chinthaka Withana<br />5<br />1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.<br />
  6. 6. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />6<br />
  7. 7. Research Questions<br />“Can user patterns and experiences be used to improve scientific job executions in large scale systems?”<br />“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “<br />“Can these be put to use to advance science?”<br />Thesis Defense - Eran Chinthaka Withana<br />7<br />
  8. 8. Contributions<br />Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.<br />Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.<br />Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.<br />Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.<br />Thesis Defense - Eran Chinthaka Withana<br />8<br />
  9. 9. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />9<br />
  10. 10. Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds<br />Objective<br />Reducing the impact of startup overheads for time-critical applications<br />Problem space<br />Workflows can have multiple paths<br />Workflow descriptions not available<br />Need for predictions to identify job execution sequence<br />Learning from user behavioral patterns to predict future jobs<br />Research outline<br />Algorithm to predict future jobs by extracting user patterns from historical information<br />Use of knowledge-based techniques<br />Zero knowledge or pre-populated job information consisting of connection between jobs<br />Similar cases retrieved are used to predict future jobs, reducing high startup overheads<br />Algorithm assessment <br />Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users<br />10<br />Thesis Defense - Eran Chinthaka Withana<br />
  11. 11. Demonstration of User Patterns with Workflows<br />Suite of workflows can differ from domain to domain<br />E.g. WRF (Weather Research and Forecasting) as upstream node<br />User patterns reveal sequence of jobs taking different users/domains into consideration<br />Useful for a science gateway serving wide-range of mid-scale scientists<br />11<br />Weather Predictions<br />Crop Predictions<br />WRF<br />Wind Farm Location Evaluations<br />Wild Fire Propagation Simulation<br />Thesis Defense - Eran Chinthaka Withana<br />
  12. 12. Role of Successful Predictions to Reduce Startup Overheads<br />Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)<br />r = probability of <br />successful prediction <br />(prediction accuracy)<br />Percentage time =<br />reduction<br />For simplicity, assuming equal job exec and startup times <br />Percentage time =<br />reduction<br />12<br />Thesis Defense - Eran Chinthaka Withana<br />
  13. 13. Relationship of Predictions to Execution Time<br />Observations<br />Percentage time reduction increases with accuracy of predictions<br />Time reduction is reduced exponentially with increased work-to-overhead ratio<br />Need to find criticalpoint for a given situation<br />Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions<br />Cost of wrong predictions<br />Depends on compute resource<br />Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions<br />Compromising cost to improve time<br />Percentage time =<br />reduction<br />13<br />Accuracy of Predictions = <br /> total successful future job predictions / total predictions<br />Thesis Defense - Eran Chinthaka Withana<br />
  14. 14. Prediction Engine: System Architecture<br />Prediction<br />Retriever<br />14<br />Thesis Defense - Eran Chinthaka Withana<br />
  15. 15. Use of Reasoning<br />Store and retrieve cases<br />Steps<br />Retrieval of similar cases<br />Similarity measurement<br />Use of thresholds<br />Reuse of old cases<br />Case adaptation<br />Storage<br />15<br />Thesis Defense - Eran Chinthaka Withana<br />
  16. 16. Case Similarity Calculation<br />Each case represented by set of attributes<br />Selected by finding effect on goal variable (next job)<br />16<br />Thesis Defense - Eran Chinthaka Withana<br />
  17. 17. Evaluation<br />Use cases<br />Individual job workload1<br />40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab<br />Workflow use case<br />System doesn’t see or assume workflow specification<br />Experimental setup<br />2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system<br />1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ <br />17<br />Thesis Defense - Eran Chinthaka Withana<br />
  18. 18. Evaluation: Average Accuracy of Predictions<br />Individual Jobs Workload<br />~ 75% accurate predictions with user patterns <br />~ 32% accurate predictions with service names<br />18<br />Thesis Defense - Eran Chinthaka Withana<br />Workflow Workload<br />~ 95% accurate predictions with user patterns <br />~ 53% accurate predictions with service names<br />
  19. 19. Evaluation: Time Saved<br />Amount of time that can be saved, if resources are provisioned, when job is ready to run<br />Startup time<br />Assumed to be 3mins (average for commercial providers)<br />19<br />Individual Jobs Workload<br />Workflow Workload<br />Thesis Defense - Eran Chinthaka Withana<br />
  20. 20. Evaluation: Prediction Accuracies for Use Cases<br />User patterns based predictions performs 2x better than service names based<br />Thesis Defense - Eran Chinthaka Withana<br />20<br />
  21. 21. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />21<br />
  22. 22. User Perceived Reliability<br />Failures tolerated through<br />fault tolerance, high availability, recoverability, etc.,[Birman05]. <br />What matters from a user’s point of view is whether these failures are visible to users or not<br />E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability<br />Reliability is not of resources themselves <br />Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. <br />It is a more broadly encompassing system reliability that can only be seen at user or workflow level<br />Can depend on user’s configuration and job types as well<br />We refer to this form of reliability as user-perceived reliability.<br />Importance of user-perceived reliability <br />Selecting a resource to schedule an experiment when user has access to multiple compute resources<br />E.g. LEAD reliability<br />supercomputing resources vs<br />Windows Azure resources<br />Thesis Defense - Eran Chinthaka Withana<br />22<br />
  23. 23. Why User Perceived Reliability is Useful<br />User perceived failure probabilities <br />Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3<br />𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 <br />𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24<br />Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B. <br /> <br />Thesis Defense - Eran Chinthaka Withana<br />23<br />
  24. 24. Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions<br />Objective<br />Reduce impact of low reliability of compute resources<br />Deducing user-perceived reliabilities<br /> learning from user experiences and perceptions<br />Research outline<br />Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information<br />Use of machine learning techniques<br />Trained classifiers to represent compute resources and their reliabilities<br />Prediction of job failures<br />Algorithm assessment <br />Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters<br />24<br />Thesis Defense - Eran Chinthaka Withana<br />
  25. 25. System Architecture<br />Thesis Defense - Eran Chinthaka Withana<br />25<br />A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.<br />Classifiers types<br />Static classifier: train classifier initially from historical information<br />Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation <br />
  26. 26. System Architecture<br />Thesis Defense - Eran Chinthaka Withana<br />26<br />Classifier manager uses Weka[Hall09] framework<br />Classification methods<br />Naïve Bayes and KStar<br />Static and Dynamic classifiers<br />Dynamic pruning of features[Fadishei09] for increased efficiency<br />Classifier manager<br />Creates and maintains classifiers for each compute resource<br />A new job is evaluated based on these classifiers to deduce predicted reliability of job execution<br />Policy Implementers<br />Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource<br />
  27. 27. Evaluation<br />Workloads from parallel <br /> workload archive[Feitelson]<br />LANL: Two years worth of <br />jobs from 1994 to 1996 on<br />1024-node CM-5 at Los <br />Alamos National Lab<br />LPC: Ten months (Aug, 2004<br /> to May, 2005) worth of job <br />records on 70 Xeon node <br />cluster at ”Laboratoire de <br />Physique Corpusculaire” <br />of UniversitatBlaise-Pascal, France<br />Minor cleanups to remove intermediate job states<br />10000 jobs were selected from each workload<br />LANL had 20% failed jobs<br />LPC had 30% failed jobs<br />Thesis Defense - Eran Chinthaka Withana<br />27<br />
  28. 28. Evaluation<br />Workload classification and maintenance<br />Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].<br />Classifier construction<br />Static classifier: first 1000 jobs trains classifier.<br />Dynamic classifier: all 10000 jobs for classifier construction and evaluation. <br />Evaluation Metrics<br />Average reliability prediction accuracy: accuracy of predicting success/fail of job<br />Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully<br />baseline measure: ideal cumulative time that can be saved over time<br />Time Consumed For Classification and Updating Classifier<br />Effect of pruning attributes<br />Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)<br />Thesis Defense - Eran Chinthaka Withana<br />28<br />
  29. 29. Evaluation<br />Evaluation Metrics<br />Effect of Job Reliability Predictions<br />on Selecting Compute Resources<br />Extended version of GridSim[Buyya02] <br />models four compute resources<br />NWS[Wolski99] for bandwidth estimation and <br />QBets[Nurmi07] for queue wait time <br />estimation<br />Total execution time = data <br />movement time + queue wait time + job execution time (found in workload)<br />Schedulers<br />Total Execution Time Priority Scheduler <br />Reliability Prediction Based Time Priority Scheduler<br />Metrics<br />Average Accuracy of Selecting Reliable Resources to Execute Jobs<br />Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs<br />All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.<br />Thesis Defense - EranChinthakaWithana<br />29<br />
  30. 30. Evaluation Metrics Summary<br />Thesis Defense - Eran Chinthaka Withana<br />30<br />
  31. 31. Results:Average Reliability Prediction Accuracy<br />31<br />Static<br />Dynamic / Updateable<br />LANL<br />LANL Accuracy Saturation ~ 82%<br />LPC Accuracy Saturation ~ 97%<br />KStar has performed slightly better than Naïve Bayes<br />LPC<br />Thesis Defense - Eran Chinthaka Withana<br />
  32. 32. Results:Time Savings<br />32<br />Static<br />Dynamic / Updateable<br />LANL<br />With static classifier, KStar has saved 90-100%<br />Updateable classifier <br />For LANL Both KStar and NB ~ 50% saving<br />For LPC ~ 90% saving<br />LPC<br />Thesis Defense - Eran Chinthaka Withana<br />
  33. 33. Results:Time Consumed for Classification and Updating Classifier<br />Thesis Defense - Eran Chinthaka Withana<br />33<br />Static Classifier<br />Updateable Classifier<br />Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)<br />
  34. 34. Results:Effect of Pruning Attributes<br />Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier<br />Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal<br />Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications<br />Identification of attributes to prune is a dynamic and expensive task <br />system can be used in practical cases even without pruning of attributes.<br />Thesis Defense - Eran Chinthaka Withana<br />34<br />
  35. 35. Results:Effect of Job Reliability Predictions on Selecting Compute Resources<br />Poor performance of execution time priority scheduler<br />After 1000 jobs (training) time wasted with our approach stays fairly constant<br />Thesis Defense - Eran Chinthaka Withana<br />35<br />
  36. 36. Evaluation Conclusion<br />Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.<br />Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.<br />Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.<br />Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions<br />Thesis Defense - Eran Chinthaka Withana<br />36<br />
  37. 37. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />37<br />
  38. 38. Scientific Computing Resource Abstraction Layer<br />Variety of scientific computing platforms and opportunities<br />Requirements<br />Support existing job description languages and also should be extensible to support other languages.<br />Provide a uniform and interoperable interface for external entities to interact with it.<br />Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.<br />Extensibility to support new and future resource managers with minimal changes. <br />Provide monitoring and fault recovery, especially when working with utility computing resources.<br />Provide light-weight, robust and scalable infrastructure.<br />Integration to variety of workflow environments.<br />Thesis Defense - Eran Chinthaka Withana<br />38<br />
  39. 39. Scientific Computing Resource Abstraction Layer<br />Our contribution<br />Resource abstraction layer <br />Implemented as a web service<br />Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.<br />Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), <br />directly interacts with resource managers so requires no grid or meta scheduling middleware<br />Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms<br />Features<br />Does not need high level of computer science knowledge to install and maintain system <br />Use of Globus was a challenge for most non-compute scientists<br />Involvement of system administrators to install and maintain Sigiri is minimal<br />Memory foot print of is minimal<br />Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)<br />Better fault tolerance and failure recovery.<br />Thesis Defense - Eran Chinthaka Withana<br />39<br />
  40. 40. Architecture<br />Asynchronous messaging model of message publishers and consumers<br />Daemons shadowing compute resources<br />Distributed component deployment<br />Daemon, front end Web service and job queue <br />Thesis Defense - Eran Chinthaka Withana<br />40<br />
  41. 41. Client Interaction Service<br />Deployed as an Apache Axis2 Web service to enable interoperability<br />Accepts job requests and enable management and monitoring functions<br />Job submission schema does not enforce schema for job description<br />Enables multiple job description languages<br />Thesis Defense - Eran Chinthaka Withana<br />41<br />
  42. 42. Client Interaction Service<br />Thesis Defense - Eran Chinthaka Withana<br />42<br />Job Submission Response<br />Job Submission Request<br />
  43. 43. Daemons<br />Each managed compute resource has a light-weight daemon<br />periodically checks job request queue<br />translates job specification to a resource manager specific language<br />submits pending jobs and persists correlation between resource manager's job id with internal id<br />Extensible daemon API <br />enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems<br />Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements<br />Current Support<br />LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure<br />Thesis Defense - Eran Chinthaka Withana<br />43<br />
  44. 44. Integration of Cloud Computing Resources<br />Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.<br />Enables scientists to interact with multiple cloud providers within same system<br />Features<br />Extensions can be written as modules independent of other extensions, typically to carry out a single task<br />Enforced failure handling to prevent orphan VMs, resources<br />Thesis Defense - Eran Chinthaka Withana<br />44<br />
  45. 45. Security<br />Client Security<br />Between client and Web service layer<br />Support for both transport level security (using SSL) and application layer security (using WS-Security)<br />Client negotiation of security credentials with WS-Security policy support within Apache Axis2<br />Compute Resource Security<br />System has support to store different types of security credentials<br />Username/password combinations, X.509 credentials<br />Thesis Defense - Eran Chinthaka Withana<br />45<br />
  46. 46. Performance Evaluation<br />Test Scenarios<br />Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.<br />Each client waits for all jobs to finish before submitting next set of jobs.<br />For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.<br />Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions<br />client does not block upon submission of a job<br />failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased<br />Thesis Defense - Eran Chinthaka Withana<br />46<br />
  47. 47. Performance Evaluation:Baseline Measurements<br />Thesis Defense - Eran Chinthaka Withana<br />47<br />
  48. 48. Performance Evaluation:Metrics<br />Thesis Defense - Eran Chinthaka Withana<br />48<br />
  49. 49. Performance Evaluation:Scalability Metrics<br />Thesis Defense - Eran Chinthaka Withana<br />49<br />
  50. 50. Performance Evaluation<br />Experimental Setup<br />Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster <br />System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)<br />Both these nodes were not dedicated for our experiment when we were running tests<br />Client Environment<br />Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)<br />All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead<br /> Data Collection<br />Each test was run number of clients * 10 times and results were averaged.<br />Each parameter is tested for 100 to 1000 concurrent clients<br />Total of 110,000 tests were run. <br />Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. <br />Thesis Defense - Eran Chinthaka Withana<br />50<br />
  51. 51. Results<br />Thesis Defense - Eran Chinthaka Withana<br />51<br />Baseline Measurements<br />All overheads scaling proportional to number of clients<br />No failures<br />Case 1<br />Case 2<br />
  52. 52. Results<br />Thesis Defense - Eran Chinthaka Withana<br />52<br />Metrics for Test Case 1 and 2<br />Both response time and total overhead scaling proportional to number of clients<br />No failures<br />
  53. 53. Results<br />Thesis Defense - Eran Chinthaka Withana<br />53<br />Scalability Metrics<br />Failures<br />No failures with Sigiri<br />Failures starting from<br />300 clients for Gram<br />Case 1<br />Case 2<br />
  54. 54. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />54<br />
  55. 55. Applications: LEAD<br />Motivations<br />Grid middleware reliability and scalability study[Marru08] and workflow failure rates. <br />components of LEAD infrastructure were considered for adaptation to other scientific environments.<br />Sigiri initially prototyped to support Load Leveler, PBS and LSF. <br />Implications<br />Improved workflow success rates <br />Mitigation need for Globus middleware<br />Ability work with non-standard job managers<br />Thesis Defense - Eran Chinthaka Withana<br />55<br />
  56. 56. Applications: LEAD II<br />Emergence of community- driven, production-quality workflow infrastructures<br />E.g. Trident Scientific Workflow Workbench with Workflow Foundation<br />Possibility of using alternate supercomputing resources<br />E.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure<br />Support for Windows based scientific computing environments.<br />56<br />
  57. 57. Background: LEAD II and Vortex2 Experiment<br />May 1, 2010 to June 15, 2010<br />~6 weeks, 7-days per week<br />Workflow started on hour every hour each morning. <br />Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. <br />If model data was not available at NCEP and University of Oklahoma, workflow could not begin.<br />Execution of complete WRF stack within 1 hour<br />57<br />
  58. 58. Trident Vortex2 Workflow<br />Bulk of time (50 min) spent in Lead Workflow Proxy Activity<br />58<br />Sigiri Integration<br />
  59. 59. Applications: Enabling Geo-Science Application on Windows Azure<br />Geo-Science Applications<br />High Resource Requirements<br />Compute intensive, dedicated HPC hardware<br />e.g. Weather Research and Forecasting (WRF) Model<br />Emergence of ensemble applications<br />Large amount of small jobs<br />e.g. Examining each air layer, over a long period of time. <br />Single experiment = About 14000 jobs each taking few minutes to complete<br />59<br />
  60. 60. Geo-Science Applications: Opportunities<br />Cloud computing resources<br />On-demand access to “unlimited” resources<br />Flexibility<br />Worker roles and VM roles<br />Recent porting of geo-science applications<br />WRF, WRF Preprocessing System (WPS) port to Windows<br />Increased use of ensemble applications (large number of small runs)<br />Production quality, opensource scientific workflow systems<br />Microsoft Trident<br />60<br />
  61. 61. Research Vision<br />Enabling geo-science experiments <br />Type of applications<br />Compute intensive, ensembles<br />Type of scientists<br />Meteorologists, atmospheric scientists, emergency management personnel, geologists<br />Utilizing both Cloud computing and Grid computing resources<br />Utilizing opensource, production quality scientific workflow environments<br />Improved data and meta-data management<br />Geo-Science Applications<br />Scientific Workflows<br />Compute Resources<br />61<br />
  62. 62. Proposed Framework<br />Thesis Defense - Eran Chinthaka Withana<br />62<br />Azure Blob Store<br />Azure <br />Management<br />API<br />Sigiri<br />Job Mgmt.Daemons<br />Azure Fabric<br />Web Service<br />Trident<br />Activity<br />Job Queue<br />Azure Custom <br />VM Images<br />VM Instance<br />IIS<br />WRF<br />Sigiri Worker<br />Service<br />MSMPI<br />Windows 2008R2<br />
  63. 63. Applications: Pragma Testbed Support<br />Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]<br />an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations<br />In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. <br />Sigiri was used within IU Pragma testbed<br />IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.<br />IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.<br />In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully<br />Thesis Defense - Eran Chinthaka Withana<br />63<br />
  64. 64. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />64<br />
  65. 65. Related Work<br />Scientific Job Management Systems<br />Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]<br />provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.<br />Carmen[Watson81] project <br />provided a cloud environment that has enabled collaboration between neuroscientists<br />requires all programs to be packaged as WS-I[Ballinger04] compliant Web services<br />Condor[Frey02] pools can also be utilized to unify certain compute resource interactions.<br />uses Globus toolkit[Foster05] (and GRAM underneath) <br />Poor failure recovery <br />overlooks failure modes of a cloud platform<br />Thesis Defense - Eran Chinthaka Withana<br />65<br />
  66. 66. Related Work<br />Scientific Research and Cloud Computing<br />IaaS, PaaS and SaaS environment evaluations<br />Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]<br />Ease of setting up custom environments and control<br />Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]<br />Optimization to balance cost and time of executions[Deelman08][Yu05]<br />Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07] <br />Job Prediction Algorithms<br />Prediction of<br />Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]<br />AI based and statistical modeling based approaches<br />AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance<br />Reliability of Compute Resources<br />Birman[Birman05] and aspects of resources causing system reliability issues<br />Statistical modeling to predict failures[Kandaswamy08]<br />Thesis Defense - Eran Chinthaka Withana<br />66<br />
  67. 67. Outline<br />Mid-Range Science<br />Challenges and Opportunities<br />Current Landscape<br />Research<br />Research Questions<br />Contributions<br />Mining Historical Information to Find Patterns and Experiences<br />Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]<br />Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]<br />Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]<br />Applications<br />Related Work<br />Conclusion and Future Work<br />Thesis Defense - Eran Chinthaka Withana<br />67<br />
  68. 68. Conclusion<br />User inspired management of scientific jobs<br />Concentrate on identification of user patterns and perceptions<br />Harnesses historical information<br />Applies knowledge gained to improve scientific job executions<br />Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements<br />Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.<br />Resource abstraction service<br />Help mid-scale scientists to obtain access to resources that are cheap and available<br />Strives to do so with a tool that is easy to set up and administer<br />Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications<br />Applications demonstrate how our research contributed to advance science in respective domains.<br />Thesis Defense - Eran Chinthaka Withana<br />68<br />
  69. 69. Contributions<br />Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.<br />Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.<br />Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.<br />Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.<br />Thesis Defense - Eran Chinthaka Withana<br />69<br />
  70. 70. Future Work<br />Short term research directions<br />Integration of future job predictions and user-perceived reliability predictions<br />Evolving resource abstraction service to support more compute resources<br />Management of ensemble runs<br />Fault tolerance with proactive replication<br />Long Term Research Directions<br />Thesis Defense - Eran Chinthaka Withana<br />70<br />
  71. 71. Thank You !!<br />Thesis Defense - Eran Chinthaka Withana<br />71<br />

×