User Inspired Management of Scientific Jobs in Grids and Clouds

  • 9,360 views
Uploaded on

This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve …

This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve meta-scheduler decisions. The resource abstraction layer proposed and implemented helps scientists to interact with a wide variety compute resources.

More in: Technology , Career
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
9,360
On Slideshare
0
From Embeds
0
Number of Embeds
8

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. User Inspired Management of Scientific Jobs in Grids and Clouds
    Eran Chinthaka Withana
    School of Informatics and Computing
    Indiana University, Bloomington, Indiana, USA
    Doctoral Committee
    Professor Beth Plale, PhD
    Dr. Dennis Gannon, PhD
    Professor Geoffrey Fox, PhD
    Professor David Leake, PhD
  • 2. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - EranChinthakaWithana
    2
  • 3. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - EranChinthakaWithana
    3
  • 4. Mid-Range Science
    Challenges
    Resource requirements going beyond lab and university, but not suited for large-scale resources
    Difficulties finding sufficient compute resources
    E.g.: short term forecast in LEAD for energy and agriculture
    Lacking resources to have strong CS support person on team
    Need for less-expensive and more-available resources
    Opportunities
    Wide variety of computational resources
    Science gateways
    Thesis Defense - EranChinthakaWithana
    4
  • 5. Current Landscape
    Grid Computing
    Batch orientation, long queues even under moderate loads, no access transparency
    Drawbacks in quota system
    Levels of computer science expertise required
    Cloud Computing
    High availability, pay-as-you-go model, on-demand limitless1 resource allocation
    Payment policy and research cost models
    Use of Workflow Systems
    Hybrid workflows
    Enables utilization of heterogeneous compute resources
    E.g.: Vortex2 Experiment
    Need for resource abstraction layers and optimal selection of resources
    Need for improvement of scientific job executions
    Better scheduler decisions, selection of compute resources
    Reliability issues in compute resources
    Importance of learning user patterns and experiences
    Thesis Defense - Eran Chinthaka Withana
    5
    1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.
  • 6. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    6
  • 7. Research Questions
    “Can user patterns and experiences be used to improve scientific job executions in large scale systems?”
    “Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “
    “Can these be put to use to advance science?”
    Thesis Defense - Eran Chinthaka Withana
    7
  • 8. Contributions
    Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.
    Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.
    Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.
    Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.
    Thesis Defense - Eran Chinthaka Withana
    8
  • 9. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    9
  • 10. Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
    Objective
    Reducing the impact of startup overheads for time-critical applications
    Problem space
    Workflows can have multiple paths
    Workflow descriptions not available
    Need for predictions to identify job execution sequence
    Learning from user behavioral patterns to predict future jobs
    Research outline
    Algorithm to predict future jobs by extracting user patterns from historical information
    Use of knowledge-based techniques
    Zero knowledge or pre-populated job information consisting of connection between jobs
    Similar cases retrieved are used to predict future jobs, reducing high startup overheads
    Algorithm assessment
    Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users
    10
    Thesis Defense - Eran Chinthaka Withana
  • 11. Demonstration of User Patterns with Workflows
    Suite of workflows can differ from domain to domain
    E.g. WRF (Weather Research and Forecasting) as upstream node
    User patterns reveal sequence of jobs taking different users/domains into consideration
    Useful for a science gateway serving wide-range of mid-scale scientists
    11
    Weather Predictions
    Crop Predictions
    WRF
    Wind Farm Location Evaluations
    Wild Fire Propagation Simulation
    Thesis Defense - Eran Chinthaka Withana
  • 12. Role of Successful Predictions to Reduce Startup Overheads
    Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)
    r = probability of
    successful prediction
    (prediction accuracy)
    Percentage time =
    reduction
    For simplicity, assuming equal job exec and startup times
    Percentage time =
    reduction
    12
    Thesis Defense - Eran Chinthaka Withana
  • 13. Relationship of Predictions to Execution Time
    Observations
    Percentage time reduction increases with accuracy of predictions
    Time reduction is reduced exponentially with increased work-to-overhead ratio
    Need to find criticalpoint for a given situation
    Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions
    Cost of wrong predictions
    Depends on compute resource
    Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions
    Compromising cost to improve time
    Percentage time =
    reduction
    13
    Accuracy of Predictions =
    total successful future job predictions / total predictions
    Thesis Defense - Eran Chinthaka Withana
  • 14. Prediction Engine: System Architecture
    Prediction
    Retriever
    14
    Thesis Defense - Eran Chinthaka Withana
  • 15. Use of Reasoning
    Store and retrieve cases
    Steps
    Retrieval of similar cases
    Similarity measurement
    Use of thresholds
    Reuse of old cases
    Case adaptation
    Storage
    15
    Thesis Defense - Eran Chinthaka Withana
  • 16. Case Similarity Calculation
    Each case represented by set of attributes
    Selected by finding effect on goal variable (next job)
    16
    Thesis Defense - Eran Chinthaka Withana
  • 17. Evaluation
    Use cases
    Individual job workload1
    40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab
    Workflow use case
    System doesn’t see or assume workflow specification
    Experimental setup
    2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system
    1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/
    17
    Thesis Defense - Eran Chinthaka Withana
  • 18. Evaluation: Average Accuracy of Predictions
    Individual Jobs Workload
    ~ 75% accurate predictions with user patterns
    ~ 32% accurate predictions with service names
    18
    Thesis Defense - Eran Chinthaka Withana
    Workflow Workload
    ~ 95% accurate predictions with user patterns
    ~ 53% accurate predictions with service names
  • 19. Evaluation: Time Saved
    Amount of time that can be saved, if resources are provisioned, when job is ready to run
    Startup time
    Assumed to be 3mins (average for commercial providers)
    19
    Individual Jobs Workload
    Workflow Workload
    Thesis Defense - Eran Chinthaka Withana
  • 20. Evaluation: Prediction Accuracies for Use Cases
    User patterns based predictions performs 2x better than service names based
    Thesis Defense - Eran Chinthaka Withana
    20
  • 21. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    21
  • 22. User Perceived Reliability
    Failures tolerated through
    fault tolerance, high availability, recoverability, etc.,[Birman05].
    What matters from a user’s point of view is whether these failures are visible to users or not
    E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability
    Reliability is not of resources themselves
    Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes.
    It is a more broadly encompassing system reliability that can only be seen at user or workflow level
    Can depend on user’s configuration and job types as well
    We refer to this form of reliability as user-perceived reliability.
    Importance of user-perceived reliability
    Selecting a resource to schedule an experiment when user has access to multiple compute resources
    E.g. LEAD reliability
    supercomputing resources vs
    Windows Azure resources
    Thesis Defense - Eran Chinthaka Withana
    22
  • 23. Why User Perceived Reliability is Useful
    User perceived failure probabilities
    Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3
    𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14
    𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24
    Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B.
     
    Thesis Defense - Eran Chinthaka Withana
    23
  • 24. Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions
    Objective
    Reduce impact of low reliability of compute resources
    Deducing user-perceived reliabilities
    learning from user experiences and perceptions
    Research outline
    Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information
    Use of machine learning techniques
    Trained classifiers to represent compute resources and their reliabilities
    Prediction of job failures
    Algorithm assessment
    Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters
    24
    Thesis Defense - Eran Chinthaka Withana
  • 25. System Architecture
    Thesis Defense - Eran Chinthaka Withana
    25
    A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.
    Classifiers types
    Static classifier: train classifier initially from historical information
    Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
  • 26. System Architecture
    Thesis Defense - Eran Chinthaka Withana
    26
    Classifier manager uses Weka[Hall09] framework
    Classification methods
    Naïve Bayes and KStar
    Static and Dynamic classifiers
    Dynamic pruning of features[Fadishei09] for increased efficiency
    Classifier manager
    Creates and maintains classifiers for each compute resource
    A new job is evaluated based on these classifiers to deduce predicted reliability of job execution
    Policy Implementers
    Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource
  • 27. Evaluation
    Workloads from parallel
    workload archive[Feitelson]
    LANL: Two years worth of
    jobs from 1994 to 1996 on
    1024-node CM-5 at Los
    Alamos National Lab
    LPC: Ten months (Aug, 2004
    to May, 2005) worth of job
    records on 70 Xeon node
    cluster at ”Laboratoire de
    Physique Corpusculaire”
    of UniversitatBlaise-Pascal, France
    Minor cleanups to remove intermediate job states
    10000 jobs were selected from each workload
    LANL had 20% failed jobs
    LPC had 30% failed jobs
    Thesis Defense - Eran Chinthaka Withana
    27
  • 28. Evaluation
    Workload classification and maintenance
    Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].
    Classifier construction
    Static classifier: first 1000 jobs trains classifier.
    Dynamic classifier: all 10000 jobs for classifier construction and evaluation.
    Evaluation Metrics
    Average reliability prediction accuracy: accuracy of predicting success/fail of job
    Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully
    baseline measure: ideal cumulative time that can be saved over time
    Time Consumed For Classification and Updating Classifier
    Effect of pruning attributes
    Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)
    Thesis Defense - Eran Chinthaka Withana
    28
  • 29. Evaluation
    Evaluation Metrics
    Effect of Job Reliability Predictions
    on Selecting Compute Resources
    Extended version of GridSim[Buyya02]
    models four compute resources
    NWS[Wolski99] for bandwidth estimation and
    QBets[Nurmi07] for queue wait time
    estimation
    Total execution time = data
    movement time + queue wait time + job execution time (found in workload)
    Schedulers
    Total Execution Time Priority Scheduler
    Reliability Prediction Based Time Priority Scheduler
    Metrics
    Average Accuracy of Selecting Reliable Resources to Execute Jobs
    Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs
    All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.
    Thesis Defense - EranChinthakaWithana
    29
  • 30. Evaluation Metrics Summary
    Thesis Defense - Eran Chinthaka Withana
    30
  • 31. Results:Average Reliability Prediction Accuracy
    31
    Static
    Dynamic / Updateable
    LANL
    LANL Accuracy Saturation ~ 82%
    LPC Accuracy Saturation ~ 97%
    KStar has performed slightly better than Naïve Bayes
    LPC
    Thesis Defense - Eran Chinthaka Withana
  • 32. Results:Time Savings
    32
    Static
    Dynamic / Updateable
    LANL
    With static classifier, KStar has saved 90-100%
    Updateable classifier
    For LANL Both KStar and NB ~ 50% saving
    For LPC ~ 90% saving
    LPC
    Thesis Defense - Eran Chinthaka Withana
  • 33. Results:Time Consumed for Classification and Updating Classifier
    Thesis Defense - Eran Chinthaka Withana
    33
    Static Classifier
    Updateable Classifier
    Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
  • 34. Results:Effect of Pruning Attributes
    Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier
    Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal
    Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications
    Identification of attributes to prune is a dynamic and expensive task
    system can be used in practical cases even without pruning of attributes.
    Thesis Defense - Eran Chinthaka Withana
    34
  • 35. Results:Effect of Job Reliability Predictions on Selecting Compute Resources
    Poor performance of execution time priority scheduler
    After 1000 jobs (training) time wasted with our approach stays fairly constant
    Thesis Defense - Eran Chinthaka Withana
    35
  • 36. Evaluation Conclusion
    Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.
    Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.
    Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.
    Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions
    Thesis Defense - Eran Chinthaka Withana
    36
  • 37. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    37
  • 38. Scientific Computing Resource Abstraction Layer
    Variety of scientific computing platforms and opportunities
    Requirements
    Support existing job description languages and also should be extensible to support other languages.
    Provide a uniform and interoperable interface for external entities to interact with it.
    Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.
    Extensibility to support new and future resource managers with minimal changes.
    Provide monitoring and fault recovery, especially when working with utility computing resources.
    Provide light-weight, robust and scalable infrastructure.
    Integration to variety of workflow environments.
    Thesis Defense - Eran Chinthaka Withana
    38
  • 39. Scientific Computing Resource Abstraction Layer
    Our contribution
    Resource abstraction layer
    Implemented as a web service
    Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.
    Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL),
    directly interacts with resource managers so requires no grid or meta scheduling middleware
    Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms
    Features
    Does not need high level of computer science knowledge to install and maintain system
    Use of Globus was a challenge for most non-compute scientists
    Involvement of system administrators to install and maintain Sigiri is minimal
    Memory foot print of is minimal
    Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)
    Better fault tolerance and failure recovery.
    Thesis Defense - Eran Chinthaka Withana
    39
  • 40. Architecture
    Asynchronous messaging model of message publishers and consumers
    Daemons shadowing compute resources
    Distributed component deployment
    Daemon, front end Web service and job queue
    Thesis Defense - Eran Chinthaka Withana
    40
  • 41. Client Interaction Service
    Deployed as an Apache Axis2 Web service to enable interoperability
    Accepts job requests and enable management and monitoring functions
    Job submission schema does not enforce schema for job description
    Enables multiple job description languages
    Thesis Defense - Eran Chinthaka Withana
    41
  • 42. Client Interaction Service
    Thesis Defense - Eran Chinthaka Withana
    42
    Job Submission Response
    Job Submission Request
  • 43. Daemons
    Each managed compute resource has a light-weight daemon
    periodically checks job request queue
    translates job specification to a resource manager specific language
    submits pending jobs and persists correlation between resource manager's job id with internal id
    Extensible daemon API
    enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems
    Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements
    Current Support
    LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure
    Thesis Defense - Eran Chinthaka Withana
    43
  • 44. Integration of Cloud Computing Resources
    Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.
    Enables scientists to interact with multiple cloud providers within same system
    Features
    Extensions can be written as modules independent of other extensions, typically to carry out a single task
    Enforced failure handling to prevent orphan VMs, resources
    Thesis Defense - Eran Chinthaka Withana
    44
  • 45. Security
    Client Security
    Between client and Web service layer
    Support for both transport level security (using SSL) and application layer security (using WS-Security)
    Client negotiation of security credentials with WS-Security policy support within Apache Axis2
    Compute Resource Security
    System has support to store different types of security credentials
    Username/password combinations, X.509 credentials
    Thesis Defense - Eran Chinthaka Withana
    45
  • 46. Performance Evaluation
    Test Scenarios
    Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.
    Each client waits for all jobs to finish before submitting next set of jobs.
    For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.
    Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions
    client does not block upon submission of a job
    failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased
    Thesis Defense - Eran Chinthaka Withana
    46
  • 47. Performance Evaluation:Baseline Measurements
    Thesis Defense - Eran Chinthaka Withana
    47
  • 48. Performance Evaluation:Metrics
    Thesis Defense - Eran Chinthaka Withana
    48
  • 49. Performance Evaluation:Scalability Metrics
    Thesis Defense - Eran Chinthaka Withana
    49
  • 50. Performance Evaluation
    Experimental Setup
    Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster
    System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)
    Both these nodes were not dedicated for our experiment when we were running tests
    Client Environment
    Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)
    All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead
    Data Collection
    Each test was run number of clients * 10 times and results were averaged.
    Each parameter is tested for 100 to 1000 concurrent clients
    Total of 110,000 tests were run.
    Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison.
    Thesis Defense - Eran Chinthaka Withana
    50
  • 51. Results
    Thesis Defense - Eran Chinthaka Withana
    51
    Baseline Measurements
    All overheads scaling proportional to number of clients
    No failures
    Case 1
    Case 2
  • 52. Results
    Thesis Defense - Eran Chinthaka Withana
    52
    Metrics for Test Case 1 and 2
    Both response time and total overhead scaling proportional to number of clients
    No failures
  • 53. Results
    Thesis Defense - Eran Chinthaka Withana
    53
    Scalability Metrics
    Failures
    No failures with Sigiri
    Failures starting from
    300 clients for Gram
    Case 1
    Case 2
  • 54. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    54
  • 55. Applications: LEAD
    Motivations
    Grid middleware reliability and scalability study[Marru08] and workflow failure rates.
    components of LEAD infrastructure were considered for adaptation to other scientific environments.
    Sigiri initially prototyped to support Load Leveler, PBS and LSF.
    Implications
    Improved workflow success rates
    Mitigation need for Globus middleware
    Ability work with non-standard job managers
    Thesis Defense - Eran Chinthaka Withana
    55
  • 56. Applications: LEAD II
    Emergence of community- driven, production-quality workflow infrastructures
    E.g. Trident Scientific Workflow Workbench with Workflow Foundation
    Possibility of using alternate supercomputing resources
    E.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure
    Support for Windows based scientific computing environments.
    56
  • 57. Background: LEAD II and Vortex2 Experiment
    May 1, 2010 to June 15, 2010
    ~6 weeks, 7-days per week
    Workflow started on hour every hour each morning.
    Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions.
    If model data was not available at NCEP and University of Oklahoma, workflow could not begin.
    Execution of complete WRF stack within 1 hour
    57
  • 58. Trident Vortex2 Workflow
    Bulk of time (50 min) spent in Lead Workflow Proxy Activity
    58
    Sigiri Integration
  • 59. Applications: Enabling Geo-Science Application on Windows Azure
    Geo-Science Applications
    High Resource Requirements
    Compute intensive, dedicated HPC hardware
    e.g. Weather Research and Forecasting (WRF) Model
    Emergence of ensemble applications
    Large amount of small jobs
    e.g. Examining each air layer, over a long period of time.
    Single experiment = About 14000 jobs each taking few minutes to complete
    59
  • 60. Geo-Science Applications: Opportunities
    Cloud computing resources
    On-demand access to “unlimited” resources
    Flexibility
    Worker roles and VM roles
    Recent porting of geo-science applications
    WRF, WRF Preprocessing System (WPS) port to Windows
    Increased use of ensemble applications (large number of small runs)
    Production quality, opensource scientific workflow systems
    Microsoft Trident
    60
  • 61. Research Vision
    Enabling geo-science experiments
    Type of applications
    Compute intensive, ensembles
    Type of scientists
    Meteorologists, atmospheric scientists, emergency management personnel, geologists
    Utilizing both Cloud computing and Grid computing resources
    Utilizing opensource, production quality scientific workflow environments
    Improved data and meta-data management
    Geo-Science Applications
    Scientific Workflows
    Compute Resources
    61
  • 62. Proposed Framework
    Thesis Defense - Eran Chinthaka Withana
    62
    Azure Blob Store
    Azure
    Management
    API
    Sigiri
    Job Mgmt.Daemons
    Azure Fabric
    Web Service
    Trident
    Activity
    Job Queue
    Azure Custom
    VM Images
    VM Instance
    IIS
    WRF
    Sigiri Worker
    Service
    MSMPI
    Windows 2008R2
  • 63. Applications: Pragma Testbed Support
    Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]
    an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations
    In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed.
    Sigiri was used within IU Pragma testbed
    IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.
    IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.
    In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully
    Thesis Defense - Eran Chinthaka Withana
    63
  • 64. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    64
  • 65. Related Work
    Scientific Job Management Systems
    Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]
    provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.
    Carmen[Watson81] project
    provided a cloud environment that has enabled collaboration between neuroscientists
    requires all programs to be packaged as WS-I[Ballinger04] compliant Web services
    Condor[Frey02] pools can also be utilized to unify certain compute resource interactions.
    uses Globus toolkit[Foster05] (and GRAM underneath)
    Poor failure recovery
    overlooks failure modes of a cloud platform
    Thesis Defense - Eran Chinthaka Withana
    65
  • 66. Related Work
    Scientific Research and Cloud Computing
    IaaS, PaaS and SaaS environment evaluations
    Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]
    Ease of setting up custom environments and control
    Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]
    Optimization to balance cost and time of executions[Deelman08][Yu05]
    Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07]
    Job Prediction Algorithms
    Prediction of
    Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]
    AI based and statistical modeling based approaches
    AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance
    Reliability of Compute Resources
    Birman[Birman05] and aspects of resources causing system reliability issues
    Statistical modeling to predict failures[Kandaswamy08]
    Thesis Defense - Eran Chinthaka Withana
    66
  • 67. Outline
    Mid-Range Science
    Challenges and Opportunities
    Current Landscape
    Research
    Research Questions
    Contributions
    Mining Historical Information to Find Patterns and Experiences
    Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
    Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
    Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
    Applications
    Related Work
    Conclusion and Future Work
    Thesis Defense - Eran Chinthaka Withana
    67
  • 68. Conclusion
    User inspired management of scientific jobs
    Concentrate on identification of user patterns and perceptions
    Harnesses historical information
    Applies knowledge gained to improve scientific job executions
    Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements
    Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.
    Resource abstraction service
    Help mid-scale scientists to obtain access to resources that are cheap and available
    Strives to do so with a tool that is easy to set up and administer
    Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications
    Applications demonstrate how our research contributed to advance science in respective domains.
    Thesis Defense - Eran Chinthaka Withana
    68
  • 69. Contributions
    Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.
    Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.
    Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.
    Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.
    Thesis Defense - Eran Chinthaka Withana
    69
  • 70. Future Work
    Short term research directions
    Integration of future job predictions and user-perceived reliability predictions
    Evolving resource abstraction service to support more compute resources
    Management of ensemble runs
    Fault tolerance with proactive replication
    Long Term Research Directions
    Thesis Defense - Eran Chinthaka Withana
    70
  • 71. Thank You !!
    Thesis Defense - Eran Chinthaka Withana
    71