• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
User Inspired Management of Scientific Jobs in Grids and Clouds
 

User Inspired Management of Scientific Jobs in Grids and Clouds

on

  • 3,533 views

This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve ...

This is my PhD defense presentation discussing the work I did on improving scientific job execution in Grids and Clouds. It talks about how user patterns can be used to learn user behavior and improve meta-scheduler decisions. The resource abstraction layer proposed and implemented helps scientists to interact with a wide variety compute resources.

Statistics

Views

Total Views
3,533
Views on SlideShare
1,230
Embed Views
2,303

Actions

Likes
0
Downloads
8
Comments
0

9 Embeds 2,303

http://blog.chinthaka.org 2282
http://16615635_137b3044c4bc18f7bbd2ca25033b32a5e4b8ee9e.blogspot.in 7
http://translate.googleusercontent.com 5
http://webcache.googleusercontent.com 4
http://74.6.117.48 1
http://prlog.ru 1
http://www.linkedin.com 1
http://16615635_137b3044c4bc18f7bbd2ca25033b32a5e4b8ee9e.blogspot.com 1
http://blog.chinthaka.org&_=1389312000018 HTTP 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    User Inspired Management of Scientific Jobs in Grids and Clouds User Inspired Management of Scientific Jobs in Grids and Clouds Presentation Transcript

    • User Inspired Management of Scientific Jobs in Grids and Clouds
      Eran Chinthaka Withana
      School of Informatics and Computing
      Indiana University, Bloomington, Indiana, USA
      Doctoral Committee
      Professor Beth Plale, PhD
      Dr. Dennis Gannon, PhD
      Professor Geoffrey Fox, PhD
      Professor David Leake, PhD
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - EranChinthakaWithana
      2
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - EranChinthakaWithana
      3
    • Mid-Range Science
      Challenges
      Resource requirements going beyond lab and university, but not suited for large-scale resources
      Difficulties finding sufficient compute resources
      E.g.: short term forecast in LEAD for energy and agriculture
      Lacking resources to have strong CS support person on team
      Need for less-expensive and more-available resources
      Opportunities
      Wide variety of computational resources
      Science gateways
      Thesis Defense - EranChinthakaWithana
      4
    • Current Landscape
      Grid Computing
      Batch orientation, long queues even under moderate loads, no access transparency
      Drawbacks in quota system
      Levels of computer science expertise required
      Cloud Computing
      High availability, pay-as-you-go model, on-demand limitless1 resource allocation
      Payment policy and research cost models
      Use of Workflow Systems
      Hybrid workflows
      Enables utilization of heterogeneous compute resources
      E.g.: Vortex2 Experiment
      Need for resource abstraction layers and optimal selection of resources
      Need for improvement of scientific job executions
      Better scheduler decisions, selection of compute resources
      Reliability issues in compute resources
      Importance of learning user patterns and experiences
      Thesis Defense - Eran Chinthaka Withana
      5
      1M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      6
    • Research Questions
      “Can user patterns and experiences be used to improve scientific job executions in large scale systems?”
      “Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? “
      “Can these be put to use to advance science?”
      Thesis Defense - Eran Chinthaka Withana
      7
    • Contributions
      Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.
      Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.
      Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.
      Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.
      Thesis Defense - Eran Chinthaka Withana
      8
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      9
    • Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds
      Objective
      Reducing the impact of startup overheads for time-critical applications
      Problem space
      Workflows can have multiple paths
      Workflow descriptions not available
      Need for predictions to identify job execution sequence
      Learning from user behavioral patterns to predict future jobs
      Research outline
      Algorithm to predict future jobs by extracting user patterns from historical information
      Use of knowledge-based techniques
      Zero knowledge or pre-populated job information consisting of connection between jobs
      Similar cases retrieved are used to predict future jobs, reducing high startup overheads
      Algorithm assessment
      Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users
      10
      Thesis Defense - Eran Chinthaka Withana
    • Demonstration of User Patterns with Workflows
      Suite of workflows can differ from domain to domain
      E.g. WRF (Weather Research and Forecasting) as upstream node
      User patterns reveal sequence of jobs taking different users/domains into consideration
      Useful for a science gateway serving wide-range of mid-scale scientists
      11
      Weather Predictions
      Crop Predictions
      WRF
      Wind Farm Location Evaluations
      Wild Fire Propagation Simulation
      Thesis Defense - Eran Chinthaka Withana
    • Role of Successful Predictions to Reduce Startup Overheads
      Largest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)
      r = probability of
      successful prediction
      (prediction accuracy)
      Percentage time =
      reduction
      For simplicity, assuming equal job exec and startup times
      Percentage time =
      reduction
      12
      Thesis Defense - Eran Chinthaka Withana
    • Relationship of Predictions to Execution Time
      Observations
      Percentage time reduction increases with accuracy of predictions
      Time reduction is reduced exponentially with increased work-to-overhead ratio
      Need to find criticalpoint for a given situation
      Fixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictions
      Cost of wrong predictions
      Depends on compute resource
      Demonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictions
      Compromising cost to improve time
      Percentage time =
      reduction
      13
      Accuracy of Predictions =
      total successful future job predictions / total predictions
      Thesis Defense - Eran Chinthaka Withana
    • Prediction Engine: System Architecture
      Prediction
      Retriever
      14
      Thesis Defense - Eran Chinthaka Withana
    • Use of Reasoning
      Store and retrieve cases
      Steps
      Retrieval of similar cases
      Similarity measurement
      Use of thresholds
      Reuse of old cases
      Case adaptation
      Storage
      15
      Thesis Defense - Eran Chinthaka Withana
    • Case Similarity Calculation
      Each case represented by set of attributes
      Selected by finding effect on goal variable (next job)
      16
      Thesis Defense - Eran Chinthaka Withana
    • Evaluation
      Use cases
      Individual job workload1
      40k jobs over two years from 1024-node CM-5 at Los Alamos National Lab
      Workflow use case
      System doesn’t see or assume workflow specification
      Experimental setup
      2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system
      1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/
      17
      Thesis Defense - Eran Chinthaka Withana
    • Evaluation: Average Accuracy of Predictions
      Individual Jobs Workload
      ~ 75% accurate predictions with user patterns
      ~ 32% accurate predictions with service names
      18
      Thesis Defense - Eran Chinthaka Withana
      Workflow Workload
      ~ 95% accurate predictions with user patterns
      ~ 53% accurate predictions with service names
    • Evaluation: Time Saved
      Amount of time that can be saved, if resources are provisioned, when job is ready to run
      Startup time
      Assumed to be 3mins (average for commercial providers)
      19
      Individual Jobs Workload
      Workflow Workload
      Thesis Defense - Eran Chinthaka Withana
    • Evaluation: Prediction Accuracies for Use Cases
      User patterns based predictions performs 2x better than service names based
      Thesis Defense - Eran Chinthaka Withana
      20
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      21
    • User Perceived Reliability
      Failures tolerated through
      fault tolerance, high availability, recoverability, etc.,[Birman05].
      What matters from a user’s point of view is whether these failures are visible to users or not
      E.g. reliability of commodity hardware (in clouds) vs user-perceived reliability
      Reliability is not of resources themselves
      Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes.
      It is a more broadly encompassing system reliability that can only be seen at user or workflow level
      Can depend on user’s configuration and job types as well
      We refer to this form of reliability as user-perceived reliability.
      Importance of user-perceived reliability
      Selecting a resource to schedule an experiment when user has access to multiple compute resources
      E.g. LEAD reliability
      supercomputing resources vs
      Windows Azure resources
      Thesis Defense - Eran Chinthaka Withana
      22
    • Why User Perceived Reliability is Useful
      User perceived failure probabilities
      Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3
      𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14
      𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24
      Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B.
       
      Thesis Defense - Eran Chinthaka Withana
      23
    • Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions
      Objective
      Reduce impact of low reliability of compute resources
      Deducing user-perceived reliabilities
      learning from user experiences and perceptions
      Research outline
      Algorithm to predict user perceived reliabilities, learning from user experiences mining historical information
      Use of machine learning techniques
      Trained classifiers to represent compute resources and their reliabilities
      Prediction of job failures
      Algorithm assessment
      Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters
      24
      Thesis Defense - Eran Chinthaka Withana
    • System Architecture
      Thesis Defense - Eran Chinthaka Withana
      25
      A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.
      Classifiers types
      Static classifier: train classifier initially from historical information
      Dynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
    • System Architecture
      Thesis Defense - Eran Chinthaka Withana
      26
      Classifier manager uses Weka[Hall09] framework
      Classification methods
      Naïve Bayes and KStar
      Static and Dynamic classifiers
      Dynamic pruning of features[Fadishei09] for increased efficiency
      Classifier manager
      Creates and maintains classifiers for each compute resource
      A new job is evaluated based on these classifiers to deduce predicted reliability of job execution
      Policy Implementers
      Considers resource reliability predictions together with other quality of service information (time, cost) to select a resource
    • Evaluation
      Workloads from parallel
      workload archive[Feitelson]
      LANL: Two years worth of
      jobs from 1994 to 1996 on
      1024-node CM-5 at Los
      Alamos National Lab
      LPC: Ten months (Aug, 2004
      to May, 2005) worth of job
      records on 70 Xeon node
      cluster at ”Laboratoire de
      Physique Corpusculaire”
      of UniversitatBlaise-Pascal, France
      Minor cleanups to remove intermediate job states
      10000 jobs were selected from each workload
      LANL had 20% failed jobs
      LPC had 30% failed jobs
      Thesis Defense - Eran Chinthaka Withana
      27
    • Evaluation
      Workload classification and maintenance
      Classifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].
      Classifier construction
      Static classifier: first 1000 jobs trains classifier.
      Dynamic classifier: all 10000 jobs for classifier construction and evaluation.
      Evaluation Metrics
      Average reliability prediction accuracy: accuracy of predicting success/fail of job
      Time saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfully
      baseline measure: ideal cumulative time that can be saved over time
      Time Consumed For Classification and Updating Classifier
      Effect of pruning attributes
      Static subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)
      Thesis Defense - Eran Chinthaka Withana
      28
    • Evaluation
      Evaluation Metrics
      Effect of Job Reliability Predictions
      on Selecting Compute Resources
      Extended version of GridSim[Buyya02]
      models four compute resources
      NWS[Wolski99] for bandwidth estimation and
      QBets[Nurmi07] for queue wait time
      estimation
      Total execution time = data
      movement time + queue wait time + job execution time (found in workload)
      Schedulers
      Total Execution Time Priority Scheduler
      Reliability Prediction Based Time Priority Scheduler
      Metrics
      Average Accuracy of Selecting Reliable Resources to Execute Jobs
      Time Wasted Due to Incorrect Selection of Compute Resources to Execute Jobs
      All evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.
      Thesis Defense - EranChinthakaWithana
      29
    • Evaluation Metrics Summary
      Thesis Defense - Eran Chinthaka Withana
      30
    • Results:Average Reliability Prediction Accuracy
      31
      Static
      Dynamic / Updateable
      LANL
      LANL Accuracy Saturation ~ 82%
      LPC Accuracy Saturation ~ 97%
      KStar has performed slightly better than Naïve Bayes
      LPC
      Thesis Defense - Eran Chinthaka Withana
    • Results:Time Savings
      32
      Static
      Dynamic / Updateable
      LANL
      With static classifier, KStar has saved 90-100%
      Updateable classifier
      For LANL Both KStar and NB ~ 50% saving
      For LPC ~ 90% saving
      LPC
      Thesis Defense - Eran Chinthaka Withana
    • Results:Time Consumed for Classification and Updating Classifier
      Thesis Defense - Eran Chinthaka Withana
      33
      Static Classifier
      Updateable Classifier
      Both static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
    • Results:Effect of Pruning Attributes
      Static sub-set of attributes (Fadishei09) performs poorly on this data set and classifier
      Dynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginal
      Conclusion -> our classifiers are handling noise features well without compromising accuracy of classifications
      Identification of attributes to prune is a dynamic and expensive task
      system can be used in practical cases even without pruning of attributes.
      Thesis Defense - Eran Chinthaka Withana
      34
    • Results:Effect of Job Reliability Predictions on Selecting Compute Resources
      Poor performance of execution time priority scheduler
      After 1000 jobs (training) time wasted with our approach stays fairly constant
      Thesis Defense - Eran Chinthaka Withana
      35
    • Evaluation Conclusion
      Even though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.
      Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.
      Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.
      Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executions
      Thesis Defense - Eran Chinthaka Withana
      36
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      37
    • Scientific Computing Resource Abstraction Layer
      Variety of scientific computing platforms and opportunities
      Requirements
      Support existing job description languages and also should be extensible to support other languages.
      Provide a uniform and interoperable interface for external entities to interact with it.
      Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.
      Extensibility to support new and future resource managers with minimal changes.
      Provide monitoring and fault recovery, especially when working with utility computing resources.
      Provide light-weight, robust and scalable infrastructure.
      Integration to variety of workflow environments.
      Thesis Defense - Eran Chinthaka Withana
      38
    • Scientific Computing Resource Abstraction Layer
      Our contribution
      Resource abstraction layer
      Implemented as a web service
      Provides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.
      Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL),
      directly interacts with resource managers so requires no grid or meta scheduling middleware
      Integration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platforms
      Features
      Does not need high level of computer science knowledge to install and maintain system
      Use of Globus was a challenge for most non-compute scientists
      Involvement of system administrators to install and maintain Sigiri is minimal
      Memory foot print of is minimal
      Other tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)
      Better fault tolerance and failure recovery.
      Thesis Defense - Eran Chinthaka Withana
      39
    • Architecture
      Asynchronous messaging model of message publishers and consumers
      Daemons shadowing compute resources
      Distributed component deployment
      Daemon, front end Web service and job queue
      Thesis Defense - Eran Chinthaka Withana
      40
    • Client Interaction Service
      Deployed as an Apache Axis2 Web service to enable interoperability
      Accepts job requests and enable management and monitoring functions
      Job submission schema does not enforce schema for job description
      Enables multiple job description languages
      Thesis Defense - Eran Chinthaka Withana
      41
    • Client Interaction Service
      Thesis Defense - Eran Chinthaka Withana
      42
      Job Submission Response
      Job Submission Request
    • Daemons
      Each managed compute resource has a light-weight daemon
      periodically checks job request queue
      translates job specification to a resource manager specific language
      submits pending jobs and persists correlation between resource manager's job id with internal id
      Extensible daemon API
      enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systems
      Queuing based approach enables daemons to be run on any compute platform, without any software or operating system requirements
      Current Support
      LSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows Azure
      Thesis Defense - Eran Chinthaka Withana
      43
    • Integration of Cloud Computing Resources
      Unique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.
      Enables scientists to interact with multiple cloud providers within same system
      Features
      Extensions can be written as modules independent of other extensions, typically to carry out a single task
      Enforced failure handling to prevent orphan VMs, resources
      Thesis Defense - Eran Chinthaka Withana
      44
    • Security
      Client Security
      Between client and Web service layer
      Support for both transport level security (using SSL) and application layer security (using WS-Security)
      Client negotiation of security credentials with WS-Security policy support within Apache Axis2
      Compute Resource Security
      System has support to store different types of security credentials
      Username/password combinations, X.509 credentials
      Thesis Defense - Eran Chinthaka Withana
      45
    • Performance Evaluation
      Test Scenarios
      Case 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.
      Each client waits for all jobs to finish before submitting next set of jobs.
      For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.
      Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissions
      client does not block upon submission of a job
      failure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increased
      Thesis Defense - Eran Chinthaka Withana
      46
    • Performance Evaluation:Baseline Measurements
      Thesis Defense - Eran Chinthaka Withana
      47
    • Performance Evaluation:Metrics
      Thesis Defense - Eran Chinthaka Withana
      48
    • Performance Evaluation:Scalability Metrics
      Thesis Defense - Eran Chinthaka Withana
      49
    • Performance Evaluation
      Experimental Setup
      Daemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster
      System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)
      Both these nodes were not dedicated for our experiment when we were running tests
      Client Environment
      Setup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)
      All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead
      Data Collection
      Each test was run number of clients * 10 times and results were averaged.
      Each parameter is tested for 100 to 1000 concurrent clients
      Total of 110,000 tests were run.
      Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison.
      Thesis Defense - Eran Chinthaka Withana
      50
    • Results
      Thesis Defense - Eran Chinthaka Withana
      51
      Baseline Measurements
      All overheads scaling proportional to number of clients
      No failures
      Case 1
      Case 2
    • Results
      Thesis Defense - Eran Chinthaka Withana
      52
      Metrics for Test Case 1 and 2
      Both response time and total overhead scaling proportional to number of clients
      No failures
    • Results
      Thesis Defense - Eran Chinthaka Withana
      53
      Scalability Metrics
      Failures
      No failures with Sigiri
      Failures starting from
      300 clients for Gram
      Case 1
      Case 2
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      54
    • Applications: LEAD
      Motivations
      Grid middleware reliability and scalability study[Marru08] and workflow failure rates.
      components of LEAD infrastructure were considered for adaptation to other scientific environments.
      Sigiri initially prototyped to support Load Leveler, PBS and LSF.
      Implications
      Improved workflow success rates
      Mitigation need for Globus middleware
      Ability work with non-standard job managers
      Thesis Defense - Eran Chinthaka Withana
      55
    • Applications: LEAD II
      Emergence of community- driven, production-quality workflow infrastructures
      E.g. Trident Scientific Workflow Workbench with Workflow Foundation
      Possibility of using alternate supercomputing resources
      E.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, Azure
      Support for Windows based scientific computing environments.
      56
    • Background: LEAD II and Vortex2 Experiment
      May 1, 2010 to June 15, 2010
      ~6 weeks, 7-days per week
      Workflow started on hour every hour each morning.
      Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions.
      If model data was not available at NCEP and University of Oklahoma, workflow could not begin.
      Execution of complete WRF stack within 1 hour
      57
    • Trident Vortex2 Workflow
      Bulk of time (50 min) spent in Lead Workflow Proxy Activity
      58
      Sigiri Integration
    • Applications: Enabling Geo-Science Application on Windows Azure
      Geo-Science Applications
      High Resource Requirements
      Compute intensive, dedicated HPC hardware
      e.g. Weather Research and Forecasting (WRF) Model
      Emergence of ensemble applications
      Large amount of small jobs
      e.g. Examining each air layer, over a long period of time.
      Single experiment = About 14000 jobs each taking few minutes to complete
      59
    • Geo-Science Applications: Opportunities
      Cloud computing resources
      On-demand access to “unlimited” resources
      Flexibility
      Worker roles and VM roles
      Recent porting of geo-science applications
      WRF, WRF Preprocessing System (WPS) port to Windows
      Increased use of ensemble applications (large number of small runs)
      Production quality, opensource scientific workflow systems
      Microsoft Trident
      60
    • Research Vision
      Enabling geo-science experiments
      Type of applications
      Compute intensive, ensembles
      Type of scientists
      Meteorologists, atmospheric scientists, emergency management personnel, geologists
      Utilizing both Cloud computing and Grid computing resources
      Utilizing opensource, production quality scientific workflow environments
      Improved data and meta-data management
      Geo-Science Applications
      Scientific Workflows
      Compute Resources
      61
    • Proposed Framework
      Thesis Defense - Eran Chinthaka Withana
      62
      Azure Blob Store
      Azure
      Management
      API
      Sigiri
      Job Mgmt.Daemons
      Azure Fabric
      Web Service
      Trident
      Activity
      Job Queue
      Azure Custom
      VM Images
      VM Instance
      IIS
      WRF
      Sigiri Worker
      Service
      MSMPI
      Windows 2008R2
    • Applications: Pragma Testbed Support
      Pacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]
      an open international organization founded in 2002 to focus on practical issues of building international scientific collaborations
      In 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed.
      Sigiri was used within IU Pragma testbed
      IU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.
      IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.
      In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfully
      Thesis Defense - Eran Chinthaka Withana
      63
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      64
    • Related Work
      Scientific Job Management Systems
      Grid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]
      provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.
      Carmen[Watson81] project
      provided a cloud environment that has enabled collaboration between neuroscientists
      requires all programs to be packaged as WS-I[Ballinger04] compliant Web services
      Condor[Frey02] pools can also be utilized to unify certain compute resource interactions.
      uses Globus toolkit[Foster05] (and GRAM underneath)
      Poor failure recovery
      overlooks failure modes of a cloud platform
      Thesis Defense - Eran Chinthaka Withana
      65
    • Related Work
      Scientific Research and Cloud Computing
      IaaS, PaaS and SaaS environment evaluations
      Scientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]
      Ease of setting up custom environments and control
      Growing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]
      Optimization to balance cost and time of executions[Deelman08][Yu05]
      Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07]
      Job Prediction Algorithms
      Prediction of
      Execution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]
      AI based and statistical modeling based approaches
      AppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performance
      Reliability of Compute Resources
      Birman[Birman05] and aspects of resources causing system reliability issues
      Statistical modeling to predict failures[Kandaswamy08]
      Thesis Defense - Eran Chinthaka Withana
      66
    • Outline
      Mid-Range Science
      Challenges and Opportunities
      Current Landscape
      Research
      Research Questions
      Contributions
      Mining Historical Information to Find Patterns and Experiences
      Usage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]
      Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]
      Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]
      Applications
      Related Work
      Conclusion and Future Work
      Thesis Defense - Eran Chinthaka Withana
      67
    • Conclusion
      User inspired management of scientific jobs
      Concentrate on identification of user patterns and perceptions
      Harnesses historical information
      Applies knowledge gained to improve scientific job executions
      Argues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirements
      Evaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.
      Resource abstraction service
      Help mid-scale scientists to obtain access to resources that are cheap and available
      Strives to do so with a tool that is easy to set up and administer
      Prototype implementations introduced and discussed is integrated and used in different domains and scientific applications
      Applications demonstrate how our research contributed to advance science in respective domains.
      Thesis Defense - Eran Chinthaka Withana
      68
    • Contributions
      Propose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.
      Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.
      Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.
      Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.
      Thesis Defense - Eran Chinthaka Withana
      69
    • Future Work
      Short term research directions
      Integration of future job predictions and user-perceived reliability predictions
      Evolving resource abstraction service to support more compute resources
      Management of ensemble runs
      Fault tolerance with proactive replication
      Long Term Research Directions
      Thesis Defense - Eran Chinthaka Withana
      70
    • Thank You !!
      Thesis Defense - Eran Chinthaka Withana
      71