User Inspired Management of Scientific Jobs in Grids and Clouds

User Inspired Management of Scientific Jobs in Grids and CloudsEran Chinthaka WithanaSchool of Informatics and ComputingIndiana University, Bloomington, Indiana, USADoctoral CommitteeProfessor Beth Plale, PhDDr. Dennis Gannon, PhDProfessor Geoffrey Fox, PhDProfessor David Leake, PhD

OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - EranChinthakaWithana2

OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - EranChinthakaWithana3

Mid-Range ScienceChallengesResource requirements going beyond lab and university, but not suited for large-scale resourcesDifficulties finding sufficient compute resourcesE.g.: short term forecast in LEAD for energy and agricultureLacking resources to have strong CS support person on teamNeed for less-expensive and more-available resourcesOpportunities Wide variety of computational resourcesScience gatewaysThesis Defense - EranChinthakaWithana4

Current LandscapeGrid ComputingBatch orientation, long queues even under moderate loads, no access transparencyDrawbacks in quota systemLevels of computer science expertise requiredCloud ComputingHigh availability, pay-as-you-go model, on-demand limitless1 resource allocationPayment policy and research cost modelsUse of Workflow SystemsHybrid workflowsEnables utilization of heterogeneous compute resourcesE.g.: Vortex2 ExperimentNeed for resource abstraction layers and optimal selection of resourcesNeed for improvement of scientific job executionsBetter scheduler decisions, selection of compute resourcesReliability issues in compute resourcesImportance of learning user patterns and experiences Thesis Defense - Eran Chinthaka Withana51M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.

OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana6

Research Questions“Can user patterns and experiences be used to improve scientific job executions in large scale systems?”“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? ““Can these be put to use to advance science?”Thesis Defense - Eran Chinthaka Withana7

ContributionsPropose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.Thesis Defense - Eran Chinthaka Withana8

Usage Patterns to Provision for Time Critical Scientific Experimentation in CloudsObjectiveReducing the impact of startup overheads for time-critical applicationsProblem spaceWorkflows can have multiple pathsWorkflow descriptions not availableNeed for predictions to identify job execution sequenceLearning from user behavioral patterns to predict future jobsResearch outlineAlgorithm to predict future jobs by extracting user patterns from historical informationUse of knowledge-based techniquesZero knowledge or pre-populated job information consisting of connection between jobsSimilar cases retrieved are used to predict future jobs, reducing high startup overheadsAlgorithm assessment Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users10Thesis Defense - Eran Chinthaka Withana

Demonstration of User Patterns with WorkflowsSuite of workflows can differ from domain to domainE.g. WRF (Weather Research and Forecasting) as upstream nodeUser patterns reveal sequence of jobs taking different users/domains into considerationUseful for a science gateway serving wide-range of mid-scale scientists11Weather PredictionsCrop PredictionsWRFWind Farm Location EvaluationsWild Fire Propagation SimulationThesis Defense - Eran Chinthaka Withana

Role of Successful Predictions to Reduce Startup OverheadsLargest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)r = probability of successful prediction (prediction accuracy)Percentage time =reductionFor simplicity, assuming equal job exec and startup times Percentage time =reduction12Thesis Defense - Eran Chinthaka Withana

Relationship of Predictions to Execution TimeObservationsPercentage time reduction increases with accuracy of predictionsTime reduction is reduced exponentially with increased work-to-overhead ratioNeed to find criticalpoint for a given situationFixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictionsCost of wrong predictionsDepends on compute resourceDemonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictionsCompromising cost to improve timePercentage time =reduction13Accuracy of Predictions = total successful future job predictions / total predictionsThesis Defense - Eran Chinthaka Withana

Prediction Engine: System ArchitecturePredictionRetriever14Thesis Defense - Eran Chinthaka Withana

Use of ReasoningStore and retrieve casesStepsRetrieval of similar casesSimilarity measurementUse of thresholdsReuse of old casesCase adaptationStorage15Thesis Defense - Eran Chinthaka Withana

Case Similarity CalculationEach case represented by set of attributesSelected by finding effect on goal variable (next job)16Thesis Defense - Eran Chinthaka Withana

EvaluationUse casesIndividual job workload140k jobs over two years from 1024-node CM-5 at Los Alamos National LabWorkflow use caseSystem doesn’t see or assume workflow specificationExperimental setup2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ 17Thesis Defense - Eran Chinthaka Withana

Evaluation: Average Accuracy of PredictionsIndividual Jobs Workload~ 75% accurate predictions with user patterns ~ 32% accurate predictions with service names18Thesis Defense - Eran Chinthaka WithanaWorkflow Workload~ 95% accurate predictions with user patterns ~ 53% accurate predictions with service names

Evaluation: Time SavedAmount of time that can be saved, if resources are provisioned, when job is ready to runStartup timeAssumed to be 3mins (average for commercial providers)19Individual Jobs WorkloadWorkflow WorkloadThesis Defense - Eran Chinthaka Withana

Evaluation: Prediction Accuracies for Use CasesUser patterns based predictions performs 2x better than service names basedThesis Defense - Eran Chinthaka Withana20

User Perceived ReliabilityFailures tolerated throughfault tolerance, high availability, recoverability, etc.,[Birman05]. What matters from a user’s point of view is whether these failures are visible to users or notE.g. reliability of commodity hardware (in clouds) vs user-perceived reliabilityReliability is not of resources themselves Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. It is a more broadly encompassing system reliability that can only be seen at user or workflow levelCan depend on user’s configuration and job types as wellWe refer to this form of reliability as user-perceived reliability.Importance of user-perceived reliability Selecting a resource to schedule an experiment when user has access to multiple compute resourcesE.g. LEAD reliabilitysupercomputing resources vsWindows Azure resourcesThesis Defense - Eran Chinthaka Withana22

Why User Perceived Reliability is UsefulUser perceived failure probabilities Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B. Thesis Defense - Eran Chinthaka Withana23

Using Reliability Aspects of Computational Resources to Improve Scientific Job ExecutionsObjectiveReduce impact of low reliability of compute resourcesDeducing user-perceived reliabilities learning from user experiences and perceptionsResearch outlineAlgorithm to predict user perceived reliabilities, learning from user experiences mining historical informationUse of machine learning techniquesTrained classifiers to represent compute resources and their reliabilitiesPrediction of job failuresAlgorithm assessment Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters24Thesis Defense - Eran Chinthaka Withana

System ArchitectureThesis Defense - Eran Chinthaka Withana25A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.Classifiers typesStatic classifier: train classifier initially from historical informationDynamic (updateable) classifier: starts from zero knowledge and build when system is in operation

System ArchitectureThesis Defense - Eran Chinthaka Withana26Classifier manager uses Weka[Hall09] frameworkClassification methodsNaïve Bayes and KStarStatic and Dynamic classifiersDynamic pruning of features[Fadishei09] for increased efficiencyClassifier managerCreates and maintains classifiers for each compute resourceA new job is evaluated based on these classifiers to deduce predicted reliability of job executionPolicy ImplementersConsiders resource reliability predictions together with other quality of service information (time, cost) to select a resource

EvaluationWorkloads from parallel workload archive[Feitelson]LANL: Two years worth of jobs from 1994 to 1996 on1024-node CM-5 at Los Alamos National LabLPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of UniversitatBlaise-Pascal, FranceMinor cleanups to remove intermediate job states10000 jobs were selected from each workloadLANL had 20% failed jobsLPC had 30% failed jobsThesis Defense - Eran Chinthaka Withana27

EvaluationWorkload classification and maintenanceClassifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].Classifier constructionStatic classifier: first 1000 jobs trains classifier.Dynamic classifier: all 10000 jobs for classifier construction and evaluation. Evaluation MetricsAverage reliability prediction accuracy: accuracy of predicting success/fail of jobTime saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfullybaseline measure: ideal cumulative time that can be saved over timeTime Consumed For Classification and Updating ClassifierEffect of pruning attributesStatic subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)Thesis Defense - Eran Chinthaka Withana28

EvaluationEvaluation MetricsEffect of Job Reliability Predictionson Selecting Compute ResourcesExtended version of GridSim[Buyya02] models four compute resourcesNWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimationTotal execution time = data movement time + queue wait time + job execution time (found in workload)SchedulersTotal Execution Time Priority Scheduler Reliability Prediction Based Time Priority SchedulerMetricsAverage Accuracy of Selecting Reliable Resources to Execute JobsTime Wasted Due to Incorrect Selection of Compute Resources to Execute JobsAll evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.Thesis Defense - EranChinthakaWithana29

Evaluation Metrics SummaryThesis Defense - Eran Chinthaka Withana30

Results:Average Reliability Prediction Accuracy31StaticDynamic / UpdateableLANLLANL Accuracy Saturation ~ 82%LPC Accuracy Saturation ~ 97%KStar has performed slightly better than Naïve BayesLPCThesis Defense - Eran Chinthaka Withana

Results:Time Savings32StaticDynamic / UpdateableLANLWith static classifier, KStar has saved 90-100%Updateable classifier For LANL Both KStar and NB ~ 50% savingFor LPC ~ 90% savingLPCThesis Defense - Eran Chinthaka Withana

Results:Time Consumed for Classification and Updating ClassifierThesis Defense - Eran Chinthaka Withana33Static ClassifierUpdateable ClassifierBoth static and updateable Naïve Bayes classifiers take very little time (not included in graphs)

Results:Effect of Pruning AttributesStatic sub-set of attributes (Fadishei09) performs poorly on this data set and classifierDynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginalConclusion -> our classifiers are handling noise features well without compromising accuracy of classificationsIdentification of attributes to prune is a dynamic and expensive task system can be used in practical cases even without pruning of attributes.Thesis Defense - Eran Chinthaka Withana34

Results:Effect of Job Reliability Predictions on Selecting Compute ResourcesPoor performance of execution time priority schedulerAfter 1000 jobs (training) time wasted with our approach stays fairly constantThesis Defense - Eran Chinthaka Withana35

Evaluation ConclusionEven though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executionsThesis Defense - Eran Chinthaka Withana36

Scientific Computing Resource Abstraction LayerVariety of scientific computing platforms and opportunitiesRequirementsSupport existing job description languages and also should be extensible to support other languages.Provide a uniform and interoperable interface for external entities to interact with it.Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.Extensibility to support new and future resource managers with minimal changes. Provide monitoring and fault recovery, especially when working with utility computing resources.Provide light-weight, robust and scalable infrastructure.Integration to variety of workflow environments.Thesis Defense - Eran Chinthaka Withana38

Scientific Computing Resource Abstraction LayerOur contributionResource abstraction layer Implemented as a web serviceProvides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), directly interacts with resource managers so requires no grid or meta scheduling middlewareIntegration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platformsFeaturesDoes not need high level of computer science knowledge to install and maintain system Use of Globus was a challenge for most non-compute scientistsInvolvement of system administrators to install and maintain Sigiri is minimalMemory foot print of is minimalOther tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)Better fault tolerance and failure recovery.Thesis Defense - Eran Chinthaka Withana39

ArchitectureAsynchronous messaging model of message publishers and consumersDaemons shadowing compute resourcesDistributed component deploymentDaemon, front end Web service and job queue Thesis Defense - Eran Chinthaka Withana40

Client Interaction ServiceDeployed as an Apache Axis2 Web service to enable interoperabilityAccepts job requests and enable management and monitoring functionsJob submission schema does not enforce schema for job descriptionEnables multiple job description languagesThesis Defense - Eran Chinthaka Withana41

Client Interaction ServiceThesis Defense - Eran Chinthaka Withana42Job Submission ResponseJob Submission Request

DaemonsEach managed compute resource has a light-weight daemonperiodically checks job request queuetranslates job specification to a resource manager specific languagesubmits pending jobs and persists correlation between resource manager's job id with internal idExtensible daemon API enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systemsQueuing based approach enables daemons to be run on any compute platform, without any software or operating system requirementsCurrent SupportLSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows AzureThesis Defense - Eran Chinthaka Withana43

Integration of Cloud Computing ResourcesUnique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.Enables scientists to interact with multiple cloud providers within same systemFeaturesExtensions can be written as modules independent of other extensions, typically to carry out a single taskEnforced failure handling to prevent orphan VMs, resourcesThesis Defense - Eran Chinthaka Withana44

SecurityClient SecurityBetween client and Web service layerSupport for both transport level security (using SSL) and application layer security (using WS-Security)Client negotiation of security credentials with WS-Security policy support within Apache Axis2Compute Resource SecuritySystem has support to store different types of security credentialsUsername/password combinations, X.509 credentialsThesis Defense - Eran Chinthaka Withana45

Performance EvaluationTest ScenariosCase 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.Each client waits for all jobs to finish before submitting next set of jobs.For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissionsclient does not block upon submission of a jobfailure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increasedThesis Defense - Eran Chinthaka Withana46

Performance Evaluation:Baseline MeasurementsThesis Defense - Eran Chinthaka Withana47

Performance Evaluation:MetricsThesis Defense - Eran Chinthaka Withana48

Performance Evaluation:Scalability MetricsThesis Defense - Eran Chinthaka Withana49

Performance EvaluationExperimental SetupDaemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)Both these nodes were not dedicated for our experiment when we were running testsClient EnvironmentSetup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead Data CollectionEach test was run number of clients * 10 times and results were averaged.Each parameter is tested for 100 to 1000 concurrent clientsTotal of 110,000 tests were run. Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. Thesis Defense - Eran Chinthaka Withana50

ResultsThesis Defense - Eran Chinthaka Withana51Baseline MeasurementsAll overheads scaling proportional to number of clientsNo failuresCase 1Case 2

ResultsThesis Defense - Eran Chinthaka Withana52Metrics for Test Case 1 and 2Both response time and total overhead scaling proportional to number of clientsNo failures

ResultsThesis Defense - Eran Chinthaka Withana53Scalability MetricsFailuresNo failures with SigiriFailures starting from300 clients for GramCase 1Case 2

Applications: LEADMotivationsGrid middleware reliability and scalability study[Marru08] and workflow failure rates. components of LEAD infrastructure were considered for adaptation to other scientific environments.Sigiri initially prototyped to support Load Leveler, PBS and LSF. ImplicationsImproved workflow success rates Mitigation need for Globus middlewareAbility work with non-standard job managersThesis Defense - Eran Chinthaka Withana55

Applications: LEAD IIEmergence of community- driven, production-quality workflow infrastructuresE.g. Trident Scientific Workflow Workbench with Workflow FoundationPossibility of using alternate supercomputing resourcesE.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, AzureSupport for Windows based scientific computing environments.56

Background: LEAD II and Vortex2 ExperimentMay 1, 2010 to June 15, 2010~6 weeks, 7-days per weekWorkflow started on hour every hour each morning. Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. If model data was not available at NCEP and University of Oklahoma, workflow could not begin.Execution of complete WRF stack within 1 hour57

Trident Vortex2 WorkflowBulk of time (50 min) spent in Lead Workflow Proxy Activity58Sigiri Integration

Applications: Enabling Geo-Science Application on Windows AzureGeo-Science ApplicationsHigh Resource RequirementsCompute intensive, dedicated HPC hardwaree.g. Weather Research and Forecasting (WRF) ModelEmergence of ensemble applicationsLarge amount of small jobse.g. Examining each air layer, over a long period of time. Single experiment = About 14000 jobs each taking few minutes to complete59

Geo-Science Applications: OpportunitiesCloud computing resourcesOn-demand access to “unlimited” resourcesFlexibilityWorker roles and VM rolesRecent porting of geo-science applicationsWRF, WRF Preprocessing System (WPS) port to WindowsIncreased use of ensemble applications (large number of small runs)Production quality, opensource scientific workflow systemsMicrosoft Trident60

Research VisionEnabling geo-science experiments Type of applicationsCompute intensive, ensemblesType of scientistsMeteorologists, atmospheric scientists, emergency management personnel, geologistsUtilizing both Cloud computing and Grid computing resourcesUtilizing opensource, production quality scientific workflow environmentsImproved data and meta-data managementGeo-Science ApplicationsScientific WorkflowsCompute Resources61

Proposed FrameworkThesis Defense - Eran Chinthaka Withana62Azure Blob StoreAzure ManagementAPISigiriJob Mgmt.DaemonsAzure FabricWeb ServiceTridentActivityJob QueueAzure Custom VM ImagesVM InstanceIISWRFSigiri WorkerServiceMSMPIWindows 2008R2

Applications: Pragma Testbed SupportPacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]an open international organization founded in 2002 to focus on practical issues of building international scientific collaborationsIn 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. Sigiri was used within IU Pragma testbedIU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfullyThesis Defense - Eran Chinthaka Withana63

Related WorkScientific Job Management SystemsGrid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.Carmen[Watson81] project provided a cloud environment that has enabled collaboration between neuroscientistsrequires all programs to be packaged as WS-I[Ballinger04] compliant Web servicesCondor[Frey02] pools can also be utilized to unify certain compute resource interactions.uses Globus toolkit[Foster05] (and GRAM underneath) Poor failure recovery overlooks failure modes of a cloud platformThesis Defense - Eran Chinthaka Withana65

Related WorkScientific Research and Cloud ComputingIaaS, PaaS and SaaS environment evaluationsScientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]Ease of setting up custom environments and controlGrowing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]Optimization to balance cost and time of executions[Deelman08][Yu05]Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07] Job Prediction AlgorithmsPrediction ofExecution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]AI based and statistical modeling based approachesAppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performanceReliability of Compute ResourcesBirman[Birman05] and aspects of resources causing system reliability issuesStatistical modeling to predict failures[Kandaswamy08]Thesis Defense - Eran Chinthaka Withana66

ConclusionUser inspired management of scientific jobsConcentrate on identification of user patterns and perceptionsHarnesses historical informationApplies knowledge gained to improve scientific job executionsArgues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirementsEvaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.Resource abstraction serviceHelp mid-scale scientists to obtain access to resources that are cheap and availableStrives to do so with a tool that is easy to set up and administerPrototype implementations introduced and discussed is integrated and used in different domains and scientific applicationsApplications demonstrate how our research contributed to advance science in respective domains.Thesis Defense - Eran Chinthaka Withana68

ContributionsPropose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.Thesis Defense - Eran Chinthaka Withana69

Future WorkShort term research directionsIntegration of future job predictions and user-perceived reliability predictionsEvolving resource abstraction service to support more compute resourcesManagement of ensemble runsFault tolerance with proactive replicationLong Term Research DirectionsThesis Defense - Eran Chinthaka Withana70

Thank You !!Thesis Defense - Eran Chinthaka Withana71

User Inspired Management of Scientific Jobs in Grids and Clouds

More Related Content

What's hot

Viewers also liked

Similar to User Inspired Management of Scientific Jobs in Grids and Clouds

More from Eran Chinthaka Withana

Recently uploaded

User Inspired Management of Scientific Jobs in Grids and Clouds