User Inspired Management of Scientific Jobs in Grids and CloudsEran Chinthaka WithanaSchool of Informatics and ComputingIndiana University, Bloomington, Indiana, USADoctoral CommitteeProfessor Beth Plale, PhDDr. Dennis Gannon, PhDProfessor Geoffrey Fox, PhDProfessor David Leake, PhD
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - EranChinthakaWithana2
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - EranChinthakaWithana3
Mid-Range ScienceChallengesResource requirements going beyond lab and university, but not suited for large-scale resourcesDifficulties finding sufficient compute resourcesE.g.: short term forecast in LEAD for energy and agricultureLacking resources to have strong CS support person on teamNeed for less-expensive and more-available resourcesOpportunities Wide variety of computational resourcesScience gatewaysThesis Defense - EranChinthakaWithana4
Current LandscapeGrid ComputingBatch orientation, long queues even under moderate loads, no access transparencyDrawbacks in quota systemLevels of computer science expertise requiredCloud ComputingHigh availability, pay-as-you-go model, on-demand limitless1 resource allocationPayment policy and research cost modelsUse of Workflow SystemsHybrid workflowsEnables utilization of heterogeneous compute resourcesE.g.: Vortex2 ExperimentNeed for resource abstraction layers and optimal selection of resourcesNeed for  improvement of scientific job executionsBetter scheduler decisions, selection of compute resourcesReliability issues in compute resourcesImportance of learning user patterns and experiences	Thesis Defense - Eran Chinthaka Withana51M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS  Department, University of California, Berkeley., 2009.
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana6
Research Questions“Can user patterns and experiences be used to improve scientific job executions in large scale systems?”“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? ““Can these be put to use to advance science?”Thesis Defense - Eran Chinthaka Withana7
ContributionsPropose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.Thesis Defense - Eran Chinthaka Withana8
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana9
Usage Patterns to Provision for Time Critical Scientific Experimentation in CloudsObjectiveReducing the impact of startup overheads for time-critical applicationsProblem spaceWorkflows can have multiple pathsWorkflow descriptions not availableNeed for predictions to identify job execution sequenceLearning from user behavioral patterns to predict future jobsResearch outlineAlgorithm to predict future jobs by extracting user patterns from historical informationUse of knowledge-based techniquesZero knowledge or pre-populated job information consisting of connection between jobsSimilar cases retrieved are used to predict future jobs, reducing high startup overheadsAlgorithm assessment Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users10Thesis Defense - Eran Chinthaka Withana
Demonstration of User Patterns with WorkflowsSuite of workflows can differ from domain to domainE.g. WRF (Weather Research and Forecasting) as upstream nodeUser patterns reveal sequence of jobs taking different users/domains into considerationUseful for a science gateway serving wide-range of mid-scale scientists11Weather PredictionsCrop PredictionsWRFWind Farm Location EvaluationsWild Fire Propagation SimulationThesis Defense - Eran Chinthaka Withana
Role of Successful Predictions to Reduce Startup OverheadsLargest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)r = probability of successful prediction (prediction accuracy)Percentage time  =reductionFor simplicity, assuming equal job exec and startup times Percentage time  =reduction12Thesis Defense - Eran Chinthaka Withana
Relationship of Predictions to Execution TimeObservationsPercentage time reduction increases with accuracy of predictionsTime reduction is reduced exponentially with increased work-to-overhead ratioNeed to find criticalpoint for a given situationFixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictionsCost of wrong predictionsDepends on compute resourceDemonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictionsCompromising cost to improve timePercentage time  =reduction13Accuracy of Predictions =          total successful future job predictions / total predictionsThesis Defense - Eran Chinthaka Withana
Prediction Engine: System ArchitecturePredictionRetriever14Thesis Defense - Eran Chinthaka Withana
Use of ReasoningStore and retrieve casesStepsRetrieval of similar casesSimilarity measurementUse of thresholdsReuse of old casesCase adaptationStorage15Thesis Defense - Eran Chinthaka Withana
Case Similarity CalculationEach case represented by set of attributesSelected by finding effect on goal variable (next job)16Thesis Defense - Eran Chinthaka Withana
EvaluationUse casesIndividual job workload140k jobs over two years from 1024-node CM-5 at Los Alamos National LabWorkflow use caseSystem doesn’t see or assume workflow specificationExperimental setup2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ 17Thesis Defense - Eran Chinthaka Withana
Evaluation: Average Accuracy of PredictionsIndividual Jobs Workload~ 75% accurate predictions with user patterns ~ 32% accurate predictions with service names18Thesis Defense - Eran Chinthaka WithanaWorkflow Workload~ 95% accurate predictions with user patterns ~ 53% accurate predictions with service names
Evaluation: Time SavedAmount of time that can be saved, if resources are provisioned, when job is ready to runStartup timeAssumed to be 3mins (average for commercial providers)19Individual Jobs WorkloadWorkflow WorkloadThesis Defense - Eran Chinthaka Withana
Evaluation: Prediction Accuracies for Use CasesUser patterns based predictions performs 2x better than service names basedThesis Defense - Eran Chinthaka Withana20
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana21
User Perceived ReliabilityFailures tolerated throughfault tolerance, high availability, recoverability, etc.,[Birman05]. What matters from a user’s point of view is whether these failures are visible to users or notE.g. reliability of commodity hardware (in clouds) vs user-perceived reliabilityReliability is not of resources themselves Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. It is a more broadly encompassing system reliability that can only be seen at user or workflow levelCan depend on user’s configuration and job types as wellWe refer to this form of reliability as user-perceived reliability.Importance of user-perceived reliability Selecting a resource to schedule an experiment when user has access to multiple compute resourcesE.g. LEAD reliabilitysupercomputing resources vsWindows Azure resourcesThesis Defense - Eran Chinthaka Withana22
Why User Perceived Reliability is UsefulUser perceived failure probabilities Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B.  Thesis Defense - Eran Chinthaka Withana23
Using Reliability Aspects of Computational Resources to Improve Scientific Job ExecutionsObjectiveReduce impact of low reliability of compute resourcesDeducing user-perceived reliabilities learning from user experiences and perceptionsResearch outlineAlgorithm to predict user perceived reliabilities, learning from user experiences mining historical informationUse of machine learning techniquesTrained classifiers to represent compute resources and their reliabilitiesPrediction of job failuresAlgorithm assessment Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters24Thesis Defense - Eran Chinthaka Withana
System ArchitectureThesis Defense - Eran Chinthaka Withana25A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.Classifiers typesStatic classifier: train classifier initially from historical informationDynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
System ArchitectureThesis Defense - Eran Chinthaka Withana26Classifier manager uses Weka[Hall09] frameworkClassification methodsNaïve Bayes and KStarStatic and Dynamic classifiersDynamic pruning of features[Fadishei09] for increased efficiencyClassifier managerCreates and maintains classifiers for each compute resourceA new job is evaluated based on these classifiers to deduce predicted reliability of job executionPolicy ImplementersConsiders resource reliability predictions together with other quality of service information (time, cost) to select a resource
EvaluationWorkloads from parallel     workload archive[Feitelson]LANL: Two years worth of jobs from 1994 to 1996 on1024-node CM-5 at Los Alamos National LabLPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of UniversitatBlaise-Pascal, FranceMinor cleanups to remove intermediate job states10000 jobs were selected from each workloadLANL had 20% failed jobsLPC had 30% failed jobsThesis Defense - Eran Chinthaka Withana27
EvaluationWorkload classification and maintenanceClassifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].Classifier constructionStatic classifier: first 1000 jobs trains classifier.Dynamic classifier: all 10000 jobs for classifier construction and evaluation. Evaluation MetricsAverage reliability prediction accuracy: accuracy of predicting success/fail of jobTime saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfullybaseline measure: ideal cumulative time that can be saved over timeTime Consumed For Classification and Updating ClassifierEffect of pruning attributesStatic subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)Thesis Defense - Eran Chinthaka Withana28
EvaluationEvaluation MetricsEffect of Job Reliability Predictionson Selecting Compute ResourcesExtended version of GridSim[Buyya02] models four compute resourcesNWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimationTotal execution time = data movement time + queue wait time + job execution time (found in workload)SchedulersTotal Execution Time Priority Scheduler Reliability Prediction Based Time Priority SchedulerMetricsAverage Accuracy of Selecting Reliable Resources to Execute JobsTime Wasted Due to Incorrect Selection of Compute Resources to Execute JobsAll evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.Thesis Defense - EranChinthakaWithana29
Evaluation Metrics SummaryThesis Defense - Eran Chinthaka Withana30
Results:Average Reliability Prediction Accuracy31StaticDynamic / UpdateableLANLLANL Accuracy Saturation  ~ 82%LPC Accuracy Saturation  ~ 97%KStar has performed slightly better than Naïve BayesLPCThesis Defense - Eran Chinthaka Withana
Results:Time Savings32StaticDynamic / UpdateableLANLWith static classifier, KStar has saved 90-100%Updateable classifier For LANL Both KStar and NB ~ 50% savingFor LPC ~ 90% savingLPCThesis Defense - Eran Chinthaka Withana
Results:Time Consumed for Classification and Updating ClassifierThesis Defense - Eran Chinthaka Withana33Static ClassifierUpdateable ClassifierBoth static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
Results:Effect of Pruning AttributesStatic sub-set of attributes (Fadishei09) performs poorly on this data set and classifierDynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginalConclusion -> our classifiers are handling noise features well without compromising accuracy of classificationsIdentification of attributes to prune is a dynamic and expensive task system can be used in practical cases even without pruning of attributes.Thesis Defense - Eran Chinthaka Withana34
Results:Effect of Job Reliability Predictions on Selecting Compute ResourcesPoor performance of execution time priority schedulerAfter 1000 jobs (training) time wasted with our approach stays fairly constantThesis Defense - Eran Chinthaka Withana35
Evaluation ConclusionEven though average accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executionsThesis Defense - Eran Chinthaka Withana36
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana37
Scientific Computing Resource Abstraction LayerVariety of scientific computing platforms and opportunitiesRequirementsSupport existing job description languages and also should be extensible to support other languages.Provide a uniform and interoperable interface for external entities to interact with it.Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.Extensibility to support new and future resource managers with minimal changes. Provide monitoring and fault recovery, especially when working with utility computing resources.Provide light-weight, robust and scalable infrastructure.Integration to variety of workflow environments.Thesis Defense - Eran Chinthaka Withana38
Scientific Computing Resource Abstraction LayerOur contributionResource abstraction layer Implemented as a web serviceProvides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), directly interacts with resource managers so requires no grid or meta scheduling middlewareIntegration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platformsFeaturesDoes not need high level of computer science knowledge to install and maintain system Use of Globus was a challenge for most non-compute scientistsInvolvement of system administrators to install and maintain Sigiri is minimalMemory foot print of is minimalOther tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)Better fault tolerance and failure recovery.Thesis Defense - Eran Chinthaka Withana39
ArchitectureAsynchronous messaging model of message publishers and consumersDaemons shadowing compute resourcesDistributed component deploymentDaemon, front end Web service and job queue Thesis Defense - Eran Chinthaka Withana40
Client Interaction ServiceDeployed as an Apache Axis2 Web service to enable interoperabilityAccepts job requests and enable management and monitoring functionsJob submission schema does not enforce schema for job descriptionEnables multiple job description languagesThesis Defense - Eran Chinthaka Withana41
Client Interaction ServiceThesis Defense - Eran Chinthaka Withana42Job Submission ResponseJob Submission Request
DaemonsEach managed compute resource has a light-weight daemonperiodically checks job request queuetranslates job specification to a resource manager specific languagesubmits pending jobs and persists correlation between resource manager's job id with internal idExtensible daemon API enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systemsQueuing based approach enables daemons to be run on any compute platform, without any software or operating system requirementsCurrent SupportLSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows AzureThesis Defense - Eran Chinthaka Withana43
Integration of Cloud Computing ResourcesUnique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.Enables scientists to interact with multiple cloud providers within same systemFeaturesExtensions can be written as modules independent of other extensions, typically to carry out a single taskEnforced failure handling to prevent orphan VMs, resourcesThesis Defense - Eran Chinthaka Withana44
SecurityClient SecurityBetween client and Web service layerSupport for both transport level security (using SSL) and application layer security (using WS-Security)Client negotiation of security credentials with WS-Security policy support within Apache Axis2Compute Resource SecuritySystem has support to store different types of security credentialsUsername/password combinations, X.509 credentialsThesis Defense - Eran Chinthaka Withana45
Performance EvaluationTest ScenariosCase 1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.Each client waits for all jobs to finish before submitting next set of jobs.For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissionsclient does not block upon submission of a jobfailure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increasedThesis Defense - Eran Chinthaka Withana46
Performance Evaluation:Baseline MeasurementsThesis Defense - Eran Chinthaka Withana47
Performance Evaluation:MetricsThesis Defense - Eran Chinthaka Withana48
Performance Evaluation:Scalability MetricsThesis Defense - Eran Chinthaka Withana49
Performance EvaluationExperimental SetupDaemon hosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)Both these nodes were not dedicated for our experiment when we were running testsClient EnvironmentSetup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead Data CollectionEach test was run number of clients * 10 times and results were averaged.Each parameter is tested for 100 to 1000 concurrent clientsTotal of 110,000 tests were run. Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. Thesis Defense - Eran Chinthaka Withana50
ResultsThesis Defense - Eran Chinthaka Withana51Baseline MeasurementsAll overheads scaling proportional to number of clientsNo failuresCase 1Case 2
ResultsThesis Defense - Eran Chinthaka Withana52Metrics for Test Case 1 and 2Both response time and total overhead scaling proportional to number of clientsNo failures
ResultsThesis Defense - Eran Chinthaka Withana53Scalability MetricsFailuresNo failures with SigiriFailures starting from300 clients for GramCase 1Case 2
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana54
Applications: LEADMotivationsGrid middleware reliability and scalability study[Marru08] and workflow failure rates. components of LEAD infrastructure were considered for adaptation to other scientific environments.Sigiri initially prototyped to support Load Leveler, PBS and LSF. ImplicationsImproved workflow success rates Mitigation need for Globus middlewareAbility work with non-standard job managersThesis Defense - Eran Chinthaka Withana55
Applications: LEAD IIEmergence of community- driven, production-quality workflow infrastructuresE.g. Trident Scientific Workflow Workbench with Workflow FoundationPossibility of using alternate supercomputing resourcesE.g.  Recent port WRF (Weather Research & Forecast) model to Windows platform, AzureSupport for Windows based scientific computing environments.56
Background: LEAD II and Vortex2 ExperimentMay 1, 2010 to June 15, 2010~6 weeks, 7-days per weekWorkflow started on hour every hour each morning. Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions.  If model data was not available at NCEP and University of Oklahoma, workflow could not begin.Execution of complete WRF stack within 1 hour57
Trident Vortex2 WorkflowBulk of time (50 min) spent in Lead Workflow Proxy Activity58Sigiri Integration
Applications: Enabling Geo-Science Application on Windows AzureGeo-Science ApplicationsHigh Resource RequirementsCompute intensive, dedicated HPC hardwaree.g. Weather Research and Forecasting (WRF) ModelEmergence of ensemble applicationsLarge amount of small jobse.g.  Examining each air layer, over a long period of time. Single experiment = About 14000 jobs each taking few minutes to complete59
Geo-Science Applications: OpportunitiesCloud computing resourcesOn-demand access to “unlimited” resourcesFlexibilityWorker roles and VM rolesRecent porting of geo-science applicationsWRF, WRF Preprocessing System (WPS) port to WindowsIncreased use of ensemble applications (large number of small runs)Production quality, opensource scientific workflow systemsMicrosoft Trident60
Research VisionEnabling geo-science experiments Type of applicationsCompute intensive, ensemblesType of scientistsMeteorologists, atmospheric scientists, emergency management personnel, geologistsUtilizing both Cloud computing and Grid computing resourcesUtilizing opensource, production quality scientific workflow environmentsImproved data and meta-data managementGeo-Science ApplicationsScientific WorkflowsCompute Resources61
Proposed FrameworkThesis Defense - Eran Chinthaka Withana62Azure Blob StoreAzure ManagementAPISigiriJob Mgmt.DaemonsAzure FabricWeb ServiceTridentActivityJob QueueAzure Custom VM ImagesVM InstanceIISWRFSigiri WorkerServiceMSMPIWindows 2008R2
Applications: Pragma Testbed SupportPacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]an open international organization founded in 2002 to focus on practical issues of building international scientific collaborationsIn 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. Sigiri was used within IU Pragma testbedIU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfullyThesis Defense - Eran Chinthaka Withana63
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana64
Related WorkScientific Job Management SystemsGrid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.Carmen[Watson81] project provided a cloud environment that has enabled collaboration between neuroscientistsrequires all programs to be packaged as WS-I[Ballinger04] compliant Web servicesCondor[Frey02] pools can also be utilized to unify certain compute resource interactions.uses Globus toolkit[Foster05] (and GRAM underneath) Poor failure recovery overlooks failure modes of a cloud platformThesis Defense - Eran Chinthaka Withana65
Related WorkScientific Research and Cloud ComputingIaaS, PaaS and SaaS environment evaluationsScientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]Ease of setting up custom environments and controlGrowing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]Optimization to balance cost and time of executions[Deelman08][Yu05]Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07]	Job Prediction AlgorithmsPrediction ofExecution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]AI based and statistical modeling based approachesAppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performanceReliability of Compute ResourcesBirman[Birman05] and aspects of resources causing system reliability issuesStatistical modeling to predict failures[Kandaswamy08]Thesis Defense - Eran Chinthaka Withana66
OutlineMid-Range ScienceChallenges and OpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana67
ConclusionUser inspired management of scientific jobsConcentrate on identification of user patterns and perceptionsHarnesses historical informationApplies knowledge gained to improve scientific job executionsArgues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirementsEvaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.Resource abstraction serviceHelp mid-scale scientists to obtain access to resources that are cheap and availableStrives to do so with a tool that is easy to set up and administerPrototype implementations introduced and discussed is integrated and used in different domains and scientific applicationsApplications demonstrate how our research contributed to advance science in respective domains.Thesis Defense - Eran Chinthaka Withana68
ContributionsPropose and empirically demonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.Thesis Defense - Eran Chinthaka Withana69
Future WorkShort term research directionsIntegration of future job predictions and user-perceived reliability predictionsEvolving resource abstraction service to support more compute resourcesManagement of ensemble runsFault tolerance with proactive replicationLong Term Research DirectionsThesis Defense - Eran Chinthaka Withana70
Thank You !!Thesis Defense - Eran Chinthaka Withana71

User Inspired Management of Scientific Jobs in Grids and Clouds

  • 1.
    User Inspired Managementof Scientific Jobs in Grids and CloudsEran Chinthaka WithanaSchool of Informatics and ComputingIndiana University, Bloomington, Indiana, USADoctoral CommitteeProfessor Beth Plale, PhDDr. Dennis Gannon, PhDProfessor Geoffrey Fox, PhDProfessor David Leake, PhD
  • 2.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - EranChinthakaWithana2
  • 3.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - EranChinthakaWithana3
  • 4.
    Mid-Range ScienceChallengesResource requirementsgoing beyond lab and university, but not suited for large-scale resourcesDifficulties finding sufficient compute resourcesE.g.: short term forecast in LEAD for energy and agricultureLacking resources to have strong CS support person on teamNeed for less-expensive and more-available resourcesOpportunities Wide variety of computational resourcesScience gatewaysThesis Defense - EranChinthakaWithana4
  • 5.
    Current LandscapeGrid ComputingBatchorientation, long queues even under moderate loads, no access transparencyDrawbacks in quota systemLevels of computer science expertise requiredCloud ComputingHigh availability, pay-as-you-go model, on-demand limitless1 resource allocationPayment policy and research cost modelsUse of Workflow SystemsHybrid workflowsEnables utilization of heterogeneous compute resourcesE.g.: Vortex2 ExperimentNeed for resource abstraction layers and optimal selection of resourcesNeed for improvement of scientific job executionsBetter scheduler decisions, selection of compute resourcesReliability issues in compute resourcesImportance of learning user patterns and experiences Thesis Defense - Eran Chinthaka Withana51M. Armbrust et al. Above the clouds: A Berkeley view of cloud computing Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley., 2009.
  • 6.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana6
  • 7.
    Research Questions“Can userpatterns and experiences be used to improve scientific job executions in large scale systems?”“Can a simple, reliable and a highly scalable uniform resource abstraction be achieved to interact with a variety compute resource providers? ““Can these be put to use to advance science?”Thesis Defense - Eran Chinthaka Withana7
  • 8.
    ContributionsPropose and empiricallydemonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as new dimension to consider during resource selections.Propose and demonstrate effectiveness and applicability of light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.Thesis Defense - Eran Chinthaka Withana8
  • 9.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana9
  • 10.
    Usage Patterns toProvision for Time Critical Scientific Experimentation in CloudsObjectiveReducing the impact of startup overheads for time-critical applicationsProblem spaceWorkflows can have multiple pathsWorkflow descriptions not availableNeed for predictions to identify job execution sequenceLearning from user behavioral patterns to predict future jobsResearch outlineAlgorithm to predict future jobs by extracting user patterns from historical informationUse of knowledge-based techniquesZero knowledge or pre-populated job information consisting of connection between jobsSimilar cases retrieved are used to predict future jobs, reducing high startup overheadsAlgorithm assessment Two different workloads representing individual scientific jobs executed in LANL and set of workflows executed by three users10Thesis Defense - Eran Chinthaka Withana
  • 11.
    Demonstration of UserPatterns with WorkflowsSuite of workflows can differ from domain to domainE.g. WRF (Weather Research and Forecasting) as upstream nodeUser patterns reveal sequence of jobs taking different users/domains into considerationUseful for a science gateway serving wide-range of mid-scale scientists11Weather PredictionsCrop PredictionsWRFWind Farm Location EvaluationsWild Fire Propagation SimulationThesis Defense - Eran Chinthaka Withana
  • 12.
    Role of SuccessfulPredictions to Reduce Startup OverheadsLargest gain can be achieved when our prediction accuracy is high and setup time (s) is large with respect to execution time (t)r = probability of successful prediction (prediction accuracy)Percentage time =reductionFor simplicity, assuming equal job exec and startup times Percentage time =reduction12Thesis Defense - Eran Chinthaka Withana
  • 13.
    Relationship of Predictionsto Execution TimeObservationsPercentage time reduction increases with accuracy of predictionsTime reduction is reduced exponentially with increased work-to-overhead ratioNeed to find criticalpoint for a given situationFixing required percentage time reduction for a given t/s ratio and finding required accuracy of predictionsCost of wrong predictionsDepends on compute resourceDemonstrated higher prediction accuracies (~90%) will reduce impact of wrong predictionsCompromising cost to improve timePercentage time =reduction13Accuracy of Predictions = total successful future job predictions / total predictionsThesis Defense - Eran Chinthaka Withana
  • 14.
    Prediction Engine: SystemArchitecturePredictionRetriever14Thesis Defense - Eran Chinthaka Withana
  • 15.
    Use of ReasoningStoreand retrieve casesStepsRetrieval of similar casesSimilarity measurementUse of thresholdsReuse of old casesCase adaptationStorage15Thesis Defense - Eran Chinthaka Withana
  • 16.
    Case Similarity CalculationEachcase represented by set of attributesSelected by finding effect on goal variable (next job)16Thesis Defense - Eran Chinthaka Withana
  • 17.
    EvaluationUse casesIndividual jobworkload140k jobs over two years from 1024-node CM-5 at Los Alamos National LabWorkflow use caseSystem doesn’t see or assume workflow specificationExperimental setup2.0GHz dual-core processor, 4GB memory and on a 64-bit Windows operating system1: Parallel Workload Archive http://www.cs.huji.ac.il/labs/parallel/workload/ 17Thesis Defense - Eran Chinthaka Withana
  • 18.
    Evaluation: Average Accuracyof PredictionsIndividual Jobs Workload~ 75% accurate predictions with user patterns ~ 32% accurate predictions with service names18Thesis Defense - Eran Chinthaka WithanaWorkflow Workload~ 95% accurate predictions with user patterns ~ 53% accurate predictions with service names
  • 19.
    Evaluation: Time SavedAmountof time that can be saved, if resources are provisioned, when job is ready to runStartup timeAssumed to be 3mins (average for commercial providers)19Individual Jobs WorkloadWorkflow WorkloadThesis Defense - Eran Chinthaka Withana
  • 20.
    Evaluation: Prediction Accuraciesfor Use CasesUser patterns based predictions performs 2x better than service names basedThesis Defense - Eran Chinthaka Withana20
  • 21.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana21
  • 22.
    User Perceived ReliabilityFailurestolerated throughfault tolerance, high availability, recoverability, etc.,[Birman05]. What matters from a user’s point of view is whether these failures are visible to users or notE.g. reliability of commodity hardware (in clouds) vs user-perceived reliabilityReliability is not of resources themselves Not derived from halting failures, fail-stop failures, network partitioning failures[Birman05] or machine downtimes. It is a more broadly encompassing system reliability that can only be seen at user or workflow levelCan depend on user’s configuration and job types as wellWe refer to this form of reliability as user-perceived reliability.Importance of user-perceived reliability Selecting a resource to schedule an experiment when user has access to multiple compute resourcesE.g. LEAD reliabilitysupercomputing resources vsWindows Azure resourcesThesis Defense - Eran Chinthaka Withana22
  • 23.
    Why User PerceivedReliability is UsefulUser perceived failure probabilities Cluster A, p(A) = 0.2 and Cluster B, p(B) = 0.3𝑝𝐴∩ 𝐵=𝑝𝐴∗𝑝(𝐵) = 0.2 * ( 1 – 0.3) = 0.14 𝑝𝐵∩ 𝐴=𝑝𝐵∗𝑝(𝐴) = 0.3 * ( 1 – 0.2) = 0.24Since 𝑝𝐴∩ 𝐵 < 𝑝𝐵∩ 𝐴, try cluster A first and then cluster B.  Thesis Defense - Eran Chinthaka Withana23
  • 24.
    Using Reliability Aspectsof Computational Resources to Improve Scientific Job ExecutionsObjectiveReduce impact of low reliability of compute resourcesDeducing user-perceived reliabilities learning from user experiences and perceptionsResearch outlineAlgorithm to predict user perceived reliabilities, learning from user experiences mining historical informationUse of machine learning techniquesTrained classifiers to represent compute resources and their reliabilitiesPrediction of job failuresAlgorithm assessment Workloads from parallel workload archive representing jobs executed in two different supercomputing clusters24Thesis Defense - Eran Chinthaka Withana
  • 25.
    System ArchitectureThesis Defense- Eran Chinthaka Withana25A machine learning classifier is trained to learn user-perceived reliabilities of each cluster.Classifiers typesStatic classifier: train classifier initially from historical informationDynamic (updateable) classifier: starts from zero knowledge and build when system is in operation
  • 26.
    System ArchitectureThesis Defense- Eran Chinthaka Withana26Classifier manager uses Weka[Hall09] frameworkClassification methodsNaïve Bayes and KStarStatic and Dynamic classifiersDynamic pruning of features[Fadishei09] for increased efficiencyClassifier managerCreates and maintains classifiers for each compute resourceA new job is evaluated based on these classifiers to deduce predicted reliability of job executionPolicy ImplementersConsiders resource reliability predictions together with other quality of service information (time, cost) to select a resource
  • 27.
    EvaluationWorkloads from parallel workload archive[Feitelson]LANL: Two years worth of jobs from 1994 to 1996 on1024-node CM-5 at Los Alamos National LabLPC: Ten months (Aug, 2004 to May, 2005) worth of job records on 70 Xeon node cluster at ”Laboratoire de Physique Corpusculaire” of UniversitatBlaise-Pascal, FranceMinor cleanups to remove intermediate job states10000 jobs were selected from each workloadLANL had 20% failed jobsLPC had 30% failed jobsThesis Defense - Eran Chinthaka Withana27
  • 28.
    EvaluationWorkload classification andmaintenanceClassifiers: Naïve Bayes[John95] and KStar[Cleary95] classifier implementations in Weka[Hall09].Classifier constructionStatic classifier: first 1000 jobs trains classifier.Dynamic classifier: all 10000 jobs for classifier construction and evaluation. Evaluation MetricsAverage reliability prediction accuracy: accuracy of predicting success/fail of jobTime saved: cumulative time saved by aggregating execution time of a job if it fails and if our system predicted failure successfullybaseline measure: ideal cumulative time that can be saved over timeTime Consumed For Classification and Updating ClassifierEffect of pruning attributesStatic subset of attributes (as proposed in Fadishei et el.[Fadishei09]) vs dynamic subset of attributes (checking affect on goal variable)Thesis Defense - Eran Chinthaka Withana28
  • 29.
    EvaluationEvaluation MetricsEffect ofJob Reliability Predictionson Selecting Compute ResourcesExtended version of GridSim[Buyya02] models four compute resourcesNWS[Wolski99] for bandwidth estimation and QBets[Nurmi07] for queue wait time estimationTotal execution time = data movement time + queue wait time + job execution time (found in workload)SchedulersTotal Execution Time Priority Scheduler Reliability Prediction Based Time Priority SchedulerMetricsAverage Accuracy of Selecting Reliable Resources to Execute JobsTime Wasted Due to Incorrect Selection of Compute Resources to Execute JobsAll evaluations were run within a 3.0GHz dual-core processor, 4GB memory on Windows 7 professional operating system.Thesis Defense - EranChinthakaWithana29
  • 30.
    Evaluation Metrics SummaryThesisDefense - Eran Chinthaka Withana30
  • 31.
    Results:Average Reliability PredictionAccuracy31StaticDynamic / UpdateableLANLLANL Accuracy Saturation ~ 82%LPC Accuracy Saturation ~ 97%KStar has performed slightly better than Naïve BayesLPCThesis Defense - Eran Chinthaka Withana
  • 32.
    Results:Time Savings32StaticDynamic /UpdateableLANLWith static classifier, KStar has saved 90-100%Updateable classifier For LANL Both KStar and NB ~ 50% savingFor LPC ~ 90% savingLPCThesis Defense - Eran Chinthaka Withana
  • 33.
    Results:Time Consumed forClassification and Updating ClassifierThesis Defense - Eran Chinthaka Withana33Static ClassifierUpdateable ClassifierBoth static and updateable Naïve Bayes classifiers take very little time (not included in graphs)
  • 34.
    Results:Effect of PruningAttributesStatic sub-set of attributes (Fadishei09) performs poorly on this data set and classifierDynamic pruning has improved accuracy of predictions compared to non-pruned case, but improvement is marginalConclusion -> our classifiers are handling noise features well without compromising accuracy of classificationsIdentification of attributes to prune is a dynamic and expensive task system can be used in practical cases even without pruning of attributes.Thesis Defense - Eran Chinthaka Withana34
  • 35.
    Results:Effect of JobReliability Predictions on Selecting Compute ResourcesPoor performance of execution time priority schedulerAfter 1000 jobs (training) time wasted with our approach stays fairly constantThesis Defense - Eran Chinthaka Withana35
  • 36.
    Evaluation ConclusionEven thoughaverage accuracy of predictions with KStarclassifier has decreased with static classifier, it has managed to learn and predict failures better than any other method.Even though amount of time saved has increased slightly with Naive Bayes updateable classifier, comparatively, amount of time saved using static KStar classifier is higher than both methods.Even though total accuracy of predictions is not performing compared to other methods, static KStar classifier is ideal for correctly predicting failure cases, with very low overhead.Taking user-perceived reliability of compute resources in to consideration can save a significant amount of time in scientific job executionsThesis Defense - Eran Chinthaka Withana36
  • 37.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana37
  • 38.
    Scientific Computing ResourceAbstraction LayerVariety of scientific computing platforms and opportunitiesRequirementsSupport existing job description languages and also should be extensible to support other languages.Provide a uniform and interoperable interface for external entities to interact with it.Support heterogeneous compute resource manager interfaces and operating platforms from grids, IaaS, PaaS clouds, departmental clusters.Extensibility to support new and future resource managers with minimal changes. Provide monitoring and fault recovery, especially when working with utility computing resources.Provide light-weight, robust and scalable infrastructure.Integration to variety of workflow environments.Thesis Defense - Eran Chinthaka Withana38
  • 39.
    Scientific Computing ResourceAbstraction LayerOur contributionResource abstraction layer Implemented as a web serviceProvides a uniform abstraction layer over heterogeneous compute resources including grids, clouds and local departmental clusters.Support for standard job specification languages including, but not limited to, Job Submission Description Language (JSDL)[Anjomshoaa04] and Globus Resource Specification Language (RSL), directly interacts with resource managers so requires no grid or meta scheduling middlewareIntegration with current resource managers, including Load Leveler, PBS, LSF and Windows HPC, Amazon EC2 and Microsoft Azure platformsFeaturesDoes not need high level of computer science knowledge to install and maintain system Use of Globus was a challenge for most non-compute scientistsInvolvement of system administrators to install and maintain Sigiri is minimalMemory foot print of is minimalOther tools require installation of most of heavy Globus stack but Sigiri does not require a complete stack installation to run. (Note that installing Globus on a small clusters is something scientists never wanted to do.)Better fault tolerance and failure recovery.Thesis Defense - Eran Chinthaka Withana39
  • 40.
    ArchitectureAsynchronous messaging modelof message publishers and consumersDaemons shadowing compute resourcesDistributed component deploymentDaemon, front end Web service and job queue Thesis Defense - Eran Chinthaka Withana40
  • 41.
    Client Interaction ServiceDeployedas an Apache Axis2 Web service to enable interoperabilityAccepts job requests and enable management and monitoring functionsJob submission schema does not enforce schema for job descriptionEnables multiple job description languagesThesis Defense - Eran Chinthaka Withana41
  • 42.
    Client Interaction ServiceThesisDefense - Eran Chinthaka Withana42Job Submission ResponseJob Submission Request
  • 43.
    DaemonsEach managed computeresource has a light-weight daemonperiodically checks job request queuetranslates job specification to a resource manager specific languagesubmits pending jobs and persists correlation between resource manager's job id with internal idExtensible daemon API enables integration of wide range of resource managers while keeping complexities of these resources managers transparent to end users of these systemsQueuing based approach enables daemons to be run on any compute platform, without any software or operating system requirementsCurrent SupportLSF, PBS,SLURM, LoadLeveler, Amazon EC2, Windows HPC, Windows AzureThesis Defense - Eran Chinthaka Withana43
  • 44.
    Integration of CloudComputing ResourcesUnique set of dynamically loaded and configured extensions to handle security, schedule jobs and perform required data movements.Enables scientists to interact with multiple cloud providers within same systemFeaturesExtensions can be written as modules independent of other extensions, typically to carry out a single taskEnforced failure handling to prevent orphan VMs, resourcesThesis Defense - Eran Chinthaka Withana44
  • 45.
    SecurityClient SecurityBetween clientand Web service layerSupport for both transport level security (using SSL) and application layer security (using WS-Security)Client negotiation of security credentials with WS-Security policy support within Apache Axis2Compute Resource SecuritySystem has support to store different types of security credentialsUsername/password combinations, X.509 credentialsThesis Defense - Eran Chinthaka Withana45
  • 46.
    Performance EvaluationTest ScenariosCase1: Jobs arrive at our system as a burst of concurrent submissions from a controlled number of clients.Each client waits for all jobs to finish before submitting next set of jobs.For example, during test with 100 clients, each client sends 1 job to server making 100 jobs coming to server in parallel.Case 2: Each client submits 10 jobs having varying execution times in sequence with no delay between submissionsclient does not block upon submission of a jobfailure rate and server performance, from clients point of view, are measured and number of simultaneous clients will be systematically increasedThesis Defense - Eran Chinthaka Withana46
  • 47.
    Performance Evaluation:Baseline MeasurementsThesisDefense - Eran Chinthaka Withana47
  • 48.
  • 49.
    Performance Evaluation:Scalability MetricsThesisDefense - Eran Chinthaka Withana49
  • 50.
    Performance EvaluationExperimental SetupDaemonhosted within gatekeeper node (quad-core IBM PowerPC (1.6GHz) with 8GB of physical memory) of Big Red cluster System Web service and database co-hosted in a box with (4 2.6GHz dual-core processors with 32GB of RAM)Both these nodes were not dedicated for our experiment when we were running testsClient EnvironmentSetup within 128 node Odin Cluster (each node is a Dual AMD 2.0GHz Opteron processor with 4GB physical memory)All client nodes were used in dedicated mode and each client is running on separate java virtual machine to eliminate any external overhead Data CollectionEach test was run number of clients * 10 times and results were averaged.Each parameter is tested for 100 to 1000 concurrent clientsTotal of 110,000 tests were run. Gram4 experiment results produced in Gram4 evaluation paper[Marru08] were used for system performance comparison. Thesis Defense - Eran Chinthaka Withana50
  • 51.
    ResultsThesis Defense -Eran Chinthaka Withana51Baseline MeasurementsAll overheads scaling proportional to number of clientsNo failuresCase 1Case 2
  • 52.
    ResultsThesis Defense -Eran Chinthaka Withana52Metrics for Test Case 1 and 2Both response time and total overhead scaling proportional to number of clientsNo failures
  • 53.
    ResultsThesis Defense -Eran Chinthaka Withana53Scalability MetricsFailuresNo failures with SigiriFailures starting from300 clients for GramCase 1Case 2
  • 54.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana54
  • 55.
    Applications: LEADMotivationsGrid middlewarereliability and scalability study[Marru08] and workflow failure rates. components of LEAD infrastructure were considered for adaptation to other scientific environments.Sigiri initially prototyped to support Load Leveler, PBS and LSF. ImplicationsImproved workflow success rates Mitigation need for Globus middlewareAbility work with non-standard job managersThesis Defense - Eran Chinthaka Withana55
  • 56.
    Applications: LEAD IIEmergenceof community- driven, production-quality workflow infrastructuresE.g. Trident Scientific Workflow Workbench with Workflow FoundationPossibility of using alternate supercomputing resourcesE.g. Recent port WRF (Weather Research & Forecast) model to Windows platform, AzureSupport for Windows based scientific computing environments.56
  • 57.
    Background: LEAD IIand Vortex2 ExperimentMay 1, 2010 to June 15, 2010~6 weeks, 7-days per weekWorkflow started on hour every hour each morning. Had to find and bind to latest model data (i.e., RUC 13km and ADAS data) to set initial and boundary conditions. If model data was not available at NCEP and University of Oklahoma, workflow could not begin.Execution of complete WRF stack within 1 hour57
  • 58.
    Trident Vortex2 WorkflowBulkof time (50 min) spent in Lead Workflow Proxy Activity58Sigiri Integration
  • 59.
    Applications: Enabling Geo-ScienceApplication on Windows AzureGeo-Science ApplicationsHigh Resource RequirementsCompute intensive, dedicated HPC hardwaree.g. Weather Research and Forecasting (WRF) ModelEmergence of ensemble applicationsLarge amount of small jobse.g. Examining each air layer, over a long period of time. Single experiment = About 14000 jobs each taking few minutes to complete59
  • 60.
    Geo-Science Applications: OpportunitiesCloudcomputing resourcesOn-demand access to “unlimited” resourcesFlexibilityWorker roles and VM rolesRecent porting of geo-science applicationsWRF, WRF Preprocessing System (WPS) port to WindowsIncreased use of ensemble applications (large number of small runs)Production quality, opensource scientific workflow systemsMicrosoft Trident60
  • 61.
    Research VisionEnabling geo-scienceexperiments Type of applicationsCompute intensive, ensemblesType of scientistsMeteorologists, atmospheric scientists, emergency management personnel, geologistsUtilizing both Cloud computing and Grid computing resourcesUtilizing opensource, production quality scientific workflow environmentsImproved data and meta-data managementGeo-Science ApplicationsScientific WorkflowsCompute Resources61
  • 62.
    Proposed FrameworkThesis Defense- Eran Chinthaka Withana62Azure Blob StoreAzure ManagementAPISigiriJob Mgmt.DaemonsAzure FabricWeb ServiceTridentActivityJob QueueAzure Custom VM ImagesVM InstanceIISWRFSigiri WorkerServiceMSMPIWindows 2008R2
  • 63.
    Applications: Pragma TestbedSupportPacific Rim Applications and Grid Middleware (PRAGMA)[Zheng06]an open international organization founded in 2002 to focus on practical issues of building international scientific collaborationsIn 2010, Indiana University (IU) joined PRAGMA and added a dedicated cluster for testbed. Sigiri was used within IU Pragma testbedIU PRAGMA testbed system required a light-weight system that could be installed and maintained with minimal effort.IU PRAGMA team wanted to evaluate on adding cloud resources into testbed with little or no changes to interfaces.In 2011, PRAGMA - Opal - Sigiri integration was demonstrated successfullyThesis Defense - Eran Chinthaka Withana63
  • 64.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana64
  • 65.
    Related WorkScientific JobManagement SystemsGrid Resource Allocation and Management (GRAM)[Foster05], Condor-G[Frey02], Nimrod/G[Buyya00], GridWay[Huedo05] and SAGA[Goodale06] and Falkon[Raicu07]provide uniform job management APIs, but are tightly integrated with complex middleware to address a broad range of problems.Carmen[Watson81] project provided a cloud environment that has enabled collaboration between neuroscientistsrequires all programs to be packaged as WS-I[Ballinger04] compliant Web servicesCondor[Frey02] pools can also be utilized to unify certain compute resource interactions.uses Globus toolkit[Foster05] (and GRAM underneath) Poor failure recovery overlooks failure modes of a cloud platformThesis Defense - Eran Chinthaka Withana65
  • 66.
    Related WorkScientific Researchand Cloud ComputingIaaS, PaaS and SaaS environment evaluationsScientists have mainly evaluated use of IaaS services for scientific job executions[Abadi09][Hoffa08][Keahey08] [Yu05]Ease of setting up custom environments and controlGrowing interest in using PaaS services[Humphrey10][Lu10] [Qiu09]Optimization to balance cost and time of executions[Deelman08][Yu05]Startup overheads[Chase03][Figueiredo03][Foster06][Sotomayor06][Keahey07] Job Prediction AlgorithmsPrediction ofExecution times[Smith], job start times[Li04], queue-wait times[Nurmi07] and resource requirements[Julian04]AI based and statistical modeling based approachesAppleS[Berman03] argues that a good scheduler must involve some prediction of application and system performanceReliability of Compute ResourcesBirman[Birman05] and aspects of resources causing system reliability issuesStatistical modeling to predict failures[Kandaswamy08]Thesis Defense - Eran Chinthaka Withana66
  • 67.
    OutlineMid-Range ScienceChallenges andOpportunitiesCurrent LandscapeResearchResearch QuestionsContributionsMining Historical Information to Find Patterns and ExperiencesUsage Patterns to Provision for Time Critical Scientific Experimentation in Clouds [Contribution 1]Using Reliability Aspects of Computational Resources to Improve Scientific Job Executions [Contribution 2]Uniform Abstraction for Large-Scale Compute Resource Interactions [Contribution 3, 4]ApplicationsRelated WorkConclusion and Future WorkThesis Defense - Eran Chinthaka Withana67
  • 68.
    ConclusionUser inspired managementof scientific jobsConcentrate on identification of user patterns and perceptionsHarnesses historical informationApplies knowledge gained to improve scientific job executionsArgues that patterns, if identified based on individual users, can reveal important information to make sophisticated estimations on resource requirementsEvaluations demonstrates usability of predictions for a meta-scheduler, especially ones integrated into community gateways, to improve their scheduling decisions.Resource abstraction serviceHelp mid-scale scientists to obtain access to resources that are cheap and availableStrives to do so with a tool that is easy to set up and administerPrototype implementations introduced and discussed is integrated and used in different domains and scientific applicationsApplications demonstrate how our research contributed to advance science in respective domains.Thesis Defense - Eran Chinthaka Withana68
  • 69.
    ContributionsPropose and empiricallydemonstrate user patterns, deduced by knowledge-based approaches, to provision for compute resources reducing impact of startup overheads in cloud computing environments.Propose and empirically demonstrate user perceived reliability, learned by mining historical job execution information, as a new dimension to consider during resource selections.Propose and demonstrate effectiveness and applicability of a light-weight and reliable resource abstraction service to hide complexities of interacting with multiple resources managers in grids and clouds.Prototype implementation to evaluate feasibility and performance of resource abstraction service and integration with four different application domains to prove its usability.Thesis Defense - Eran Chinthaka Withana69
  • 70.
    Future WorkShort termresearch directionsIntegration of future job predictions and user-perceived reliability predictionsEvolving resource abstraction service to support more compute resourcesManagement of ensemble runsFault tolerance with proactive replicationLong Term Research DirectionsThesis Defense - Eran Chinthaka Withana70
  • 71.
    Thank You !!ThesisDefense - Eran Chinthaka Withana71