Towards Autonomic Grids

1,510 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,510
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Towards Autonomic Grids

  1. 1. Towards Autonomic Grids C´cile Germain-Renaud eLaboratoire de Recherche en Informatique Universit´ Paris-Sud - CNRS - INRIA e
  2. 2. e-science infrastructures 2003 NSF Atkins Report : Revolutionizing Science and Engineering through Cyberinfrastructure Grids of computational centers Comprehensive libraries of digital objects Well-curated collections of scientific data Online instruments and vast sensor arrays Convenient software toolkits
  3. 3. e-science infrastructures 2003 NSF Atkins Report : Revolutionizing Science and Engineering through Cyberinfrastructure Grids of computational centers Comprehensive libraries of digital objects Well-curated collections of scientific data Online instruments and vast sensor arrays The largest (circ 26km), Convenient software toolkits fastest(14TeV), coldest (1.9K), emptiest (10−13 atm) machine.
  4. 4. e-science infrastructures 2003 NSF Atkins Report : Revolutionizing Science and Engineering through Cyberinfrastructure Grids of computational centers Comprehensive libraries of digital objects Well-curated collections of Storage and analysis of scientific data 15PB/year Online instruments and vast sensor arrays Convenient software toolkits
  5. 5. e-science infrastructures 2003 NSF Atkins Report : Revolutionizing Science and Engineering through Cyberinfrastructure Grids of computational centers Comprehensive libraries of digital objects Well-curated collections of The largest (40000 CPUs), scientific data most complex (200 VOs), most distributed (250 sites), Online instruments and vast sensor most used (300K jobs/day) computing machine arrays Convenient software toolkits
  6. 6. How we configure our grids Courtesy James Casey talk @EGEE09
  7. 7. Outline 1 The grid ecosystem 2 Grids and Autonomic Computing 3 The Grid Observatory 4 Learning grid models On-line fault detection Model Selection 5 Model-free policies Policy evaluation Reinforcement learning for responsive grids
  8. 8. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiese-science infrastructures The classical definition of grids A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high computational capabilities. I. Foster, C. Kesselman, The Grid, 1998 An old dream UCLA press release on the creation of Arpanet, 1969
  9. 9. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesThe niches in the ecosystem
  10. 10. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGrids are not about technology, but about sharing Consumers: Large scale Ian Foster’s definition 2000 international collaborations Grid are defined by coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations The sharing is necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing Different users with occurs. A set of individuals and/or institutions differentiated requirements defined by such sharing rules form a virtual across and within the organization collaborations
  11. 11. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGrids are not about technology, but about sharing Ian Foster’s definition 2000 Providers: national and Grid are defined by regional institutions coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations The sharing is necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions Organized in National Grid defined by such sharing rules form a virtual Initiatives, coordinated by EGI organization
  12. 12. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGrids are not about technology, but about sharing Ian Foster’s definition 2000 Operators: local sites, with temporary EU support Grid are defined by (EGI-Inspire) coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations The sharing is necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form a virtual Configuration, prioritization, organization monitoring, accounting, . . .
  13. 13. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesDo Datacenters and Cloud make Grid obsolete?
  14. 14. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies*-aaS Courtesy William Vambenepe - slides from the Cloud Connect keynote Freeing SaaS from Cloud
  15. 15. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGrids and Clouds IaaS : on-demand, elastic, virtualization-based provisioning A single-objective optimization target: pay less by turning on and off at the minute rather than days or weeks scale Convergence path: Grids over Clouds or Clouds of Grids? EU project Stratuslab SaaS: the core of the IT process lies in deploying and orchestrating heterogeneous software components, and having them ”in the cloud” does not help much
  16. 16. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesAutonomic Computing Computing systems that manage themselves in accordance with high-level objectives from humans Kephart and Chess A vision of Autonomic Computing, IEEE Computer 2003 AUTONOMIC VISION & MANIFESTO http://www.research.ibm.com/autonomic/manifesto/ Relation with Machine Learning : I. Rish tutorial @ECML 2006, Self-managing system with the ability of Self-healing: detect, diagnose and repair failures Self-configuring: automatically incorporate and configure components Self-optimizing: ensure the optimal functioning wrt high-level requirements Self-protecting: anticipate and defend against security breaches On dynamical non-steady state systems
  17. 17. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesAutonomic Computing
  18. 18. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesAutonomic Computing
  19. 19. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesAutonomic Grids Emerging behaviour as the result of sites and stakeholders decisions Coupled usage: Virtual Organizations, community software and activity Feedback loops in the middleware Incomplete and noisy information We need Inference of models for middleware components and applications, users and usage profiles, users interactions, inconsistencies Self-configuration and self-optimization for management policies Self-healing across middleware and applications
  20. 20. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGoals Grid digital assets curation Collecting verifiable digital assets Providing digital asset search and retrieval Certification of the trustworthiness and integrity of the collection content Semantic and ontological continuity and comparability of the collection Building the domain knowledge Dimensionality and volume reduction: getting rid of the massive redundancy in operational logs Answering operational issues Descriptive/generative/predictive models Design and validation of model-free policies
  21. 21. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesSupport and collaborations
  22. 22. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesMethods Focused on EGEE/EGI www.grid-observatory.org The best approximation of the current needs of e-science Extensive monitoring facilities Traces were discarded after operational usage, and in any case not available to the scientific community Now available without grid certificate
  23. 23. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesMethods Focused on EGEE/EGI The best approximation of the current needs of e-science Extensive monitoring facilities Traces were discarded after operational usage, and in any case not available to the scientific community Now available without grid certificate
  24. 24. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGrids are complex systems Users/Files/Clients worker nodes graph display with AVIZ GraphDice
  25. 25. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesGrids are complex systems Users in green, File groups in purple. Rightmost is most ”active” And also [Lovro Iliasic PhD Computational Grids as Complex Networks]
  26. 26. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesIssues Large non-stationary system Courtesy M. Lassnig et al. Austrian Grid Symp. 09 Trends Academic events Scientific events Software events
  27. 27. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesOn-line fault detectionAbrupt changepoint detection Page-Hinkley Statistics - jumps in the mean pt changing distribution pt = 1 t =1 p P ¯ t Pt mt = =1 (p − p + δ) ¯ Mt = max{m } PHt = Mt − mt CUSUM test: if PHt > λ, change detected First Application Blackhole detection Validation requires expert interpretation
  28. 28. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesOn-line fault detectionStrAP: On-line clustering aka Streaming Affinity Propagation (AP) [Frey2007] statistical physics algorithm for clustering (based on message passing ) a cluster = an exemplar (akin k-centers) the model = set of {exemplar, frequency} Why AP ? Traceability: real jobs as exemplars because of categorical variables, e.g., userid, queue name etc No prior knowledge of K , number of clusters quasi optimality wrt. information loss —> stability [Meila2006]
  29. 29. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesOn-line fault detectionFrom AP to Large-scale Data Streaming h+2 1 SCALABILITY : from O(N 2 log N) to O(N h+1 ) Hierarchical Affinity Propagation negligible infromation loss (proof in the paper)
  30. 30. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesOn-line fault detectionFrom AP to Large-scale Data Streaming 2 Non stationary distribution various Virtual Organization number and expertise of users Streaming AP (StrAP)
  31. 31. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesOn-line fault detectionAdaptive change detection test Self-adapt λ ≡ An optimization problem |C | 1 1 BIC: Fλ = |C | i=1 ni d(ej , ei∗ ) + ϕ ρ log N + ηOt ej ∈Ci 2 ∝ loss + size of model + fraction of outliers OPTIMIZATION: -greedy search from a finite set of λ values λ = argmin{E(Fλ }), λ1 λ2 λ3 λ4 ... E(Fλ1 ) E(Fλ2 ) E(Fλ3 ) E(Fλ4 ) ... Gaussian Process Regression based on {λi , Fλi } a continuous value of λ is generated
  32. 32. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesOn-line fault detectionG-StrAP: A Grid Dashboard Online Monitoring 100 Percentage of jobs assigned (%) 8 100 18 exemplar shown 24 LogMonitor is 80 80 as a job vector 30 595 getting clogged 10 18 139 29 60 10 60 20091 7 7 47 395 0 13 8 276 54 7 9 18 6 40 0 14 0 18 129 40 0 10 24 5 0 24 0 25 0 0 47 30 10 0 9728 0 0 54 20110 595 14 0 0 20 0 19190 0 129 0 139 127 20 0 0 Reservoir 0 0 10854 Reservoir 0 0 0 0 1 2 Clusters3 4 5 1 2 3 4 5 6 7 8 Off-line Analysis
  33. 33. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesModel SelectionThe Piecewise Autoregressive model AR process: Xt = γ + φ1 Xt−1 + . . . + φp Xt−p + t The model Parameters for piecewise AR Number of segments m Breakpoints location/segment size (nj )j=1...m AR orders.(pj )j=1...m Segment 1, 0 < t ≤ 512: AR parameters Xt = 0.9Xt−1 + t Segment 2, 512 < t ≤ 768: (Ψj )j=1...m Xt = 1.69Xt−1 − 0.81Xt−2 + t Segment 3, 768 < t ≤ 1024: Very large model space Xt = 1.32Xt−1 − 0.81Xt−2 + t
  34. 34. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesModel SelectionMinimum Description Length model selection for PAR [Davis, Lee, Rodriguez-Yam, J. American Statist. Assoc. 2006.] The MDL principle: the best-fitting model is the one that produces the shortest code length that completely describes the observed data y ˆ CLF (y ) = CLF (F) + CLF (e|F)ˆ ˆ CLF (F): description of the model ˆ CLF (e|F) description the residuals - what is not explained by the model m+1 pj +2 n CL = log m+(m+1) log n+ j=1 log pj + 2 log nj + 2j log(2πˆj2 ) σ
  35. 35. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesModel SelectionResults on the workload processes The amount of unterminated work in the system Smoothed workload difference Typically low AR models Long segments no. of segment segment smallest Ljung-Box CE segment start end root abs. test on [days] [days] value residuals (p-value) CE-A 18 158.91 196.53 1.5915 0.05 CE-B 19 109.61 160.65 2.1563 0.04 CE-C 17 104.86 149.31 5.5711 0.21 CE-D 27 151.39 190.16 1.1062 0.05 ´ [T. Eltet˝ et al. Discovering Piecewise Linear Models of Grid Workload, CCGrid 2010] o
  36. 36. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesModel SelectionModel validation PAR: Ljung-Box test - Stability: Bootstrapping whiteness of the AR - stable breakpoints residuals
  37. 37. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesModel SelectionModel reconciliation – bootstrap aggregation Outcome: a simple and robust model describing the essential part of the workload process.
  38. 38. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesPolicy evaluationEvaluation of the matchmaking scheduling policy ART: Actual Response Time = queuing delay at the CE ERT: Expected Response Time, copernican principle, gLite Question: how good is the prediction? Question: what is your definition of good predictor? Root Mean Squared Error? Close statistical distribution, at normal regime, in the tail? Correlation of time series? ROC (Receiver Operating Characteristic): cost-benefit relation Heterogeneous data
  39. 39. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesPolicy evaluationEvaluation of the matchmaking scheduling policy Overall The distributions are not consistent RMSE Atl. 7.94E4, Biom. 7.2E3 Correlation (subsampling at 900s) is not convincing
  40. 40. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesPolicy evaluationEvaluation of the matchmaking scheduling policy ` A la BQP (Batch Queue Predictor) How often does the prediction lie within a reasonable distance of the actual? Modified because BQP considers only upper bounds ERT is a classifier, the classes are intervals of the value range Intervals of exponentially increasing size ROC: True Positive Rate vs False Positive Rate
  41. 41. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesReinforcement learning for responsive gridsReinforcement learning for ressource provisioning in grids 1 0.9 A multi-objective scheduling and 0.8 dimensioning problem 0.7 Probability 0.6 all data atlas Users: Differentiated QoS 0.5 0.4 biomed Stakeholders: Fairness 0.3 0.2 Administrators: Utilization 0.1 0 10 1 10 2 10 10 3 Execution time [s] 4 10 5 10 10 6 Goals Elastic resource provisioning: the context is Grids over Clouds - Infrastructure as a Service (IaaS) Realistic hypotheses: organized sharing and mutualization, no central control Autonomics: Model-free policies and configuration-free implementations
  42. 42. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesReinforcement learning for responsive gridsFormalisation The scheduling MDP State: descriptive variables of a site (queue, cluster) Action: descriptive variables of a job (VO, execution time) The dimensioning MDP Action: number of computing nodes to maintain in activity Policy learning sarsa algorithm Continuous state-action space: Non linear regression of Q : (s, a) → r Neural Network and Echo State Network
  43. 43. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesReinforcement learning for responsive gridsThe Rewards The Responsiveness utility for job j is execution timej Wj = . (1) execution timej + waiting timej The Fairness utility for job j is maxk (wk − Skj )+ , Fj = 1 − , (2) M where x+ = x if x > 0 and 0 otherwise, wk the target share of VO k, and Skj the share received by VO k up to the election of job j The Utilization reward Un at time Tn is fn Un = n (3) k=0 Pk (Tk+1 − Tk ) where (T1 , . . . , TN ) are the instants of decision making, Pk the number of processors allocated in the interval [Tk , Tk+1 ] for 1 ≤ n < N, and fn the sum of the execution times of jobs completed at time Tn .
  44. 44. The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policiesReinforcement learning for responsive gridsExperimental results on EGEE traces 1 0.95 0.9 0.85 1 0.98 CDF 0.8 0.96 0.75 0.94 EGEE−INTER 0.7 ORA−INTER−0.5 0.92 ORA−INTER−1.0 CDF 0.65 EST−INTER−0.5 0.9 EST−INTER−1.0 0.88 0.6 ELA−ORA−0.5 1 2 3 4 5 10 10 10 10 10 0.86 ELA−ORA−1.0 Queueing delay (sec) ELA−EST−0.5 0.84 ELA−EST−1.0 Queuing delays - interactive jobs -Rigid 0.82 RIG−ORA−0.5 −3 RIG−ORA−1.0 x 10 4 0.8 1 2 3 4 5 ELA−ORA−0.5 − EGEE 10 10 10 10 10 3 Queueing delay (sec) 2 Queuing delays - interactive jobs - Elastic [J. Fairshare Difference 1 0 −1 −2 Perez et al. JoGC 8/3 Sep. 2010] −3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Arrival Times (sec) 6 x 10 Dynamics of the fairshare - All jobs - Rigid
  45. 45. Conclusion

×