The Role of Predictive Methods in Autonomic Computing April 27, 2005 Ric Telford Director of Architecture and Development,...
Agenda <ul><li>Autonomic Computing overview </li></ul><ul><li>AC Problem Determination Technologies </li></ul><ul><li>Cust...
Today’s Complex Infrastructure Management of complex,  heterogeneous environments  is too difficult IT asset utilisation i...
Focus on business value, not infrastructure “ IBM’s autonomic computing initiative will become its most important cross-pr...
IBM Autonomic Computing Structure Open Standards Autonomic Computing Architecture Products delivering  autonomic features ...
Autonomic Computing: Problem Determination Technologies
The Pain Point…. Backup Servers Fire Wall HTTP Servers Fire Wall Fire Wall Data Servers Application Servers Fire Wall Netw...
Today’s Approach… Internal Swat Team   – The Manual Process <ul><li>Requires: </li></ul><ul><ul><li>Key resources across t...
Log format today Problem determination: Log format tomorrow <ul><li>Disparate pieces  and parts </li></ul><ul><li>Tools fo...
Common Base Event Format
Supported Log Formats (Feb 2005) <ul><li>AIX errpt log </li></ul><ul><li>AIX syslog </li></ul><ul><li>Apache HTTP Server a...
Log Correlation  – Generating the End-to-End View <ul><li>Transition from trying to understand log formats to identifying ...
End Results… Multiple IT-Skilled Resources Multiple Man-Hours / Days / Weeks of analysis Unstructured Swat Team Approach w...
Self-Healing - Customer Results From several hours/days to less than one hour 85% Improvement 70% Improvement 50% Improvem...
Self-Healing Roadmap Event Representation Adapters IBM  Deployers Knowledge Representation Event Correlation and Analysis ...
Self-Healing Vision CBEs Win SS AIX DB2 MQ zOS DB2 MQ Call Home M A E P Increased Embedded Self-Management Function IT Pro...
Summary <ul><li>IBM’s Autonomic Computing initiative has helped deliver the right “hygiene” to enable the industry for bet...
Upcoming SlideShare
Loading in...5
×

S068

343

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
343
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • “ IBM’s autonomic computing initiative will become its most important cross-product initiative (as the foundation of On Demand Business).” – Thomas Bittman, Gartner The Autonomic Vision. The term Autonomic Computing is derived from the autonomic nervous system Our vision is to deliver intelligent / open systems that can manage themselves The core benefits of autonomic computing are improved resiliency, ability to deploy new capabilities more rapidly and increased return from IT investments. Think about when you get hot. Your body automatically starts sweating to cool off. You don’t think “Gee I’m hot, I need to sweat.” This is all controlled by your autonomic nervous system. In fact, I’m sweating right now. And I assure you I didn’t plan it that way. What we’re trying to do with autonomic computing is implement systems which operate in an automatic way, and free the staff to focus on more strategic and higher-level issues. We’d like to achieve intelligent, open systems that can mask the complexity we talked about earlier, systems that have some knowledge of themselves. Today the different elements of a system are independent, and none of them seem to really recognize being part of this whole environment that you are trying to deal with. And when things break or there are problems, it’s not at all clear how it fits into that mosaic. So you’d like the system to have some sense of itself -- you’d like to have it configure and reconfigure under the varying and unpredictable conditions. You’d like to have it continually tune itself as workloads change and take best advantage of the available resources. You’d like to be able to prevent failures, and recover quickly and easily if they do occur. And you’d also like to make sure your systems and data are protected against attacks, which is an increasing issue in our world, whether they are viruses or denial of service attacks -- data shows that threats of malicious attacks, are increasing each year. And addressing all of these things will deliver value for customers, beginning with reducing risk, cost and down-time, and improving your productivity;and improving the resiliency of your IT environment from the standpoint of availability and protection from attacks -- a more stable, reliable environment. And finally to be able to accelerate the implementation of new capabilities, with increased responsiveness and efficiency.. Key points: So what does it mean for an IT environment to be autonomic? To net it out, it means that systems are self-configuring, self-healing, self-optimizing and self-protecting. Systems that are self-configuring have the ability to adapt dynamically to changing environments, the ability to add and remove components to and from the systems, change the environment as is necessary depending on workloads. Just imagine if you no longer needed to worry about the interactions between components when you add a new application or capability. That’s part of what self-configuring is all about. The bottom line is IT agility - the ability to respond rapidly to changing demands. Self-healing is all about business resiliency, the ability to discover, diagnose, prevent, and recover from disruptions, keeping the system going. With increased business dependence on IT and internal and external users demanding 24x7 availability, this is more important than ever. Self-optimizing is about operational efficiency -- tuning the resources, balancing workloads, making the maximum use of the IT resources that are available. And in today’s world that’s something that has to be done and redone continuously since the workloads are so variable. In the old days, when workloads were more predictable, it was a question of tuning the systems to support the workload and then letting them run. In today’s world, the workloads change dynamically and dramatically and systems need to continuously monitor and self-tune, adapting and learning from the environment around them. And self-protecting... Now that our IT systems are open to the public so to speak, security is ever more important... So, self-protecting is about the ability of the system to secure information and resources, by anticipating, detecting, identifying, and protecting from attacks of any kind.
  • Multiple Animations on this slide…. Complex 4-tiew application infrastructure, which is load-balanced. When a transaction is initiated over the web, it flows across the pools of servers in variable ways. Transactions may never flow the same way across the infrastructure each time they occur. When a problem occurs that effects the end user – what route did it take? Where did it go wrong?...... Plus… multiple vendors involved – whose fault is it? Who do I get help from?
  • Key points: Today: multiple log formats, divergent tools to decode the logs With autonomic computing: common log format, common set of tools
  • From Cris - I left Railinc&apos;s, IT-Austria&apos;s and NS Solution&apos; logos there even if we do not have quantifiable results, but Alan can talk about the fact that these customers have &amp;quot;standardized&amp;quot; in CBEs as well Guardian Insurance: PD/RAC project. Common Based Event for standardizing message format Log/Trace Analyzer to help correlation of problem symptoms, WebSphere symptom database in conjunction with custom symptom database to capture recurring problems/solutions, Remote access to log data via Agent Controller – saving 75% in problem resolution time Rational Performance Analyst is for Developers and Performance Testers who need to find root causes for problems occurring in pre-production and production environments and/or to optimize application performance. RPA is a Runtime analysis and performance monitoring and optimization solution that provides runtime analysis, transaction tracing and system log monitoring in a single package, code-level views into production-time data (live and data warehouse), visual correlation of data collected from transaction-based and server-based log events with code-level detail. This offering Integrates with Tivoli Monitoring for Transaction Performance and Tivoli Monitoring to capture live and data-warehouse information on production and pre-production systems Uses Autonomic Computing technology to analyze system logs generated by Web/App/DB servers. Supports user-defined probes for customized and extendable runtime analysis. Provides advanced memory leak analysis that eliminates false positives and captures chain of object references leading up to the error. Provides an Eclipse-based user interface for developers and performance that integrates with the Rational desktop solutions Early Access program targeted for Oct 2004. Can use the product to determine problems in production systems where TMTP is already deployed Can use the product to determine problems in pre-production systems using a bundled version of TMTP Can use early access version through March 2005
  • Heterogeneous managed resources exist. The managed resources may have their own embedded self-management function. The touchpoints generate CBEs. Adapters enable CBE generation from existing messages and notifications; over time, touchpoints will natively generate CBEs. Autonomic managers manage the resources using the fabric. A self-healing control loop for an AM receives CBEs, correlates them in ‘M’ to produce symptoms; symptoms are analyzed by decision trees (or other analysis engines) in ‘A’ to produce Change Types that describe “what” needs to change; Change types are processed in ‘P’ to produce Change Plans that describe “how” the change should be realized; and Change Plans are processed in ‘E’ to produce actions that are executed on the touchpoint to accomplish the change. Knowledge may be built up as the self-healing control loop iterates. Human-based MAs work with partial autonomic managers and tooling for problem determination activities. Tools for authoring symptoms and other knowledge types produce K-sources that are used for autonomic management. Orchestrating autonomic managers coordinate systemwide self-healing. Call Home technology supplements AM functionality with augmented analysis, planning and execution, as well as knowledge updates. General principles [as listed]
  • S068

    1. 1. The Role of Predictive Methods in Autonomic Computing April 27, 2005 Ric Telford Director of Architecture and Development, Autonomic Computing
    2. 2. Agenda <ul><li>Autonomic Computing overview </li></ul><ul><li>AC Problem Determination Technologies </li></ul><ul><li>Customer Results </li></ul><ul><li>The Self-Healing Vision </li></ul><ul><li>Summary </li></ul>
    3. 3. Today’s Complex Infrastructure Management of complex, heterogeneous environments is too difficult IT asset utilisation is too low Operational speed too slow; IT flexibility too limited Privacy, security and business continuity Inability to manage the infrastructure seamlessly Swamped by the proliferation of technology and platforms to support WWW
    4. 4. Focus on business value, not infrastructure “ IBM’s autonomic computing initiative will become its most important cross-product initiative (as the foundation of On Demand Business).” — Thomas Bittman, Gartner <ul><li>Increased return on IT investment </li></ul><ul><li>Improved flexibility, resiliency and quality of service </li></ul><ul><li>Accelerated time to value </li></ul>Providing customer value <ul><li>Adapt to unpredictable conditions </li></ul><ul><li>Continuously tune themselves </li></ul><ul><li>Prevent and recover from failures </li></ul><ul><li>Provide a safe environment </li></ul>Autonomic Computing delivers intelligent open systems that: Sense and respond to ever-changing environments
    5. 5. IBM Autonomic Computing Structure Open Standards Autonomic Computing Architecture Products delivering autonomic features Autonomic Computing Common Components Problem Determination Provisioning Admin Console Workload Mgt <ul><li>Autonomic Computing Control Loop </li></ul><ul><li>Autonomic Computing Architecture Blueprint </li></ul><ul><li>Log/Trace Analyzer </li></ul><ul><li>Generic Log Adapter </li></ul><ul><li>Solution installation & dependency checking </li></ul><ul><li>Common Console </li></ul><ul><li>Autonomic Management Engine </li></ul><ul><li>50 products with 415+ features </li></ul><ul><li>Partner solutions </li></ul><ul><li>Common log format </li></ul><ul><li>Solution installation schema </li></ul>Installation Management Engine
    6. 6. Autonomic Computing: Problem Determination Technologies
    7. 7. The Pain Point…. Backup Servers Fire Wall HTTP Servers Fire Wall Fire Wall Data Servers Application Servers Fire Wall Network Routers/Switches Policy Servers Managing Servers Load Balancers LDAP Registries You Load Balancers Edge Servers Security Servers Load Balancers
    8. 8. Today’s Approach… Internal Swat Team – The Manual Process <ul><li>Requires: </li></ul><ul><ul><li>Key resources across the IT staff to get the breadth of skills to understand the end-to-end problem </li></ul></ul><ul><ul><li>Deep understanding of log file formats </li></ul></ul><ul><ul><li>Deep understanding of system components </li></ul></ul><ul><li>Result: </li></ul><ul><ul><li>Multiple man-hours/days/weeks of effort </li></ul></ul><ul><ul><li>Political issues – passing the blame </li></ul></ul><ul><ul><li>Insufficient / inadequate data can cause this approach to fail </li></ul></ul><ul><li>Customers are repeating this step today for every major IT outage </li></ul>Blame Storming
    9. 9. Log format today Problem determination: Log format tomorrow <ul><li>Disparate pieces and parts </li></ul><ul><li>Tools focused on individual products </li></ul><ul><li>No common interfaces among tools </li></ul><ul><li>No synergies in building tools OR in creating log entries </li></ul><ul><li>Generic log adapter </li></ul><ul><li>Common format for log files </li></ul><ul><li>Common set of tools </li></ul><ul><li>Common interfaces among tools </li></ul>common base event Adapters Adapters Common Base Event an OASIS standard Database Networks Application Server Servers Storage devices Applications
    10. 10. Common Base Event Format
    11. 11. Supported Log Formats (Feb 2005) <ul><li>AIX errpt log </li></ul><ul><li>AIX syslog </li></ul><ul><li>Apache HTTP Server access log </li></ul><ul><li>Apache HTTP Server error log </li></ul><ul><li>CICS Transaction Server for z/OS System message log </li></ul><ul><li>Common Base Event XML log </li></ul><ul><li>ESS (Shark) Problem log </li></ul><ul><li>IBM Communications Server log </li></ul><ul><li>IBM DB2 Express diagnostic log </li></ul><ul><li>IBM DB2 Universal Database Cli Trace log </li></ul><ul><li>IBM DB2 Universal Database JDBC trace log </li></ul><ul><li>IBM DB2 Universal Database SVC Dump on z/OS </li></ul><ul><li>IBM DB2 Universal Database Trace log </li></ul><ul><li>IBM DB2 Universal Database diagnostic log </li></ul><ul><li>IBM HTTP Server access log </li></ul><ul><li>IBM HTTP Server error log </li></ul><ul><li>IBM WebSphere Application Server activity log </li></ul><ul><li>IBM WebSphere Application Server for z/OS error log </li></ul><ul><li>IBM WebSphere Application Server plugin log </li></ul><ul><li>IBM WebSphere Application Server trace log </li></ul><ul><li>IBM WebSphere Commerce Server ecmsg log </li></ul><ul><li>IBM WebSphere Commerce Server ecmsg, stdout, stderr log </li></ul><ul><li>IBM WebSphere InterChange Server log </li></ul><ul><li>IBM WebSphere MQ FDC log </li></ul><ul><li>IBM WebSphere MQ error log </li></ul><ul><li>IBM WebSphere MQ for z/OS Joblog </li></ul><ul><li>IBM WebSphere Portal Server appserver_err log </li></ul><ul><li>IBM WebSphere Portal Server appserverout log </li></ul><ul><li>IBM WebSphere Portal Server run-time information log </li></ul><ul><li>IBM WebSphere Portal Server systemerr log </li></ul><ul><li>IBM WebSphere Portal Server systemout log </li></ul><ul><li>IBM Websphere Edge Server log </li></ul><ul><li>Javacore log </li></ul><ul><li>Logging Utilities XML log </li></ul><ul><li>Microsoft Windows Application log </li></ul><ul><li>Microsoft Windows Security log </li></ul><ul><li>Microsoft Windows System log </li></ul><ul><li>Oracle JDBC trace log </li></ul><ul><li>Oracle alert log </li></ul><ul><li>Oracle listener log </li></ul><ul><li>Oracle server log </li></ul><ul><li>Rational TestManager log </li></ul><ul><li>RedHat syslog </li></ul><ul><li>SAN File System log </li></ul><ul><li>SAN Volume Controller error log </li></ul><ul><li>SAP system log </li></ul><ul><li>Squadrons-S Problem log </li></ul><ul><li>SunOS syslog </li></ul><ul><li>SunOS vold log </li></ul><ul><li>TXSeries CICS Console/CSMT log </li></ul><ul><li>z/OS Component trace </li></ul><ul><li>z/OS GTF trace </li></ul><ul><li>z/OS Joblog </li></ul><ul><li>z/OS Logrec </li></ul><ul><li>z/OS System log(SYSLOG) </li></ul><ul><li>z/OS System trace </li></ul><ul><li>z/OS master trace </li></ul>
    12. 12. Log Correlation – Generating the End-to-End View <ul><li>Transition from trying to understand log formats to identifying ways to analyze the overall data and the end-to-end view </li></ul><ul><li>Move the Mindset from Monitoring to Analysis </li></ul><ul><li>With Correlation IDs in place, or Correlation methods identified: </li></ul><ul><ul><li>Implement a Correlation Engine in the Log Analyzer </li></ul></ul><ul><ul><li>Generate a sequence diagram showing the log interactions and sequence of events </li></ul></ul><ul><li>Help the IT staff hone in on where the problem occurred: </li></ul><ul><ul><li>Identify quickly where to concentrate efforts </li></ul></ul>
    13. 13. End Results… Multiple IT-Skilled Resources Multiple Man-Hours / Days / Weeks of analysis Unstructured Swat Team Approach with success unknown Repeatable Process with a reusable set of tools Root Cause identification in hours / minutes Single PD-Skilled Resource From To
    14. 14. Self-Healing - Customer Results From several hours/days to less than one hour 85% Improvement 70% Improvement 50% Improvement 10 to 30% Savings in IT Support Costs 50% Improvement – IBM’s SAP Deployment 60% Improvement 60% Improvement 20 to 30% Improvement 10 to 20% improvement in operational staff productivity – IBM Software Delivery and Fulfillment From 3 people 2 hours to 1 person 15 min 40% Improvement 75% Improvement New in 2005
    15. 15. Self-Healing Roadmap Event Representation Adapters IBM Deployers Knowledge Representation Event Correlation and Analysis Partner Deployers Action Representation Knowledge Accumulation Customer Pull Capture Remediation Business Policy Continuous Availability Knowledge Sharing Self Healing Analysis <ul><li>Standard data model for common situation and event reporting </li></ul><ul><li>Tooling for easy adoption of standard </li></ul><ul><li>Commitments from IBM brands and IBM Partners to support the data model </li></ul><ul><li>Standardize data model for symptom analysis </li></ul><ul><li>Transport & correlate events from all components in IT infrastructure </li></ul><ul><li>Predictive Analysis Constructs </li></ul><ul><li>ARM Correlation </li></ul><ul><li>Standardize data model for change requests, change plans </li></ul><ul><li>Standardize grammar to describe change requests and constraints </li></ul><ul><li>Allow analysis and planning when uncertainty is present </li></ul><ul><li>Allow human to determine recovery action </li></ul><ul><li>High-profile customer deployments and references </li></ul><ul><li>Business policies guide self-healing system </li></ul><ul><li>Preemptive diagnostics automatically recognize and resolve problems </li></ul><ul><li>Call home facilities are integrated as part of self-healing solutions </li></ul><ul><li>Symptom data made available to customers, ISVs, partners </li></ul>2004 2004-2005 2007 2006
    16. 16. Self-Healing Vision CBEs Win SS AIX DB2 MQ zOS DB2 MQ Call Home M A E P Increased Embedded Self-Management Function IT Professionals Tooling Symptom Policy Config M A E P M A E P M A E P Human-based MAs and associated tooling for correlation, analysis, viewing Adapter Analyze Plan Execute Monitor Knowledge Symptom Change Type CBE Action Change Plan Sensor Effector
    17. 17. Summary <ul><li>IBM’s Autonomic Computing initiative has helped deliver the right “hygiene” to enable the industry for better Problem Determination </li></ul><ul><li>Predictive technologies can capitalize on this hygiene to help automate the “Problem Determination” process </li></ul><ul><li>We need continued research and cooperation across IBM and the industry at large to make the vision of Self-Healing systems a reality! </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×