Whether you have just deployed Oracle Coherence or have been working with it for a while, you need to learn the skills necessary to master the monitoring challenges of your business critical scalable architecture.
Our in-house Oracle Coherence expert, Everett Williams, a former Coherence Development Team member, walks you through the basics of Oracle Coherence Monitoring 101 in this 30 minute webinar.
Over the years, SL has become the leader in Coherence monitoring by helping dozens of companies get better visibility into their Oracle Coherence performance. We can help you too.
I want to tal about SL’s relationship with TIBCO for a moment.
We go back a long way with TIBCO and TIBCO is SL’s largest reseller.
We have a lot of experience working with more than 100 TIBCO customers to implement monitoring.
These challenges are probably not news to the people on this call but:
BW environments tend to be complex and challenging to monitor
Not uncommon for large enterprises to have hundreds if not thousands of BW engines, EMS servers, custom adapters and orchestration processes
Large applications built around BusinessWorks and the rest of the TIBCO stack, the typical subsystems can include: infrastructure (hardware/network/OS), messaging (EMS), orchestration (BW), and applications (business services)
It is a very complex system and if you are responsible for monitoring a complex BW environment, how can you do it
Working with a large number of BW customers, we’ve seen a number of questions come up repeatedly and ou likely struggle with them as well:
Prioritization – how can I prioritize addressing issues around business impact?
Being proactive: how can I correct problems before users are affected?
Reducing complexity and understanding interdependencies even when your critical applications depend on both TIBCO and non-TIBCO components
Using BW in a heterogeneous environment: how can I understand how other parts of the infrastructure are impacting me?
Turn over to Glyn
Many of the tips we are going to review today will use RTView to illustrate the point. I want to take a moment to explain how the monitor we will show today is different from TIBCO RTView Standalone monitors.
TIBCO’s RTView monitors are advanced out-of-box monitoring for specific TIBCO technologies. They are developed by SL and resold by TIBCO as TIBCO-branded products.
RTView Enterprise Monitor product is sold by SL, not by TIBCO.
Enterprise Monitor is an end-to-end monitoring platform that monitors a number of different technologies including all of these TIBCO technologies here.
These standalone monitors are essentially the same in functionality as the solution package in Enterprise Monitor. SO the standalone BW Monitor is essentially the same as the BW Solution Package in Enterprise Monitor
And anyone using TIBCO standalone monitors can extend those to work with RTView Enterprise Monitor.
This enables you to take your TIBCO monitoring to a new dimension.
SO let’s dive in - - - Glyn?
Avoid tunnel vision.
We all need to be aware of the environments around us. Upstream and Downstream events…
Natural to focus on our own area of responsibility – when there is an issue, our first instinct is to make sure that everything in our domain is functioning normally and is not the cause of the problem. This can mean that we spend many hours troubleshooting issues in our BW environment only to find out that symptom is caused by an external problem. Modern, complex systems are highly interdependent. for example: 90% of the processes in BW are created by or for EMS so to try and manage a BW environment in isolation and without visibility into EMS does not seem to make sense. In addition, 100% of these processes rely on a compute layer – this can be a hidden source of intermittent problems. – You need to be able to understand the relationships between all these systems - and this requires visibility into all of these systems.
What we see on this screen is how this visibility can be provided by a holistic monitoring system – on this single screen we see the healthstate of all the midlleware components resposibe for the smooth running of the application – compute, bw and EMS. We can now detect problems before they occur and often take action before the users are impacted.
Health checks are another best practice that can encourage problem avoidance
Over time products change but habits die hard. It’s Easy to get caught in the day to day routine –
Take a step back, at least annually, to review deployments and configurations of your assets – both hardware and software.
May find, for example, you are using a very labor-intensive process when a new automated process is provided by the vendor. Vendors listen to their customers and try to make their products more efficient/automated.
You may also find that the environment was optimally aligned to meet past business requirements – are no longer appropriate - requirements that may be very different today.
Finally, technology always changes – are you up to date on the latest and greatest?, Can you reduce complexity and avoid a hardware upgrades by using new techniques?
Another practice that can encourage problem avoidance is Health Checks
Easy to get caught in the day to day routine –
Take a step back, annually, to review deployments and configurations of your assets – both hardware and software.
This means that, for example, you may be using a very labor-intensive process when a new automated process is provided by the vendor.
You may also find that the environment was optimally aligned to past business requirements – requirements that may be very different today.
Finally, technology always changes – are you up to date on the latest and greatest?, Can you reduce complexity and avoid a hardware upgrade by using new techniques?
By reducing persistent failure events, you improve your SLAs’. Improve your customer satisfaction.
When working with a large, dynamic environment it is often difficult to detect recurrent points of failure – especially if the failures are intermittent.
Alert History can be an excellent way to look for patterns that, without this History, would be overlooked.
Example: filters on a Device, ID, alert text etc. can reveal failure patterns of failures in the sub-system. Yu may see that failures most often occur on Monday morning at about 9 am. Investigation would reveal that this is when the file system is being backed up – starving the processes of cpu at a peak time.
This screen shows that by resolving the problem on a single resource, you will eliminate a series of errors, impacting several systems with just one fix – result – improved SLA and customer satisfaction – and reduce your workload!
Avoid persistent failure events, improve your SLAs’. Improve your customer satisfaction.
When working with a large, dynamic environment it is often difficult to detect recurrent points of failure – especially if the failures are intermittent.
Alert History can be an excellent way to look for patterns that, without History, would be overlooked. Example: Apply filters on a Geography, Device, ID, alert text etc. and detect patterns in the failures on a sub-system seem to happen most often on Monday morning at about 9 am. Investigation (see 1, above) would reveal that this is when the system is being backed up – starving the system of cpu at a peak time.
The screen shows the result of filtering Historical alerts to zone in on the multiple errors caused by the same problem, on the same server. By resolving the problem on this single resource, you will eliminate a series of errors, impacting several systems.
While on the subject of workload…..dont forget the hidden machines
Java runs in JVM. Correlate JVM with the operational engine to understand the impact of changing user demand and is often overlooked as a cause of intermittent slowdowns.
You need to watch the JVM and increase/decrease size to optimize use of resources and to avoid slowdowns at moments of high demand.
Make sure that the JVM has enough headroom to accommodate peak demand. A simple change in allocation can provide great performance benefits
You also need to monitor Garbage collection to make sure enough memory is released back to the user
Java runs in JVM. You need to watch the JVM behavior as demands changed due to user behavior. Make sure that the JVM has enough headroom to accommodate peak demand.
Your Monitoring System should provide Application-centric navigation.
Support rsources are limited.
When you are looking multiple screens of alerts, and your managers phone starts ringing, how can you automatically focus on those resources that support mission-critical applications and leave for later, those that don’t
A good monitoring system will automatically indicate which systems should receive immediate attention and which can wait by associating the resources with the Business application – remember that your customers depend on Applications, not processes. A single failed transaction for one customer, caused y a single process, can represent a 100% failure rate for that customer.
This screen is an example of how a monitoring system can automatically help prioritze your support effors so that critical systems get critical attention. Explain the screens
One headache facing middleware administrators is managing the sheer number of processes and connections.
Plan ahead. Enabling ClientID during design time will greatly improve information available about connections – it’s a simple check box and thereafter when BW interacts with EMS and the clientID is stored and can be used to associate a process with queues/topics or bridges.
You will have a better understanding of who/what apps are making the connections, what connections are left open after the process terminates etc.
You monitoring system should also be able to make use of these relationships for alert generation.
Organize your BW environment – divide and conquer. 600 + engines are impossible to manage en-masse.
As deployment systems change to meet new business demands, the static environment becomes much more dynamic, the monitoring system, in turn, needs to be agile and extensible.
you can no longer keep in your head information on what resources exist where, which are the heavily used politically sensitive systems, and which are just for time card management.
Structure helps you make sense of what’s happening
A good monitoring solution will allow visual segregation of assets. Assets can be displayed by profile without changing the physical resources.
You should be able to create screens that show different regions, Business sectors and function.
For example, your monitoring system should provide ‘profile-driven’ dashboards that separate you BW community into:
Different Geographic regions
Different Datacenters
Different Business sectors/divisions
Critical Vs. non critical DR-tagged systems etc etc etc
Wherever you can, avoid unnecessary upgrades - why would they be unnecessary? – because driven by incorrect load information..
Examples
In your load-balanced clusters, is the balancing effective and really acting on the resources appropriately? Traffic directed appropriately?
A monitoring system should be able to show the demand placed on multiple balanced resources, on the same screen.
You should be able to validate that the algorithms are effective --- otherwise increased overall traffic will have a different impact depending on which resources the process is assigned to .
The screen shows, visually, the load placed on each clustered resource – a well balanced cluster would show an even coloration, no square appearing darker or lighter than the other.
The same monitoring system should show you, historically, the demand placed on all your subsystems – you can shift resources from under to over utilized systems and dramatically improve reliability and performance without upgrades.
However, when you do need to push for more resources, the monitoring system will provide factual evidence of the need – no more “I’m sure it would help’
Choose monitoring options that have a light footprint
It can be embarrassing to have to explain that performance issues were caused by the very the monitoring systems designed to avoid these problems in the first place.
Many monitoring systems allow for the alert rules to be centralized to one server as opposed to being replicated and executed on hundreds of servers.
It is also a fact that maintenance of these rules is onerous and often leads to inefficient processing caused by many obsolete and redundant rules being executed needlessly. Also, the execution of these rules provides a significant load on the host and will impact performance – another reason to execute these rules on a remote machine.
Your monitoring system should scale -----
Scalability is a big issue when dealing with thousands of components, distributed across several datacenters.
Data collection and processing needs to happen close to the source to reduce latency and network traffic, with only the results/deltas being sent to central repository for Presentation purposes – this design is optimal for low overhead on host and network while providing high performance and low latency.
Don’t forget that you will be adding new technology types to the mix – can your vendor handle the integration of these new requirments?
Finally – Ti #10
It all comes together in a view that shows, at a glance, the different layers of technology that, together, provide smooth running operation of the applications and services. The performance and availability of each layer is represented on the same time scale.
A time-synced, holistic view of an infrastructure segment or application is essential to understanding cause and effect.
Showing all this information – representing different data stores owned and controlled by different groups on the same screen, greatly improves communication via a common interface, and encourages a collaborative approach to problem solving.
What we see on this screen is how a series of - what would seem unrelated events – can be connected. We see that , at 8am….. 12 am ……2pm ……
Can you imagine how many people would be involved for how long to track down this problem that is directly caused by the application inter-dependencies?
There is absolutely no way that this result could be achieved efficiently through use of separate, disparate monitoring solutions.
Monitoring systems should highlight cause and effect by presenting the end-to-end flow in a single screen. This enables administrators to quickly understand how a ‘Pending Message High’ alert can be caused by a CPU shortage on a seemingly unrelated resource.
(tom Blog) Sharing information across different workgroups is essential.