Oracle SOA Suite 11g Troubleshooting Methodology (whitepaper)
1 Session 185
ORACLE SOA SUITE 11G TROUBLESHOOTING METHODOLOGY
Harold A. Dost III, Raastech, Inc.
Most troubleshooting guides simply list out solutions to
common errors. This paper introduces a troubleshooting
methodology surrounding performance, composite
instances, deployment, and logging. The goal is to better
equip the reader with the ability to solve most problems as
they pertain to the SOA infrastructure and its executed
transactions. As well as, learn where to look, what to look
for, and what do afterward.
This is intended for every Oracle SOA Suite 11g developer
and administrator should read.
There is no guarantee that every error is an easy fix away.
However, this paper will provide the reader with a better
understanding of where to look for errors, how to
categorize them, and deal with them in an appropriate
manner. With the tools explained later on, quicker
resolutions may be achieved which will produce more
efficient development and better support.
According to Splunk one of their clients, Macy’s, noted
that tracking down the exact cause of a problem could be
“exceedingly difficult.” It often required a team comprised
of members from various IT functional areas to fix these
problems. Even with these teams, resolutions still took
In the past when an issue presented itself it was always the
network admins to be blamed. Over time technology has
improved in that area, therefore network issues at most
companies are few and far between by comparison.
Everything is connected higher in the stack through
various integration servers and technologies. The blame is
often shifted to the integration team, but much like people
blaming a browser for a bad Internet connection, this can
often be misdirected.
Customers hold a company responsible to maintain near-
continuous reliable services and by transitivity the
integration team. This puts a lot of pressure onto the
integration team to quickly determine if the error is
something within their realm or if it needs a different
One of the biggest pains with tracking down issues in
middleware is that it is composed of so many layers. For
example, a web application might make a call. The payload
first goes through Oracle Enterprise Gateway (OEG), this
is because it is going from the Internet to the company
intranet. Then the company uses Oracle Service Bus
(OSB) for all internal service calls to abstract naming and
versions of services. Finally, the payload makes it to
Oracle SOA Suite and it goes on from there to call other
systems. Since the focus is on Oracle SOA Suite, below
are a few issues.
A custom ANT script is used to iterate through a list of
composites and deploys them one at a time. After the 66th
composite an OutofMemeory:PermGen error is thrown;
an odd but repeatable error. A much more common error
is: “Unable to access endpoint…” This error can have
many explanations from a simple timeout, to a security
issue such as an invalid certificate. Without knowing how
to diagnose the source of these symptoms will slow down
even the most senior developers and administrators.
TECHNICAL DISCUSSIONS AND EXAMPLES
Before learning how to solve these problems, it is first a
good idea to step back and acknowledge that
troubleshooting problems is an art. Like any other art it is
part skill and part knowledge. For skill there each person
has a certain level of natural inclination towards solving
problems, some being better than others. Much of it deals
with having a very methodical and scientific approach. The
other half, knowledge, refers to a person’s intimacy with
the product. Unless someone has the ability to deduce the
topology of a system without ever using it is at hand, there
needs to be some time spent working with and
understanding the various subsystems of a product. To
understand how SOA Suite works and how to fix errors
there are many resources.
Many people, not having an answer to an issue, will
immediately jump onto the Internet and perform a series
of queries on their favorite search engine. This can lead to
various blogs and even some Oracle specific resources,
2 Session 185
such as the OTN discussion forums. This is often wasted
time leading to solutions that aren’t related to the problem
at hand. Finding no resolution, many will hop onto the
Oracle support site to search for the existence of a patch.
While none of these options are bad, if unable to properly
direct searches this can be very time consuming,
frustrating, and wasteful. The Internet should not be the
only resource used. In fact, one’s brainstorming and
knowledge should also be a resource on how to determine
the issue at hand. Ideally once the source of the issue is
tracked a resolution is obvious or at least achievable. If
that is not the case then it’s time to resort to the
aforementioned resources. The company may also have an
error tracking and knowledge base of its own. Also, always
remember talking to coworkers is useful, since often issues
have been previously solved and forgotten.
The first step in tracking down the error should be to
classify the problem. For purposes of this paper, lets start
by placing the issues into one of three major categories:
deployment, runtime, and performance. Distinguishing
between these categories may not be at first obvious, but
after encountering a few different types of problems this
will provide a better idea. Runtime errors are going to be
an issue in the logic of integration; this can be actual code
or configuration in the server.
In certain cases the problem would be specific to a
particular composite. Signs that only a composite is
affected are usually obvious since the only errors showing
related to that integration. However, there may also be
issues that affect the entire infrastructure. For now, the
focus is on singular composites and deployment.
The quickest, and usually easiest, issues to troubleshoot are
deployment related. Deployment of a composite is broken
into different phases: cleanup, validation, compilation, and
the deployment. The cleanup phase should never fail as it
searches for existing packaged integrations and deletes
them if they exist. Validation examines the code, and many
errors related to bad references and XML. The
compilation phase will provide further errors should they
arise, but if successful this also packages the source into a
JAR file to prepare for deployment. Finally, deployment
occurs. The deployment process will reveal a number of
issues, however they may not all be displayed from the
deployer’s point of view. Normally that is not a problem,
as most of the issues will be revealed at runtime. These
issues are usually with the server configuration: data
sources, queues, topics, etc. When dealing with a process
that polls a database or file folder the processes will simply
not start. The best way to identify the root cause here is to
tail the out logs while performing a deployment.
Commonly, the issue is a bad JNDI name or a directory
that doesn’t exist. Most of these require coordination with
an application administrator depending on the level of
permissions that the developer has in the particular
environment. Issues that can be determined by the
developers themselves will be discussed with runtime
During runtime any number of errors can occur, but not
all of them will be caused by individual composites. Some
of them can be overarching issues that affect multiple
integrations. Similar to deployment issues, runtime issues
may be caused by problems in the code or in server
configurations. Most code related issues will appear in the
flow trace and will be obvious to solve. Most issues, even
non-code related, will manifest as an error in the console
but the root cause will be hidden in the logs.
In the case of Figure 1, the error is a missing organization.
This is a business fault and should be handled by the
integration code or passed back to the calling application.
Other issues can include errors like: “Cannot insert NULL
into…” These issues may or may not need to be handled
by the integration. Unfortunately, not all of the errors will
appear in the logs all the time, or the error that does show
is not descriptive enough to determine a resolution
immediately. One such error is the “Unable to access the
following endpoints…” error. Logging levels can be
increased to various levels to obtain further information.
However, there are many different loggers available, so
always knowing which logger to modify can be difficult.
The best way to decide which logger should be modified is
by looking in the header of a log message. Next, finding
the right level of logging can be difficult, because trace
logging at times can be overly verbose leading to more
time sifting through the noise. One of the best ways to
find the right logging level is to increment by a couple
levels at a time until the true problem is revealed.
There are many signs that there is a problem with the
performance of a system. Some of those signs being:
The Oracle Enterprise Manager Fusion
Middleware Control is abnormally slow.
The completion time of composites is increased
consistently across the board.
The size of the dehydration store is growing
A large number of errors are appearing in the logs.
<Aug 6, 2011 10:10:33 AM EDT> <Error> <oracle.soa.mediator.serviceEngine> <BEA-
<Got an exception:
Message: Organization 129024 not found. Stack trace: at
on(Organization organization, Notification notification)
Figure 1: Business Fault
3 Session 185
Knowing the server is experiencing any of these issues
listed above means there is likely a performance issue.
There are a number of places to look to track down the
root cause. First, check if there is enough available space
on the hard drives. A lack a space can result in drastic
performance reductions. Secondly, be sure to check the
processor, memory, and I/O statistics with a tool like
vmstat to help narrow down which process is exactly
hogging resources on the [virtual] machine. Other factors
in performance can be the number of files open and the
number of processes running. A runaway integration has
the possibility to consume all file descriptors thereby
degrading performance across the rest of the system. If
issues arise like this, it is often a good idea in development
to clear the logs and restart Weblogic while watching the
logs for any errors that may be a precursor to the “too
many files open” error. If nothing is found specific to
SOA Suite, check other applications running, and be sure
to check the OS logs (/var/log/messages). While errors
can be a common reason for a slow environment, there
could be other issues playing a role.
A tuned JVM is the only one that will give the kind of
performance demanded by production level environment;
this is especially true when there are high volumes of
transactions passing through the environment. If the
application server is not already running in the JRockit
JVM, it is highly recommended. Speed increases can be
realized with little configuration. However, once JRockit is
running there are a number of tools such as the JRockit
Flight Recorder (JFR) that come with the JVM to further
tune your instance as necessary. As of writing this paper,
the Hotspot and JRockit JVMs will ship as one product
with the release of JDK 8. This means the benefits of
JRockit will be realized within the JVM. Tuning a JVM is
not the only useful part interacting directly with your
configuration settings. Additional information can be
provided by your JVM as well. Performing a heap dump
when a memory error occurs is one of those ways. The
JVM is not the only part that should be monitored.
Data sources are another critical component that should
be monitored in the case of performance issues. It is
possible that the available connection pool has been
saturated with connections and is causing a bottleneck. If
there is consistently an issue with a particular connection
pool, involve a DBA to help understand why the pool may
be getting full. There may be some SQL tuning that can be
done so that queries and procedures run more efficiently
shortening the length of connection times.
In the end, even this paper can only gloss over the very
complex art that is troubleshooting. There are many
variables that can come into determining the cause such as
security considerations, operating system, hardware, etc.
Most issues that arise can be narrowed into runtime or
infrastructure errors, performance issues, and deployment
issues. Targeting the category can allow focus on where
the true cause of the issue lay. For deployment issues, it is
good to have an understanding of the overall deployment
process. Also, knowing the purpose of the adf-config.xml
can provide insight as to how the MDS is referenced and
other important deployment related information.
When dealing with errors determining whether there is a
code specific issue or a system wide issue can prevent
many long hours looking in the wrong place. Modifying
logging levels can assist in this and allow for drilling into
the true cause of the issue.
JVM Performance Tuning Documentation
Location of out.err (Used for deployment errors)
Oracle ADF-config.xml Description
Splunk. Ensure the availability and performance of your critical
applications using the genius of splunk. Retrieved from