OBIEE 12c Advanced Analytic Functions

www.redpillanalytics.com
Abstract
Oracle Business Intelligence Enterprise Edition 12c has enhanced analytical capabilities
due to an (optional) integration with the statistical software R. These new functions include the
following: Trendline, Bin and Width Bucket, Forecast, Clustering, Outlier and Regression. This
document will provide a comprehensive review of these newly available functions, and provide
examples of them in action. For ease of understanding and reproducibility, the sample data set is
Oracle’s Sample Sales Lite1
.

1
This data set is available with every install of OBIEE 12c. Alternatively, a similar set can be
found within the Oracle BI Sample App Virtual Machine.

The Trendline Function
Trendline is part of the Advanced Analytics Internal Logical SQL Functions, meaning it
is in the group of functions that are done internally as opposed to being done in R. This function
fits a linear or exponential model, and returns the fitted values or model. The numeric_expr
represents the Y value for the trend and the series (time columns) represent the X value. A
Trendline is a model, and its assertion that the data is the result of a model. The TRENDLINE
function measures data across time and shows a line of a metric by ordered records. It can model
data as linear and as exponential regression.
Figure 1: The Trendline function is found under the Aggregate folder by clicking on the
“Insert Function” button in the Formula section of the column editor.

Trendline Function Syntax
TRENDLINE( <numeric_expr>,( [<series>] ) BY ( [<partitionBy] ),
<model_type>, <result_type>, [number_of_degrees] )
Where:
o numeric_expr—represents the data to trend
▪ This is the Y-Axis and is a measure column.
o series—indicates the X-axis. This is a list of <valueExp> <orderByDirection>,
where <valueExp> is a dimension column and <orderByDirection> is ASC
(ascending) or DES (descending).
▪ The default is ASC. Note that this cannot be an arbitrary combination of
numeric columns.
▪ It is possible to use more than one Trendline column in the same analysis,
but the Trendline columns must have the same X-Axis.
o partitionBy—A list of dimension attributes that are not on the X-Axis.
o model_type— A model type may be one of the following types:
▪ LINEAR—a function with a constant rate of change and a straight line
graph.
▪ EXPONENTIAL—a function whose value is raised to the power of the
variable.
o result_type— A results type may be one of the following types:
▪ VALUE - will return all the regression Y values given that X in the fit.
▪ MODEL - will return all the parameters in a JSON (JavaScript Object
Notation, which is a lightweight data-interchange format) format string.
Figure 2: Example formula to display result_type of MODEL.

Figure 3: Results of using ‘MODEL’ as the result type; it returns the parameters in a JSON
(JavaScript Object Notation) format string.
Example Syntax
TRENDLINE(“Base Facts”.”Revenue”, (“Time”.”Calendar Date”), ‘LINEAR’,’VALUE’)
Figure 4: Selected dimensions and fact columns for a sample trendline analysis.
Figure 5: Note the Trendline (in green); depicting these types of subtle changes is what this
function is best at.

Figure 6: If the graph is set to vary color by ‘Per Name Year’, the results are displayed for each
year. Note the differences between each year that otherwise would not be apparent.
Figure 7: Segmentation of the trends could continue to smaller subsets. Above, the 2009 has
been split by semester.

The BIN and WIDTH_BUCKET Functions
Both BIN and WIDTH_BUCKET are included in the Advanced Analytics Internal
Logical SQL Functions, meaning they are in the group of functions that are done internally as
opposed to being done in R. With that being said, the syntax for the two functions is different
and will be covered later on.
About BIN
In the BIN function, the user can select any numeric attribute (INT, FLOAT, DOUBLE,
NUMERIC) from a dimension or fact table/measure containing the data values and place them
into a discrete number of bins. The reason to bin a measure would be to separate results of the
measure into group (see BIN syntax). An example of this would be sales from a store and
binning the revenue from anything less than $200, between $200 and $500, and so on. This sales
that had that amount of revenue will be binned into the groups that fit that specific criteria. The
BIN function classifies a given number expression into a specific number of equal width buckets.
The function can return either the bin number or one of the two end points of the bin interval.
The output of the BIN function is used as a GROUP BY expression for other measures included
in the query. The BIN function is treated like a new dimension attribute for purposes such as
aggregation, filtering, and drilling. All of these operations are supported on BIN expressions.
BIN Syntax
BIN(numeric_expr [BY grain_expr1, …, grain_exprN] [WHERE condition] INTO
number_of_bins BINS [BETWEEN min_value AND max_value] [RETURNING { NUMBER
| RANGE_LOW | RANGE_HIGH }])
Where:
o numeric_expr—indicates the measure or numeric attribute to bin
o BY grain_expr1, …, grain_exprN—indicates a list of expressions that define
the grain at which the numeric_expr is calculated before the numeric values are
assigned to bins.
▪ This clause is required for measure expressions and is optional for
attribute expressions
▪ The BY clause of the BIN function defines the grain at which the binned
expression is evaluated prior to binning.
If the binned expression is a measure, then the measure is grouped
at the grain specified in the BY clause before being binned.
▪ The BY clause of the BIN function is mandatory if the binned expression is
a measure.

Otherwise, for non-measure expressions, the BY clause is optional.
o WHERE condition—indicates a filter condition to apply to the numeric_expr before
the numeric values are assigned to bins
o INTO number_of_bins—indicates the number of bins to return. The default is 10.
o BETWEEN min_value AND max_value—indicates the minimum and maximum
values used for the end points of the outermost bins
o RETURNING—indicates a filter condition to apply to the numeric_expr before the
numeric values are assigned to bins. Note the following options:
▪ RETURNING NUMBER—indicates the return value should be the bin number
(for example: 1,2,3,4). This is the default condition
▪ RETURNING RANGE_LOW—indicates the lower value of the bin interval
▪ RETURNING RANGE_HIGH—indicates the higher value of the bin interval
Figure 8: The Bin Function is found under the Aggregate folder in the column formula editor.

About Width Buckets
The WIDTH_BUCKETS function is known as a “secret function” meaning it is not
available in the function menu, but the user can type the formula to use it. The syntax of
WIDTH_BUCKET is also comma-based, which is not consistent with most Advanced Analytics
in OBIEE. Similar to binning, width bucket classifies a given numeric expression into a specified
number of equal width buckets. It operates on top of a base query result set as a display function.
The function can return either the bin number or one of the two end points of the bin interval.
Unlike the BIN function, the WIDTH_BUCKET function is not treated as a new dimensional
attribute for the purposes of aggregation. It is applied on top of the query result similar to the
other display functions such as RANK, TOPN, BOTTOMN, NTILE, PERCENTILE, MAVG,
and MEDIAN. Use the WIDTH_BUCKET function when you want to compute a discrete set of
buckets on top of an already aggregated query result set. The syntax for Width Bucket is much
simpler than that of the BIN function.
WIDTH_BUCKET Syntax
WIDTH_BUCKET(numeric_expr, {NUMBER | RANGE_LOW | RANGE_HIGH },
number_of_bins, [min_value, max_value] [BY expr1, …, exprN])
Where:
o numeric_expr—indicates the measure or numeric attribute to bin
o NUMBER—indicated that the return value should be the bin number (ex: 1,2,3,4).
o RANGE_LOW—indicates the lower value of the bin interval
o RANGE_HIGH—indicates the higher value of the bin interval
o number_of_bins—indicates the number of bins to return. The default is 10.
o min_value, max_value—indicates the minimum and maximum values used for
the end points of the outermost bins. If the min_value and max_value conditions
are omitted, then the function determines the end points automatically.
o BY expr1, …, exprN—indicates an optional list of expressions that define the
groups in the query result set over which the WIDTH_BUCKET calculation is
applied. The bucket intervals within different groups are calculated
independently.
▪ The BY clause of the WIDTH_BUCKET function defines the groups in the
query result over which the WIDTH_BUCKET calculation is applied.
The buckets within different groups are calculated independently.
▪ The BY clause is always optional in the WIDTH_BUCKETS function.

If the BY clause is omitted from the WIDTH_BUCKET function,
then the function operates over the entire result set.
BIN and WIDTH BUCKET: Defining Grouping

The goal of both functions is to define the bin/bucket that the specific data entry belongs
to. This is accomplished by:
o Using what column the binning should be done (that is, the binned expression).
§ Remember, this is a numeric expression (and usually a measure).
o By what attributes the data should be arranged.
§ Remember, the BY function does not have the same meaning in both
functions!
o The number of Bins/Buckets and the type of data returned.
§ Remember, it is one of three options: the bin or bucket number, it’s
minimum or maximum point.
o The WHERE condition option found in the BIN function.
BIN and WIDTH_BUCKET Function Example
The dimensions and measures being used for this example are:
• LOB
• Per Name Month
• Revenue
• BIN Formula: BIN("Base Facts"."Revenue" BY"Products"."LOB","Time"."Per
Name Month" into 4 bins)
• WIDTH_BUCKET Formula: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER,
4)
o (Define the number of bins for each to be the same or there will be an error)

Figure 9: Above are the results of the binning and buckets of revenue. The table shows that it is
binning the monthly revenue of the LOB in columns “BIN” and “WIDTH_BUCKET” in bins of
1-4. It is sorting or binning the revenue into specific numbered groups.
Figure 10: A linear graph where Bin #1 contains the month and year when the revenue was less
than $15,000.

Figure 11: A linear graph where Bin #2 contains the month and year when the revenue was
between $15,000 and $30,000.
between $30,000 and $45,000.

greater than $45,000.
Be sure not to aggregate BOTH functions using the BY clause for it will result in an error.
•BIN: BIN("Base Facts"."Revenue" BY "Time"."Per Name Month" into 4 bins)
The meaning of BY "Month" in BIN is: Take the sum("Revenue" by "Month") and
arrange the sum of month in 4 bins. So rows of the same month will have the same BIN
"Revenue" by "Month" results.
•WIDTH_BUCKET: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4 by
"Time"."Per Name Month")
The meaning of BY "Month" in WIDTH_BUCKET is: Take individual rows of data in
each month and arrange them in 4 buckets.

Figure 14: The Bin and Width Bucket do not match due to both functions using the BY clause.
Using the WHERE Option in the BIN Function
Figure 15: BIN Function Criteria edited to include the WHERE option.
BIN Formula: BIN("Base Facts"."Revenue" BY "Products"."Product
Type","Time"."Per Name Month" where "Time"."Per Name Year"='2010' into 4
bins)

The Forecast Function
A Forecast creates a time-series model of the specified measure over the series using either
Exponential Smoothing or ARIMA (Autoregressive integrated moving average). This function
outputs a forecast for the set of periods as specified by numPeriods. Forecasting is very useful as
a tool for predictive analytics. You can see potential trends for different dimensions and
measures because of this function.
Forecast Syntax
Figure 16: The Forecast function can be found under the “Time Series Calculations” folder
within the column formula editor.

FORECAST (numeric_expr, ([series]), output_column_name, options,
[runtime_binded_options]) ])
Where:
o numeric_expr —indicates the measure to forecast.
o series —indicates the time grain at which the forecast model is built. This is a
list of one or more time dimension columns.
▪ If you omit series, then the time grain is determined from the query.
▪ The series must fit the date columns in the Analysis.
o output_column_name —indicates the output column. Valid values are ‘forecast’,
‘low’, ‘high’, and ‘predictionInterval.’
▪ forecast —This column is the forecasted output
▪ low —This column is the forecasted lower bound number
▪ high —This column is the forecasted higher bound number
Upper and lower limits of the prediction at the given confidence
level might be important
▪ predictionInterval —This is an available option that is the confidence
for the prediction.
The predictionInterval ranges from 0 to 100, where the higher
values specify a higher confidence.
o options —indicates a string list of name/value pairs separated by a semi-colon.
▪ The value can include %1…%N, which can be specified in
runtime_binded_options.
▪ View the table below for the available options
o runtime_binded_options—indicates a comma separated list of runtime-binded
columns and options

Forecast also has many of Available Options that can be used with the function. Below is a list of
the option types: (Value type in the parentheses)
numPeriods —The number of periods to forecast (integer)
predictionInterval —The confidence for the prediction (0 to 100, where higher
values specify higher confidence)
modelType —The model to use for forecasting. (ARIMA—Autoregressive Integrated
Moving Average, fitted to time series data either to better understand the data or to
predict future points in the series), (ETS—Error, Trend, Seasonal—exponential
smoothing state space model that is applied to the ‘y’.)
useBoxCox —If TRUE, then use Box-Cox transformation, which is a method used to
normalize a data set so that statistical tests can be performed to evaluate it properly.
Many real world raw data sets do not conform to the normality assumptions used for
statistics, so transformation functions can sometimes be used to normalize the data.
(TRUE, FALSE)
lambdaValue —The Box-Cox transformation parameter. Ignore if NULL or when
useBoxCox is FALSE. Otherwise the data is transformed before the model is estimated.
trendDamp —This is a parameter for ETS (Error, Trend, Seasonal) model. If TRUE, then
use damped trend. If NULL, then try both damped and non-damped trend and choose
then one that is optimal.
errorType —This is a parameter for ETS model. (additive (“A”), multiplicative (“M”),
automatically selected (“Z”))
trendType —This is a parameter for ETS model. (none(“N”), additive (“A”),
multiplicative (“M”), automatically selected (“Z”))
seasonType —This is a parameter for ETS model. (none(“N”), additive (“A”),
multiplicative (“M”), automatically selected (“Z”))
modelParamIC —The information criterion (IC) to be used in the model selection.
(“ic_auto”, “ic_aicc”,”ic_bic”,”ic_auto”—this is the default)

Figure 17: “Per Name Year” has been filtered to be “equal to/ is in” ‘2008’ to allow
forecasting for ‘2009’.
Forecast Example
The formula used in the FORECAST Column is as follows:
FORECAST("Base Facts"."Revenue", ("Time"."Per Name Year", "Time"."Per Name
Month"),'forecast','modelType=arima;numPeriods=%1;predictionInterval=70;',
12)
Figure 18: Forecast for 2009 based on 2008 data.

The Clustering Function
This function groups a set of records into groups based on one or more input expressions using
K-Means or Hierarchical Clustering, which are the two modes of clustering analysis that can be
utilized in the advanced analytics clustering model provided in 12c.
K-MEANs:
Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering
attempts to partition into a specified number of clusters (k) so as to minimize the sum of the
distance functions of each individual point from the K center. This allows for an overview of
similarities along the given dimensions.
Hierarchical Clustering:
Generally, this form of clustering is an attempt to build a sort of pecking order in which the data
filters down into distinct groups along the prompted dimensions. Hierarchical clustering can be
thought of as a sort of “top-down” approach of structuring an overview for viewing contextual
differences/similarities amongst user-defined dimensions.
Syntax for Clustering Analysis:
CLUSTER( (dimension_expr), (expr), output_column_name, options, [runtime_binded_options])
Where:
• dimension_expr— represents a list of dimensions to be clustered (K).
• expr— represents a list of dimension attributes or measures to be used (x1, x2, …, xn) to
cluster the dimension_expr (K)
• output_column_name— is the output to be printed in the column header, this portion of
the syntax is only part of the aesthetic interaction in the platform and does not perform
and analytics. The valid values include:
o clusterID – This column is the cluster number or ID.
o clusterName – This column is synonymous with clusterID.
o clusterDescription – The description can be added by the end user after the
cluster dataset is persisted into DSS.
o clusterSize – This column is the number of elements in the current cluster.
o distanceFromCenter – This column indicates how far the current cluster
element is from the center of the current cluster.
o centers – This column indicates the center of the current cluster in a format
• options — is a string list of name=value pairs separated by ';'. The value can include %1
... %N, which can be specified using runtime_binded_options.
• runtime_binded_options — indicates a comma separated list of binded columns or
literal expressions that supply a specification to an unrepresented value in the options list.

This portion of the syntax is optional. It is merely satisfying parameters for other options
that have yet to be specified. For example, in the clustering analysis, you might have
options of numclusters=%1, maxIter=%2. Let’s speculate that you want 5 clusters and a
maximum 10 iterations for this particular analysis. Your runtime_binded_options would
then be 5,10 — which corresponds to 5 clusters and 10 iterations. Order matters. %1 in
options equates to the first specified option, %2 the second, and %N the Nth. Here would
be the entire syntax for this example (highlighted is the areas of focus).
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"),
("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName',
'algorithm=k-
means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE',
5, 10)
Remember that the runtime_binded_options option is not required. Parameters can be
specified in the function without the use of this option. This means that the following code is
synonymous in performance to the example given above:
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"),
("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName',
‘algorithm=k-means;numClusters=5;maxIter=
10;useRandomSeed=FALSE;enablePartitioning=TRUE’)
Clustering Example Analysis
An example of a clustering analysis could check to see how the dimensions of offices and
companies within the data set were clustered along the measures of revenue and discount
amount. One hypothesis for this analysis might be that offices under their respective companies
are acting very similar in regards to discount amount and revenue.
Formula Syntax2

CLUSTER(("Offices"."Office", "Offices"."Company"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', ‘algorithm=k-
means;numClusters=@{numClusters};maxIter=@{numIter};useRandomSeed=FALSE;enabl
ePartitioning=TRUE’)
Methodology:
The example will be using K-means clustering rather than hierarchical clustering. See above in
the Syntax for Clustering section for details on the syntax variation for the options of
numClusters and maxIter that allow for user inputs for these variables.

2
The highlighted text refers to presentation variables. See Appendix I for more information.

With a user input of 3 clusters and 20 iterations, one would receive an output of:
Figure 19: Cluster Visualization for 3 Clusters, with 20 Iterations
Where our clusters are depicted via color and shape and our Discount Amount and Revenue on
our axis and each point represents one of the 20 offices in the data set. We can see how this
graph changes after doubling the cluster amount.
Figure 20: Cluster Visualization of 6 Clusters with 20 Iterations.

Notice how some clusters are larger than others. This is because in this clustering method, the
objects of the data set are grouped in such a way that the clusters are very different from each
other and the objects in the same group or cluster are very similar to each other. This being said,
some data clusters might contain highly similar points along the measures of discount amount
and revenue while others are highly varied and only contain one data point, such as cluster
number 1 in this analysis. There is no ‘perfect number’ for cluster amount. This number is
contingent upon the data set in use, the amount of data, and user preference. 3 and 6 were used
here in a mere exemplary fashion.
If the data is in a tabular format, one can get a fairly informational depiction of exact amounts
within the selected data set. This allows for a more precise or exact view of the data within the
clusters. It would be poor practice to display all of this information on the scatterplot. The
visualization is more of an aesthetic way of viewing data that allows for increased perception of
what might otherwise not be apparent. The tabular version is important in correspondence with
the visual so that the user can witness precision along the results of the executed underlying
algorithm. Here is a snippet of the tabular information, sorted in ascending order by cluster
number:
Figure 21: Tabular View of Cluster Analysis.
The last important thing to note is that within the clustering function in 12c there are a few
variant methods for clustering. These are sort of subsets within the K-means and Hierarchical
methods. For the visual comparison K-means will be used because K-means is the default
method for clustering in OBIEE. Also new variables (as compared to the previous analysis) will
be used to get more data points and to compare the different methods accordingly to see how
they differ.

Figure 22: New Columns for Methodology Comparison.
Notice below, the added option in the options portion of the syntax for all 3 of the following
comparisons, clusterNamePrefix, for this function. Also notice that useRandomSeed is set to
FALSE because we are comparing methods. In the ‘run time binded’ section of the function
analysis, both %1 and %2 are set to (“INSERT METHOD”) for the usage of methodology and
the display of the methodology name in the legend for the visualization respectively. Also note
that 5 clusters are used in each analysis which allows for a more telling comparison along our
input dimensions.
K-MEANS CLUSTERING METHODS:
1) Hartigan-Wong Method
CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method=
%1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2',
'@{P_Method}{Hartigan-Wong}', ‘@{P_Method}{Hartigan-Wong}')
Figure 23: Output from Hartigan-Wong Method.

2) Lloyd Method
'@{P_Method}{Lloyd}', ‘@{P_Method}{Lloyd}')
Figure 24: Output from Lloyd Method.

3) MacQueen Method
'@{P_Method}{MacQueen}', ‘@{P_Method}{MacQueen}')
Figure 25: Output from MacQueen Method.
Looking closely at these varying visualizations, it is apparent that the differentiation of each
cluster is slightly different for the 3 methods.
Also of note, is that the H-Clustering Methods are the default, but there is also ward.D,
ward.D2, Single, Average, Median, McQuitty, and centroid.

The Outlier Function

This function classifies a record as Outlier based one or more input expressions using K-Means,
Hierarchical Clustering or Multi-Variate Outlier Detection Algorithms (The 3 methods in outlier
detection for the Advanced Analytics tools in OBIEE 12c). Each method is utilized for different
purposes and the user has the ability to adjust the algorithm of use according to their specific
needs. In statistics, an outlier is a reference to specific data that diverge from the normality of the
data set as a whole to a statistically significant extent. Outliers can be thought of as a data
anomaly; the sort of black sheep within the data. Outlier detection can be thought of as
clustering data along a logical metric, where normality is equal to FALSE (not an outlier) or
abnormality is equal to TRUE (an outlier). Here is a brief description of the 3 methods that were
mentioned above:
K-MEANs:
Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering
attempts to partition into a specified number of clusters (k) so as to minimize the sum of the
distance functions of each individual point from the K center. This allows for an overview of
similarities along the given dimensions. For outlier detection, there will be two clusters in a
logical format, one of TRUE and one of FALSE. TRUE denoting an outlier, FALSE denoting
data normality.
Hierarchical Clustering:
Generally, this form of clustering is an attempt to build a sort of pecking order in which the data
filters down into distinct ‘groups’ along the prompted dimensions. Hierarchical clustering can be
thought of as a “top-down” approach of structuring an overview for viewing contextual
differences/similarities amongst user-defined dimensions.
Multivariate Outlier Detection (default outlier detection for 12c):
One way to check for multivariate outliers is with Mahalanobis’ distance.3
Mahalanobis’
distance can be thought of as a metric for estimating how far each case is from the center of all
the variables’ distributions (i.e. the centroid in multivariate space). Mahalanobis’ distance
accounts for the different scale and variance of each of the variables in a set in a probabilistic
way.

3
(Mahalanobis, 1927; 1936 ).

Syntax for Outlier Analysis:
OUTLIER( (dimension_expr1 , ... dimension_exprN), (expr1, .. exprN),
output_column_name, options, [runtime_binded_options])
Where:
• dimension_expr— represents a list of dimensions to be clustered (K)
• expr— represents a list of dimension attributes or measures (x1, x2, …, xn) to be used in
order to find outlier’s.
• output_column_name— is the output column. The valid values are:
o ’isOutlier’: which will print back a logical value TRUE or FALSE as to whether
or not each data point is an outlier or not.
o ’distance’: will return the “distance from normality” (the higher this number, the
‘more’ of an outlier the data point is).
• options — is a string list of name=value pairs separated by ';'. The value can include
%1 ... %N, which can be specified using runtime_binded_options.
• runtime_binded_options — is an optional comma separated list of run-time binded
columns or literal expressions that supply a specification to an unrepresented value in the
options list. This portion of the syntax is optional. It is merely satisfying parameters for
other options that have yet to be specified. For example, in an outlier analysis, the user
might have an option output_column_name=%1. If it was speculated that they wanted to
use the distance for this particular analysis, Their runtime_binded_options would then be
equal to ‘distance’. Order matters. %1 in options equates to the first specified option, %2
the second, and %N the Nth. Here would be the entire syntax for this example
(highlighted is the areas of focus). Remember that runtime_binded_options is
optional. You can specify parameters to your options without using this tool, which
implies that runtime_binded_options is more of an organizational tool than a
functional one. Using it versus not using it does not impact performance, but the option is
nice to have for organizational purposes.
Outlier Function Example Analysis:
For the analysis, observe how the dimensions of offices and companies within the data set were
clustered along the measures or attributes of both revenue and discount amount. One hypothesis
for this analysis might be that offices under their respective companies are acting very similar in
regards to discount amount and revenue.
Figure 26: Columns used in example analysis.

New Columns for Methodology Comparison
For this example, the multivariate outlier algorithm (mvoutlier) will be used, rather than K-means
or hierarchical clustering to start (no particular reason for this other than mvoutlier being the
default algorithm). However, perhaps it could be wagered that the mvoutlier algorithm is the most
favorable and is the default algorithm for a reason. Observe the variance in algorithms below.
Function Syntax:
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=_________’)
Outlier Observations
Thus far, the syntax has proved to disallow any entering of a specified number of outliers. When
using the multivariate algorithm, and entering numClusters into the syntax in order to change the
result, an error is printed in the results tab. After playing around with the sample sales data, the
conclusion can be made that there is no way to set a specific number of outliers to be detected.
The number of outliers is contingent upon each data set and how it acts with the underlying
algorithm in R. Setting an “is not equal to” filter on the two data points (Eiffel and Spring
offices) in order to see if there would still be outliers does not change whether or not there are
outliers. Rather, there are two new outliers (the second set of two most northeasterly points on
the graph). This is counterintuitive to what the function is doing. If the function was finding
truly, significantly variant data, then the result, after this filter was applied, should return all
green (FALSE) points on the scatter plot. On the other hand, sometimes a user might have a data
set with all very similar points but still want to find the point(s) that are most variant. This
means that the outlier detection algorithm is a reliable source and will give us outliers in all
situations. It is important to keep these contingencies in mind when analyzing data.
When the scatter plot involving these variables of analysis is made, returned is the following
graphs, with the accompanying tables of:

Multivariate Outlier Detection Method:
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier'
Figure 27: Multivariate Outlier Detection output.
Hierarchical or H-clustering Outlier Detection Method:
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=h-clustering')
Figure 28: Hierarchical Clustering Outlier Detection output.

K-means Outlier Detection Method
Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=Kmeans')
Figure 29: K-Means Outlier Detection output.
Notice that when using the h-clustering algorithms and the multivariate algorithms, the outliers
are consistent (Eiffel and Spring offices of Tescare Ltd.) but when using the K-means algorithm
to find outliers, very different values of Blue Bell and Teller offices of Stockpiles Inc are
received. These variations in outlier detection methods between the algorithms beg the question
of reliability amongst algorithms. For this reason, the variables of analysis were altered to try to
get a visualization with more data points, and hence more outliers, to see if there was some sort
of anomalistic variation here with just these variables. The syntax in use is:
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=h-clustering')
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=Mvoutlier’)
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=K-means’)
Place each of the above in their own respective columns (with the same variables in each)in
order to graphically view these outliers on the same scatter-plot. New variables (Product, and
office) were also used in this analysis for a larger amount of data points. As is visible after these
minute changes, some outliers overlap and others do not.

Legend Translation:
Blue Squares: Only h-clustering viewed these as outliers
Green Circles: All algorithms viewed these as outliers
Yellow Rhombi: No algorithm viewed these as outliers
Red plus: MV Outlier and K-means algorithms viewed these as outliers, not h-clustering.
In the graph editor, the corresponding order of methodological reference is:
• Hierarchical Clustering
• MV Outlier
• K-Means
Figure 30: Visualization comparing all three outlier methods
Analyzing Methodology Consistency
There are no algorithmic overlapping points between the h-clustering algorithm and the other two
algorithms. This is interesting. It could be inferred that other data sets would have points where
the other methods overlapped with the h-clustering algorithm, but in this particular data set, the
reasoning might have something to do with the variance in the algorithm and how it goes about
‘defining’ an outlier. Remember, h-clustering is representative of a hierarchical pecking order in
which data sort of filters down whereas the other methods are distance based, based on your
input criterion or dimensionality. These differences could account for the variance in our
visualization here.
Also notice that it seems as if the ‘behind the scenes’ R-statistics are more consistent with their
outlier detection. In the first analysis, K-means was a little bit off as compared to the other two
algorithms. After browsing through some documentation on K-means clustering, an apparent

notion of K-means being a reliable method amongst increasingly large data sets is noticeable. In
the first analysis, there were few data points, in the second there are many. The fact that there
were only 20 points in the first analysis might be the reason for this discrepancy amongst
strategies. Perhaps as the data set size increases, more consistency with the varying algorithms
will be noticed. Keep this in mind when choosing algorithms.

The Regression Function
This function fits a linear model, and returns the fitted values or model. This function can be
used to fit a linear curve on two measures. In statistics, a regression analysis is a process that
estimates the relationship among two variables within a data set. The focus of this test is to
measure the relationship between one or more independent (fixed) variables and its correlation to
a dependent (variable) variable. More specifically, regression allows for a deeper understanding
of how a dependent value changes when the independent variable is adjusted. It might help to
think of regression in a sort of ‘mathy’ f(x) or f of x notation, where x is the independent variable
or the input value. The dependent variable (or output) could be thought of as the value of the y
axis.
It might also help to think of these two variables in a linguistic way. The y-axis measure is the
dependent variable, this means that it is literally dependent on some other value to change before
it does. The x-axis measure(s) is/are literally independent of any other factor(s); they are fixed.
This is important to understand before getting into the syntax.
In laymen’s terms, regression is a measure of how good one measure is a predictor of another
measure. Linear regression is also widely used for forecasting trends in an analysis, predictive
analytics, and has large ties to the arena of machine learning as well. Also, understand that
regression methodology does not insinuate causation, but rather suggests a specific extent of
correlation of two measures.
Dummy Variables in Categorical Regression
It is not possible to directly regress a categorical variable against a numerical variable, nor is it
possible to regress a numerical variable against a categorical variable. There is a solution for this
though. It is called a dummy variable. This works with the assumption that it is necessary for an
analysis to have a regression model regarding a categorical variable that contains the names of
pets (Cats, Dogs, and Birds) and to see how good a predictor these pets are of (fill in the blank).
It would not make sense to assign Cats, Dogs, and Birds a 1,2, and 3, respectively, unless, for
some reason, this Dog was twice as much of a pet than a Cat and the Bird 3 times as much of a
pet as the Cat. Since regression is used with two numerical variables, interpretations are only
valid under circumstances where having a 100 stored for some variable literally equates to
having 100 times the characteristic of X than the variable that stores the number 1. For the pet
example, since it would be illogical to assign a 1, 2 and 3, an alternative (with a regression model
in mind) is to assign some binary values, such as a 1=Cat and 0=not a cat.
Syntax for Regression Analysis
REGR(y_axis_measure_expr, (x_axis_expr), (category_expr1, ...,
category_exprN), output_column_name, options, [runtime_binded_options])
Where:
• y_axis_measure_expr represents the measure for which the regression model is to be
computed. This is your dependent variable.

• x_axis_expr represents the measure to be used to determine the regression model for the
y_axis_measure_expr. This is your independent variable.
• category_expr1, ..., category_exprN represents the dimension/dimension
attributes to be used to determine the category for which the regression model for the
y_axis_measure_expr is to be computed. One or more dimensions or dimension
attributes, up to five, may be provided as category columns.
• output_column_name is the output column.
o fitted - returns the points on regression line in (y=ax+b) format
o intercept - the intercept point with the zero on x axis (b from y=ax+b)
o modelDescription - the Model in JSON format.
• options is a string list of name=value pairs separated by ';'. The value can include %1 ...
%N, which can be specified using runtime_binded_options.
• runtime_binded_options is an optional comma separated list of run-time binded
columns and options.
Regression Example Analysis

In this particular analysis, a comparison is made to unveil how good a predictor the independent
variable of billed quantity is for the dependent variable of revenue. The question to be answered
here is, if the quantity of billed items is changed, how does revenue altered? Based on the
column names alone, it could be predicted that the data will cluster fairly nicely around the
regression line created by the function in an upward slope. This means that the billed quantity
would be a good predictor of revenue. This is fairly intuitive. But, what can also be witnessed
below is that billed quantity is not a perfect predictor of revenue; if it was there would be less
data outlying this regression line. In a regression scatterplot like the one below, the tighter our
‘green dots’ are hugging our ‘blue dots’ the higher the correlation between the two variables.
Function Syntax Used
REGR("Base Facts"."Revenue", ("Base Facts"."Billed Quantity"), ("Time"."Per
Name Month", "Time"."Per Name Year"), 'fitted', ‘’)

Figure 31: Regression Analysis of Billed Quantity as a Predictor of Revenue
If the user were check the table below and look under the column heading “Regression”, he/she
would see the regression function’s output, and how it relates to Figure 32,
Figure 32: Regression Output in Tabular View

It may be interesting to see what data in this regression were not fitting this particular trend. The
visualization below was created by using this syntax —OUTLIER((“Time"."Per Name Year",
"Time"."Per Name Month"), ("Base Facts"."Billed Quantity","Base
Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier’). This will display outlying
values in correspondence with the same syntax and variables used for the above regression.
Figure 33: Visual of Data Points where Billed Quantity is not a Predictor of Revenue
Concentrate on the red plus signs rather than the yellow rhombi. The red plus signs are the
outliers for this regression analysis, where the yellow rhombi are merely the corresponding data
points that were plotted for the regression line for these 4 outlying data points. By sorting the
outlier portion of this data set, one could create a table that shows the year and month where
billed quantity was not necessarily a great predictor of revenue.
Figure 34: Tabular View of Outliers Within a Regression Analysis

What is noticeable is that, for the 6th and 7th months for 3 consecutive years, billed quantity was
not a great predictor of revenue. By obtaining this sort of information, it is possible to drill down
into why this might be the case. These sort of quantitative and visual ‘hints’ within the data
being unveiled in an aesthetic way is the epitome of these advanced analytics tools. Statistics
can tell a lot about why things are the way they are and can, ultimately, provide some insight to
move forward in a fashion that will allow the building of a sustainable organization.

Appendix I: Creating Presentation Variables and Prompts
Presentation Variable and Prompting the User for Function Options
Above, there is slight variation in syntax within the function code from the original syntax given
where there is @{numClusters};maxIter=@{numIter}in the options portion of the function
input. The @{} is the code for adding a presentation variable to a dashboard prompt that will
prompt the user for the number of clusters and the number of iterations for the algorithm to
perform. In many cases it is a good idea to prompt the user for the number of clusters and
iterations because it allows for a more interactive dashboard. It is also important because this
easy functional change can show us how a large sample size continues to change as we
continuously segment our data set into varying numbers of clusters.
If a developer was eager to perform this same task, highlight (in the syntax) the portion that
would typically contain (%1…%N) for whatever variable they wanted to add a prompt for they
would perform the following tasks:
Figure A1: Highlight the %N.
Figure A2: Click “Variable”, then “Presentation”.

Figure A3: Input a variable expression.
It is important to be careful prior to clicking OK here. This Variable Expression must be
matched in a case sensitive fashion to the corresponding dashboard prompt. Click OK.
Figure A4: Click “New”, then “Dashboard Prompt”.
Figure A5: Click the green arrow, then “Variable Prompt”.

Prompt for=Presentation Variable: *Label (this is what is equal to the presentation variable that
was set in the column function)=numClusters: Expand the options window: Variable Data
Type=Number:
A Note of Defaults
The user can set a default value here. Also, just a heads up, there is some sort of undocumented
default value of 5 clusters. For example: The syntax of— CLUSTER(("Products"."Product",
"Offices"."Office"), ("Base Facts"."Discount Amount","Base
Facts"."Revenue"),'clusterName', 'algorithm=k-means;') —returns a visualization of:
Figure A6: Default Visualization of Discount Amount versus Revenue.

Figure A7: Complete the process again for the iteration variable.
Save these Prompts.
Now when going into the Dashboard, where the dashboard prompt and the analysis have been
input, this presentation variable can be witnessed in action.

Document History
Created By: Brendan Doyle
Mike Perhats
Edited By: Phil Goerdt
Creation Date: 8/8/16
Last Edit Date: 8/8/16

OBIEE 12c Advanced Analytic Functions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OBIEE 12c Advanced Analytic Functions

Similar to OBIEE 12c Advanced Analytic Functions (20)

OBIEE 12c Advanced Analytic Functions