SlideShare a Scribd company logo
www.redpillanalytics.com
Abstract
Oracle Business Intelligence Enterprise Edition 12c has enhanced analytical capabilities
due to an (optional) integration with the statistical software R. These new functions include the
following: Trendline, Bin and Width Bucket, Forecast, Clustering, Outlier and Regression. This
document will provide a comprehensive review of these newly available functions, and provide
examples of them in action. For ease of understanding and reproducibility, the sample data set is
Oracle’s Sample Sales Lite1
.
																																																								
1
	This	data	set	is	available	with	every	install	of	OBIEE	12c.	Alternatively,	a	similar	set	can	be	
found	within	the	Oracle	BI	Sample	App	Virtual	Machine.
www.redpillanalytics.com
The Trendline Function
Trendline is part of the Advanced Analytics Internal Logical SQL Functions, meaning it
is in the group of functions that are done internally as opposed to being done in R. This function
fits a linear or exponential model, and returns the fitted values or model. The numeric_expr
represents the Y value for the trend and the series (time columns) represent the X value. A
Trendline is a model, and its assertion that the data is the result of a model. The TRENDLINE
function measures data across time and shows a line of a metric by ordered records. It can model
data as linear and as exponential regression.
Figure 1: The Trendline function is found under the Aggregate folder by clicking on the
“Insert Function” button in the Formula section of the column editor.
www.redpillanalytics.com
Trendline Function Syntax
TRENDLINE( <numeric_expr>,( [<series>] ) BY ( [<partitionBy] ),
<model_type>, <result_type>, [number_of_degrees] )
Where:
o numeric_expr—represents the data to trend
▪ This is the Y-Axis and is a measure column.
o series—indicates the X-axis. This is a list of <valueExp> <orderByDirection>,
where <valueExp> is a dimension column and <orderByDirection> is ASC
(ascending) or DES (descending).
▪ The default is ASC. Note that this cannot be an arbitrary combination of
numeric columns.
▪ It is possible to use more than one Trendline column in the same analysis,
but the Trendline columns must have the same X-Axis.
o partitionBy—A list of dimension attributes that are not on the X-Axis.
o model_type— A model type may be one of the following types:
▪ LINEAR—a function with a constant rate of change and a straight line
graph.
▪ EXPONENTIAL—a function whose value is raised to the power of the
variable.
o result_type— A results type may be one of the following types:
▪ VALUE - will return all the regression Y values given that X in the fit.
▪ MODEL - will return all the parameters in a JSON (JavaScript Object
Notation, which is a lightweight data-interchange format) format string.
Figure 2: Example formula to display result_type of MODEL.
www.redpillanalytics.com
Figure 3: Results of using ‘MODEL’ as the result type; it returns the parameters in a JSON
(JavaScript Object Notation) format string.
Example Syntax
TRENDLINE(“Base Facts”.”Revenue”, (“Time”.”Calendar Date”), ‘LINEAR’,’VALUE’)
Figure 4: Selected dimensions and fact columns for a sample trendline analysis.
Figure 5: Note the Trendline (in green); depicting these types of subtle changes is what this
function is best at.
www.redpillanalytics.com
Figure 6: If the graph is set to vary color by ‘Per Name Year’, the results are displayed for each
year. Note the differences between each year that otherwise would not be apparent.
Figure 7: Segmentation of the trends could continue to smaller subsets. Above, the 2009 has
been split by semester.
www.redpillanalytics.com
The BIN and WIDTH_BUCKET Functions
Both BIN and WIDTH_BUCKET are included in the Advanced Analytics Internal
Logical SQL Functions, meaning they are in the group of functions that are done internally as
opposed to being done in R. With that being said, the syntax for the two functions is different
and will be covered later on.
About BIN
In the BIN function, the user can select any numeric attribute (INT, FLOAT, DOUBLE,
NUMERIC) from a dimension or fact table/measure containing the data values and place them
into a discrete number of bins. The reason to bin a measure would be to separate results of the
measure into group (see BIN syntax). An example of this would be sales from a store and
binning the revenue from anything less than $200, between $200 and $500, and so on. This sales
that had that amount of revenue will be binned into the groups that fit that specific criteria. The
BIN function classifies a given number expression into a specific number of equal width buckets.
The function can return either the bin number or one of the two end points of the bin interval.
The output of the BIN function is used as a GROUP BY expression for other measures included
in the query. The BIN function is treated like a new dimension attribute for purposes such as
aggregation, filtering, and drilling. All of these operations are supported on BIN expressions.
BIN Syntax
BIN(numeric_expr [BY grain_expr1, …, grain_exprN] [WHERE condition] INTO
number_of_bins BINS [BETWEEN min_value AND max_value] [RETURNING { NUMBER
| RANGE_LOW | RANGE_HIGH }])
Where:
o numeric_expr—indicates the measure or numeric attribute to bin
o BY grain_expr1, …, grain_exprN—indicates a list of expressions that define
the grain at which the numeric_expr is calculated before the numeric values are
assigned to bins.
▪ This clause is required for measure expressions and is optional for
attribute expressions
▪ The BY clause of the BIN function defines the grain at which the binned
expression is evaluated prior to binning.
If the binned expression is a measure, then the measure is grouped
at the grain specified in the BY clause before being binned.
▪ The BY clause of the BIN function is mandatory if the binned expression is
a measure.
www.redpillanalytics.com
Otherwise, for non-measure expressions, the BY clause is optional.
o WHERE condition—indicates a filter condition to apply to the numeric_expr before
the numeric values are assigned to bins
o INTO number_of_bins—indicates the number of bins to return. The default is 10.
o BETWEEN min_value AND max_value—indicates the minimum and maximum
values used for the end points of the outermost bins
o RETURNING—indicates a filter condition to apply to the numeric_expr before the
numeric values are assigned to bins. Note the following options:
▪ RETURNING NUMBER—indicates the return value should be the bin number
(for example: 1,2,3,4). This is the default condition
▪ RETURNING RANGE_LOW—indicates the lower value of the bin interval
▪ RETURNING RANGE_HIGH—indicates the higher value of the bin interval
Figure 8: The Bin Function is found under the Aggregate folder in the column formula editor.
www.redpillanalytics.com
About Width Buckets
The WIDTH_BUCKETS function is known as a “secret function” meaning it is not
available in the function menu, but the user can type the formula to use it. The syntax of
WIDTH_BUCKET is also comma-based, which is not consistent with most Advanced Analytics
in OBIEE. Similar to binning, width bucket classifies a given numeric expression into a specified
number of equal width buckets. It operates on top of a base query result set as a display function.
The function can return either the bin number or one of the two end points of the bin interval.
Unlike the BIN function, the WIDTH_BUCKET function is not treated as a new dimensional
attribute for the purposes of aggregation. It is applied on top of the query result similar to the
other display functions such as RANK, TOPN, BOTTOMN, NTILE, PERCENTILE, MAVG,
and MEDIAN. Use the WIDTH_BUCKET function when you want to compute a discrete set of
buckets on top of an already aggregated query result set. The syntax for Width Bucket is much
simpler than that of the BIN function.
WIDTH_BUCKET Syntax
WIDTH_BUCKET(numeric_expr, {NUMBER | RANGE_LOW | RANGE_HIGH },
number_of_bins, [min_value, max_value] [BY expr1, …, exprN])
Where:
o numeric_expr—indicates the measure or numeric attribute to bin
o NUMBER—indicated that the return value should be the bin number (ex: 1,2,3,4).
o RANGE_LOW—indicates the lower value of the bin interval
o RANGE_HIGH—indicates the higher value of the bin interval
o number_of_bins—indicates the number of bins to return. The default is 10.
o min_value, max_value—indicates the minimum and maximum values used for
the end points of the outermost bins. If the min_value and max_value conditions
are omitted, then the function determines the end points automatically.
o BY expr1, …, exprN—indicates an optional list of expressions that define the
groups in the query result set over which the WIDTH_BUCKET calculation is
applied. The bucket intervals within different groups are calculated
independently.
▪ The BY clause of the WIDTH_BUCKET function defines the groups in the
query result over which the WIDTH_BUCKET calculation is applied.
The buckets within different groups are calculated independently.
▪ The BY clause is always optional in the WIDTH_BUCKETS function.
www.redpillanalytics.com
If the BY clause is omitted from the WIDTH_BUCKET function,
then the function operates over the entire result set.
BIN and WIDTH BUCKET: Defining Grouping
	
The goal of both functions is to define the bin/bucket that the specific data entry belongs
to. This is accomplished by:
o Using what column the binning should be done (that is, the binned expression).
§ Remember, this is a numeric expression (and usually a measure).
o By what attributes the data should be arranged.
§ Remember, the BY function does not have the same meaning in both
functions!
o The number of Bins/Buckets and the type of data returned.
§ Remember, it is one of three options: the bin or bucket number, it’s
minimum or maximum point.
o The WHERE condition option found in the BIN function.
BIN and WIDTH_BUCKET Function Example
The dimensions and measures being used for this example are:
• LOB
• Per Name Month
• Revenue
• BIN Formula: BIN("Base Facts"."Revenue" BY"Products"."LOB","Time"."Per
Name Month" into 4 bins)
• WIDTH_BUCKET Formula: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER,
4)
o (Define the number of bins for each to be the same or there will be an error)
www.redpillanalytics.com
Figure 9: Above are the results of the binning and buckets of revenue. The table shows that it is
binning the monthly revenue of the LOB in columns “BIN” and “WIDTH_BUCKET” in bins of
1-4. It is sorting or binning the revenue into specific numbered groups.
Figure 10: A linear graph where Bin #1 contains the month and year when the revenue was less
than $15,000.
www.redpillanalytics.com
Figure 11: A linear graph where Bin #2 contains the month and year when the revenue was
between $15,000 and $30,000.
Figure 12: A linear graph where Bin #3 contains the month and year when the revenue was
between $30,000 and $45,000.
www.redpillanalytics.com
Figure 13: A linear graph where Bin #4 contains the month and year when the revenue was
greater than $45,000.
Be sure not to aggregate BOTH functions using the BY clause for it will result in an error.
•BIN: BIN("Base Facts"."Revenue" BY "Time"."Per Name Month" into 4 bins)
The meaning of BY "Month" in BIN is: Take the sum("Revenue" by "Month") and
arrange the sum of month in 4 bins. So rows of the same month will have the same BIN
"Revenue" by "Month" results.
•WIDTH_BUCKET: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4 by
"Time"."Per Name Month")
The meaning of BY "Month" in WIDTH_BUCKET is: Take individual rows of data in
each month and arrange them in 4 buckets.
www.redpillanalytics.com
Figure 14: The Bin and Width Bucket do not match due to both functions using the BY clause.
Using the WHERE Option in the BIN Function
Figure 15: BIN Function Criteria edited to include the WHERE option.
BIN Formula: BIN("Base Facts"."Revenue" BY "Products"."Product
Type","Time"."Per Name Month" where "Time"."Per Name Year"='2010' into 4
bins)
www.redpillanalytics.com
The Forecast Function
A Forecast creates a time-series model of the specified measure over the series using either
Exponential Smoothing or ARIMA (Autoregressive integrated moving average). This function
outputs a forecast for the set of periods as specified by numPeriods. Forecasting is very useful as
a tool for predictive analytics. You can see potential trends for different dimensions and
measures because of this function.
Forecast Syntax
Figure 16: The Forecast function can be found under the “Time Series Calculations” folder
within the column formula editor.
www.redpillanalytics.com
FORECAST (numeric_expr, ([series]), output_column_name, options,
[runtime_binded_options]) ])
Where:
o numeric_expr —indicates the measure to forecast.
o series —indicates the time grain at which the forecast model is built. This is a
list of one or more time dimension columns.
▪ If you omit series, then the time grain is determined from the query.
▪ The series must fit the date columns in the Analysis.
o output_column_name —indicates the output column. Valid values are ‘forecast’,
‘low’, ‘high’, and ‘predictionInterval.’
▪ forecast —This column is the forecasted output
▪ low —This column is the forecasted lower bound number
▪ high —This column is the forecasted higher bound number
Upper and lower limits of the prediction at the given confidence
level might be important
▪ predictionInterval —This is an available option that is the confidence
for the prediction.
The predictionInterval ranges from 0 to 100, where the higher
values specify a higher confidence.
o options —indicates a string list of name/value pairs separated by a semi-colon.
▪ The value can include %1…%N, which can be specified in
runtime_binded_options.
▪ View the table below for the available options
o runtime_binded_options—indicates a comma separated list of runtime-binded
columns and options
www.redpillanalytics.com
Forecast also has many of Available Options that can be used with the function. Below is a list of
the option types: (Value type in the parentheses)
numPeriods —The number of periods to forecast (integer)
predictionInterval —The confidence for the prediction (0 to 100, where higher
values specify higher confidence)
modelType —The model to use for forecasting. (ARIMA—Autoregressive Integrated
Moving Average, fitted to time series data either to better understand the data or to
predict future points in the series), (ETS—Error, Trend, Seasonal—exponential
smoothing state space model that is applied to the ‘y’.)
useBoxCox —If TRUE, then use Box-Cox transformation, which is a method used to
normalize a data set so that statistical tests can be performed to evaluate it properly.
Many real world raw data sets do not conform to the normality assumptions used for
statistics, so transformation functions can sometimes be used to normalize the data.
(TRUE, FALSE)
lambdaValue —The Box-Cox transformation parameter. Ignore if NULL or when
useBoxCox is FALSE. Otherwise the data is transformed before the model is estimated.
trendDamp —This is a parameter for ETS (Error, Trend, Seasonal) model. If TRUE, then
use damped trend. If NULL, then try both damped and non-damped trend and choose
then one that is optimal.
errorType —This is a parameter for ETS model. (additive (“A”), multiplicative (“M”),
automatically selected (“Z”))
trendType —This is a parameter for ETS model. (none(“N”), additive (“A”),
multiplicative (“M”), automatically selected (“Z”))
seasonType —This is a parameter for ETS model. (none(“N”), additive (“A”),
multiplicative (“M”), automatically selected (“Z”))
modelParamIC —The information criterion (IC) to be used in the model selection.
(“ic_auto”, “ic_aicc”,”ic_bic”,”ic_auto”—this is the default)
www.redpillanalytics.com
Figure 17: “Per Name Year” has been filtered to be “equal to/ is in” ‘2008’ to allow
forecasting for ‘2009’.
Forecast Example
The formula used in the FORECAST Column is as follows:
FORECAST("Base Facts"."Revenue", ("Time"."Per Name Year", "Time"."Per Name
Month"),'forecast','modelType=arima;numPeriods=%1;predictionInterval=70;',
12)
Figure 18: Forecast for 2009 based on 2008 data.
www.redpillanalytics.com
The Clustering Function
This function groups a set of records into groups based on one or more input expressions using
K-Means or Hierarchical Clustering, which are the two modes of clustering analysis that can be
utilized in the advanced analytics clustering model provided in 12c.
K-MEANs:
Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering
attempts to partition into a specified number of clusters (k) so as to minimize the sum of the
distance functions of each individual point from the K center. This allows for an overview of
similarities along the given dimensions.
Hierarchical Clustering:
Generally, this form of clustering is an attempt to build a sort of pecking order in which the data
filters down into distinct groups along the prompted dimensions. Hierarchical clustering can be
thought of as a sort of “top-down” approach of structuring an overview for viewing contextual
differences/similarities amongst user-defined dimensions.
Syntax for Clustering Analysis:
CLUSTER( (dimension_expr), (expr), output_column_name, options, [runtime_binded_options])
Where:
• dimension_expr— represents a list of dimensions to be clustered (K).
• expr— represents a list of dimension attributes or measures to be used (x1, x2, …, xn) to
cluster the dimension_expr (K)
• output_column_name— is the output to be printed in the column header, this portion of
the syntax is only part of the aesthetic interaction in the platform and does not perform
and analytics. The valid values include:
o clusterID – This column is the cluster number or ID.
o clusterName – This column is synonymous with clusterID.
o clusterDescription – The description can be added by the end user after the
cluster dataset is persisted into DSS.
o clusterSize – This column is the number of elements in the current cluster.
o distanceFromCenter – This column indicates how far the current cluster
element is from the center of the current cluster.
o centers – This column indicates the center of the current cluster in a format
• options — is a string list of name=value pairs separated by ';'. The value can include %1
... %N, which can be specified using runtime_binded_options.
• runtime_binded_options — indicates a comma separated list of binded columns or
literal expressions that supply a specification to an unrepresented value in the options list.
www.redpillanalytics.com
This portion of the syntax is optional. It is merely satisfying parameters for other options
that have yet to be specified. For example, in the clustering analysis, you might have
options of numclusters=%1, maxIter=%2. Let’s speculate that you want 5 clusters and a
maximum 10 iterations for this particular analysis. Your runtime_binded_options would
then be 5,10 — which corresponds to 5 clusters and 10 iterations. Order matters. %1 in
options equates to the first specified option, %2 the second, and %N the Nth. Here would
be the entire syntax for this example (highlighted is the areas of focus).
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"),
("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName',
'algorithm=k-
means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE',
5, 10)
Remember that the runtime_binded_options option is not required. Parameters can be
specified in the function without the use of this option. This means that the following code is
synonymous in performance to the example given above:
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"),
("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName',
‘algorithm=k-means;numClusters=5;maxIter=
10;useRandomSeed=FALSE;enablePartitioning=TRUE’)
Clustering Example Analysis
An example of a clustering analysis could check to see how the dimensions of offices and
companies within the data set were clustered along the measures of revenue and discount
amount. One hypothesis for this analysis might be that offices under their respective companies
are acting very similar in regards to discount amount and revenue.
Formula Syntax2
	
CLUSTER(("Offices"."Office", "Offices"."Company"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', ‘algorithm=k-
means;numClusters=@{numClusters};maxIter=@{numIter};useRandomSeed=FALSE;enabl
ePartitioning=TRUE’)
Methodology:
The example will be using K-means clustering rather than hierarchical clustering. See above in
the Syntax for Clustering section for details on the syntax variation for the options of
numClusters and maxIter that allow for user inputs for these variables.
																																																								
2
	The	highlighted	text	refers	to	presentation	variables.	See	Appendix	I	for	more	information.
www.redpillanalytics.com
With a user input of 3 clusters and 20 iterations, one would receive an output of:
Figure 19: Cluster Visualization for 3 Clusters, with 20 Iterations
Where our clusters are depicted via color and shape and our Discount Amount and Revenue on
our axis and each point represents one of the 20 offices in the data set. We can see how this
graph changes after doubling the cluster amount.
Figure 20: Cluster Visualization of 6 Clusters with 20 Iterations.
www.redpillanalytics.com
Notice how some clusters are larger than others. This is because in this clustering method, the
objects of the data set are grouped in such a way that the clusters are very different from each
other and the objects in the same group or cluster are very similar to each other. This being said,
some data clusters might contain highly similar points along the measures of discount amount
and revenue while others are highly varied and only contain one data point, such as cluster
number 1 in this analysis. There is no ‘perfect number’ for cluster amount. This number is
contingent upon the data set in use, the amount of data, and user preference. 3 and 6 were used
here in a mere exemplary fashion.
If the data is in a tabular format, one can get a fairly informational depiction of exact amounts
within the selected data set. This allows for a more precise or exact view of the data within the
clusters. It would be poor practice to display all of this information on the scatterplot. The
visualization is more of an aesthetic way of viewing data that allows for increased perception of
what might otherwise not be apparent. The tabular version is important in correspondence with
the visual so that the user can witness precision along the results of the executed underlying
algorithm. Here is a snippet of the tabular information, sorted in ascending order by cluster
number:
Figure 21: Tabular View of Cluster Analysis.
The last important thing to note is that within the clustering function in 12c there are a few
variant methods for clustering. These are sort of subsets within the K-means and Hierarchical
methods. For the visual comparison K-means will be used because K-means is the default
method for clustering in OBIEE. Also new variables (as compared to the previous analysis) will
be used to get more data points and to compare the different methods accordingly to see how
they differ.
www.redpillanalytics.com
Figure 22: New Columns for Methodology Comparison.
Notice below, the added option in the options portion of the syntax for all 3 of the following
comparisons, clusterNamePrefix, for this function. Also notice that useRandomSeed is set to
FALSE because we are comparing methods. In the ‘run time binded’ section of the function
analysis, both %1 and %2 are set to (“INSERT METHOD”) for the usage of methodology and
the display of the methodology name in the legend for the visualization respectively. Also note
that 5 clusters are used in each analysis which allows for a more telling comparison along our
input dimensions.
K-MEANS CLUSTERING METHODS:
1) Hartigan-Wong Method
CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method=
%1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2',
'@{P_Method}{Hartigan-Wong}', ‘@{P_Method}{Hartigan-Wong}')
Figure 23: Output from Hartigan-Wong Method.
www.redpillanalytics.com
2) Lloyd Method
CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method=
%1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2',
'@{P_Method}{Lloyd}', ‘@{P_Method}{Lloyd}')
Figure 24: Output from Lloyd Method.
www.redpillanalytics.com
3) MacQueen Method
CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method=
%1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2',
'@{P_Method}{MacQueen}', ‘@{P_Method}{MacQueen}')
Figure 25: Output from MacQueen Method.
Looking closely at these varying visualizations, it is apparent that the differentiation of each
cluster is slightly different for the 3 methods.
Also of note, is that the H-Clustering Methods are the default, but there is also ward.D,
ward.D2, Single, Average, Median, McQuitty, and centroid.
www.redpillanalytics.com
The Outlier Function
	
This function classifies a record as Outlier based one or more input expressions using K-Means,
Hierarchical Clustering or Multi-Variate Outlier Detection Algorithms (The 3 methods in outlier
detection for the Advanced Analytics tools in OBIEE 12c). Each method is utilized for different
purposes and the user has the ability to adjust the algorithm of use according to their specific
needs. In statistics, an outlier is a reference to specific data that diverge from the normality of the
data set as a whole to a statistically significant extent. Outliers can be thought of as a data
anomaly; the sort of black sheep within the data. Outlier detection can be thought of as
clustering data along a logical metric, where normality is equal to FALSE (not an outlier) or
abnormality is equal to TRUE (an outlier). Here is a brief description of the 3 methods that were
mentioned above:
K-MEANs:
Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering
attempts to partition into a specified number of clusters (k) so as to minimize the sum of the
distance functions of each individual point from the K center. This allows for an overview of
similarities along the given dimensions. For outlier detection, there will be two clusters in a
logical format, one of TRUE and one of FALSE. TRUE denoting an outlier, FALSE denoting
data normality.
Hierarchical Clustering:
Generally, this form of clustering is an attempt to build a sort of pecking order in which the data
filters down into distinct ‘groups’ along the prompted dimensions. Hierarchical clustering can be
thought of as a “top-down” approach of structuring an overview for viewing contextual
differences/similarities amongst user-defined dimensions.
Multivariate Outlier Detection (default outlier detection for 12c):
One way to check for multivariate outliers is with Mahalanobis’ distance.3
Mahalanobis’
distance can be thought of as a metric for estimating how far each case is from the center of all
the variables’ distributions (i.e. the centroid in multivariate space). Mahalanobis’ distance
accounts for the different scale and variance of each of the variables in a set in a probabilistic
way.
																																																								
3
(Mahalanobis, 1927; 1936 ).
www.redpillanalytics.com
Syntax for Outlier Analysis:
OUTLIER( (dimension_expr1 , ... dimension_exprN), (expr1, .. exprN),
output_column_name, options, [runtime_binded_options])
Where:
• dimension_expr— represents a list of dimensions to be clustered (K)
• expr— represents a list of dimension attributes or measures (x1, x2, …, xn) to be used in
order to find outlier’s.
• output_column_name— is the output column. The valid values are:
o ’isOutlier’: which will print back a logical value TRUE or FALSE as to whether
or not each data point is an outlier or not.
o ’distance’: will return the “distance from normality” (the higher this number, the
‘more’ of an outlier the data point is).
• options — is a string list of name=value pairs separated by ';'. The value can include
%1 ... %N, which can be specified using runtime_binded_options.
• runtime_binded_options — is an optional comma separated list of run-time binded
columns or literal expressions that supply a specification to an unrepresented value in the
options list. This portion of the syntax is optional. It is merely satisfying parameters for
other options that have yet to be specified. For example, in an outlier analysis, the user
might have an option output_column_name=%1. If it was speculated that they wanted to
use the distance for this particular analysis, Their runtime_binded_options would then be
equal to ‘distance’. Order matters. %1 in options equates to the first specified option, %2
the second, and %N the Nth. Here would be the entire syntax for this example
(highlighted is the areas of focus). Remember that runtime_binded_options is
optional. You can specify parameters to your options without using this tool, which
implies that runtime_binded_options is more of an organizational tool than a
functional one. Using it versus not using it does not impact performance, but the option is
nice to have for organizational purposes.
Outlier Function Example Analysis:
For the analysis, observe how the dimensions of offices and companies within the data set were
clustered along the measures or attributes of both revenue and discount amount. One hypothesis
for this analysis might be that offices under their respective companies are acting very similar in
regards to discount amount and revenue.
Figure 26: Columns used in example analysis.
www.redpillanalytics.com
New Columns for Methodology Comparison
For this example, the multivariate outlier algorithm (mvoutlier) will be used, rather than K-means
or hierarchical clustering to start (no particular reason for this other than mvoutlier being the
default algorithm). However, perhaps it could be wagered that the mvoutlier algorithm is the most
favorable and is the default algorithm for a reason. Observe the variance in algorithms below.
Function Syntax:
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=_________’)
Outlier Observations
Thus far, the syntax has proved to disallow any entering of a specified number of outliers. When
using the multivariate algorithm, and entering numClusters into the syntax in order to change the
result, an error is printed in the results tab. After playing around with the sample sales data, the
conclusion can be made that there is no way to set a specific number of outliers to be detected.
The number of outliers is contingent upon each data set and how it acts with the underlying
algorithm in R. Setting an “is not equal to” filter on the two data points (Eiffel and Spring
offices) in order to see if there would still be outliers does not change whether or not there are
outliers. Rather, there are two new outliers (the second set of two most northeasterly points on
the graph). This is counterintuitive to what the function is doing. If the function was finding
truly, significantly variant data, then the result, after this filter was applied, should return all
green (FALSE) points on the scatter plot. On the other hand, sometimes a user might have a data
set with all very similar points but still want to find the point(s) that are most variant. This
means that the outlier detection algorithm is a reliable source and will give us outliers in all
situations. It is important to keep these contingencies in mind when analyzing data.
When the scatter plot involving these variables of analysis is made, returned is the following
graphs, with the accompanying tables of:
www.redpillanalytics.com
Multivariate Outlier Detection Method:
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier'
Figure 27: Multivariate Outlier Detection output.
Hierarchical or H-clustering Outlier Detection Method:
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=h-clustering')
Figure 28: Hierarchical Clustering Outlier Detection output.
www.redpillanalytics.com
K-means Outlier Detection Method
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=Kmeans')
Figure 29: K-Means Outlier Detection output.
Notice that when using the h-clustering algorithms and the multivariate algorithms, the outliers
are consistent (Eiffel and Spring offices of Tescare Ltd.) but when using the K-means algorithm
to find outliers, very different values of Blue Bell and Teller offices of Stockpiles Inc are
received. These variations in outlier detection methods between the algorithms beg the question
of reliability amongst algorithms. For this reason, the variables of analysis were altered to try to
get a visualization with more data points, and hence more outliers, to see if there was some sort
of anomalistic variation here with just these variables. The syntax in use is:
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=h-clustering')
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=Mvoutlier’)
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=K-means’)
Place each of the above in their own respective columns (with the same variables in each)in
order to graphically view these outliers on the same scatter-plot. New variables (Product, and
office) were also used in this analysis for a larger amount of data points. As is visible after these
minute changes, some outliers overlap and others do not.
www.redpillanalytics.com
Legend Translation:
Blue Squares: Only h-clustering viewed these as outliers
Green Circles: All algorithms viewed these as outliers
Yellow Rhombi: No algorithm viewed these as outliers
Red plus: MV Outlier and K-means algorithms viewed these as outliers, not h-clustering.
In the graph editor, the corresponding order of methodological reference is:
• Hierarchical Clustering
• MV Outlier
• K-Means
Figure 30: Visualization comparing all three outlier methods
Analyzing Methodology Consistency
There are no algorithmic overlapping points between the h-clustering algorithm and the other two
algorithms. This is interesting. It could be inferred that other data sets would have points where
the other methods overlapped with the h-clustering algorithm, but in this particular data set, the
reasoning might have something to do with the variance in the algorithm and how it goes about
‘defining’ an outlier. Remember, h-clustering is representative of a hierarchical pecking order in
which data sort of filters down whereas the other methods are distance based, based on your
input criterion or dimensionality. These differences could account for the variance in our
visualization here.
Also notice that it seems as if the ‘behind the scenes’ R-statistics are more consistent with their
outlier detection. In the first analysis, K-means was a little bit off as compared to the other two
algorithms. After browsing through some documentation on K-means clustering, an apparent
www.redpillanalytics.com
notion of K-means being a reliable method amongst increasingly large data sets is noticeable. In
the first analysis, there were few data points, in the second there are many. The fact that there
were only 20 points in the first analysis might be the reason for this discrepancy amongst
strategies. Perhaps as the data set size increases, more consistency with the varying algorithms
will be noticed. Keep this in mind when choosing algorithms.
www.redpillanalytics.com
The Regression Function
This function fits a linear model, and returns the fitted values or model. This function can be
used to fit a linear curve on two measures. In statistics, a regression analysis is a process that
estimates the relationship among two variables within a data set. The focus of this test is to
measure the relationship between one or more independent (fixed) variables and its correlation to
a dependent (variable) variable. More specifically, regression allows for a deeper understanding
of how a dependent value changes when the independent variable is adjusted. It might help to
think of regression in a sort of ‘mathy’ f(x) or f of x notation, where x is the independent variable
or the input value. The dependent variable (or output) could be thought of as the value of the y
axis.
It might also help to think of these two variables in a linguistic way. The y-axis measure is the
dependent variable, this means that it is literally dependent on some other value to change before
it does. The x-axis measure(s) is/are literally independent of any other factor(s); they are fixed.
This is important to understand before getting into the syntax.
In laymen’s terms, regression is a measure of how good one measure is a predictor of another
measure. Linear regression is also widely used for forecasting trends in an analysis, predictive
analytics, and has large ties to the arena of machine learning as well. Also, understand that
regression methodology does not insinuate causation, but rather suggests a specific extent of
correlation of two measures.
Dummy Variables in Categorical Regression
It is not possible to directly regress a categorical variable against a numerical variable, nor is it
possible to regress a numerical variable against a categorical variable. There is a solution for this
though. It is called a dummy variable. This works with the assumption that it is necessary for an
analysis to have a regression model regarding a categorical variable that contains the names of
pets (Cats, Dogs, and Birds) and to see how good a predictor these pets are of (fill in the blank).
It would not make sense to assign Cats, Dogs, and Birds a 1,2, and 3, respectively, unless, for
some reason, this Dog was twice as much of a pet than a Cat and the Bird 3 times as much of a
pet as the Cat. Since regression is used with two numerical variables, interpretations are only
valid under circumstances where having a 100 stored for some variable literally equates to
having 100 times the characteristic of X than the variable that stores the number 1. For the pet
example, since it would be illogical to assign a 1, 2 and 3, an alternative (with a regression model
in mind) is to assign some binary values, such as a 1=Cat and 0=not a cat.
Syntax for Regression Analysis
REGR(y_axis_measure_expr, (x_axis_expr), (category_expr1, ...,
category_exprN), output_column_name, options, [runtime_binded_options])
Where:
• y_axis_measure_expr represents the measure for which the regression model is to be
computed. This is your dependent variable.
www.redpillanalytics.com
• x_axis_expr represents the measure to be used to determine the regression model for the
y_axis_measure_expr. This is your independent variable.
• category_expr1, ..., category_exprN represents the dimension/dimension
attributes to be used to determine the category for which the regression model for the
y_axis_measure_expr is to be computed. One or more dimensions or dimension
attributes, up to five, may be provided as category columns.
• output_column_name is the output column.
o fitted - returns the points on regression line in (y=ax+b) format
o intercept - the intercept point with the zero on x axis (b from y=ax+b)
o modelDescription - the Model in JSON format.
• options is a string list of name=value pairs separated by ';'. The value can include %1 ...
%N, which can be specified using runtime_binded_options.
• runtime_binded_options is an optional comma separated list of run-time binded
columns and options.
Regression Example Analysis
	
In this particular analysis, a comparison is made to unveil how good a predictor the independent
variable of billed quantity is for the dependent variable of revenue. The question to be answered
here is, if the quantity of billed items is changed, how does revenue altered? Based on the
column names alone, it could be predicted that the data will cluster fairly nicely around the
regression line created by the function in an upward slope. This means that the billed quantity
would be a good predictor of revenue. This is fairly intuitive. But, what can also be witnessed
below is that billed quantity is not a perfect predictor of revenue; if it was there would be less
data outlying this regression line. In a regression scatterplot like the one below, the tighter our
‘green dots’ are hugging our ‘blue dots’ the higher the correlation between the two variables.
Function Syntax Used
REGR("Base Facts"."Revenue", ("Base Facts"."Billed Quantity"), ("Time"."Per
Name Month", "Time"."Per Name Year"), 'fitted', ‘’)
www.redpillanalytics.com
Figure 31: Regression Analysis of Billed Quantity as a Predictor of Revenue
If the user were check the table below and look under the column heading “Regression”, he/she
would see the regression function’s output, and how it relates to Figure 32,
Figure 32: Regression Output in Tabular View
www.redpillanalytics.com
It may be interesting to see what data in this regression were not fitting this particular trend. The
visualization below was created by using this syntax —OUTLIER((“Time"."Per Name Year",
"Time"."Per Name Month"), ("Base Facts"."Billed Quantity","Base
Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier’). This will display outlying
values in correspondence with the same syntax and variables used for the above regression.
Figure 33: Visual of Data Points where Billed Quantity is not a Predictor of Revenue
Concentrate on the red plus signs rather than the yellow rhombi. The red plus signs are the
outliers for this regression analysis, where the yellow rhombi are merely the corresponding data
points that were plotted for the regression line for these 4 outlying data points. By sorting the
outlier portion of this data set, one could create a table that shows the year and month where
billed quantity was not necessarily a great predictor of revenue.
Figure 34: Tabular View of Outliers Within a Regression Analysis
www.redpillanalytics.com
What is noticeable is that, for the 6th and 7th months for 3 consecutive years, billed quantity was
not a great predictor of revenue. By obtaining this sort of information, it is possible to drill down
into why this might be the case. These sort of quantitative and visual ‘hints’ within the data
being unveiled in an aesthetic way is the epitome of these advanced analytics tools. Statistics
can tell a lot about why things are the way they are and can, ultimately, provide some insight to
move forward in a fashion that will allow the building of a sustainable organization.
www.redpillanalytics.com
Appendix I: Creating Presentation Variables and Prompts
Presentation Variable and Prompting the User for Function Options
Above, there is slight variation in syntax within the function code from the original syntax given
where there is @{numClusters};maxIter=@{numIter}in the options portion of the function
input. The @{} is the code for adding a presentation variable to a dashboard prompt that will
prompt the user for the number of clusters and the number of iterations for the algorithm to
perform. In many cases it is a good idea to prompt the user for the number of clusters and
iterations because it allows for a more interactive dashboard. It is also important because this
easy functional change can show us how a large sample size continues to change as we
continuously segment our data set into varying numbers of clusters.
If a developer was eager to perform this same task, highlight (in the syntax) the portion that
would typically contain (%1…%N) for whatever variable they wanted to add a prompt for they
would perform the following tasks:
Figure A1: Highlight the %N.
Figure A2: Click “Variable”, then “Presentation”.
www.redpillanalytics.com
Figure A3: Input a variable expression.
It is important to be careful prior to clicking OK here. This Variable Expression must be
matched in a case sensitive fashion to the corresponding dashboard prompt. Click OK.
Figure A4: Click “New”, then “Dashboard Prompt”.
Figure A5: Click the green arrow, then “Variable Prompt”.
www.redpillanalytics.com
Prompt for=Presentation Variable: *Label (this is what is equal to the presentation variable that
was set in the column function)=numClusters: Expand the options window: Variable Data
Type=Number:
A Note of Defaults
The user can set a default value here. Also, just a heads up, there is some sort of undocumented
default value of 5 clusters. For example: The syntax of— CLUSTER(("Products"."Product",
"Offices"."Office"), ("Base Facts"."Discount Amount","Base
Facts"."Revenue"),'clusterName', 'algorithm=k-means;') —returns a visualization of:
Figure A6: Default Visualization of Discount Amount versus Revenue.
www.redpillanalytics.com
Figure A7: Complete the process again for the iteration variable.
Save these Prompts.
Now when going into the Dashboard, where the dashboard prompt and the analysis have been
input, this presentation variable can be witnessed in action.
www.redpillanalytics.com
Document History
Created	By:	 Brendan	Doyle	
	 	 Mike	Perhats	
Edited	By:	 Phil	Goerdt	
Creation	Date:	8/8/16	
Last	Edit	Date:	8/8/16

More Related Content

What's hot

Moving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics CloudMoving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics Cloud
Edelweiss Kammermann
 
Sap s4 hana (2)
Sap s4 hana (2)Sap s4 hana (2)
Sap s4 hana (2)
babloo6
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
BizTalk360
 
Cash Management in SAP
Cash Management in SAPCash Management in SAP
Cash Management in SAP
KamalGaur11
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
Maria Colgan
 
New Features in OBIEE 12c
New Features in OBIEE 12c New Features in OBIEE 12c
New Features in OBIEE 12c
Michelle Kolbe
 
Calculation commands in essbase
Calculation commands in essbaseCalculation commands in essbase
Calculation commands in essbase
Shoheb Mohammad
 
Sql Server Basics
Sql Server BasicsSql Server Basics
Sql Server Basics
rainynovember12
 
Strategic Choices in SAP S/4 HANA Deployment
Strategic Choices in SAP S/4 HANA DeploymentStrategic Choices in SAP S/4 HANA Deployment
Strategic Choices in SAP S/4 HANA Deployment
Dirk Oppenkowski
 
Oracle Analytics Cloud
Oracle Analytics CloudOracle Analytics Cloud
Oracle Analytics Cloud
Joseph Alaimo Jr
 
Oracle Assets
Oracle AssetsOracle Assets
Oracle Assets
Mohamed159686
 
Essbase aso a quick reference guide part i
Essbase aso a quick reference guide part iEssbase aso a quick reference guide part i
Essbase aso a quick reference guide part iAmit Sharma
 
SAP BW Introduction.
SAP BW Introduction.SAP BW Introduction.
1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx
BRIJESH KUMAR
 
Oracle Apps Technical – Short notes on RICE Components.
Oracle Apps Technical – Short notes on RICE Components.Oracle Apps Technical – Short notes on RICE Components.
Oracle Apps Technical – Short notes on RICE Components.
Boopathy CS
 
Introduction to extracting data from sap s 4 hana with abap cds views
Introduction to extracting data from sap s 4 hana with abap cds viewsIntroduction to extracting data from sap s 4 hana with abap cds views
Introduction to extracting data from sap s 4 hana with abap cds views
Luc Vanrobays
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Cathrine Wilhelmsen
 
Budgeting using hyperion planning vs essbase
Budgeting using hyperion planning vs essbaseBudgeting using hyperion planning vs essbase
Budgeting using hyperion planning vs essbaseSyntelli Solutions
 
Customizing Oracle EBS OA Framework
Customizing Oracle EBS OA FrameworkCustomizing Oracle EBS OA Framework
Customizing Oracle EBS OA Framework
iWare Logic Technologies Pvt. Ltd.
 

What's hot (20)

Moving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics CloudMoving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics Cloud
 
Sap s4 hana (2)
Sap s4 hana (2)Sap s4 hana (2)
Sap s4 hana (2)
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Cash Management in SAP
Cash Management in SAPCash Management in SAP
Cash Management in SAP
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 
New Features in OBIEE 12c
New Features in OBIEE 12c New Features in OBIEE 12c
New Features in OBIEE 12c
 
Calculation commands in essbase
Calculation commands in essbaseCalculation commands in essbase
Calculation commands in essbase
 
Sql Server Basics
Sql Server BasicsSql Server Basics
Sql Server Basics
 
Strategic Choices in SAP S/4 HANA Deployment
Strategic Choices in SAP S/4 HANA DeploymentStrategic Choices in SAP S/4 HANA Deployment
Strategic Choices in SAP S/4 HANA Deployment
 
Oracle Analytics Cloud
Oracle Analytics CloudOracle Analytics Cloud
Oracle Analytics Cloud
 
Oracle Assets
Oracle AssetsOracle Assets
Oracle Assets
 
Essbase aso a quick reference guide part i
Essbase aso a quick reference guide part iEssbase aso a quick reference guide part i
Essbase aso a quick reference guide part i
 
SAP BW Introduction.
SAP BW Introduction.SAP BW Introduction.
SAP BW Introduction.
 
1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx
 
Oracle Apps Technical – Short notes on RICE Components.
Oracle Apps Technical – Short notes on RICE Components.Oracle Apps Technical – Short notes on RICE Components.
Oracle Apps Technical – Short notes on RICE Components.
 
Introduction to extracting data from sap s 4 hana with abap cds views
Introduction to extracting data from sap s 4 hana with abap cds viewsIntroduction to extracting data from sap s 4 hana with abap cds views
Introduction to extracting data from sap s 4 hana with abap cds views
 
Landed Cost at a glance
Landed Cost at a glanceLanded Cost at a glance
Landed Cost at a glance
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
 
Budgeting using hyperion planning vs essbase
Budgeting using hyperion planning vs essbaseBudgeting using hyperion planning vs essbase
Budgeting using hyperion planning vs essbase
 
Customizing Oracle EBS OA Framework
Customizing Oracle EBS OA FrameworkCustomizing Oracle EBS OA Framework
Customizing Oracle EBS OA Framework
 

Similar to OBIEE 12c Advanced Analytic Functions

Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
KalyankumarVenkat1
 
Oracle SQL Advanced
Oracle SQL AdvancedOracle SQL Advanced
Oracle SQL Advanced
Dhananjay Goel
 
Funções DAX.pdf
Funções DAX.pdfFunções DAX.pdf
Funções DAX.pdf
Joao Vaz
 
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2tovetrivel
 
BAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de ManutençãoBAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de Manutenção
Roberto Fernandes Ferreira
 
04 quiz 1 answer key
04 quiz 1 answer key04 quiz 1 answer key
04 quiz 1 answer key
Anne Lee
 
Dax best practices.pdf
Dax best practices.pdfDax best practices.pdf
Dax best practices.pdf
deepneuron
 
Analytic & Windowing functions in oracle
Analytic & Windowing functions in oracleAnalytic & Windowing functions in oracle
Analytic & Windowing functions in oracle
Logan Palanisamy
 
Excel Database Function
Excel Database FunctionExcel Database Function
Excel Database Function
Anita Shah
 
Set Analyse OK.pdf
Set Analyse OK.pdfSet Analyse OK.pdf
Set Analyse OK.pdf
qlik2learn2024
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
Zohar Elkayam
 
Obiee interview questions and answers faq
Obiee interview questions and answers faqObiee interview questions and answers faq
Obiee interview questions and answers faqmaheshboggula
 
Sydney Oracle Meetup - indexes
Sydney Oracle Meetup - indexesSydney Oracle Meetup - indexes
Sydney Oracle Meetup - indexespaulguerin
 
Part3 Explain the Explain Plan
Part3 Explain the Explain PlanPart3 Explain the Explain Plan
Part3 Explain the Explain Plan
Maria Colgan
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithms
Armando Vieira
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
sit20ad004
 
Complete list of all sap abap keywords
Complete list of all sap abap keywordsComplete list of all sap abap keywords
Complete list of all sap abap keywordsPrakash Thirumoorthy
 
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docxMicrosoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
ARIV4
 

Similar to OBIEE 12c Advanced Analytic Functions (20)

Visual binning
Visual binningVisual binning
Visual binning
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
 
Oracle SQL Advanced
Oracle SQL AdvancedOracle SQL Advanced
Oracle SQL Advanced
 
Funções DAX.pdf
Funções DAX.pdfFunções DAX.pdf
Funções DAX.pdf
 
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
 
BAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de ManutençãoBAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de Manutenção
 
04 quiz 1 answer key
04 quiz 1 answer key04 quiz 1 answer key
04 quiz 1 answer key
 
Dax best practices.pdf
Dax best practices.pdfDax best practices.pdf
Dax best practices.pdf
 
Analytic & Windowing functions in oracle
Analytic & Windowing functions in oracleAnalytic & Windowing functions in oracle
Analytic & Windowing functions in oracle
 
MA3696 Lecture 9
MA3696 Lecture 9MA3696 Lecture 9
MA3696 Lecture 9
 
Excel Database Function
Excel Database FunctionExcel Database Function
Excel Database Function
 
Set Analyse OK.pdf
Set Analyse OK.pdfSet Analyse OK.pdf
Set Analyse OK.pdf
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
 
Obiee interview questions and answers faq
Obiee interview questions and answers faqObiee interview questions and answers faq
Obiee interview questions and answers faq
 
Sydney Oracle Meetup - indexes
Sydney Oracle Meetup - indexesSydney Oracle Meetup - indexes
Sydney Oracle Meetup - indexes
 
Part3 Explain the Explain Plan
Part3 Explain the Explain PlanPart3 Explain the Explain Plan
Part3 Explain the Explain Plan
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithms
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Complete list of all sap abap keywords
Complete list of all sap abap keywordsComplete list of all sap abap keywords
Complete list of all sap abap keywords
 
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docxMicrosoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
 

OBIEE 12c Advanced Analytic Functions

  • 1. www.redpillanalytics.com Abstract Oracle Business Intelligence Enterprise Edition 12c has enhanced analytical capabilities due to an (optional) integration with the statistical software R. These new functions include the following: Trendline, Bin and Width Bucket, Forecast, Clustering, Outlier and Regression. This document will provide a comprehensive review of these newly available functions, and provide examples of them in action. For ease of understanding and reproducibility, the sample data set is Oracle’s Sample Sales Lite1 . 1 This data set is available with every install of OBIEE 12c. Alternatively, a similar set can be found within the Oracle BI Sample App Virtual Machine.
  • 2. www.redpillanalytics.com The Trendline Function Trendline is part of the Advanced Analytics Internal Logical SQL Functions, meaning it is in the group of functions that are done internally as opposed to being done in R. This function fits a linear or exponential model, and returns the fitted values or model. The numeric_expr represents the Y value for the trend and the series (time columns) represent the X value. A Trendline is a model, and its assertion that the data is the result of a model. The TRENDLINE function measures data across time and shows a line of a metric by ordered records. It can model data as linear and as exponential regression. Figure 1: The Trendline function is found under the Aggregate folder by clicking on the “Insert Function” button in the Formula section of the column editor.
  • 3. www.redpillanalytics.com Trendline Function Syntax TRENDLINE( <numeric_expr>,( [<series>] ) BY ( [<partitionBy] ), <model_type>, <result_type>, [number_of_degrees] ) Where: o numeric_expr—represents the data to trend ▪ This is the Y-Axis and is a measure column. o series—indicates the X-axis. This is a list of <valueExp> <orderByDirection>, where <valueExp> is a dimension column and <orderByDirection> is ASC (ascending) or DES (descending). ▪ The default is ASC. Note that this cannot be an arbitrary combination of numeric columns. ▪ It is possible to use more than one Trendline column in the same analysis, but the Trendline columns must have the same X-Axis. o partitionBy—A list of dimension attributes that are not on the X-Axis. o model_type— A model type may be one of the following types: ▪ LINEAR—a function with a constant rate of change and a straight line graph. ▪ EXPONENTIAL—a function whose value is raised to the power of the variable. o result_type— A results type may be one of the following types: ▪ VALUE - will return all the regression Y values given that X in the fit. ▪ MODEL - will return all the parameters in a JSON (JavaScript Object Notation, which is a lightweight data-interchange format) format string. Figure 2: Example formula to display result_type of MODEL.
  • 4. www.redpillanalytics.com Figure 3: Results of using ‘MODEL’ as the result type; it returns the parameters in a JSON (JavaScript Object Notation) format string. Example Syntax TRENDLINE(“Base Facts”.”Revenue”, (“Time”.”Calendar Date”), ‘LINEAR’,’VALUE’) Figure 4: Selected dimensions and fact columns for a sample trendline analysis. Figure 5: Note the Trendline (in green); depicting these types of subtle changes is what this function is best at.
  • 5. www.redpillanalytics.com Figure 6: If the graph is set to vary color by ‘Per Name Year’, the results are displayed for each year. Note the differences between each year that otherwise would not be apparent. Figure 7: Segmentation of the trends could continue to smaller subsets. Above, the 2009 has been split by semester.
  • 6. www.redpillanalytics.com The BIN and WIDTH_BUCKET Functions Both BIN and WIDTH_BUCKET are included in the Advanced Analytics Internal Logical SQL Functions, meaning they are in the group of functions that are done internally as opposed to being done in R. With that being said, the syntax for the two functions is different and will be covered later on. About BIN In the BIN function, the user can select any numeric attribute (INT, FLOAT, DOUBLE, NUMERIC) from a dimension or fact table/measure containing the data values and place them into a discrete number of bins. The reason to bin a measure would be to separate results of the measure into group (see BIN syntax). An example of this would be sales from a store and binning the revenue from anything less than $200, between $200 and $500, and so on. This sales that had that amount of revenue will be binned into the groups that fit that specific criteria. The BIN function classifies a given number expression into a specific number of equal width buckets. The function can return either the bin number or one of the two end points of the bin interval. The output of the BIN function is used as a GROUP BY expression for other measures included in the query. The BIN function is treated like a new dimension attribute for purposes such as aggregation, filtering, and drilling. All of these operations are supported on BIN expressions. BIN Syntax BIN(numeric_expr [BY grain_expr1, …, grain_exprN] [WHERE condition] INTO number_of_bins BINS [BETWEEN min_value AND max_value] [RETURNING { NUMBER | RANGE_LOW | RANGE_HIGH }]) Where: o numeric_expr—indicates the measure or numeric attribute to bin o BY grain_expr1, …, grain_exprN—indicates a list of expressions that define the grain at which the numeric_expr is calculated before the numeric values are assigned to bins. ▪ This clause is required for measure expressions and is optional for attribute expressions ▪ The BY clause of the BIN function defines the grain at which the binned expression is evaluated prior to binning. If the binned expression is a measure, then the measure is grouped at the grain specified in the BY clause before being binned. ▪ The BY clause of the BIN function is mandatory if the binned expression is a measure.
  • 7. www.redpillanalytics.com Otherwise, for non-measure expressions, the BY clause is optional. o WHERE condition—indicates a filter condition to apply to the numeric_expr before the numeric values are assigned to bins o INTO number_of_bins—indicates the number of bins to return. The default is 10. o BETWEEN min_value AND max_value—indicates the minimum and maximum values used for the end points of the outermost bins o RETURNING—indicates a filter condition to apply to the numeric_expr before the numeric values are assigned to bins. Note the following options: ▪ RETURNING NUMBER—indicates the return value should be the bin number (for example: 1,2,3,4). This is the default condition ▪ RETURNING RANGE_LOW—indicates the lower value of the bin interval ▪ RETURNING RANGE_HIGH—indicates the higher value of the bin interval Figure 8: The Bin Function is found under the Aggregate folder in the column formula editor.
  • 8. www.redpillanalytics.com About Width Buckets The WIDTH_BUCKETS function is known as a “secret function” meaning it is not available in the function menu, but the user can type the formula to use it. The syntax of WIDTH_BUCKET is also comma-based, which is not consistent with most Advanced Analytics in OBIEE. Similar to binning, width bucket classifies a given numeric expression into a specified number of equal width buckets. It operates on top of a base query result set as a display function. The function can return either the bin number or one of the two end points of the bin interval. Unlike the BIN function, the WIDTH_BUCKET function is not treated as a new dimensional attribute for the purposes of aggregation. It is applied on top of the query result similar to the other display functions such as RANK, TOPN, BOTTOMN, NTILE, PERCENTILE, MAVG, and MEDIAN. Use the WIDTH_BUCKET function when you want to compute a discrete set of buckets on top of an already aggregated query result set. The syntax for Width Bucket is much simpler than that of the BIN function. WIDTH_BUCKET Syntax WIDTH_BUCKET(numeric_expr, {NUMBER | RANGE_LOW | RANGE_HIGH }, number_of_bins, [min_value, max_value] [BY expr1, …, exprN]) Where: o numeric_expr—indicates the measure or numeric attribute to bin o NUMBER—indicated that the return value should be the bin number (ex: 1,2,3,4). o RANGE_LOW—indicates the lower value of the bin interval o RANGE_HIGH—indicates the higher value of the bin interval o number_of_bins—indicates the number of bins to return. The default is 10. o min_value, max_value—indicates the minimum and maximum values used for the end points of the outermost bins. If the min_value and max_value conditions are omitted, then the function determines the end points automatically. o BY expr1, …, exprN—indicates an optional list of expressions that define the groups in the query result set over which the WIDTH_BUCKET calculation is applied. The bucket intervals within different groups are calculated independently. ▪ The BY clause of the WIDTH_BUCKET function defines the groups in the query result over which the WIDTH_BUCKET calculation is applied. The buckets within different groups are calculated independently. ▪ The BY clause is always optional in the WIDTH_BUCKETS function.
  • 9. www.redpillanalytics.com If the BY clause is omitted from the WIDTH_BUCKET function, then the function operates over the entire result set. BIN and WIDTH BUCKET: Defining Grouping The goal of both functions is to define the bin/bucket that the specific data entry belongs to. This is accomplished by: o Using what column the binning should be done (that is, the binned expression). § Remember, this is a numeric expression (and usually a measure). o By what attributes the data should be arranged. § Remember, the BY function does not have the same meaning in both functions! o The number of Bins/Buckets and the type of data returned. § Remember, it is one of three options: the bin or bucket number, it’s minimum or maximum point. o The WHERE condition option found in the BIN function. BIN and WIDTH_BUCKET Function Example The dimensions and measures being used for this example are: • LOB • Per Name Month • Revenue • BIN Formula: BIN("Base Facts"."Revenue" BY"Products"."LOB","Time"."Per Name Month" into 4 bins) • WIDTH_BUCKET Formula: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4) o (Define the number of bins for each to be the same or there will be an error)
  • 10. www.redpillanalytics.com Figure 9: Above are the results of the binning and buckets of revenue. The table shows that it is binning the monthly revenue of the LOB in columns “BIN” and “WIDTH_BUCKET” in bins of 1-4. It is sorting or binning the revenue into specific numbered groups. Figure 10: A linear graph where Bin #1 contains the month and year when the revenue was less than $15,000.
  • 11. www.redpillanalytics.com Figure 11: A linear graph where Bin #2 contains the month and year when the revenue was between $15,000 and $30,000. Figure 12: A linear graph where Bin #3 contains the month and year when the revenue was between $30,000 and $45,000.
  • 12. www.redpillanalytics.com Figure 13: A linear graph where Bin #4 contains the month and year when the revenue was greater than $45,000. Be sure not to aggregate BOTH functions using the BY clause for it will result in an error. •BIN: BIN("Base Facts"."Revenue" BY "Time"."Per Name Month" into 4 bins) The meaning of BY "Month" in BIN is: Take the sum("Revenue" by "Month") and arrange the sum of month in 4 bins. So rows of the same month will have the same BIN "Revenue" by "Month" results. •WIDTH_BUCKET: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4 by "Time"."Per Name Month") The meaning of BY "Month" in WIDTH_BUCKET is: Take individual rows of data in each month and arrange them in 4 buckets.
  • 13. www.redpillanalytics.com Figure 14: The Bin and Width Bucket do not match due to both functions using the BY clause. Using the WHERE Option in the BIN Function Figure 15: BIN Function Criteria edited to include the WHERE option. BIN Formula: BIN("Base Facts"."Revenue" BY "Products"."Product Type","Time"."Per Name Month" where "Time"."Per Name Year"='2010' into 4 bins)
  • 14. www.redpillanalytics.com The Forecast Function A Forecast creates a time-series model of the specified measure over the series using either Exponential Smoothing or ARIMA (Autoregressive integrated moving average). This function outputs a forecast for the set of periods as specified by numPeriods. Forecasting is very useful as a tool for predictive analytics. You can see potential trends for different dimensions and measures because of this function. Forecast Syntax Figure 16: The Forecast function can be found under the “Time Series Calculations” folder within the column formula editor.
  • 15. www.redpillanalytics.com FORECAST (numeric_expr, ([series]), output_column_name, options, [runtime_binded_options]) ]) Where: o numeric_expr —indicates the measure to forecast. o series —indicates the time grain at which the forecast model is built. This is a list of one or more time dimension columns. ▪ If you omit series, then the time grain is determined from the query. ▪ The series must fit the date columns in the Analysis. o output_column_name —indicates the output column. Valid values are ‘forecast’, ‘low’, ‘high’, and ‘predictionInterval.’ ▪ forecast —This column is the forecasted output ▪ low —This column is the forecasted lower bound number ▪ high —This column is the forecasted higher bound number Upper and lower limits of the prediction at the given confidence level might be important ▪ predictionInterval —This is an available option that is the confidence for the prediction. The predictionInterval ranges from 0 to 100, where the higher values specify a higher confidence. o options —indicates a string list of name/value pairs separated by a semi-colon. ▪ The value can include %1…%N, which can be specified in runtime_binded_options. ▪ View the table below for the available options o runtime_binded_options—indicates a comma separated list of runtime-binded columns and options
  • 16. www.redpillanalytics.com Forecast also has many of Available Options that can be used with the function. Below is a list of the option types: (Value type in the parentheses) numPeriods —The number of periods to forecast (integer) predictionInterval —The confidence for the prediction (0 to 100, where higher values specify higher confidence) modelType —The model to use for forecasting. (ARIMA—Autoregressive Integrated Moving Average, fitted to time series data either to better understand the data or to predict future points in the series), (ETS—Error, Trend, Seasonal—exponential smoothing state space model that is applied to the ‘y’.) useBoxCox —If TRUE, then use Box-Cox transformation, which is a method used to normalize a data set so that statistical tests can be performed to evaluate it properly. Many real world raw data sets do not conform to the normality assumptions used for statistics, so transformation functions can sometimes be used to normalize the data. (TRUE, FALSE) lambdaValue —The Box-Cox transformation parameter. Ignore if NULL or when useBoxCox is FALSE. Otherwise the data is transformed before the model is estimated. trendDamp —This is a parameter for ETS (Error, Trend, Seasonal) model. If TRUE, then use damped trend. If NULL, then try both damped and non-damped trend and choose then one that is optimal. errorType —This is a parameter for ETS model. (additive (“A”), multiplicative (“M”), automatically selected (“Z”)) trendType —This is a parameter for ETS model. (none(“N”), additive (“A”), multiplicative (“M”), automatically selected (“Z”)) seasonType —This is a parameter for ETS model. (none(“N”), additive (“A”), multiplicative (“M”), automatically selected (“Z”)) modelParamIC —The information criterion (IC) to be used in the model selection. (“ic_auto”, “ic_aicc”,”ic_bic”,”ic_auto”—this is the default)
  • 17. www.redpillanalytics.com Figure 17: “Per Name Year” has been filtered to be “equal to/ is in” ‘2008’ to allow forecasting for ‘2009’. Forecast Example The formula used in the FORECAST Column is as follows: FORECAST("Base Facts"."Revenue", ("Time"."Per Name Year", "Time"."Per Name Month"),'forecast','modelType=arima;numPeriods=%1;predictionInterval=70;', 12) Figure 18: Forecast for 2009 based on 2008 data.
  • 18. www.redpillanalytics.com The Clustering Function This function groups a set of records into groups based on one or more input expressions using K-Means or Hierarchical Clustering, which are the two modes of clustering analysis that can be utilized in the advanced analytics clustering model provided in 12c. K-MEANs: Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering attempts to partition into a specified number of clusters (k) so as to minimize the sum of the distance functions of each individual point from the K center. This allows for an overview of similarities along the given dimensions. Hierarchical Clustering: Generally, this form of clustering is an attempt to build a sort of pecking order in which the data filters down into distinct groups along the prompted dimensions. Hierarchical clustering can be thought of as a sort of “top-down” approach of structuring an overview for viewing contextual differences/similarities amongst user-defined dimensions. Syntax for Clustering Analysis: CLUSTER( (dimension_expr), (expr), output_column_name, options, [runtime_binded_options]) Where: • dimension_expr— represents a list of dimensions to be clustered (K). • expr— represents a list of dimension attributes or measures to be used (x1, x2, …, xn) to cluster the dimension_expr (K) • output_column_name— is the output to be printed in the column header, this portion of the syntax is only part of the aesthetic interaction in the platform and does not perform and analytics. The valid values include: o clusterID – This column is the cluster number or ID. o clusterName – This column is synonymous with clusterID. o clusterDescription – The description can be added by the end user after the cluster dataset is persisted into DSS. o clusterSize – This column is the number of elements in the current cluster. o distanceFromCenter – This column indicates how far the current cluster element is from the center of the current cluster. o centers – This column indicates the center of the current cluster in a format • options — is a string list of name=value pairs separated by ';'. The value can include %1 ... %N, which can be specified using runtime_binded_options. • runtime_binded_options — indicates a comma separated list of binded columns or literal expressions that supply a specification to an unrepresented value in the options list.
  • 19. www.redpillanalytics.com This portion of the syntax is optional. It is merely satisfying parameters for other options that have yet to be specified. For example, in the clustering analysis, you might have options of numclusters=%1, maxIter=%2. Let’s speculate that you want 5 clusters and a maximum 10 iterations for this particular analysis. Your runtime_binded_options would then be 5,10 — which corresponds to 5 clusters and 10 iterations. Order matters. %1 in options equates to the first specified option, %2 the second, and %N the Nth. Here would be the entire syntax for this example (highlighted is the areas of focus). CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', 'algorithm=k- means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10) Remember that the runtime_binded_options option is not required. Parameters can be specified in the function without the use of this option. This means that the following code is synonymous in performance to the example given above: CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', ‘algorithm=k-means;numClusters=5;maxIter= 10;useRandomSeed=FALSE;enablePartitioning=TRUE’) Clustering Example Analysis An example of a clustering analysis could check to see how the dimensions of offices and companies within the data set were clustered along the measures of revenue and discount amount. One hypothesis for this analysis might be that offices under their respective companies are acting very similar in regards to discount amount and revenue. Formula Syntax2 CLUSTER(("Offices"."Office", "Offices"."Company"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', ‘algorithm=k- means;numClusters=@{numClusters};maxIter=@{numIter};useRandomSeed=FALSE;enabl ePartitioning=TRUE’) Methodology: The example will be using K-means clustering rather than hierarchical clustering. See above in the Syntax for Clustering section for details on the syntax variation for the options of numClusters and maxIter that allow for user inputs for these variables. 2 The highlighted text refers to presentation variables. See Appendix I for more information.
  • 20. www.redpillanalytics.com With a user input of 3 clusters and 20 iterations, one would receive an output of: Figure 19: Cluster Visualization for 3 Clusters, with 20 Iterations Where our clusters are depicted via color and shape and our Discount Amount and Revenue on our axis and each point represents one of the 20 offices in the data set. We can see how this graph changes after doubling the cluster amount. Figure 20: Cluster Visualization of 6 Clusters with 20 Iterations.
  • 21. www.redpillanalytics.com Notice how some clusters are larger than others. This is because in this clustering method, the objects of the data set are grouped in such a way that the clusters are very different from each other and the objects in the same group or cluster are very similar to each other. This being said, some data clusters might contain highly similar points along the measures of discount amount and revenue while others are highly varied and only contain one data point, such as cluster number 1 in this analysis. There is no ‘perfect number’ for cluster amount. This number is contingent upon the data set in use, the amount of data, and user preference. 3 and 6 were used here in a mere exemplary fashion. If the data is in a tabular format, one can get a fairly informational depiction of exact amounts within the selected data set. This allows for a more precise or exact view of the data within the clusters. It would be poor practice to display all of this information on the scatterplot. The visualization is more of an aesthetic way of viewing data that allows for increased perception of what might otherwise not be apparent. The tabular version is important in correspondence with the visual so that the user can witness precision along the results of the executed underlying algorithm. Here is a snippet of the tabular information, sorted in ascending order by cluster number: Figure 21: Tabular View of Cluster Analysis. The last important thing to note is that within the clustering function in 12c there are a few variant methods for clustering. These are sort of subsets within the K-means and Hierarchical methods. For the visual comparison K-means will be used because K-means is the default method for clustering in OBIEE. Also new variables (as compared to the previous analysis) will be used to get more data points and to compare the different methods accordingly to see how they differ.
  • 22. www.redpillanalytics.com Figure 22: New Columns for Methodology Comparison. Notice below, the added option in the options portion of the syntax for all 3 of the following comparisons, clusterNamePrefix, for this function. Also notice that useRandomSeed is set to FALSE because we are comparing methods. In the ‘run time binded’ section of the function analysis, both %1 and %2 are set to (“INSERT METHOD”) for the usage of methodology and the display of the methodology name in the legend for the visualization respectively. Also note that 5 clusters are used in each analysis which allows for a more telling comparison along our input dimensions. K-MEANS CLUSTERING METHODS: 1) Hartigan-Wong Method CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{Hartigan-Wong}', ‘@{P_Method}{Hartigan-Wong}') Figure 23: Output from Hartigan-Wong Method.
  • 23. www.redpillanalytics.com 2) Lloyd Method CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{Lloyd}', ‘@{P_Method}{Lloyd}') Figure 24: Output from Lloyd Method.
  • 24. www.redpillanalytics.com 3) MacQueen Method CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{MacQueen}', ‘@{P_Method}{MacQueen}') Figure 25: Output from MacQueen Method. Looking closely at these varying visualizations, it is apparent that the differentiation of each cluster is slightly different for the 3 methods. Also of note, is that the H-Clustering Methods are the default, but there is also ward.D, ward.D2, Single, Average, Median, McQuitty, and centroid.
  • 25. www.redpillanalytics.com The Outlier Function This function classifies a record as Outlier based one or more input expressions using K-Means, Hierarchical Clustering or Multi-Variate Outlier Detection Algorithms (The 3 methods in outlier detection for the Advanced Analytics tools in OBIEE 12c). Each method is utilized for different purposes and the user has the ability to adjust the algorithm of use according to their specific needs. In statistics, an outlier is a reference to specific data that diverge from the normality of the data set as a whole to a statistically significant extent. Outliers can be thought of as a data anomaly; the sort of black sheep within the data. Outlier detection can be thought of as clustering data along a logical metric, where normality is equal to FALSE (not an outlier) or abnormality is equal to TRUE (an outlier). Here is a brief description of the 3 methods that were mentioned above: K-MEANs: Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering attempts to partition into a specified number of clusters (k) so as to minimize the sum of the distance functions of each individual point from the K center. This allows for an overview of similarities along the given dimensions. For outlier detection, there will be two clusters in a logical format, one of TRUE and one of FALSE. TRUE denoting an outlier, FALSE denoting data normality. Hierarchical Clustering: Generally, this form of clustering is an attempt to build a sort of pecking order in which the data filters down into distinct ‘groups’ along the prompted dimensions. Hierarchical clustering can be thought of as a “top-down” approach of structuring an overview for viewing contextual differences/similarities amongst user-defined dimensions. Multivariate Outlier Detection (default outlier detection for 12c): One way to check for multivariate outliers is with Mahalanobis’ distance.3 Mahalanobis’ distance can be thought of as a metric for estimating how far each case is from the center of all the variables’ distributions (i.e. the centroid in multivariate space). Mahalanobis’ distance accounts for the different scale and variance of each of the variables in a set in a probabilistic way. 3 (Mahalanobis, 1927; 1936 ).
  • 26. www.redpillanalytics.com Syntax for Outlier Analysis: OUTLIER( (dimension_expr1 , ... dimension_exprN), (expr1, .. exprN), output_column_name, options, [runtime_binded_options]) Where: • dimension_expr— represents a list of dimensions to be clustered (K) • expr— represents a list of dimension attributes or measures (x1, x2, …, xn) to be used in order to find outlier’s. • output_column_name— is the output column. The valid values are: o ’isOutlier’: which will print back a logical value TRUE or FALSE as to whether or not each data point is an outlier or not. o ’distance’: will return the “distance from normality” (the higher this number, the ‘more’ of an outlier the data point is). • options — is a string list of name=value pairs separated by ';'. The value can include %1 ... %N, which can be specified using runtime_binded_options. • runtime_binded_options — is an optional comma separated list of run-time binded columns or literal expressions that supply a specification to an unrepresented value in the options list. This portion of the syntax is optional. It is merely satisfying parameters for other options that have yet to be specified. For example, in an outlier analysis, the user might have an option output_column_name=%1. If it was speculated that they wanted to use the distance for this particular analysis, Their runtime_binded_options would then be equal to ‘distance’. Order matters. %1 in options equates to the first specified option, %2 the second, and %N the Nth. Here would be the entire syntax for this example (highlighted is the areas of focus). Remember that runtime_binded_options is optional. You can specify parameters to your options without using this tool, which implies that runtime_binded_options is more of an organizational tool than a functional one. Using it versus not using it does not impact performance, but the option is nice to have for organizational purposes. Outlier Function Example Analysis: For the analysis, observe how the dimensions of offices and companies within the data set were clustered along the measures or attributes of both revenue and discount amount. One hypothesis for this analysis might be that offices under their respective companies are acting very similar in regards to discount amount and revenue. Figure 26: Columns used in example analysis.
  • 27. www.redpillanalytics.com New Columns for Methodology Comparison For this example, the multivariate outlier algorithm (mvoutlier) will be used, rather than K-means or hierarchical clustering to start (no particular reason for this other than mvoutlier being the default algorithm). However, perhaps it could be wagered that the mvoutlier algorithm is the most favorable and is the default algorithm for a reason. Observe the variance in algorithms below. Function Syntax: OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=_________’) Outlier Observations Thus far, the syntax has proved to disallow any entering of a specified number of outliers. When using the multivariate algorithm, and entering numClusters into the syntax in order to change the result, an error is printed in the results tab. After playing around with the sample sales data, the conclusion can be made that there is no way to set a specific number of outliers to be detected. The number of outliers is contingent upon each data set and how it acts with the underlying algorithm in R. Setting an “is not equal to” filter on the two data points (Eiffel and Spring offices) in order to see if there would still be outliers does not change whether or not there are outliers. Rather, there are two new outliers (the second set of two most northeasterly points on the graph). This is counterintuitive to what the function is doing. If the function was finding truly, significantly variant data, then the result, after this filter was applied, should return all green (FALSE) points on the scatter plot. On the other hand, sometimes a user might have a data set with all very similar points but still want to find the point(s) that are most variant. This means that the outlier detection algorithm is a reliable source and will give us outliers in all situations. It is important to keep these contingencies in mind when analyzing data. When the scatter plot involving these variables of analysis is made, returned is the following graphs, with the accompanying tables of:
  • 28. www.redpillanalytics.com Multivariate Outlier Detection Method: OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier' Figure 27: Multivariate Outlier Detection output. Hierarchical or H-clustering Outlier Detection Method: OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=h-clustering') Figure 28: Hierarchical Clustering Outlier Detection output.
  • 29. www.redpillanalytics.com K-means Outlier Detection Method OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=Kmeans') Figure 29: K-Means Outlier Detection output. Notice that when using the h-clustering algorithms and the multivariate algorithms, the outliers are consistent (Eiffel and Spring offices of Tescare Ltd.) but when using the K-means algorithm to find outliers, very different values of Blue Bell and Teller offices of Stockpiles Inc are received. These variations in outlier detection methods between the algorithms beg the question of reliability amongst algorithms. For this reason, the variables of analysis were altered to try to get a visualization with more data points, and hence more outliers, to see if there was some sort of anomalistic variation here with just these variables. The syntax in use is: OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=h-clustering') OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=Mvoutlier’) OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=K-means’) Place each of the above in their own respective columns (with the same variables in each)in order to graphically view these outliers on the same scatter-plot. New variables (Product, and office) were also used in this analysis for a larger amount of data points. As is visible after these minute changes, some outliers overlap and others do not.
  • 30. www.redpillanalytics.com Legend Translation: Blue Squares: Only h-clustering viewed these as outliers Green Circles: All algorithms viewed these as outliers Yellow Rhombi: No algorithm viewed these as outliers Red plus: MV Outlier and K-means algorithms viewed these as outliers, not h-clustering. In the graph editor, the corresponding order of methodological reference is: • Hierarchical Clustering • MV Outlier • K-Means Figure 30: Visualization comparing all three outlier methods Analyzing Methodology Consistency There are no algorithmic overlapping points between the h-clustering algorithm and the other two algorithms. This is interesting. It could be inferred that other data sets would have points where the other methods overlapped with the h-clustering algorithm, but in this particular data set, the reasoning might have something to do with the variance in the algorithm and how it goes about ‘defining’ an outlier. Remember, h-clustering is representative of a hierarchical pecking order in which data sort of filters down whereas the other methods are distance based, based on your input criterion or dimensionality. These differences could account for the variance in our visualization here. Also notice that it seems as if the ‘behind the scenes’ R-statistics are more consistent with their outlier detection. In the first analysis, K-means was a little bit off as compared to the other two algorithms. After browsing through some documentation on K-means clustering, an apparent
  • 31. www.redpillanalytics.com notion of K-means being a reliable method amongst increasingly large data sets is noticeable. In the first analysis, there were few data points, in the second there are many. The fact that there were only 20 points in the first analysis might be the reason for this discrepancy amongst strategies. Perhaps as the data set size increases, more consistency with the varying algorithms will be noticed. Keep this in mind when choosing algorithms.
  • 32. www.redpillanalytics.com The Regression Function This function fits a linear model, and returns the fitted values or model. This function can be used to fit a linear curve on two measures. In statistics, a regression analysis is a process that estimates the relationship among two variables within a data set. The focus of this test is to measure the relationship between one or more independent (fixed) variables and its correlation to a dependent (variable) variable. More specifically, regression allows for a deeper understanding of how a dependent value changes when the independent variable is adjusted. It might help to think of regression in a sort of ‘mathy’ f(x) or f of x notation, where x is the independent variable or the input value. The dependent variable (or output) could be thought of as the value of the y axis. It might also help to think of these two variables in a linguistic way. The y-axis measure is the dependent variable, this means that it is literally dependent on some other value to change before it does. The x-axis measure(s) is/are literally independent of any other factor(s); they are fixed. This is important to understand before getting into the syntax. In laymen’s terms, regression is a measure of how good one measure is a predictor of another measure. Linear regression is also widely used for forecasting trends in an analysis, predictive analytics, and has large ties to the arena of machine learning as well. Also, understand that regression methodology does not insinuate causation, but rather suggests a specific extent of correlation of two measures. Dummy Variables in Categorical Regression It is not possible to directly regress a categorical variable against a numerical variable, nor is it possible to regress a numerical variable against a categorical variable. There is a solution for this though. It is called a dummy variable. This works with the assumption that it is necessary for an analysis to have a regression model regarding a categorical variable that contains the names of pets (Cats, Dogs, and Birds) and to see how good a predictor these pets are of (fill in the blank). It would not make sense to assign Cats, Dogs, and Birds a 1,2, and 3, respectively, unless, for some reason, this Dog was twice as much of a pet than a Cat and the Bird 3 times as much of a pet as the Cat. Since regression is used with two numerical variables, interpretations are only valid under circumstances where having a 100 stored for some variable literally equates to having 100 times the characteristic of X than the variable that stores the number 1. For the pet example, since it would be illogical to assign a 1, 2 and 3, an alternative (with a regression model in mind) is to assign some binary values, such as a 1=Cat and 0=not a cat. Syntax for Regression Analysis REGR(y_axis_measure_expr, (x_axis_expr), (category_expr1, ..., category_exprN), output_column_name, options, [runtime_binded_options]) Where: • y_axis_measure_expr represents the measure for which the regression model is to be computed. This is your dependent variable.
  • 33. www.redpillanalytics.com • x_axis_expr represents the measure to be used to determine the regression model for the y_axis_measure_expr. This is your independent variable. • category_expr1, ..., category_exprN represents the dimension/dimension attributes to be used to determine the category for which the regression model for the y_axis_measure_expr is to be computed. One or more dimensions or dimension attributes, up to five, may be provided as category columns. • output_column_name is the output column. o fitted - returns the points on regression line in (y=ax+b) format o intercept - the intercept point with the zero on x axis (b from y=ax+b) o modelDescription - the Model in JSON format. • options is a string list of name=value pairs separated by ';'. The value can include %1 ... %N, which can be specified using runtime_binded_options. • runtime_binded_options is an optional comma separated list of run-time binded columns and options. Regression Example Analysis In this particular analysis, a comparison is made to unveil how good a predictor the independent variable of billed quantity is for the dependent variable of revenue. The question to be answered here is, if the quantity of billed items is changed, how does revenue altered? Based on the column names alone, it could be predicted that the data will cluster fairly nicely around the regression line created by the function in an upward slope. This means that the billed quantity would be a good predictor of revenue. This is fairly intuitive. But, what can also be witnessed below is that billed quantity is not a perfect predictor of revenue; if it was there would be less data outlying this regression line. In a regression scatterplot like the one below, the tighter our ‘green dots’ are hugging our ‘blue dots’ the higher the correlation between the two variables. Function Syntax Used REGR("Base Facts"."Revenue", ("Base Facts"."Billed Quantity"), ("Time"."Per Name Month", "Time"."Per Name Year"), 'fitted', ‘’)
  • 34. www.redpillanalytics.com Figure 31: Regression Analysis of Billed Quantity as a Predictor of Revenue If the user were check the table below and look under the column heading “Regression”, he/she would see the regression function’s output, and how it relates to Figure 32, Figure 32: Regression Output in Tabular View
  • 35. www.redpillanalytics.com It may be interesting to see what data in this regression were not fitting this particular trend. The visualization below was created by using this syntax —OUTLIER((“Time"."Per Name Year", "Time"."Per Name Month"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier’). This will display outlying values in correspondence with the same syntax and variables used for the above regression. Figure 33: Visual of Data Points where Billed Quantity is not a Predictor of Revenue Concentrate on the red plus signs rather than the yellow rhombi. The red plus signs are the outliers for this regression analysis, where the yellow rhombi are merely the corresponding data points that were plotted for the regression line for these 4 outlying data points. By sorting the outlier portion of this data set, one could create a table that shows the year and month where billed quantity was not necessarily a great predictor of revenue. Figure 34: Tabular View of Outliers Within a Regression Analysis
  • 36. www.redpillanalytics.com What is noticeable is that, for the 6th and 7th months for 3 consecutive years, billed quantity was not a great predictor of revenue. By obtaining this sort of information, it is possible to drill down into why this might be the case. These sort of quantitative and visual ‘hints’ within the data being unveiled in an aesthetic way is the epitome of these advanced analytics tools. Statistics can tell a lot about why things are the way they are and can, ultimately, provide some insight to move forward in a fashion that will allow the building of a sustainable organization.
  • 37. www.redpillanalytics.com Appendix I: Creating Presentation Variables and Prompts Presentation Variable and Prompting the User for Function Options Above, there is slight variation in syntax within the function code from the original syntax given where there is @{numClusters};maxIter=@{numIter}in the options portion of the function input. The @{} is the code for adding a presentation variable to a dashboard prompt that will prompt the user for the number of clusters and the number of iterations for the algorithm to perform. In many cases it is a good idea to prompt the user for the number of clusters and iterations because it allows for a more interactive dashboard. It is also important because this easy functional change can show us how a large sample size continues to change as we continuously segment our data set into varying numbers of clusters. If a developer was eager to perform this same task, highlight (in the syntax) the portion that would typically contain (%1…%N) for whatever variable they wanted to add a prompt for they would perform the following tasks: Figure A1: Highlight the %N. Figure A2: Click “Variable”, then “Presentation”.
  • 38. www.redpillanalytics.com Figure A3: Input a variable expression. It is important to be careful prior to clicking OK here. This Variable Expression must be matched in a case sensitive fashion to the corresponding dashboard prompt. Click OK. Figure A4: Click “New”, then “Dashboard Prompt”. Figure A5: Click the green arrow, then “Variable Prompt”.
  • 39. www.redpillanalytics.com Prompt for=Presentation Variable: *Label (this is what is equal to the presentation variable that was set in the column function)=numClusters: Expand the options window: Variable Data Type=Number: A Note of Defaults The user can set a default value here. Also, just a heads up, there is some sort of undocumented default value of 5 clusters. For example: The syntax of— CLUSTER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;') —returns a visualization of: Figure A6: Default Visualization of Discount Amount versus Revenue.
  • 40. www.redpillanalytics.com Figure A7: Complete the process again for the iteration variable. Save these Prompts. Now when going into the Dashboard, where the dashboard prompt and the analysis have been input, this presentation variable can be witnessed in action.
  • 41. www.redpillanalytics.com Document History Created By: Brendan Doyle Mike Perhats Edited By: Phil Goerdt Creation Date: 8/8/16 Last Edit Date: 8/8/16