Oracle Business Intelligence Enterprise Edition 12c has enhanced analytical capabilities due to an optional integration with R. This document reviews newly available functions in OBIEE 12c including Trendline, Binning, Width Bucket, Forecast, Clustering, Outlier detection, and Regression. Examples using the Sample Sales Lite data set demonstrate how to use the Trendline, Binning, Width Bucket, and Forecast functions to analyze time series data and predict future trends. Syntax and options for each function are provided.
This presentation shows all the posible options to move Oracle BI on-premise system to Oracle Analytics Cloud. We are going to see all the steps to perform this migration as well as the issues that we have seen and how to troubleshoot them. In addition we will review the most common administration tasks.
Cash Management in SAP S/4HANA Finance takes care of the cash movements in business. It ensures on-time availability of funds for financial requirements and payment obligations and maintains accurate liquidity in the business.
What to Expect From Oracle database 19cMaria Colgan
The Oracle Database has recently switched to an annual release model. Oracle Database 19c is only the second release in this new model. So what can you expect from the latest version of the Oracle Database? This presentation explains how Oracle Database 19c is really 12.2.0.3 the terminal release of the 12.2 family and the new features you can find in this release.
S/4 HANA Editions
S/4 HANA Deployment options:
- on premise
- private, managed cloud
- private cloud
- public cloud
- TCO reductions by using SUSE Linux
- SUSE Linux as the foundation for S/4 HANA
Oracle Analytics Cloud: connect; prepare; explore; share. Liberate all data and connect to more than 50 different data sources. Powerful tools for auditable and traceable data blending, wrangling, cleansing, & modeling. Intuitive and rich exploration with self-service data visualization. Build collective intelligence by collaborating with peers and socialize insights across the organization or the world.
In this webinar there will be a brief discussion on what is personalization, customization and extension. Lastly, we will be talking about the role of ADF, which is going to supersede OA Framework in fusion applications.
Oracle Analytical Function Include First Value, Last Value, Lead, Lag, Nth Value with Unbounded and Difference between Rank and Dense Rank . Contain Rollup, Cube and Grouping and Different type of Window Function and Analytical Window frame
This presentation shows all the posible options to move Oracle BI on-premise system to Oracle Analytics Cloud. We are going to see all the steps to perform this migration as well as the issues that we have seen and how to troubleshoot them. In addition we will review the most common administration tasks.
Cash Management in SAP S/4HANA Finance takes care of the cash movements in business. It ensures on-time availability of funds for financial requirements and payment obligations and maintains accurate liquidity in the business.
What to Expect From Oracle database 19cMaria Colgan
The Oracle Database has recently switched to an annual release model. Oracle Database 19c is only the second release in this new model. So what can you expect from the latest version of the Oracle Database? This presentation explains how Oracle Database 19c is really 12.2.0.3 the terminal release of the 12.2 family and the new features you can find in this release.
S/4 HANA Editions
S/4 HANA Deployment options:
- on premise
- private, managed cloud
- private cloud
- public cloud
- TCO reductions by using SUSE Linux
- SUSE Linux as the foundation for S/4 HANA
Oracle Analytics Cloud: connect; prepare; explore; share. Liberate all data and connect to more than 50 different data sources. Powerful tools for auditable and traceable data blending, wrangling, cleansing, & modeling. Intuitive and rich exploration with self-service data visualization. Build collective intelligence by collaborating with peers and socialize insights across the organization or the world.
In this webinar there will be a brief discussion on what is personalization, customization and extension. Lastly, we will be talking about the role of ADF, which is going to supersede OA Framework in fusion applications.
Oracle Analytical Function Include First Value, Last Value, Lead, Lag, Nth Value with Unbounded and Difference between Rank and Dense Rank . Contain Rollup, Cube and Grouping and Different type of Window Function and Analytical Window frame
This presentation deals with the advanced features of SQL comprising of Arithmetic Calculations, Analytical Function, PIVOT etc. Presented by Alphalogic Inc: https://www.alphalogicinc.com/
Are you an Oracle developer or a DBA?
Do you know the difference between aggregate and analytic functions?
Without complex sub-queries or self-joins, do you know how to:
Calculate running/cumulative totals and moving/centered averages?
List products with revenues above or below their peers or product groups?
Compute the ratio of one category’s sales to the total sales?
Select the Top-N or Top N % of the customers/products?
Classify advertisers into quartiles/n-tiles based on the revenue potential?
Compare period-over-period (year-over-year, month-over-month) growth and rank advancement?
Convert rows into columns (pivot), columns into rows (unpivot) or aggregate strings?
Perform what-if analysis and hypothetical ranking?
Analytic functions are more performant because tables need to be scanned only once. They make you more productive because there is no need to write procedural code. No wonder Tom Kyte, a well-respected Oracle guru, says analytic functions are the best thing to happen after the sliced bread.
In the first half, I will cover the basics of the various analytic functions:
Ranking: RANK, DENSE_RANK, ROW_NUMBER, NTILE, CUME_DIST, PERCENTILE_RANK
Windowing: SUM, AVG, MAX, MIN, FIRST_VALUE, LAST_VALUE
Reporting: RATIO_TO_REPORT
Others: FIRST/LAST, LEAD/LAG, hypothetical ranking,
In the second half, I will show how powerful these functions are with a few examples.
If there is time, I will cover enhanced aggregation (ROLLUP, CUBE, GROUPING SET extensions to GROUP BY clause)
This class would be useful for both developers and DBAs alike, especially for those working in Analytic, Business Intelligence, and Datawarehouse environments.
Are you already an expert in analytic functions? Then come and help me refine the content.
For more info, read
http://download.oracle.com/docs/cd/E11882_01/server.112/e16579/analysis.htm
http://download.oracle.com/docs/cd/E11882_01/server.112/e16579/aggreg.htm
rollup, cross-tabulation across different dimensions using ROLLUP, CUBE and GROUPING SETS extension to GROUP BY clause
, most active time-periods (i.e. days when the most number of tickets are open in BZ, hours with the most take-off and landings, months with the highest sales, 5-minute periods with the maximum number of calls made, etc)
data densification?
their rank last year, this year, rank growth, running/cumulative total (Year-To-Date/Month-To-Date summation), moving averages, Year-Over-Year comparison, sales projection, average/min/max time between one sale and the next sale, products with above and below average sales.
overall average, sum, departmental average, sum, ranking, job wise ranking in one SQL.
Oracle Advanced SQL and Analytic FunctionsZohar Elkayam
Even though DBAs and developers are writing SQL queries every day, it seems that advanced SQL techniques such as multidimension aggregation and analytic functions still remain relatively unknown. In this session, we will explore some of the common real-world usages for analytic function and understand how to take advantage of this great and useful tool. We will deep dive into ranking based on values and groups, understand aggregation of multiple dimensions without a group by, see how to do inter-row calculations, and much more.
This is the presentation slides which was presented in Kscope 17 on June 28, 2017.
Part 3 of the SQL Tuning workshop examines the different aspects of an execution plan, from cardinality estimates to parallel execution and explains what information you should be gleaming from the plan and how it affects the execution. It offers insight into what caused the Optimizer to make the decision it did as well as a set of corrective measures that can be used to improve each aspect of the plan.
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and OLAP
Data Cube Computation: Preliminary Concepts
Data Cube Computation Methods
Summary
Microsoft® Excel® 2016ExploringSeries Editor Mary Anne Poats.docx
OBIEE 12c Advanced Analytic Functions
1. www.redpillanalytics.com
Abstract
Oracle Business Intelligence Enterprise Edition 12c has enhanced analytical capabilities
due to an (optional) integration with the statistical software R. These new functions include the
following: Trendline, Bin and Width Bucket, Forecast, Clustering, Outlier and Regression. This
document will provide a comprehensive review of these newly available functions, and provide
examples of them in action. For ease of understanding and reproducibility, the sample data set is
Oracle’s Sample Sales Lite1
.
1
This data set is available with every install of OBIEE 12c. Alternatively, a similar set can be
found within the Oracle BI Sample App Virtual Machine.
2. www.redpillanalytics.com
The Trendline Function
Trendline is part of the Advanced Analytics Internal Logical SQL Functions, meaning it
is in the group of functions that are done internally as opposed to being done in R. This function
fits a linear or exponential model, and returns the fitted values or model. The numeric_expr
represents the Y value for the trend and the series (time columns) represent the X value. A
Trendline is a model, and its assertion that the data is the result of a model. The TRENDLINE
function measures data across time and shows a line of a metric by ordered records. It can model
data as linear and as exponential regression.
Figure 1: The Trendline function is found under the Aggregate folder by clicking on the
“Insert Function” button in the Formula section of the column editor.
3. www.redpillanalytics.com
Trendline Function Syntax
TRENDLINE( <numeric_expr>,( [<series>] ) BY ( [<partitionBy] ),
<model_type>, <result_type>, [number_of_degrees] )
Where:
o numeric_expr—represents the data to trend
▪ This is the Y-Axis and is a measure column.
o series—indicates the X-axis. This is a list of <valueExp> <orderByDirection>,
where <valueExp> is a dimension column and <orderByDirection> is ASC
(ascending) or DES (descending).
▪ The default is ASC. Note that this cannot be an arbitrary combination of
numeric columns.
▪ It is possible to use more than one Trendline column in the same analysis,
but the Trendline columns must have the same X-Axis.
o partitionBy—A list of dimension attributes that are not on the X-Axis.
o model_type— A model type may be one of the following types:
▪ LINEAR—a function with a constant rate of change and a straight line
graph.
▪ EXPONENTIAL—a function whose value is raised to the power of the
variable.
o result_type— A results type may be one of the following types:
▪ VALUE - will return all the regression Y values given that X in the fit.
▪ MODEL - will return all the parameters in a JSON (JavaScript Object
Notation, which is a lightweight data-interchange format) format string.
Figure 2: Example formula to display result_type of MODEL.
4. www.redpillanalytics.com
Figure 3: Results of using ‘MODEL’ as the result type; it returns the parameters in a JSON
(JavaScript Object Notation) format string.
Example Syntax
TRENDLINE(“Base Facts”.”Revenue”, (“Time”.”Calendar Date”), ‘LINEAR’,’VALUE’)
Figure 4: Selected dimensions and fact columns for a sample trendline analysis.
Figure 5: Note the Trendline (in green); depicting these types of subtle changes is what this
function is best at.
5. www.redpillanalytics.com
Figure 6: If the graph is set to vary color by ‘Per Name Year’, the results are displayed for each
year. Note the differences between each year that otherwise would not be apparent.
Figure 7: Segmentation of the trends could continue to smaller subsets. Above, the 2009 has
been split by semester.
6. www.redpillanalytics.com
The BIN and WIDTH_BUCKET Functions
Both BIN and WIDTH_BUCKET are included in the Advanced Analytics Internal
Logical SQL Functions, meaning they are in the group of functions that are done internally as
opposed to being done in R. With that being said, the syntax for the two functions is different
and will be covered later on.
About BIN
In the BIN function, the user can select any numeric attribute (INT, FLOAT, DOUBLE,
NUMERIC) from a dimension or fact table/measure containing the data values and place them
into a discrete number of bins. The reason to bin a measure would be to separate results of the
measure into group (see BIN syntax). An example of this would be sales from a store and
binning the revenue from anything less than $200, between $200 and $500, and so on. This sales
that had that amount of revenue will be binned into the groups that fit that specific criteria. The
BIN function classifies a given number expression into a specific number of equal width buckets.
The function can return either the bin number or one of the two end points of the bin interval.
The output of the BIN function is used as a GROUP BY expression for other measures included
in the query. The BIN function is treated like a new dimension attribute for purposes such as
aggregation, filtering, and drilling. All of these operations are supported on BIN expressions.
BIN Syntax
BIN(numeric_expr [BY grain_expr1, …, grain_exprN] [WHERE condition] INTO
number_of_bins BINS [BETWEEN min_value AND max_value] [RETURNING { NUMBER
| RANGE_LOW | RANGE_HIGH }])
Where:
o numeric_expr—indicates the measure or numeric attribute to bin
o BY grain_expr1, …, grain_exprN—indicates a list of expressions that define
the grain at which the numeric_expr is calculated before the numeric values are
assigned to bins.
▪ This clause is required for measure expressions and is optional for
attribute expressions
▪ The BY clause of the BIN function defines the grain at which the binned
expression is evaluated prior to binning.
If the binned expression is a measure, then the measure is grouped
at the grain specified in the BY clause before being binned.
▪ The BY clause of the BIN function is mandatory if the binned expression is
a measure.
7. www.redpillanalytics.com
Otherwise, for non-measure expressions, the BY clause is optional.
o WHERE condition—indicates a filter condition to apply to the numeric_expr before
the numeric values are assigned to bins
o INTO number_of_bins—indicates the number of bins to return. The default is 10.
o BETWEEN min_value AND max_value—indicates the minimum and maximum
values used for the end points of the outermost bins
o RETURNING—indicates a filter condition to apply to the numeric_expr before the
numeric values are assigned to bins. Note the following options:
▪ RETURNING NUMBER—indicates the return value should be the bin number
(for example: 1,2,3,4). This is the default condition
▪ RETURNING RANGE_LOW—indicates the lower value of the bin interval
▪ RETURNING RANGE_HIGH—indicates the higher value of the bin interval
Figure 8: The Bin Function is found under the Aggregate folder in the column formula editor.
8. www.redpillanalytics.com
About Width Buckets
The WIDTH_BUCKETS function is known as a “secret function” meaning it is not
available in the function menu, but the user can type the formula to use it. The syntax of
WIDTH_BUCKET is also comma-based, which is not consistent with most Advanced Analytics
in OBIEE. Similar to binning, width bucket classifies a given numeric expression into a specified
number of equal width buckets. It operates on top of a base query result set as a display function.
The function can return either the bin number or one of the two end points of the bin interval.
Unlike the BIN function, the WIDTH_BUCKET function is not treated as a new dimensional
attribute for the purposes of aggregation. It is applied on top of the query result similar to the
other display functions such as RANK, TOPN, BOTTOMN, NTILE, PERCENTILE, MAVG,
and MEDIAN. Use the WIDTH_BUCKET function when you want to compute a discrete set of
buckets on top of an already aggregated query result set. The syntax for Width Bucket is much
simpler than that of the BIN function.
WIDTH_BUCKET Syntax
WIDTH_BUCKET(numeric_expr, {NUMBER | RANGE_LOW | RANGE_HIGH },
number_of_bins, [min_value, max_value] [BY expr1, …, exprN])
Where:
o numeric_expr—indicates the measure or numeric attribute to bin
o NUMBER—indicated that the return value should be the bin number (ex: 1,2,3,4).
o RANGE_LOW—indicates the lower value of the bin interval
o RANGE_HIGH—indicates the higher value of the bin interval
o number_of_bins—indicates the number of bins to return. The default is 10.
o min_value, max_value—indicates the minimum and maximum values used for
the end points of the outermost bins. If the min_value and max_value conditions
are omitted, then the function determines the end points automatically.
o BY expr1, …, exprN—indicates an optional list of expressions that define the
groups in the query result set over which the WIDTH_BUCKET calculation is
applied. The bucket intervals within different groups are calculated
independently.
▪ The BY clause of the WIDTH_BUCKET function defines the groups in the
query result over which the WIDTH_BUCKET calculation is applied.
The buckets within different groups are calculated independently.
▪ The BY clause is always optional in the WIDTH_BUCKETS function.
9. www.redpillanalytics.com
If the BY clause is omitted from the WIDTH_BUCKET function,
then the function operates over the entire result set.
BIN and WIDTH BUCKET: Defining Grouping
The goal of both functions is to define the bin/bucket that the specific data entry belongs
to. This is accomplished by:
o Using what column the binning should be done (that is, the binned expression).
§ Remember, this is a numeric expression (and usually a measure).
o By what attributes the data should be arranged.
§ Remember, the BY function does not have the same meaning in both
functions!
o The number of Bins/Buckets and the type of data returned.
§ Remember, it is one of three options: the bin or bucket number, it’s
minimum or maximum point.
o The WHERE condition option found in the BIN function.
BIN and WIDTH_BUCKET Function Example
The dimensions and measures being used for this example are:
• LOB
• Per Name Month
• Revenue
• BIN Formula: BIN("Base Facts"."Revenue" BY"Products"."LOB","Time"."Per
Name Month" into 4 bins)
• WIDTH_BUCKET Formula: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER,
4)
o (Define the number of bins for each to be the same or there will be an error)
10. www.redpillanalytics.com
Figure 9: Above are the results of the binning and buckets of revenue. The table shows that it is
binning the monthly revenue of the LOB in columns “BIN” and “WIDTH_BUCKET” in bins of
1-4. It is sorting or binning the revenue into specific numbered groups.
Figure 10: A linear graph where Bin #1 contains the month and year when the revenue was less
than $15,000.
11. www.redpillanalytics.com
Figure 11: A linear graph where Bin #2 contains the month and year when the revenue was
between $15,000 and $30,000.
Figure 12: A linear graph where Bin #3 contains the month and year when the revenue was
between $30,000 and $45,000.
12. www.redpillanalytics.com
Figure 13: A linear graph where Bin #4 contains the month and year when the revenue was
greater than $45,000.
Be sure not to aggregate BOTH functions using the BY clause for it will result in an error.
•BIN: BIN("Base Facts"."Revenue" BY "Time"."Per Name Month" into 4 bins)
The meaning of BY "Month" in BIN is: Take the sum("Revenue" by "Month") and
arrange the sum of month in 4 bins. So rows of the same month will have the same BIN
"Revenue" by "Month" results.
•WIDTH_BUCKET: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4 by
"Time"."Per Name Month")
The meaning of BY "Month" in WIDTH_BUCKET is: Take individual rows of data in
each month and arrange them in 4 buckets.
13. www.redpillanalytics.com
Figure 14: The Bin and Width Bucket do not match due to both functions using the BY clause.
Using the WHERE Option in the BIN Function
Figure 15: BIN Function Criteria edited to include the WHERE option.
BIN Formula: BIN("Base Facts"."Revenue" BY "Products"."Product
Type","Time"."Per Name Month" where "Time"."Per Name Year"='2010' into 4
bins)
14. www.redpillanalytics.com
The Forecast Function
A Forecast creates a time-series model of the specified measure over the series using either
Exponential Smoothing or ARIMA (Autoregressive integrated moving average). This function
outputs a forecast for the set of periods as specified by numPeriods. Forecasting is very useful as
a tool for predictive analytics. You can see potential trends for different dimensions and
measures because of this function.
Forecast Syntax
Figure 16: The Forecast function can be found under the “Time Series Calculations” folder
within the column formula editor.
15. www.redpillanalytics.com
FORECAST (numeric_expr, ([series]), output_column_name, options,
[runtime_binded_options]) ])
Where:
o numeric_expr —indicates the measure to forecast.
o series —indicates the time grain at which the forecast model is built. This is a
list of one or more time dimension columns.
▪ If you omit series, then the time grain is determined from the query.
▪ The series must fit the date columns in the Analysis.
o output_column_name —indicates the output column. Valid values are ‘forecast’,
‘low’, ‘high’, and ‘predictionInterval.’
▪ forecast —This column is the forecasted output
▪ low —This column is the forecasted lower bound number
▪ high —This column is the forecasted higher bound number
Upper and lower limits of the prediction at the given confidence
level might be important
▪ predictionInterval —This is an available option that is the confidence
for the prediction.
The predictionInterval ranges from 0 to 100, where the higher
values specify a higher confidence.
o options —indicates a string list of name/value pairs separated by a semi-colon.
▪ The value can include %1…%N, which can be specified in
runtime_binded_options.
▪ View the table below for the available options
o runtime_binded_options—indicates a comma separated list of runtime-binded
columns and options
16. www.redpillanalytics.com
Forecast also has many of Available Options that can be used with the function. Below is a list of
the option types: (Value type in the parentheses)
numPeriods —The number of periods to forecast (integer)
predictionInterval —The confidence for the prediction (0 to 100, where higher
values specify higher confidence)
modelType —The model to use for forecasting. (ARIMA—Autoregressive Integrated
Moving Average, fitted to time series data either to better understand the data or to
predict future points in the series), (ETS—Error, Trend, Seasonal—exponential
smoothing state space model that is applied to the ‘y’.)
useBoxCox —If TRUE, then use Box-Cox transformation, which is a method used to
normalize a data set so that statistical tests can be performed to evaluate it properly.
Many real world raw data sets do not conform to the normality assumptions used for
statistics, so transformation functions can sometimes be used to normalize the data.
(TRUE, FALSE)
lambdaValue —The Box-Cox transformation parameter. Ignore if NULL or when
useBoxCox is FALSE. Otherwise the data is transformed before the model is estimated.
trendDamp —This is a parameter for ETS (Error, Trend, Seasonal) model. If TRUE, then
use damped trend. If NULL, then try both damped and non-damped trend and choose
then one that is optimal.
errorType —This is a parameter for ETS model. (additive (“A”), multiplicative (“M”),
automatically selected (“Z”))
trendType —This is a parameter for ETS model. (none(“N”), additive (“A”),
multiplicative (“M”), automatically selected (“Z”))
seasonType —This is a parameter for ETS model. (none(“N”), additive (“A”),
multiplicative (“M”), automatically selected (“Z”))
modelParamIC —The information criterion (IC) to be used in the model selection.
(“ic_auto”, “ic_aicc”,”ic_bic”,”ic_auto”—this is the default)
17. www.redpillanalytics.com
Figure 17: “Per Name Year” has been filtered to be “equal to/ is in” ‘2008’ to allow
forecasting for ‘2009’.
Forecast Example
The formula used in the FORECAST Column is as follows:
FORECAST("Base Facts"."Revenue", ("Time"."Per Name Year", "Time"."Per Name
Month"),'forecast','modelType=arima;numPeriods=%1;predictionInterval=70;',
12)
Figure 18: Forecast for 2009 based on 2008 data.
18. www.redpillanalytics.com
The Clustering Function
This function groups a set of records into groups based on one or more input expressions using
K-Means or Hierarchical Clustering, which are the two modes of clustering analysis that can be
utilized in the advanced analytics clustering model provided in 12c.
K-MEANs:
Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering
attempts to partition into a specified number of clusters (k) so as to minimize the sum of the
distance functions of each individual point from the K center. This allows for an overview of
similarities along the given dimensions.
Hierarchical Clustering:
Generally, this form of clustering is an attempt to build a sort of pecking order in which the data
filters down into distinct groups along the prompted dimensions. Hierarchical clustering can be
thought of as a sort of “top-down” approach of structuring an overview for viewing contextual
differences/similarities amongst user-defined dimensions.
Syntax for Clustering Analysis:
CLUSTER( (dimension_expr), (expr), output_column_name, options, [runtime_binded_options])
Where:
• dimension_expr— represents a list of dimensions to be clustered (K).
• expr— represents a list of dimension attributes or measures to be used (x1, x2, …, xn) to
cluster the dimension_expr (K)
• output_column_name— is the output to be printed in the column header, this portion of
the syntax is only part of the aesthetic interaction in the platform and does not perform
and analytics. The valid values include:
o clusterID – This column is the cluster number or ID.
o clusterName – This column is synonymous with clusterID.
o clusterDescription – The description can be added by the end user after the
cluster dataset is persisted into DSS.
o clusterSize – This column is the number of elements in the current cluster.
o distanceFromCenter – This column indicates how far the current cluster
element is from the center of the current cluster.
o centers – This column indicates the center of the current cluster in a format
• options — is a string list of name=value pairs separated by ';'. The value can include %1
... %N, which can be specified using runtime_binded_options.
• runtime_binded_options — indicates a comma separated list of binded columns or
literal expressions that supply a specification to an unrepresented value in the options list.
19. www.redpillanalytics.com
This portion of the syntax is optional. It is merely satisfying parameters for other options
that have yet to be specified. For example, in the clustering analysis, you might have
options of numclusters=%1, maxIter=%2. Let’s speculate that you want 5 clusters and a
maximum 10 iterations for this particular analysis. Your runtime_binded_options would
then be 5,10 — which corresponds to 5 clusters and 10 iterations. Order matters. %1 in
options equates to the first specified option, %2 the second, and %N the Nth. Here would
be the entire syntax for this example (highlighted is the areas of focus).
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"),
("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName',
'algorithm=k-
means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE',
5, 10)
Remember that the runtime_binded_options option is not required. Parameters can be
specified in the function without the use of this option. This means that the following code is
synonymous in performance to the example given above:
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"),
("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName',
‘algorithm=k-means;numClusters=5;maxIter=
10;useRandomSeed=FALSE;enablePartitioning=TRUE’)
Clustering Example Analysis
An example of a clustering analysis could check to see how the dimensions of offices and
companies within the data set were clustered along the measures of revenue and discount
amount. One hypothesis for this analysis might be that offices under their respective companies
are acting very similar in regards to discount amount and revenue.
Formula Syntax2
CLUSTER(("Offices"."Office", "Offices"."Company"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', ‘algorithm=k-
means;numClusters=@{numClusters};maxIter=@{numIter};useRandomSeed=FALSE;enabl
ePartitioning=TRUE’)
Methodology:
The example will be using K-means clustering rather than hierarchical clustering. See above in
the Syntax for Clustering section for details on the syntax variation for the options of
numClusters and maxIter that allow for user inputs for these variables.
2
The highlighted text refers to presentation variables. See Appendix I for more information.
20. www.redpillanalytics.com
With a user input of 3 clusters and 20 iterations, one would receive an output of:
Figure 19: Cluster Visualization for 3 Clusters, with 20 Iterations
Where our clusters are depicted via color and shape and our Discount Amount and Revenue on
our axis and each point represents one of the 20 offices in the data set. We can see how this
graph changes after doubling the cluster amount.
Figure 20: Cluster Visualization of 6 Clusters with 20 Iterations.
21. www.redpillanalytics.com
Notice how some clusters are larger than others. This is because in this clustering method, the
objects of the data set are grouped in such a way that the clusters are very different from each
other and the objects in the same group or cluster are very similar to each other. This being said,
some data clusters might contain highly similar points along the measures of discount amount
and revenue while others are highly varied and only contain one data point, such as cluster
number 1 in this analysis. There is no ‘perfect number’ for cluster amount. This number is
contingent upon the data set in use, the amount of data, and user preference. 3 and 6 were used
here in a mere exemplary fashion.
If the data is in a tabular format, one can get a fairly informational depiction of exact amounts
within the selected data set. This allows for a more precise or exact view of the data within the
clusters. It would be poor practice to display all of this information on the scatterplot. The
visualization is more of an aesthetic way of viewing data that allows for increased perception of
what might otherwise not be apparent. The tabular version is important in correspondence with
the visual so that the user can witness precision along the results of the executed underlying
algorithm. Here is a snippet of the tabular information, sorted in ascending order by cluster
number:
Figure 21: Tabular View of Cluster Analysis.
The last important thing to note is that within the clustering function in 12c there are a few
variant methods for clustering. These are sort of subsets within the K-means and Hierarchical
methods. For the visual comparison K-means will be used because K-means is the default
method for clustering in OBIEE. Also new variables (as compared to the previous analysis) will
be used to get more data points and to compare the different methods accordingly to see how
they differ.
22. www.redpillanalytics.com
Figure 22: New Columns for Methodology Comparison.
Notice below, the added option in the options portion of the syntax for all 3 of the following
comparisons, clusterNamePrefix, for this function. Also notice that useRandomSeed is set to
FALSE because we are comparing methods. In the ‘run time binded’ section of the function
analysis, both %1 and %2 are set to (“INSERT METHOD”) for the usage of methodology and
the display of the methodology name in the legend for the visualization respectively. Also note
that 5 clusters are used in each analysis which allows for a more telling comparison along our
input dimensions.
K-MEANS CLUSTERING METHODS:
1) Hartigan-Wong Method
CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method=
%1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2',
'@{P_Method}{Hartigan-Wong}', ‘@{P_Method}{Hartigan-Wong}')
Figure 23: Output from Hartigan-Wong Method.
24. www.redpillanalytics.com
3) MacQueen Method
CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method=
%1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2',
'@{P_Method}{MacQueen}', ‘@{P_Method}{MacQueen}')
Figure 25: Output from MacQueen Method.
Looking closely at these varying visualizations, it is apparent that the differentiation of each
cluster is slightly different for the 3 methods.
Also of note, is that the H-Clustering Methods are the default, but there is also ward.D,
ward.D2, Single, Average, Median, McQuitty, and centroid.
25. www.redpillanalytics.com
The Outlier Function
This function classifies a record as Outlier based one or more input expressions using K-Means,
Hierarchical Clustering or Multi-Variate Outlier Detection Algorithms (The 3 methods in outlier
detection for the Advanced Analytics tools in OBIEE 12c). Each method is utilized for different
purposes and the user has the ability to adjust the algorithm of use according to their specific
needs. In statistics, an outlier is a reference to specific data that diverge from the normality of the
data set as a whole to a statistically significant extent. Outliers can be thought of as a data
anomaly; the sort of black sheep within the data. Outlier detection can be thought of as
clustering data along a logical metric, where normality is equal to FALSE (not an outlier) or
abnormality is equal to TRUE (an outlier). Here is a brief description of the 3 methods that were
mentioned above:
K-MEANs:
Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering
attempts to partition into a specified number of clusters (k) so as to minimize the sum of the
distance functions of each individual point from the K center. This allows for an overview of
similarities along the given dimensions. For outlier detection, there will be two clusters in a
logical format, one of TRUE and one of FALSE. TRUE denoting an outlier, FALSE denoting
data normality.
Hierarchical Clustering:
Generally, this form of clustering is an attempt to build a sort of pecking order in which the data
filters down into distinct ‘groups’ along the prompted dimensions. Hierarchical clustering can be
thought of as a “top-down” approach of structuring an overview for viewing contextual
differences/similarities amongst user-defined dimensions.
Multivariate Outlier Detection (default outlier detection for 12c):
One way to check for multivariate outliers is with Mahalanobis’ distance.3
Mahalanobis’
distance can be thought of as a metric for estimating how far each case is from the center of all
the variables’ distributions (i.e. the centroid in multivariate space). Mahalanobis’ distance
accounts for the different scale and variance of each of the variables in a set in a probabilistic
way.
3
(Mahalanobis, 1927; 1936 ).
26. www.redpillanalytics.com
Syntax for Outlier Analysis:
OUTLIER( (dimension_expr1 , ... dimension_exprN), (expr1, .. exprN),
output_column_name, options, [runtime_binded_options])
Where:
• dimension_expr— represents a list of dimensions to be clustered (K)
• expr— represents a list of dimension attributes or measures (x1, x2, …, xn) to be used in
order to find outlier’s.
• output_column_name— is the output column. The valid values are:
o ’isOutlier’: which will print back a logical value TRUE or FALSE as to whether
or not each data point is an outlier or not.
o ’distance’: will return the “distance from normality” (the higher this number, the
‘more’ of an outlier the data point is).
• options — is a string list of name=value pairs separated by ';'. The value can include
%1 ... %N, which can be specified using runtime_binded_options.
• runtime_binded_options — is an optional comma separated list of run-time binded
columns or literal expressions that supply a specification to an unrepresented value in the
options list. This portion of the syntax is optional. It is merely satisfying parameters for
other options that have yet to be specified. For example, in an outlier analysis, the user
might have an option output_column_name=%1. If it was speculated that they wanted to
use the distance for this particular analysis, Their runtime_binded_options would then be
equal to ‘distance’. Order matters. %1 in options equates to the first specified option, %2
the second, and %N the Nth. Here would be the entire syntax for this example
(highlighted is the areas of focus). Remember that runtime_binded_options is
optional. You can specify parameters to your options without using this tool, which
implies that runtime_binded_options is more of an organizational tool than a
functional one. Using it versus not using it does not impact performance, but the option is
nice to have for organizational purposes.
Outlier Function Example Analysis:
For the analysis, observe how the dimensions of offices and companies within the data set were
clustered along the measures or attributes of both revenue and discount amount. One hypothesis
for this analysis might be that offices under their respective companies are acting very similar in
regards to discount amount and revenue.
Figure 26: Columns used in example analysis.
27. www.redpillanalytics.com
New Columns for Methodology Comparison
For this example, the multivariate outlier algorithm (mvoutlier) will be used, rather than K-means
or hierarchical clustering to start (no particular reason for this other than mvoutlier being the
default algorithm). However, perhaps it could be wagered that the mvoutlier algorithm is the most
favorable and is the default algorithm for a reason. Observe the variance in algorithms below.
Function Syntax:
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=_________’)
Outlier Observations
Thus far, the syntax has proved to disallow any entering of a specified number of outliers. When
using the multivariate algorithm, and entering numClusters into the syntax in order to change the
result, an error is printed in the results tab. After playing around with the sample sales data, the
conclusion can be made that there is no way to set a specific number of outliers to be detected.
The number of outliers is contingent upon each data set and how it acts with the underlying
algorithm in R. Setting an “is not equal to” filter on the two data points (Eiffel and Spring
offices) in order to see if there would still be outliers does not change whether or not there are
outliers. Rather, there are two new outliers (the second set of two most northeasterly points on
the graph). This is counterintuitive to what the function is doing. If the function was finding
truly, significantly variant data, then the result, after this filter was applied, should return all
green (FALSE) points on the scatter plot. On the other hand, sometimes a user might have a data
set with all very similar points but still want to find the point(s) that are most variant. This
means that the outlier detection algorithm is a reliable source and will give us outliers in all
situations. It is important to keep these contingencies in mind when analyzing data.
When the scatter plot involving these variables of analysis is made, returned is the following
graphs, with the accompanying tables of:
29. www.redpillanalytics.com
K-means Outlier Detection Method
OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=Kmeans')
Figure 29: K-Means Outlier Detection output.
Notice that when using the h-clustering algorithms and the multivariate algorithms, the outliers
are consistent (Eiffel and Spring offices of Tescare Ltd.) but when using the K-means algorithm
to find outliers, very different values of Blue Bell and Teller offices of Stockpiles Inc are
received. These variations in outlier detection methods between the algorithms beg the question
of reliability amongst algorithms. For this reason, the variables of analysis were altered to try to
get a visualization with more data points, and hence more outliers, to see if there was some sort
of anomalistic variation here with just these variables. The syntax in use is:
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=h-clustering')
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=Mvoutlier’)
OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount
Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=K-means’)
Place each of the above in their own respective columns (with the same variables in each)in
order to graphically view these outliers on the same scatter-plot. New variables (Product, and
office) were also used in this analysis for a larger amount of data points. As is visible after these
minute changes, some outliers overlap and others do not.
30. www.redpillanalytics.com
Legend Translation:
Blue Squares: Only h-clustering viewed these as outliers
Green Circles: All algorithms viewed these as outliers
Yellow Rhombi: No algorithm viewed these as outliers
Red plus: MV Outlier and K-means algorithms viewed these as outliers, not h-clustering.
In the graph editor, the corresponding order of methodological reference is:
• Hierarchical Clustering
• MV Outlier
• K-Means
Figure 30: Visualization comparing all three outlier methods
Analyzing Methodology Consistency
There are no algorithmic overlapping points between the h-clustering algorithm and the other two
algorithms. This is interesting. It could be inferred that other data sets would have points where
the other methods overlapped with the h-clustering algorithm, but in this particular data set, the
reasoning might have something to do with the variance in the algorithm and how it goes about
‘defining’ an outlier. Remember, h-clustering is representative of a hierarchical pecking order in
which data sort of filters down whereas the other methods are distance based, based on your
input criterion or dimensionality. These differences could account for the variance in our
visualization here.
Also notice that it seems as if the ‘behind the scenes’ R-statistics are more consistent with their
outlier detection. In the first analysis, K-means was a little bit off as compared to the other two
algorithms. After browsing through some documentation on K-means clustering, an apparent
31. www.redpillanalytics.com
notion of K-means being a reliable method amongst increasingly large data sets is noticeable. In
the first analysis, there were few data points, in the second there are many. The fact that there
were only 20 points in the first analysis might be the reason for this discrepancy amongst
strategies. Perhaps as the data set size increases, more consistency with the varying algorithms
will be noticed. Keep this in mind when choosing algorithms.
32. www.redpillanalytics.com
The Regression Function
This function fits a linear model, and returns the fitted values or model. This function can be
used to fit a linear curve on two measures. In statistics, a regression analysis is a process that
estimates the relationship among two variables within a data set. The focus of this test is to
measure the relationship between one or more independent (fixed) variables and its correlation to
a dependent (variable) variable. More specifically, regression allows for a deeper understanding
of how a dependent value changes when the independent variable is adjusted. It might help to
think of regression in a sort of ‘mathy’ f(x) or f of x notation, where x is the independent variable
or the input value. The dependent variable (or output) could be thought of as the value of the y
axis.
It might also help to think of these two variables in a linguistic way. The y-axis measure is the
dependent variable, this means that it is literally dependent on some other value to change before
it does. The x-axis measure(s) is/are literally independent of any other factor(s); they are fixed.
This is important to understand before getting into the syntax.
In laymen’s terms, regression is a measure of how good one measure is a predictor of another
measure. Linear regression is also widely used for forecasting trends in an analysis, predictive
analytics, and has large ties to the arena of machine learning as well. Also, understand that
regression methodology does not insinuate causation, but rather suggests a specific extent of
correlation of two measures.
Dummy Variables in Categorical Regression
It is not possible to directly regress a categorical variable against a numerical variable, nor is it
possible to regress a numerical variable against a categorical variable. There is a solution for this
though. It is called a dummy variable. This works with the assumption that it is necessary for an
analysis to have a regression model regarding a categorical variable that contains the names of
pets (Cats, Dogs, and Birds) and to see how good a predictor these pets are of (fill in the blank).
It would not make sense to assign Cats, Dogs, and Birds a 1,2, and 3, respectively, unless, for
some reason, this Dog was twice as much of a pet than a Cat and the Bird 3 times as much of a
pet as the Cat. Since regression is used with two numerical variables, interpretations are only
valid under circumstances where having a 100 stored for some variable literally equates to
having 100 times the characteristic of X than the variable that stores the number 1. For the pet
example, since it would be illogical to assign a 1, 2 and 3, an alternative (with a regression model
in mind) is to assign some binary values, such as a 1=Cat and 0=not a cat.
Syntax for Regression Analysis
REGR(y_axis_measure_expr, (x_axis_expr), (category_expr1, ...,
category_exprN), output_column_name, options, [runtime_binded_options])
Where:
• y_axis_measure_expr represents the measure for which the regression model is to be
computed. This is your dependent variable.
33. www.redpillanalytics.com
• x_axis_expr represents the measure to be used to determine the regression model for the
y_axis_measure_expr. This is your independent variable.
• category_expr1, ..., category_exprN represents the dimension/dimension
attributes to be used to determine the category for which the regression model for the
y_axis_measure_expr is to be computed. One or more dimensions or dimension
attributes, up to five, may be provided as category columns.
• output_column_name is the output column.
o fitted - returns the points on regression line in (y=ax+b) format
o intercept - the intercept point with the zero on x axis (b from y=ax+b)
o modelDescription - the Model in JSON format.
• options is a string list of name=value pairs separated by ';'. The value can include %1 ...
%N, which can be specified using runtime_binded_options.
• runtime_binded_options is an optional comma separated list of run-time binded
columns and options.
Regression Example Analysis
In this particular analysis, a comparison is made to unveil how good a predictor the independent
variable of billed quantity is for the dependent variable of revenue. The question to be answered
here is, if the quantity of billed items is changed, how does revenue altered? Based on the
column names alone, it could be predicted that the data will cluster fairly nicely around the
regression line created by the function in an upward slope. This means that the billed quantity
would be a good predictor of revenue. This is fairly intuitive. But, what can also be witnessed
below is that billed quantity is not a perfect predictor of revenue; if it was there would be less
data outlying this regression line. In a regression scatterplot like the one below, the tighter our
‘green dots’ are hugging our ‘blue dots’ the higher the correlation between the two variables.
Function Syntax Used
REGR("Base Facts"."Revenue", ("Base Facts"."Billed Quantity"), ("Time"."Per
Name Month", "Time"."Per Name Year"), 'fitted', ‘’)
34. www.redpillanalytics.com
Figure 31: Regression Analysis of Billed Quantity as a Predictor of Revenue
If the user were check the table below and look under the column heading “Regression”, he/she
would see the regression function’s output, and how it relates to Figure 32,
Figure 32: Regression Output in Tabular View
35. www.redpillanalytics.com
It may be interesting to see what data in this regression were not fitting this particular trend. The
visualization below was created by using this syntax —OUTLIER((“Time"."Per Name Year",
"Time"."Per Name Month"), ("Base Facts"."Billed Quantity","Base
Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier’). This will display outlying
values in correspondence with the same syntax and variables used for the above regression.
Figure 33: Visual of Data Points where Billed Quantity is not a Predictor of Revenue
Concentrate on the red plus signs rather than the yellow rhombi. The red plus signs are the
outliers for this regression analysis, where the yellow rhombi are merely the corresponding data
points that were plotted for the regression line for these 4 outlying data points. By sorting the
outlier portion of this data set, one could create a table that shows the year and month where
billed quantity was not necessarily a great predictor of revenue.
Figure 34: Tabular View of Outliers Within a Regression Analysis
36. www.redpillanalytics.com
What is noticeable is that, for the 6th and 7th months for 3 consecutive years, billed quantity was
not a great predictor of revenue. By obtaining this sort of information, it is possible to drill down
into why this might be the case. These sort of quantitative and visual ‘hints’ within the data
being unveiled in an aesthetic way is the epitome of these advanced analytics tools. Statistics
can tell a lot about why things are the way they are and can, ultimately, provide some insight to
move forward in a fashion that will allow the building of a sustainable organization.
37. www.redpillanalytics.com
Appendix I: Creating Presentation Variables and Prompts
Presentation Variable and Prompting the User for Function Options
Above, there is slight variation in syntax within the function code from the original syntax given
where there is @{numClusters};maxIter=@{numIter}in the options portion of the function
input. The @{} is the code for adding a presentation variable to a dashboard prompt that will
prompt the user for the number of clusters and the number of iterations for the algorithm to
perform. In many cases it is a good idea to prompt the user for the number of clusters and
iterations because it allows for a more interactive dashboard. It is also important because this
easy functional change can show us how a large sample size continues to change as we
continuously segment our data set into varying numbers of clusters.
If a developer was eager to perform this same task, highlight (in the syntax) the portion that
would typically contain (%1…%N) for whatever variable they wanted to add a prompt for they
would perform the following tasks:
Figure A1: Highlight the %N.
Figure A2: Click “Variable”, then “Presentation”.
38. www.redpillanalytics.com
Figure A3: Input a variable expression.
It is important to be careful prior to clicking OK here. This Variable Expression must be
matched in a case sensitive fashion to the corresponding dashboard prompt. Click OK.
Figure A4: Click “New”, then “Dashboard Prompt”.
Figure A5: Click the green arrow, then “Variable Prompt”.
39. www.redpillanalytics.com
Prompt for=Presentation Variable: *Label (this is what is equal to the presentation variable that
was set in the column function)=numClusters: Expand the options window: Variable Data
Type=Number:
A Note of Defaults
The user can set a default value here. Also, just a heads up, there is some sort of undocumented
default value of 5 clusters. For example: The syntax of— CLUSTER(("Products"."Product",
"Offices"."Office"), ("Base Facts"."Discount Amount","Base
Facts"."Revenue"),'clusterName', 'algorithm=k-means;') —returns a visualization of:
Figure A6: Default Visualization of Discount Amount versus Revenue.
40. www.redpillanalytics.com
Figure A7: Complete the process again for the iteration variable.
Save these Prompts.
Now when going into the Dashboard, where the dashboard prompt and the analysis have been
input, this presentation variable can be witnessed in action.