SlideShare a Scribd company logo
1 of 14
Download to read offline
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
DOI : 10.5121/ijdkp.2016.6205 59
ANOMALY DETECTION AND ATTRIBUTION
USING AUTO FORECAST AND DIRECTED
GRAPHS
Vivek Sankar and Somendra Tripathi
Latentview Analytics, Chennai, India
ABSTRACT
In the business world, decision makers rely heavily on data to back their decisions. With the quantum of
data increasing rapidly, traditional methods used to generate insights from reports and dashboards will
soon become intractable. This creates a need for efficient systems which can substitute human intelligence
and reduce time latency in decision making. This paper describes an approach to process time series data
with multiple dimensions such as geographies, verticals, products, efficiently, and to detect anomalies in
the data and further, to explain potential reasons for the occurrence of the anomalies. The algorithm
implements auto selection of forecast models to make reliable forecasts and detect such anomalies. Depth
First Search (DFS) is applied to analyse each of these anomalies and find its root causes. The algorithm
filters the redundant causes and reports the insights to the stakeholders. Apart from being a hair-trigger
KPI tracking mechanism, this algorithm can also be customized for problems lke A/B testing, campaign
tracking and product evaluations.
KEYWORDS
DFS, A/B Testing, Reporting, Forecasting, Anomaly Spotting.
1. INTRODUCTION
With the growing use of Internet and Mobile Apps, the world is seeing a steep increase in the
availability and accessibility to different kinds of data – demographical, transactional, social
media and so on. Modern businesses are keen to leverage on these data to arrive at smarter and
timely decisions. But the existing data setup in a lot of organizations primarily provides post
mortem reports of performance. The shift to a more pro-active or an instantaneous reactive
approach to data based decision making requires investments in data infrastructure, skilled
resources and enabling quicker and efficient dissemination of information to the right
stakeholders. With more and more firms investing on their infrastructure to capture necessary
data, there is a growing need for automated systems that can step in to process the raw data and
provide actionable readouts to the required stakeholders.
This paper proposes one such completely automated frequentist framework which when provided
with any casted data (a dataset which is a cross product of dimensions involved and their
corresponding metric values for each time frame) can run in the background to provide actionable
readouts. It helps business decision makers stay updated with the development in their portfolio
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
60
by providing timely updates by specifying a list of anomalies (spikes & dips) with respect to the
KPIs of their interest and further providing reasons for the same.
The proposed method makes use of forecasting techniques to arrive at ballpark figures for every
segment which acts as a logical substitute to user’s sense. The estimates are pitted against the
actual performance immediately after the availability of actual data to effectively spot anomalies.
The framework consists of intermediate trigger systems that can alert corresponding stakeholders
without much delay.
The system further digs down to analyze sub segments to attribute the reasons for every spotted
anomaly. This approach which requires least human intervention is an effective aid in scenarios
where:
• the number of segments involved could not be handled manually
• there is a lack of statistical expertise on the user front
• there is a high time lag in decision making due to existing reporting structures
This approach finds its application in a wide range of industries such as retail, e-commerce,
airlines, insurance, manufacturing , logistic and supply chain, etc. to benefit portfolio managers,
analyst, marketers, product and sales managers to name a few.
The task of anomaly detection and reporting starts with the processing of melted data coming
from the database to cast data by producing all possible interactions between the various
dimensions in the dataset. Auto-forecasting iterates through the entire cast data converts it into
time series and generates prediction to be consumed in the later module. Further the framework
establishes networks to understand the interdependencies between the various segments in the
data. Depth First Search is then applied to spot anomalies, which are then checked for
redundancy and reported.
The paper is structured as follows. The next section presents the methodology and tools used in
the framework. This section has been broken into three sub-sections to highlight the pivotal
modules running the entire framework. Section 3 discusses its implementation. Concluding
remarks and future works are mentioned in Section 4.
2. METHODOLOGY
The proposed framework is a novel technique to spot anomalies in data with the minimum human
intervention.
The three prime components that are required for its functioning are:
1. The actual value of the KPI
2. A ballpark value for the KPI
3. A scientific flagging approach
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
With the availability of these three components the
of real life scenarios and across multiple verticals and KPIs. The following sections would
explain the various steps involved in the framework for creating the above mentioned
components and utilizing them to pr
The entire framework is broadly broken down into the following sections:
• Data Processing Module (DPM)
• Auto Forecasting Module (AFM)
• Pattern Analysis and Reporting Module (PARM)
The flow chart in figure 1 is a
involved and the flow of data between the modules.
Figure 1: Framework
2.1 Module 1: Data Processing Module
2.1.1 Data Input:
This module accepts raw data which could be
all the necessary dimensions that would best describe the KPI along with the KPI itself broken
down at the least granularity of the time unit used for reporting. Some of the easily relatable
datasets are Sales data of retail stores, web traffic of an e
from call centers, risk decline volumes for a payment gateway and so on. Any data that
encompasses trends and seasonal patterns could be a perfect fit for this module or th
as a whole. Table 1 is an illustration of a sample data source.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
With the availability of these three components the method could be applied across a wide range
of real life scenarios and across multiple verticals and KPIs. The following sections would
explain the various steps involved in the framework for creating the above mentioned
components and utilizing them to provide actionable insights to business users.
The entire framework is broadly broken down into the following sections:
Data Processing Module (DPM)
Auto Forecasting Module (AFM)
Pattern Analysis and Reporting Module (PARM)
The flow chart in figure 1 is a concise representation of the internal structure of the modules
involved and the flow of data between the modules.
Figure 1: Framework - Block Diagram
2.1 Module 1: Data Processing Module
This module accepts raw data which could be any form of casted data. The dataset would include
all the necessary dimensions that would best describe the KPI along with the KPI itself broken
down at the least granularity of the time unit used for reporting. Some of the easily relatable
Sales data of retail stores, web traffic of an e-commerce website, call volume data
from call centers, risk decline volumes for a payment gateway and so on. Any data that
encompasses trends and seasonal patterns could be a perfect fit for this module or th
as a whole. Table 1 is an illustration of a sample data source.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
61
method could be applied across a wide range
of real life scenarios and across multiple verticals and KPIs. The following sections would
explain the various steps involved in the framework for creating the above mentioned
concise representation of the internal structure of the modules
any form of casted data. The dataset would include
all the necessary dimensions that would best describe the KPI along with the KPI itself broken
down at the least granularity of the time unit used for reporting. Some of the easily relatable
commerce website, call volume data
from call centers, risk decline volumes for a payment gateway and so on. Any data that
encompasses trends and seasonal patterns could be a perfect fit for this module or the framework
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
62
Table 1: Sample Casted Dataset
Customer Segment Product Category Region Date Sales
Consumer Furniture Central 1/1/2010 690.77
Consumer Furniture Central 2/1/2010 5303.17
Consumer Office Supplies East 1/1/2010 1112.11
Consumer Office Supplies East 2/1/2010 84.01
Home Office Furniture West 11/1/2013 10696.84
Home Office Furniture West 12/1/2013 4383.98
Here the first three columns are the dimensions and date column is the indicator of frequency of
reporting. The date column here is at a month level but in general the framework can be applied
for Daily, Weekly, Monthly, Quarterly or Yearly reports. The final column (Sales) is the actual
KPI that the business wants to track using this methodology.
2.1.2 Segments Creation:
Each of the dimensions in the dataset could hold 2 or more values and each of these could be of
interest to different stakeholders. Referring back to the above sample dataset the Region
dimension holds values of the geographies where there was Sales reported and each of the
individual Regional Heads would want to keep an eye on the Sales of their region. Hence each of
the values in every dimension potentially is a segment. The system further goes to generate
segments by combining values of two dimensions. For example combining region and product
category, segments like (Central _ Furniture) and (West _ Furniture) could be generated. The
segment formation extends from treating every segment individually to combining all the
available n dimensions. After the system generates all the possible combinations using the
available dimensions in the data, each unique combination is given a Segment ID and the initial
casted data is melted.
Table 2: Conversion of casted to melted data
Segment ID Date Sales
Seg 1 1/1/2010 690.77
Seg 1 2/1/2010 5303.17
Seg 2 1/1/2010 1112.11
Seg 2 2/1/2010 84.01
Seg 3 11/1/2013 10696.84
Seg 3 12/1/2013 4383.98
In short if there are n dimensions (D1,D2,D3,….Dn) and the number of values in each dimension is
(X1,X2,X3,….Xn) respectively the total number of combinations generated would be (X1+1) *
(X2+1) * …….. *(Xn+1).
2.1.2.1 Business Preferences/Inputs:
This is an optional step in the overall structure where the intention is to bucket values in each
dimension to club smaller segments. Assuming there are 100 different products in the Product
Category and of these 100 products 9 products lead to 95% of the overall Sales, then the rest of
the 91 products could be clubbed as ‘Other Products’.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
This removes certain segments from the picture based on business preferences. It could even be
inputs in the form of flat files containing segments which businesses are not too
Both these measures result in the reduction of the number of segments and thereby improving the
operating efficiency and memory consumption of this proposed approach.
The final melted output from this module is fed into the Auto Forecast
of all data points corresponding to the latest unit of time and passing through the Multiprocessing
stage to enable parallel processing in the Auto Forecast Module.
2.1.3 Parallel Processing Module:
The forecasting module can be duplicated as multiple processes as each of the segment is treated
independent of the other. The dataset is broken down into multiple parts of equal number of
segments and fed to the multiple sessions of the AFM.
2.2 Module 2: Auto Forecast Module
The auto forecast module is designed to generate forecasts on multiple time series data which
gets fed to the PARM discussed in the next section.
The following steps explain how one step forecasts are generated:
2.2.1 Data Preparation:
Aggregated data is converted to time series data of the specified frequency. Each time series is
then iteratively checked for missing values and outliers. The framework provides the flexibility
of using linear and cubic spline interpolation for treating missing values and outl
series is decomposed using STL and the trend component is smoothened. This helps in
approximating missing values and minimizing the effect of outliers. Each time series is checked
for 0-padding (both leading and trailing). Powerful trans
applied before moving on to the next phase.
Figure 1: Missing value treatment using STL + spline smoothing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
This removes certain segments from the picture based on business preferences. It could even be
inputs in the form of flat files containing segments which businesses are not too concerned about.
Both these measures result in the reduction of the number of segments and thereby improving the
operating efficiency and memory consumption of this proposed approach.
The final melted output from this module is fed into the Auto Forecast Module after the removal
of all data points corresponding to the latest unit of time and passing through the Multiprocessing
stage to enable parallel processing in the Auto Forecast Module.
2.1.3 Parallel Processing Module:
duplicated as multiple processes as each of the segment is treated
independent of the other. The dataset is broken down into multiple parts of equal number of
segments and fed to the multiple sessions of the AFM.
2.2 Module 2: Auto Forecast Module
uto forecast module is designed to generate forecasts on multiple time series data which
gets fed to the PARM discussed in the next section.
The following steps explain how one step forecasts are generated:
nverted to time series data of the specified frequency. Each time series is
then iteratively checked for missing values and outliers. The framework provides the flexibility
of using linear and cubic spline interpolation for treating missing values and outliers
series is decomposed using STL and the trend component is smoothened. This helps in
approximating missing values and minimizing the effect of outliers. Each time series is checked
padding (both leading and trailing). Powerful transformations such as Box
applied before moving on to the next phase.
: Missing value treatment using STL + spline smoothing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
63
This removes certain segments from the picture based on business preferences. It could even be
concerned about.
Both these measures result in the reduction of the number of segments and thereby improving the
Module after the removal
of all data points corresponding to the latest unit of time and passing through the Multiprocessing
duplicated as multiple processes as each of the segment is treated
independent of the other. The dataset is broken down into multiple parts of equal number of
uto forecast module is designed to generate forecasts on multiple time series data which
nverted to time series data of the specified frequency. Each time series is
then iteratively checked for missing values and outliers. The framework provides the flexibility
iers [1], the time
series is decomposed using STL and the trend component is smoothened. This helps in
approximating missing values and minimizing the effect of outliers. Each time series is checked
formations such as Box-Cox can be
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
2.2.2 Model selection:
This phase involves choosing the right set of models for a
includes Exponential Smoothing
auto.Arima[2], Nnet - Feed forward neural networks with a single hidden layer,
linear models to time series including trend
state space model with Box
Components)[3,4] and STL.
It is important that the seasonality component is correctly identified before forecasting. Over
parameterization or force fitting seasonality when there is no real seasonal component might
produce unreliable forecasts. Also, in cases where there are too few data points only ARIMA and
Regression models are used. Croston’s method is used in cases where time s
intermittent data. TBATS is used with data with weekly and annual seasonality.
models are picked for forecasting in this phase.
Figure 2: Model selection based on Holdout accuracy
2.2.3 Forecasting:
The level of the time series must be specified before this step starts. A small portion of the data is
held out to be used as validation [5]. This helps in preventing over fitting and gives a truer
estimation of the generalization error. The module then ru
time series and provides the residuals, fit statistics and confidence intervals.
2.2.4 Evaluation:
The metric to be used for evaluating the fit can be specified at the start of execution of this
module. The framework provides MAPE (Mean absolute percentage error), MSE (Mean squared
error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A
summary of all fits statistics from selected models applied on the data is generated and the winne
model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
this step for manual intervention.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
This phase involves choosing the right set of models for a given time series. The current pool
Exponential Smoothing – ETS implementation for automatic selection and
Feed forward neural networks with a single hidden layer, TSLM for fitting
linear models to time series including trend and seasonality, TBATS (Exponential smoothing
state space model with Box-Cox transformation, ARMA errors, Trend an
It is important that the seasonality component is correctly identified before forecasting. Over
erization or force fitting seasonality when there is no real seasonal component might
unreliable forecasts. Also, in cases where there are too few data points only ARIMA and
. Croston’s method is used in cases where time s
TBATS is used with data with weekly and annual seasonality.
models are picked for forecasting in this phase.
Figure 2: Model selection based on Holdout accuracy
The level of the time series must be specified before this step starts. A small portion of the data is
held out to be used as validation [5]. This helps in preventing over fitting and gives a truer
estimation of the generalization error. The module then runs all the models picked for the given
time series and provides the residuals, fit statistics and confidence intervals.
The metric to be used for evaluating the fit can be specified at the start of execution of this
ork provides MAPE (Mean absolute percentage error), MSE (Mean squared
error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A
summary of all fits statistics from selected models applied on the data is generated and the winne
model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
this step for manual intervention.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
64
given time series. The current pool
ETS implementation for automatic selection and
TSLM for fitting
, TBATS (Exponential smoothing
Cox transformation, ARMA errors, Trend and Seasonal
It is important that the seasonality component is correctly identified before forecasting. Over-
erization or force fitting seasonality when there is no real seasonal component might
unreliable forecasts. Also, in cases where there are too few data points only ARIMA and
. Croston’s method is used in cases where time series have
TBATS is used with data with weekly and annual seasonality. All relevant
The level of the time series must be specified before this step starts. A small portion of the data is
held out to be used as validation [5]. This helps in preventing over fitting and gives a truer
ns all the models picked for the given
The metric to be used for evaluating the fit can be specified at the start of execution of this
ork provides MAPE (Mean absolute percentage error), MSE (Mean squared
error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A
summary of all fits statistics from selected models applied on the data is generated and the winner
model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
65
Table 1: Selection of winner model based on Performance metric
Store # ETS TBATS Arima Nnet Tslm Croston’s*
Store 53 7.2% 4.3% 4.7% 2.0% 3.3% NA
Store 54 72.2% 44.8% 41.4% 104.3% 15.9% NA
Store 55 2.6% 5.4% 2.6% 5.4% 6.6% NA
Store 56 5.6% 2.5% 2.8% 12.6% 5.9% NA
Store 57 4.5% 1.8% 4.4% 4.1% 2.3% NA
Store 58 3.9% 2.9% 5.4% 4.8% 2.5% NA
Store 59 11.5% NA 14.9% 31.5% 20.8% NA
Store 60 NA NA 10.1% NA 8.9% NA
Store 61 1.8% 1.3% 3.5% 33.2% 1.8% NA
Store 62 33.8% 52.1% 71.3% 27.0% 14.0% NA
Store 63 12.7% 4.3% 7.8% 4.0% 8.7% NA
Store 64 4.1% 1.4% 3.0% 8.0% 2.5% NA
2.2.5 One step forecast:
The final forecast are generated by running the winner model on the entire data, as missing recent
data points could cause loss of valuable information. The residuals, confidence interval and one
step forecasts for this model are then passed to the next module.
2.3 Module 3: Pattern Analysis and Reporting Module (PARM)
PARM receives data feed separately from the Data Processing Module and Auto Forecast
Module.
• DPM provides the KPI values of the latest unit of time for every segment. These values
were the ones which were kept aside from the melted dataset before being fed into AFM.
• AFM provides reliable forecast values along with the residuals and prediction intervals
based on the required confidence interval for every segment.
2.3.1 Segment Level Flagging
Every available segment in the data has an actual value, a ballpark value (forecasted output) and a
prediction interval based on the point forecast.
2.3.1.1 Prediction Interval
As per Hyndman [7], a prediction interval is an interval associated with a random variable yet to
be observed, with a specified probability of the random variable lying within the interval. For
example, I might give an 80% interval for the forecast of GDP in 2014. The actual GDP in 2014
should lie within the interval with probability 0.8. Prediction intervals can arise in Bayesian or
frequentist statistics.
A confidence interval is an interval associated with a parameter and is a frequentist concept. The
parameter is assumed to be non-random but unknown, and the confidence interval is computed
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
66
from data. Because the data are random, the interval is random. A 95% confidence interval will
contain the true parameter with probability 0.95. That is, with a large number of repeated
samples, 95% of the intervals would contain the true parameter.
The range of the prediction interval is dependent on two factors:
i. The accuracy of the forecast. Higher the accuracy narrow is the band and a poor accuracy
pushes the limit to -∞ and +∞.
ii. The desired confidence percentage. Higher the required confidence broader is the band.
2.3.1.2 Reasons for independent forecast
Every segment generated by the DPM could be unique in its properties and thereby could exhibit
its own trend and seasonal attributes. Also obtaining forecast values from sub granular level
would result in aggregating errors of the sub granular levels and hence affect the accuracy of the
overall forecast.
2.3.1.2.1 Anomalies – Dips and spikes
The actual value for the KPI corresponding to the latest unit of time is pitted against the
prediction interval for that segment.
If , AT > EULT => Flag for spike,
AT < ELLT => Flag for dip
where,
AT - actual KPI value for the time period T
EULT & ELLT - upper and lower limits of prediction intervals for time period T
With every segment flagged independently for anomalies instantaneous triggers could be sent out
to accountable stake holders alerting them to react without much time latency. In such scenarios
businesses start to drill down the KPIs using dimensions based on their judgments to arrive at
reasons for the anomaly.
2.3.2 Network Generation
The network here is a scientific substitute to replace the operations of extensive drill downs by
the business consumers. The approach to node formation and node connections are explained in
the next section.
2.3.3 Node Formation
Every segment created by DPM would be a node by itself. The number of dimensions involved
in creation of the node indicates the level of the node i.e. if dimensions are taken one at a time the
level is 2 while a combination of two dimensions indicates level 3 and extends up to (n+1) levels.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
2.3.4 Node Connections
Every node at level k would be connected from one or more nodes from level k
or more values involved in the formation of node at level k has a commonality of value with any
node at level (k-1). For example, any node at level 3 formed by [r
category-furniture] would be connected from nodes [region
furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG).
2.3.4.1 Directed Acyclic Graphs (DAG)
In mathematics and computer science
no directed cycles. That is, it is formed by a collection of
connecting one vertex to another,
sequence of edges that eventually loops back to
DAGs may be used to model many different kinds of information. The
DAG forms a partial order, and any
reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that
certain tasks must be performed earlier than others, may be represented as a DAG with a vertex
for each task and an edge for each const
generate a valid sequence. Additionally, DAGs may be used as a space
a collection of sequences with overlapping subsequences. DAGs are also used to represent
systems of events or potential events and the
be used to model processes in which data flows in a consistent direction through a network of
processors, or states of a repository in a version
Figure 3: Directed Acyclic Graph (DAG)
2.3.5 Node Information
Every node is identified by the
obtained for the segments at the previous step.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
Every node at level k would be connected from one or more nodes from level k-1 where any one
or more values involved in the formation of node at level k has a commonality of value with any
1). For example, any node at level 3 formed by [region-central and product
furniture] would be connected from nodes [region-central] and [product category
furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG).
2.3.4.1 Directed Acyclic Graphs (DAG)
computer science, a directed acyclic graph is a directed graph
. That is, it is formed by a collection of vertices and directed edges
connecting one vertex to another, such that there is no way to start at some vertex v
sequence of edges that eventually loops back to v again.
DAGs may be used to model many different kinds of information. The reachability
, and any finite partial order may be represented by a DAG using
reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that
certain tasks must be performed earlier than others, may be represented as a DAG with a vertex
for each task and an edge for each constraint; algorithms for topological ordering may be used to
generate a valid sequence. Additionally, DAGs may be used as a space-efficient representation of
on of sequences with overlapping subsequences. DAGs are also used to represent
systems of events or potential events and the causal relationships between them. DAGs may also
del processes in which data flows in a consistent direction through a network of
processors, or states of a repository in a version-control system [6].
Figure 3: Directed Acyclic Graph (DAG)
Every node is identified by the segment id and includes the respective flags for anomalies
obtained for the segments at the previous step.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
67
1 where any one
or more values involved in the formation of node at level k has a commonality of value with any
central and product
central] and [product category-
furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG).
directed graph with
directed edges, each edge
v and follow a
reachability relation in a
ted by a DAG using
reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that
certain tasks must be performed earlier than others, may be represented as a DAG with a vertex
may be used to
efficient representation of
on of sequences with overlapping subsequences. DAGs are also used to represent
between them. DAGs may also
del processes in which data flows in a consistent direction through a network of
segment id and includes the respective flags for anomalies
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
68
Figure 4: Network formation - An illustration
2.3.6 DAG as a substitute for Drill Down
The DAG network is used a substitute for manual drill down process as every possible
dimensions is fitted above and below the other dimensions and combination of dimensions
thereby enabling an exhaustive drill down setup. But in order to understand the cause for every
anomaly, it is necessary to understand the relation between anomalies spotted and set up the right
order of fitting the dimensions. This is enabled by traversing the network generated from every
spotted anomaly which is explained in detail in section 2.3.7.
2.3.7 Network traversal for spotting dependent anomalies
Network is primarily used to understand the relationship between the nodes flagged for
anomalies. A regular Depth First Search (DFS) traversal technique is used to traverse through
the network.
2.3.7.1 Steps for primary consolidation
1. The traversal starts from the top most level (level 1)
2. Every node at level 1 is visited to see if it is flagged for an anomaly (spike or dip)
• If it is flagged, the segment is marked as a Main Segment and is fixed as one of the
starting points of DFS
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
69
• If the node is not flagged, the focus moves to the next node in the same level
3. Step 2 is repeated until level n and all the main segments are marked. The Main Segment
nodes marked are of two categories – one flagged for spikes and the other for dips
4. From the list of nodes marked as main segment, the algorithm picks the first node. With this
node as the starting point, the network is traversed till the last level using DFS. If a visited
node on the traversal route is flagged for a similar anomaly, that node is marked as impacted
node corresponding to the main segment.
5. Step 4 is repeated for all the Main Segments marked in step 3
The generated list of Main Segment and Impacted Segment combination concludes the primary
consolidation thereby providing the relationship between one segment anomaly and its directly
related segment anomalies.
2.3.7.2 Root Causing through redundancy removal
The output from the above level has several redundant factors which are wiped out in this step to
provide a read out of a mutually exclusive anomalies for the business users' perusal. The
redundancies present are of two types
a. Inter Level Redundancies
b. Intra Level Redundancies
2.3.7.2.1 Inter Level Redundancies
Since the Main Segment list had all the available flagged nodes an anomaly spotted at a top level
could have its effect permeating down until the bottom most level. These sub level Main
Segments are part of the list of Impacted Segments for the top level Main Segment. The
framework identifies that having all these in the final read out would be reiterating the same
anomaly repeatedly in multiple forms.
2.3.7.2.2 Intra Level Redundancies
Intra Level Redundancies are more due to inherent data aspects. Two or more nodes at the same
level could be flagged for a similar anomaly. But more often than not it is the effect of one on
the other as each of the nodes have an inherent volume of the other nodes. This is a more
complex case to eliminate from the primary consolidation unlike the Inter Level Redundancies.
2.3.7.3 Steps to remove redundancies
1. The Main Segments at level n are picked.
2. Compare the list of Impacted Segments for each pair of Main Segments in this level.
• If there is an intersection then there lives a common thread
• If there are no intersections, move to the next pair of Main Segments
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
70
3. If there is an intersection in the previous step create a proxy index separately for each of the
Main Segment in the pair. The proxy index is simply a product of (absolute drop in KPI
from forecast) and (Percentage drop of KPI from forecast). This index tangibly does not
have any specific meaning or unit of measurement but higher the value of the index greater is
the probability of it being a root cause. This index is used to tie break between the paired
segments at each level.
• The winner of the tiebreaker remains in the list of primary consolidation.
• The looser is removed from the primary consolidated list.
4. This process is repeated for every pair at each level and across all the levels moving upward.
5. Now, with the truncated list the process starts from the top most level. A Main Segment at
the top most level is selected and its Impacted Segments are chosen.
6. The truncated list is looped to check if the Impacted Segments are present as a Main
Segment.
• If it is present the sub level Main Segment corresponding to the matching Impacted
Segment is removed.
• If none of the Impacted Segments in the list match with the Main Segment in the sub
levels skip to the next Main Segment in the current level of focus.
At the end of this iteration what remains are mutually exclusive anomalies (Main Segments) and
their corresponding sub level variations (Impacted Segments).
Figure 5: Types of Redundancies
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
71
2.3.8 Attribution and Reporting
The final step of the framework is to communicate the anomalies in a readable format to the
respective stakeholders. The change in Main Segment in absolute terms is used as the base to
calculate the proportion of change at the corresponding sub level segments tagged to the Main
Segment as its Impacted Segments. The sub level segments are ordered based on decreasing
value of the calculated proportion. The final consolidated list after the sorting is then
automatically converted into a presentation where every spotted anomaly is shown separately
along with its causes / impact.
2.3.9 Business Preferences and Reporting Customizations
The framework also provides the flexibility to integrate static business filters or preferences to
include and exclude certain kinds of anomalies from the final consolidated list. Also the final
reporting could be customized according to the needs of the business users by setting up the
reporting module accordingly in the framework.
3. RESULTS
The framework was successfully implemented using R and python and tested out on eight
different datasets across multiple KPI’s. The implementation has resulted positively for
businesses in terms of reduction in time latency between report availability and actual actions
from the report, spotting variations in smaller segments which were initially neglected and
avoiding redundant actions by intimating the right stakeholders based on assigned accountability.
On an average 6 out of 20 root cause reports generated during the Beta phase gave conclusive and
actionable insights on the anomalies in the data. The time required to forecast and generate
reports increases exponentially with the increase in the number of dimensions. However, with the
use of parallel processing time taken to generate the results for a dataset with close to 8
dimensions (200,000 columns) was reduced from 9 hours to 1 hour.
With the aid of shiny package in R and PyQt it was possible to create a user-friendly UI for
taking inputs and displaying interactive dashboards.
4. CONCLUSIONS
The increasing reliance on data for decision-making has added pressure on analysts to provide
quick and accurate reports. The volume of data has increased manifolds in the past decade. If
such vast volumes of data can be managed and processed more efficiently it could lead to multi
fold gains in all organisations. The framework discussed in this paper finds application in a wide
range of domains for KPI tracking, AB testing, campaign tracking and product evaluations.
The methodology can further be improved by adding functionalities like allowing external
regressors and multiplicative metrics, interface for integration to Business intelligence tools and
models that can handle high frequency data. Distributed computing via big data platforms can
further increase the scalability of this approach.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
72
ACKNOWLEDGEMENTS
The authors would like to thank LatentView Analytics for providing the opportunity, inputs and
resources to generate this framework. The authors would like to mention their gratitude to co-
employees who provided their support in this work. The authors would extend their thanks to
Prof. Subhash Subramanian and Prof. Mandeep Sandhu for their guidance and inputs in writing
down the paper and proof reading it.
REFERENCES
[1] R implementation by B. D. Ripley and Martin Maechler (spar/lambda, etc). https://www.r-
project.org/Licenses/GPL-2
[2] Hyndman, R.J., Akram, Md., and Archibald, B. (2008) "The admissible parameter space for
exponential smoothing models". Annals of Statistical Mathematics, 60(2), 407--426.Models: A
Roughness Penalty Approach. Chapman and Hall.
[3] Hyndman, R.J., Koehler, A.B., Snyder, R.D., and Grose, S. (2002) "A state space framework for
automatic forecasting using exponential smoothing methods", International J. Forecasting, 18(3),
439--454
[4] Hyndman, R.J. and Khandakar, Y. (2008) "Automatic time series forecasting: The forecast package
for R",Journal of Statistical Software, 26(3)
[5] Leonard, Michael. "Large-Scale Automatic Forecasting Using Inputs and Calendar Events." White
Paper (2005): 1-27.
[6] Thulasiraman, K.; Swamy, M. N. S. (1992), "5.7 Acyclic Directed Graphs", Graphs: Theory and
Algorithms, John Wiley and Son, p. 118, ISBN 978-0-471-51356-8.
[7] http://robjhyndman.com/hyndsight/intervals/
AUTHORS
Vivek Sankar is a post graduate in Business Administration from Sri Sathya Sai
Institute of Higher Learning and is currently working in LatentView Analytics,
Chennai, since September 2012.
Somendra Tripathi received his B.Tech degree in computer science from Vellore
Institute of Technology in 2013. He is currently working in LatentView Analytics,
Chennai.

More Related Content

What's hot

An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...eSAT Journals
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
 
Leveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentLeveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentErik De Monte
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmManishankar Medi
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its ApplicationsIRJET Journal
 
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank Website
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank WebsiteIRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank Website
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank WebsiteIRJET Journal
 
Data mining software comparison
Data mining software comparison Data mining software comparison
Data mining software comparison Esteban Alcaide
 
A Clustering Method for Weak Signals to Support Anticipative Intelligence
A Clustering Method for Weak Signals to Support Anticipative IntelligenceA Clustering Method for Weak Signals to Support Anticipative Intelligence
A Clustering Method for Weak Signals to Support Anticipative IntelligenceCSCJournals
 
Full Paper: Analytics: Key to go from generating big data to deriving busines...
Full Paper: Analytics: Key to go from generating big data to deriving busines...Full Paper: Analytics: Key to go from generating big data to deriving busines...
Full Paper: Analytics: Key to go from generating big data to deriving busines...Piyush Malik
 
Presentation Title
Presentation TitlePresentation Title
Presentation Titlebutest
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
Machine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting StartedMachine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting StartedBhupesh Chaurasia
 
Extract Business Process Performance using Data Mining
Extract Business Process Performance using Data MiningExtract Business Process Performance using Data Mining
Extract Business Process Performance using Data MiningIJERA Editor
 
Statistical Approach to CRR
Statistical Approach to CRRStatistical Approach to CRR
Statistical Approach to CRRMayank Johri
 
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGTHE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGcsijjournal
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES) International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES) irjes
 

What's hot (18)

An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...
 
IEEE 2 5 beta method unraveled
IEEE 2 5 beta method unraveledIEEE 2 5 beta method unraveled
IEEE 2 5 beta method unraveled
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Leveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentLeveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk Assessment
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
 
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank Website
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank WebsiteIRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank Website
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank Website
 
Data mining software comparison
Data mining software comparison Data mining software comparison
Data mining software comparison
 
A Clustering Method for Weak Signals to Support Anticipative Intelligence
A Clustering Method for Weak Signals to Support Anticipative IntelligenceA Clustering Method for Weak Signals to Support Anticipative Intelligence
A Clustering Method for Weak Signals to Support Anticipative Intelligence
 
Full Paper: Analytics: Key to go from generating big data to deriving busines...
Full Paper: Analytics: Key to go from generating big data to deriving busines...Full Paper: Analytics: Key to go from generating big data to deriving busines...
Full Paper: Analytics: Key to go from generating big data to deriving busines...
 
Presentation Title
Presentation TitlePresentation Title
Presentation Title
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
Machine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting StartedMachine Learning for Business - Eight Best Practices for Getting Started
Machine Learning for Business - Eight Best Practices for Getting Started
 
Extract Business Process Performance using Data Mining
Extract Business Process Performance using Data MiningExtract Business Process Performance using Data Mining
Extract Business Process Performance using Data Mining
 
Statistical Approach to CRR
Statistical Approach to CRRStatistical Approach to CRR
Statistical Approach to CRR
 
Pcc mktg 6 chapter 3
Pcc mktg 6 chapter 3Pcc mktg 6 chapter 3
Pcc mktg 6 chapter 3
 
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGTHE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES) International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 

Viewers also liked

AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...IJDKP
 
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGS
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGSRECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGS
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGSIJDKP
 
A SURVEY OF LINK MINING AND ANOMALIES DETECTION
A SURVEY OF LINK MINING AND ANOMALIES DETECTIONA SURVEY OF LINK MINING AND ANOMALIES DETECTION
A SURVEY OF LINK MINING AND ANOMALIES DETECTIONIJDKP
 
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...IJDKP
 
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYIJDKP
 
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTS
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTSAN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTS
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTSijseajournal
 
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTOR
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTORREVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTOR
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTORmsejjournal
 
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...msejjournal
 
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGES
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGESA HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGES
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGESacijjournal
 
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...ijaia
 
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICES
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICESDESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICES
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICESmsejjournal
 
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...ijaia
 
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTURE
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTUREAN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTURE
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTUREacijjournal
 
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING IJNSA Journal
 
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKS
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKSA NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKS
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKSIJNSA Journal
 
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...csandit
 
SEGMENTATION USING ‘NEW’ TEXTURE FEATURE
SEGMENTATION USING ‘NEW’ TEXTURE FEATURESEGMENTATION USING ‘NEW’ TEXTURE FEATURE
SEGMENTATION USING ‘NEW’ TEXTURE FEATUREacijjournal
 
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...ijmvsc
 
Azterketa egutegia; lehen deialdia
Azterketa egutegia; lehen deialdiaAzterketa egutegia; lehen deialdia
Azterketa egutegia; lehen deialdiaOarsoaldekoAEK
 

Viewers also liked (20)

AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...
 
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGS
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGSRECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGS
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGS
 
A SURVEY OF LINK MINING AND ANOMALIES DETECTION
A SURVEY OF LINK MINING AND ANOMALIES DETECTIONA SURVEY OF LINK MINING AND ANOMALIES DETECTION
A SURVEY OF LINK MINING AND ANOMALIES DETECTION
 
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...
 
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
 
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTS
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTSAN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTS
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTS
 
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTOR
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTORREVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTOR
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTOR
 
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...
 
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGES
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGESA HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGES
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGES
 
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...
 
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICES
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICESDESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICES
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICES
 
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
 
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTURE
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTUREAN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTURE
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTURE
 
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING
 
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKS
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKSA NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKS
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKS
 
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...
 
SEGMENTATION USING ‘NEW’ TEXTURE FEATURE
SEGMENTATION USING ‘NEW’ TEXTURE FEATURESEGMENTATION USING ‘NEW’ TEXTURE FEATURE
SEGMENTATION USING ‘NEW’ TEXTURE FEATURE
 
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...
 
Azterketa egutegia; lehen deialdia
Azterketa egutegia; lehen deialdiaAzterketa egutegia; lehen deialdia
Azterketa egutegia; lehen deialdia
 
Tuto moviemaker
Tuto moviemakerTuto moviemaker
Tuto moviemaker
 

Similar to ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS

IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET Journal
 
Pradeepnsingh praveenkyadav-131008015758-phpapp02
Pradeepnsingh praveenkyadav-131008015758-phpapp02Pradeepnsingh praveenkyadav-131008015758-phpapp02
Pradeepnsingh praveenkyadav-131008015758-phpapp02PMI_IREP_TP
 
Pradeep n singh_praveenkyadav
Pradeep n singh_praveenkyadavPradeep n singh_praveenkyadav
Pradeep n singh_praveenkyadavPMI2011
 
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedEfficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedIRJET Journal
 
Decision Making Framework in e-Business Cloud Environment Using Software Metr...
Decision Making Framework in e-Business Cloud Environment Using Software Metr...Decision Making Framework in e-Business Cloud Environment Using Software Metr...
Decision Making Framework in e-Business Cloud Environment Using Software Metr...ijitjournal
 
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...ijaia
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
 
Data Visualization advances Business by promoting easy story-telling and info...
Data Visualization advances Business by promoting easy story-telling and info...Data Visualization advances Business by promoting easy story-telling and info...
Data Visualization advances Business by promoting easy story-telling and info...IRJET Journal
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisIRJET Journal
 
Analytics, business cycles and disruptions
Analytics, business cycles and disruptionsAnalytics, business cycles and disruptions
Analytics, business cycles and disruptionsMark Albala
 
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERNEVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERNIRJET Journal
 
IRJET- Vendor Management System using Machine Learning
IRJET-  	  Vendor Management System using Machine LearningIRJET-  	  Vendor Management System using Machine Learning
IRJET- Vendor Management System using Machine LearningIRJET Journal
 
What is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdfWhat is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdfannikasarees
 
Smart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportSmart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportALi Baker
 
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...AIRCC Publishing Corporation
 
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...ijcsit
 
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Eswar Publications
 

Similar to ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS (20)

IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom Industry
 
Pradeepnsingh praveenkyadav-131008015758-phpapp02
Pradeepnsingh praveenkyadav-131008015758-phpapp02Pradeepnsingh praveenkyadav-131008015758-phpapp02
Pradeepnsingh praveenkyadav-131008015758-phpapp02
 
Pradeep n singh_praveenkyadav
Pradeep n singh_praveenkyadavPradeep n singh_praveenkyadav
Pradeep n singh_praveenkyadav
 
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedEfficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
 
Decision Making Framework in e-Business Cloud Environment Using Software Metr...
Decision Making Framework in e-Business Cloud Environment Using Software Metr...Decision Making Framework in e-Business Cloud Environment Using Software Metr...
Decision Making Framework in e-Business Cloud Environment Using Software Metr...
 
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
Data Visualization advances Business by promoting easy story-telling and info...
Data Visualization advances Business by promoting easy story-telling and info...Data Visualization advances Business by promoting easy story-telling and info...
Data Visualization advances Business by promoting easy story-telling and info...
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Analytics, business cycles and disruptions
Analytics, business cycles and disruptionsAnalytics, business cycles and disruptions
Analytics, business cycles and disruptions
 
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERNEVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
 
IRJET- Vendor Management System using Machine Learning
IRJET-  	  Vendor Management System using Machine LearningIRJET-  	  Vendor Management System using Machine Learning
IRJET- Vendor Management System using Machine Learning
 
What is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdfWhat is the relationship between Accounting and an Accounting inform.pdf
What is the relationship between Accounting and an Accounting inform.pdf
 
Smart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportSmart Traffic Monitoring System Report
Smart Traffic Monitoring System Report
 
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
 
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...
 
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS

  • 1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 DOI : 10.5121/ijdkp.2016.6205 59 ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS Vivek Sankar and Somendra Tripathi Latentview Analytics, Chennai, India ABSTRACT In the business world, decision makers rely heavily on data to back their decisions. With the quantum of data increasing rapidly, traditional methods used to generate insights from reports and dashboards will soon become intractable. This creates a need for efficient systems which can substitute human intelligence and reduce time latency in decision making. This paper describes an approach to process time series data with multiple dimensions such as geographies, verticals, products, efficiently, and to detect anomalies in the data and further, to explain potential reasons for the occurrence of the anomalies. The algorithm implements auto selection of forecast models to make reliable forecasts and detect such anomalies. Depth First Search (DFS) is applied to analyse each of these anomalies and find its root causes. The algorithm filters the redundant causes and reports the insights to the stakeholders. Apart from being a hair-trigger KPI tracking mechanism, this algorithm can also be customized for problems lke A/B testing, campaign tracking and product evaluations. KEYWORDS DFS, A/B Testing, Reporting, Forecasting, Anomaly Spotting. 1. INTRODUCTION With the growing use of Internet and Mobile Apps, the world is seeing a steep increase in the availability and accessibility to different kinds of data – demographical, transactional, social media and so on. Modern businesses are keen to leverage on these data to arrive at smarter and timely decisions. But the existing data setup in a lot of organizations primarily provides post mortem reports of performance. The shift to a more pro-active or an instantaneous reactive approach to data based decision making requires investments in data infrastructure, skilled resources and enabling quicker and efficient dissemination of information to the right stakeholders. With more and more firms investing on their infrastructure to capture necessary data, there is a growing need for automated systems that can step in to process the raw data and provide actionable readouts to the required stakeholders. This paper proposes one such completely automated frequentist framework which when provided with any casted data (a dataset which is a cross product of dimensions involved and their corresponding metric values for each time frame) can run in the background to provide actionable readouts. It helps business decision makers stay updated with the development in their portfolio
  • 2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 60 by providing timely updates by specifying a list of anomalies (spikes & dips) with respect to the KPIs of their interest and further providing reasons for the same. The proposed method makes use of forecasting techniques to arrive at ballpark figures for every segment which acts as a logical substitute to user’s sense. The estimates are pitted against the actual performance immediately after the availability of actual data to effectively spot anomalies. The framework consists of intermediate trigger systems that can alert corresponding stakeholders without much delay. The system further digs down to analyze sub segments to attribute the reasons for every spotted anomaly. This approach which requires least human intervention is an effective aid in scenarios where: • the number of segments involved could not be handled manually • there is a lack of statistical expertise on the user front • there is a high time lag in decision making due to existing reporting structures This approach finds its application in a wide range of industries such as retail, e-commerce, airlines, insurance, manufacturing , logistic and supply chain, etc. to benefit portfolio managers, analyst, marketers, product and sales managers to name a few. The task of anomaly detection and reporting starts with the processing of melted data coming from the database to cast data by producing all possible interactions between the various dimensions in the dataset. Auto-forecasting iterates through the entire cast data converts it into time series and generates prediction to be consumed in the later module. Further the framework establishes networks to understand the interdependencies between the various segments in the data. Depth First Search is then applied to spot anomalies, which are then checked for redundancy and reported. The paper is structured as follows. The next section presents the methodology and tools used in the framework. This section has been broken into three sub-sections to highlight the pivotal modules running the entire framework. Section 3 discusses its implementation. Concluding remarks and future works are mentioned in Section 4. 2. METHODOLOGY The proposed framework is a novel technique to spot anomalies in data with the minimum human intervention. The three prime components that are required for its functioning are: 1. The actual value of the KPI 2. A ballpark value for the KPI 3. A scientific flagging approach
  • 3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc With the availability of these three components the of real life scenarios and across multiple verticals and KPIs. The following sections would explain the various steps involved in the framework for creating the above mentioned components and utilizing them to pr The entire framework is broadly broken down into the following sections: • Data Processing Module (DPM) • Auto Forecasting Module (AFM) • Pattern Analysis and Reporting Module (PARM) The flow chart in figure 1 is a involved and the flow of data between the modules. Figure 1: Framework 2.1 Module 1: Data Processing Module 2.1.1 Data Input: This module accepts raw data which could be all the necessary dimensions that would best describe the KPI along with the KPI itself broken down at the least granularity of the time unit used for reporting. Some of the easily relatable datasets are Sales data of retail stores, web traffic of an e from call centers, risk decline volumes for a payment gateway and so on. Any data that encompasses trends and seasonal patterns could be a perfect fit for this module or th as a whole. Table 1 is an illustration of a sample data source. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc With the availability of these three components the method could be applied across a wide range of real life scenarios and across multiple verticals and KPIs. The following sections would explain the various steps involved in the framework for creating the above mentioned components and utilizing them to provide actionable insights to business users. The entire framework is broadly broken down into the following sections: Data Processing Module (DPM) Auto Forecasting Module (AFM) Pattern Analysis and Reporting Module (PARM) The flow chart in figure 1 is a concise representation of the internal structure of the modules involved and the flow of data between the modules. Figure 1: Framework - Block Diagram 2.1 Module 1: Data Processing Module This module accepts raw data which could be any form of casted data. The dataset would include all the necessary dimensions that would best describe the KPI along with the KPI itself broken down at the least granularity of the time unit used for reporting. Some of the easily relatable Sales data of retail stores, web traffic of an e-commerce website, call volume data from call centers, risk decline volumes for a payment gateway and so on. Any data that encompasses trends and seasonal patterns could be a perfect fit for this module or th as a whole. Table 1 is an illustration of a sample data source. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 61 method could be applied across a wide range of real life scenarios and across multiple verticals and KPIs. The following sections would explain the various steps involved in the framework for creating the above mentioned concise representation of the internal structure of the modules any form of casted data. The dataset would include all the necessary dimensions that would best describe the KPI along with the KPI itself broken down at the least granularity of the time unit used for reporting. Some of the easily relatable commerce website, call volume data from call centers, risk decline volumes for a payment gateway and so on. Any data that encompasses trends and seasonal patterns could be a perfect fit for this module or the framework
  • 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 62 Table 1: Sample Casted Dataset Customer Segment Product Category Region Date Sales Consumer Furniture Central 1/1/2010 690.77 Consumer Furniture Central 2/1/2010 5303.17 Consumer Office Supplies East 1/1/2010 1112.11 Consumer Office Supplies East 2/1/2010 84.01 Home Office Furniture West 11/1/2013 10696.84 Home Office Furniture West 12/1/2013 4383.98 Here the first three columns are the dimensions and date column is the indicator of frequency of reporting. The date column here is at a month level but in general the framework can be applied for Daily, Weekly, Monthly, Quarterly or Yearly reports. The final column (Sales) is the actual KPI that the business wants to track using this methodology. 2.1.2 Segments Creation: Each of the dimensions in the dataset could hold 2 or more values and each of these could be of interest to different stakeholders. Referring back to the above sample dataset the Region dimension holds values of the geographies where there was Sales reported and each of the individual Regional Heads would want to keep an eye on the Sales of their region. Hence each of the values in every dimension potentially is a segment. The system further goes to generate segments by combining values of two dimensions. For example combining region and product category, segments like (Central _ Furniture) and (West _ Furniture) could be generated. The segment formation extends from treating every segment individually to combining all the available n dimensions. After the system generates all the possible combinations using the available dimensions in the data, each unique combination is given a Segment ID and the initial casted data is melted. Table 2: Conversion of casted to melted data Segment ID Date Sales Seg 1 1/1/2010 690.77 Seg 1 2/1/2010 5303.17 Seg 2 1/1/2010 1112.11 Seg 2 2/1/2010 84.01 Seg 3 11/1/2013 10696.84 Seg 3 12/1/2013 4383.98 In short if there are n dimensions (D1,D2,D3,….Dn) and the number of values in each dimension is (X1,X2,X3,….Xn) respectively the total number of combinations generated would be (X1+1) * (X2+1) * …….. *(Xn+1). 2.1.2.1 Business Preferences/Inputs: This is an optional step in the overall structure where the intention is to bucket values in each dimension to club smaller segments. Assuming there are 100 different products in the Product Category and of these 100 products 9 products lead to 95% of the overall Sales, then the rest of the 91 products could be clubbed as ‘Other Products’.
  • 5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc This removes certain segments from the picture based on business preferences. It could even be inputs in the form of flat files containing segments which businesses are not too Both these measures result in the reduction of the number of segments and thereby improving the operating efficiency and memory consumption of this proposed approach. The final melted output from this module is fed into the Auto Forecast of all data points corresponding to the latest unit of time and passing through the Multiprocessing stage to enable parallel processing in the Auto Forecast Module. 2.1.3 Parallel Processing Module: The forecasting module can be duplicated as multiple processes as each of the segment is treated independent of the other. The dataset is broken down into multiple parts of equal number of segments and fed to the multiple sessions of the AFM. 2.2 Module 2: Auto Forecast Module The auto forecast module is designed to generate forecasts on multiple time series data which gets fed to the PARM discussed in the next section. The following steps explain how one step forecasts are generated: 2.2.1 Data Preparation: Aggregated data is converted to time series data of the specified frequency. Each time series is then iteratively checked for missing values and outliers. The framework provides the flexibility of using linear and cubic spline interpolation for treating missing values and outl series is decomposed using STL and the trend component is smoothened. This helps in approximating missing values and minimizing the effect of outliers. Each time series is checked for 0-padding (both leading and trailing). Powerful trans applied before moving on to the next phase. Figure 1: Missing value treatment using STL + spline smoothing International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc This removes certain segments from the picture based on business preferences. It could even be inputs in the form of flat files containing segments which businesses are not too concerned about. Both these measures result in the reduction of the number of segments and thereby improving the operating efficiency and memory consumption of this proposed approach. The final melted output from this module is fed into the Auto Forecast Module after the removal of all data points corresponding to the latest unit of time and passing through the Multiprocessing stage to enable parallel processing in the Auto Forecast Module. 2.1.3 Parallel Processing Module: duplicated as multiple processes as each of the segment is treated independent of the other. The dataset is broken down into multiple parts of equal number of segments and fed to the multiple sessions of the AFM. 2.2 Module 2: Auto Forecast Module uto forecast module is designed to generate forecasts on multiple time series data which gets fed to the PARM discussed in the next section. The following steps explain how one step forecasts are generated: nverted to time series data of the specified frequency. Each time series is then iteratively checked for missing values and outliers. The framework provides the flexibility of using linear and cubic spline interpolation for treating missing values and outliers series is decomposed using STL and the trend component is smoothened. This helps in approximating missing values and minimizing the effect of outliers. Each time series is checked padding (both leading and trailing). Powerful transformations such as Box applied before moving on to the next phase. : Missing value treatment using STL + spline smoothing International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 63 This removes certain segments from the picture based on business preferences. It could even be concerned about. Both these measures result in the reduction of the number of segments and thereby improving the Module after the removal of all data points corresponding to the latest unit of time and passing through the Multiprocessing duplicated as multiple processes as each of the segment is treated independent of the other. The dataset is broken down into multiple parts of equal number of uto forecast module is designed to generate forecasts on multiple time series data which nverted to time series data of the specified frequency. Each time series is then iteratively checked for missing values and outliers. The framework provides the flexibility iers [1], the time series is decomposed using STL and the trend component is smoothened. This helps in approximating missing values and minimizing the effect of outliers. Each time series is checked formations such as Box-Cox can be
  • 6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc 2.2.2 Model selection: This phase involves choosing the right set of models for a includes Exponential Smoothing auto.Arima[2], Nnet - Feed forward neural networks with a single hidden layer, linear models to time series including trend state space model with Box Components)[3,4] and STL. It is important that the seasonality component is correctly identified before forecasting. Over parameterization or force fitting seasonality when there is no real seasonal component might produce unreliable forecasts. Also, in cases where there are too few data points only ARIMA and Regression models are used. Croston’s method is used in cases where time s intermittent data. TBATS is used with data with weekly and annual seasonality. models are picked for forecasting in this phase. Figure 2: Model selection based on Holdout accuracy 2.2.3 Forecasting: The level of the time series must be specified before this step starts. A small portion of the data is held out to be used as validation [5]. This helps in preventing over fitting and gives a truer estimation of the generalization error. The module then ru time series and provides the residuals, fit statistics and confidence intervals. 2.2.4 Evaluation: The metric to be used for evaluating the fit can be specified at the start of execution of this module. The framework provides MAPE (Mean absolute percentage error), MSE (Mean squared error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A summary of all fits statistics from selected models applied on the data is generated and the winne model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at this step for manual intervention. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc This phase involves choosing the right set of models for a given time series. The current pool Exponential Smoothing – ETS implementation for automatic selection and Feed forward neural networks with a single hidden layer, TSLM for fitting linear models to time series including trend and seasonality, TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend an It is important that the seasonality component is correctly identified before forecasting. Over erization or force fitting seasonality when there is no real seasonal component might unreliable forecasts. Also, in cases where there are too few data points only ARIMA and . Croston’s method is used in cases where time s TBATS is used with data with weekly and annual seasonality. models are picked for forecasting in this phase. Figure 2: Model selection based on Holdout accuracy The level of the time series must be specified before this step starts. A small portion of the data is held out to be used as validation [5]. This helps in preventing over fitting and gives a truer estimation of the generalization error. The module then runs all the models picked for the given time series and provides the residuals, fit statistics and confidence intervals. The metric to be used for evaluating the fit can be specified at the start of execution of this ork provides MAPE (Mean absolute percentage error), MSE (Mean squared error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A summary of all fits statistics from selected models applied on the data is generated and the winne model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at this step for manual intervention. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 64 given time series. The current pool ETS implementation for automatic selection and TSLM for fitting , TBATS (Exponential smoothing Cox transformation, ARMA errors, Trend and Seasonal It is important that the seasonality component is correctly identified before forecasting. Over- erization or force fitting seasonality when there is no real seasonal component might unreliable forecasts. Also, in cases where there are too few data points only ARIMA and . Croston’s method is used in cases where time series have TBATS is used with data with weekly and annual seasonality. All relevant The level of the time series must be specified before this step starts. A small portion of the data is held out to be used as validation [5]. This helps in preventing over fitting and gives a truer ns all the models picked for the given The metric to be used for evaluating the fit can be specified at the start of execution of this ork provides MAPE (Mean absolute percentage error), MSE (Mean squared error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A summary of all fits statistics from selected models applied on the data is generated and the winner model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
  • 7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 65 Table 1: Selection of winner model based on Performance metric Store # ETS TBATS Arima Nnet Tslm Croston’s* Store 53 7.2% 4.3% 4.7% 2.0% 3.3% NA Store 54 72.2% 44.8% 41.4% 104.3% 15.9% NA Store 55 2.6% 5.4% 2.6% 5.4% 6.6% NA Store 56 5.6% 2.5% 2.8% 12.6% 5.9% NA Store 57 4.5% 1.8% 4.4% 4.1% 2.3% NA Store 58 3.9% 2.9% 5.4% 4.8% 2.5% NA Store 59 11.5% NA 14.9% 31.5% 20.8% NA Store 60 NA NA 10.1% NA 8.9% NA Store 61 1.8% 1.3% 3.5% 33.2% 1.8% NA Store 62 33.8% 52.1% 71.3% 27.0% 14.0% NA Store 63 12.7% 4.3% 7.8% 4.0% 8.7% NA Store 64 4.1% 1.4% 3.0% 8.0% 2.5% NA 2.2.5 One step forecast: The final forecast are generated by running the winner model on the entire data, as missing recent data points could cause loss of valuable information. The residuals, confidence interval and one step forecasts for this model are then passed to the next module. 2.3 Module 3: Pattern Analysis and Reporting Module (PARM) PARM receives data feed separately from the Data Processing Module and Auto Forecast Module. • DPM provides the KPI values of the latest unit of time for every segment. These values were the ones which were kept aside from the melted dataset before being fed into AFM. • AFM provides reliable forecast values along with the residuals and prediction intervals based on the required confidence interval for every segment. 2.3.1 Segment Level Flagging Every available segment in the data has an actual value, a ballpark value (forecasted output) and a prediction interval based on the point forecast. 2.3.1.1 Prediction Interval As per Hyndman [7], a prediction interval is an interval associated with a random variable yet to be observed, with a specified probability of the random variable lying within the interval. For example, I might give an 80% interval for the forecast of GDP in 2014. The actual GDP in 2014 should lie within the interval with probability 0.8. Prediction intervals can arise in Bayesian or frequentist statistics. A confidence interval is an interval associated with a parameter and is a frequentist concept. The parameter is assumed to be non-random but unknown, and the confidence interval is computed
  • 8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 66 from data. Because the data are random, the interval is random. A 95% confidence interval will contain the true parameter with probability 0.95. That is, with a large number of repeated samples, 95% of the intervals would contain the true parameter. The range of the prediction interval is dependent on two factors: i. The accuracy of the forecast. Higher the accuracy narrow is the band and a poor accuracy pushes the limit to -∞ and +∞. ii. The desired confidence percentage. Higher the required confidence broader is the band. 2.3.1.2 Reasons for independent forecast Every segment generated by the DPM could be unique in its properties and thereby could exhibit its own trend and seasonal attributes. Also obtaining forecast values from sub granular level would result in aggregating errors of the sub granular levels and hence affect the accuracy of the overall forecast. 2.3.1.2.1 Anomalies – Dips and spikes The actual value for the KPI corresponding to the latest unit of time is pitted against the prediction interval for that segment. If , AT > EULT => Flag for spike, AT < ELLT => Flag for dip where, AT - actual KPI value for the time period T EULT & ELLT - upper and lower limits of prediction intervals for time period T With every segment flagged independently for anomalies instantaneous triggers could be sent out to accountable stake holders alerting them to react without much time latency. In such scenarios businesses start to drill down the KPIs using dimensions based on their judgments to arrive at reasons for the anomaly. 2.3.2 Network Generation The network here is a scientific substitute to replace the operations of extensive drill downs by the business consumers. The approach to node formation and node connections are explained in the next section. 2.3.3 Node Formation Every segment created by DPM would be a node by itself. The number of dimensions involved in creation of the node indicates the level of the node i.e. if dimensions are taken one at a time the level is 2 while a combination of two dimensions indicates level 3 and extends up to (n+1) levels.
  • 9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc 2.3.4 Node Connections Every node at level k would be connected from one or more nodes from level k or more values involved in the formation of node at level k has a commonality of value with any node at level (k-1). For example, any node at level 3 formed by [r category-furniture] would be connected from nodes [region furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG). 2.3.4.1 Directed Acyclic Graphs (DAG) In mathematics and computer science no directed cycles. That is, it is formed by a collection of connecting one vertex to another, sequence of edges that eventually loops back to DAGs may be used to model many different kinds of information. The DAG forms a partial order, and any reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that certain tasks must be performed earlier than others, may be represented as a DAG with a vertex for each task and an edge for each const generate a valid sequence. Additionally, DAGs may be used as a space a collection of sequences with overlapping subsequences. DAGs are also used to represent systems of events or potential events and the be used to model processes in which data flows in a consistent direction through a network of processors, or states of a repository in a version Figure 3: Directed Acyclic Graph (DAG) 2.3.5 Node Information Every node is identified by the obtained for the segments at the previous step. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc Every node at level k would be connected from one or more nodes from level k-1 where any one or more values involved in the formation of node at level k has a commonality of value with any 1). For example, any node at level 3 formed by [region-central and product furniture] would be connected from nodes [region-central] and [product category furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG). 2.3.4.1 Directed Acyclic Graphs (DAG) computer science, a directed acyclic graph is a directed graph . That is, it is formed by a collection of vertices and directed edges connecting one vertex to another, such that there is no way to start at some vertex v sequence of edges that eventually loops back to v again. DAGs may be used to model many different kinds of information. The reachability , and any finite partial order may be represented by a DAG using reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that certain tasks must be performed earlier than others, may be represented as a DAG with a vertex for each task and an edge for each constraint; algorithms for topological ordering may be used to generate a valid sequence. Additionally, DAGs may be used as a space-efficient representation of on of sequences with overlapping subsequences. DAGs are also used to represent systems of events or potential events and the causal relationships between them. DAGs may also del processes in which data flows in a consistent direction through a network of processors, or states of a repository in a version-control system [6]. Figure 3: Directed Acyclic Graph (DAG) Every node is identified by the segment id and includes the respective flags for anomalies obtained for the segments at the previous step. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 67 1 where any one or more values involved in the formation of node at level k has a commonality of value with any central and product central] and [product category- furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG). directed graph with directed edges, each edge v and follow a reachability relation in a ted by a DAG using reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that certain tasks must be performed earlier than others, may be represented as a DAG with a vertex may be used to efficient representation of on of sequences with overlapping subsequences. DAGs are also used to represent between them. DAGs may also del processes in which data flows in a consistent direction through a network of segment id and includes the respective flags for anomalies
  • 10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 68 Figure 4: Network formation - An illustration 2.3.6 DAG as a substitute for Drill Down The DAG network is used a substitute for manual drill down process as every possible dimensions is fitted above and below the other dimensions and combination of dimensions thereby enabling an exhaustive drill down setup. But in order to understand the cause for every anomaly, it is necessary to understand the relation between anomalies spotted and set up the right order of fitting the dimensions. This is enabled by traversing the network generated from every spotted anomaly which is explained in detail in section 2.3.7. 2.3.7 Network traversal for spotting dependent anomalies Network is primarily used to understand the relationship between the nodes flagged for anomalies. A regular Depth First Search (DFS) traversal technique is used to traverse through the network. 2.3.7.1 Steps for primary consolidation 1. The traversal starts from the top most level (level 1) 2. Every node at level 1 is visited to see if it is flagged for an anomaly (spike or dip) • If it is flagged, the segment is marked as a Main Segment and is fixed as one of the starting points of DFS
  • 11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 69 • If the node is not flagged, the focus moves to the next node in the same level 3. Step 2 is repeated until level n and all the main segments are marked. The Main Segment nodes marked are of two categories – one flagged for spikes and the other for dips 4. From the list of nodes marked as main segment, the algorithm picks the first node. With this node as the starting point, the network is traversed till the last level using DFS. If a visited node on the traversal route is flagged for a similar anomaly, that node is marked as impacted node corresponding to the main segment. 5. Step 4 is repeated for all the Main Segments marked in step 3 The generated list of Main Segment and Impacted Segment combination concludes the primary consolidation thereby providing the relationship between one segment anomaly and its directly related segment anomalies. 2.3.7.2 Root Causing through redundancy removal The output from the above level has several redundant factors which are wiped out in this step to provide a read out of a mutually exclusive anomalies for the business users' perusal. The redundancies present are of two types a. Inter Level Redundancies b. Intra Level Redundancies 2.3.7.2.1 Inter Level Redundancies Since the Main Segment list had all the available flagged nodes an anomaly spotted at a top level could have its effect permeating down until the bottom most level. These sub level Main Segments are part of the list of Impacted Segments for the top level Main Segment. The framework identifies that having all these in the final read out would be reiterating the same anomaly repeatedly in multiple forms. 2.3.7.2.2 Intra Level Redundancies Intra Level Redundancies are more due to inherent data aspects. Two or more nodes at the same level could be flagged for a similar anomaly. But more often than not it is the effect of one on the other as each of the nodes have an inherent volume of the other nodes. This is a more complex case to eliminate from the primary consolidation unlike the Inter Level Redundancies. 2.3.7.3 Steps to remove redundancies 1. The Main Segments at level n are picked. 2. Compare the list of Impacted Segments for each pair of Main Segments in this level. • If there is an intersection then there lives a common thread • If there are no intersections, move to the next pair of Main Segments
  • 12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 70 3. If there is an intersection in the previous step create a proxy index separately for each of the Main Segment in the pair. The proxy index is simply a product of (absolute drop in KPI from forecast) and (Percentage drop of KPI from forecast). This index tangibly does not have any specific meaning or unit of measurement but higher the value of the index greater is the probability of it being a root cause. This index is used to tie break between the paired segments at each level. • The winner of the tiebreaker remains in the list of primary consolidation. • The looser is removed from the primary consolidated list. 4. This process is repeated for every pair at each level and across all the levels moving upward. 5. Now, with the truncated list the process starts from the top most level. A Main Segment at the top most level is selected and its Impacted Segments are chosen. 6. The truncated list is looped to check if the Impacted Segments are present as a Main Segment. • If it is present the sub level Main Segment corresponding to the matching Impacted Segment is removed. • If none of the Impacted Segments in the list match with the Main Segment in the sub levels skip to the next Main Segment in the current level of focus. At the end of this iteration what remains are mutually exclusive anomalies (Main Segments) and their corresponding sub level variations (Impacted Segments). Figure 5: Types of Redundancies
  • 13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 71 2.3.8 Attribution and Reporting The final step of the framework is to communicate the anomalies in a readable format to the respective stakeholders. The change in Main Segment in absolute terms is used as the base to calculate the proportion of change at the corresponding sub level segments tagged to the Main Segment as its Impacted Segments. The sub level segments are ordered based on decreasing value of the calculated proportion. The final consolidated list after the sorting is then automatically converted into a presentation where every spotted anomaly is shown separately along with its causes / impact. 2.3.9 Business Preferences and Reporting Customizations The framework also provides the flexibility to integrate static business filters or preferences to include and exclude certain kinds of anomalies from the final consolidated list. Also the final reporting could be customized according to the needs of the business users by setting up the reporting module accordingly in the framework. 3. RESULTS The framework was successfully implemented using R and python and tested out on eight different datasets across multiple KPI’s. The implementation has resulted positively for businesses in terms of reduction in time latency between report availability and actual actions from the report, spotting variations in smaller segments which were initially neglected and avoiding redundant actions by intimating the right stakeholders based on assigned accountability. On an average 6 out of 20 root cause reports generated during the Beta phase gave conclusive and actionable insights on the anomalies in the data. The time required to forecast and generate reports increases exponentially with the increase in the number of dimensions. However, with the use of parallel processing time taken to generate the results for a dataset with close to 8 dimensions (200,000 columns) was reduced from 9 hours to 1 hour. With the aid of shiny package in R and PyQt it was possible to create a user-friendly UI for taking inputs and displaying interactive dashboards. 4. CONCLUSIONS The increasing reliance on data for decision-making has added pressure on analysts to provide quick and accurate reports. The volume of data has increased manifolds in the past decade. If such vast volumes of data can be managed and processed more efficiently it could lead to multi fold gains in all organisations. The framework discussed in this paper finds application in a wide range of domains for KPI tracking, AB testing, campaign tracking and product evaluations. The methodology can further be improved by adding functionalities like allowing external regressors and multiplicative metrics, interface for integration to Business intelligence tools and models that can handle high frequency data. Distributed computing via big data platforms can further increase the scalability of this approach.
  • 14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016 72 ACKNOWLEDGEMENTS The authors would like to thank LatentView Analytics for providing the opportunity, inputs and resources to generate this framework. The authors would like to mention their gratitude to co- employees who provided their support in this work. The authors would extend their thanks to Prof. Subhash Subramanian and Prof. Mandeep Sandhu for their guidance and inputs in writing down the paper and proof reading it. REFERENCES [1] R implementation by B. D. Ripley and Martin Maechler (spar/lambda, etc). https://www.r- project.org/Licenses/GPL-2 [2] Hyndman, R.J., Akram, Md., and Archibald, B. (2008) "The admissible parameter space for exponential smoothing models". Annals of Statistical Mathematics, 60(2), 407--426.Models: A Roughness Penalty Approach. Chapman and Hall. [3] Hyndman, R.J., Koehler, A.B., Snyder, R.D., and Grose, S. (2002) "A state space framework for automatic forecasting using exponential smoothing methods", International J. Forecasting, 18(3), 439--454 [4] Hyndman, R.J. and Khandakar, Y. (2008) "Automatic time series forecasting: The forecast package for R",Journal of Statistical Software, 26(3) [5] Leonard, Michael. "Large-Scale Automatic Forecasting Using Inputs and Calendar Events." White Paper (2005): 1-27. [6] Thulasiraman, K.; Swamy, M. N. S. (1992), "5.7 Acyclic Directed Graphs", Graphs: Theory and Algorithms, John Wiley and Son, p. 118, ISBN 978-0-471-51356-8. [7] http://robjhyndman.com/hyndsight/intervals/ AUTHORS Vivek Sankar is a post graduate in Business Administration from Sri Sathya Sai Institute of Higher Learning and is currently working in LatentView Analytics, Chennai, since September 2012. Somendra Tripathi received his B.Tech degree in computer science from Vellore Institute of Technology in 2013. He is currently working in LatentView Analytics, Chennai.