In the business world, decision makers rely heavily on data to back their decisions. With the quantum of
data increasing rapidly, traditional methods used to generate insights from reports and dashboards will
soon become intractable. This creates a need for efficient systems which can substitute human intelligence
and reduce time latency in decision making. This paper describes an approach to process time series data
with multiple dimensions such as geographies, verticals, products, efficiently, and to detect anomalies in
the data and further, to explain potential reasons for the occurrence of the anomalies. The algorithm
implements auto selection of forecast models to make reliable forecasts and detect such anomalies. Depth
First Search (DFS) is applied to analyse each of these anomalies and find its root causes. The algorithm
filters the redundant causes and reports the insights to the stakeholders. Apart from being a hair-trigger
KPI tracking mechanism, this algorithm can also be customized for problems lke A/B testing, campaign
tracking and product evaluations.
An efficient data pre processing frame work for loan credibility prediction s...eSAT Journals
Abstract
In today's world data mining have increasingly become very interesting and popular in terms of all applications especially in the
banking industry. We have too much data and too much technology but don't have useful information. This is why we need data
mining process. The importance of data mining is increasing and studies have been done in many domains to solve tons of
problems using various data mining techniques. The art of preparing data for data mining is the most important and time
consuming phase. In developing countries like India, bankers should vigilant to fraudsters because they will create more problems
to the banking organization. Applying data mining techniques, it is very effective to build a successful predictive model that helps
the bankers to take the proper decision. This paper covers the set of techniques under the umbrella of data preprocessing based
on a case study of bank loan transaction data. The proposed model will help to distinguish borrowers who repay loans promptly
from those who do not. The frame work helps the organizations to implement better CRM by applying better prediction ability.
Keywords: Data preprocessing, Customer behavior, Input columns, Outlier columns, Target column, Dataset, CRM
This white paper, written for non-technical audiences, discusses the scope and applicability of the IEEE (Institute of Electrical and Electronics Engineers) 1366 standard methodology. It focuses on solutions available to improve utility indices and Network Operations Center (NOC) metrics. The purpose is to provide an educational and informational white paper and not overload the audience with statistical formulas and complex details.
One of the goals for this white paper is to illustrate the statistical concept and provide awareness. Another goal is to be able to summarize concepts and its applications to facilitate socializing with a wider audience such as Critical Infrastructure, Network Operations Centers (NOCs), and the Sustainability program.
A Clustering Method for Weak Signals to Support Anticipative IntelligenceCSCJournals
Organizations need appropriate anticipative information to support their decision making process. Contrarily to some strategic information analyses that help managers to establish patterns using past information, anticipative intelligence is intended to help managers to act based on the analysis of pieces of information that indicate some sort of trend that may become true in the future. One example of this kind of information is known as a weak signal, which is a short text related to a specific domain. In this work, pairs of weak signals, written in Portuguese, are compared to each other so that similarities can be identified and correlated weak signals can be clustered together. The idea is that the analysis of the resulting similar groups may lead to the formulation of a hypothesis that can support the decision making process. The proposed technique consists of two main steps: preprocessing the set of weak signals and clustering. The proposed method was evaluated on a database of bio-energy weak signals. The main innovations of this work are: (i) the application of a computational methodology from the literature for analyzing anticipative information; and (ii) the adaptation of data mining techniques to implement this methodology in a software product.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
The aim of this project is to help a telecom company with insights on customer behavior that would be useful for retention of customers. The specific goals expected to be achieved are given below
1. Identification of the top variables driving likelihood of churn
2. Build a predictive model to identify customers who have highest probability to terminate services with the company.
3. Build a lift chart for optimization of efforts by targeting most of the potential churns with least contact efforts. Here with 30% of the total customer pool, the model accurately provides 33% of total potential churn candidates.
Models tried to arrive at the best are
1. Simple Models like Logistic Regression & Discriminant Analysis with different thresholds for classification
2. Random Forest after balancing the dataset using Synthetic Minority Oversampling Technique (SMOTE)
3. Ensemble of five individual models and predicting the output by averaging the individual output probabilities
4. Xgboost algorithm
Machine Learning for Business - Eight Best Practices for Getting StartedBhupesh Chaurasia
Though the term machine learning has become very visible in
the popular press over the past few years—making it appear to be the newest shiny object—the technology has actually been
in use for decades. In fact, machine learning algorithms such as decision trees are already in use by many organizations for predictive analytics.
Extract Business Process Performance using Data MiningIJERA Editor
This paper aimed to analyze the performance of the business process using process mining. The performance is very important especially in large systems .the process of repairing devices was used as case study and the Fuzzy Miner Algorithm used to analyze the process model performance
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGcsijjournal
The aim of this study is to identify the extent of Data mining activities that are practiced by banks, Data mining is the ability to link structured and unstructured information with the changing rules by which people apply it. It is not a technology, but a solution that applies information technologies. Currently
several industries including like banking, finance, retail, insurance, publicity, database marketing, sales predict, etc are Data Mining tools for Customer . Leading banks are using Data Mining tools for customer segmentation and benefit, credit scoring and approval, predicting payment lapse, marketing, detecting illegal transactions, etc. The Banking is realizing that it is possible to gain competitive advantage deploy data mining. This article provides the effectiveness of Data mining technique in organized Banking. It also discusses standard tasks involved in data mining; evaluate various data mining applications in different
sectors
International Refereed Journal of Engineering and Science (IRJES) irjes
International Refereed Journal of Engineering and Science (IRJES)
Ad hoc & sensor networks, Adaptive applications, Aeronautical Engineering, Aerospace Engineering
Agricultural Engineering, AI and Image Recognition, Allied engineering materials, Applied mechanics,
Architecture & Planning, Artificial intelligence, Audio Engineering, Automation and Mobile Robots
Automotive Engineering….
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...IJDKP
In mining frequent itemsets, one of most important algorithm is FP-growth. FP-growth proposes an
algorithm to compress information needed for mining frequent itemsets in FP-tree and recursively
constructs FP-trees to find all frequent itemsets. In this paper, we propose the EFP-growth (enhanced FPgrowth)
algorithm to achieve the quality of FP-growth. Our proposed method implemented the EFPGrowth
based on MapReduce framework using Hadoop approach. New method has high achieving
performance compared with the basic FP-Growth. The EFP-growth it can work with the large datasets to
discovery frequent patterns in a transaction database. Based on our method, the execution time under
different minimum supports is decreased..
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGSIJDKP
Web service composition has been one of the most researched topics of the past decade. Novel methods of
web service composition are being proposed in the literature include Semantics-based composition, WSDLbased
composition. Although these methods provide promising results for composition, search and
discovery of web service based on QoS parameter of network and semantics or ontology associated with
WSDL, they do not address composition based on usage of web service. Web Service usage logs capture
time series data of web service invocation by business objects, which innately captures patterns or
workflows associated with business operations. Web service composition based on such patterns and
workflows can greatly streamline the business operations. In this research work, we try to explore and
implement methods of mining web service usage logs. Main objectives include Identifying usage
association of services. Linking one service invocation with other, Evaluation of the causal relationship
between associations of services.
A SURVEY OF LINK MINING AND ANOMALIES DETECTIONIJDKP
This survey introduces the emergence of link mining and its relevant application to detect anomalies which
can include events that are unusual, out of the ordinary or rare, unexpected behaviour, or outliers.
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...IJDKP
The social networking sites have brought a new horizon for expressing views and opinions of individuals.
Moreover, they provide medium to students to share their sentiments including struggles and joy during the
learning process. Such informal information has a great venue for decision making. The large and growing
scale of information needs automatic classification techniques. Sentiment analysis is one of the automated
techniques to classify large data. The existing predictive sentiment analysis techniques are highly used to
classify reviews on E-commerce sites to provide business intelligence. However, they are not much useful
to draw decisions in education system since they classify the sentiments into merely three pre-set
categories: positive, negative and neutral. Moreover, classifying the students’ sentiments into positive or
negative category does not provide deeper insight into their problems and perks. In this paper, we propose
a novel Hybrid Classification Algorithm to classify engineering students’ sentiments. Unlike traditional
predictive sentiment analysis techniques, the proposed algorithm makes sentiment analysis process
descriptive. Moreover, it classifies engineering students’ perks in addition to problems into several
categories to help future students and education system in decision making.
An efficient data pre processing frame work for loan credibility prediction s...eSAT Journals
Abstract
In today's world data mining have increasingly become very interesting and popular in terms of all applications especially in the
banking industry. We have too much data and too much technology but don't have useful information. This is why we need data
mining process. The importance of data mining is increasing and studies have been done in many domains to solve tons of
problems using various data mining techniques. The art of preparing data for data mining is the most important and time
consuming phase. In developing countries like India, bankers should vigilant to fraudsters because they will create more problems
to the banking organization. Applying data mining techniques, it is very effective to build a successful predictive model that helps
the bankers to take the proper decision. This paper covers the set of techniques under the umbrella of data preprocessing based
on a case study of bank loan transaction data. The proposed model will help to distinguish borrowers who repay loans promptly
from those who do not. The frame work helps the organizations to implement better CRM by applying better prediction ability.
Keywords: Data preprocessing, Customer behavior, Input columns, Outlier columns, Target column, Dataset, CRM
This white paper, written for non-technical audiences, discusses the scope and applicability of the IEEE (Institute of Electrical and Electronics Engineers) 1366 standard methodology. It focuses on solutions available to improve utility indices and Network Operations Center (NOC) metrics. The purpose is to provide an educational and informational white paper and not overload the audience with statistical formulas and complex details.
One of the goals for this white paper is to illustrate the statistical concept and provide awareness. Another goal is to be able to summarize concepts and its applications to facilitate socializing with a wider audience such as Critical Infrastructure, Network Operations Centers (NOCs), and the Sustainability program.
A Clustering Method for Weak Signals to Support Anticipative IntelligenceCSCJournals
Organizations need appropriate anticipative information to support their decision making process. Contrarily to some strategic information analyses that help managers to establish patterns using past information, anticipative intelligence is intended to help managers to act based on the analysis of pieces of information that indicate some sort of trend that may become true in the future. One example of this kind of information is known as a weak signal, which is a short text related to a specific domain. In this work, pairs of weak signals, written in Portuguese, are compared to each other so that similarities can be identified and correlated weak signals can be clustered together. The idea is that the analysis of the resulting similar groups may lead to the formulation of a hypothesis that can support the decision making process. The proposed technique consists of two main steps: preprocessing the set of weak signals and clustering. The proposed method was evaluated on a database of bio-energy weak signals. The main innovations of this work are: (i) the application of a computational methodology from the literature for analyzing anticipative information; and (ii) the adaptation of data mining techniques to implement this methodology in a software product.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
The aim of this project is to help a telecom company with insights on customer behavior that would be useful for retention of customers. The specific goals expected to be achieved are given below
1. Identification of the top variables driving likelihood of churn
2. Build a predictive model to identify customers who have highest probability to terminate services with the company.
3. Build a lift chart for optimization of efforts by targeting most of the potential churns with least contact efforts. Here with 30% of the total customer pool, the model accurately provides 33% of total potential churn candidates.
Models tried to arrive at the best are
1. Simple Models like Logistic Regression & Discriminant Analysis with different thresholds for classification
2. Random Forest after balancing the dataset using Synthetic Minority Oversampling Technique (SMOTE)
3. Ensemble of five individual models and predicting the output by averaging the individual output probabilities
4. Xgboost algorithm
Machine Learning for Business - Eight Best Practices for Getting StartedBhupesh Chaurasia
Though the term machine learning has become very visible in
the popular press over the past few years—making it appear to be the newest shiny object—the technology has actually been
in use for decades. In fact, machine learning algorithms such as decision trees are already in use by many organizations for predictive analytics.
Extract Business Process Performance using Data MiningIJERA Editor
This paper aimed to analyze the performance of the business process using process mining. The performance is very important especially in large systems .the process of repairing devices was used as case study and the Fuzzy Miner Algorithm used to analyze the process model performance
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGcsijjournal
The aim of this study is to identify the extent of Data mining activities that are practiced by banks, Data mining is the ability to link structured and unstructured information with the changing rules by which people apply it. It is not a technology, but a solution that applies information technologies. Currently
several industries including like banking, finance, retail, insurance, publicity, database marketing, sales predict, etc are Data Mining tools for Customer . Leading banks are using Data Mining tools for customer segmentation and benefit, credit scoring and approval, predicting payment lapse, marketing, detecting illegal transactions, etc. The Banking is realizing that it is possible to gain competitive advantage deploy data mining. This article provides the effectiveness of Data mining technique in organized Banking. It also discusses standard tasks involved in data mining; evaluate various data mining applications in different
sectors
International Refereed Journal of Engineering and Science (IRJES) irjes
International Refereed Journal of Engineering and Science (IRJES)
Ad hoc & sensor networks, Adaptive applications, Aeronautical Engineering, Aerospace Engineering
Agricultural Engineering, AI and Image Recognition, Allied engineering materials, Applied mechanics,
Architecture & Planning, Artificial intelligence, Audio Engineering, Automation and Mobile Robots
Automotive Engineering….
AN ENHANCED FREQUENT PATTERN GROWTH BASED ON MAPREDUCE FOR MINING ASSOCIATION...IJDKP
In mining frequent itemsets, one of most important algorithm is FP-growth. FP-growth proposes an
algorithm to compress information needed for mining frequent itemsets in FP-tree and recursively
constructs FP-trees to find all frequent itemsets. In this paper, we propose the EFP-growth (enhanced FPgrowth)
algorithm to achieve the quality of FP-growth. Our proposed method implemented the EFPGrowth
based on MapReduce framework using Hadoop approach. New method has high achieving
performance compared with the basic FP-Growth. The EFP-growth it can work with the large datasets to
discovery frequent patterns in a transaction database. Based on our method, the execution time under
different minimum supports is decreased..
RECOMMENDATION FOR WEB SERVICE COMPOSITION BY MINING USAGE LOGSIJDKP
Web service composition has been one of the most researched topics of the past decade. Novel methods of
web service composition are being proposed in the literature include Semantics-based composition, WSDLbased
composition. Although these methods provide promising results for composition, search and
discovery of web service based on QoS parameter of network and semantics or ontology associated with
WSDL, they do not address composition based on usage of web service. Web Service usage logs capture
time series data of web service invocation by business objects, which innately captures patterns or
workflows associated with business operations. Web service composition based on such patterns and
workflows can greatly streamline the business operations. In this research work, we try to explore and
implement methods of mining web service usage logs. Main objectives include Identifying usage
association of services. Linking one service invocation with other, Evaluation of the causal relationship
between associations of services.
A SURVEY OF LINK MINING AND ANOMALIES DETECTIONIJDKP
This survey introduces the emergence of link mining and its relevant application to detect anomalies which
can include events that are unusual, out of the ordinary or rare, unexpected behaviour, or outliers.
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...IJDKP
The social networking sites have brought a new horizon for expressing views and opinions of individuals.
Moreover, they provide medium to students to share their sentiments including struggles and joy during the
learning process. Such informal information has a great venue for decision making. The large and growing
scale of information needs automatic classification techniques. Sentiment analysis is one of the automated
techniques to classify large data. The existing predictive sentiment analysis techniques are highly used to
classify reviews on E-commerce sites to provide business intelligence. However, they are not much useful
to draw decisions in education system since they classify the sentiments into merely three pre-set
categories: positive, negative and neutral. Moreover, classifying the students’ sentiments into positive or
negative category does not provide deeper insight into their problems and perks. In this paper, we propose
a novel Hybrid Classification Algorithm to classify engineering students’ sentiments. Unlike traditional
predictive sentiment analysis techniques, the proposed algorithm makes sentiment analysis process
descriptive. Moreover, it classifies engineering students’ perks in addition to problems into several
categories to help future students and education system in decision making.
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYIJDKP
Data integration is the process of collecting data from different data sources and providing user with
unified view of answers that meet his requirements. The quality of query answers can be improved by
identifying the quality of data sources according to some quality measures and retrieving data from only
significant ones. Query answers that returned from significant data sources can be ranked according to
quality requirements that specified in user query and proposed queries types to return only top-k query
answers. In this paper, Data integration framework called Data integration to return ranked alternatives
(DIRA) will be introduced depending on data quality assessment module that will use data sources quality
to choose the significant ones and ranking algorithm to return top-k query answers according to different
queries types.
AN APPROACH TO IMPROVEMENT THE USABILITY IN SOFTWARE PRODUCTSijseajournal
One of the significantaspects of software quality is usability. It is one of the characteristics that judge by
the success or failure of software applications. The most important risk facing the software applications is
usability which may lead to the existence of a gap between users and systems. This may lead to system
failure because of Poor design. This is due to the design is not based on the desires and requirements of the
customer. To overcome these problems, this paper proposed an approach to improve usability of software
applications to meet the needs of the customer and interacts with the user easily with an efficient and
effective manner.The proposed approach is based prototyping technique due to itssimplicity and it does not
require additional costs to elicit precise and complete requirement and design.
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTORmsejjournal
Owing to the fact that MOSFETs can be effortlessly assimilated into ICs, they have become the heart of the
growing semiconductor industry. The need to procure low power dissipation, high operating speed and
small size requires the scaling down of these devices. This fully serves the Moore’s Law. But scaling down
comes with its own drawbacks which can be substantiated as the Short Channel Effect. The working of the
device deteriorates owing to SCE. In this paper, the problems of device downsizing as well as how the use
of SED based devices prove to be a better solution to device downsizing has been presented. As such the
study of Short Channel effects as well as the issues associated with a nanoMOSFET is provided. The study
of the properties of several Quantum dot materials and how to choose the best material depending on the
observation of clear Coulomb blockade is done. Specifically, a study of a graphene single electron
transistor is reviewed. Also a theoretical explanation to a model designed to tune the movement of
electrons with the help of a quantum wire has been presented.
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...msejjournal
In this paper we consider an approach of manufacturing more compact field-effect heterotransistors. The
approach based on manufacturing a heterostructure, which consist of a substrate and an epitaxial layer
with specific configuration. After that several areas of the epitaxial layer have been doped by diffusion or
ion implantation with optimized annealing of dopant and /or radiation defects. At the same time we introduce
an approach of modification of energy band diagram by additional doping of channel of the transistors.
We also consider an analytical approach to model and optimize technological process.
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGESacijjournal
Microscopic image analysis is an essential process to enable the automatic enumeration and quantitative
analysis of microbial images. There are several system are available for numerating microbial growth.
Some of the existing method may be inefficient to accurately count the overlapped microorganisms.
Therefore, in this paper we proposed an efficient method for automatic segmentation and counting of
microorganisms in microscopic images. This method uses a hybrid approach based on morphological
operation, active contour model and counting by region labelling process. The colony count value obtained
by this proposed method is compared with the manual count and the count value obtained from the existing
method.
CROSS DATASET EVALUATION OF FEATURE EXTRACTION TECHNIQUES FOR LEAF CLASSIFICA...ijaia
In this work feature extraction techniques for leaf classification are evaluated in a cross dataset scenario.
First, a leaf identification system consisting of six feature classes is described and tested on five established
publicly available datasets by using standard evaluation procedures within the datasets. Afterwards, the
performance of the developed system is evaluated in the much more challenging scenario of cross dataset
evaluation. Finally, a new dataset is introduced as well as a web service, which allows to identify leaves
both photographed on paper and when still attached to the tree. While the results obtained during
classification within a dataset come close to the state of the art, the classification accuracy in cross dataset
evaluation is significantly worse. However, by adjusting the system and taking the top five predictions into
consideration very good results of up to 98% are achieved. It is shown that this difference is down to the
ineffectiveness of certain feature classes as well as the increased severity of the task as leaves that grew
under different environmental influences can differ significantly not only in colour, but also in shape.
DESIGN OF DIFFERENT DIGITAL CIRCUITS USING SINGLE ELECTRON DEVICESmsejjournal
Single Electron transistor (SET) is foreseen as an excellently growing technology. The aim of this paper is
to present in short the fundamentals of SET as well as to realize its application in the design of single
electron device based novel digital logic circuits with the help of a Monte Carlo based simulator. A Single
Electron Transistors (SET) is characterized by two most substantial determinants. One is very low power
dissipation while the other is its small stature that makes it a favorable suitor for the future generation of
very high level integration. With the utilization of SET, technology is moving past CMOS age resulting in
power efficient, high integrity, handy and high speed devices. Conducting a check on the transport of single
electrons is one of the most stirring aspects of SET technologies. Apparently, Monte Carlo technique is in
vogue in terms of simulating SED based circuits. Hence, a MC based tool called SIMON 2.0 is exercised
upon for the design and simulation of these digital logic circuits. Further, an efficient functioning of the
logic circuits such as multiplexers, decoders, adders and converters are illustrated and established by
means of circuit simulation using SIMON 2.0 simulator.
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...ijaia
Support Vector Machine has appeared as an active study in machine learning community and extensively
used in various fields including in prediction, pattern recognition and many more. However, the Least
Squares Support Vector Machine which is a variant of Support Vector Machine offers better solution
strategy. In order to utilize the LSSVM capability in data mining task such as prediction, there is a need to
optimize its hyper parameters. This paper presents a review on techniques used to optimize the parameters
based on two main classes; Evolutionary Computation and Cross Validation.
AN ADAPTIVE REUSABLE LEARNING OBJECT FOR E-LEARNING USING COGNITIVE ARCHITECTUREacijjournal
Nowadays, a huge amount of ambiguous e-learning materials are available in World Wide Web
irrespective of various objectives. These digital educational resources can be reused and shared from
centralized online repository and it will avoid the redundant learning material. The main goal is to design
consistent adaptable e-learning course material for web-based education system with emphasis on the
quality of learning. This can be done by organizing learning object in a prescribed manner and it can be
reused in feature. Such reusable learning objects are enhanced further to become adaptive reusable
learning objects that are virtually cognitive and responsive towards the specific needs of the user/customer.
This paper proposes the cognitive architecture to offer an adaptive reusable objects (RLO) based on
individual profile of e-learner besides their cognitive behaviour while learning.
DESIGN AND IMPLEMENTATION OF THE ADVANCED CLOUD PRIVACY THREAT MODELING IJNSA Journal
Privacy-preservation for sensitive data has become a challenging issue in cloud computing. Threat
modeling as a part of requirements engineering in secure software development provides a structured
approach for identifying attacks and proposing countermeasures against the exploitation of vulnerabilities
in a system. This paper describes an extension of Cloud Privacy Threat Modeling (CPTM) methodology for
privacy threat modeling in relation to processing sensitive data in cloud computing environments. It
describes the modeling methodology that involved applying Method Engineering to specify characteristics
of a cloud privacy threat modeling methodology, different steps in the proposed methodology and
corresponding products. In addition, a case study has been implemented as a proof of concept to
demonstrate the usability of the proposed methodology. We believe that the extended methodology
facilitates the application of a privacy-preserving cloud software development approach from requirements
engineering to design.
A NOVEL CHARGING AND ACCOUNTING SCHEME IN MOBILE AD-HOC NETWORKSIJNSA Journal
Because of the lack of infrastructure in mobile ad hoc networks (MANETs), their proper functioning must
rely on co-operations among mobile nodes. However, mobile nodes tend to save their own resources and
may be reluctant to forward packets for other nodes. One approach to encourage co-operations among
nodes is to reward nodes that forward data for others. Such an incentive-based scheme requires a charging
and accounting framework to control and manage rewards and fines (collected from users committing
infractions). In this paper, we propose a novel charging and accounting scheme for MANETs. We present a
detailed description of the proposed scheme and demonstrate its effectiveness via formal proofs and
simulation results [15]. We develop a theoretical game model that offers advice to network administrators
about the allocation of resources for monitoring mobile nodes. The solution provides the optimal
monitoring probability, which discourages nodes from cheating because the gain would be compensated by
the penalty.
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...csandit
Due to tremendous growth of the wireless based application services are increasing the demand
for wireless communication techniques that use bandwidth more effectively. Channel slot reutilization
in multi-radio wireless mesh networks is a very challenging problem. WMNs have
been adopted as back haul to connect various networks such as Wi-Fi (802.11), WI-MAX
(802.16e) etc. to the internet. The slot re-utilization technique proposed so far suffer due to high
collision due to improper channel slot usage approximation error. To overcome this here the
author propose the cross layer optimization technique by designing a device classification
based channel slot re-utilization routing strategy which considers the channel slot and node
information from various layers and use some of these parameters to approximate the risk
involve in channel slot re-utilization in order to improve the QoS of the network. The simulation
and analytical results show the effectiveness of our proposed approach in term of channel slot
re-utilization efficiency and thus helps in reducing latency for data transmission and reduce
channel slot collision.
SEGMENTATION USING ‘NEW’ TEXTURE FEATUREacijjournal
Color, texture, shape and luminance are the prominent features for image segmentation. Texture is an
organized group of spatial repetitive arrangements in an image and it is a vital attribute in many image
processing and computer vision applications. The objective of this work is to segment the texture sub
images from the given arbitrary image. The main contribution of this work is to introduce “NEW” texture
feature descriptor to the image segmentation field. The NEW texture descriptor labels the neighborhood
pixels of a pixel in an image as N,W,NW,NE,WW,NN and NNE(N-North, W-West).To find the prediction
value, the gradient of the intensity functions are calculated. Eight component binary vectors are formed
and compared to prediction value. Finally end up with 256 possible vectors. Fuzzy c-means clustering is
used to segment the similar regions in textural image Extensive experimentation shows that the proposed
methodology works better for segmenting the texture images, and also segmentation performance are
evaluated.
THE IMPACT OF THE AUDIT QUALITY ON THAT OF THE ACCOUNTING PROFITS: THE CASE O...ijmvsc
The purpose of this article is to examine the impact of the audit quality on that of the accounting profits. We chose the accrual quality, the accounting conservatism and the profit relevance as a measure of the quality of the accounting profits. The empirical study, which was carried out in this article on a sample of Tunisian firms listed on the TSE for the period (2005-2009), confirms the significant impact of the audit quality on that of the accounting profits. The study results also show that the variables: size of the audit firm, sector-based specialization of the audit firm, the co-commissioner and the size of the audit committee, improve the quality of the accounting profits.
Decision Making Framework in e-Business Cloud Environment Using Software Metr...ijitjournal
Cloud computing technology is most important one in IT industry by enabling them to offer access to their
system and application services on payment type. As a result, more than a few enterprises with Facebook,
Microsoft, Google, and amazon have started offer to their clients. Quality software is most important one in
market competition in this paper presents a hybrid framework based on the goal/question/metric paradigm
to evaluate the quality and effectiveness of previous software goods in project, product and organizations
in a cloud computing environment. In our approach it support decision making in the area of project,
product and organization levels using Neural networks and three angular metrics i.e., project metrics,
product metrics, and organization metrics
IMPLEMENTATION OF A DECISION SUPPORT SYSTEM AND BUSINESS INTELLIGENCE ALGORIT...ijaia
Data processing is crucial in the insurance industry, due to the important information that is contained in
the data. Business Intelligence (BI) allows to better manage the various activities as for companies
working in the insurance sector. Business Intelligence based on the Decision Support System (DSS), makes
it possible to improve the efficiency of decisions and processes, by improving them to the individual
characteristics of the agents. In this direction, Key Performance Indicators (KPIs) are valid tools that help
insurance companies to understand the current market and to anticipate future trends. The purpose of the
present paper is to discuss a case study, which was developed within the research project "DSS / BI
HUMAN RESOURCES", related to the implementation of an intelligent platform for the automated
management of agents' activities. The platform includes BI, DSS, and KPIs. Specifically, the platform
integrates Data Mining (DM) algorithms for agent scoring, K-means algorithms for customer clustering,
and a Long Short-Term Memory (LSTM) artificial neural network for the prediction of agents KPIs. The
LSTM model is validated by the Artificial Records (AR) approach, which allows to feed the training dataset
in data-poor situations as in many practical cases using Artificial Intelligence (AI) algorithms. Using the
LSTM-AR method, an analysis of the performance of the artificial neural network is carried out by
changing the number of records in the dataset. More precisely, as the number of records increases, the
accuracy increases up to a value equal to 0.9987.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
The paper aims at proposing a solution for designing and developing a seamless automation and
integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to
seamlessly handle and scale very large amount of unstructured and structured data from diversified and
heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for
analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing
component; 3) the ability to automatically select the most appropriate libraries and tools to compute and
accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high
learning performance, but with minimal human intervention and supervision. The whole focus is to provide
a seamless automated and integrated solution which can be effectively used to analyze Big Data with highfrequency
and high-dimensional features from different types of data characteristics and different
application problem domains, with high accuracy, robustness, and scalability. This paper highlights the
research methodologies and research activities that we propose to be conducted by the Big Data
researchers and practitioners in order to develop and support seamless automation and integration of
machine learning capabilities for Big Data analytics.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
The paper aims at proposing a solution for designing and developing a seamless automation and integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to seamlessly handle and scale very large amount of unstructured and structured data from diversified and heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for
analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing component; 3) the ability to automatically select the most appropriate libraries and tools to compute and accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high learning performance, but with minimal human intervention and supervision. The whole focus is to provide
a seamless automated and integrated solution which can be effectively used to analyze Big Data with highfrequency
and high-dimensional features from different types of data characteristics and different application problem domains, with high accuracy, robustness, and scalability. This paper highlights the research methodologies and research activities that we propose to be conducted by the Big Data researchers and practitioners in order to develop and support seamless automation and integration of machine learning capabilities for Big Data analytics.
Analytics, business cycles and disruptionsMark Albala
The digital economy is different. Depending on platforms and a much more malleable set of methods to interact with consumers, an accelerated rate of disruptions compromises the orderly business experience of most market participants. A well-honed analytics program facilitates understanding these accelerated disruptions. With a platform based digital marketplace, obtaining the information necessary to decipher unexpected outcomes and prescribe suitable actions is difficult because the information required Both of these facts are important to analytics. First, platforms. Platform based activity is hard to decipher, not because it is more complex but because the information needed to decipher activity is not contained within your four walls.
Once deciphered, the next challenge facing organizations deciphering unexpected outcomes is a determination of whether the unexpected outcome is truly a disruptive event or simply a phase change in a regularly occurring business cycle. There are significant differences in the suitable reactions to disruptions and business cycle phase changes. Unfortunately, many organizations are ill equipped to discern between these two classes of unexpected business outcomes and consistently find their business plans fall victim to the actions of others within the marketplace.
Luckily, many of the activities of governmental and regulatory bodies are focused on predicting phase changes to the business cycles likely to impact the economic forces within the next fiscal year and describe their economic policies and agendas in publicly available documents and analysis. Understanding where to find these documents and how to use the published to discern between the likely business cycle phase changes and true disruptions as one of the vehicles available within your arsenal of analytics will lessen the occurrence of falling victim in the marketplace by misreading the clues available from unexpected outcomes. This document will address the sources most likely to assist and the actions to be taken to utilize the information attained from these documents.
What is the relationship between Accounting and an Accounting inform.pdfannikasarees
What is the relationship between Accounting and an Accounting information system? (2.5
Marks)
Accounting-Methods, procedures, and standards followed in accumulating, classifying,
recording, and reporting business events and transactions. The accounting system includes the
formal records and original source data. Regulatory requirements may exist on how a particular
accounting system is to be maintained (e.g., insurance company).
Accounting Information System-Subsystem of a Management Information System (MIS) that
processes financial transactions to provide (1) internal reporting to managers for use in planning
and controlling current and future operations and for nonroutine decision making; (2) external
reporting to outside parties such as to stockholders, creditors, and government agencies.
• What has happened to the relationship over the years? (2.5 Marks)
Accounting and Information technology are two terms which are the used in every business .
Because both are needed for effective working of a corporate or company. It is the need of time
that we should understand the relationship between Accounting and Information Technology .
Accounting is related recording and utilisation of recorded data . Information technology is
scientific , technological , engineering disciplines and management technique used in
information handling and processing , their application , computers and their interaction with
men and machines and associated , economical and cultural matters . In Simple wording IT is
that technique which and get and utilize the information with effective and efficient way.
Now , we are ready for giving the relationship between Accounting And Information
technology.
Both are related to get information and utilization of that information . So both are
interconnected with each other . If our specialize of both area merge both system with scientific
and technical way , then they easily overcome the different problems due to lack of correct and
adequate information related to business.
• What is accounting information? (1 marks)
Accounting information can be classified into two categories: financial accounting or public
information and managerial accounting or private information. Financial accounting includes
information disseminated to parties that are not part of the enterprise proper—stockholders,
creditors, customers, suppliers, regulatory commissions, financial analysts, and trade
associations—although the information is also of interest to the company\'s officers and
managers. Such information relates to the financial position, liquidity (that is, ability to convert
to cash), and profitability of an enterprise.
Managerial accounting deals with cost-profit-volume relationships, efficiency and productivity,
planning and control, pricing decisions, capital budgeting, and similar matters. This information
is not generally disseminated outside the company. Whereas the general-purpose financial
statements of financial accounting are assumed.
By applying RapidMiner workflows has been processed a dataset originated from different data files, and containing information about the sales over three years of a large chain of retail stores. Subsequently, has been constructed a Deep Learning model performing a predictive algorithm suitable for sales forecasting. This model is based on artificial neural network –ANN- algorithm able to learn the model starting from sales historical data and by pre-processing the data. The best built model uses a multilayer neural network together with an “optimized operator” able to find automatically the best parameter setting of the implemented algorithm. In order to prove the best performing predictive model, other machine learning algorithms have been tested. The performance comparison has been performed between Support Vector Machine –SVM-, k-Nearest Neighbor k-NN-,Gradient Boosted Trees, Decision Trees, and Deep Learning algorithms. The comparison of the degree of correlation between real and predicted values, the average absolute error and the relative average error proved that ANN exhibited the best performance. The Gradient Boosted Trees approach represents an alternative approach having the second best performance. The case of study has been developed within the framework of an industry project oriented on the integration of high performance data mining models able to predict sales using–ERP- and customer relationship management –CRM- tools.
DATA MINING MODEL PERFORMANCE OF SALES PREDICTIVE ALGORITHMS BASED ON RAPIDMI...ijcsit
By applying RapidMiner workflows has been processed a dataset originated from different data files, and containing information about the sales over three years of a large chain of retail stores. Subsequently, has been constructed a Deep Learning model performing a predictive algorithm suitable for sales forecasting. This model is based on artificial neural network –ANN- algorithm able to learn the model starting from
sales historical data and by pre-processing the data. The best built model uses a multilayer neural network together with an “optimized operator” able to find automatically the best parameter setting of the implemented algorithm. In order to prove the best performing predictive model, other machine learning algorithms have been tested. The performance comparison has been performed between Support Vector
Machine –SVM-, k-Nearest Neighbor k-NN-,Gradient Boosted Trees, Decision Trees, and Deep Learning algorithms. The comparison of the degree of correlation between real and predicted values, the average
absolute error and the relative average error proved that ANN exhibited the best performance. The Gradient Boosted Trees approach represents an alternative approach having the second best performance. The case of study has been developed within the framework of an industry project oriented on the
integration of high performance data mining models able to predict sales using–ERP- and customer relationship management –CRM- tools.
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Eswar Publications
In today’s dynamic marketplace, telecommunication organizations, both private and public, are increasingly leaving antiquated marketing philosophies and strategies to the adoption of more customer-driven initiatives that seek to understand, attract, retain and build intimate long term relationship with profitable customers. This paradigm shift has undauntedly led to the growing interest in Customer Relationship Management (CRM) initiatives that aim at ensuring customer identification and interactions. The urgent market requirement is to identify automated methods that can assist businesses in the complex task of predicting customer churning.
The immediate requirement of the market is to have systems that can perform accurate
(i) identification of loyal customers (so that companies can offer more services to retain them)
(ii) prediction of churners to ensure that only the customers who are planning to switch their service
providers are being targeted for retention
Similar to ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS (20)
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
DOI : 10.5121/ijdkp.2016.6205 59
ANOMALY DETECTION AND ATTRIBUTION
USING AUTO FORECAST AND DIRECTED
GRAPHS
Vivek Sankar and Somendra Tripathi
Latentview Analytics, Chennai, India
ABSTRACT
In the business world, decision makers rely heavily on data to back their decisions. With the quantum of
data increasing rapidly, traditional methods used to generate insights from reports and dashboards will
soon become intractable. This creates a need for efficient systems which can substitute human intelligence
and reduce time latency in decision making. This paper describes an approach to process time series data
with multiple dimensions such as geographies, verticals, products, efficiently, and to detect anomalies in
the data and further, to explain potential reasons for the occurrence of the anomalies. The algorithm
implements auto selection of forecast models to make reliable forecasts and detect such anomalies. Depth
First Search (DFS) is applied to analyse each of these anomalies and find its root causes. The algorithm
filters the redundant causes and reports the insights to the stakeholders. Apart from being a hair-trigger
KPI tracking mechanism, this algorithm can also be customized for problems lke A/B testing, campaign
tracking and product evaluations.
KEYWORDS
DFS, A/B Testing, Reporting, Forecasting, Anomaly Spotting.
1. INTRODUCTION
With the growing use of Internet and Mobile Apps, the world is seeing a steep increase in the
availability and accessibility to different kinds of data – demographical, transactional, social
media and so on. Modern businesses are keen to leverage on these data to arrive at smarter and
timely decisions. But the existing data setup in a lot of organizations primarily provides post
mortem reports of performance. The shift to a more pro-active or an instantaneous reactive
approach to data based decision making requires investments in data infrastructure, skilled
resources and enabling quicker and efficient dissemination of information to the right
stakeholders. With more and more firms investing on their infrastructure to capture necessary
data, there is a growing need for automated systems that can step in to process the raw data and
provide actionable readouts to the required stakeholders.
This paper proposes one such completely automated frequentist framework which when provided
with any casted data (a dataset which is a cross product of dimensions involved and their
corresponding metric values for each time frame) can run in the background to provide actionable
readouts. It helps business decision makers stay updated with the development in their portfolio
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
60
by providing timely updates by specifying a list of anomalies (spikes & dips) with respect to the
KPIs of their interest and further providing reasons for the same.
The proposed method makes use of forecasting techniques to arrive at ballpark figures for every
segment which acts as a logical substitute to user’s sense. The estimates are pitted against the
actual performance immediately after the availability of actual data to effectively spot anomalies.
The framework consists of intermediate trigger systems that can alert corresponding stakeholders
without much delay.
The system further digs down to analyze sub segments to attribute the reasons for every spotted
anomaly. This approach which requires least human intervention is an effective aid in scenarios
where:
• the number of segments involved could not be handled manually
• there is a lack of statistical expertise on the user front
• there is a high time lag in decision making due to existing reporting structures
This approach finds its application in a wide range of industries such as retail, e-commerce,
airlines, insurance, manufacturing , logistic and supply chain, etc. to benefit portfolio managers,
analyst, marketers, product and sales managers to name a few.
The task of anomaly detection and reporting starts with the processing of melted data coming
from the database to cast data by producing all possible interactions between the various
dimensions in the dataset. Auto-forecasting iterates through the entire cast data converts it into
time series and generates prediction to be consumed in the later module. Further the framework
establishes networks to understand the interdependencies between the various segments in the
data. Depth First Search is then applied to spot anomalies, which are then checked for
redundancy and reported.
The paper is structured as follows. The next section presents the methodology and tools used in
the framework. This section has been broken into three sub-sections to highlight the pivotal
modules running the entire framework. Section 3 discusses its implementation. Concluding
remarks and future works are mentioned in Section 4.
2. METHODOLOGY
The proposed framework is a novel technique to spot anomalies in data with the minimum human
intervention.
The three prime components that are required for its functioning are:
1. The actual value of the KPI
2. A ballpark value for the KPI
3. A scientific flagging approach
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
With the availability of these three components the
of real life scenarios and across multiple verticals and KPIs. The following sections would
explain the various steps involved in the framework for creating the above mentioned
components and utilizing them to pr
The entire framework is broadly broken down into the following sections:
• Data Processing Module (DPM)
• Auto Forecasting Module (AFM)
• Pattern Analysis and Reporting Module (PARM)
The flow chart in figure 1 is a
involved and the flow of data between the modules.
Figure 1: Framework
2.1 Module 1: Data Processing Module
2.1.1 Data Input:
This module accepts raw data which could be
all the necessary dimensions that would best describe the KPI along with the KPI itself broken
down at the least granularity of the time unit used for reporting. Some of the easily relatable
datasets are Sales data of retail stores, web traffic of an e
from call centers, risk decline volumes for a payment gateway and so on. Any data that
encompasses trends and seasonal patterns could be a perfect fit for this module or th
as a whole. Table 1 is an illustration of a sample data source.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
With the availability of these three components the method could be applied across a wide range
of real life scenarios and across multiple verticals and KPIs. The following sections would
explain the various steps involved in the framework for creating the above mentioned
components and utilizing them to provide actionable insights to business users.
The entire framework is broadly broken down into the following sections:
Data Processing Module (DPM)
Auto Forecasting Module (AFM)
Pattern Analysis and Reporting Module (PARM)
The flow chart in figure 1 is a concise representation of the internal structure of the modules
involved and the flow of data between the modules.
Figure 1: Framework - Block Diagram
2.1 Module 1: Data Processing Module
This module accepts raw data which could be any form of casted data. The dataset would include
all the necessary dimensions that would best describe the KPI along with the KPI itself broken
down at the least granularity of the time unit used for reporting. Some of the easily relatable
Sales data of retail stores, web traffic of an e-commerce website, call volume data
from call centers, risk decline volumes for a payment gateway and so on. Any data that
encompasses trends and seasonal patterns could be a perfect fit for this module or th
as a whole. Table 1 is an illustration of a sample data source.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
61
method could be applied across a wide range
of real life scenarios and across multiple verticals and KPIs. The following sections would
explain the various steps involved in the framework for creating the above mentioned
concise representation of the internal structure of the modules
any form of casted data. The dataset would include
all the necessary dimensions that would best describe the KPI along with the KPI itself broken
down at the least granularity of the time unit used for reporting. Some of the easily relatable
commerce website, call volume data
from call centers, risk decline volumes for a payment gateway and so on. Any data that
encompasses trends and seasonal patterns could be a perfect fit for this module or the framework
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
62
Table 1: Sample Casted Dataset
Customer Segment Product Category Region Date Sales
Consumer Furniture Central 1/1/2010 690.77
Consumer Furniture Central 2/1/2010 5303.17
Consumer Office Supplies East 1/1/2010 1112.11
Consumer Office Supplies East 2/1/2010 84.01
Home Office Furniture West 11/1/2013 10696.84
Home Office Furniture West 12/1/2013 4383.98
Here the first three columns are the dimensions and date column is the indicator of frequency of
reporting. The date column here is at a month level but in general the framework can be applied
for Daily, Weekly, Monthly, Quarterly or Yearly reports. The final column (Sales) is the actual
KPI that the business wants to track using this methodology.
2.1.2 Segments Creation:
Each of the dimensions in the dataset could hold 2 or more values and each of these could be of
interest to different stakeholders. Referring back to the above sample dataset the Region
dimension holds values of the geographies where there was Sales reported and each of the
individual Regional Heads would want to keep an eye on the Sales of their region. Hence each of
the values in every dimension potentially is a segment. The system further goes to generate
segments by combining values of two dimensions. For example combining region and product
category, segments like (Central _ Furniture) and (West _ Furniture) could be generated. The
segment formation extends from treating every segment individually to combining all the
available n dimensions. After the system generates all the possible combinations using the
available dimensions in the data, each unique combination is given a Segment ID and the initial
casted data is melted.
Table 2: Conversion of casted to melted data
Segment ID Date Sales
Seg 1 1/1/2010 690.77
Seg 1 2/1/2010 5303.17
Seg 2 1/1/2010 1112.11
Seg 2 2/1/2010 84.01
Seg 3 11/1/2013 10696.84
Seg 3 12/1/2013 4383.98
In short if there are n dimensions (D1,D2,D3,….Dn) and the number of values in each dimension is
(X1,X2,X3,….Xn) respectively the total number of combinations generated would be (X1+1) *
(X2+1) * …….. *(Xn+1).
2.1.2.1 Business Preferences/Inputs:
This is an optional step in the overall structure where the intention is to bucket values in each
dimension to club smaller segments. Assuming there are 100 different products in the Product
Category and of these 100 products 9 products lead to 95% of the overall Sales, then the rest of
the 91 products could be clubbed as ‘Other Products’.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
This removes certain segments from the picture based on business preferences. It could even be
inputs in the form of flat files containing segments which businesses are not too
Both these measures result in the reduction of the number of segments and thereby improving the
operating efficiency and memory consumption of this proposed approach.
The final melted output from this module is fed into the Auto Forecast
of all data points corresponding to the latest unit of time and passing through the Multiprocessing
stage to enable parallel processing in the Auto Forecast Module.
2.1.3 Parallel Processing Module:
The forecasting module can be duplicated as multiple processes as each of the segment is treated
independent of the other. The dataset is broken down into multiple parts of equal number of
segments and fed to the multiple sessions of the AFM.
2.2 Module 2: Auto Forecast Module
The auto forecast module is designed to generate forecasts on multiple time series data which
gets fed to the PARM discussed in the next section.
The following steps explain how one step forecasts are generated:
2.2.1 Data Preparation:
Aggregated data is converted to time series data of the specified frequency. Each time series is
then iteratively checked for missing values and outliers. The framework provides the flexibility
of using linear and cubic spline interpolation for treating missing values and outl
series is decomposed using STL and the trend component is smoothened. This helps in
approximating missing values and minimizing the effect of outliers. Each time series is checked
for 0-padding (both leading and trailing). Powerful trans
applied before moving on to the next phase.
Figure 1: Missing value treatment using STL + spline smoothing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
This removes certain segments from the picture based on business preferences. It could even be
inputs in the form of flat files containing segments which businesses are not too concerned about.
Both these measures result in the reduction of the number of segments and thereby improving the
operating efficiency and memory consumption of this proposed approach.
The final melted output from this module is fed into the Auto Forecast Module after the removal
of all data points corresponding to the latest unit of time and passing through the Multiprocessing
stage to enable parallel processing in the Auto Forecast Module.
2.1.3 Parallel Processing Module:
duplicated as multiple processes as each of the segment is treated
independent of the other. The dataset is broken down into multiple parts of equal number of
segments and fed to the multiple sessions of the AFM.
2.2 Module 2: Auto Forecast Module
uto forecast module is designed to generate forecasts on multiple time series data which
gets fed to the PARM discussed in the next section.
The following steps explain how one step forecasts are generated:
nverted to time series data of the specified frequency. Each time series is
then iteratively checked for missing values and outliers. The framework provides the flexibility
of using linear and cubic spline interpolation for treating missing values and outliers
series is decomposed using STL and the trend component is smoothened. This helps in
approximating missing values and minimizing the effect of outliers. Each time series is checked
padding (both leading and trailing). Powerful transformations such as Box
applied before moving on to the next phase.
: Missing value treatment using STL + spline smoothing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
63
This removes certain segments from the picture based on business preferences. It could even be
concerned about.
Both these measures result in the reduction of the number of segments and thereby improving the
Module after the removal
of all data points corresponding to the latest unit of time and passing through the Multiprocessing
duplicated as multiple processes as each of the segment is treated
independent of the other. The dataset is broken down into multiple parts of equal number of
uto forecast module is designed to generate forecasts on multiple time series data which
nverted to time series data of the specified frequency. Each time series is
then iteratively checked for missing values and outliers. The framework provides the flexibility
iers [1], the time
series is decomposed using STL and the trend component is smoothened. This helps in
approximating missing values and minimizing the effect of outliers. Each time series is checked
formations such as Box-Cox can be
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
2.2.2 Model selection:
This phase involves choosing the right set of models for a
includes Exponential Smoothing
auto.Arima[2], Nnet - Feed forward neural networks with a single hidden layer,
linear models to time series including trend
state space model with Box
Components)[3,4] and STL.
It is important that the seasonality component is correctly identified before forecasting. Over
parameterization or force fitting seasonality when there is no real seasonal component might
produce unreliable forecasts. Also, in cases where there are too few data points only ARIMA and
Regression models are used. Croston’s method is used in cases where time s
intermittent data. TBATS is used with data with weekly and annual seasonality.
models are picked for forecasting in this phase.
Figure 2: Model selection based on Holdout accuracy
2.2.3 Forecasting:
The level of the time series must be specified before this step starts. A small portion of the data is
held out to be used as validation [5]. This helps in preventing over fitting and gives a truer
estimation of the generalization error. The module then ru
time series and provides the residuals, fit statistics and confidence intervals.
2.2.4 Evaluation:
The metric to be used for evaluating the fit can be specified at the start of execution of this
module. The framework provides MAPE (Mean absolute percentage error), MSE (Mean squared
error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A
summary of all fits statistics from selected models applied on the data is generated and the winne
model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
this step for manual intervention.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
This phase involves choosing the right set of models for a given time series. The current pool
Exponential Smoothing – ETS implementation for automatic selection and
Feed forward neural networks with a single hidden layer, TSLM for fitting
linear models to time series including trend and seasonality, TBATS (Exponential smoothing
state space model with Box-Cox transformation, ARMA errors, Trend an
It is important that the seasonality component is correctly identified before forecasting. Over
erization or force fitting seasonality when there is no real seasonal component might
unreliable forecasts. Also, in cases where there are too few data points only ARIMA and
. Croston’s method is used in cases where time s
TBATS is used with data with weekly and annual seasonality.
models are picked for forecasting in this phase.
Figure 2: Model selection based on Holdout accuracy
The level of the time series must be specified before this step starts. A small portion of the data is
held out to be used as validation [5]. This helps in preventing over fitting and gives a truer
estimation of the generalization error. The module then runs all the models picked for the given
time series and provides the residuals, fit statistics and confidence intervals.
The metric to be used for evaluating the fit can be specified at the start of execution of this
ork provides MAPE (Mean absolute percentage error), MSE (Mean squared
error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A
summary of all fits statistics from selected models applied on the data is generated and the winne
model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
this step for manual intervention.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
64
given time series. The current pool
ETS implementation for automatic selection and
TSLM for fitting
, TBATS (Exponential smoothing
Cox transformation, ARMA errors, Trend and Seasonal
It is important that the seasonality component is correctly identified before forecasting. Over-
erization or force fitting seasonality when there is no real seasonal component might
unreliable forecasts. Also, in cases where there are too few data points only ARIMA and
. Croston’s method is used in cases where time series have
TBATS is used with data with weekly and annual seasonality. All relevant
The level of the time series must be specified before this step starts. A small portion of the data is
held out to be used as validation [5]. This helps in preventing over fitting and gives a truer
ns all the models picked for the given
The metric to be used for evaluating the fit can be specified at the start of execution of this
ork provides MAPE (Mean absolute percentage error), MSE (Mean squared
error), MAE (Mean absolute error) and MASE as possible measures of forecast accuracy. A
summary of all fits statistics from selected models applied on the data is generated and the winner
model is chosen based on the model getting the best accuracy. Poor predictions are flagged off at
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
65
Table 1: Selection of winner model based on Performance metric
Store # ETS TBATS Arima Nnet Tslm Croston’s*
Store 53 7.2% 4.3% 4.7% 2.0% 3.3% NA
Store 54 72.2% 44.8% 41.4% 104.3% 15.9% NA
Store 55 2.6% 5.4% 2.6% 5.4% 6.6% NA
Store 56 5.6% 2.5% 2.8% 12.6% 5.9% NA
Store 57 4.5% 1.8% 4.4% 4.1% 2.3% NA
Store 58 3.9% 2.9% 5.4% 4.8% 2.5% NA
Store 59 11.5% NA 14.9% 31.5% 20.8% NA
Store 60 NA NA 10.1% NA 8.9% NA
Store 61 1.8% 1.3% 3.5% 33.2% 1.8% NA
Store 62 33.8% 52.1% 71.3% 27.0% 14.0% NA
Store 63 12.7% 4.3% 7.8% 4.0% 8.7% NA
Store 64 4.1% 1.4% 3.0% 8.0% 2.5% NA
2.2.5 One step forecast:
The final forecast are generated by running the winner model on the entire data, as missing recent
data points could cause loss of valuable information. The residuals, confidence interval and one
step forecasts for this model are then passed to the next module.
2.3 Module 3: Pattern Analysis and Reporting Module (PARM)
PARM receives data feed separately from the Data Processing Module and Auto Forecast
Module.
• DPM provides the KPI values of the latest unit of time for every segment. These values
were the ones which were kept aside from the melted dataset before being fed into AFM.
• AFM provides reliable forecast values along with the residuals and prediction intervals
based on the required confidence interval for every segment.
2.3.1 Segment Level Flagging
Every available segment in the data has an actual value, a ballpark value (forecasted output) and a
prediction interval based on the point forecast.
2.3.1.1 Prediction Interval
As per Hyndman [7], a prediction interval is an interval associated with a random variable yet to
be observed, with a specified probability of the random variable lying within the interval. For
example, I might give an 80% interval for the forecast of GDP in 2014. The actual GDP in 2014
should lie within the interval with probability 0.8. Prediction intervals can arise in Bayesian or
frequentist statistics.
A confidence interval is an interval associated with a parameter and is a frequentist concept. The
parameter is assumed to be non-random but unknown, and the confidence interval is computed
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
66
from data. Because the data are random, the interval is random. A 95% confidence interval will
contain the true parameter with probability 0.95. That is, with a large number of repeated
samples, 95% of the intervals would contain the true parameter.
The range of the prediction interval is dependent on two factors:
i. The accuracy of the forecast. Higher the accuracy narrow is the band and a poor accuracy
pushes the limit to -∞ and +∞.
ii. The desired confidence percentage. Higher the required confidence broader is the band.
2.3.1.2 Reasons for independent forecast
Every segment generated by the DPM could be unique in its properties and thereby could exhibit
its own trend and seasonal attributes. Also obtaining forecast values from sub granular level
would result in aggregating errors of the sub granular levels and hence affect the accuracy of the
overall forecast.
2.3.1.2.1 Anomalies – Dips and spikes
The actual value for the KPI corresponding to the latest unit of time is pitted against the
prediction interval for that segment.
If , AT > EULT => Flag for spike,
AT < ELLT => Flag for dip
where,
AT - actual KPI value for the time period T
EULT & ELLT - upper and lower limits of prediction intervals for time period T
With every segment flagged independently for anomalies instantaneous triggers could be sent out
to accountable stake holders alerting them to react without much time latency. In such scenarios
businesses start to drill down the KPIs using dimensions based on their judgments to arrive at
reasons for the anomaly.
2.3.2 Network Generation
The network here is a scientific substitute to replace the operations of extensive drill downs by
the business consumers. The approach to node formation and node connections are explained in
the next section.
2.3.3 Node Formation
Every segment created by DPM would be a node by itself. The number of dimensions involved
in creation of the node indicates the level of the node i.e. if dimensions are taken one at a time the
level is 2 while a combination of two dimensions indicates level 3 and extends up to (n+1) levels.
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
2.3.4 Node Connections
Every node at level k would be connected from one or more nodes from level k
or more values involved in the formation of node at level k has a commonality of value with any
node at level (k-1). For example, any node at level 3 formed by [r
category-furniture] would be connected from nodes [region
furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG).
2.3.4.1 Directed Acyclic Graphs (DAG)
In mathematics and computer science
no directed cycles. That is, it is formed by a collection of
connecting one vertex to another,
sequence of edges that eventually loops back to
DAGs may be used to model many different kinds of information. The
DAG forms a partial order, and any
reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that
certain tasks must be performed earlier than others, may be represented as a DAG with a vertex
for each task and an edge for each const
generate a valid sequence. Additionally, DAGs may be used as a space
a collection of sequences with overlapping subsequences. DAGs are also used to represent
systems of events or potential events and the
be used to model processes in which data flows in a consistent direction through a network of
processors, or states of a repository in a version
Figure 3: Directed Acyclic Graph (DAG)
2.3.5 Node Information
Every node is identified by the
obtained for the segments at the previous step.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, Marc
Every node at level k would be connected from one or more nodes from level k-1 where any one
or more values involved in the formation of node at level k has a commonality of value with any
1). For example, any node at level 3 formed by [region-central and product
furniture] would be connected from nodes [region-central] and [product category
furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG).
2.3.4.1 Directed Acyclic Graphs (DAG)
computer science, a directed acyclic graph is a directed graph
. That is, it is formed by a collection of vertices and directed edges
connecting one vertex to another, such that there is no way to start at some vertex v
sequence of edges that eventually loops back to v again.
DAGs may be used to model many different kinds of information. The reachability
, and any finite partial order may be represented by a DAG using
reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that
certain tasks must be performed earlier than others, may be represented as a DAG with a vertex
for each task and an edge for each constraint; algorithms for topological ordering may be used to
generate a valid sequence. Additionally, DAGs may be used as a space-efficient representation of
on of sequences with overlapping subsequences. DAGs are also used to represent
systems of events or potential events and the causal relationships between them. DAGs may also
del processes in which data flows in a consistent direction through a network of
processors, or states of a repository in a version-control system [6].
Figure 3: Directed Acyclic Graph (DAG)
Every node is identified by the segment id and includes the respective flags for anomalies
obtained for the segments at the previous step.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
67
1 where any one
or more values involved in the formation of node at level k has a commonality of value with any
central and product
central] and [product category-
furniture] at level 2. This network structures is enabled using Directed Acyclic Graphs (DAG).
directed graph with
directed edges, each edge
v and follow a
reachability relation in a
ted by a DAG using
reachability. A collection of tasks that must be ordered into a sequence, subject to constraints that
certain tasks must be performed earlier than others, may be represented as a DAG with a vertex
may be used to
efficient representation of
on of sequences with overlapping subsequences. DAGs are also used to represent
between them. DAGs may also
del processes in which data flows in a consistent direction through a network of
segment id and includes the respective flags for anomalies
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
68
Figure 4: Network formation - An illustration
2.3.6 DAG as a substitute for Drill Down
The DAG network is used a substitute for manual drill down process as every possible
dimensions is fitted above and below the other dimensions and combination of dimensions
thereby enabling an exhaustive drill down setup. But in order to understand the cause for every
anomaly, it is necessary to understand the relation between anomalies spotted and set up the right
order of fitting the dimensions. This is enabled by traversing the network generated from every
spotted anomaly which is explained in detail in section 2.3.7.
2.3.7 Network traversal for spotting dependent anomalies
Network is primarily used to understand the relationship between the nodes flagged for
anomalies. A regular Depth First Search (DFS) traversal technique is used to traverse through
the network.
2.3.7.1 Steps for primary consolidation
1. The traversal starts from the top most level (level 1)
2. Every node at level 1 is visited to see if it is flagged for an anomaly (spike or dip)
• If it is flagged, the segment is marked as a Main Segment and is fixed as one of the
starting points of DFS
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
69
• If the node is not flagged, the focus moves to the next node in the same level
3. Step 2 is repeated until level n and all the main segments are marked. The Main Segment
nodes marked are of two categories – one flagged for spikes and the other for dips
4. From the list of nodes marked as main segment, the algorithm picks the first node. With this
node as the starting point, the network is traversed till the last level using DFS. If a visited
node on the traversal route is flagged for a similar anomaly, that node is marked as impacted
node corresponding to the main segment.
5. Step 4 is repeated for all the Main Segments marked in step 3
The generated list of Main Segment and Impacted Segment combination concludes the primary
consolidation thereby providing the relationship between one segment anomaly and its directly
related segment anomalies.
2.3.7.2 Root Causing through redundancy removal
The output from the above level has several redundant factors which are wiped out in this step to
provide a read out of a mutually exclusive anomalies for the business users' perusal. The
redundancies present are of two types
a. Inter Level Redundancies
b. Intra Level Redundancies
2.3.7.2.1 Inter Level Redundancies
Since the Main Segment list had all the available flagged nodes an anomaly spotted at a top level
could have its effect permeating down until the bottom most level. These sub level Main
Segments are part of the list of Impacted Segments for the top level Main Segment. The
framework identifies that having all these in the final read out would be reiterating the same
anomaly repeatedly in multiple forms.
2.3.7.2.2 Intra Level Redundancies
Intra Level Redundancies are more due to inherent data aspects. Two or more nodes at the same
level could be flagged for a similar anomaly. But more often than not it is the effect of one on
the other as each of the nodes have an inherent volume of the other nodes. This is a more
complex case to eliminate from the primary consolidation unlike the Inter Level Redundancies.
2.3.7.3 Steps to remove redundancies
1. The Main Segments at level n are picked.
2. Compare the list of Impacted Segments for each pair of Main Segments in this level.
• If there is an intersection then there lives a common thread
• If there are no intersections, move to the next pair of Main Segments
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
70
3. If there is an intersection in the previous step create a proxy index separately for each of the
Main Segment in the pair. The proxy index is simply a product of (absolute drop in KPI
from forecast) and (Percentage drop of KPI from forecast). This index tangibly does not
have any specific meaning or unit of measurement but higher the value of the index greater is
the probability of it being a root cause. This index is used to tie break between the paired
segments at each level.
• The winner of the tiebreaker remains in the list of primary consolidation.
• The looser is removed from the primary consolidated list.
4. This process is repeated for every pair at each level and across all the levels moving upward.
5. Now, with the truncated list the process starts from the top most level. A Main Segment at
the top most level is selected and its Impacted Segments are chosen.
6. The truncated list is looped to check if the Impacted Segments are present as a Main
Segment.
• If it is present the sub level Main Segment corresponding to the matching Impacted
Segment is removed.
• If none of the Impacted Segments in the list match with the Main Segment in the sub
levels skip to the next Main Segment in the current level of focus.
At the end of this iteration what remains are mutually exclusive anomalies (Main Segments) and
their corresponding sub level variations (Impacted Segments).
Figure 5: Types of Redundancies
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
71
2.3.8 Attribution and Reporting
The final step of the framework is to communicate the anomalies in a readable format to the
respective stakeholders. The change in Main Segment in absolute terms is used as the base to
calculate the proportion of change at the corresponding sub level segments tagged to the Main
Segment as its Impacted Segments. The sub level segments are ordered based on decreasing
value of the calculated proportion. The final consolidated list after the sorting is then
automatically converted into a presentation where every spotted anomaly is shown separately
along with its causes / impact.
2.3.9 Business Preferences and Reporting Customizations
The framework also provides the flexibility to integrate static business filters or preferences to
include and exclude certain kinds of anomalies from the final consolidated list. Also the final
reporting could be customized according to the needs of the business users by setting up the
reporting module accordingly in the framework.
3. RESULTS
The framework was successfully implemented using R and python and tested out on eight
different datasets across multiple KPI’s. The implementation has resulted positively for
businesses in terms of reduction in time latency between report availability and actual actions
from the report, spotting variations in smaller segments which were initially neglected and
avoiding redundant actions by intimating the right stakeholders based on assigned accountability.
On an average 6 out of 20 root cause reports generated during the Beta phase gave conclusive and
actionable insights on the anomalies in the data. The time required to forecast and generate
reports increases exponentially with the increase in the number of dimensions. However, with the
use of parallel processing time taken to generate the results for a dataset with close to 8
dimensions (200,000 columns) was reduced from 9 hours to 1 hour.
With the aid of shiny package in R and PyQt it was possible to create a user-friendly UI for
taking inputs and displaying interactive dashboards.
4. CONCLUSIONS
The increasing reliance on data for decision-making has added pressure on analysts to provide
quick and accurate reports. The volume of data has increased manifolds in the past decade. If
such vast volumes of data can be managed and processed more efficiently it could lead to multi
fold gains in all organisations. The framework discussed in this paper finds application in a wide
range of domains for KPI tracking, AB testing, campaign tracking and product evaluations.
The methodology can further be improved by adding functionalities like allowing external
regressors and multiplicative metrics, interface for integration to Business intelligence tools and
models that can handle high frequency data. Distributed computing via big data platforms can
further increase the scalability of this approach.
14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.2, March 2016
72
ACKNOWLEDGEMENTS
The authors would like to thank LatentView Analytics for providing the opportunity, inputs and
resources to generate this framework. The authors would like to mention their gratitude to co-
employees who provided their support in this work. The authors would extend their thanks to
Prof. Subhash Subramanian and Prof. Mandeep Sandhu for their guidance and inputs in writing
down the paper and proof reading it.
REFERENCES
[1] R implementation by B. D. Ripley and Martin Maechler (spar/lambda, etc). https://www.r-
project.org/Licenses/GPL-2
[2] Hyndman, R.J., Akram, Md., and Archibald, B. (2008) "The admissible parameter space for
exponential smoothing models". Annals of Statistical Mathematics, 60(2), 407--426.Models: A
Roughness Penalty Approach. Chapman and Hall.
[3] Hyndman, R.J., Koehler, A.B., Snyder, R.D., and Grose, S. (2002) "A state space framework for
automatic forecasting using exponential smoothing methods", International J. Forecasting, 18(3),
439--454
[4] Hyndman, R.J. and Khandakar, Y. (2008) "Automatic time series forecasting: The forecast package
for R",Journal of Statistical Software, 26(3)
[5] Leonard, Michael. "Large-Scale Automatic Forecasting Using Inputs and Calendar Events." White
Paper (2005): 1-27.
[6] Thulasiraman, K.; Swamy, M. N. S. (1992), "5.7 Acyclic Directed Graphs", Graphs: Theory and
Algorithms, John Wiley and Son, p. 118, ISBN 978-0-471-51356-8.
[7] http://robjhyndman.com/hyndsight/intervals/
AUTHORS
Vivek Sankar is a post graduate in Business Administration from Sri Sathya Sai
Institute of Higher Learning and is currently working in LatentView Analytics,
Chennai, since September 2012.
Somendra Tripathi received his B.Tech degree in computer science from Vellore
Institute of Technology in 2013. He is currently working in LatentView Analytics,
Chennai.