www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Artivatic AI Labs Capabilities for
Anomaly & Fraud Detection
Version 1.0
Artivatic Technology Team
Ownership
Artivatic Data Labs Private Limited is the owner of the document. Unless otherwise specified, no part of this document may
be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm,
without permission in writing from Artivatic. Similarly, distribution of this document to a third party is also prohibited unless
specific approval is taken from Artivatic
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Proposed Possible Solution by Artivatic
Artivatic team did study for the problems considering the output, need and processes to identify
the best solution for anomaly detection based on time series data and fraud detection in multiple
sectors.
Introduction
Anomaly detection is the identification of items, events or observations which do not
conform to an expected pattern or other items in a dataset. The goal of anomaly detection is to
identify unusual or suspicious cases based on deviation from the norm within data that is
seemingly homogeneous.
Categories of anomaly detection techniques
Three broad categories of anomaly detection techniques exist.
1. Unsupervised anomaly detection techniques detect anomalies in an unlabelled test
data set under the assumption that the majority of the instances in the data set are
normal by looking for instances that seem to fit least to the remainder of the data set.
2. Supervised anomaly detection techniques require a data set that has been labelled
as "normal" and "abnormal" and involves training a classifier (the key difference to
many other statistical classification problems is the inherent unbalanced nature of
outlier detection).
3. Semi-supervised anomaly detection techniques construct a model representing
normal behaviour from a given normal training data set, and then testing the
likelihood of a test instance to be generated by the learnt model.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Techniques
A. Density-based spatial clustering of applications with noise (DBSCAN)
1. Introduction:
a. DBSCAN works on the density of the data points.
b. Each Cluster has a considerable higher density of points than outside of the cluster.
2. Algorithm:
a. Arbitrary select a point p.
b. Retrieve all points density-reachable from p wrt radius of the circles and MinPts.
c. If p is a core point, a cluster is formed.
d. If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.
e. Continue the process until all of the points have been processed.
3. How can we apply?
a. Client should provide the sales data per city/zone.
b. We will apply the DBSCAN algorithm on the historical as well as current data.
c. Sales figures tend to stay within a small range. i.e. the sales of ‘Noodles’ in Bangalore is
approx. 10k every month. Hence Data-point density near 10k is higher.
d. DBSCAN works on the density. If the sales data is not within the historical range of data,
we consider it as an Anomaly. Ex. If in the month of June, the sales goes down to 8k for
noodles then we pop-up its anomaly.
4. Advantage: No need of training and creating model
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
5. Disadvantage: If the same anomaly occurs multiple times, it is not detected.
B. Regression Based Anomaly Detection
1. Introduction
a. Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor).
b. This technique is used for forecasting, time series modelling and finding the causal effect
relationship between the variables.
2. Type of Regression
a. Linear Regression
b. Logistic Regression
c. Polynomial Regression
d. Stepwise Regression
3. How can we apply for Client?
a. Client should provide data is time-series format.
b. From time-series based data we can predict the expected sales for the future.
c. If the sales figure for particular time is greater than or less than the Predicted value +/-
threshold value. Then that is detected as an Anomaly.
d. We calculate threshold depending on the average of the previous error values.
e. Threshold is re-calculated automatically after every training.
4. Advantage: Repeated anomaly also detected.
5. Disadvantage: Need to create prediction model and slower than DBSCAN.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Type of Anomalies
Anomalies can be broadly categorized as:
1. Point anomalies:
a. A single instance of data is anomalous if it's too far off from the rest.
b. Business use case: Detecting credit card fraud based on "amount spent."
Fig1. Point anomalies
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
2. Contextual anomalies:
a. An individual data instance is anomalous within a context
b. Requires a notion of context
c. Also, referred to as “conditional anomalies”.
d. Business use case: Spending $100 on food every day during the holiday season
is normal, but may be odd otherwise.
Fig2. Contextual anomalies
3. Collective anomalies:
a. A collection of related data instances is anomalous
b. Requires a relationship among data instances
i. Sequential Data
ii. Spatial Data
iii. Graph Data
c. The individual instances within a collective anomaly are not anomalous by
themselves.
d. Business use case: Someone is trying to copy data from a remote machine to a
local host unexpectedly, an anomaly that would be flagged as a potential cyber
attack.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Fig 3. Collective anomalies
Type of Anomalies Solution Process
1. Point anomalies:
a. Input data point not conforming to historical models are identified as
anomalous.
Ex: Sales value of the particular product is very high as compared to overall
average of the sales of that product.
b. Large difference between two similar products from the same brand should be
identified as outliers.
Ex: Sales of the “AXE Dark Temptation” and sales of the “AXE Blast
Deodorant” have huge difference in sales then it can be anomalous.
c. Large gap between sales figures of competing brands should be identified.
Ex: Sales of a brand of “shampoo” as compared to the sales of a competing brand
of “shampoo” are decreasing.
d. Solution: Point anomalies can be found using AV Hybrid approach for anomaly
detection.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
2. Contextual anomalies:
a. If there is the sudden spike (positive or negative) then it considered as a
contextual anomaly.
Ex: If there is sudden drop in the sales by 25% for a particular product then it
is considered as an anomaly?
b. If there is a random change or the value is not correct according the context of
the attribute, then it is also considered as an anomaly.
Ex: Sales value of the product is $0 on the weekdays.
c. Solution: For finding the contextual anomalies we compare the input data with
latest history data and find if there is sudden drop above threshold or not. Also
find context and apply the context of attribute to find the anomaly.
3. Collective anomalies:
a. If there is a continuous change in the figures it can be considered as an anomaly.
Ex: Continuous drop in the sales for 4-6 months that may be anomalous.
b. Constant value for large period of time can be an anomaly.
Ex: If there is sales value of 0 for whole week.
c. Solution: Compare input data with previous dataset. If there is constant value
that can be an anomaly.
Artivatic’s approach for Client’s anomaly Fraud detection
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
1. ‘Hybrid approach for the anomaly detection’ will be developed.
2. Hybrid approach combines the advantage of both DBSCAN based Anomaly detection
and Regression based anomaly detection.
3. Initially we will apply PCA for dimension reduction on the given dataset which help
us to reduce complexity and runtime for Anomaly detection.
4. After PCA, we will apply regression based anomaly detection algorithm on the given
dataset to get the initial set of anomalies.
5. In regression based anomaly detection if the data is crossing the threshold limit its
consider as the anomaly.
6. But in some edge cases, correct behaviour of data points are identified as anomalous.
For example, every Sunday the sales figures are 0 and it will be detected as an outlier
in the regression based algorithm.
7. To verify the outliers detected in the regression model, the list of attributes detected as
anomalies will be checked using the Density based anomaly detection.
8. The outliers detected after the Density based algorithm will be labelled as anomalous.
9. For finding the contextual anomalies we compare the input data with latest history data
and find if there is sudden drop above threshold or not. Also, find context and apply
the context of attribute to find the anomaly.
10. To find the collective anomalies, the input data will be compared with the previous
dataset. If there is a constant value that can be an anomaly.
Details of step 9 and 10 will be defined post further analysis of the given data sets.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
System Architecture
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Technologies, Languages and Cloud Platform
Technologies, we use for architecture, backend, models/algorithms:
Scala, C++, Java, Hadoop, OpenCV, Angular2
Our APIs/SDKs are available Scala, Java, JavaScript, PHP, Android, iOS as well. The new
technology wrappers can be built quickly based on Client’s technology.
Database: Cassandra [We can store data to any other database based on the need of Client]
For ML processing: Hadoop, Mahout
For NLP Tool: Java, C ++
Cloud Platform: AWS, Google, Azure, Own Servers (It can be installed on client's
local/private servers as well, no such restrictions)
The technologies will not have dependency much as we will be needing sometime to create
required technology focused software as per need of Clients, so no technology, cloud, server
and data based dependency for Artivatic to integrate.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Estimation for Completion Time for the PoC
Task Name Task in Details Approx. Hours
Needed (Hrs)
A. Data retrieving from client 60
B. Data Cleaning and pre-processing 40
C. DBSCAN Approach 40
D. Regression based Approach 40
E. Integration and creating Hybrid approach 20
F. Contextual anomalies finding module 80
G. Collective anomalies 80
H. Dashboard Depends if needed to build
separately or want to
directly integrate with
existing Client’s Dashboard
80
I. Internal Testing 50
J. Integration with the client and testing
with client
On Client’s Servers 80
Total Approx. Hours 475-570 Hrs
Approximate Time = 60- 70 Days [If Single person works]
The duration can be much lower by spending more time and can finish within 1 month as
well if needed by Client. Timings are flexible.
www.artivatic.com contact@artivatic.com
Copyright © Artivatic Data Labs Private Limited
Disclaimer: This document has been released solely for educational and informational purposes. Artivatic does not make any
representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of products and
solutions, services and technologies mentioned herewith. Depending on specific situations, products and solutions may need
customization, and performance and results may vary.

Fraud detection- Retail, Banking, Finance & FMCG

  • 1.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Artivatic AI Labs Capabilities for Anomaly & Fraud Detection Version 1.0 Artivatic Technology Team Ownership Artivatic Data Labs Private Limited is the owner of the document. Unless otherwise specified, no part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from Artivatic. Similarly, distribution of this document to a third party is also prohibited unless specific approval is taken from Artivatic
  • 2.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Proposed Possible Solution by Artivatic Artivatic team did study for the problems considering the output, need and processes to identify the best solution for anomaly detection based on time series data and fraud detection in multiple sectors. Introduction Anomaly detection is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Categories of anomaly detection techniques Three broad categories of anomaly detection techniques exist. 1. Unsupervised anomaly detection techniques detect anomalies in an unlabelled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. 2. Supervised anomaly detection techniques require a data set that has been labelled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). 3. Semi-supervised anomaly detection techniques construct a model representing normal behaviour from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learnt model.
  • 3.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Techniques A. Density-based spatial clustering of applications with noise (DBSCAN) 1. Introduction: a. DBSCAN works on the density of the data points. b. Each Cluster has a considerable higher density of points than outside of the cluster. 2. Algorithm: a. Arbitrary select a point p. b. Retrieve all points density-reachable from p wrt radius of the circles and MinPts. c. If p is a core point, a cluster is formed. d. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. e. Continue the process until all of the points have been processed. 3. How can we apply? a. Client should provide the sales data per city/zone. b. We will apply the DBSCAN algorithm on the historical as well as current data. c. Sales figures tend to stay within a small range. i.e. the sales of ‘Noodles’ in Bangalore is approx. 10k every month. Hence Data-point density near 10k is higher. d. DBSCAN works on the density. If the sales data is not within the historical range of data, we consider it as an Anomaly. Ex. If in the month of June, the sales goes down to 8k for noodles then we pop-up its anomaly. 4. Advantage: No need of training and creating model
  • 4.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited 5. Disadvantage: If the same anomaly occurs multiple times, it is not detected. B. Regression Based Anomaly Detection 1. Introduction a. Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). b. This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. 2. Type of Regression a. Linear Regression b. Logistic Regression c. Polynomial Regression d. Stepwise Regression 3. How can we apply for Client? a. Client should provide data is time-series format. b. From time-series based data we can predict the expected sales for the future. c. If the sales figure for particular time is greater than or less than the Predicted value +/- threshold value. Then that is detected as an Anomaly. d. We calculate threshold depending on the average of the previous error values. e. Threshold is re-calculated automatically after every training. 4. Advantage: Repeated anomaly also detected. 5. Disadvantage: Need to create prediction model and slower than DBSCAN.
  • 5.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Type of Anomalies Anomalies can be broadly categorized as: 1. Point anomalies: a. A single instance of data is anomalous if it's too far off from the rest. b. Business use case: Detecting credit card fraud based on "amount spent." Fig1. Point anomalies
  • 6.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited 2. Contextual anomalies: a. An individual data instance is anomalous within a context b. Requires a notion of context c. Also, referred to as “conditional anomalies”. d. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd otherwise. Fig2. Contextual anomalies 3. Collective anomalies: a. A collection of related data instances is anomalous b. Requires a relationship among data instances i. Sequential Data ii. Spatial Data iii. Graph Data c. The individual instances within a collective anomaly are not anomalous by themselves. d. Business use case: Someone is trying to copy data from a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.
  • 7.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Fig 3. Collective anomalies Type of Anomalies Solution Process 1. Point anomalies: a. Input data point not conforming to historical models are identified as anomalous. Ex: Sales value of the particular product is very high as compared to overall average of the sales of that product. b. Large difference between two similar products from the same brand should be identified as outliers. Ex: Sales of the “AXE Dark Temptation” and sales of the “AXE Blast Deodorant” have huge difference in sales then it can be anomalous. c. Large gap between sales figures of competing brands should be identified. Ex: Sales of a brand of “shampoo” as compared to the sales of a competing brand of “shampoo” are decreasing. d. Solution: Point anomalies can be found using AV Hybrid approach for anomaly detection.
  • 8.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited 2. Contextual anomalies: a. If there is the sudden spike (positive or negative) then it considered as a contextual anomaly. Ex: If there is sudden drop in the sales by 25% for a particular product then it is considered as an anomaly? b. If there is a random change or the value is not correct according the context of the attribute, then it is also considered as an anomaly. Ex: Sales value of the product is $0 on the weekdays. c. Solution: For finding the contextual anomalies we compare the input data with latest history data and find if there is sudden drop above threshold or not. Also find context and apply the context of attribute to find the anomaly. 3. Collective anomalies: a. If there is a continuous change in the figures it can be considered as an anomaly. Ex: Continuous drop in the sales for 4-6 months that may be anomalous. b. Constant value for large period of time can be an anomaly. Ex: If there is sales value of 0 for whole week. c. Solution: Compare input data with previous dataset. If there is constant value that can be an anomaly. Artivatic’s approach for Client’s anomaly Fraud detection
  • 9.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited 1. ‘Hybrid approach for the anomaly detection’ will be developed. 2. Hybrid approach combines the advantage of both DBSCAN based Anomaly detection and Regression based anomaly detection. 3. Initially we will apply PCA for dimension reduction on the given dataset which help us to reduce complexity and runtime for Anomaly detection. 4. After PCA, we will apply regression based anomaly detection algorithm on the given dataset to get the initial set of anomalies. 5. In regression based anomaly detection if the data is crossing the threshold limit its consider as the anomaly. 6. But in some edge cases, correct behaviour of data points are identified as anomalous. For example, every Sunday the sales figures are 0 and it will be detected as an outlier in the regression based algorithm. 7. To verify the outliers detected in the regression model, the list of attributes detected as anomalies will be checked using the Density based anomaly detection. 8. The outliers detected after the Density based algorithm will be labelled as anomalous. 9. For finding the contextual anomalies we compare the input data with latest history data and find if there is sudden drop above threshold or not. Also, find context and apply the context of attribute to find the anomaly. 10. To find the collective anomalies, the input data will be compared with the previous dataset. If there is a constant value that can be an anomaly. Details of step 9 and 10 will be defined post further analysis of the given data sets.
  • 10.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited System Architecture
  • 11.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Technologies, Languages and Cloud Platform Technologies, we use for architecture, backend, models/algorithms: Scala, C++, Java, Hadoop, OpenCV, Angular2 Our APIs/SDKs are available Scala, Java, JavaScript, PHP, Android, iOS as well. The new technology wrappers can be built quickly based on Client’s technology. Database: Cassandra [We can store data to any other database based on the need of Client] For ML processing: Hadoop, Mahout For NLP Tool: Java, C ++ Cloud Platform: AWS, Google, Azure, Own Servers (It can be installed on client's local/private servers as well, no such restrictions) The technologies will not have dependency much as we will be needing sometime to create required technology focused software as per need of Clients, so no technology, cloud, server and data based dependency for Artivatic to integrate.
  • 12.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Estimation for Completion Time for the PoC Task Name Task in Details Approx. Hours Needed (Hrs) A. Data retrieving from client 60 B. Data Cleaning and pre-processing 40 C. DBSCAN Approach 40 D. Regression based Approach 40 E. Integration and creating Hybrid approach 20 F. Contextual anomalies finding module 80 G. Collective anomalies 80 H. Dashboard Depends if needed to build separately or want to directly integrate with existing Client’s Dashboard 80 I. Internal Testing 50 J. Integration with the client and testing with client On Client’s Servers 80 Total Approx. Hours 475-570 Hrs Approximate Time = 60- 70 Days [If Single person works] The duration can be much lower by spending more time and can finish within 1 month as well if needed by Client. Timings are flexible.
  • 13.
    www.artivatic.com contact@artivatic.com Copyright ©Artivatic Data Labs Private Limited Disclaimer: This document has been released solely for educational and informational purposes. Artivatic does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of products and solutions, services and technologies mentioned herewith. Depending on specific situations, products and solutions may need customization, and performance and results may vary.