User behaviour analysis (UBA) is the set of methods/techniques/mindset for collecting, combining, and analysing quantitative and qualitative user data to understand how users interact with a product, and why. And from those data points there are data point which we call for anomalies. Those data points which stand out and which at times contain wealth of indications and signals, necessary for the product and business in general.
In this talk I will go from general UBA to more specific anomalous cases and specifically to some cases of fraud and anti money laundering (AML). Some existing ML methods and discussions around that.
2. Why Users are doing what they are doing? How to make sense?
Tracking, collecting and assessing of user data and activities using ML.
U SER
B EHAVIOUR
A NALYSIS
Muhammad Ali Norozi
3. Outline
● Introduction: User Behaviour Analysis (UBA)
● Generic UBA
○ ML and UBA
● Anomalous and outlier detection
● Negative anomaly
● Positive anomaly
● Discussion
5. ● User behavior encompasses all the actions users take on a product: where and what they click
on, how they move from one state to another (being active to inactive), where they stumble, and
where they eventually drop off and leave.
● Tracking user behavior gives you an inside look at how people interact with your product (credit
card for example) and what obstacles or hooks they experience in their journey as your customers.
● User behavior analyses (UBA) is a tailored method for collecting, combining, and analyzing
quantitatively and qualitatively users data to understand how users interact with a product, and why.
What?
6. When you want an answer to pressing business/research questions such as “Why are users coming
to my product/services?” or “Why are they leaving or not coming?,” Traditional analytics alone can
tell you that quantitative activity is happening, but can’t give you any of the ‘whys’. That's where user
behavior analyses comes in play, with specific tools and techniques that help you get a full picture of
user behavior.
●
7. ● Demographics information: who the users are?
●
● Retention: how regularly they use the product
●
● Engagement: how much time they spend in your
product
●
● Average Revenue: how much they spend
8. it’s all about events!
Drivers
What are bringing the users?
Hooks
What persuaded
the users to act?
Barriers
Where and why the users
stumble and abandon?
9. Benefits
● Get real, first-hand insight into what people are interested in, gravitating
towards, or ignoring
● Verify and validate the hypothesis
● How specific user needs change over time
● Investigate how specific flow and sections are performing
● Understand what your customers want and care about and subsequently
align the product and avail the opportunities
In a nutshell find answer to the core question of “users satisfaction”.
11. Machine Learning (ML)
Arthur Samuel (1959):
“Machine Learning is a Science of getting the computers to learn, without
being explicitly programmed!”
Tom Mitchell (1998):
“A computer is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by
P, improves with experience E.”
12. ML and UBA
● Self adjusted dynamic behaviour pattern
● Find hidden pattern in user behaviour
● Escape postmortem rules and signatures (not a
viable solution)
● Detect unknown patterns and make it visible
and usable for unfolding users’ intent
● Behavioral profiles
Still needs expert knowledge and human intervention
13. ML ... (rocket science?)
● ML Tasks (given right data and right parameters and time they give results
with good accuracy)
○ Clustering
○ Regression
○ Classification
○ Anomaly detection
○ …
● Learning pattern from data (learn from seen data to predict unseen data)
○ Supervised learning
○ Unsupervised learning
○ Semi-supervised learning with tips from data and human
○ Reinforcement learning with performance feedback loop
○ ...
14. ML ...
● ML model (snapshot of all trained algorithms, parameters, features and
environments, the most time spent is on tuning params not developing)
○ Feature extraction and engineering (best set of features, requires domain expert
knowledge)
○ Model parameters (learned)
○ Model hyperparameters (architecture)
● ML features
○ Categorical
○ Statistical
○ Empirical
○ Continuous
○ Binary
○ ...
15. It's all about networks
● User Profile data - the interaction data
● Represent complex tree and network of users and items into a matrix
● Reverse search -
○ User profile and item profile - user features, item features and joint features using matrix
factorizations (collaborative filtering). Feed these to learning algorithm and let it predict the
possibility of user purchasing an item.
● Cluster users and cluster items and see the correlation of user clusters
with item clusters.
16. Data is at the centre
● Data Sources (all direct or indirect information available)
○ APIs
○ Logs
○ Databases
○ Log archives
○ Log management tools (e.g., humio, dynatrace and others)
○ Monitoring tools (e.g., prometheus)
○ …
● Data formats (json, csv, tsv, ...)
○ Syslog
○ Key-value sources
○ Distributed file formats like hadoop file system
○ ...
17. Data normalization
● Understand the data sources and formats
● Bring all formats to the same convention
● Find duplicate and missing fields
○ One action generates several entries
○ User do not enter a field in the application
22. Is UBA grey, yes it is :)
● GDPR or other regulatory compliance.
● Processing lots and lots of users data without user being aware of
● Honeypot products which squeeze users data and further cash it.
● ...
24. Anomaly (Statistical deviations?)
● Deviant, unusual data point, when data generating process behaves
unusually it results in outliers
● Outlier-detection is well-research both in statistical and data science
worlds.
● Static anomalies (analyzed individually)
○ Unusual action
○ Unusual context
● Temporal anomalies
○ Unusual time
○ Unexpected event
○ Huge events volume
○ ...
25. Anomaly categorization
The abnormal data points can be categorized further by whether the data being examined is a
set, a sequence, or a structure of inherently higher dimensionality, which leads to:
1. Point Anomaly:
a. One or more data points in the collection are anomalous
2. Context Anomaly:
a. A data point is anomalous with respect to its neighbors or to other data points which share
a context or some features
3. Collective Anomaly:
a. A collection of possibly similar data points that behaves differently from the rest of the
collection.
26. Supervised and Unsupervised
● Supervised
○ The training data is pre-labelled or characterized by domain experts, and the task of the anomaly
detection is merely involving measuring the variation of new data points from such models.
○
● Unsupervised
○ On the other hand are used when the data is not labelled or characterized, because it is a laborious task
and / or because of the lack of domain experts. So, there are no prior labels which conclusively
distinguish abnormal data points from the normal ones. These algorithms focus primarily on identifying
anomalies from a finite set of data points. The distance between two data points are depicted using for
instance Euclidean distance or any other distance measure (e.g., Mahalanobis distance, distance
between a point p and distribution d.):
27. Methods categorization
● Density-based
○ DBSCAN (non-parametric - outliers points that lie
alone in low-density regions)
○ LOF (local outlier factor, local deviation of a given
data point with respect to its neighbours.)
● Distance based
○ K-NN
○ K-Means
○ Distance to Regression hyperplane (if regression is
used)
● Parametric (assume some sort of “form” to the
data)
○ GMM (Gaussian mixture model)
○ Single class SVM
○ Extreme value theory
● Others (non-ml)
○ Statistical tests: Z-score
Spatial
proximity
28. Isolation Forest uses a different approach: instead of trying to build a model of normal instances, it explicitly
isolates anomalous points in the dataset.
Multivariate Anomaly detection techniques - e.g., PCA based.
Methods based on neural networks, artificial neural networks, e.g., autoencoders
30. Application to Banking
● Retail bank (consumer banking)
○ Credit card fraud
○ ML through retail Bank
● Private bank (banking to HNWIs)
○ Market abuse
○ ML through private bank
○ Other fraud
● Investment bank (serves govt, corporations & institutions)
○ Market abuse
○ ML through investment bank
○ Other fraud
As money laundering in these three types are different, therefore different
type of detections, approaches and different types of red flags.
31. Supervised vs Unsupervised
Automated fraud detection is inherently different problem then automated
money laundry detection and market abuse
● In CC fraud, we know what TP looks like. How? Customer tells us. Can use
supervised approach here.
● In Market abuse detection also we often know what TP look like. How?
Positive PnL and price moves can indicate this. Supervised!
● In Automated ML detection we do not know whether a data record is TP or
not. So it is unsupervised in nature (labelling the data is impractical).
32. Unsupervised - DIFFICULT?
The main issues with analytical approach to AML.
● SEVERE CLASS IMBALANCE - estimated less than 0.1%
● SEVERE CLASS OVERLAP - ML is mixed with legal financial activity
● CONCEPT DRIFT - ML techniques keeps on changing even by same
culprits org.
● UNCERTAINTY AROUND THE DATA MODEL - the confusion matrix
33. Automated AML
● It's not a simple anomaly detection problem its not really outlier detection
problem
○ Many patterns of transactions associated to ML differ little from legitimate transactions.
● Outliers are often hidden in the unusual local behaviour of
low-dimensional subspaces
● The choice of normal depends on subject matter
34. Learning from “negativity”
● Allow a few % of transactions which were tagged/classified as “bad”.
Double benefit from this approach:
○ First and foremost the learner will continue learning the new cases, instead of stopping
them and eventually “forgetting” them
○ And secondly and also importantly reduce the FP cases
37. Good anomalies (minority community of good anomalies)
Humans/machine have a tendency of classifying anomalies into “bad” anomalies while failing to
successfully classify “good” anomalies.
● Withdrawal of huge amounts at once or small amounts regularly
● Deposit/receiving of huge amounts and very small amount regularly:
○ The first step is to identify those transactions which are abnormal or outliers. Secondly, predict the occurrence of
such an event.
■ What causes this event to occur, the reasons:
■ Is it because of the time of the year, e.g., christmas etc
■ Is it because of the weather change?
■ Or totally random?
● Changes in the life’s situation:
○ New marital status
○ Having children
○ Needs new loan
○ Kids are old enough to live at their own
○ ...
Any other cases of good anomalies?
38. Machine learning
● Find right algorithm for the task at hand, i.e., ve anomaly detection (for
temporal anomalies, e.g., LSTM, which has feedback loop)
● Implement the algorithm and its environment
● Optimize the model for its best accuracy
39. Model parameters (engineering task?)
● Architecture (global or hyper params that define high level behavior of
neural network, in other words translating domain expert into numbers and
algorithms)
○ Layers number, Neurons number, Activation function, Loss function, Optimizer, ...
○
● Data (how the data is cooked)
○ Features, Knowledge base, Sequence length, Normalization, ...
○
● Training (how we see the results and tune accuracy, evaluations)
○ Epochs, Bach size, Threshold, Distance, Smoothing, ...
40. Conclusions & Discussions
● UBA is Grey, often
● UBA is good and gives the users perspective of the system.
○ Monitoring system from users point of view
○ Test of product as user sees it.
● RPA and UBA in general and anomaly detection in specific?
● Successful automatic anomaly detection starts with asking the right questions about what is
truly unusual and building a set of data models to mimic this: “exploring low dimensional
subspaces with flag (maybe red / green)”
● What could be good anomalies in the some specific use-case?
○ How can they be turned into opportunities?
○ Can ML help?