Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, 2021

Advanced Outlier Detection and Noise Reduction
with Splunk & MLTK
Presented by: Urwah Haq
August 10th, 2021
Presented by Urwah Haq @ San Francisco Splunk User Group
1

DI Confidential
18th Dec 2019
Agenda
1. Common Ways of finding outliers
• Review of some math terminology
• Review on Outlier blog what it covers
• Re-introduce moving average & foreach function
2. Using the ‘density function’ in MLTK
• An example of ML algorithm to detect outliers
3. Combining Multiple methods 1+2
• Ensemble Learning (combining multiple ML methods)
4. T-Tests & Clustering – What are they are how to use them?
2

DI Confidential
18th Dec 2019
ML/Splunk Terminology Refresher
Statistics Terms:
• Mean/Average – Central value in a set of data
• Standard Deviation – Measure of spread of data (higher the stdev the larger the difference between the
points)
• Time Series Data/Events - Time Series Data is data that is collected/ingested in Splunk over intervals of
time
ML Terms:
• Outliers – Legitimate Data Points that deviate far away from the norm
• Anomalies – An action that may seem out of order with the rest of data
• Outliers vs Anomalies – For our purposes in Splunk any deviations in data such as mb_out from firewall
data or cpu/mem/network utilization can be considered ‘Outliers’. Anything involving user actions such as
Urwah installing 10+ splunkbase applications on a Sunday is considered an ‘Anomaly’
Anomalies
Outliers
Relational
anomalies
+ Others
3

DI Confidential
18th Dec 2019
1 - What is an Outlier
• A point away from the
body of data points
• A data point different than
rest of the points
• In Splunk one of the most common ways to find
outliers is to set boundaries
• If datapoint deviates away from these boundaries
tag them as outliers
4

DI Confidential
18th Dec 2019
1- Types of Outlier detection (NO ML)
Blog: https://discoveredintelligence.ca/quick-guide-to-outlier-detection-in-splunk/
1. Static Threshold
a) If(value) > X(fixed threshold) THEN X is an outlier
2. Moving Thresholding
a) If(value) > X(moving average or moving value) THEN X is an outlier
b) Can use functions such as ‘trendline sma/ema’ OR ‘streamstats window=N’
c) We can get creative with this
index=main user=* sourcetype=WinEventLog| timechart count by user| eval
threshold=100
Static Threshold
| inputlookup app_usage.csv| rename * as user_*| rename user__time as _time|
table _time *| eval threshold=100
Moving Threshold
| inputlookup app_usage.csv | rename * as user_* | rename user__time as _time |
table _time * | eval threshold=1200 | addtotals fieldname=total | eval limit=0|
rename OTHER as u_OTHER | eval distinct_values=0 | foreach user_* [ eval
distinct_values=if(<<FIELD>> >0,distinct_values+1,distinct_values)] | eval
average=round(total/distinct_values,2) | eval average=if(distinct_values=1 AND
average >50,round(average/5),average)| table _time average user_*
5

DI Confidential
18th Dec 2019
1 – How basic moving average works
Moving Thresholding
a) A moving threshold is not just the average of past X number of points it can be a lot more
b) Basic search for moving average of past 5 data points
| inputlookup user_usage.csv | table _time * | eval threshold=1200 | addtotals fieldname=total | table _time total|
trendline sma5(total) as 5_moving_average
Here is what a simple average looks like with window=2:
_time User_a User_b User_C User_D User_E Average Moving Average
9:00 0 0 10 15 5  (0 + 0 + 0 +10
+ 15 +5)/5 = 6
9:15 0 0 0 5 5 (0 + 0 + 0 + 0 + 5
+ 5 )/5 = 2
4
9:30 1 2 3 4 5 3 2.5
9:45 0 1 5 4 5 3 3
10:00 1 3 0 5 2 2.2 2.1
10:15 1 0 4 6 3 2.8
10:30 1 0 0 7 0 1.6
5
Active
Users
Using trendline
table _time * | eval threshold=1200 | addtotals fieldname=total | table _time total|
trendline sma5(total) as 5_moving_average
Using streamstats
table _time * | eval threshold=1200 | addtotals fieldname=total | table _time total |
trendline sma5(total) as 5_moving_average| streamstats window=5 avg(total) as
streamstats_moving_average
Using streamstats & autoregress
| inputlookup app_usage.csv
| rename * as user_*
| rename user__time as _time
| table _time *
| eval threshold=1200
| addtotals fieldname=total
| table _time total
6

| streamstats window=5 avg(total) as streamstats_moving_average
| autoregress streamstats_moving_average as previous_moving_average
6

DI Confidential
18th Dec 2019
1 – Using Foreach Function to adjust moving average
• Use ‘Foreach’ function with conditions. E.g
ONLY use ‘active’ users with hits>0 to calculate average
_time User_a User_b User_C User_
D
User_E New Average New Moving Average
9:00 0 0 10 15 5  (10 + 15
+5)/3 = 10
9:15 0 0 0 5 5 (5 + 5 )/2 = 10 10
9:30 1 2 3 4 5 3 6.5
9:45 0 1 5 4 5 3 3
10:00 1 3 0 5 2 2.2 2.65
10:15 1 0 4 6 3 2.8
10:30 1 0 0 7 0 1.6
3 Active Users
Using Foreach function
table _time * | eval threshold=1200 | addtotals fieldname=total | rename OTHER as
u_OTHER | eval distinct_values=0 | foreach user_* [ eval
new_average=round(total/distinct_values,2) | eval old_average=round(total/11,2)|
table _time new_average old_average
7

DI Confidential
18th Dec 2019
1 – Using Foreach vs Aggregate Moving Average
_time User
_a
User
_b
User
_C
User
_D
User
_E
9:00 0 0 10 15 5
9:15 0 0 0 5 5
9:30 1 2 3 4 5
9:45 0 1 5 4 5
10:00 1 3 0 5 2
10:15 1 0 4 6 3
10:30 1 0 0 7 0
Basic method Using Foreach method
• Designed such that a user with 0
activity will count as an ‘active
user’
• Simple to implement
• Better to use for total
aggregates
• Results in more ‘outliers’ due to
static or moving bound
• Only users with activity will be
counted as ‘active users’
• More Complicated to setup
• Better to use when you have a
limited number of Users/Ips or
Entities
• Gives a more accurate picture of
User/IP that is more active than
normal
Using Foreach function
new_average=round(total/distinct_values,2) | eval old_average=round(total/11,2)|
table _time new_average old_average
8

DI Confidential
18th Dec 2019
2 - Introducing the ‘Density Function’
• What is the ‘Density Function’ within MLTK?
• It is another tool for you to use in anomaly detection on top of previous methods to find anomalies.
• It is better to use at an aggregate level (e.g span=15/30/60min)
• It works by plotting your values against mathematics distributions to calculate the probability of them happening
• Similar to the “| anomalydetection method=histogram [field_name]”
All user activity
counts
Activity Bins
0-100 500-600 600-700
Activity between 500-700 is
usually most common in a day
when span and have the
highest probability of
happening
1100-1200
Activities in these bins have the lowest
probability of occurring  More likely to be
outliers
DensityFunction -
https://docs.splunk.com/Documentation/MLApp/5.2.1/User/Algorithms#DensityF
unction
AnomalyDetection -
https://docs.splunk.com/Documentation/SplunkCloud/8.2.2104/SearchReference/
Anomalydetection
DensityFunction
table _time * | eval threshold=1200 | addtotals fieldname=total | fields _time total|
bin total start=1 end=5| stats count by total
DensityFunction Example
table _time * | eval threshold=1200 | addtotals fieldname=total | fields _time total|
fit DensityFunction total
9

DI Confidential
18th Dec 2019
2 – Using the Density Function
• Where it works well
• Data that is continuous, with little to no gaps
• For Aggregate-level e.g total activity
• For Entity-level (users/Ips) that has few or no gaps (fit DensityFunction <Field> by “User” into Model_Name)
10

DI Confidential
18th Dec 2019
• Using Density Function at Aggregate Level
• Use foreach moving average method
3 – Combining Density Function with Moving Averages
11

DI Confidential
18th Dec 2019
• Using Density Function at Aggregate Level
• …..| fields _time Total| fit DensityFunction Total show_density=true into
my_usergroup_model
• Use foreach moving average method
• …. | foreach user_* [ eval distinct_values=if(<<FIELD>>
>0,distinct_values+1,distinct_values)] | eval
new_average=round(total/distinct_values,2) | table _time * new_average
| foreach user_* [ eval isOutlier_<<FIELD>>=if(<<FIELD>> >
2*new_average,1,0)]
3 – Combining Searches
Output
Fields
Output
Fields
_time, isOutlier (Aggregate)
_time, isOutlier_user1, isOutlier_user2,
isOutlier_user3, …
Reference Outlier (Aggregate) in user-level outlier
search from 1 of 3 options:
1 – Lookup
2 – Summary Index
3 – Inline Search
| inputlookup user_usage.csv| addtotals| fields _time Total| fit DensityFunction Total
show_density=true into my_usergroup_model
new_average=round(total/distinct_values,2) | table _time * new_average | foreach
user_* [ eval isOutlier_<<FIELD>>=if(<<FIELD>> > 2*new_average,1,0)]
12

DI Confidential
18th Dec 2019
How do I make most use of all of outlier methods?
• Apply Density Function or any other technique
to find a time frame that was an outlier
• Save results in lookup or summary index for
reference
Aggregate
Level
• Use user-level outlier technique to find a user
who was an outlier at a certain time
• Reference that time with the aggregate level
Entity
(user/Ip) level
• Reference regional outliers using _time or time
buckets as the common field with the
aggregate level & user level
(Optional)
Regional-Level
Advantages of combining multiple
styles of outlier detection at
different data levels
• Verification of true outliers vs a
simple static value
• Less noisy for alerting
• Alert only when all 2 or 3 levels of
outliers are met
• Validate if rise/fall of aggregate
level was contributed by one or
more user. If one user that is a
confirmed outlier
13

DI Confidential
18th Dec 2019
More Advanced Ensemble Techniques
Aggregate Level Entity Level
Available ML Techniques
• Density Function to find most rare time
buckets with highest values as outliers
• Regression to find loudest times buckets
• Classification to find times with highest
probability of being outliers
• Statespace algorithm & anomaly
detection algorithm
Available Non-ML Techniques
• Static thresholds
• Moving Averages thresholds
Available ML Techniques
• Density Function to find most rare time
buckets with highest values as outliers
• Classification to find entities with highest
probability of going above thresholds
• Statespace algorithm & anomaly
detection algorithm
Available Non-ML Techniques
• Static thresholds
• Moving Averages thresholds
• Foreach and activity based averages
• Better outliers
• Less mundane alerting
• Statespace algorithm
& anomaly detection
algorithm
14

DI Confidential
18th Dec 2019
4 – Increasing Outlier Function Accuracy
1. Find Entities/Users/Ips that form a large percentage of your overall activity and
remove them
• This can be measured by using the correlation OR t-test function from MLTK
2. Group Similar sets of Entities/Users/Ips using the clustering command in MLTK
• Analyze each cluster individually. The cluster command
15

DI Confidential
18th Dec 2019
Thank you
| inputlookup query.csv| fit TFIDF query stop_words=english analyzer=word
token_pattern="w{3,20}" max_features=200| fit KMeans query* k=3| fields user
query cluster cluster_distance
16

DI Confidential
18th Dec 2019
Scoring Function to determine similarity
Scoring function
| score <test_name> <fields>…
https://docs.splunk.com/Documentation/MLApp/5.2.1/User/Scorecommand#T-test_.281_sample.29
Available tests:
• T-test(s):
1. Test if two Ips/User have identical pattern from different groups/domains (T-test 2 independent
sample)
2. Test if single user/ip is equal to a average from group (T-test 1 sample)
3. Test if two Ips/User have identical pattern from same group/domain (T-test 2 related samples)
• Energy Distance: The closer this value to 0 the similar two fields are in-terms of gain/loss overtime (or
mathematically they have similar cumulative distributive function)
• Kolmogorov-Smirnov (KS): Test if something is statistically identical to another field
• Kwiatkowski-Phillips-Schmidt-Shin: Test if field(s) trend is stationary – no or little gain/loss
T-test examples:
1
| table _time user_ITOps
| score ttest_1samp user_ITOps popmean=100 alpha=0.1
2
| table _time user_HR1 user_HR2 user_ERP user_CRM user_ITOps
| score ttest_ind user_HR1 against user_HR2 user_ITOps
17

| score ttest_ind user_HR1 against user_HR1 user_ITOps
3
| score ttest_rel user_HR1 against user_HR1 user_ITOps
Energy Distance
user_RemoteAccess user_Webmail
| score energy_distance user_Webmail against user_RemoteAccess
user_RemoteAccess user_Webmail
| score energy_distance user_HR1 against user_HR1
| table _time user_RemoteAccess user_Webmail
| fit CorrelationMatrix method=kendall user_Webmail user_RemoteAccess
17

DI Confidential
18th Dec 2019
Streamstats - Explanation
Window=2
18

Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, 2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, 2021

Similar to Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, 2021 (20)

More from Becky Burwell

More from Becky Burwell (13)

Recently uploaded

Recently uploaded (20)

Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, 2021