This document discusses how to analyze web traffic logs to distinguish between real human users and non-human traffic such as bots, spiders, and crawlers. It recommends sending logs to New Relic for analysis and using queries to:
1) Identify bot and crawler traffic by searching for those terms in the user agent field.
2) Exclude known browsers like Mozilla to find real users.
3) Combine the queries to see totals for machine traffic versus real human traffic.
4) Also look for suspicious user agents showing atypical traffic patterns that could indicate fraud.
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
Bots and spiders
1.
2. Bots and Spiders vs. Real Users
• You want to know how good the search engine bots are served
• You want to know what else is looking on your page (competition?)
• You want to separate the good from the bad (traffic) for clean
analytics
• You want to get alerted when suspicious traffic kicks in (Fraud?)
• You want a clean and accurate Basis for your Marketing Analytics?
4. Get the Logs into New Relic
• Define New Relic the endpoint of your Logs in your CDN or from your servers (in case no CDN is
used)
• Check if logs arrive
(Example for Fastly) (Logs arrived in New Relic)
5. Make sense of the of the data
• Identify the Bots
Most Spiders, Bots and Crawler identify as such in the name. So we ask the Data
platform to show us a count of all – excluding Mozilla.
SELECT count(*) from Log where request_user_agent not like 'Mozilla%' facet request_user_agent since 1 day ago
6. Make sense of the of the data
• Exclude the knowns (identify Machines)
Most Real User Agents (Browser) identify as “Mozilla” – so we ask the Data platform to
show us a count of all – excluding Mozilla.
SELECT count(*) from Log where request_user_agent like '%Bot%' or request_user_agent like '%Spider%
' or request_user_agent like '%crawler%' facet request_user_agent since 1 day ago
7. Make sense of the data (Real vs. Machine)
• Combine the learnings and check what’s going on (enhanced the
check for only text/html content)
SELECT filter(count(*),
where request_user_agent not like 'Mozilla%'
or request_user_agent like '%Crawler%'
or request_user_agent like '%bot%'
or request_user_agent like '%spider%'
as 'Machine Traffic'),
filter(count(*),
where request_user_agent like 'Mozilla%'
and request_user_agent not like '%Crawler%'
and request_user_agent not like '%bot%'
and request_user_agent not like '%spider%'
as 'Real Traffic')
from Log
where type LIKE 'text/html%'
since 1 day ago timeseries
8. Not every Real User is a Real User
• Not all synthetic monitoring engines Identify as such – not all competitors
checking your content Identify as such – so it might be worth to have a look for
suspicious User agents. A simple count helps but also looking for Traffic spikes –
or a lot of flat traffic.
Linux with a Chrome Browser is not
really the most common
combination. Uuuh….and the
version looks rather old !
32k clicks by just one user Agent?
Hmmm….suspicious.