SlideShare a Scribd company logo
1 of 25
Download to read offline
DIGGING INTO DATA COLLECTION
Jason Packer

jason@quantable.com 

@jhpacker
Feb 17, 2016

#cbuswaw
WHAT DRIVES OUR METRICS?
*Note all metrics may be inaccurate by some amount**
**But we’re not sure which ones and by how much.
DATA COLLECTION 1.0:
SERVER LOGS, HITS, IP ADDRESSES
• Server logs, valid in 1996 and 2016
• Basic, but still contains highly useful
data!
• Unanalyzed raw logs get big, fast.
128.135.189.9 - - [15/Feb/1996:15:16:27] "GET / HTTP/1.1" 200 5397 "Mozilla/1.0 (Win3.1)”
65.60.216.104 - - [15/Feb/2016:15:16:27] "GET / HTTP/1.1" 200 5397 "Mozilla/5.0 (Mac OS)"
WEB ANALYST, CIRCA 2000
flickr: boston_public_library

CC BY-NC-ND 2.0
DATA COLLECTION 2.0:
CLIENT-SIDE JAVASCRIPT, COOKIES
• Easier to implement (“just a few lines
of JavaScript…”)
• Cookies match users closer than IPs
• Much more info available on client-
side
HOW DOES CLIENT-SIDE JS WORK?
…SPECIFICALLY GOOGLE ANALYTICS
2 requests - 1st for code, 2nd with measurement
TRACKING CODE SNIPPETS
• Sets up command queue
• Loads analytics.js, which does the
real work.
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-34128028-1', 'auto');
ga('send', 'pageview');
</script>
MEASUREMENT PROTOCOL
https://www.google-analytics.com/collect?v=1&_v=j41&a=702618035&t=pageview&_s=1&dl=https://
www.quantable.com/&ul=en-us&de=UTF-8&dt=Quantable - Analytics & Optimization&sd=24-
bit&sr=1680x1050&vp=1442x464&je=0&_u=SCCAAUAjK~&jid=&cid=157092037.1441829013&tid=UA-34128028
-1&z=823826407
This hit..
Once made readable, is this data…
from ObservePoint tag debugger
SEEMS GREAT, WHAT COULD
POSSIBLY GO WRONG?
Some data still only on the server side…
• Bot traffic (mostly)
• HTTP errors
• Pages we forgot to tag
• Content blocking users
SERVER LOGS, AGAIN
• Distributed systems, distributed logs
• As before, but somewhat different
consumers
AS ANALYSTS, WHAT’S GIVING
US GRIEF
• Cookie Deleting Users
• Bots
• Analytics “Referrer” Spam
• Ad blocker Users
COOKIE DELETING USERS
IS IT STILL ~30%?
• Artificially increases user counts
• Visit after deletion is direct, no attribution
• Stats based on users accounts? flickr: diskant

CC BY-NC 2.0
BROWSER FINGERPRINTS
• Survives Cookie deletion
• 2010 EFF Panopticlick: 84% of browsers unique
• Invasive?
• Browser fingerprint + IP in Piwik as cookie fallback
• Can be thought of as next gen User-Agent + IP
BOTS
• About 50% of all traffic may be bots (48.5%,
Incapsula 2015)
• Most of these don’t show in GA (yet?)
• Smaller the site, higher the bot % (85% for <1k
visits/day) flickr: skynoir

CC BY-NC 2.0
BOTS
BOTS
BOTS
BOTS
ANALYTICS SPAM
• free-social-buttons.biz, top-seo-blah-
blah-blah.com, number-one-analytics.fail
• Way to get traffic, SEO, and lulz since
before 2009
• Not GA specific, just the #1 target
• Two kinds: Crawler & Ghost
WHO’S SPAMMING US TODAY?
List of 2016 GA
Spammers from
Analytics Edge
Google is blocking
offenders, but often
not quickly.
WHY IS IT SO PREVALENT?
“Ghost” version via Measurement Protocol abuse
$ curl "https://www.google-analytics.com/collect?v=1&t=pageview&tid=UA-XXXX-X&cid=fa0c8140-
eef8-47c5-a244-b4c60cf46f74&dr=http%3A%2F%2Fmyspamsite.pizza&dp=%2Fhome"
Just iterate through UA-XXXX-1 numbers.
HOW DO I FIX IT?
• Filters for new traffic, segments for
historical
• Tool available on my site: 

quantable.com/spamfilter
• Higher than UA-XX—1 property
tracking id number for new site
AD BLOCKING IS MAKING SOME
OF OUR USERS DISAPPEAR
• Blockers such as AdBlock Plus, Ghostery, uBlock
Origin, and Purify can block analytics tools, not just ads
• ABP has largest install base (300M downloads)
• These users are still in your server logs, but may never
show up in your web analytics
HOW DOES THE BLOCKING
WORK?
• Long lists of URLs to block loading for, e.g.:

google-analytics.com/analytics.js

/piwik.php

?[AQB]&ndh=1&t=

com/0.gif?
• EasyPrivacy list (used by ABP and others) is over
10,000 lines long and very actively maintained
HOW MANY USERS BLOCK GA?
My study showing 8.7% blocking GA

(for one particular site)
blockers
HOW DO I COUNT BLOCKERS?
• Can’t really be “fixed” client-side
• Still show up server-side
• May be against GA terms (can’t
circumvent Opt-Out Add-on)
…because sometimes 22/7 is good enough.
SQUARING THAT CIRCLE
THANKS!
slides & recap to be posted at cbuswaw.com
References & Further Reading
Quantable GA Blocking Analysis:

https://www.quantable.com/analytics/how-many-users-block-google-analytics/
GA Tracking Code walkthrough:

http://code.stephenmorley.org/javascript/understanding-the-google-analytics-tracking-code/
GA Measurement Protocol Hit Builder:

https://ga-dev-tools.appspot.com/hit-builder/
Fingerprintjs2:

http://valve.github.io/fingerprintjs2/
Incapsula 2015 Bot Report

https://www.incapsula.com/blog/bot-traffic-report-2015.html
Analytics Edge’s Guide to GA Spam:

http://help.analyticsedge.com/spam-filter/definitive-guide-to-removing-google-analytics-spam/

More Related Content

Similar to Digging into Data Collection

Sps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flowSps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flowVincent Biret
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesRobin Rozhon
 
How go makes us faster (May 2015)
How go makes us faster (May 2015)How go makes us faster (May 2015)
How go makes us faster (May 2015)Wilfried Schobeiri
 
Malware Analysis For The Enterprise
Malware Analysis For The EnterpriseMalware Analysis For The Enterprise
Malware Analysis For The EnterpriseJason Ross
 
Your Web Application Is Most Likely Insecure
Your Web Application Is Most Likely InsecureYour Web Application Is Most Likely Insecure
Your Web Application Is Most Likely InsecureAchievers Tech
 
Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0
Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0
Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0Tom Moore
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservicesDynatrace
 
Top 10 DBA Mistakes on Microsoft SQL Server
Top 10 DBA Mistakes on Microsoft SQL ServerTop 10 DBA Mistakes on Microsoft SQL Server
Top 10 DBA Mistakes on Microsoft SQL ServerKevin Kline
 
10 things you can do to speed up your web app today 2016
10 things you can do to speed up your web app today 201610 things you can do to speed up your web app today 2016
10 things you can do to speed up your web app today 2016Chris Love
 
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...rschuppe
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScalePatrick Chanezon
 
Technical SEO Beyond the Audit - Brighton SEO April 2017 - Philip Gamble
Technical SEO Beyond the Audit - Brighton SEO April 2017 - Philip GambleTechnical SEO Beyond the Audit - Brighton SEO April 2017 - Philip Gamble
Technical SEO Beyond the Audit - Brighton SEO April 2017 - Philip GamblePhilip Gamble
 
Log Analytics for Distributed Microservices
Log Analytics for Distributed MicroservicesLog Analytics for Distributed Microservices
Log Analytics for Distributed MicroservicesKai Wähner
 
Oracle database threats - LAOUC Webinar
Oracle database threats - LAOUC WebinarOracle database threats - LAOUC Webinar
Oracle database threats - LAOUC WebinarOsama Mustafa
 
11 Advanced Uses of Screaming Frog Nov 2019 DMSS
11 Advanced Uses of Screaming Frog Nov 2019 DMSS11 Advanced Uses of Screaming Frog Nov 2019 DMSS
11 Advanced Uses of Screaming Frog Nov 2019 DMSSOliver Brett
 
10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App TodayChris Love
 
HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Technology
 
Premature optimisation: The Root of All Evil
Premature optimisation: The Root of All EvilPremature optimisation: The Root of All Evil
Premature optimisation: The Root of All EvilFabio Akita
 

Similar to Digging into Data Collection (20)

OTG-Recon
OTG-ReconOTG-Recon
OTG-Recon
 
Sps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flowSps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flow
 
DevOps and Cloud Native
DevOps and Cloud NativeDevOps and Cloud Native
DevOps and Cloud Native
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO Opportunities
 
How go makes us faster (May 2015)
How go makes us faster (May 2015)How go makes us faster (May 2015)
How go makes us faster (May 2015)
 
Malware Analysis For The Enterprise
Malware Analysis For The EnterpriseMalware Analysis For The Enterprise
Malware Analysis For The Enterprise
 
Your Web Application Is Most Likely Insecure
Your Web Application Is Most Likely InsecureYour Web Application Is Most Likely Insecure
Your Web Application Is Most Likely Insecure
 
Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0
Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0
Rapid Assessment of Web Resources (RAWR) - DerbyCon 3.0
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices
 
Top 10 DBA Mistakes on Microsoft SQL Server
Top 10 DBA Mistakes on Microsoft SQL ServerTop 10 DBA Mistakes on Microsoft SQL Server
Top 10 DBA Mistakes on Microsoft SQL Server
 
10 things you can do to speed up your web app today 2016
10 things you can do to speed up your web app today 201610 things you can do to speed up your web app today 2016
10 things you can do to speed up your web app today 2016
 
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
 
Technical SEO Beyond the Audit - Brighton SEO April 2017 - Philip Gamble
Technical SEO Beyond the Audit - Brighton SEO April 2017 - Philip GambleTechnical SEO Beyond the Audit - Brighton SEO April 2017 - Philip Gamble
Technical SEO Beyond the Audit - Brighton SEO April 2017 - Philip Gamble
 
Log Analytics for Distributed Microservices
Log Analytics for Distributed MicroservicesLog Analytics for Distributed Microservices
Log Analytics for Distributed Microservices
 
Oracle database threats - LAOUC Webinar
Oracle database threats - LAOUC WebinarOracle database threats - LAOUC Webinar
Oracle database threats - LAOUC Webinar
 
11 Advanced Uses of Screaming Frog Nov 2019 DMSS
11 Advanced Uses of Screaming Frog Nov 2019 DMSS11 Advanced Uses of Screaming Frog Nov 2019 DMSS
11 Advanced Uses of Screaming Frog Nov 2019 DMSS
 
10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today
 
HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020
 
Premature optimisation: The Root of All Evil
Premature optimisation: The Root of All EvilPremature optimisation: The Root of All Evil
Premature optimisation: The Root of All Evil
 

More from Jason Packer

Third Party Cookies: Columbus DAW March 2024
Third Party Cookies: Columbus DAW March 2024Third Party Cookies: Columbus DAW March 2024
Third Party Cookies: Columbus DAW March 2024Jason Packer
 
Cbuswaw October '23, Marketing Mix Modeling
Cbuswaw October '23, Marketing Mix ModelingCbuswaw October '23, Marketing Mix Modeling
Cbuswaw October '23, Marketing Mix ModelingJason Packer
 
Generative AI and SEO
Generative AI and SEOGenerative AI and SEO
Generative AI and SEOJason Packer
 
DataOps , cbuswaw April '23
DataOps , cbuswaw April '23DataOps , cbuswaw April '23
DataOps , cbuswaw April '23Jason Packer
 
Google Analytics Alternatives
Google Analytics AlternativesGoogle Analytics Alternatives
Google Analytics AlternativesJason Packer
 
Google Analytics Alternatives
Google Analytics AlternativesGoogle Analytics Alternatives
Google Analytics AlternativesJason Packer
 
Web Analytics Wednesday April 2020 - Customer Journey Mapping
Web Analytics Wednesday April 2020 - Customer Journey MappingWeb Analytics Wednesday April 2020 - Customer Journey Mapping
Web Analytics Wednesday April 2020 - Customer Journey MappingJason Packer
 
Introduction to Factor Analysis
Introduction to Factor AnalysisIntroduction to Factor Analysis
Introduction to Factor AnalysisJason Packer
 
Product Analytics at Web Analytics Wednesday
Product Analytics at Web Analytics WednesdayProduct Analytics at Web Analytics Wednesday
Product Analytics at Web Analytics WednesdayJason Packer
 
Columbus Web Analytics Wednesday September 2019
Columbus Web Analytics Wednesday September 2019Columbus Web Analytics Wednesday September 2019
Columbus Web Analytics Wednesday September 2019Jason Packer
 
How to Present Test Results to Inspire Action
How to Present Test Results to Inspire ActionHow to Present Test Results to Inspire Action
How to Present Test Results to Inspire ActionJason Packer
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysisJason Packer
 
CBUSWAW - October 2017 Alain Stephan
CBUSWAW - October 2017 Alain StephanCBUSWAW - October 2017 Alain Stephan
CBUSWAW - October 2017 Alain StephanJason Packer
 
Columbus WordCamp 2015
Columbus WordCamp 2015Columbus WordCamp 2015
Columbus WordCamp 2015Jason Packer
 

More from Jason Packer (15)

Third Party Cookies: Columbus DAW March 2024
Third Party Cookies: Columbus DAW March 2024Third Party Cookies: Columbus DAW March 2024
Third Party Cookies: Columbus DAW March 2024
 
Cbuswaw October '23, Marketing Mix Modeling
Cbuswaw October '23, Marketing Mix ModelingCbuswaw October '23, Marketing Mix Modeling
Cbuswaw October '23, Marketing Mix Modeling
 
Generative AI and SEO
Generative AI and SEOGenerative AI and SEO
Generative AI and SEO
 
DataOps , cbuswaw April '23
DataOps , cbuswaw April '23DataOps , cbuswaw April '23
DataOps , cbuswaw April '23
 
Google Analytics Alternatives
Google Analytics AlternativesGoogle Analytics Alternatives
Google Analytics Alternatives
 
Google Analytics Alternatives
Google Analytics AlternativesGoogle Analytics Alternatives
Google Analytics Alternatives
 
Web Analytics Wednesday April 2020 - Customer Journey Mapping
Web Analytics Wednesday April 2020 - Customer Journey MappingWeb Analytics Wednesday April 2020 - Customer Journey Mapping
Web Analytics Wednesday April 2020 - Customer Journey Mapping
 
Introduction to Factor Analysis
Introduction to Factor AnalysisIntroduction to Factor Analysis
Introduction to Factor Analysis
 
Product Analytics at Web Analytics Wednesday
Product Analytics at Web Analytics WednesdayProduct Analytics at Web Analytics Wednesday
Product Analytics at Web Analytics Wednesday
 
Columbus Web Analytics Wednesday September 2019
Columbus Web Analytics Wednesday September 2019Columbus Web Analytics Wednesday September 2019
Columbus Web Analytics Wednesday September 2019
 
How to Present Test Results to Inspire Action
How to Present Test Results to Inspire ActionHow to Present Test Results to Inspire Action
How to Present Test Results to Inspire Action
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
CBUSWAW - October 2017 Alain Stephan
CBUSWAW - October 2017 Alain StephanCBUSWAW - October 2017 Alain Stephan
CBUSWAW - October 2017 Alain Stephan
 
Attribution 101
Attribution 101Attribution 101
Attribution 101
 
Columbus WordCamp 2015
Columbus WordCamp 2015Columbus WordCamp 2015
Columbus WordCamp 2015
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Digging into Data Collection

  • 1. DIGGING INTO DATA COLLECTION Jason Packer
 jason@quantable.com 
 @jhpacker Feb 17, 2016
 #cbuswaw
  • 2. WHAT DRIVES OUR METRICS? *Note all metrics may be inaccurate by some amount** **But we’re not sure which ones and by how much.
  • 3. DATA COLLECTION 1.0: SERVER LOGS, HITS, IP ADDRESSES • Server logs, valid in 1996 and 2016 • Basic, but still contains highly useful data! • Unanalyzed raw logs get big, fast. 128.135.189.9 - - [15/Feb/1996:15:16:27] "GET / HTTP/1.1" 200 5397 "Mozilla/1.0 (Win3.1)” 65.60.216.104 - - [15/Feb/2016:15:16:27] "GET / HTTP/1.1" 200 5397 "Mozilla/5.0 (Mac OS)"
  • 4. WEB ANALYST, CIRCA 2000 flickr: boston_public_library
 CC BY-NC-ND 2.0
  • 5. DATA COLLECTION 2.0: CLIENT-SIDE JAVASCRIPT, COOKIES • Easier to implement (“just a few lines of JavaScript…”) • Cookies match users closer than IPs • Much more info available on client- side
  • 6. HOW DOES CLIENT-SIDE JS WORK? …SPECIFICALLY GOOGLE ANALYTICS 2 requests - 1st for code, 2nd with measurement
  • 7. TRACKING CODE SNIPPETS • Sets up command queue • Loads analytics.js, which does the real work. <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-34128028-1', 'auto'); ga('send', 'pageview'); </script>
  • 8. MEASUREMENT PROTOCOL https://www.google-analytics.com/collect?v=1&_v=j41&a=702618035&t=pageview&_s=1&dl=https:// www.quantable.com/&ul=en-us&de=UTF-8&dt=Quantable - Analytics & Optimization&sd=24- bit&sr=1680x1050&vp=1442x464&je=0&_u=SCCAAUAjK~&jid=&cid=157092037.1441829013&tid=UA-34128028 -1&z=823826407 This hit.. Once made readable, is this data…
  • 10. SEEMS GREAT, WHAT COULD POSSIBLY GO WRONG? Some data still only on the server side… • Bot traffic (mostly) • HTTP errors • Pages we forgot to tag • Content blocking users
  • 11. SERVER LOGS, AGAIN • Distributed systems, distributed logs • As before, but somewhat different consumers
  • 12. AS ANALYSTS, WHAT’S GIVING US GRIEF • Cookie Deleting Users • Bots • Analytics “Referrer” Spam • Ad blocker Users
  • 13. COOKIE DELETING USERS IS IT STILL ~30%? • Artificially increases user counts • Visit after deletion is direct, no attribution • Stats based on users accounts? flickr: diskant
 CC BY-NC 2.0
  • 14. BROWSER FINGERPRINTS • Survives Cookie deletion • 2010 EFF Panopticlick: 84% of browsers unique • Invasive? • Browser fingerprint + IP in Piwik as cookie fallback • Can be thought of as next gen User-Agent + IP
  • 15. BOTS • About 50% of all traffic may be bots (48.5%, Incapsula 2015) • Most of these don’t show in GA (yet?) • Smaller the site, higher the bot % (85% for <1k visits/day) flickr: skynoir
 CC BY-NC 2.0 BOTS BOTS BOTS BOTS
  • 16. ANALYTICS SPAM • free-social-buttons.biz, top-seo-blah- blah-blah.com, number-one-analytics.fail • Way to get traffic, SEO, and lulz since before 2009 • Not GA specific, just the #1 target • Two kinds: Crawler & Ghost
  • 17. WHO’S SPAMMING US TODAY? List of 2016 GA Spammers from Analytics Edge Google is blocking offenders, but often not quickly.
  • 18. WHY IS IT SO PREVALENT? “Ghost” version via Measurement Protocol abuse $ curl "https://www.google-analytics.com/collect?v=1&t=pageview&tid=UA-XXXX-X&cid=fa0c8140- eef8-47c5-a244-b4c60cf46f74&dr=http%3A%2F%2Fmyspamsite.pizza&dp=%2Fhome" Just iterate through UA-XXXX-1 numbers.
  • 19. HOW DO I FIX IT? • Filters for new traffic, segments for historical • Tool available on my site: 
 quantable.com/spamfilter • Higher than UA-XX—1 property tracking id number for new site
  • 20. AD BLOCKING IS MAKING SOME OF OUR USERS DISAPPEAR • Blockers such as AdBlock Plus, Ghostery, uBlock Origin, and Purify can block analytics tools, not just ads • ABP has largest install base (300M downloads) • These users are still in your server logs, but may never show up in your web analytics
  • 21. HOW DOES THE BLOCKING WORK? • Long lists of URLs to block loading for, e.g.:
 google-analytics.com/analytics.js
 /piwik.php
 ?[AQB]&ndh=1&t=
 com/0.gif? • EasyPrivacy list (used by ABP and others) is over 10,000 lines long and very actively maintained
  • 22. HOW MANY USERS BLOCK GA? My study showing 8.7% blocking GA
 (for one particular site) blockers
  • 23. HOW DO I COUNT BLOCKERS? • Can’t really be “fixed” client-side • Still show up server-side • May be against GA terms (can’t circumvent Opt-Out Add-on)
  • 24. …because sometimes 22/7 is good enough. SQUARING THAT CIRCLE
  • 25. THANKS! slides & recap to be posted at cbuswaw.com References & Further Reading Quantable GA Blocking Analysis:
 https://www.quantable.com/analytics/how-many-users-block-google-analytics/ GA Tracking Code walkthrough:
 http://code.stephenmorley.org/javascript/understanding-the-google-analytics-tracking-code/ GA Measurement Protocol Hit Builder:
 https://ga-dev-tools.appspot.com/hit-builder/ Fingerprintjs2:
 http://valve.github.io/fingerprintjs2/ Incapsula 2015 Bot Report
 https://www.incapsula.com/blog/bot-traffic-report-2015.html Analytics Edge’s Guide to GA Spam:
 http://help.analyticsedge.com/spam-filter/definitive-guide-to-removing-google-analytics-spam/