How to collect Google Analytics events to your own data warehouse and do it on budget

How to collect Google Analytics events to your
own data warehouse and do it on budget
Alex Levashov
Web Analytics Wednesday presentation
06 Nov 2019

Brief Intro
WWW.OWNYOURBUSINESSDATA.NET
• eCommerce consultant, run own small consultancy Magenable, specializing in
Magento
• Deal with many eCommerce related things: from strategy to implementation to
support, so not only web analytics
• Started OwnYourBusinessData.net couple months ago

OwnYourBusinessData
• Own data warehouse over vendor locked in
• Central data warehouse over silos
• Open, transferable data format over vendor proprietary
• De-coupled warehouse, ETL and business analysis tool over monolith
• Open-source over proprietary
The data generated by a business should be owned by this business
for its own and its customers benefits.

WHY BOTHER TO COLLECT COPY OF GA DATA?
MOTIVATION
Why in general?
1. Being paranoid and control freak 
2. Centralization
3. Sampling
4. API Limits
Why this way?
1. Affordability
2. Low maintenance
3. Learn something new

INSPIRATION AND CREDITS
Existing Snowplow GA Plugin
Google Analytics plugin for Snowplow
Approach in general
Blog post at Bostata.com “Client-side instrumentation for under $1 per month. No servers
necessary.”

DISCLAMERS, NOTES
• I am just starting to use Snowplow
• Alternative ways are there and may work
better in other cases
• Link to blog post that describes the process
in more details and git repository will be
provided, so no need to write everything

TECHNOLOGIES USED
Approach
Snowplow architecture
Technologies we used
AWS Cloudfront AWS Lambda
Python
AWS S3 AWS Athena

PROCESS
Approach
JS tracker
• Calls
tracking
pixel
Cloudfront
• Produces
logs
Lambda
function
• Processes
logs
• Enriches
data
• Puts to S3
Athena
• Takes S3
data
• Creates
SQL tables

Why this way?
Benefits
• Easy to implement
• Serverless, low resource
usage and costs (under
$1/month)
• Reliable/low maintenance
• Easy access to data (SQL)

What you need to start?
1. Google Analytics account
2. Google Tag Manager account
3. AWS account
4. Terraform (optional, but saves your time)

Step 1. Deploy AWS infrastructure
1. Manually
2. Or use Terraform script:
https://github.com/ownyourbusinessdata/snowplow-google-analytics-enrich-lambda
At the end of process you’ll get:
• Cloudfront distribution
• 3 S3 buckets for logs, tracking pixel and Athena queries results
• Tracking pixel in one S3 bucket
• Python Lambda function that does data processing and enrichment
• Athena table (empty now)
AWS Cloudfront AWS Lambda
Python
AWS S3 AWS Athena

Step 2. Deploy JS tracker
With Google Tag Manager
Create User Defined Variable (Custom Javascript type), where you insert your tracker

Make another variable with type Variable Configuration and add there your Custom Javascript variable was a field

Use that configuration variable to modify tag configuration

Wait
The data updates every 5-15 mins

Few words about enrichment
AWS Lambda (Python)
Part that we had to develop
• Processing turns logs to text
files
• Enrichment adds geo data (use
MaxMindDB)

Let’s check what we get
Access from R demo
AWR.Athena package comes handy
# sample R connector to Athena DB with Snowplow events generated via
Google Analytics plugin collected
# required package to instal AWR.Athena
# connect to Athena
# install.packages("AWR.Athena")
library(AWR.Athena)
require(DBI)
library(tidyverse)
library(lubridate)
# You need AWS API user with proper access to S3 and Athena
# AWS Access Key and Secret should be set via AWS CLI, run "aws configure"
from command line
# S3OutputLocation should be taken from your Athena settings
con <- dbConnect(AWR.Athena::Athena(), region='us-west-2',
S3OutputLocation='s3://aws-athena-query-results-518190832416-us-
west-2/',
Schema='default')
# get list of tables available
dbListTables(con)
#query specific table (all records, SQL statement can be any supported by
Athena)
df <- as_tibble(dbGetQuery(con, "Select * from eventsga"))

Let’s check what we get
AWS S3 and Athana live demo

References
• Collect Google Analytics events in your own cheap AWS warehouse with Snowplow (OwnYourBusinessData)
https://www.ownyourbusinessdata.net/collect-google-analytics-events-in-your-own-cheap-aws-warehouse-with-snowplow
• Snowplow data enrichment with Lambda (OwnYourBusinessData)
https://www.ownyourbusinessdata.net/enrich-snowplow-data-with-aws-lambda-function/
• Connect R to Athena (OwnYourBusinessData)
https://www.ownyourbusinessdata.net/connecting-r-to-athena-to-analyse-snowplow-events/
• Own Your Business Data Git
https://github.com/ownyourbusinessdata/
• Client-side instrumentation for under $1 per month. No servers necessary (Bostata)
https://bostata.com/client-side-instrumentation-for-under-one-dollar/

Q&A TIME

Contacts
Web: OwnYourBusinessData.Net
Twitter: https://twitter.com/own_data
LinkedIn: https://www.linkedin.com/groups/12283165/
OwnYourBusinessData
Web: https://levashov.biz/
Twitter: https://twitter.com/levashovbiz
LinkedIn: https://www.linkedin.com/in/alevashov/
Alex Levashov
Looking for people interested to join the course

How to collect Google Analytics events to your own data warehouse and do it on budget

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to collect Google Analytics events to your own data warehouse and do it on budget

Similar to How to collect Google Analytics events to your own data warehouse and do it on budget (20)

More from Alex Levashov

More from Alex Levashov (8)

Recently uploaded

Recently uploaded (20)

How to collect Google Analytics events to your own data warehouse and do it on budget