Levelling up your data infrastructure

Levelling up your
data infrastructure
simon.belak@gmail.com

@sbelak

The Problem
… but eventually
• Want granularity smaller than GA exposes

• Want analysis GA doesn’t support

• Want to combine and analyse data from diﬀerent sources

Goal: answer 80% of
questions stemming from
data in 20min or less

The analytics chasm
2 min 20 min project
Ideal. Almost real-time. Can be
done during brainstorming
without disrupting the ﬂow.
fail
Added to roadmapSqueeze in
somewhere
in the day

Levelling up
1.Acquire data (directly, or from 3rd party APIs)

2.Store it in a data warehouse

3.Transform it to a usable and uniﬁed shape

4.Perform analytics on it

Intermezzo: My perspective
• Core developer at Metabase, an open source BI/analytics
tool. 3rd largest BI tool in the world. 20k+ companies use
us daily, including N26, Revolut, Swisscom

• Built analytics department at GoOpti from the ground up

• Helped 20+ companies become data-driven

Collecting requirements
1.Make a list of all the data sources you currently have, how much data is in
them (number of entities), and at what rate the data grows

2.Collect user stories from all potential users: 
As a ______ I’d like to _________, because _________
3.Match each user story with needed data sources

4.Rank user stories using PIE (probability, impact, eﬀort)

5.Rank data sources by summing the PIE score of all user stories that require it.

6.Build data infrastructure to enable the high-value cluster

7.Continue doing steps 1-6 as you iterate

A minimal data-collection
plan
• Event stream

• Goal: be able to reconstruct any given session from data

• Timestamp, session, action, payload, context/result

Invest into workﬂow
management from the
start

Extract-Load-Transform
• Dump data somewhere as soon as possible so you don’t
loose it.

• Databases are fast and powerful enough to do most
transforms there. In return you get:

• Observability

• Analysts become more self-suﬃcient (if they know SQL)

• For small-medium data size (< 1M data points/day)
more performant and requires much less infrastructure

Good ELT is:
• Repeatable

• Observable

• Extensible

• Scalable

• Recoverable (don’t loose data, ever!)

Identify principle axis of
your data
• User, account, transaction, instance, product, event (log)…

• There will (and should) be some overlap

• Diﬀerent axis will have diﬀerent granularity

• Some should be ordered in time

Data warehouse topology
• Big fat denormalised tables, one for each principle axis

• Use views to tailor the representation to your tools and
analysis needs

Which DB?
• Optimize for ease of ad-hoc querying

• Should be decently performant (waiting kills productivity)
but is unlikely to be the bottleneck

• Simple to deploy, connect to, and use

• Strong data validation/schemas, but should also handle
non-structured data (validation on load = data loss)

• Sane handling of timezones, date time arithmetics, &
numbers

My go-to stack
• Snowplow for event-like data

• Apache Airﬂow to manage the workﬂow

• (managed) Postgres for data warehouse (or Druid if only event data and a
lot of it)

• dbt for data transforms

• Metabase for analytics

• Fully open-source

• Extensible, performant

SaaS alternatives
• Segment, Stitch Data

• Redshift, BigQuery, Snowﬂake

• Dataform

• PowerBI, Looker

What to look for when
choosing your stack
• Iteration velocity

• Toil

• Observability

• Vendor lock-in

• Extensibility and repurposability (avoid the multiple tool anti-pattern)

• Don’t loose data

• Self-service

• Friction, friction, friction

• Cost (both setup & running)

Common pitfalls
• You need it before you can aﬀord it

• (no) Ownership of data, processes, dashboards

• Overestimating scale

• Not iterating

Good dashboards are:
• Actionable

• Clear & simple

• Sharable (and a good teaching tool <3)

Add descriptions and
reasoning

Anticipate followup
questions & ﬂow

Interactivity turns reports
into tools (and begets a
sense of ownership)

It should be easy to slip
into exploration mode

Design your dashboards
with user journey and
process in mind

Metric deﬁnitions are
rarely unambiguous,
nor self-explanatory.
Document them!

Why your dashboards fail to
cross the chasm
• Discoverability

• Legibility

• Trust (in data, in creator, in correctness)

Segmentation,
segmentation,
segmentation

Minimal segmentation
checklist
• New vs. Returning

• Time cohorts

• Milestone events

• Usage

• Value

• Customer attributes (company size, industry, …)

• Geography

Seasonality
Different segments, different behaviour, different
volumes

You can often encode
dynamic processes as
binary outcomes

Signal or noise?
• Trend & relative change often tell more than absolute
values Percentiles

• Intra- vs. inter-segment variance

• Signiﬁcance tests

• Sample representativeness (is not just for A/B tests)

• Distribution similarity

• Have a reference point (and reference it often)

MESI
• Medical decices

• North-star metric: number of measurements/device

• Current data sources: GA, product database, countly,
sentry, hubspot, Odoo

MESI data acquisition
• Collect event stream from devices capturing all the interactions [Snowplow]

• Mirror product database into data warehouse [Airflow]

• Collect event stream from the website [Snowplow]

• Integrate Hubspot and Odoo via API [Airflow]

• Integrate sentry via API [Airflow]

• (Retire Countly)

• (Add support data — Jira, Zendesk, …)

• (Add accounting/billing)

MESI data warehouse
• (managed) Postgres

• Principle axis: account, user, device event, user journey
event, device

MESI analytics
• Metabase

• User journey before conversion

• Device usage patterns

• UX friction points

• Onboarding

• Errors & support issues

• Segmentation

SalesGenomics
• eCommerce marketing agency focused on scale-up

• Typical customer marketing budget 10k-100k/month

• Current data sources: GA, FB, Shopify

• 2-sided reporting: for clients, internal

SalesGenomics data
acquisition
• Custom event collector on websites (replacing GA
snippet) [Snowplow]

• Integrate Shoppify, AdWords, FB ads [Airﬂow]

— OR —

• Use Segment/Stitch Data

SalesGenomics data
warehouse
• (managed) Postgres

• Principle axis: order, order item, user, account, user
journey event, ad, ad campaign

SalesGenomics analytics
• Metabase

• Cross-client learning & benchmarking

• User journey

• Segmentation

• Order patterns and periodicity

• Gross margin!
• Cost analysis (shipping, marketing, returns …)

Starting from 0
• Setup GA (remember the minimal data-collection plan)

• Connect Metabase to your product DB

• Collect data user stories from day 1

• Focus analytics on user journey, segmentation, costs, & UX

Questions
simon.belak@gmail.com

@sbelak

Resources
metabase.com

airflow.apache.org

github.com/fishtown-analytics

postgresql.org

github.com/snowplow/snowplow

druid.apache.org

segment.com

stitchdata.com

dataliftoff.com/elt-with-amazon-redshift-an-overview

Levelling up your data infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Levelling up your data infrastructure

Similar to Levelling up your data infrastructure (20)

More from Simon Belak

More from Simon Belak (20)

Recently uploaded

Recently uploaded (20)

Levelling up your data infrastructure