video: http://www.youtube.com/watch?v=t3bISkp7zBw
A template for guiding developer in setting up, configuring and running clickstream pipeline using open source tools:
- Snowplow
- Apache Kafka
- Docker
and Cloud tools:
- Google BigQuery
- Kubernetes
# company
for clickstream pipeline to track ecommerce visitors check https://stacktome.com#customer-retention
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Developers guide for building real-time clickstream pipeline with Snowplow Apach Kafka and BigQuery
1. Developers guide
for building real-
time clickstream
pipelinewith Snowplow,
Apache Kafka and BigQuery
May 23th, 2017
Evaldas Miliauskas
@evaldasw
TeamLead @ FuzzyLabs Research
2. Objective
Provide a template for building a clickstream pipeline using
open source tools and BigQuery
Clickstream - recorded user activity events
originating from one or more websites
Clickstream pipeline - a group of software tools and
libraries configured to capture and store user generated
events
3. 10 mins
Demo
5 mins
Pipeline composition
15 mins
Overview of each
component
5 mins
Data engineering
10 mins
Questions
10. Snowplow
Everyone has heard about Google analytics. Snowplow is an
open source alternative that addresses the same problem, but
also gives you full control on what, how and which data you
want to collect.
Tracker - sends events from client side app
Collector - receives, validates format and stores raw events
Enricher - validates based on schema, extends with extra
attributes and stores events
12. Apache Kafka
Originally was developed in LinkedIn, but now is by far the
most wide spread event store solution out there used by
many companies where data is first class citizen.
Topic - a dedicated list that allows read/write messages to
Cursor - Last read messages in a topic
Lifetime - How long a message lives inside a topic
13. Google BigQuery
MPP - columnar data store available at GCP (Google Cloud
Platform).
Main competitor for AWS Redshift.
Advantage is that fully managed by Google so you need to
spend less time in devops activities just to keep it running
optimally.
Nested data - Support for hierarchical data structs, like json
Slot - A worker that executes the job when submitting query
Stream/Batch - supports both ways of loading data
14. Docker
Allows you to run your applications in a isolated lightweight
containers without the need to virtualize a full machine.
No more “works on my machine”
Dockerfile - spec for how image is built and run
Image - self sustaining OS and libraries necessary to run the
container
Dockerd - a process that handles all docker images and
interacts with docker cli
15. Kubernetes
An open source platform that reduces the friction of running,
deploying, monitoring and managing one or more dockerized
applications on any infrastructure (GCP, AWS, Azure, on-
premise)
Pod - a unit of application that handles scaling, running and
managing docker containers
Service - Provides ability to connect and expose different
applications on the cluster network
Deployment - Allows to update pods with zero downtime
17. Data engineering
“Data engineers build tools, infrastructure, frameworks, and
services.” - The Rise of the Data Engineer by Maxime Beauchemin
(founder of Airflow)
As data is becoming more and more centric to every company it’s
becoming critical to account for data management and all related
infrastructure in the same fashion as code and it’s implementing
applications.