Developers guide for building real-time clickstream pipeline with Snowplow Apach Kafka and BigQuery

•Download as PPTX, PDF•

2 likes•696 views

video: http://www.youtube.com/watch?v=t3bISkp7zBw A template for guiding developer in setting up, configuring and running clickstream pipeline using open source tools: - Snowplow - Apache Kafka - Docker and Cloud tools: - Google BigQuery - Kubernetes # company for clickstream pipeline to track ecommerce visitors check https://stacktome.com#customer-retention

Software

Developers guide
for building real-
time clickstream
pipelinewith Snowplow,
Apache Kafka and BigQuery
May 23th, 2017
Evaldas Miliauskas
@evaldasw
TeamLead @ FuzzyLabs Research

Objective
Provide a template for building a clickstream pipeline using
open source tools and BigQuery
Clickstream - recorded user activity events
originating from one or more websites
Clickstream pipeline - a group of software tools and
libraries configured to capture and store user generated
events

10 mins
Demo
5 mins
Pipeline composition
15 mins
Overview of each
component
5 mins
Data engineering
10 mins
Questions

devguide.herokuapp.com
Which OS is most popular?

Components
Snowplow
Apache Kafka
Google BigQuery
Docker
Kubernetes

Snowplow
Everyone has heard about Google analytics. Snowplow is an
open source alternative that addresses the same problem, but
also gives you full control on what, how and which data you
want to collect.
Tracker - sends events from client side app
Collector - receives, validates format and stores raw events
Enricher - validates based on schema, extends with extra
attributes and stores events

Apache Kafka
Originally was developed in LinkedIn, but now is by far the
most wide spread event store solution out there used by
many companies where data is first class citizen.
Topic - a dedicated list that allows read/write messages to
Cursor - Last read messages in a topic
Lifetime - How long a message lives inside a topic

Google BigQuery
MPP - columnar data store available at GCP (Google Cloud
Platform).
Main competitor for AWS Redshift.
Advantage is that fully managed by Google so you need to
spend less time in devops activities just to keep it running
optimally.
Nested data - Support for hierarchical data structs, like json
Slot - A worker that executes the job when submitting query
Stream/Batch - supports both ways of loading data

Docker
Allows you to run your applications in a isolated lightweight
containers without the need to virtualize a full machine.
No more “works on my machine”
Dockerfile - spec for how image is built and run
Image - self sustaining OS and libraries necessary to run the
container
Dockerd - a process that handles all docker images and
interacts with docker cli

Kubernetes
An open source platform that reduces the friction of running,
deploying, monitoring and managing one or more dockerized
applications on any infrastructure (GCP, AWS, Azure, on-
premise)
Pod - a unit of application that handles scaling, running and
managing docker containers
Service - Provides ability to connect and expose different
applications on the cluster network
Deployment - Allows to update pods with zero downtime

Data engineering
“Data engineers build tools, infrastructure, frameworks, and
services.” - The Rise of the Data Engineer by Maxime Beauchemin
(founder of Airflow)
As data is becoming more and more centric to every company it’s
becoming critical to account for data management and all related
infrastructure in the same fashion as code and it’s implementing
applications.

Resources
Github: https://github.com/fuzzylabs/dev-guide-sp-kafka-bq
Presentation:
https://docs.google.com/a/fuzzylabsresearch.com/presentati
on/d/1UC_ci5A4zEQf4NgqWCS5pEzQq2eVCBeuqut6BM8x
dTU

We’re Hiring
Contact HR@fuzzylabsresearch.com

Recently uploaded

10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171

What need to be mastered as AI-Powered Java DevelopersEmilyJiang23

Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl

AI Hackathon.pptxERRORhackerboy

JustNaik Solution Deck (stage bus sector)Max Lee

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...naitiksharma1124

GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j

A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.

AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.

INGKA DIGITAL: Linked Metadata by DesignNeo4j

APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795

CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10

Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM

Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise

StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2

KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j

OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan

The Impact of PLM Software on Fashion ProductionWave PLM

A Guideline to Gorgias to to Re:amaze Data MigrationHelp Desk Migration

Recently uploaded (20)

10 Essential Software Testing Tools You Need to Know About.pdf

What need to be mastered as AI-Powered Java Developers

Agnieszka Andrzejewska - BIM School Course in Kraków

AI Hackathon.pptx

JustNaik Solution Deck (stage bus sector)

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...

GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates

A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

AI/ML Infra Meetup | ML explainability in Michelangelo

INGKA DIGITAL: Linked Metadata by Design

APVP,apvp apvp High quality supplier safe spot transport, 98% purity

CompTIA Security+ (Study Notes) for cs.pdf

Crafting the Perfect Measurement Sheet with PLM Integration

Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf

StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf

KLARNA - Language Models and Knowledge Graphs: A Systems Approach

OpenChain @ LF Japan Executive Briefing - May 2024

The Impact of PLM Software on Fashion Production

A Guideline to Gorgias to to Re:amaze Data Migration

Featured

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Developers guide for building real-time clickstream pipeline with Snowplow Apach Kafka and BigQuery

1. Developers guide for building real- time clickstream pipelinewith Snowplow, Apache Kafka and BigQuery May 23th, 2017 Evaldas Miliauskas @evaldasw TeamLead @ FuzzyLabs Research

2. Objective Provide a template for building a clickstream pipeline using open source tools and BigQuery Clickstream - recorded user activity events originating from one or more websites Clickstream pipeline - a group of software tools and libraries configured to capture and store user generated events

3. 10 mins Demo 5 mins Pipeline composition 15 mins Overview of each component 5 mins Data engineering 10 mins Questions

4. DEMO

5. devguide.herokuapp.com Which OS is most popular?

6. Pipeline Composition (Full)

7. Pipeline (Configuration)

8. Pipeline (Use cases)

9. Components Snowplow Apache Kafka Google BigQuery Docker Kubernetes

10. Snowplow Everyone has heard about Google analytics. Snowplow is an open source alternative that addresses the same problem, but also gives you full control on what, how and which data you want to collect. Tracker - sends events from client side app Collector - receives, validates format and stores raw events Enricher - validates based on schema, extends with extra attributes and stores events

11. Snowplow standard pipeline components

12. Apache Kafka Originally was developed in LinkedIn, but now is by far the most wide spread event store solution out there used by many companies where data is first class citizen. Topic - a dedicated list that allows read/write messages to Cursor - Last read messages in a topic Lifetime - How long a message lives inside a topic

13. Google BigQuery MPP - columnar data store available at GCP (Google Cloud Platform). Main competitor for AWS Redshift. Advantage is that fully managed by Google so you need to spend less time in devops activities just to keep it running optimally. Nested data - Support for hierarchical data structs, like json Slot - A worker that executes the job when submitting query Stream/Batch - supports both ways of loading data

14. Docker Allows you to run your applications in a isolated lightweight containers without the need to virtualize a full machine. No more “works on my machine” Dockerfile - spec for how image is built and run Image - self sustaining OS and libraries necessary to run the container Dockerd - a process that handles all docker images and interacts with docker cli

15. Kubernetes An open source platform that reduces the friction of running, deploying, monitoring and managing one or more dockerized applications on any infrastructure (GCP, AWS, Azure, on- premise) Pod - a unit of application that handles scaling, running and managing docker containers Service - Provides ability to connect and expose different applications on the cluster network Deployment - Allows to update pods with zero downtime

16. Kubernetes network architecture

17. Data engineering “Data engineers build tools, infrastructure, frameworks, and services.” - The Rise of the Data Engineer by Maxime Beauchemin (founder of Airflow) As data is becoming more and more centric to every company it’s becoming critical to account for data management and all related infrastructure in the same fashion as code and it’s implementing applications.

18. Date engineering trends

19. Questions?

20. Resources Github: https://github.com/fuzzylabs/dev-guide-sp-kafka-bq Presentation: https://docs.google.com/a/fuzzylabsresearch.com/presentati on/d/1UC_ci5A4zEQf4NgqWCS5pEzQq2eVCBeuqut6BM8x dTU

21. We’re Hiring Contact HR@fuzzylabsresearch.com

Developers guide for building real-time clickstream pipeline with Snowplow Apach Kafka and BigQuery

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Developers guide for building real-time clickstream pipeline with Snowplow Apach Kafka and BigQuery