Presented at the 25th Data Science Leuven meetup on 2020/03/11
Jonny Daenen explains the steps they took at Selligent to create a multi-tenant real-time data pipeline. He discusses all challenges the team encountered as well as the tools they used. The benefits of using Google Cloud Platform to remove operational hurdles when moving data pipelines to production are demonstrated.
2. JONNY DAENEN - DATA SHEET
• Data Engineer/Scientist
• 3 years at Selligent
• PhD Computer Science
• Focus = Data!
3. SPOILER
• Data Engineering and the flip side of the Data Science Pyramid
• Getting to Production is a long road
• Using GCP to focus on the important stuff
9. Little Jonny*Regular Jonny
* for legal reasons we are obliged to mention that "Little Jonny" is in no way related to "Little John" from the "Robin Hood" story.
19. CAPTURE - PUB/SUB
• Event enters system
• Event is sent to Pub/Sub
• No ops
• Globally available
• Pay as you go
• 7 day retention
• No ordering (alpha feature)
• No server-side filtering
Pub/Sub
25. PROCESSING - DATAFLOW
• Aggregation of events per consumer per tenant
• Dataflow
• Managed (choose your machines)
• Auto-scaling
• In-flight pipeline updates
• Monitoring
• Exactly-once
• Batch and strEAMing (Apache Beam)
• SQL available
• Documentation Unclear
DataFlow
26.
27.
28. PROBLEM!
• Aggregate per consumer per tenant
• Too much state data
• 1.000.000.000 users?
• how much memory do we need?
• what about inactive consumers?
• Offload to document store
• key-value access
29. STATE - DATASTORE
• DataStore
• No ops
• Namespaces
• Deal with Failures
• Costs can go up
DataStore
34. FAULT-TOLERANCE
• Idempotency!
• What if pipeline fails?
• Streaming means: Re-execute, Re-execute, Re-execute
• Bundle
• Out of order processing of successive windows
• Can you deal with it?
• Depends on use case
• Exactly-once?
• Use native dataflow/beam operators
The Researcher
42. TESTING
• Unit tests
• Dataflow test framework
• Integration test
• Between services
• External components
• Mocking?
• Performance test
• Does it scale
• Multi-tenancy
"The Butcher"
53. BIGQUERY
• Dataflow to BigQuery
• Storage
• BigQuery
• No ops
• Pay for storage
• Pay per byte queried
• Data Market
BigQuery
54. ONBOARDING, CLIENTS & LEGAL
• How to create business value?
• How to measure success?
• Who does activations?
• Do we need initial data loads?
• Who triggers it?
• What documents need to be signed?
• What do clients expect?
"Ms. Heard"
65. KEY FIGURES
• Currently:
• 30 of the 700 clients
• 300 events per second
• Upcoming
• 100M emails per day
• Mobile push deliveries
• SMS deliveries
• Website views
• Different Aggregations
70. STRATEGIES
Everything as code
• Traceable
• Reproducable
• Explicit
Cloud/Serverless
• Less management
• Devops becomes easier
• Pay as you go
Automation
• Less ops work
• Reliable releases
• Continuous delivery
72. Hanne Van Briel
Product Management
Tom Artoos
Back-end
Yohan Laudelout
Scrum Master
Timo Naessens
Quality
Kirill Ismagulov
Data Engineer / Scientist
Dirk Dupont
Data Engineer / Scientist
Jonny Daenen
Data Engineer / Scientist
TEAM DeLorean
Laurens Vijnck
Data Engineer / Scientist
73. GCP TAKEAWAY
• High-level concepts -> Focus on Data
• Good console and cli
• No ops global services
• Lower cost (depends!)
• Documentation not always up-to-date