Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm

Building reliable big data
applications for news brands
across the Benelux
- Why
- What
- How, 5 challenges
Rogier Vlijm
Frank Mekkelholt

Big Data De Persgroep Nederland 2019
2
25 GB
=
111.825 boeken per dag
>300 mio
events a day
75 GB
data a day

Monthly reach
DIGITAL and PRINT 2018

Data strategy
Building a data foundation to generate value with data for:
Increase conversion
&
Personal Offer
personalized
ads
better results
uplift of page views,
unique visitors
& time on site
respectful
&
reducing risk
3
Newsroom
2
Digital
advertising
1
Subscriptions
Marketing
4
Compliancy
& security

Digital
advertising
News
room
Subscription
Marketing
360 ° PROFILE AND DATALAKE
IN THE CENTRE OF THE ORGANIZATION
Increase
automated
conversion on
paywall and
newspaper.nl
Automated CTR
optimization
Audiences
visitor behavior
for news
innovations
b2c marketing
datafundament
tbv Consumer
Intelligence
b2b Sales
data fundament

Optimize Media channels, creation and platform.
- Channels: Where can we find our consumer and find the best way to convince them.
- Creation: How do we best appeal to you as a consumer and in what format?
- Platform: In what phase is the consumer and do we convert him to sales or more engagement?
-
Data: Online subscriptions marketing

7
televisie
sport
overig
regionieuws
binnenlandse
politiek
onderwijs
digitaal &
technologie
herdenken
voeding
internationaal
culinair
consumeren
& uitgeven
literatuur
bouw &
vastgoed
defensie
natuur &
milieu
geloof &
samenleving
gezondheid
binnenland
rampen &
rechtshandhaving
transport
muziek
film &
podiumkunst
royalty weer
BV
Nederland
bekende
personen
hoger onderwijs
& emancipatie
voetbal
drama &
emotie
kunst

KONING
VOETBAL
Specifieke interesse in voetbal, maar leest ook andere
sporten. Checkt regelmatig voetbalcenter voor uitslagen.
kenmerk per
gebruiker
aantal
# bezoeken per
maand
5,8
# pagina’s per
bezoek
6,5
# artikelpagina’s 79
% kijkt video’s 19
% crossdomain
landelijk / regionaal
15/20
% ingelogd 4,9

Reduce the distance to Google and Facebook
• Strong brands (Volkskrant, Parool, Trouw, AD, tweakers, Qmusic etc ..)
• Link with demographic characteristics through CRM data
• Create audiences based on behaviour
• Demand from larger advertisers is growing to be less dependent on Google
or Facebook while maintaining results
→ Closing step by step by building data in 2 zones
1. Demographic and behavioral data
2. Intent data
Improve service for advertisers and close gap with Google and Facebook

● How successful is my story
● Via which channels van I need to publish
● Can I improve the header
● Should we create a follow up?

RAW layer
Master
(datamarts)
Clean layer
Batch / micro batch
Data catalog raw
● Source
● Owner
● Location
● Frequentie
● Description
● Consent
● Delta /full
Data catalog Clean
● Consent
● PI data (hashed)
● frequentie
● lookuptables
● field description
🕐Airflow Ingestion
Monitoring/ alerting
👤 Acces by owner and
dataprocessor
🕐Airflow transformation
Monitoring/ alerting
Consumers van data
(CI/BI/CX/IT/DCC)
● Dataiku
● Databricks
● Redshift (Spectrum)
● Athena
● Looker / Clicksense
S
3
S
3
🕐Airflow / data
transformation
👤 Acces role based /PI
data hashed
Logging user
Trails
Monitoring
Performance
/costs

How:
Translating 5 challenges
to technical solutions

Analytics log level data:
- time1 - user1 - article1
- ....
1 - User-content interactions
for analytics and data science
articles
users
1 1
0 1
0.5 0.22 0.28
.01 .01 0.98
users

Problems with known analytics partners
- Throttling/sampling
- Non-realtime (event level)
- 3rd party tracking
- Non-transparant
- Privacy control
- Vendor lock-in

Open source tracking:
- Android
- Go
- .NET
- iOS
- Java
- JavaScript
- NodeJS
- Python
- Scala
- [many more]
- Infrastructure as a Service on AWS
- Open source
- Flexible/configurable
- Realtime
- 1st party

2 - Easy testing playground
- Raw topic - Enriched topic
- Corrupt topic
Snowplow
Collector
Snowplow
Enricher

3 - Data quality
Challenges:
- Variety of brands
- Variety of platforms
- Variety of development teams
Solutions:
- Enforcing schema verification => corrupt events topic
- Tag manager templating
- Monitoring of tags and anomalies
- Automated quality assurance for new releases

4 - High event volumes
ClicksPageviews Player heartbeats
5 B/month
Processing
- Transform
- Filter
- Parse
- Window
- Aggregate
Integrate with
- Business Intelligence tools
- Data Science tools

4 - High event volumes
Solutions:
- Snowplow => heavy lifting of collecting
- Start/terminate (EMR clusters on AWS)
from Airflow when needed
- Spark for cleaning and aggregating
- Mirror S3 (partially) to Redshift for fast
querying and BI tooling
regionieuws
bouw &
vastgoed
voetbal

5 - Realtime scalability
Night/day pattern
- almost no night time traffic
Breaking news/developing stories
- double / quadruple daily volume
Push notifications
- peaks up to 16K events per second
How to aggregate?

5 - Realtime scalability
Challenges
- Fluctuating traffic
- Stateful streaming
Considerations
- Latency - How fast is fast enough?
- Spark Streaming is still mini-batch
Solutions
- Dockerize applications
- Orchestrate with Kubernetes
- Container I/O to Kafka
- Redis
- ElasticSearch
- Flink

‘Data isn’t magic,
it’s what you do with it that counts’
(Mary Hamilton, The Guardian)

Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm

Similar to Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm (20)

Recently uploaded

Recently uploaded (20)

Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm