The proliferation of digital channels has made it mandatory for marketers to understand an individual across multiple touchpoints. In order to develop market effectiveness, marketers need have a pretty good sense of its consumer’s identity so that it can reach him via mobile device, desktop or a big TV screen on living room. Examples of such identity tokens include cookies, app IDs etc.A consumer can use multiple devices at the same time and so the same consumer should not be treated as different people in the advertising space. The idea of identity resolution comes with this mission and goal to have an omnichannel view of a consumer.
Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and Delta Lake
1. “Building Identity Graph at Scale for Programmatic Media
Buying Using Apache Spark and Delta Lake”
Bikash Singh
Sneha Chokshi
Data + AI Summit
Europe
2020
2. 2
A leading programmatic media partner.
We’re MiQ.
For 10 years we have partnered with agencies
and marketers to deliver business-changing
results through better connected marketing.
3. The Agenda
Looking at:
The Agenda
1 What: Identity Graph in Programmatic Advertising
2 How : did we build it?
3 Business Impact
3
5. Programmatic Media Advertising
The use of software to buy digital ad spaces in real time connecting advertiser to a specific
consumer
User visits a webpage and ad
slot (impression) is generated
Website publishers communicate with
ad marketplace to put up impression
for auction
Real-time auction is held among
advertisers competing for that
impression
Advertiser with the highest bid
wins
The ad gets delivered to the
user on a webpage
----------------------------
------------------------
------
---------------------------
-----------------
Less than 0.1sec
40+ bn
Ad spaces traded daily
20 TB
Data generated daily
5
6. The future of supply is programmatic.
The future of demand is tangible outcomes. 6
7. • Purchase
• Online behavior
• Media exposure data
• Offline data & moments
Different screens
Multiple formats
At home
At work
On the move
Planning
Buying
Measurement
Google
Amazon
AT&T, Verizon…
7
Attention is Functions are Supply isData is
8. 8
8
Connecting diverse
datasets to solve
specific problems.
Building continuity
from one screen to
the next.
Better connected marketing means...
Making sure every
team is working
towards the same
business goal.
Using AI to access
open and closed
supply environments
efficiently.
These are the challenges we exist to solve
12. How do we connect these digital
identities in Programmatic Media
Buying?
12
13. Identity Resolution Provider
Cookie Id
Websites
Device Id
Individual ID
IP Address
Individual ID
Individual ID
IP Address
Device Id
Individual ID
We receive information related to digital Identities (like mobile Ids and Cookie Ids)
being mapped to a unique Individual Id and activation channels from third party
data provider like TAPAD
XANDER
14. Connecting TV viewers with Mobile Data
Mobile
ID
TV ID
Individual ID
Individual ID
IP Address
Data from TV viewing will be matched with IP address to get Individual ID.
This Individual ID will again be matched with other Individual ID to find mobile IDs if
any.
XANDER
15. Cross channel linkages
Mobile
ID
TV ID
Individual ID
Individual ID
IP Address
Cookie Id
Websites
Individual ID
There can be multiple matches of Individual IDs.
● One might match with Mobile ID
● Another might match with Cookie/Website Id
XANDER
19. Connecting Data Sets = Identity Graph
Location Data
TV Data
Cross
Device Data
Clickstream
Data
Device
Ids
Individua
l Ids
Cookie/
Device Ids
Cross Device
Data
❖ Joining datasets to derive
the final Identity Graph
20. Heterogeneous Data Sets (Batch + Streaming)
Location Data
TV Data Cross Device Data Clickstream
Data
400GB 500GB 450 GB 550 GB
Device Ids TV ID Individual Ids
Cookie/ Device
Ids
21. Challenges Ahead
❖ No uniform data format
➢ processing on raw data taking longer time
❖ Data lake is not GDPR Compliant
➢ All identities are PII . Hence need to be GDPR compliant
❖ Too many small files generated from Websites Click stream data
❖ High processing infrastructure cost
➢ Weekly refresh rates
22. Enriching Data Lake
Location Data
TV Data
Cross Device Data
Impression Data
MD5 Hashing Algorithm
GDPR
Digital Identities Hashed Digital Identities
24. Spark 3.0 + AQE
Adaptive Query Execution (AQE) is query re-optimization that occurs during query execution based on
runtime statistics. AQE in Spark 3.0 includes 3 main features:
● Dynamically coalescing shuffle partitions
● Dynamically switching join strategies
● Dynamically optimizing skew joins
.join .join .join
set spark.sql.adaptive.enabled = true;
25. Migrating to Delta
Dataframe Count operation became efficient, after moving
streaming datasets to delta saving 3-4 min per operation
26. Delta Z-Ordering
Optimize '/mnt/dwh-reports-data/bikash/delta/unacast_feed/'
ZORDER by (identifier)
Optimize '/mnt/dwh-reports-data/bikash/delta/gracenote_feed/'
ZORDER by (userid)
Optimize '/mnt/dwh-reports-data/bikash/delta/tapad_feed/'
ZORDER by (individual_id)
Optimize '/mnt/dwh-reports-data/bikash/delta/standard_feed/'
ZORDER by (user_id_64)
Processing runtime of Joins reduced to ~13 min
from 40 min.. Thus saving time and resources.
27. Refreshing Data : Delta write concurrency
We needed to refresh the old graph with new data in the same location. With normal parquet format,
reading and writing to same location with overwrite mode is not possible.
Moving to delta enabled us to perform this operation with ease due to its ACID features
Weekly refreshing data to the same location
28. Key Impact
Reduced
processing time to
30 mins from 3
Hours.
Increased scale
and opportunity
for cross-channel
& cross-platform
activation
80% dip in
processing cost
1 2 3
29. 29
29
We are able to
connect TV data with
digital data
Cross Device Tracking
and activation
capabilities
Activate users on
their DOOH - instore
based on their TV
viewing data.
With Identity Graph we are able to...
Life time value of
digital identities
Cross channel
activation
capabilities -
Online to offline
Measurement
and upliftment of
a brand online as
well as offline.
Brand uplift
measurement
studies
Higher accuracy
in conversion
attribution
measurement
User Journey and
Sales impact
Measurement
30. Questions?
30
MIQ BLOG : https://www.wearemiq.com/blog/
Medium BLOG : https://medium.com/miq-tech-and-analytics