Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Beyond the Classifier, Inspiration
from Engineering Algorithms
Yael Elmatad, Data Scientist at Tapad
@y_s_e
ML Conf NYC
Ap...
2
Introduction to Tapad
Tapad is a marketing technology company that seeks to bridge the gap
between users’ various screen...
3
Tapad’s Solution: The Device Graph™
4
Modeling Identity is Hard
1. Identifier persistence and accuracy
2. Conflicting data
3. Grouping keys / Transitive prope...
5
Modeling Identity is Hard
1. Identifier persistence and accuracy
2. Conflicting data
3. Grouping keys / Transitive prope...
6
Focus: Identifier Persistence & Groupings
Grouping keys
How can we effectively, at scale, determine
groups of identifier...
7
Grouping: Connected Components
● Over 1.4 billion devices in each weekly Device Graph
● There are 6.6 billion connection...
8
Connected Component Basics: Label Prop
Initializing, assign self as cluster label
A B C D
Cluster Label (Temp): A B C D
...
9
Need A More Efficient Solution: Hash-to-Min
Standard message passing is O(d), where d = cluster diameter.
arXiv.org > cs...
10
Hash-to-Min: Initialization
A B C D
E
v C(v)
A (A,B)
B (A,B,C)
C (B,C,D,E)
D (C,D)
E (C,E)
A A B C
C
For node v, assign...
11
Hash-to-Min: Round 1
For each C(v), vmin
= minimal member of C(v)
Broadcast C(v) to vmin
and broadcast vmin
to all othe...
12
Hash-to-Min: Round 1
For each C(v), vmin
= minimal member of C(v)
Broadcast C(v) to vmin
and broadcast vmin
to all othe...
13
Hash-to-Min: Round 2 + Completion
A B C D
E
A A A A
A
v C(v)
A (A,B,C,D,E)
B (A,B)
C (A)
D (A)
E (A)
Iterations cease w...
14
Hash-to-Min: Round 2 + Completion
Iterations cease when no updates are made to C(v)’s
Completes in O(log(d)) where d = ...
15
First labeling scheme:
Labeled by lowest device id participating in cluster.
Example:
Once we have CC, how do we label ...
16
Why 22% Change? ID Expiration & Creation
D
B
C
D
C
B C
Label Device Expires:
D
B
C
D
B
C
AB A
New Lowest ID Created:
17
Why 22% Change? Splits and Merges
D
B
C
AA
D
B
C
AA
C
Cluster Splits:
D
B
C
AA
C
D
B
C
AA
Clusters Merge:
18
Only a small fraction are of Merge/Split variety
Type of change Percent
Device Expiration
& Creation
> 75%
Cluster Merg...
19
Solution? Map onto Stable-Marriage Problem
Definition of “Stable Marriage”
Given n men and n women, where each person h...
20
Stable-Marriage - (By Negation)
Want to pair triangles to circles.
Unstable Match:
Prefer Each Other
A stable solution ...
21
Gale-Shapley Algorithm
a
b
c δ
ɣ
β
(Psst… it won the Nobel Prize in Economics in 2012)
22
Gale-Shapley Pre-Iteration (GS0): Rankings
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: ...
23
GS1: Circles “Propose” to Triangles
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)...
24
GS1: Triangles tentatively accept best proposal
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
R...
25
GS2: Unengaged circles try again
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
...
26
GS2: Triangles again tentatively accept best offer
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c...
27
GS3: iterations terminate when all triangles/circles are paired
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)...
28
How do we use it at Tapad?
Considerations:
● How do you rank best labels for your cluster?
● Need to be able to run at ...
29
Results & Cluster Stability
Metric:
The % of devices that maintain their cluster label after x weeks.
Min ID Based Gale...
30
Conclusion
Many challenges which get thrown at data scientists can potentially be
solved by deterministic engineering a...
31
Thank you!
Thanks to the Data Science/Engineering teams at Tapad
Read our blog:
http://engineering.tapad.com
Careers:
h...
Upcoming SlideShare
Loading in …5
×

Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

882 views

Published on

Beyond the Classifier, Inspiration from Engineering Algorithms: Many data scientists work within the realm of Machine Learning and their problems are often addressable with techniques such as classifiers and recommendation engines. At Tapad, we have often had to look outside that standard toolkit to find inspiration from more traditional engineering algorithms. This has included solving our Device Graph’s connected component problem at scale as well as maintaining our Device Graph’s time-consistency in our cluster identification week over week.

Published in: Technology
  • Be the first to comment

Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

  1. 1. Beyond the Classifier, Inspiration from Engineering Algorithms Yael Elmatad, Data Scientist at Tapad @y_s_e ML Conf NYC April 15, 2016 +
  2. 2. 2 Introduction to Tapad Tapad is a marketing technology company that seeks to bridge the gap between users’ various screens.
  3. 3. 3 Tapad’s Solution: The Device Graph™
  4. 4. 4 Modeling Identity is Hard 1. Identifier persistence and accuracy 2. Conflicting data 3. Grouping keys / Transitive properties 4. User Privacy and Data Governance 5. Use case flexibility 4
  5. 5. 5 Modeling Identity is Hard 1. Identifier persistence and accuracy 2. Conflicting data 3. Grouping keys / Transitive properties 4. User Privacy and Data Governance 5. Use case flexibility 5
  6. 6. 6 Focus: Identifier Persistence & Groupings Grouping keys How can we effectively, at scale, determine groups of identifiers? Identifier Persistence How can we make sure that these identifiers are persistent in time? Spoiler Alert No classifiers, recommender systems, or community detection in sight. 6
  7. 7. 7 Grouping: Connected Components ● Over 1.4 billion devices in each weekly Device Graph ● There are 6.6 billion connections between these Devices Question: How do we determine connected components at scale? Previous attempts: Various graph based databases and solutions (Giraph, GraphX, Cassovary) - we were not able to identify clusters at scale. Current solution: Runs in logarithmic rounds
  8. 8. 8 Connected Component Basics: Label Prop Initializing, assign self as cluster label A B C D Cluster Label (Temp): A B C D Iterations: Ask neighbor for current label, take min of neighbors and self. A B C D A A B C Stop iterations when no labels change over previous iteration.
  9. 9. 9 Need A More Efficient Solution: Hash-to-Min Standard message passing is O(d), where d = cluster diameter. arXiv.org > cs > arXiv:1203.5387v2
  10. 10. 10 Hash-to-Min: Initialization A B C D E v C(v) A (A,B) B (A,B,C) C (B,C,D,E) D (C,D) E (C,E) A A B C C For node v, assign minimum of v and its neighbors as cluster label and a cluster C(v) which is a set of v + v’s neighbors.
  11. 11. 11 Hash-to-Min: Round 1 For each C(v), vmin = minimal member of C(v) Broadcast C(v) to vmin and broadcast vmin to all other members of C(v) Each node, v, then merges all the C(v) + vmin it receives. A B C D E A A B C C v C(v) A (A,B) B (A,B,C) C (B,C,D,E) D (C,D) E (C,E)
  12. 12. 12 Hash-to-Min: Round 1 For each C(v), vmin = minimal member of C(v) Broadcast C(v) to vmin and broadcast vmin to all other members of C(v) Each node, v, then merges all the C(v) + vmin it receives. A B C D E A A A B B v C(v) A (A,B,C) B (A,B,C,D,E) C (A,C,D,E) D (B) E (B)
  13. 13. 13 Hash-to-Min: Round 2 + Completion A B C D E A A A A A v C(v) A (A,B,C,D,E) B (A,B) C (A) D (A) E (A) Iterations cease when no updates are made to C(v)’s Completes in O(log(d)) where d = cluster diameter.
  14. 14. 14 Hash-to-Min: Round 2 + Completion Iterations cease when no updates are made to C(v)’s Completes in O(log(d)) where d = cluster diameter. A B C D E A A A A A v C(v) A (A,B,C,D,E) B (A) C (A) D (A) E (A)
  15. 15. 15 First labeling scheme: Labeled by lowest device id participating in cluster. Example: Once we have CC, how do we label them? A B C D E A Only 78% of devices maintain label after 1 week.
  16. 16. 16 Why 22% Change? ID Expiration & Creation D B C D C B C Label Device Expires: D B C D B C AB A New Lowest ID Created:
  17. 17. 17 Why 22% Change? Splits and Merges D B C AA D B C AA C Cluster Splits: D B C AA C D B C AA Clusters Merge:
  18. 18. 18 Only a small fraction are of Merge/Split variety Type of change Percent Device Expiration & Creation > 75% Cluster Merges & Splits < 25%
  19. 19. 19 Solution? Map onto Stable-Marriage Problem Definition of “Stable Marriage” Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable. (wikipedia definition)
  20. 20. 20 Stable-Marriage - (By Negation) Want to pair triangles to circles. Unstable Match: Prefer Each Other A stable solution is defined as the lack of these instabilities. The Gale-Shapley algorithm is a method for finding stable solutions.
  21. 21. 21 Gale-Shapley Algorithm a b c δ ɣ β (Psst… it won the Nobel Prize in Economics in 2012)
  22. 22. 22 Gale-Shapley Pre-Iteration (GS0): Rankings Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
  23. 23. 23 GS1: Circles “Propose” to Triangles Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
  24. 24. 24 GS1: Triangles tentatively accept best proposal Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
  25. 25. 25 GS2: Unengaged circles try again Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
  26. 26. 26 GS2: Triangles again tentatively accept best offer Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
  27. 27. 27 GS3: iterations terminate when all triangles/circles are paired Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
  28. 28. 28 How do we use it at Tapad? Considerations: ● How do you rank best labels for your cluster? ● Need to be able to run at scale for 100 million label pairs. ● Needs to run on in a distributed fashion (MapReduce). ● Needs to be able to handle ties. ● Need to handle label expiry and new label creation.
  29. 29. 29 Results & Cluster Stability Metric: The % of devices that maintain their cluster label after x weeks. Min ID Based Gale-Shapley Based 1 week 78% 98% 8 weeks 33% 87%
  30. 30. 30 Conclusion Many challenges which get thrown at data scientists can potentially be solved by deterministic engineering algorithms. Being familiar with these algorithms prevents data scientists from reinventing the wheel. Once you start using these algorithms, you start seeing use cases for them everywhere (we use connected components in no less than 3 parts of our graph building process).
  31. 31. 31 Thank you! Thanks to the Data Science/Engineering teams at Tapad Read our blog: http://engineering.tapad.com Careers: http://www.tapad.com/about-us/careers/openings (Data Science & Engineering!) Follow us on twitter: @tapad, @tapadeng Contact me: yael@tapad.com, @y_s_e

×