Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- XCMO2013: Driving Results with Cros... by Cross Channel Mar... 2738 views
- Erich Elsen, Research Scientist, Ba... by MLconf 1037 views
- Sergei Vassilvitskii, Research Scie... by MLconf 1073 views
- Kaheer Suleman, CTO, Maluuba at MLc... by MLconf 793 views
- Soumith Chintala, Artificial Intell... by MLconf 992 views

950 views

Published on

Published in:
Technology

No Downloads

Total views

950

On SlideShare

0

From Embeds

0

Number of Embeds

10

Shares

0

Downloads

42

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Beyond the Classifier, Inspiration from Engineering Algorithms Yael Elmatad, Data Scientist at Tapad @y_s_e ML Conf NYC April 15, 2016 +
- 2. 2 Introduction to Tapad Tapad is a marketing technology company that seeks to bridge the gap between users’ various screens.
- 3. 3 Tapad’s Solution: The Device Graph™
- 4. 4 Modeling Identity is Hard 1. Identifier persistence and accuracy 2. Conflicting data 3. Grouping keys / Transitive properties 4. User Privacy and Data Governance 5. Use case flexibility 4
- 5. 5 Modeling Identity is Hard 1. Identifier persistence and accuracy 2. Conflicting data 3. Grouping keys / Transitive properties 4. User Privacy and Data Governance 5. Use case flexibility 5
- 6. 6 Focus: Identifier Persistence & Groupings Grouping keys How can we effectively, at scale, determine groups of identifiers? Identifier Persistence How can we make sure that these identifiers are persistent in time? Spoiler Alert No classifiers, recommender systems, or community detection in sight. 6
- 7. 7 Grouping: Connected Components ● Over 1.4 billion devices in each weekly Device Graph ● There are 6.6 billion connections between these Devices Question: How do we determine connected components at scale? Previous attempts: Various graph based databases and solutions (Giraph, GraphX, Cassovary) - we were not able to identify clusters at scale. Current solution: Runs in logarithmic rounds
- 8. 8 Connected Component Basics: Label Prop Initializing, assign self as cluster label A B C D Cluster Label (Temp): A B C D Iterations: Ask neighbor for current label, take min of neighbors and self. A B C D A A B C Stop iterations when no labels change over previous iteration.
- 9. 9 Need A More Efficient Solution: Hash-to-Min Standard message passing is O(d), where d = cluster diameter. arXiv.org > cs > arXiv:1203.5387v2
- 10. 10 Hash-to-Min: Initialization A B C D E v C(v) A (A,B) B (A,B,C) C (B,C,D,E) D (C,D) E (C,E) A A B C C For node v, assign minimum of v and its neighbors as cluster label and a cluster C(v) which is a set of v + v’s neighbors.
- 11. 11 Hash-to-Min: Round 1 For each C(v), vmin = minimal member of C(v) Broadcast C(v) to vmin and broadcast vmin to all other members of C(v) Each node, v, then merges all the C(v) + vmin it receives. A B C D E A A B C C v C(v) A (A,B) B (A,B,C) C (B,C,D,E) D (C,D) E (C,E)
- 12. 12 Hash-to-Min: Round 1 For each C(v), vmin = minimal member of C(v) Broadcast C(v) to vmin and broadcast vmin to all other members of C(v) Each node, v, then merges all the C(v) + vmin it receives. A B C D E A A A B B v C(v) A (A,B,C) B (A,B,C,D,E) C (A,C,D,E) D (B) E (B)
- 13. 13 Hash-to-Min: Round 2 + Completion A B C D E A A A A A v C(v) A (A,B,C,D,E) B (A,B) C (A) D (A) E (A) Iterations cease when no updates are made to C(v)’s Completes in O(log(d)) where d = cluster diameter.
- 14. 14 Hash-to-Min: Round 2 + Completion Iterations cease when no updates are made to C(v)’s Completes in O(log(d)) where d = cluster diameter. A B C D E A A A A A v C(v) A (A,B,C,D,E) B (A) C (A) D (A) E (A)
- 15. 15 First labeling scheme: Labeled by lowest device id participating in cluster. Example: Once we have CC, how do we label them? A B C D E A Only 78% of devices maintain label after 1 week.
- 16. 16 Why 22% Change? ID Expiration & Creation D B C D C B C Label Device Expires: D B C D B C AB A New Lowest ID Created:
- 17. 17 Why 22% Change? Splits and Merges D B C AA D B C AA C Cluster Splits: D B C AA C D B C AA Clusters Merge:
- 18. 18 Only a small fraction are of Merge/Split variety Type of change Percent Device Expiration & Creation > 75% Cluster Merges & Splits < 25%
- 19. 19 Solution? Map onto Stable-Marriage Problem Definition of “Stable Marriage” Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable. (wikipedia definition)
- 20. 20 Stable-Marriage - (By Negation) Want to pair triangles to circles. Unstable Match: Prefer Each Other A stable solution is defined as the lack of these instabilities. The Gale-Shapley algorithm is a method for finding stable solutions.
- 21. 21 Gale-Shapley Algorithm a b c δ ɣ β (Psst… it won the Nobel Prize in Economics in 2012)
- 22. 22 Gale-Shapley Pre-Iteration (GS0): Rankings Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
- 23. 23 GS1: Circles “Propose” to Triangles Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
- 24. 24 GS1: Triangles tentatively accept best proposal Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
- 25. 25 GS2: Unengaged circles try again Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
- 26. 26 GS2: Triangles again tentatively accept best offer Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
- 27. 27 GS3: iterations terminate when all triangles/circles are paired Rank: (β,ɣ,δ) Rank: (β,δ,ɣ) Rank: (δ,ɣ,β) Rank: (c,b,a) Rank: (b,a,c) Rank: (c,a,b) a b c δ ɣ β
- 28. 28 How do we use it at Tapad? Considerations: ● How do you rank best labels for your cluster? ● Need to be able to run at scale for 100 million label pairs. ● Needs to run on in a distributed fashion (MapReduce). ● Needs to be able to handle ties. ● Need to handle label expiry and new label creation.
- 29. 29 Results & Cluster Stability Metric: The % of devices that maintain their cluster label after x weeks. Min ID Based Gale-Shapley Based 1 week 78% 98% 8 weeks 33% 87%
- 30. 30 Conclusion Many challenges which get thrown at data scientists can potentially be solved by deterministic engineering algorithms. Being familiar with these algorithms prevents data scientists from reinventing the wheel. Once you start using these algorithms, you start seeing use cases for them everywhere (we use connected components in no less than 3 parts of our graph building process).
- 31. 31 Thank you! Thanks to the Data Science/Engineering teams at Tapad Read our blog: http://engineering.tapad.com Careers: http://www.tapad.com/about-us/careers/openings (Data Science & Engineering!) Follow us on twitter: @tapad, @tapadeng Contact me: yael@tapad.com, @y_s_e

No public clipboards found for this slide

Be the first to comment