Learning to hash has been widely adopted as a solution to approximate nearest neighbor search for large-scale data retrieval in many applications. Applying deep architectures to learning to hash has recently gained increasing attention due to its computational efficiency and retrieval quality.
3. Agenda
Praveen Pratury
▪ Overview of Samsung
▪ Samsung Audience platform
▪ Lookalike modeling introduction
Yingnan Zhu
▪ Lookalike approaches
▪ Speed up with Pandas UDF
▪ Model performance
▪ Results
▪ Q & A
Director of Engineering, Samsung
Research America
Lead Data Scientist, Samsung Research
America
11. LookAlike Modeling – Samsung Context
Improve Incremental Reach and Improved Targeting for:
▪ TV Networks (Identify new audiences to promote new shows)
▪ Samsung New TV purchases (8K, QLED, Terrace etc)U ffhoffef
The goal is to improve Reach and increase conversion for TV shows and New TV purchases
Goals
Approach
By leveraging Samsung’s rich ACR viewership data on 50+ M TVs in US and by applying User Behavior Hashing techniques:
▪ Identify TV viewers similar to existing audiences based on user behavior
- Find audiences that will respond favorably to show-specific TV ads
▪ Identify existing premium TV owners to expand to future buyer
12. Look Alike Audience Expansion Example
A: seed segment
B: expanded segment
* *
*
*
* *
* *
*
*
* *
* *
*
*
* *
* *
*
* *
*
* *
* * *
*
*
++ +++
+
+
+ +
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+ + +
++
* *
*
*
* *
* *
*
*
* *
* *
*
*
* *
* *
*
*
* *
* *
*
*
* *
* * *
*
*
*
++ +++
+
+
+ +
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+ + +
++
A*
B
All TV Users’ hash code space All TV Users’ hash code space
*
+
8K TV users
non 8K TV users
A
LookALike example for 8K TV campaign. The goal is to identify users who look alike to 8K TV owners, but have not owned an 8K TV yet.
Targeted size to expand
13. Challenges and solutions
▪ Challenges
▪ Large-scale data:
▪ Search space is huge: Hundreds of millions of Smart TV and Mobile users
▪ Efficiency:
▪ Each device could generate thousands of logs per day
▪ Look alike user retrieval has time constraint
▪ Possible solutions
▪ LSH, K-nearest neighbor, similar user search in recommender system and etc.
▪ These solutions sacrifice accuracy, efficiency, do not consider contextual information, or not optimized for time sequence data
▪ Our solution
▪ Heterogeneous user behavior hash code
▪ Provides LSH like bucketized fast searching and maintain high accuracy of user similarity
14. Look Alike Work Flow
Online/Offline
Offline
Raw Data
Deep Binary
Hashing Models
(various bit length)
Deep Binary
Hashing Models
(various bit length)
User Hash
Codes
Lookalike ServiceSeed
Segments
Expended
Segments
Processed
User Behavior
Data
15. Hashing Model Training Flow
Df
User 1
Hash LayerNetwork LayersInput Similarity Label
1
0
Similar
Dissimilar
y
x
+1
-1
SGN
User 2
y
x
+1
-1
• By given two user pair, it first generates user embedding (continuous vector)
from Network Layers. After that we make K-bit dimension from Hash Layer.
Finally we apply SGN for binary representation. The output is 1: similar, 0:
dissimilar
17. Model Explained
▪ Input layer:
▪ The input layer is the data pre-processing layer. In this layer, we will map sequential behavior data input into a 3D structure that can
be processed by CNN.
▪ The first step in our data pre-processing step is to embed each item into a D dimension vector. The next step is to sessionize user’s
history by a specific time unit (e.g., hour). For each session, we aggregate all items that the user in consideration had interacted
with using the multi-hot encoding of the corresponding items. This will represent the summary of user’s behavior for the given
session. After sessionization, we map each user’s behavior input into the high dimensional space.
▪ Embedding layer:
▪ Since the multi-hot encoding scheme used during our pre-processing step is a sparse and hand-crafted encoding scheme, it carries
more conceptual information than similarity information itself. This would affect the overall performance of TAACNN, particularly its
ability to preserve similarity information at large scale. To overcome this limitation, we introduce an embedding layer as part of our
model.
▪ Time-Aware attention layer:
▪ The time-aware attention is used to abstract time-aware attention features in our TAACNN model. This layer separates attention
features into short-term and long-term features.
19. Distributed Inference
▪ Issues:
▪ Large data scales and hundreds of millions user’s profile need update within limited time and computation resource
▪ Current Spark UDF is processed row-at-a-time and it won’t satisfy the requirements
▪ Need efficient distributed inference methods
▪ Solution: Pandas UDF
▪ Scalar
▪ Scalar iterator
▪ Group map
▪ Group aggregate
21. Model Performance
▪ We used the accuracy measure as the main performance metric for all
binary hashing algorithms because each user has the identical number
of similar and dissimilar user pairs.
22. Conclusion
▪ A novel deep binary hashing architecture to derive similarity
preserving binary hash codes for sequential behavior data.
▪ TAACNN explores evolving user’s attention preferences across
different time awareness level separately. Experiments results show
significant over-performance compared to other well-known hashing
methods
▪ Pandas UDF improved efficiency significantly. They have been
adopted in many of our projects.
23. Thank you !!
We are hiring:
www.sra.samsung.com/open-positions
Contact: Praveen.Pratury@Samsung.com
Yingnan.z@Samsung.com
https://www.linkedin.com/in/praveenpratury
https://www.linkedin.com/in/yingnan-zhu-66651113/