Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 24

Modern recommender system in large content website

2

Share

Download to read offline

Speech at PyCon TW 2020

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Modern recommender system in large content website

  1. 1. Modern recommender system in large content website Cyrus Chiu @ PyConTW 2020 1
  2. 2. Youtube - Landing Page 2
  3. 3. Cyrus Chiu About www.linkedin.com/in/iamcyruschiu 3 a ML engineer at a financial (technology) company
  4. 4. Use case in PIXNET User2Article (personal) Article2Tag (non-personal) Article2Article (both) Tag2Tag (non-personal) Tag2Article (personal) 4
  5. 5. How we building a recommender system Basic • Data infrastructure • Data collection • User tracking • Some rules or model • Position on the webpage Vehicle Say: segments 5
  6. 6. How we building a recommender system Advanced • Data lifecycle • Scalability • User profiling • Item profiling • Item retrieving • Item ranking Tech Vehicle User / Item Information 6
  7. 7. Retrieving Ranking Recommendation Data Warehouse What is 2–phase recommender system Vehicle Food Pet Down scaling millions of item to hundreds for each user Solving the diversity problem by using hybrid retrieval strategy. • Large scale • Less features • Simple model • Speed • Small scale • More features • Sophisticated model • Performance Millions Hundreds Tens 7
  8. 8. Personalized Item Retrieving … Phase 1: Retrieving Reading Sequence Recommender Ordered list of items 8
  9. 9. Define the similarity between item A and B According to the content According to the user behavior 9
  10. 10. Strategy Search Engine Word Embedding Item Embedding Frequently Occurring Together Hot Features • Easy to use • Multiple query type • Another cost to maintain your Elasticsearch cluster • Unexplainable • Vector space • High computational cost • Cold start problem • Data sparsity issue • Vector space • High computational cost • Cold start problem • Data sparsity issue • Explainable • Easy to maintain • Explainable • Easy to use • Suitable for any situations Recommended Some Retrieving Methods 10
  11. 11. Phase 2: Ranking RecommenderRecommender f ( user, item, context ) Candidate of items Ordered list of itemsUser Information Item Information Context Information 11
  12. 12. Model Logsitc Regression GBDT + LR Factorization Machine DNN Features • Fast • Highly explainable • Manually crafting features • Limited effective • Fast • Highly explainable • Capability to generate feature interaction • Slow • Explainable • Good at learning feature interactions • Exhausting resources • More slowly • Unexplainable • Learning sophisticated feature interactions automatically • Exhausting more resources • Wide & Deep, Deep & Cross, DeepFM, xDeepFM, … Evolution of the prediction model 12
  13. 13. RecSys on DMP 13
  14. 14. Example: Find the pair-wise distance ? from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(X) Solution 1 Solution 2 from sklearn.metrics.pairwise import pairwise_distances pairwise_distances(X, metric="cosine") 14
  15. 15. 15
  16. 16. import pyspark (1) X_bc = sc.broadcast(X) n_part = 100 batch_size = 64 data_size = X.shape[0] batch_vectors_rdd = sc.parallelize( [X[i: i+batch_size] for i in range(0, data_size, batch_size) ], n_part) batch_vectors_rdd.mapPartitions( lambda x: cosine_similarity(x, X_bc.value)) Solution: RDD X_bc batch 16
  17. 17. import pyspark (2) @F.udf(returnType=DoubleType()) def get_distance(i, j): sim = cosine_similarity(i, j) return sim vector_sdf.alias("i").join( vector_sdf.alias(“j"),
 col("i.id") < col("j.id")).select( F.col("i.id").alias("i"), F.col("j.id").alias("j"), get_distance("i.vector", "j.vector").alias( "cosine_similarity") ) Solution: DataFrame I J = I CROSS JOIN CosineSimilarity ( Column I.vector, Column J.vector ) 17
  18. 18. 1K 10K 100K sklearn 0.1 1.5 OOM pyspark 2 5.1 70 Faiss 0.05 0.6 21 Comparison Calculating distances between coordinates from a 32*1k/10k/100k matrix
 (32 core, 32 GB memory) https://github.com/facebookresearch/faiss18
  19. 19. Workflows of the data project Data Management Programming Project Management • ETL pipeline design • Data infra design • Metalayer design • Data toolkit • SQL • Framework design • Library development • Parallel computing • Machine learning • Git flow • Review 19
  20. 20. Using Apache Airflow to schedule and monitor workflows 20
  21. 21. Online A/B Testing 21
  22. 22. Youtube - Landing Page 22
  23. 23. Youtube - Video Page 23
  24. 24. Thank You 24

×