Successfully reported this slideshow.
Your SlideShare is downloading. ×

Challenges and research for a real-time recommendation at OLX

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Md5 intro 27mar
Md5 intro 27mar
Loading in …3
×

Check these out next

1 of 27 Ad

More Related Content

Similar to Challenges and research for a real-time recommendation at OLX (20)

More from Tiago Albineli Motta (18)

Advertisement

Recently uploaded (20)

Challenges and research for a real-time recommendation at OLX

  1. 1. Challenges and Research for Real-Time Recommendation in a Dynamic Marketplace Environment
  2. 2. Some unlimited consumption items
  3. 3. Limited but with stock under control
  4. 4. Unique items and stock out of control
  5. 5. @timotta
  6. 6. contacts influenced by recommendations Largest online classified in Brazil daily users +7M +500k new ads published daily +15%
  7. 7. Heavy ad publishing flow X
  8. 8. Graph based collaborative filtering
  9. 9. Real-time graph updating Adega (PostgreSQL) Sommelier (API) Lurker (Tracker) Stream processor
  10. 10. contacts Great result adviews +6% +4%
  11. 11. Collaborative filtering based on ad views View Contact
  12. 12. Concentration Buyers Sellers :( X X ✓ :( :(
  13. 13. The Idea: A content-based ranked by contact probability
  14. 14. Random item-item balanced dataset Ad Viewed Features Ad Recommended Features Target: Contacted yes or no?
  15. 15. Features: Title and description Title, description and category Doc2Vec embedding
  16. 16. Features: Image Image embedding from ResNet's penultimate layer
  17. 17. Features: Neighborhood Neighborhood latent factors generated by logistic matrix factorization
  18. 18. Features: Price $$$ Price
  19. 19. Classification on a balanced dataset accuracy 75%
  20. 20. Studying how to compare both methods offline . . .
  21. 21. Cannot predict online due to high time loading candidate ads and calculating probabilities
  22. 22. Real-time background prediction Adega (PostgreSQL) Embedding calculations
  23. 23. Real-time background prediction Adega (PostgreSQL) Embedding calculations Probability calculation
  24. 24. Real-time background prediction Adega (PostgreSQL) Embedding calculations Probability calculation Reversed Probability calculation
  25. 25. Real-time background prediction Adega (PostgreSQL) Sommelier (API) Embedding calculations Probability calculation Reversed Probability calculation
  26. 26. Future research News Session-Based Recommendations using Deep Neural Networks (Chamaleon) Metadata Embeddings for User and Item Cold-start Recommendations (lightFM)
  27. 27. Recommendation Squad at OLX Filipe Casal Marcelo Malta Tiago Motta Leonardo Wajnsztok Thays Macedo

Editor's Notes

  • Good morning everybody. Thank you for attending this talk and a special thank to Rodrygo Santos who invited us to talk about our challenges and research
    for a real-time recommendation in a dynamic marketplace environment.

    I’m Tiago Motta and I work as Machine Learning Engineer at OLX

    https://www.flaticon.com
  • Mainstream recommendation systems often deal items that can be consumed by many users.

    Examples of unlimited item offerings that we can speak about are movies, digital books, articles and news.


  • Even when the item consumption are not unlimited, such as products sold online, hotel rooms and food, at least these applications have a way of knowing when the stock runs out, so that the system can stop recommending this item heuristically.
  • The problem I bring today is completely different.
    Ignoring some exceptions like ads about job and services
    We're dealing with unique items that can be consumed by only one buyer
    And by not doing the intermediation of purchase and sale,
    we're not sure if the items are not available anymore to stop recommending it.
  • OLX is the top of mind online classifieds in Brazil,
    Millions of Brazilians depend on us for living.
    So we have a huge responsability to improve this product.
  • We have more than seven million users accessing the platform daily,
    over five hundred thousand new ads per day.
    And 15% of contacts between buyers and sellers is through recommended ads,
    showing that the recommendation system plays a key role in the product.
  • But this intense stream of new ads prevents us from using traditional collaborative filtering techniques such as Matrix Factorization.
    That’s because training time for each new batch of ads would make us recommend old ads, slowing down the negotiation and reducing the freshness
  • That is why our first and current recommendation system is a graph based collaborative filtering.

    It's calculated through a SQL query, where ads and users are the nodes and implicit ad views feedback are the edges.

  • The main advantage of this technique is that it allows real-time recommendations without the need for constant and costly retraining

    This happens because the database is fed in real time by a streaming process, which is then accessed directly by the API.
  • Despite its simplicity, this technique has ensured us a 6% increase in ad views and a 4% increase in contacts as could be validated in the implemented and runned AB experiments in the past.

    Even with these great results we have the intuition that it is possible to increase the connectivity and therefore the speed of sales of our sellers
  • Mainly because the current target metric of our collaborative filtering is the ad visualization and not the action of contacting the seller.

    However, if we changed this approach adopting contact as the edge of collaborative filtering, we would have the problem of focusing users' attention on ads that already received contact, and that would reduce OLX's democracy.


  • When this concentration happens it is bad for sellers and buyers. Some sellers have a flood of messages to answer, while others are ignored. And for the buyer is also bad, because he has the object of desire denied because it has already been sold.

    Thus we began to research an approach that would address both challenges, increase connectivity and increase product democracy.
  • The idea we are currently validating is to use a classifier to identify which factors in an ad have the greatest influence on stimulating buyer-seller contact given the user's intention when visiting another ad.

    This would give us a chance to recommend ads that are likely to receive contacts long before it has even been viewed or received any contact.

    In addition, this would give us the flexibility to heuristically filter ads currently being negotiated, avoiding concentration and giving other ads a chance.
  • To train this classifier we have created a balanced dataset containing features of both the source and the recommended ad, and as a target label a value informing if the seller was contacted about the recommended ad or not

    This dataset was created using a pseudo random recommendationdisplayed to a sample users for one week.

    This randomness was designed to allow us to discover other patterns of consumption that the current recommendation bias would prevent us.
  • As features of the source and recommended ad, we use the vector representation of the document using title and description generated through doc2vec.

    The doc2vec model was trained using a sample of 15 million examples of OLX historical ads.
  • Other important group of features we included was a vector representation of the ad image
    as the penultimate layer of a pre-trained ResNet18 (eighteen)

  • In addition, it was natural to imagine that the geographical distance between the ads were a representative factor.

    As the most granular location information we have from the ad is the neighborhood,

    we have done a Matrix Factorization between users and the ad neighborhoods where they come in contact with

    to generate a vector representation of the neighborhood.

    Look, do not confuse the name “neighborhood” with the term in the algorithm KNN

    https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf

  • Last but not least, the price difference between the ads was included

  • With all those features in this pseudo random balanced dataset,
    we use the CatBoost classifier and so far have achieved an accuracy of 75%
    which is an excellent result given it deals with human behavior
  • We have not yet been able to validate offline if this technique will do better than our current recommendation.
    We are currently trying to find a way to validate this fairly and without bias.
    If anyone have idea of how we can do this, please come talk to us after the talk in the OLX stand
    But we already have an idea of how to put this algorithm in production.
  • Once you have this classifier ready,
    because of our huge ad volume,
    it will be impracticable to do the prediction in real time on the API

    So we are thinking of strategies to make this prediction in a background streaming process by calculating the probabilities through the classifier and updating them in real time in a database.

  • For each new ad,
    the streaming process will need to:

    Calculate text and image embeddings and save into the database.

  • Browse a large sample of ads to calculate the probability of contact using the classifier and save these probabilities to the database. In this moment we are going to have the recommendations for the published item.

  • Execute the reverse probability calculation, switching the recommended ad as source ad to update recommendations for old items in the database

  • That way the API would only need to execute an optimized SQL query while maintaining our excellent performance.
  • As future research and improvements we are going to experiment the embedding technique adopted by Gabriel Moreiras’s in his excelent paper about Chamaleon.

    And as an alternative on the classification we are thinking about doing the matrix factorization using the items features as Kula described in his paper from recsys 2015 and implemented in lightFM.

    https://github.com/lyst/lightfm

    https://dl.acm.org/citation.cfm?doid=3270323.3270328
  • We hope this item-item content-based classification technique will allow us to have the flexibility of a wide range of new offerings with many different types of filtering.

    However, only offline and online validation of this technique will show us if it can improve OLX connectivity and democracy over the previous recommendation.

    We hope to have this result at RecSys 2020 in Rio de Janeiro, we are going to wait for you there.

×