Good morning everybody. Thank you for attending this talk and a special thank to Rodrygo Santos who invited us to talk about our challenges and research for a real-time recommendation in a dynamic marketplace environment.
I’m Tiago Motta and I work as Machine Learning Engineer at OLX
Mainstream recommendation systems often deal items that can be consumed by many users.
Examples of unlimited item offerings that we can speak about are movies, digital books, articles and news.
Even when the item consumption are not unlimited, such as products sold online, hotel rooms and food, at least these applications have a way of knowing when the stock runs out, so that the system can stop recommending this item heuristically.
The problem I bring today is completely different. Ignoring some exceptions like ads about job and services We're dealing with unique items that can be consumed by only one buyer And by not doing the intermediation of purchase and sale, we're not sure if the items are not available anymore to stop recommending it.
OLX is the top of mind online classifieds in Brazil, Millions of Brazilians depend on us for living. So we have a huge responsability to improve this product.
We have more than seven million users accessing the platform daily, over five hundred thousand new ads per day. And 15% of contacts between buyers and sellers is through recommended ads, showing that the recommendation system plays a key role in the product.
But this intense stream of new ads prevents us from using traditional collaborative filtering techniques such as Matrix Factorization. That’s because training time for each new batch of ads would make us recommend old ads, slowing down the negotiation and reducing the freshness
That is why our first and current recommendation system is a graph based collaborative filtering.
It's calculated through a SQL query, where ads and users are the nodes and implicit ad views feedback are the edges.
The main advantage of this technique is that it allows real-time recommendations without the need for constant and costly retraining
This happens because the database is fed in real time by a streaming process, which is then accessed directly by the API.
Despite its simplicity, this technique has ensured us a 6% increase in ad views and a 4% increase in contacts as could be validated in the implemented and runned AB experiments in the past.
Even with these great results we have the intuition that it is possible to increase the connectivity and therefore the speed of sales of our sellers
Mainly because the current target metric of our collaborative filtering is the ad visualization and not the action of contacting the seller.
However, if we changed this approach adopting contact as the edge of collaborative filtering, we would have the problem of focusing users' attention on ads that already received contact, and that would reduce OLX's democracy.
When this concentration happens it is bad for sellers and buyers. Some sellers have a flood of messages to answer, while others are ignored. And for the buyer is also bad, because he has the object of desire denied because it has already been sold.
Thus we began to research an approach that would address both challenges, increase connectivity and increase product democracy.
The idea we are currently validating is to use a classifier to identify which factors in an ad have the greatest influence on stimulating buyer-seller contact given the user's intention when visiting another ad.
This would give us a chance to recommend ads that are likely to receive contacts long before it has even been viewed or received any contact.
In addition, this would give us the flexibility to heuristically filter ads currently being negotiated, avoiding concentration and giving other ads a chance.
To train this classifier we have created a balanced dataset containing features of both the source and the recommended ad, and as a target label a value informing if the seller was contacted about the recommended ad or not
This dataset was created using a pseudo random recommendationdisplayed to a sample users for one week.
This randomness was designed to allow us to discover other patterns of consumption that the current recommendation bias would prevent us.
As features of the source and recommended ad, we use the vector representation of the document using title and description generated through doc2vec.
The doc2vec model was trained using a sample of 15 million examples of OLX historical ads.
Other important group of features we included was a vector representation of the ad image as the penultimate layer of a pre-trained ResNet18 (eighteen)
In addition, it was natural to imagine that the geographical distance between the ads were a representative factor.
As the most granular location information we have from the ad is the neighborhood,
we have done a Matrix Factorization between users and the ad neighborhoods where they come in contact with
to generate a vector representation of the neighborhood.
Look, do not confuse the name “neighborhood” with the term in the algorithm KNN
Last but not least, the price difference between the ads was included
With all those features in this pseudo random balanced dataset, we use the CatBoost classifier and so far have achieved an accuracy of 75% which is an excellent result given it deals with human behavior
We have not yet been able to validate offline if this technique will do better than our current recommendation. We are currently trying to find a way to validate this fairly and without bias. If anyone have idea of how we can do this, please come talk to us after the talk in the OLX stand But we already have an idea of how to put this algorithm in production.
Once you have this classifier ready, because of our huge ad volume, it will be impracticable to do the prediction in real time on the API
So we are thinking of strategies to make this prediction in a background streaming process by calculating the probabilities through the classifier and updating them in real time in a database.
For each new ad, the streaming process will need to:
Calculate text and image embeddings and save into the database.
Browse a large sample of ads to calculate the probability of contact using the classifier and save these probabilities to the database. In this moment we are going to have the recommendations for the published item.
Execute the reverse probability calculation, switching the recommended ad as source ad to update recommendations for old items in the database
That way the API would only need to execute an optimized SQL query while maintaining our excellent performance.
As future research and improvements we are going to experiment the embedding technique adopted by Gabriel Moreiras’s in his excelent paper about Chamaleon.
And as an alternative on the classification we are thinking about doing the matrix factorization using the items features as Kula described in his paper from recsys 2015 and implemented in lightFM.