2. Objective
Predict whether the user (a session) is going to buy
something or not.
If Buy is predicted, then also predict what items will
be bought in that session.
3. Data Set
The YOOCHOOSE dataset contain a collection of sessions from a retailer, where each
session is encapsulating the click events that the user performed in the session.
For some of the sessions, there are also buy events; means that the session ended with
the user bought something from the web shop.
The data was collected during several months in the year of 2014, reflecting the clicks
and purchases performed by the users of an online retailer in Europe.
Challenge Link: http://2015.recsyschallenge.com/challenge.html
4. Data Set Schema
Clicks Data Set
Session ID – the id of the session. In one session there are one or many clicks.
Timestamp – the time when the click occurred.
Item ID – the unique identifier of the clicked Item.
Category – the context of the click. The value "S" indicates a special offer, "0" indicates a
missing value, a number between 1 to 12 indicates a real category identifier, any other
number indicates a brand.
5. Data Set Schema
Buys Data Set
Session ID - the id of the session. In one session there are one or many buying events.
Timestamp - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
Item ID – the unique identifier of item that has been bought.
Price – the price of the item. Could be represented as an integer number.
Quantity – the quantity in this buying.
6. Data Set Schema
Test Data Set
Session ID – the id of the session. In one session there are one or many clicks.
Timestamp – the time when the click occurred.
Item ID – the unique identifier of the clicked Item.
Category – the context of the click. The value "S" indicates a special offer, "0" indicates a
missing value, a number between 1 to 12 indicates a real category identifier, any other
number indicates a brand.
11. Buys Per Click
Items which have a high proportion of buys with respect to
clicks in the training set have a higher chance of being bought
in the test clicks as well.
12. Item Popularity
Globally popular items in the
training data have a higher
chance of being bought in the
test data as well. Items which
are purchased more than 10
times are taken into
consideration.
13. Hour Of Item Click
In the training data, hours
8-17 of a day have been
observed to have high
buys/click rate compared
to others. We have
categorized our hours by
serially numbering them
from 0 to 23.
14. Item Clicks
An item which receives
more number of clicks in
a session is more likely
to be bought by the user.
16. Random Forest Classifier
Group Of Decision Trees Built upon Randomly Sampled
Input Records .
Can Build Complex decision Regions.
Combining Classifier outputs helps in preventing Overfitting.
MOST IMPORTANTLY : Capable Of Dealing with
Imbalanced Datasets .
17. Evaluation Measure
The evaluation considers taking into consideration the ability to predict both aspects – whether the sessions
end with buying event, and what were the items that have been bought. Let’s define the following:
Sl – sessions in submitted solution file
S - All sessions in the test set
s – session in the test set
Sb – sessions in test set which end with buy
As – predicted bought items in session s
Bs – actual bought items in session s
then the score of a solution will be :
18. Results
Best Score Obtained - 45821
Following Features are used for training Model:
Item clicks
Buys/click
Popular items
Hour of Day
Month Of Year
19. Conclusion
Proper Features influence the score of prediction. Not all features are useful in determining
buyer behaviour and some of them may even prove to be detrimental.
Features that take too many values may actually overfit , it is better to bin the values into
fewer value sets .