PPT on competition conducted by Avito on Demand Prediction of an AD on Kaggle. The model is trained on Tensorflow (DNNRegrssor) and SKLearn (Linear Regreesion). Project made by Codrega under Eckovation.
3. W E L C O M E
Avito’s challenge is to predict demand for
an online advertisement based on its full
description (title, description, images,
etc.), its context (geographically where it
was posted, similar ads already posted)
and historical demand for similar ads in
similar contexts.
Note: Since the dataset was too large, all
the work was done on Google Cloud
4. D A T A S E T
Provided by Avito, Russia’s largest classified
advertisements website.
Size of dataset = 80 GB.
5. D A T A S E T
• item_id - Ad id.
• user_id - User id.
• region - Ad region.
• city - Ad city.
• parent_category_name - Top level ad category as classified by Avito's ad
model.
• category_name - Fine grain ad category as classified by Avito's ad model.
• param_1 - Optional parameter from Avito's ad model.
• param_2 - Optional parameter from Avito's ad model.
• param_3 - Optional parameter from Avito's ad model.
• title - Ad title.
• description - Ad description.
• price - Ad price.
• item_seq_number - Ad sequential number for user.
• activation_date- Date ad was placed.
• user_type - User type.
• image - Id code of image.
• image_top_1 - Avito's classification code for the image.
6. O B J E C T I V E S
1
2
3
4DATA
ANALYSISFeatures are
analyzed and
visualized for data
refining
DATA REFINING
Unimportant
features are
removed and are
converted to
integers
MODEL
CREATIONDifferent models
were created to
test accuracy
ML
ALGORITHMSAlgorithm were
applied to increase
accuracy
7. D A T A V I S U A L I S A T I
O N
There are a
lot of cheap
items.
Deal
Probability
reduces as
Low prices
have higher
deal_probabili
8. C A T E G O R I S A T I O N
• region = 28
• city = 1022
• parent_category_name = 9
• category_name = 47
• user_type = 3
• image_top_1 = 2774
11. R E F I N I N G
• Null values in price were exchanged by the categorical means.
• image column contains image id of the AD and hence was dropped after
the images were joined to the final dataset file.
• Images were compressed from different sizes to 32x32 pixel size.
• They were converted to Black and White
• Approximately, 50GB of images were reduced to 11GB and stored in
an array of length 1024 in a pickle file.
• Rows which do not have images were given 0 as their pixel
information.
12. R E F I N I N G
• description was not analysed due to time constraints and was dropped.
• Stop words would have been removed.
• Each word would have been tokenized in description.
• Most common words would have been removed.
• Dummies would have been created for each word.
• user_type contains 3 unique set of values (Private, Shop and Company)
hence dummies were created.
• user_type was dropped.
• Shop was dropped.
• item_id was unique for every row and hence was dropped.
• Null values in param_1, param_2, param_3 were given a unique set of
values (missing).
13. R E F I N I N G
We tried to translate the language of data from Russian to
English using the GoogleTranslateAPI. The data was not
translated as the API is paid after some translations and time
constraints.
14. P R E – P R O C E S S I N G
• All the data (string type) was assigned a unique ID (integer).
• This ID was stored in the dictionary and later in a JSON File for future
mapping of data.
• The columns changed were user_type (Private and Company), region,
city, category_name, image_top_1, parent_category_name.
15. P R E – P R O C E S S I N G
The final data-frame was made with:
1. 15,03,424 rows x 1,040 columns
2. 8.5 GB CSV File
3. 11.2 GB Feather File
The data was too large to handle at once so was split into 15 CSV Files of approx. 566 MB
containing 1,00,000 rows each.
16. L I N E A R – R E G R E S S I O
N
MODEL
INITIALIZATION,
TRAIN,
SCORE &
ROOT MEAN
SQUARE ERROR:
17. D N N - R E G R E S S O R
MODEL INITIALIZATION :