SlideShare a Scribd company logo
1 of 21
Download to read offline
EE660 Project
Walmart Recruiting: Trip Type
Classification
Shanglin Yang, shangliy@usc.edu
Yi Zheng, zhen578@usc.edu
December 8th
2015
Instructor: Professor B. Keith Jenkins
Table of Contents
Abstract.......................................................................................................................................... 2
1. Project Homepage................................................................................................................. 3
2. Problem statement and goals............................................................................................... 3
3. Literature Review ................................................................................................................. 4
4. Prior and Related Work....................................................................................................... 5
5. Project Formulation and Setup ........................................................................................... 5
6 Methodology.......................................................................................................................... 8
7. Implementation ................................................................................................................... 11
7.1 Feature Space .................................................................................................................... 11
7.2 Pre-processing and Feature Extraction........................................................................................11
7.3 Training Process .............................................................................................................................13
7.3.1 Naïve Bayes classifier:..............................................................................................................13
7.3.2 K-nearest neighbors (KNN) classifier:......................................................................................13
7.3.3 Scikit-based using SVM, Random Forest and Adaboost ..........................................................14
7.4 Testing, Validation and Model Selection......................................................................................15
8. Final Results ........................................................................................................................ 19
9. Interpretation...................................................................................................................... 20
10. Summary and conclusions.................................................................................................. 21
Abstract
The project aims to help Walmart classify customer trip type using the dataset of the items they’ve
purchased. We apply machine learning methods on the purchase history data and the customer trip
type provided by Walmart to solve the problem. The main challenges are transforming the raw
data to features that represent trip type well and learning a predictive model based on these features.
We first look for best representation of sample data as features and solve the trip type classification
problem by five machine learning methods, i.e. naïve Bayes, K-nearest neighbor (KNN), Support
Vector Machine (SVM), Random Forest and adaptive boosting. The random forest performs best
and we use this model as the final classification system to predict the trip types of unseen customer
purchase data. We get a good predictive score and there still remains a big improvement in our
score. We find features are essential in applied machine learning and will next explore more in
process of features selection and extraction based on raw data to improve accuracy of our
classification system.
1. Project Homepage
https://github.com/ee660finalproject/EE660_Group_pro
2. Problem statement and goals
Walmart improves customers' shopping experiences by segmenting their store visits into different
trip types. Whether they're on a last minute run for new puppy supplies or leisurely making their
way through a weekly grocery list, classifying trip types enables Walmart to create the best
shopping experience for every customer. Currently, Walmart's trip types are created from a
combination of existing customer insights and purchase history data. In this problem, we will focus
on the purchase history data and classify customer trips using only a transactional dataset of the
items they've purchased. The goal is to help Walmart refine their segmentation process by
improving the data behind trip type classification. Walmart has categorized the trips contained in
this data into 38 distinct types with 647054 training samples and 653646 test samples.
Data Fields:
 TripType - a categorical id representing the type of shopping trip the customer made.
TripType_999 is an "other" category.
 VisitNumber - an id corresponding to a single trip by a single customer
 Weekday - the weekday of the trip
 Upc - the UPC number of the product purchased
 ScanCount - the number of the given item that was purchased. A negative value indicates a
product return.
 DepartmentDescription - a high-level description of the item's department
 FinelineNumber - a more refined category for each of the products, created by Walmart.
This is an interesting and challenging problem. Since we are not provided with more information
than what is given in the data (e.g. what the TripTypes represent or more product information), we
need to mine the useful information behind the purchase history data by ourselves to predict the
trip types. Using both art (customer insights) and science (purchase history data) will help Walmart
make progress on the core mission of better understanding and serving their customers. The
challenge is to recreate this categorization/clustering with a more limited set of features. This could
provide new and more robust ways to categorize trips. It requires significant amounts of
preprocessing. There exists some missing data. Each customer only has one trip type but may
purchase more than one commodity. Depending on the VisitNumber, we ensemble the samples
and find that there are 94247 customers. The data has a high dimensionality of feature space with
totally 102984 dimensions which would be a sparse matrix. We need to perform feature selection
and feature extraction to reduce the dimensions. A good selection of features leads to an excellent
classification but It is hard to select the them among a huge number of features. Since the project
involves massive data, the training procedure is time consuming.
3. Literature Review
Title: <Largeron, Christine, Christophe Moulin, and Mathias Géry. "Entropy based
feature selection for text categorization." Proceedings of the 2011 ACM
Symposium on Applied Computing. ACM, 2011. >
This paper made a review of several feature selection methods including document frequency (DF),
information gain (IG), mutual information(IM), χ2, odd ratio and GSS and proposed a feature
selection criterion, called Entropy based Category Coverage Difference (ECCD). From the paper,
we get some basic ideas of the feature information theory and the implementation of information
gain (IG). We then implement the algorithm in our feature selection.
The basic idea of feature selection is to build a functions which tries to capture the intuition that
the best terms for ci are the ones distributed most differently in the sets of positive and negative
examples of ci. We choose to use IG for its easy implementation as well as high performance.
Given a term tj and a category ck, IG(tj , ck) can be computed from a contingency table. Let A be
the number of documents in the category containing tj ; B, the number of documents in the other
categories containing tj ; C, the number of documents of ck which do not contain tj and D, the
number of documents in the other categories which do not contain tj (with N = A + B + C + D):
Fig.1 The ECCD Matrix
In our problem, we use a 97714*38 matrix as IGM97714∗38. The row stands for each term (Upc
number) and the column stands for the category (the trip type). Each element in the matrix is the
occurrence of the upc number in the trip type, i.e. A𝑗𝑘= IGM𝑗𝑘. B𝑗𝑘 = ∑ IGM𝑗𝑖
38
𝑖=1 − IGM𝑗𝑘. C𝑗𝑘 =
∑ IGM𝑖𝑘
97714
𝑖=1 − IGM𝑗𝑘. D𝑗𝑘 = ∑ ∑ IGM 𝑘𝑖
97714
𝑘=1
38
𝑖=1 − A𝑗𝑘 − B𝑗𝑘 − C𝑗𝑘.
Then we use a a 97714*38 matrix as IGM97714∗38. Using the contingency table, Information Gain
can be estimated by:
𝐼𝐹(𝑡𝑗, 𝐶 𝑘) ≈ −
𝐴+𝐶
𝑁
log (
𝐴+𝐶
𝑁
) +
𝐴
𝑁
log (
𝐴
𝐴+𝐵
) +
𝐶
𝑁
log (
𝐶
𝐶+𝐷
) (1)
Then, we can use the IG value as a criterion of choosing features. For example, we would use the
feature with the large IG (Greater than Threshold).
4. Prior and Related Work
There is no prior and related work
5. Project Formulation and Setup
After analyzing the problem description and goal. We decided to implement the standard
machine learning and testing process to handle this problem. Using the known sample features and
label to training multi-classification mode, which could assign class to new sample with same
feature. In our case, we decide to implement five algorithms to do the classification:
5.1 Naïve Bayes classifier
Naïve Bayes is a simple kind of generative classifier which is the model of the form:
𝑝(𝑦, 𝑥|𝜃) = 𝑝(𝑦|𝜋) ∏ 𝑝(𝑥𝑗|𝑦, 𝜃)𝐷
𝑗=1 (2)
It fits by MAP estimation with a vague Dirichlet prior (add-one-smoothing). Typically, the
results are not too sensitive to the setting of this prior (unlike discriminative models). In this
problem, we use MAP estimation. The model has two fields: theta (c, i) and and classPrior. This
could be our baseline classification.
Table 1 Parameter within Naïve Bayes classifier
model.theta(c, j) The probability the feature j turns on in trip type c
model.classPrior(c) The probability of trip type c
5.2 K-nearest neighbors (KNN) classifier
KNN is a generative classifier where the class conditional density is a non-parametric kernel
density estimator. Based on the samples, the function is only approximated locally and all
computation is deferred until classification. KNN assign weight to the contributions of the
neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.
In this problem, the KNN model finds the closest training sample to the test sample and assigns
the label of the training to the test sample.
• Parameter:
— similarity function: 𝐾: 𝑋 × 𝑋 → 𝑅
— number of nearest neighbor to consider: k=38
• Prediction rule:
— test sample 𝑥′
— KNN: training samples with nearest Euclidian distance: knn(𝑥′
) = d(𝑥′
, 𝑥𝑖).
y(𝑥′
) = argmax
𝑦∈𝑌
{∑ 1[𝑦 𝑖=𝑦]𝑖∈𝑘𝑛𝑛(𝑥′) }
5.3 SVM (Multi-class)
Support vector machines (SVM) is supervised learning models with associated learning
algorithms that analyze data and recognize patterns, used for classification and regression analysis.
Given a set of training examples, each marked for belonging to one categories, an SVM training
algorithm builds a model that assigns new examples into different sides based on the kernel
function. The basic idea behind the SVM shown in Fig.2(a) is to Maximize margin and minimize
training error simultaneously.
The SVM could fit the dataset with high dimension effectively and it also uses a subset of
training points in the decision function (called support vectors), so it is also memory efficient.
There are also several modifications we can apply to fit our problem:
(a) (b)
Fig.2 The SVM principle (a) and non-linear model (b)
As shown in Fig.3(b), we try to fit a non-linear classification. To do that, we decide to use
kernel function to map the non-linear problem. We decide to use Radial Basis Functions as kernel
for it requires less parameter to optimize and could measure the distance in high dimension
appropriately. Mathematical principle described below:
Training : 𝑚𝑎𝑥𝑚𝑖𝑧𝑒 𝐷(𝛼⃗) = (∑ 𝛼𝑖
𝑛
𝑖=1 ) −
1
2
∑ ∑ 𝛼𝑖
𝑛
𝑗=1
𝑛
𝑖=1 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥⃗𝑖, 𝑥⃗𝑗) (3)
𝑠. 𝑡. ∑ 𝛼𝑖 𝑦𝑖 = 0 𝑎𝑛𝑑 0 ≤ 𝛼𝑖 ≤ 𝐶
𝑛
𝑖=1
𝐾(𝑥⃗𝑖, 𝑥⃗𝑗) = exp(−
|𝑥⃗𝑖 − 𝑥⃗𝑗|
2
𝜎2
⁄ )
Classification: For new example 𝑥, ℎ(𝑥⃗ ) = 𝑠𝑖𝑔𝑛(∑ 𝛼𝑖 𝑦𝑖 𝐾(𝑥⃗𝑖, 𝑥⃗𝑗) + 𝑏)𝑥 𝑖∈𝑆𝑉 (4)
Considering the SVM in general is a binary classification. We could use one-vs-all (ovo) or
one-vs-rest (ovr) method to make the model fit to the multiclass problem; we also use cross
validation to choose the best parameters.
The parameters include the Penalty parameter C (int), kernel function (‘linear’,‘rbf’,
‘poly’), Degree of the polynomial kernel function (int), gamma(int), decision_function_shape
( ‘ovo’, ‘ovr’).
5.4 Random forest Classification
Random forest is a meta estimator that fits a number of decision tree classifiers on various sub-
samples of the dataset and use averaging to improve the predictive accuracy and control over-
fitting (Fig.3).
Basically, it would generate good results with sufficient trees, considering it also use subset of
feature to increase the final result.
The parameters that could be modified include number of trees/estimators(int),
max_depth(int).
There are also other parameters we could use, but in general, we would set them automatically.
5.5 Adaboost Classification
An AdaBoost classifier is another meta-estimator that begins by fitting a classifier on the
original dataset and then fits additional copies of the classifier on the same dataset but where the
weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more
on difficult cases. It based on weak estimator using subset of feature as well as samples and
continues to boost the results iteratively.
Adaboost could work for our algorithm, basically because it could on the one hand fits the high
dimension, on the other hand it could penalize the misclassified data continually making the
performance good. It could also prevent overfitting significantly.
The algorithm basically includes training process and weighting as well as boost process. The
weak classifier we used in implementation is the basic decision-tree classifier with small depth.
Fig.3 The framework of Random Forest
Each time after the basic estimator give temporary results, the misclassified samples would be
assigned larger weight to make a better classification in the next time.
The parameters that we can adjust including: number of trees/estimators(int), learning rate(int) ,
we also let the system to choose other parameter automatically.
6 Methodology
The frame work is shown in Fig 4. The whole work includes Preprocessing, Training and
Evaluation.
6.1 Preprocessing
Too much information can reduce the effectiveness of classifier learning. It may actually
detract from the quality and accuracy of the model. Thus, the representation and quality of data is
first and foremost before running a classifier. In this section, we need to determine the actual
features for training our systems. Since we can hardly find meaningful information to the
classification of the trip types from the original purchase history data, we assembled some data
attributes and find the relationships between features to form new training set. Moreover, the new
training data set with many attributes may contain groups of attributes that are correlated. These
attributes may actually be measuring the same underlying feature. The redundant attributes simply
add noise to the data and affect model accuracy. Noise increases the complexity of the model and
the time and system resources needed for model building and scoring. The higher the
dimensionality of the processing space, the higher the computation cost involved in algorithmic
processing. To minimize the effects of noise, correlation, and high dimensionality, some form of
dimension reduction is sometimes a desirable preprocessing step. Feature selection and extraction
are approaches to dimension reduction. The product of data preprocessing and feature extraction
is the final training data.
6.2 Training Process
Once we get the final training set after preprocessing and feature extract, we segment the final
training data into two parts: training set and test set. To avoid data snoop, we set the test
set aside and never look at it in the training process. We apply cross validation method to the
training process, i.e. the training set is divided into five equal-size sets, each time four of them are
used for training the classifier and the rest is used for testing the performance for the specific model
and parameters. In our project, we use total five machine learning methods: naïve Bayes, KNN,
SVM, Random Forest and Adaptive Boosting. By cross validation, we will find the best classifier
for each training method.
Especially, we need to consider the Hypothesis sets of different algorithms for it is related to the
feasibility and performance of machine learning.
Naïve Bayes: The naïve Bayes probability model is an independent feature model. The naïve
Bayes classifier combines this model with decision rule. One common rule is to pick the hypothesis
that is most probable; this is known as the maximum a posterior or MAP decision rule. The
corresponding classifier, a Bayes classifier, is the function that assigns a class label 𝑦̂=𝐶 𝑘 for
some k as follows:
(5)
K-nearest Neighbors: Since KNN is an instance-based learning algorithm, in a k-NN model, a
hypothesis is built from the training data directly at the time a query is made to the system. The
prediction is based on the K training instances closest to the case being scored. Therefore, all
training cases have to be stored, which may be problematic when the amount of data is large.
SVM: Considering we are using the RBF kernel and ovo/ovr to solve multi-classification problem.
Recall the formulation (3) and (4). Our target parameter is the vector 𝛼⃗ ∈ ℝ 𝑛
, 𝑛 is the number of
training sample. So basically, out hypotheses set is the subset of ℝ 𝑛
dimension space with
constraints :
Fig.4 The Framework of whole System
Random Forest Tree and Adaboost decision tree: For both case, the basic idea is using decision
tree as the weak classifier. For each tree, it should have the unit Hypotheses set ℎ{𝑖}, the depth
within the unit Hypotheses is decided by its required depth and halting situation. In general, for
depth = 𝑑 and node = 𝑛.The ℎ{𝑖} = 𝐷 𝑛
, D is the feature dimension. Therefore, for the whole
system, the Hypotheses set 𝐻 = ⋃ ℎ{𝑖}𝑁
𝑖 , N is the number of decision trees.
6.3 Evaluation
After the training process, we get the optimal model for each classification method. In this
evaluation section, we use the test set to evaluate the performance of each classifier. We select the
classifier with the best performance as the final classification system. Then, we will complete the
competition of Walmart: Predict the trip type depending on the customer purchase and submit the
trip types to Kaggle.com to get a score of our classification system.
(https://www.kaggle.com/c/walmart-recruiting-trip-type-classification/submissions/attach)
7. Implementation
7.1 Feature Space
In the original dataset, each sample represents one commodity and one customer usually purchase
more than one commodity. Since the goal is to predict the trip type of the customer, we need to
know when and what does one customer buy in his/her trip. We at first assemble samples belonging
to each customer in our new dataset so that each new sample represent each customer. Each sample
has features such as purchasing weekday, department description, Upc and Finelinenumber of all
purchased items with corresponding quantity.
7.2 Pre-processing and Feature Extraction
STEP1:
Original data size Missing data size New data size
Training set 647054 4129 642925
Test set 653646 3986 649660
The original data sets have some missing data but the number is relevant small, so we just need
to discard the missing data.
STEP2:
Depending on the VisitNumber, we can determine each customer id and then merge each
customer’s items into one sample to form the new data set. There are 95674 customers in the
training set and 95674 customers in the test set. After discarding the missing data, the actual
number of customer in training set is 94247 and in test set is 94288. For training, it is all right to
use the 94247 samples, but for testing, we will use our classification system to find the predicted
trip types for the 94288 customers and assign trip type 999 (“other”) for the 1386 customers with
missing data.
STEP3:
Weekday is a categorical feature representing the purchasing weekday for each customer. We
use 7-digit binary number, set the corresponding bit “1” and others “0” to denote the corresponding
weekday. The format is shown below:
7-digit 'Friday' 'Monday' 'Saturday' 'Sunday' 'Thursday' 'Tuesday' 'Wednesday'
DepartmentDescription is a categorical feature describing the item’s department. It shows the
properties and functions of the items bought by the customers, we think it carries a larger amount
of information classifying the trip type. By calculation, there are totally 68 descriptions for all
commodities and we will use all of them by a 68-digit number. For each customer, set the
corresponding bit 1 multiplied by the ScanCount number depending on the descriptions of his/her
purchase items.
FinelineNumber is a more refined category of each product created by Walmart and this
feature gives more information for classification. By calculation, there are 5195 FinelineNumbers
in the training set and we will use 5195-digit binary number to represent this feature. For each
customer, set the corresponding bit 1 if the customer has purchased the item. If the customer in the
test set doesn’t buy any of the 5195 FinelineNumber items, set all the 5195 digit 0.
Upc is the upc number of each product. There are 97714 upc in the training data. It is not
feasible to use the 97714-bit for the upc feature. Our idea is to chose 5000 most representative upc
numbers of 97714 as the most useful in separating the training documents into the given classes.
To determine the 5000 upc number, we apply Information Gain method. A simple formulation of
entropy-based feature selection is presented:
The IG matrix provides very good Information-Theoretic feature selection. We pick the first 5000
upc numbers with the highest Information Gain. For each customer, set the corresponding bit “1”
if the customer has purchased the item with one of the 5000 upc numbers. If the customer in doesn’t
buy any item with one of the 5000 Upc numbers, set all the 5000 digit “0”.
STEP4:
We assemble all new features obtained form in step3 for each customer in this order:
Weekday DepartmentDescription FinelineNumber Upc
After the preprocessing and feature extraction, the new training set has size 94247*10270 and
the test set has 94288*10270. The row stands for the sample and the column stands for the feature.
STEP5:
Feature reduction: The new data set obtained in preprocessing and feature selection has
10270 dimensions, however the feature space still has a huge amount of features. There are many
irrelevant features simply adding the noise to the data and affecting the classifier accuracy. Some
features are highly correlated and reduce the effectiveness of the classifier. So we need to apply
feature extraction to reduce the features.
1. 7-digit Weekday feature: we keep the 7-dit unchanged. (→ 7-digit)
2. 68-digit department description feature:
we divide this feature as two parts:
(1) Apply LDA method to the 68-dimension feature space and transform it to lower 37-
dimension space. (→ 37-digit)
(2) By mining the data, we find that among the 68 department descriptions, there 20 pairs of
the descriptions with high correlations to the trip type. So we use 20-bit binary to denote
whether the pair occurs in the customer’s trip. (→ 20-digit)
3. 5195-digit fine line number feature: we use randomized PCA to reduce the feature dimension
from 5195 to 68. (→ 68-digit)
4. 5000-digit upc feature: we use randomized PCA to reduce the feature dimension from 5000 to
136. (→ 136-digit)
By feature extraction, we can reduce the feature dimension from 10270 to 268. Feature
reduction can reduce time and storage space required. The removal of collinearity improves the
performance of the machine learning model. Also, it can reduce the over fitting effect.
7.3 Training Process
7.3.1 Naïve Bayes classifier: We use naiveBayesFit from pmtk3 to generate a naïve Bayes
model using MAP estimation and predict with model using naiveBayesPredit. Since the features
are binary, 𝑝(𝑥𝑗|𝑦 = 𝑐, 𝜃) = 𝐵𝑒𝑟(𝑥𝑗|𝜃𝑗𝑐). It fits by MAP estimation with a vague Dirichlet
prior (add-one-smoothing). Depending on the input training data, we compute the
frequency of each trip type in the labels and use them as the prior for each class.
By counting the total number of 1’s and 0’s for each bit, we can get the likelihood for
each bit turns on in each class. The likelihood and prior are used as model parameters.
7.3.2 K-nearest neighbors (KNN) classifier: We fit the model using knnFit from pmtk3. The
KNN model is generated form the input training data since it records each training sample and the
corresponding class label. Once we fit the KNN model, we use knnPredict to find the predicted
trip types. For each test sample, the classifier finds the most amount of nearest training samples in
the model and assign the corresponding class label to the test sample. The input augments include
the training samples (X), the training labels (y) and the number of clusters (K).
% model = knnFit (Xtrain, ytrain, k)
% label = knnPredict (model, Xtest)
assign its class label to the test sample. The input augments include the training samples (X), the
training labels (y) and the number of clusters (K).
7.3.3 Scikit-based using SVM, Random Forest and Adaboost
We implement multiclass SVM, Random Forest and Adaboost using scikit library. The flowchart
of the whole problem is shown below.
Fig.5 The Training Process of the Scikit Based Training
The key concept involved in the whole process includes dimension reduction, parameter search
and model building.
The scikit provides the function related to the training. We use the function below to do the training
and testing. We choose some of the basic parameter to see the performance.Detail shown in next
section.
Table 2 Function within Scikit Based Training
SVC Randomforest Adaboost
Training
(Model)
sklearn.svm
.SVC
sklearn.ensemble.RandomForestCl
assifier
sklearn.ensemble.AdaBoostCla
ssifier
Predicting Model.predict
7.4 Testing, Validation and Model Selection
Considering our work involves numerous mode and parameters. So we need to do cross validation
to choose our parameters.
7.4.1 Flowchart of cross validation
We choose randomly 6000 samples from the original 94247 samples. Let the 60000 samples
belongs to 𝑋_𝑇𝑟𝑎𝑖𝑛. The rest of samples would be taken as 𝑋_𝑇𝑒𝑠𝑡 , which would be set aside
until the final test.
Then we separate the 𝑋_𝑇𝑟𝑎𝑖𝑛 to 5 subsets, each of which 12000 samples. Each time, the model
would use 5 subsets as the training sets and the left one would be the test set. Each time we record
the results and change the parameter, then repeat the process again.
7.4.2 Model Target and Parameter candidate sets.
We have 5 basic models/algorithms, but considering the Naïve Bayes and KNN are base line model.
So we care more about the performance along with change of parameters in other three models.
We choose to use the grid search algorithm to do the parameter search and optimization using tool
sklearn.grid_search_GridSearchCV.
The reason we use this method is because we have large sample as well as large class to train. So
we need to shrink the candidate set so that the computation would be faster.
For each model we have set different search parameter and their candidate sets:
SVC C Kernal Gammar decision_function
Candidate Sets [1,10,100,100] [‘rbf’] [0.01,0.001,0.0001]; [‘ovo’,’ovr’]
The cross_validation results are shown below:
Fig.6 The SVC Cross_validation results
We can see clearly that the decision_function do not make much difference. We can choose the
best parameter based on the results: {‘C’:100, ‘Gamma’:0.01}
Randomforest N_estimator Max_dapth
Candidate Sets [50,100,200,400,600,800,1000] [2,4,8,12,16,20,24,2 ,30]
The cross_validation results are shown below:
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-200 0 200 400 600 800 1000 1200
ErorRate
C
Error_rate vs C
ovo ovr
0
0.1
0.2
0.3
0.4
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
ErroRate
- log(gamma)
Error rate vs Gamma
ovo ovr
Fig.7 The Random Forest Cross validation results
we can choose the best parameter for the random forest: {‘N_estimator’:800, ‘Max_depth’:24}
Adaboost n_estimator Learning Rate
Candidate Sets [50,100,200,400,600,800] [0.6,0.8,1,1.2]
0.25
0.255
0.26
0.265
0.27
0.275
0.28
0.285
0 200 400 600 800 1000 1200
errorrate
N_estimator
Error rate vs N_estimator
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5 10 15 20 25 30 35
ErrorRate
Max_depth
Error rate vs Max_depth
Fig.8 The Adaboost Forest Cross validation results
we can choose the best parameter for the random forest: {‘N_estimator’:800, ‘Learning Rate:1}
7.4.3 Final Test
We test our results through two ways. First we use the 𝑋_𝑇𝑒𝑠𝑡, which have not been ‘looked’
through the whole training process, to do evaluation on the different model with the best parameter
and get the prediction as well as the Error Rate.
Then, we decide our final model based on these error rates as well as the computation time to
choose the best model. Using the test.csv downloaded from Kaggle and the whole train.scv data
to do the training to get the label for the test file. Upload it and then get the result.
0
0.1
0.2
0.3
0.4
0.5
0.6
0 100 200 300 400 500 600 700 800 900
ErrorRate
N_estimator
Error rate vs N_estimator
0.45
0.46
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
ErrorRate
Learning Rate
Error Rate vs Learning Rate
8. Final Results
Table 3 Final Results
Algorithm 𝐸𝑖𝑛 𝐸 𝑜𝑢𝑡 Best Parameter
Computation
Time (20000
samples) (/s)
Naïve Bayes 0.345 0.415 - 600
K-nearest
neighbors (KNN)
0.0799 0.441 - 900
SVM 0.005 0.281 {‘C’:100, ‘Gamma’:0.01} 800
Random Forest 0.007 0.228
{‘N_estimator’:800,
‘Max_depth’:24}
600
Adaboost 0.080 0.255
{‘N_estimator’:800,
‘Learning Rate:1}
1800
Depending on the results above, the Random Forest algorithm performs the best with lowest 𝐸 𝑜𝑢𝑡.
We use the Random Forest model as our final classifier for the classification system. Then we
apply our system to the test samples provided by Walmart and predict the trip types, the result is
shown below:
Online competition submission performance:
Team name in Kaggle.com: Guoshiwushuang
Public score: 9.33204
According to the leaderboard, the highest public score is 0.50519 and the lowest score is 34.53878.
The score is calculated on approximately 30% of the test data. We have submitted 6 times and
improve our score from 20 to 9 which reveals that our method does work.
Meanwhile, from the competition forum topics, we find some teams with better score also apply
the random forest method. We think the reason our system can’t get the same score as them is that
the features we select and extract are still not efficient and accurate enough for the classification.
We need to find a better relationship between the features of the products and the trip types and
transfer the original features efficiently.
9. Interpretation
Our final results are not very satisfying compared to other performance on the board.
Meanwhile, from the table, we can see clearly, there is an overfitting phenomenon for all algorithm,
especially on the random forest although it has achieved the best performance. However, we still
learn a lot from the process and result.
9.1 Feature engineering and machine learning
The key problem within our work is not the implementation of the learning algorithm but the usage
of features. There are two part of the challenge. At first, considering the features are highly sparse
distributed which make the individual feature cannot generate little valuable information towards
the classification. On the other hand, both the fine number and UPC contain large dimension
information (almost 100000) which are also hard to handle.
We have implemented several methods to do the feature selection and extraction. However, the
results are not qualified using different algorithm. The direct reason for that should be that our
features are still not sufficient powerful.
In our case, for example, you cannot decide what trip-type it should be based on one or two object
they bought. Because, the objects people bought embrace high variety. From that, we can see that
not only the original feature to enhance the performance, but also some features may even be
confusing being harmful the classification.
Our methods combine the manual search and selection based on IG value which have improved
the performance compared to using only original feature. But the feature engineering is still far
from solvable. The process is still untraceable. The reason could be that we are still unable to
describe the relation between different features. Our new features may still not be distinguishable
for classification. Another reason is that there is always a dilemma between the precision and recall.
When we bring some new features which could help us to find more sample in one class, it has
larger chance to misclassified feature of other class simultaneously;
9.2 Limitation of solving large data
Besides the features selection, the limitation of solving computation of large data also damage our
final performance. To speed up the computation, we have to use small set of samples which may
let us unable to explore more information from the dataset. While spending lots time on data also
make it hard to tract individual sample or features; especially facing the parameter –search process,
the long computation time make us unable to get sufficient prediction of the performance.
To solve the problem, we need to figure out more advanced algorithm. Considering our case is
sparse distribution, randomized PCA is helpful. We in fact need more similar algorithm to solve
our problem.
10. Summary and conclusions
In our project, by comparing our score with others’ in the leaderboard, we find that even applying
the same machine learning method, different features result in different performances.
Transforming the raw data into features which are better representatives of the underlying problem
influences the accuracy of the model on unseen data. Feature plays an important role to success in
applied machine learning. Algorithms in machine learning are very important and we usually
invest our main effort in these. However, good features are key to make the algorithms work well
and guarantee a predictive model. Better features mean flexibility, simple models and better
performance. Next we will try our best to find better features from the purchase history data before
training our model.

More Related Content

What's hot

Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Feature selection on boolean symbolic objects
Feature selection on boolean symbolic objectsFeature selection on boolean symbolic objects
Feature selection on boolean symbolic objectsijcsity
 
Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data miningijpla
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
 
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...inventionjournals
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generationrsathishwaran
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
An Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data ClassificationAn Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data ClassificationAM Publications
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mininghari91
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Application of data mining tools for
Application of data mining tools forApplication of data mining tools for
Application of data mining tools forIJDKP
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Miningijsrd.com
 
Lecture 2 - Classes, Fields, Parameters, Methods and Constructors
Lecture 2 - Classes, Fields, Parameters, Methods and ConstructorsLecture 2 - Classes, Fields, Parameters, Methods and Constructors
Lecture 2 - Classes, Fields, Parameters, Methods and ConstructorsSyed Afaq Shah MACS CP
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Publishing House
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
 
Session ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmcSession ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmcUSD Bioinformatics
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 

What's hot (20)

Classification
ClassificationClassification
Classification
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Feature selection on boolean symbolic objects
Feature selection on boolean symbolic objectsFeature selection on boolean symbolic objects
Feature selection on boolean symbolic objects
 
Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data mining
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generation
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
An Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data ClassificationAn Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data Classification
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mining
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Application of data mining tools for
Application of data mining tools forApplication of data mining tools for
Application of data mining tools for
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
Lecture 2 - Classes, Fields, Parameters, Methods and Constructors
Lecture 2 - Classes, Fields, Parameters, Methods and ConstructorsLecture 2 - Classes, Fields, Parameters, Methods and Constructors
Lecture 2 - Classes, Fields, Parameters, Methods and Constructors
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
Session ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmcSession ii g3 lab behavior science mmc
Session ii g3 lab behavior science mmc
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 

Similar to EE660 Project_sl_final

BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxLSURYAPRAKASHREDDY
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
Open06
Open06Open06
Open06butest
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfDanilo Cardona
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptxbigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptxHarshavardhan851231
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceJosé Ramón Ríos Viqueira
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1warishali570
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docxaudeleypearl
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docxroushhsiu
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
 

Similar to EE660 Project_sl_final (20)

CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
 
ML Basics
ML BasicsML Basics
ML Basics
 
BIG MART SALES.pptx
BIG MART SALES.pptxBIG MART SALES.pptx
BIG MART SALES.pptx
 
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptx
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Data Science Machine
Data Science Machine Data Science Machine
Data Science Machine
 
Open06
Open06Open06
Open06
 
Clustering
ClusteringClustering
Clustering
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptxbigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 

EE660 Project_sl_final

  • 1. EE660 Project Walmart Recruiting: Trip Type Classification Shanglin Yang, shangliy@usc.edu Yi Zheng, zhen578@usc.edu December 8th 2015 Instructor: Professor B. Keith Jenkins
  • 2. Table of Contents Abstract.......................................................................................................................................... 2 1. Project Homepage................................................................................................................. 3 2. Problem statement and goals............................................................................................... 3 3. Literature Review ................................................................................................................. 4 4. Prior and Related Work....................................................................................................... 5 5. Project Formulation and Setup ........................................................................................... 5 6 Methodology.......................................................................................................................... 8 7. Implementation ................................................................................................................... 11 7.1 Feature Space .................................................................................................................... 11 7.2 Pre-processing and Feature Extraction........................................................................................11 7.3 Training Process .............................................................................................................................13 7.3.1 Naïve Bayes classifier:..............................................................................................................13 7.3.2 K-nearest neighbors (KNN) classifier:......................................................................................13 7.3.3 Scikit-based using SVM, Random Forest and Adaboost ..........................................................14 7.4 Testing, Validation and Model Selection......................................................................................15 8. Final Results ........................................................................................................................ 19 9. Interpretation...................................................................................................................... 20 10. Summary and conclusions.................................................................................................. 21 Abstract The project aims to help Walmart classify customer trip type using the dataset of the items they’ve purchased. We apply machine learning methods on the purchase history data and the customer trip type provided by Walmart to solve the problem. The main challenges are transforming the raw data to features that represent trip type well and learning a predictive model based on these features. We first look for best representation of sample data as features and solve the trip type classification problem by five machine learning methods, i.e. naïve Bayes, K-nearest neighbor (KNN), Support Vector Machine (SVM), Random Forest and adaptive boosting. The random forest performs best and we use this model as the final classification system to predict the trip types of unseen customer purchase data. We get a good predictive score and there still remains a big improvement in our score. We find features are essential in applied machine learning and will next explore more in process of features selection and extraction based on raw data to improve accuracy of our classification system.
  • 3. 1. Project Homepage https://github.com/ee660finalproject/EE660_Group_pro 2. Problem statement and goals Walmart improves customers' shopping experiences by segmenting their store visits into different trip types. Whether they're on a last minute run for new puppy supplies or leisurely making their way through a weekly grocery list, classifying trip types enables Walmart to create the best shopping experience for every customer. Currently, Walmart's trip types are created from a combination of existing customer insights and purchase history data. In this problem, we will focus on the purchase history data and classify customer trips using only a transactional dataset of the items they've purchased. The goal is to help Walmart refine their segmentation process by improving the data behind trip type classification. Walmart has categorized the trips contained in this data into 38 distinct types with 647054 training samples and 653646 test samples. Data Fields:  TripType - a categorical id representing the type of shopping trip the customer made. TripType_999 is an "other" category.  VisitNumber - an id corresponding to a single trip by a single customer  Weekday - the weekday of the trip  Upc - the UPC number of the product purchased  ScanCount - the number of the given item that was purchased. A negative value indicates a product return.  DepartmentDescription - a high-level description of the item's department  FinelineNumber - a more refined category for each of the products, created by Walmart. This is an interesting and challenging problem. Since we are not provided with more information than what is given in the data (e.g. what the TripTypes represent or more product information), we need to mine the useful information behind the purchase history data by ourselves to predict the trip types. Using both art (customer insights) and science (purchase history data) will help Walmart make progress on the core mission of better understanding and serving their customers. The challenge is to recreate this categorization/clustering with a more limited set of features. This could provide new and more robust ways to categorize trips. It requires significant amounts of preprocessing. There exists some missing data. Each customer only has one trip type but may purchase more than one commodity. Depending on the VisitNumber, we ensemble the samples
  • 4. and find that there are 94247 customers. The data has a high dimensionality of feature space with totally 102984 dimensions which would be a sparse matrix. We need to perform feature selection and feature extraction to reduce the dimensions. A good selection of features leads to an excellent classification but It is hard to select the them among a huge number of features. Since the project involves massive data, the training procedure is time consuming. 3. Literature Review Title: <Largeron, Christine, Christophe Moulin, and Mathias Géry. "Entropy based feature selection for text categorization." Proceedings of the 2011 ACM Symposium on Applied Computing. ACM, 2011. > This paper made a review of several feature selection methods including document frequency (DF), information gain (IG), mutual information(IM), χ2, odd ratio and GSS and proposed a feature selection criterion, called Entropy based Category Coverage Difference (ECCD). From the paper, we get some basic ideas of the feature information theory and the implementation of information gain (IG). We then implement the algorithm in our feature selection. The basic idea of feature selection is to build a functions which tries to capture the intuition that the best terms for ci are the ones distributed most differently in the sets of positive and negative examples of ci. We choose to use IG for its easy implementation as well as high performance. Given a term tj and a category ck, IG(tj , ck) can be computed from a contingency table. Let A be the number of documents in the category containing tj ; B, the number of documents in the other categories containing tj ; C, the number of documents of ck which do not contain tj and D, the number of documents in the other categories which do not contain tj (with N = A + B + C + D): Fig.1 The ECCD Matrix In our problem, we use a 97714*38 matrix as IGM97714∗38. The row stands for each term (Upc number) and the column stands for the category (the trip type). Each element in the matrix is the occurrence of the upc number in the trip type, i.e. A𝑗𝑘= IGM𝑗𝑘. B𝑗𝑘 = ∑ IGM𝑗𝑖 38 𝑖=1 − IGM𝑗𝑘. C𝑗𝑘 = ∑ IGM𝑖𝑘 97714 𝑖=1 − IGM𝑗𝑘. D𝑗𝑘 = ∑ ∑ IGM 𝑘𝑖 97714 𝑘=1 38 𝑖=1 − A𝑗𝑘 − B𝑗𝑘 − C𝑗𝑘. Then we use a a 97714*38 matrix as IGM97714∗38. Using the contingency table, Information Gain can be estimated by:
  • 5. 𝐼𝐹(𝑡𝑗, 𝐶 𝑘) ≈ − 𝐴+𝐶 𝑁 log ( 𝐴+𝐶 𝑁 ) + 𝐴 𝑁 log ( 𝐴 𝐴+𝐵 ) + 𝐶 𝑁 log ( 𝐶 𝐶+𝐷 ) (1) Then, we can use the IG value as a criterion of choosing features. For example, we would use the feature with the large IG (Greater than Threshold). 4. Prior and Related Work There is no prior and related work 5. Project Formulation and Setup After analyzing the problem description and goal. We decided to implement the standard machine learning and testing process to handle this problem. Using the known sample features and label to training multi-classification mode, which could assign class to new sample with same feature. In our case, we decide to implement five algorithms to do the classification: 5.1 Naïve Bayes classifier Naïve Bayes is a simple kind of generative classifier which is the model of the form: 𝑝(𝑦, 𝑥|𝜃) = 𝑝(𝑦|𝜋) ∏ 𝑝(𝑥𝑗|𝑦, 𝜃)𝐷 𝑗=1 (2) It fits by MAP estimation with a vague Dirichlet prior (add-one-smoothing). Typically, the results are not too sensitive to the setting of this prior (unlike discriminative models). In this problem, we use MAP estimation. The model has two fields: theta (c, i) and and classPrior. This could be our baseline classification. Table 1 Parameter within Naïve Bayes classifier model.theta(c, j) The probability the feature j turns on in trip type c model.classPrior(c) The probability of trip type c 5.2 K-nearest neighbors (KNN) classifier KNN is a generative classifier where the class conditional density is a non-parametric kernel density estimator. Based on the samples, the function is only approximated locally and all computation is deferred until classification. KNN assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. In this problem, the KNN model finds the closest training sample to the test sample and assigns the label of the training to the test sample. • Parameter: — similarity function: 𝐾: 𝑋 × 𝑋 → 𝑅 — number of nearest neighbor to consider: k=38
  • 6. • Prediction rule: — test sample 𝑥′ — KNN: training samples with nearest Euclidian distance: knn(𝑥′ ) = d(𝑥′ , 𝑥𝑖). y(𝑥′ ) = argmax 𝑦∈𝑌 {∑ 1[𝑦 𝑖=𝑦]𝑖∈𝑘𝑛𝑛(𝑥′) } 5.3 SVM (Multi-class) Support vector machines (SVM) is supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked for belonging to one categories, an SVM training algorithm builds a model that assigns new examples into different sides based on the kernel function. The basic idea behind the SVM shown in Fig.2(a) is to Maximize margin and minimize training error simultaneously. The SVM could fit the dataset with high dimension effectively and it also uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. There are also several modifications we can apply to fit our problem: (a) (b) Fig.2 The SVM principle (a) and non-linear model (b) As shown in Fig.3(b), we try to fit a non-linear classification. To do that, we decide to use kernel function to map the non-linear problem. We decide to use Radial Basis Functions as kernel for it requires less parameter to optimize and could measure the distance in high dimension appropriately. Mathematical principle described below: Training : 𝑚𝑎𝑥𝑚𝑖𝑧𝑒 𝐷(𝛼⃗) = (∑ 𝛼𝑖 𝑛 𝑖=1 ) − 1 2 ∑ ∑ 𝛼𝑖 𝑛 𝑗=1 𝑛 𝑖=1 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥⃗𝑖, 𝑥⃗𝑗) (3) 𝑠. 𝑡. ∑ 𝛼𝑖 𝑦𝑖 = 0 𝑎𝑛𝑑 0 ≤ 𝛼𝑖 ≤ 𝐶 𝑛 𝑖=1
  • 7. 𝐾(𝑥⃗𝑖, 𝑥⃗𝑗) = exp(− |𝑥⃗𝑖 − 𝑥⃗𝑗| 2 𝜎2 ⁄ ) Classification: For new example 𝑥, ℎ(𝑥⃗ ) = 𝑠𝑖𝑔𝑛(∑ 𝛼𝑖 𝑦𝑖 𝐾(𝑥⃗𝑖, 𝑥⃗𝑗) + 𝑏)𝑥 𝑖∈𝑆𝑉 (4) Considering the SVM in general is a binary classification. We could use one-vs-all (ovo) or one-vs-rest (ovr) method to make the model fit to the multiclass problem; we also use cross validation to choose the best parameters. The parameters include the Penalty parameter C (int), kernel function (‘linear’,‘rbf’, ‘poly’), Degree of the polynomial kernel function (int), gamma(int), decision_function_shape ( ‘ovo’, ‘ovr’). 5.4 Random forest Classification Random forest is a meta estimator that fits a number of decision tree classifiers on various sub- samples of the dataset and use averaging to improve the predictive accuracy and control over- fitting (Fig.3). Basically, it would generate good results with sufficient trees, considering it also use subset of feature to increase the final result. The parameters that could be modified include number of trees/estimators(int), max_depth(int). There are also other parameters we could use, but in general, we would set them automatically. 5.5 Adaboost Classification An AdaBoost classifier is another meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. It based on weak estimator using subset of feature as well as samples and continues to boost the results iteratively. Adaboost could work for our algorithm, basically because it could on the one hand fits the high dimension, on the other hand it could penalize the misclassified data continually making the performance good. It could also prevent overfitting significantly. The algorithm basically includes training process and weighting as well as boost process. The weak classifier we used in implementation is the basic decision-tree classifier with small depth.
  • 8. Fig.3 The framework of Random Forest Each time after the basic estimator give temporary results, the misclassified samples would be assigned larger weight to make a better classification in the next time. The parameters that we can adjust including: number of trees/estimators(int), learning rate(int) , we also let the system to choose other parameter automatically. 6 Methodology The frame work is shown in Fig 4. The whole work includes Preprocessing, Training and Evaluation. 6.1 Preprocessing Too much information can reduce the effectiveness of classifier learning. It may actually detract from the quality and accuracy of the model. Thus, the representation and quality of data is first and foremost before running a classifier. In this section, we need to determine the actual features for training our systems. Since we can hardly find meaningful information to the classification of the trip types from the original purchase history data, we assembled some data attributes and find the relationships between features to form new training set. Moreover, the new
  • 9. training data set with many attributes may contain groups of attributes that are correlated. These attributes may actually be measuring the same underlying feature. The redundant attributes simply add noise to the data and affect model accuracy. Noise increases the complexity of the model and the time and system resources needed for model building and scoring. The higher the dimensionality of the processing space, the higher the computation cost involved in algorithmic processing. To minimize the effects of noise, correlation, and high dimensionality, some form of dimension reduction is sometimes a desirable preprocessing step. Feature selection and extraction are approaches to dimension reduction. The product of data preprocessing and feature extraction is the final training data. 6.2 Training Process Once we get the final training set after preprocessing and feature extract, we segment the final training data into two parts: training set and test set. To avoid data snoop, we set the test set aside and never look at it in the training process. We apply cross validation method to the training process, i.e. the training set is divided into five equal-size sets, each time four of them are used for training the classifier and the rest is used for testing the performance for the specific model and parameters. In our project, we use total five machine learning methods: naïve Bayes, KNN, SVM, Random Forest and Adaptive Boosting. By cross validation, we will find the best classifier for each training method. Especially, we need to consider the Hypothesis sets of different algorithms for it is related to the feasibility and performance of machine learning. Naïve Bayes: The naïve Bayes probability model is an independent feature model. The naïve Bayes classifier combines this model with decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posterior or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label 𝑦̂=𝐶 𝑘 for some k as follows: (5) K-nearest Neighbors: Since KNN is an instance-based learning algorithm, in a k-NN model, a hypothesis is built from the training data directly at the time a query is made to the system. The prediction is based on the K training instances closest to the case being scored. Therefore, all training cases have to be stored, which may be problematic when the amount of data is large. SVM: Considering we are using the RBF kernel and ovo/ovr to solve multi-classification problem. Recall the formulation (3) and (4). Our target parameter is the vector 𝛼⃗ ∈ ℝ 𝑛 , 𝑛 is the number of training sample. So basically, out hypotheses set is the subset of ℝ 𝑛 dimension space with constraints :
  • 10. Fig.4 The Framework of whole System Random Forest Tree and Adaboost decision tree: For both case, the basic idea is using decision tree as the weak classifier. For each tree, it should have the unit Hypotheses set ℎ{𝑖}, the depth within the unit Hypotheses is decided by its required depth and halting situation. In general, for depth = 𝑑 and node = 𝑛.The ℎ{𝑖} = 𝐷 𝑛 , D is the feature dimension. Therefore, for the whole system, the Hypotheses set 𝐻 = ⋃ ℎ{𝑖}𝑁 𝑖 , N is the number of decision trees. 6.3 Evaluation
  • 11. After the training process, we get the optimal model for each classification method. In this evaluation section, we use the test set to evaluate the performance of each classifier. We select the classifier with the best performance as the final classification system. Then, we will complete the competition of Walmart: Predict the trip type depending on the customer purchase and submit the trip types to Kaggle.com to get a score of our classification system. (https://www.kaggle.com/c/walmart-recruiting-trip-type-classification/submissions/attach) 7. Implementation 7.1 Feature Space In the original dataset, each sample represents one commodity and one customer usually purchase more than one commodity. Since the goal is to predict the trip type of the customer, we need to know when and what does one customer buy in his/her trip. We at first assemble samples belonging to each customer in our new dataset so that each new sample represent each customer. Each sample has features such as purchasing weekday, department description, Upc and Finelinenumber of all purchased items with corresponding quantity. 7.2 Pre-processing and Feature Extraction STEP1: Original data size Missing data size New data size Training set 647054 4129 642925 Test set 653646 3986 649660 The original data sets have some missing data but the number is relevant small, so we just need to discard the missing data. STEP2: Depending on the VisitNumber, we can determine each customer id and then merge each customer’s items into one sample to form the new data set. There are 95674 customers in the training set and 95674 customers in the test set. After discarding the missing data, the actual number of customer in training set is 94247 and in test set is 94288. For training, it is all right to use the 94247 samples, but for testing, we will use our classification system to find the predicted trip types for the 94288 customers and assign trip type 999 (“other”) for the 1386 customers with missing data. STEP3:
  • 12. Weekday is a categorical feature representing the purchasing weekday for each customer. We use 7-digit binary number, set the corresponding bit “1” and others “0” to denote the corresponding weekday. The format is shown below: 7-digit 'Friday' 'Monday' 'Saturday' 'Sunday' 'Thursday' 'Tuesday' 'Wednesday' DepartmentDescription is a categorical feature describing the item’s department. It shows the properties and functions of the items bought by the customers, we think it carries a larger amount of information classifying the trip type. By calculation, there are totally 68 descriptions for all commodities and we will use all of them by a 68-digit number. For each customer, set the corresponding bit 1 multiplied by the ScanCount number depending on the descriptions of his/her purchase items. FinelineNumber is a more refined category of each product created by Walmart and this feature gives more information for classification. By calculation, there are 5195 FinelineNumbers in the training set and we will use 5195-digit binary number to represent this feature. For each customer, set the corresponding bit 1 if the customer has purchased the item. If the customer in the test set doesn’t buy any of the 5195 FinelineNumber items, set all the 5195 digit 0. Upc is the upc number of each product. There are 97714 upc in the training data. It is not feasible to use the 97714-bit for the upc feature. Our idea is to chose 5000 most representative upc numbers of 97714 as the most useful in separating the training documents into the given classes. To determine the 5000 upc number, we apply Information Gain method. A simple formulation of entropy-based feature selection is presented: The IG matrix provides very good Information-Theoretic feature selection. We pick the first 5000 upc numbers with the highest Information Gain. For each customer, set the corresponding bit “1” if the customer has purchased the item with one of the 5000 upc numbers. If the customer in doesn’t buy any item with one of the 5000 Upc numbers, set all the 5000 digit “0”. STEP4: We assemble all new features obtained form in step3 for each customer in this order: Weekday DepartmentDescription FinelineNumber Upc After the preprocessing and feature extraction, the new training set has size 94247*10270 and the test set has 94288*10270. The row stands for the sample and the column stands for the feature. STEP5: Feature reduction: The new data set obtained in preprocessing and feature selection has 10270 dimensions, however the feature space still has a huge amount of features. There are many
  • 13. irrelevant features simply adding the noise to the data and affecting the classifier accuracy. Some features are highly correlated and reduce the effectiveness of the classifier. So we need to apply feature extraction to reduce the features. 1. 7-digit Weekday feature: we keep the 7-dit unchanged. (→ 7-digit) 2. 68-digit department description feature: we divide this feature as two parts: (1) Apply LDA method to the 68-dimension feature space and transform it to lower 37- dimension space. (→ 37-digit) (2) By mining the data, we find that among the 68 department descriptions, there 20 pairs of the descriptions with high correlations to the trip type. So we use 20-bit binary to denote whether the pair occurs in the customer’s trip. (→ 20-digit) 3. 5195-digit fine line number feature: we use randomized PCA to reduce the feature dimension from 5195 to 68. (→ 68-digit) 4. 5000-digit upc feature: we use randomized PCA to reduce the feature dimension from 5000 to 136. (→ 136-digit) By feature extraction, we can reduce the feature dimension from 10270 to 268. Feature reduction can reduce time and storage space required. The removal of collinearity improves the performance of the machine learning model. Also, it can reduce the over fitting effect. 7.3 Training Process 7.3.1 Naïve Bayes classifier: We use naiveBayesFit from pmtk3 to generate a naïve Bayes model using MAP estimation and predict with model using naiveBayesPredit. Since the features are binary, 𝑝(𝑥𝑗|𝑦 = 𝑐, 𝜃) = 𝐵𝑒𝑟(𝑥𝑗|𝜃𝑗𝑐). It fits by MAP estimation with a vague Dirichlet prior (add-one-smoothing). Depending on the input training data, we compute the frequency of each trip type in the labels and use them as the prior for each class. By counting the total number of 1’s and 0’s for each bit, we can get the likelihood for each bit turns on in each class. The likelihood and prior are used as model parameters. 7.3.2 K-nearest neighbors (KNN) classifier: We fit the model using knnFit from pmtk3. The KNN model is generated form the input training data since it records each training sample and the corresponding class label. Once we fit the KNN model, we use knnPredict to find the predicted trip types. For each test sample, the classifier finds the most amount of nearest training samples in the model and assign the corresponding class label to the test sample. The input augments include the training samples (X), the training labels (y) and the number of clusters (K). % model = knnFit (Xtrain, ytrain, k)
  • 14. % label = knnPredict (model, Xtest) assign its class label to the test sample. The input augments include the training samples (X), the training labels (y) and the number of clusters (K). 7.3.3 Scikit-based using SVM, Random Forest and Adaboost We implement multiclass SVM, Random Forest and Adaboost using scikit library. The flowchart of the whole problem is shown below. Fig.5 The Training Process of the Scikit Based Training
  • 15. The key concept involved in the whole process includes dimension reduction, parameter search and model building. The scikit provides the function related to the training. We use the function below to do the training and testing. We choose some of the basic parameter to see the performance.Detail shown in next section. Table 2 Function within Scikit Based Training SVC Randomforest Adaboost Training (Model) sklearn.svm .SVC sklearn.ensemble.RandomForestCl assifier sklearn.ensemble.AdaBoostCla ssifier Predicting Model.predict 7.4 Testing, Validation and Model Selection Considering our work involves numerous mode and parameters. So we need to do cross validation to choose our parameters. 7.4.1 Flowchart of cross validation We choose randomly 6000 samples from the original 94247 samples. Let the 60000 samples belongs to 𝑋_𝑇𝑟𝑎𝑖𝑛. The rest of samples would be taken as 𝑋_𝑇𝑒𝑠𝑡 , which would be set aside until the final test. Then we separate the 𝑋_𝑇𝑟𝑎𝑖𝑛 to 5 subsets, each of which 12000 samples. Each time, the model would use 5 subsets as the training sets and the left one would be the test set. Each time we record the results and change the parameter, then repeat the process again. 7.4.2 Model Target and Parameter candidate sets. We have 5 basic models/algorithms, but considering the Naïve Bayes and KNN are base line model. So we care more about the performance along with change of parameters in other three models. We choose to use the grid search algorithm to do the parameter search and optimization using tool sklearn.grid_search_GridSearchCV. The reason we use this method is because we have large sample as well as large class to train. So we need to shrink the candidate set so that the computation would be faster. For each model we have set different search parameter and their candidate sets: SVC C Kernal Gammar decision_function Candidate Sets [1,10,100,100] [‘rbf’] [0.01,0.001,0.0001]; [‘ovo’,’ovr’] The cross_validation results are shown below:
  • 16. Fig.6 The SVC Cross_validation results We can see clearly that the decision_function do not make much difference. We can choose the best parameter based on the results: {‘C’:100, ‘Gamma’:0.01} Randomforest N_estimator Max_dapth Candidate Sets [50,100,200,400,600,800,1000] [2,4,8,12,16,20,24,2 ,30] The cross_validation results are shown below: 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 -200 0 200 400 600 800 1000 1200 ErorRate C Error_rate vs C ovo ovr 0 0.1 0.2 0.3 0.4 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ErroRate - log(gamma) Error rate vs Gamma ovo ovr
  • 17. Fig.7 The Random Forest Cross validation results we can choose the best parameter for the random forest: {‘N_estimator’:800, ‘Max_depth’:24} Adaboost n_estimator Learning Rate Candidate Sets [50,100,200,400,600,800] [0.6,0.8,1,1.2] 0.25 0.255 0.26 0.265 0.27 0.275 0.28 0.285 0 200 400 600 800 1000 1200 errorrate N_estimator Error rate vs N_estimator 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 5 10 15 20 25 30 35 ErrorRate Max_depth Error rate vs Max_depth
  • 18. Fig.8 The Adaboost Forest Cross validation results we can choose the best parameter for the random forest: {‘N_estimator’:800, ‘Learning Rate:1} 7.4.3 Final Test We test our results through two ways. First we use the 𝑋_𝑇𝑒𝑠𝑡, which have not been ‘looked’ through the whole training process, to do evaluation on the different model with the best parameter and get the prediction as well as the Error Rate. Then, we decide our final model based on these error rates as well as the computation time to choose the best model. Using the test.csv downloaded from Kaggle and the whole train.scv data to do the training to get the label for the test file. Upload it and then get the result. 0 0.1 0.2 0.3 0.4 0.5 0.6 0 100 200 300 400 500 600 700 800 900 ErrorRate N_estimator Error rate vs N_estimator 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 ErrorRate Learning Rate Error Rate vs Learning Rate
  • 19. 8. Final Results Table 3 Final Results Algorithm 𝐸𝑖𝑛 𝐸 𝑜𝑢𝑡 Best Parameter Computation Time (20000 samples) (/s) Naïve Bayes 0.345 0.415 - 600 K-nearest neighbors (KNN) 0.0799 0.441 - 900 SVM 0.005 0.281 {‘C’:100, ‘Gamma’:0.01} 800 Random Forest 0.007 0.228 {‘N_estimator’:800, ‘Max_depth’:24} 600 Adaboost 0.080 0.255 {‘N_estimator’:800, ‘Learning Rate:1} 1800 Depending on the results above, the Random Forest algorithm performs the best with lowest 𝐸 𝑜𝑢𝑡. We use the Random Forest model as our final classifier for the classification system. Then we apply our system to the test samples provided by Walmart and predict the trip types, the result is shown below: Online competition submission performance: Team name in Kaggle.com: Guoshiwushuang Public score: 9.33204
  • 20. According to the leaderboard, the highest public score is 0.50519 and the lowest score is 34.53878. The score is calculated on approximately 30% of the test data. We have submitted 6 times and improve our score from 20 to 9 which reveals that our method does work. Meanwhile, from the competition forum topics, we find some teams with better score also apply the random forest method. We think the reason our system can’t get the same score as them is that the features we select and extract are still not efficient and accurate enough for the classification. We need to find a better relationship between the features of the products and the trip types and transfer the original features efficiently. 9. Interpretation Our final results are not very satisfying compared to other performance on the board. Meanwhile, from the table, we can see clearly, there is an overfitting phenomenon for all algorithm, especially on the random forest although it has achieved the best performance. However, we still learn a lot from the process and result. 9.1 Feature engineering and machine learning The key problem within our work is not the implementation of the learning algorithm but the usage of features. There are two part of the challenge. At first, considering the features are highly sparse distributed which make the individual feature cannot generate little valuable information towards the classification. On the other hand, both the fine number and UPC contain large dimension information (almost 100000) which are also hard to handle. We have implemented several methods to do the feature selection and extraction. However, the results are not qualified using different algorithm. The direct reason for that should be that our features are still not sufficient powerful. In our case, for example, you cannot decide what trip-type it should be based on one or two object they bought. Because, the objects people bought embrace high variety. From that, we can see that not only the original feature to enhance the performance, but also some features may even be confusing being harmful the classification. Our methods combine the manual search and selection based on IG value which have improved the performance compared to using only original feature. But the feature engineering is still far from solvable. The process is still untraceable. The reason could be that we are still unable to describe the relation between different features. Our new features may still not be distinguishable for classification. Another reason is that there is always a dilemma between the precision and recall. When we bring some new features which could help us to find more sample in one class, it has larger chance to misclassified feature of other class simultaneously; 9.2 Limitation of solving large data
  • 21. Besides the features selection, the limitation of solving computation of large data also damage our final performance. To speed up the computation, we have to use small set of samples which may let us unable to explore more information from the dataset. While spending lots time on data also make it hard to tract individual sample or features; especially facing the parameter –search process, the long computation time make us unable to get sufficient prediction of the performance. To solve the problem, we need to figure out more advanced algorithm. Considering our case is sparse distribution, randomized PCA is helpful. We in fact need more similar algorithm to solve our problem. 10. Summary and conclusions In our project, by comparing our score with others’ in the leaderboard, we find that even applying the same machine learning method, different features result in different performances. Transforming the raw data into features which are better representatives of the underlying problem influences the accuracy of the model on unseen data. Feature plays an important role to success in applied machine learning. Algorithms in machine learning are very important and we usually invest our main effort in these. However, good features are key to make the algorithms work well and guarantee a predictive model. Better features mean flexibility, simple models and better performance. Next we will try our best to find better features from the purchase history data before training our model.