9. Preprocessing
● Tokenize all text columns with OpenNLP, lowercase tokens
and filter out stop words ["‐", "-", "-", "‒", "–", "—", "―", "-", "+", "/", "*",
".", ",", "'", "(", ")", """, "&", ":", "to", "of", "and", "or", "for", "the", "a"]
10. Preprocessing
● Tokenize all text columns with OpenNLP, lowercase tokens
and filter out stop words ["‐", "-", "-", "‒", "–", "—", "―", "-", "+", "/", "*",
".", ",", "'", "(", ")", """, "&", ":", "to", "of", "and", "or", "for", "the", "a"]
● Perform softmax normalization for all float columns
● Replace all NaNs in float columns with 0
11. Preprocessing
● Tokenize all text columns with OpenNLP, lowercase tokens
and filter out stop words ["‐", "-", "-", "‒", "–", "—", "―", "-", "+", "/", "*",
".", ",", "'", "(", ")", """, "&", ":", "to", "of", "and", "or", "for", "the", "a"]
● Perform softmax normalization for all float columns
● Replace all NaNs in float columns with 0
● Keep floats intact, replace NaNs with the specified value
or
18. Bag-of-words
● Concatenated BOW features and floats comprise final feature
vector.
● Replace FTE field NaNs with -1, Total filed NaNs with -20000
19. Bag-of-words
● Concatenated BOW features and floats comprise final feature
vector.
● Replace FTE field NaNs with -1, Total filed NaNs with -20000
● Train sklearn.ensemble.RandomForestClassifier
20. Bag-of-words
● Concatenated BOW features and floats comprise final feature
vector.
● Replace FTE field NaNs with -1, Total filed NaNs with -20000
● Train sklearn.ensemble.RandomForestClassifier
● Score: 0.8671
36. Continuous bag-of-words
● Concatenated CBOW features and floats comprise the final
feature vector.
● Replace FTE field NaNs with -1, Total filed NaNs with -20000
37. Continuous bag-of-words
● Concatenated CBOW features and floats comprise the final
feature vector.
● Replace FTE field NaNs with -1, Total filed NaNs with -20000
● Train sklearn.ensemble.RandomForestClassifier
38. Continuous bag-of-words
● Concatenated CBOW features and floats comprise the final
feature vector.
● Replace FTE field NaNs with -1, Total filed NaNs with -20000
● Train sklearn.ensemble.RandomForestClassifier
● Score: 0.6616
42. Continuous bag-of-words
(+) Pros
● Simplicity
● Possible to generalize to unseen words
(-) Cons
● All words are equal, but some words are more equal than others
61. Weighted CBOW
● Concatenated WCBOW features and floats compose the final
feature vector.
● Use softmax normalization for float columns
62. Weighted CBOW
● Concatenated WCBOW features and floats compose the final
feature vector.
● Use softmax normalization for float columns
● Replace all float NaNs with 0
63. Weighted CBOW
● Concatenated WCBOW features and floats compose the final
feature vector.
● Use softmax normalization for float columns
● Replace all float NaNs with 0
● Train softmax jointly with θc parameters
64. Weighted CBOW
● Concatenated WCBOW features and floats compose the final
feature vector.
● Use softmax normalization for float columns
● Replace all float NaNs with 0
● Train softmax jointly with θc parameters
● Score: 0.9159
67. Weighted CBOW
Why so poorly?
H1: Softmax is not as powerful as Random Forest
H2: Model assumes that for describing relevant words it is
enough to use one direction per column in the word space
119. Convolutional NN
● Concatenated mean and max values for feature maps and floats form the final
feature vector
● Use softmax normalization for float columns
120. Convolutional NN
● Concatenated mean and max values for feature maps and floats form the final
feature vector
● Use softmax normalization for float columns
● Replace all float NaNs with 0
121. Convolutional NN
● Concatenated mean and max values for feature maps and floats form the final
feature vector
● Use softmax normalization for float columns
● Replace all float NaNs with 0
● Train softmax with Wf parameters jointly
122. ● Concatenated mean and max values for feature maps and floats form the final
feature vector
● Use softmax normalization for float columns
● Replace all float NaNs with 0
● Train softmax with Wf parameters jointly
● Score: 0.6932
Convolutional NN
125. Convolutional NN
Why is it not as good as CBOW + RF?
● There is fewer parameters
● The performance is still comparable to CBOW+RF. Therefore, using cnn
is a sensible idea.
126. Convolutional NN
Why is it not as good as CBOW + RF?
● There is fewer parameters
● The performance is still comparable to CBOW+RF. Therefore, using cnn
is a sensible idea.
● Probably, we could gain more from this type of feature learner if we go
deeper
128. Final model
● Train RF on the concatenated CBOW features and NN logits
● Train 2 CBOW classifiers, 2 NN classifiers, 2 meta-classifiers
and blend them
129. Final model
● Train RF on the concatenated CBOW features and NN logits
● Train 2 CBOW classifiers, 2 NN classifiers, 2 meta-classifiers
and blend them
● Score: 0.5228
131. Conclusion
● Explore your data before doing any analysis
● Keep trying
● Ensembles are powerful
● Participating in competitions provides a great
learning opportunity
Editor's Notes
DrivenData is a for-profit social enterprise that hosts online competitions, with the goal of engaging a global community of data scientists in solving social problems.
ERS helps school districts use their resources more strategically by providing them with a way to compare their spending to other school districts. Before partnering with DrivenData, that process involved assigning every line item to certain categories in a comprehensive financial spending framework — a task that required an average 400 man-hours per project, and limited the nonprofit's ability to give school districts the analysis they need to improve.