[Slides] Using deep recurrent neural network for direct beam solar irradiance...
JZanzigposter
1. Identifying Buzz in Twitter Data
Jane Zanzig
University of Washington
Department of Statistics
Introduction
• Binary Classification 100,000 observations
of 78 variables
• Imbalanced Classes 856 positive examples,
99144 negative examples
• Objective Maximize AUC
• Features Time series data, include:
• Discussions: Number Created, Average Length
• Authors: Increase, Interaction, Number
• Burstiness Level
• Attention Level
• Number of Atomic Containers
• Averages and Minima/Maxima
0 50 100 150 200
0.40.60.81.0
Weight on Positive Class
Prop.MislabeledPositives
Figure 1: Proportion of mislabeled positive examples versus
weight on positive class.
Training and Validation Sets
• Imbalanced classes means good classification error
(< 0.01 can be achieved with trivial all-negative
classifier
• Tried weights from 1 to 200 on the positive class,
but false negative rate remained over 50% (see
figure 1)
• Training data 750 positive examples, 750
negative examples
• Validation set 106 positive examples, 12274
negative examples
Random Forests
• 500 trees, each tree grown on independent
bootstrap sample
• At each node, randomly selects m out of M
total features, and chooses the best split out of
the chosen m
• Predicted class decided by voting
• Decreases variance by bootstrap (“out-of-bag”
data) and averaging
0 20 40 60 80 100
−10123
Feature
ProximitytoPositiveClass
Figure 2: Proximity to positive class by feature.
• Variable Importance Permute values of
attribute j on OOB data; compare error rate
on permuted j to true j: mean decrease in
accuracy
• Proximity Increases by one each time two
cases are in same terminal mode; measure of
similarity
Support Vector Machine
minα
1
2
αT
Qα (1)
s.t.0 ≤ αi ≤ 1/l, i = 1, ..., l,
eT
α ≥ ν
yT
α = 0.
K(x, x ) = e−γ||x−x ||2
(2)
0.1
0.2
0.3
0.4
0.5
−6.0 −5.5 −5.0 −4.5 −4.0 −3.5 −3.0
0.05
0.10
0.15
0.20
10−Fold CV Error of SVM
log10(γ)
ν
Figure 3: 10-fold CV error for tuning the SVM.
• Preprocessing Features are standardized
before training; only used features that had
greater than median variable importance from
RF
• Tuning Tried γ ∈ {10−3
, 10−6
},
ν ∈ (0.05, 0.25)
• Parameters ν = 0.23 represents proportion
of SVs; γ = 10−4
smoothing parameter
Final ROC Curves
ROC Curve for Random Forest
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.992
ROC Curve for SVM
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.984
Figure 4: ROC curves on validation set.
Conclusions
• Contribution sparseness, burstiness level, and
average discussions length (ADL) not
informative with respect to class
• Averages and minima/maxima were
informative
• Using information gleaned from RF helped
increase SVM AUC from .973 to .984, and
improved training time.
• Random forests were much faster to train and
gave more intuitive insight into the structure
of the data than SVMs.
Possible Future Work
• Investigate interactions between features
• Experiment with different selection criteria for
split variables in RF (highly correlated data)
• Fine tune feature selection for SVM
References
[1] Predictions d’activite dans les reseaux sociaux en ligne (f.
kawala, a. douzal-chouakria, e. gaussier, e. dimert), in
actes de la conference sur les modeles et l’analyse des
reseaux : Approches mathematiques et informatique
(marami), pp. 16, 2013.
Acknowledgements
Thank you to Marina Meila and Yali Wan for the course, and
AAUW for financial support.