JZanzigposter

Identifying Buzz in Twitter Data
Jane Zanzig
University of Washington
Department of Statistics
Introduction
• Binary Classification 100,000 observations
of 78 variables
• Imbalanced Classes 856 positive examples,
99144 negative examples
• Objective Maximize AUC
• Features Time series data, include:
• Discussions: Number Created, Average Length
• Authors: Increase, Interaction, Number
• Burstiness Level
• Attention Level
• Number of Atomic Containers
• Averages and Minima/Maxima
0 50 100 150 200
0.40.60.81.0
Weight on Positive Class
Prop.MislabeledPositives
Figure 1: Proportion of mislabeled positive examples versus
weight on positive class.
Training and Validation Sets
• Imbalanced classes means good classification error
(< 0.01 can be achieved with trivial all-negative
classifier
• Tried weights from 1 to 200 on the positive class,
but false negative rate remained over 50% (see
figure 1)
• Training data 750 positive examples, 750
negative examples
• Validation set 106 positive examples, 12274
negative examples
Random Forests
• 500 trees, each tree grown on independent
bootstrap sample
• At each node, randomly selects m out of M
total features, and chooses the best split out of
the chosen m
• Predicted class decided by voting
• Decreases variance by bootstrap (“out-of-bag”
data) and averaging
0 20 40 60 80 100
−10123
Feature
ProximitytoPositiveClass
Figure 2: Proximity to positive class by feature.
• Variable Importance Permute values of
attribute j on OOB data; compare error rate
on permuted j to true j: mean decrease in
accuracy
• Proximity Increases by one each time two
cases are in same terminal mode; measure of
similarity
Support Vector Machine
minα
1
2
αT
Qα (1)
s.t.0 ≤ αi ≤ 1/l, i = 1, ..., l,
eT
α ≥ ν
yT
α = 0.
K(x, x ) = e−γ||x−x ||2
(2)
0.1
0.2
0.3
0.4
0.5
−6.0 −5.5 −5.0 −4.5 −4.0 −3.5 −3.0
0.05
0.10
0.15
0.20
10−Fold CV Error of SVM
log10(γ)
ν
Figure 3: 10-fold CV error for tuning the SVM.
• Preprocessing Features are standardized
before training; only used features that had
greater than median variable importance from
RF
• Tuning Tried γ ∈ {10−3
, 10−6
},
ν ∈ (0.05, 0.25)
• Parameters ν = 0.23 represents proportion
of SVs; γ = 10−4
smoothing parameter
Final ROC Curves
ROC Curve for Random Forest
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.992
ROC Curve for SVM
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.984
Figure 4: ROC curves on validation set.
Conclusions
• Contribution sparseness, burstiness level, and
average discussions length (ADL) not
informative with respect to class
• Averages and minima/maxima were
informative
• Using information gleaned from RF helped
increase SVM AUC from .973 to .984, and
improved training time.
• Random forests were much faster to train and
gave more intuitive insight into the structure
of the data than SVMs.
Possible Future Work
• Investigate interactions between features
• Experiment with different selection criteria for
split variables in RF (highly correlated data)
• Fine tune feature selection for SVM
References
[1] Predictions d’activite dans les reseaux sociaux en ligne (f.
kawala, a. douzal-chouakria, e. gaussier, e. dimert), in
actes de la conference sur les modeles et l’analyse des
reseaux : Approches mathematiques et informatique
(marami), pp. 16, 2013.
Acknowledgements
Thank you to Marina Meila and Yali Wan for the course, and
AAUW for financial support.

JZanzigposter

Recommended

Recommended

More Related Content

Similar to JZanzigposter

Similar to JZanzigposter (20)

JZanzigposter