SlideShare a Scribd company logo
1 of 1
Download to read offline
Identifying Buzz in Twitter Data
Jane Zanzig
University of Washington
Department of Statistics
Introduction
• Binary Classification 100,000 observations
of 78 variables
• Imbalanced Classes 856 positive examples,
99144 negative examples
• Objective Maximize AUC
• Features Time series data, include:
• Discussions: Number Created, Average Length
• Authors: Increase, Interaction, Number
• Burstiness Level
• Attention Level
• Number of Atomic Containers
• Averages and Minima/Maxima
0 50 100 150 200
0.40.60.81.0
Weight on Positive Class
Prop.MislabeledPositives
Figure 1: Proportion of mislabeled positive examples versus
weight on positive class.
Training and Validation Sets
• Imbalanced classes means good classification error
(< 0.01 can be achieved with trivial all-negative
classifier
• Tried weights from 1 to 200 on the positive class,
but false negative rate remained over 50% (see
figure 1)
• Training data 750 positive examples, 750
negative examples
• Validation set 106 positive examples, 12274
negative examples
Random Forests
• 500 trees, each tree grown on independent
bootstrap sample
• At each node, randomly selects m out of M
total features, and chooses the best split out of
the chosen m
• Predicted class decided by voting
• Decreases variance by bootstrap (“out-of-bag”
data) and averaging
0 20 40 60 80 100
−10123
Feature
ProximitytoPositiveClass
Figure 2: Proximity to positive class by feature.
• Variable Importance Permute values of
attribute j on OOB data; compare error rate
on permuted j to true j: mean decrease in
accuracy
• Proximity Increases by one each time two
cases are in same terminal mode; measure of
similarity
Support Vector Machine
minα
1
2
αT
Qα (1)
s.t.0 ≤ αi ≤ 1/l, i = 1, ..., l,
eT
α ≥ ν
yT
α = 0.
K(x, x ) = e−γ||x−x ||2
(2)
0.1
0.2
0.3
0.4
0.5
−6.0 −5.5 −5.0 −4.5 −4.0 −3.5 −3.0
0.05
0.10
0.15
0.20
10−Fold CV Error of SVM
log10(γ)
ν
Figure 3: 10-fold CV error for tuning the SVM.
• Preprocessing Features are standardized
before training; only used features that had
greater than median variable importance from
RF
• Tuning Tried γ ∈ {10−3
, 10−6
},
ν ∈ (0.05, 0.25)
• Parameters ν = 0.23 represents proportion
of SVs; γ = 10−4
smoothing parameter
Final ROC Curves
ROC Curve for Random Forest
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.992
ROC Curve for SVM
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.984
Figure 4: ROC curves on validation set.
Conclusions
• Contribution sparseness, burstiness level, and
average discussions length (ADL) not
informative with respect to class
• Averages and minima/maxima were
informative
• Using information gleaned from RF helped
increase SVM AUC from .973 to .984, and
improved training time.
• Random forests were much faster to train and
gave more intuitive insight into the structure
of the data than SVMs.
Possible Future Work
• Investigate interactions between features
• Experiment with different selection criteria for
split variables in RF (highly correlated data)
• Fine tune feature selection for SVM
References
[1] Predictions d’activite dans les reseaux sociaux en ligne (f.
kawala, a. douzal-chouakria, e. gaussier, e. dimert), in
actes de la conference sur les modeles et l’analyse des
reseaux : Approches mathematiques et informatique
(marami), pp. 16, 2013.
Acknowledgements
Thank you to Marina Meila and Yali Wan for the course, and
AAUW for financial support.

More Related Content

Similar to JZanzigposter

Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesMohamed Farouk
 
introduction to measurements.pptx
introduction to measurements.pptxintroduction to measurements.pptx
introduction to measurements.pptxMahamad Jawhar
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2khairulhuda242
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Cross validation
Cross validationCross validation
Cross validationRidhaAfrawe
 
604_multiplee.ppt
604_multiplee.ppt604_multiplee.ppt
604_multiplee.pptRufesh
 
Risk management Report
Risk management ReportRisk management Report
Risk management ReportNewGate India
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learningSylvain Ferrandiz
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabkrishna_093
 
Biostatistics Measures of central tendency
Biostatistics Measures of central tendency Biostatistics Measures of central tendency
Biostatistics Measures of central tendency HARINATHA REDDY ASWARTHA
 
The fundamentals of regression
The fundamentals of regressionThe fundamentals of regression
The fundamentals of regressionStephanie Locke
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Monte carlo option pricing final v3
Monte carlo option pricing final v3Monte carlo option pricing final v3
Monte carlo option pricing final v3naojan
 
Sparsenet
SparsenetSparsenet
Sparsenetndronen
 
[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...Maosi Chen
 

Similar to JZanzigposter (20)

Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
 
introduction to measurements.pptx
introduction to measurements.pptxintroduction to measurements.pptx
introduction to measurements.pptx
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Cross validation
Cross validationCross validation
Cross validation
 
604_multiplee.ppt
604_multiplee.ppt604_multiplee.ppt
604_multiplee.ppt
 
Risk management Report
Risk management ReportRisk management Report
Risk management Report
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learning
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Biostatistics Measures of central tendency
Biostatistics Measures of central tendency Biostatistics Measures of central tendency
Biostatistics Measures of central tendency
 
The fundamentals of regression
The fundamentals of regressionThe fundamentals of regression
The fundamentals of regression
 
Entropy-based Histograms for Selectivity Estimation
Entropy-based Histograms for Selectivity EstimationEntropy-based Histograms for Selectivity Estimation
Entropy-based Histograms for Selectivity Estimation
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Monte carlo option pricing final v3
Monte carlo option pricing final v3Monte carlo option pricing final v3
Monte carlo option pricing final v3
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...[Slides] Using deep recurrent neural network for direct beam solar irradiance...
[Slides] Using deep recurrent neural network for direct beam solar irradiance...
 

JZanzigposter

  • 1. Identifying Buzz in Twitter Data Jane Zanzig University of Washington Department of Statistics Introduction • Binary Classification 100,000 observations of 78 variables • Imbalanced Classes 856 positive examples, 99144 negative examples • Objective Maximize AUC • Features Time series data, include: • Discussions: Number Created, Average Length • Authors: Increase, Interaction, Number • Burstiness Level • Attention Level • Number of Atomic Containers • Averages and Minima/Maxima 0 50 100 150 200 0.40.60.81.0 Weight on Positive Class Prop.MislabeledPositives Figure 1: Proportion of mislabeled positive examples versus weight on positive class. Training and Validation Sets • Imbalanced classes means good classification error (< 0.01 can be achieved with trivial all-negative classifier • Tried weights from 1 to 200 on the positive class, but false negative rate remained over 50% (see figure 1) • Training data 750 positive examples, 750 negative examples • Validation set 106 positive examples, 12274 negative examples Random Forests • 500 trees, each tree grown on independent bootstrap sample • At each node, randomly selects m out of M total features, and chooses the best split out of the chosen m • Predicted class decided by voting • Decreases variance by bootstrap (“out-of-bag” data) and averaging 0 20 40 60 80 100 −10123 Feature ProximitytoPositiveClass Figure 2: Proximity to positive class by feature. • Variable Importance Permute values of attribute j on OOB data; compare error rate on permuted j to true j: mean decrease in accuracy • Proximity Increases by one each time two cases are in same terminal mode; measure of similarity Support Vector Machine minα 1 2 αT Qα (1) s.t.0 ≤ αi ≤ 1/l, i = 1, ..., l, eT α ≥ ν yT α = 0. K(x, x ) = e−γ||x−x ||2 (2) 0.1 0.2 0.3 0.4 0.5 −6.0 −5.5 −5.0 −4.5 −4.0 −3.5 −3.0 0.05 0.10 0.15 0.20 10−Fold CV Error of SVM log10(γ) ν Figure 3: 10-fold CV error for tuning the SVM. • Preprocessing Features are standardized before training; only used features that had greater than median variable importance from RF • Tuning Tried γ ∈ {10−3 , 10−6 }, ν ∈ (0.05, 0.25) • Parameters ν = 0.23 represents proportion of SVs; γ = 10−4 smoothing parameter Final ROC Curves ROC Curve for Random Forest False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 AUC = 0.992 ROC Curve for SVM False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 AUC = 0.984 Figure 4: ROC curves on validation set. Conclusions • Contribution sparseness, burstiness level, and average discussions length (ADL) not informative with respect to class • Averages and minima/maxima were informative • Using information gleaned from RF helped increase SVM AUC from .973 to .984, and improved training time. • Random forests were much faster to train and gave more intuitive insight into the structure of the data than SVMs. Possible Future Work • Investigate interactions between features • Experiment with different selection criteria for split variables in RF (highly correlated data) • Fine tune feature selection for SVM References [1] Predictions d’activite dans les reseaux sociaux en ligne (f. kawala, a. douzal-chouakria, e. gaussier, e. dimert), in actes de la conference sur les modeles et l’analyse des reseaux : Approches mathematiques et informatique (marami), pp. 16, 2013. Acknowledgements Thank you to Marina Meila and Yali Wan for the course, and AAUW for financial support.