SlideShare a Scribd company logo
1 of 8
Empower Public Health
through Social Media
Zhen Wang, Ph.D.
Insight Health Data Science
Text
Cleaning, Tokenizing
Convert to Feature Vectors
“I like food!”
“Food is good!”
“I had some good food.”
i, like, food
food, is, good
i, had, some, good, food
e.g., TF-IDF
I’m really good
with numbers!
i like food is good had some
1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 0 1 0 1 1 1
Downweight, Normalize
Machine
Learning
Numbers
Natural Language Processing
Text Classification
Normalized Retweet Counts
NumberofTweets
Distribution of Tweets
● Sample Imbalance
● Classification (0/1: Not / Retweeted)
● Logistic Regression
Threshold: 0.005
Misclassification Error: 22%
0 01 1
Train Test
downsampling
0.81
0.740.26
0.19
Normalized Confusion Matrix
Codes: github.com/zweinstein/SpreadHealth_dev
Zhen (Jen) Wang
Beta Tester
Since 2015 Editor since 2015
Traditional Medicine Science Fiction
Public Speaking Online Education
Ph.D. in Physical Chemistry
Thank you!
See the App in Action:
Text Preprocessing Pipeline
Text Cleaning:
● Convert to lower case
● Replace URL, #, and @
● Remove special characters other than
emoticons
● Remove stopwords
Tokenizing:
● Splitting each documents into individual
elements
● Bag-of-Words or N-grams
● Stemming
○ Porter Stemmer was used
○ Snowball or Lancaster stemmer faster but
more aggressive
○ Lemmatization computationally more
expensive but little impact on the
performance of text classification
Term Frequency-Inverse Document
Frequency (tf-idf):
Term Frequency--tf(t,d): the number of times
a term t occurs in a document d
Used to downweight frequently occurring
words in the feature vectors tf(t,d)
Document Frequency--df(d,f): the number of
documents d that contain a term t.
The implementation in Scikit-learn
● Train Dataset: 10000 tweets on diabetes (4782 retweeted);
● Test Set Accuracy (Random Chance 0.49 on positive class):
○ KNN: 60%
○ Naive Bayes: 67%
○ Logistic regression: 75% (chosen and tested on imbalanced test data)
● Potential Improvements:
○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)
○ Other Features:
■ Polarity & Sentiment
■ Length
● Out-of-Core Incremental Learning with Stochastic Gradient Descent
(Advantages of Logistic Regression…)
● Automatic Update to SQLite Database and to the Classifier
Prediction Algorithms

More Related Content

Viewers also liked

Our Opening Title sequence presentation
Our Opening Title sequence presentationOur Opening Title sequence presentation
Our Opening Title sequence presentationchloe-carman
 
Hareket Magazine 12.
Hareket  Magazine 12.Hareket  Magazine 12.
Hareket Magazine 12.Hareket
 
3D Game Environment Workflow
3D Game Environment Workflow3D Game Environment Workflow
3D Game Environment Workflowraimondklavins
 
Hareket Magazine-19-2016
Hareket Magazine-19-2016Hareket Magazine-19-2016
Hareket Magazine-19-2016Hareket
 
What is a dance music video
What is a dance music videoWhat is a dance music video
What is a dance music videochloe-carman
 
Rich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock DrummersRich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock DrummersRichard Aquilone
 
1.1 ingles sistema operativo
1.1 ingles sistema operativo1.1 ingles sistema operativo
1.1 ingles sistema operativodenissecollins94
 
A step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap serversA step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap serverskrishna RK
 

Viewers also liked (8)

Our Opening Title sequence presentation
Our Opening Title sequence presentationOur Opening Title sequence presentation
Our Opening Title sequence presentation
 
Hareket Magazine 12.
Hareket  Magazine 12.Hareket  Magazine 12.
Hareket Magazine 12.
 
3D Game Environment Workflow
3D Game Environment Workflow3D Game Environment Workflow
3D Game Environment Workflow
 
Hareket Magazine-19-2016
Hareket Magazine-19-2016Hareket Magazine-19-2016
Hareket Magazine-19-2016
 
What is a dance music video
What is a dance music videoWhat is a dance music video
What is a dance music video
 
Rich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock DrummersRich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock Drummers
 
1.1 ingles sistema operativo
1.1 ingles sistema operativo1.1 ingles sistema operativo
1.1 ingles sistema operativo
 
A step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap serversA step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap servers
 

Similar to Zhen wang demo3

Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
 
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...The Research Council of Norway, IKTPLUSS
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksAnaelia Ovalle
 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning FoundationsAlbert Y. C. Chen
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsMMS Holdings
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Automated health responses
Automated health responses Automated health responses
Automated health responses Austin Powell
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicineHongyoon Choi
 
Biostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEMBiostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEMPo-Jen Wu
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
Deep learning for natural language understanding
Deep learning for natural language understandingDeep learning for natural language understanding
Deep learning for natural language understandingDavid Talby
 
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Twitter Inc.
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan BegumNurjahan Begum
 

Similar to Zhen wang demo3 (20)

Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural Networks
 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning Foundations
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health Records
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Automated health responses
Automated health responses Automated health responses
Automated health responses
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicine
 
Biostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEMBiostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEM
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Deep learning for natural language understanding
Deep learning for natural language understandingDeep learning for natural language understanding
Deep learning for natural language understanding
 
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan Begum
 

Recently uploaded

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Recently uploaded (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

Zhen wang demo3

  • 1. Empower Public Health through Social Media Zhen Wang, Ph.D. Insight Health Data Science
  • 2. Text Cleaning, Tokenizing Convert to Feature Vectors “I like food!” “Food is good!” “I had some good food.” i, like, food food, is, good i, had, some, good, food e.g., TF-IDF I’m really good with numbers! i like food is good had some 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 Downweight, Normalize Machine Learning Numbers Natural Language Processing
  • 3. Text Classification Normalized Retweet Counts NumberofTweets Distribution of Tweets ● Sample Imbalance ● Classification (0/1: Not / Retweeted) ● Logistic Regression Threshold: 0.005 Misclassification Error: 22% 0 01 1 Train Test downsampling 0.81 0.740.26 0.19 Normalized Confusion Matrix Codes: github.com/zweinstein/SpreadHealth_dev
  • 4. Zhen (Jen) Wang Beta Tester Since 2015 Editor since 2015 Traditional Medicine Science Fiction Public Speaking Online Education Ph.D. in Physical Chemistry
  • 6. See the App in Action:
  • 7. Text Preprocessing Pipeline Text Cleaning: ● Convert to lower case ● Replace URL, #, and @ ● Remove special characters other than emoticons ● Remove stopwords Tokenizing: ● Splitting each documents into individual elements ● Bag-of-Words or N-grams ● Stemming ○ Porter Stemmer was used ○ Snowball or Lancaster stemmer faster but more aggressive ○ Lemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf): Term Frequency--tf(t,d): the number of times a term t occurs in a document d Used to downweight frequently occurring words in the feature vectors tf(t,d) Document Frequency--df(d,f): the number of documents d that contain a term t. The implementation in Scikit-learn
  • 8. ● Train Dataset: 10000 tweets on diabetes (4782 retweeted); ● Test Set Accuracy (Random Chance 0.49 on positive class): ○ KNN: 60% ○ Naive Bayes: 67% ○ Logistic regression: 75% (chosen and tested on imbalanced test data) ● Potential Improvements: ○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost) ○ Other Features: ■ Polarity & Sentiment ■ Length ● Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression…) ● Automatic Update to SQLite Database and to the Classifier Prediction Algorithms

Editor's Notes

  1. http://54.191.168.240