Zhen wang demo3

•Download as PPTX, PDF•

0 likes•157 views

Zhen Wang

Demo slides for my Flask App SpreadHealth.tech

Data & Analytics

Empower Public Health
through Social Media
Zhen Wang, Ph.D.
Insight Health Data Science

Text
Cleaning, Tokenizing
Convert to Feature Vectors
“I like food!”
“Food is good!”
“I had some good food.”
i, like, food
food, is, good
i, had, some, good, food
e.g., TF-IDF
I’m really good
with numbers!
i like food is good had some
1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 0 1 0 1 1 1
Downweight, Normalize
Machine
Learning
Numbers
Natural Language Processing

Text Classification
Normalized Retweet Counts
NumberofTweets
Distribution of Tweets
● Sample Imbalance
● Classification (0/1: Not / Retweeted)
● Logistic Regression
Threshold: 0.005
Misclassification Error: 22%
0 01 1
Train Test
downsampling
0.81
0.740.26
0.19
Normalized Confusion Matrix
Codes: github.com/zweinstein/SpreadHealth_dev

Zhen (Jen) Wang
Beta Tester
Since 2015 Editor since 2015
Traditional Medicine Science Fiction
Public Speaking Online Education
Ph.D. in Physical Chemistry

Text Preprocessing Pipeline
Text Cleaning:
● Convert to lower case
● Replace URL, #, and @
● Remove special characters other than
emoticons
● Remove stopwords
Tokenizing:
● Splitting each documents into individual
elements
● Bag-of-Words or N-grams
● Stemming
○ Porter Stemmer was used
○ Snowball or Lancaster stemmer faster but
more aggressive
○ Lemmatization computationally more
expensive but little impact on the
performance of text classification
Term Frequency-Inverse Document
Frequency (tf-idf):
Term Frequency--tf(t,d): the number of times
a term t occurs in a document d
Used to downweight frequently occurring
words in the feature vectors tf(t,d)
Document Frequency--df(d,f): the number of
documents d that contain a term t.
The implementation in Scikit-learn

● Train Dataset: 10000 tweets on diabetes (4782 retweeted);
● Test Set Accuracy (Random Chance 0.49 on positive class):
○ KNN: 60%
○ Naive Bayes: 67%
○ Logistic regression: 75% (chosen and tested on imbalanced test data)
● Potential Improvements:
○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)
○ Other Features:
■ Polarity & Sentiment
■ Length
● Out-of-Core Incremental Learning with Stochastic Gradient Descent
(Advantages of Logistic Regression…)
● Automatic Update to SQLite Database and to the Classifier
Prediction Algorithms

Viewers also liked

Our Opening Title sequence presentationchloe-carman

Hareket Magazine 12.Hareket

3D Game Environment Workflowraimondklavins

Hareket Magazine-19-2016Hareket

What is a dance music videochloe-carman

Rich Aquilone- Top 5 Rock DrummersRichard Aquilone

1.1 ingles sistema operativodenissecollins94

A step by-step guide on i doc-ale between two sap serverskrishna RK

Viewers also liked (8)

Our Opening Title sequence presentation

Hareket Magazine 12.

3D Game Environment Workflow

Hareket Magazine-19-2016

What is a dance music video

Rich Aquilone- Top 5 Rock Drummers

1.1 ingles sistema operativo

A step by-step guide on i doc-ale between two sap servers

Similar to Zhen wang demo3

Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe

Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...The Research Council of Norway, IKTPLUSS

Predicting Thyroid Disorder with Deep Neural NetworksAnaelia Ovalle

Machine Learning FoundationsAlbert Y. C. Chen

Natural Language Processing to Curate Unstructured Electronic Health RecordsMMS Holdings

Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.

2016 bergen-sarsc.titus.brown

Analysing & interpreting data.pptmanaswidebbarma1

171017 giab for giab grc workshopGenome Reference Consortium

Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson

Introduction to Data MiningKai Koenig

Automated health responses Austin Powell

Recommendation engine Using Genetic AlgorithmVaibhav Varshney

How deep learning reshapes medicineHongyoon Choi

Biostatistics and DNA for NCKU iGEMPo-Jen Wu

Multivariate Analysis and Visualization of Proteomic DataUC Davis

2014 aus-agtac.titus.brown

Deep learning for natural language understandingDavid Talby

Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Twitter Inc.

TADPole_Nurjahan BegumNurjahan Begum

Similar to Zhen wang demo3 (20)

Using Bioinformatics Data to inform Therapeutics discovery and development

Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...

Predicting Thyroid Disorder with Deep Neural Networks

Machine Learning Foundations

Natural Language Processing to Curate Unstructured Electronic Health Records

Text Mining for Biocuration of Bacterial Infectious Diseases

2016 bergen-sars

Analysing & interpreting data.ppt

171017 giab for giab grc workshop

Modeling Electronic Health Records with Recurrent Neural Networks

Introduction to Data Mining

Automated health responses

Recommendation engine Using Genetic Algorithm

How deep learning reshapes medicine

Biostatistics and DNA for NCKU iGEM

Multivariate Analysis and Visualization of Proteomic Data

2014 aus-agta

Deep learning for natural language understanding

Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...

TADPole_Nurjahan Begum

Recently uploaded

Learn How Data Science Changes Our WorldEduminds Learning

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia

IBEF report on the Insurance market in IndiaManalVerma4

Networking Case Study prepared by teacher.pptxHimangsuNath

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Recently uploaded (20)

Learn How Data Science Changes Our World

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Data Factory in Microsoft Fabric (MsBIP #82)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf

IBEF report on the Insurance market in India

Networking Case Study prepared by teacher.pptx

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

Rithik Kumar Singh codealpha pythohn.pdf

Insurance Churn Prediction Data Analysis Project

Student profile product demonstration on grades, ability, well-being and mind...

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Semantic Shed - Squashing and Squeezing.pptx

Student Profile Sample report on improving academic performance by uniting gr...

Zhen wang demo3

1. Empower Public Health through Social Media Zhen Wang, Ph.D. Insight Health Data Science

2. Text Cleaning, Tokenizing Convert to Feature Vectors “I like food!” “Food is good!” “I had some good food.” i, like, food food, is, good i, had, some, good, food e.g., TF-IDF I’m really good with numbers! i like food is good had some 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 Downweight, Normalize Machine Learning Numbers Natural Language Processing

3. Text Classification Normalized Retweet Counts NumberofTweets Distribution of Tweets ● Sample Imbalance ● Classification (0/1: Not / Retweeted) ● Logistic Regression Threshold: 0.005 Misclassification Error: 22% 0 01 1 Train Test downsampling 0.81 0.740.26 0.19 Normalized Confusion Matrix Codes: github.com/zweinstein/SpreadHealth_dev

4. Zhen (Jen) Wang Beta Tester Since 2015 Editor since 2015 Traditional Medicine Science Fiction Public Speaking Online Education Ph.D. in Physical Chemistry

5. Thank you!

6. See the App in Action:

7. Text Preprocessing Pipeline Text Cleaning: ● Convert to lower case ● Replace URL, #, and @ ● Remove special characters other than emoticons ● Remove stopwords Tokenizing: ● Splitting each documents into individual elements ● Bag-of-Words or N-grams ● Stemming ○ Porter Stemmer was used ○ Snowball or Lancaster stemmer faster but more aggressive ○ Lemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf): Term Frequency--tf(t,d): the number of times a term t occurs in a document d Used to downweight frequently occurring words in the feature vectors tf(t,d) Document Frequency--df(d,f): the number of documents d that contain a term t. The implementation in Scikit-learn

8. ● Train Dataset: 10000 tweets on diabetes (4782 retweeted); ● Test Set Accuracy (Random Chance 0.49 on positive class): ○ KNN: 60% ○ Naive Bayes: 67% ○ Logistic regression: 75% (chosen and tested on imbalanced test data) ● Potential Improvements: ○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost) ○ Other Features: ■ Polarity & Sentiment ■ Length ● Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression…) ● Automatic Update to SQLite Database and to the Classifier Prediction Algorithms

Editor's Notes

http://54.191.168.240

Zhen wang demo3

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Zhen wang demo3

Similar to Zhen wang demo3 (20)

Recently uploaded

Recently uploaded (20)

Zhen wang demo3

Editor's Notes