Text minings

•Download as PPTX, PDF•

0 likes•107 views

University of Illinois,Chicago

Web scraping using R Selenium. Latent Dirichlet Allocation, Sentiment Analysis using Bag of Words approach

Data & Analytics

Text Mining
Independent Course Study
Under the guidance of
Prof. Ranganathan Chandraskaren
1Aadish Chopra

Text Mining
Problem Statement :Classify tweets into categories such as Health ,
General Information
Data Set: Collected through web-scraping
• Twitter API is used widely but limitation of Twitter API is that you can
only collect upto a few thousand tweets.
• Another limitation of twitter API is that you can’t extract all the
information that is on the website
Tweets were only collected for Health institutions. There are 790 health
institutions or handles.
2Aadish Chopra

How the twitter website looks like ?
3Aadish Chopra
Twitter
Name
Twitter
Handle
Tweet

Web scraper
Web scraper was built using R. Package used is Rselenium*
https://drive.google.com/drive/folders/0Bzq4aAyBiuFQdHd0Y2JobmhjY
1k
Rselenium is based on Selenium server and automates the web browser
and takes control over it .You can then specify how you want to interact
with the web browser by specifying the action e.g. scrolling, pressing
the end key or pressing some specific button on the website
4Aadish Chopra
*I did web scraping using R. Earlier teams did it using Java and Python

How the collected data looks like ?
5Aadish Chopra
https://drive.google.com/drive/folders/0B-x4pbFFhIICWmdvWnBwX1VwZzA

Text Cleaning
Example Tweet :
Don't forget to register for the American Heart Associationâ€™s HeartsaverÂ® First Aid
Training offered by the...http://fb.me/6PjnRWaMAÂ
Steps taken to clean tweet
• Tweets can be cleaned using the “tm” package in R . The functions are as follows
1. removeNumbers
2. removePunctuation
3. Stopwords
4. removeWords
5. stripWhitespace
6. stemDocument
6Aadish Chopra

Text Cleaning
However if one wants more control over the cleaning of the data. For this purpose regex
expressions can be used
1. Punctuation replacement: LaSalletwe<-gsub(pattern = "W",replacement = "
",LaSalletwe)
2. Digits replacement: LaSalletwe<-gsub(pattern="d",replacement = "",LaSalletwe)
3. To remove a single letter: LaSalletwe<-gsub(pattern="b[A-
z]b{1}",replace="",LaSalletwe)
4. The list of stopwords* can be modified in the English.dat file which can be opened. The
stopwords can be created for any language in R. It can be helpful for instance in cities
like Chicago where people speak languages like Spanish
*Changed the English.dat file in retrospective manner.
7Aadish Chopra

Text Cleaning
8Aadish Chopra
English.dat file can
be changed for
stopwords

Topic Modelling
Topic Modelling:
• Classify documents using topic models.
• Get themes from documents
• Get topic distributions from corpus
• Get word distributions within topics
A technique which is popular is known as Latent Dirichlet allocation(LDA) was used to find
the topic distributions within documents.
Various versions of LDA like sLDA(supervised), Gibbs sampling are also present and may be
used interchangeably depending on the goals.
9Aadish Chopra

Topic Modelling
10Aadish Chopra
α is the parameter of the Dirichlet prior on the per-document topic distributions,
β is the parameter of the Dirichlet prior on the per-topic word distribution,
Ɵ is the topic distribution for document m,
ψ is the word distribution for topic k,
z is the topic for the n-th word in document m, and
w is the specific word.
N is the number of
words in a
document
M is the number of
documents

Topic Modelling
LDA gives the themes of the document or the corpus. LDA can be used
in various ways.
1. On a single tweet
2. On a single handle
3. On the entire corpus
11Aadish Chopra

Word Frequency -On a single handle
We are illustrating by taking example of LaSalleCoHealth
Total no of Observations: 691
Total no of words after data cleaning: 1126 observations
Top 16 words which occurs most in their tweet are
12Aadish Chopra

Word Frequency-On the corpus
Total no of Observations: 16449
Total no of words after data cleaning: 13665
observations
13Aadish Chopra

What's hot

Semantic Web Austin YahooPeter Mika

Otsuka Talk in Dec 2017Fariz Darari

Knowledge Technologies: Opportunities and ChallengesFariz Darari

PIR advanced information skills 2018Royal Holloway University of London

An introduction to Semantic Web and Linked DataFabien Gandon

Hello dataInes Sombra

2 Hka Researchingaptwano

Terrorism 15 May 2009karsalan

Search strategyrammiyanandu

Introduction to RDFNarni Rajesh

2014 CrossRef Workshops: Support Update and Multiple Resolution OverviewCrossref

Search strategies – subject searchingdoverlibrary

E-LEARN: BrainstormingEvans Library at Florida Institute of Technology

An introduction to Semantic Web and Linked DataGabriela Agustini

Publishing data on the Semantic WebPeter Mika

Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika

Productive Searchingjskotnicki

What google scholar can do for youpvhead123

What's hot (18)

Semantic Web Austin Yahoo

Otsuka Talk in Dec 2017

Knowledge Technologies: Opportunities and Challenges

PIR advanced information skills 2018

An introduction to Semantic Web and Linked Data

Hello data

2 Hka Researching

Terrorism 15 May 2009

Search strategy

Introduction to RDF

2014 CrossRef Workshops: Support Update and Multiple Resolution Overview

Search strategies – subject searching

E-LEARN: Brainstorming

An introduction to Semantic Web and Linked Data

Publishing data on the Semantic Web

Year of the Monkey: Lessons from the first year of SearchMonkey

Productive Searching

What google scholar can do for you

Similar to Text minings

Utilizing the natural langauage toolkit for keyword researchErudite

Aws rSherri Gunder

Building NLP solutions for Davidson ML Groupbotsplash.com

How to get_community_supportipsitamishra

Microformats I: What & WhyRachael L Moore

TLA Webinar: Introduction to Drupal -- part 3 of 3cherryhillco

Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake

Copyright © 2017, 2018 Sinclair Community College. All Right.docx

Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni

Hadoop presentationPriyaj Kumar

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn

Harnessing Web Page Directories for Large-Scale Classification of TweetsGabriela Agustini

Use Amazon Comprehend and Amazon SageMaker to Gain Insight from TextAmazon Web Services

Resource Mining for Effective ResearchAndrew Sorensen

8 programming concepts_you_should_know_in_2017Rajesh Shirsagar

ppt 2.pptxandxikcicncmk0wufjepfc09eufcdcashimavashisth001

Twitter analysis by Kaify RaisAjay Ohri

Twitter System DesignAkshatMishra72438

How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower

Social Networks at ScaleEoin Hurrell, PhD

Similar to Text minings (20)

Utilizing the natural langauage toolkit for keyword research

Aws r

Building NLP solutions for Davidson ML Group

How to get_community_support

Microformats I: What & Why

TLA Webinar: Introduction to Drupal -- part 3 of 3

Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...

Big Data Analytics course: Named Entities and Deep Learning for NLP

Hadoop presentation

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...

Harnessing Web Page Directories for Large-Scale Classification of Tweets

Use Amazon Comprehend and Amazon SageMaker to Gain Insight from Text

Resource Mining for Effective Research

8 programming concepts_you_should_know_in_2017

ppt 2.pptxandxikcicncmk0wufjepfc09eufcdc

Twitter analysis by Kaify Rais

Twitter System Design

How Oracle Uses CrowdFlower For Sentiment Analysis

Social Networks at Scale

Recently uploaded

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics

Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

Predicting Loan Approval: A Data Science Project

Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Anomaly detection and data imputation within time series

Text minings

1. Text Mining Independent Course Study Under the guidance of Prof. Ranganathan Chandraskaren 1Aadish Chopra

2. Text Mining Problem Statement :Classify tweets into categories such as Health , General Information Data Set: Collected through web-scraping • Twitter API is used widely but limitation of Twitter API is that you can only collect upto a few thousand tweets. • Another limitation of twitter API is that you can’t extract all the information that is on the website Tweets were only collected for Health institutions. There are 790 health institutions or handles. 2Aadish Chopra

3. How the twitter website looks like ? 3Aadish Chopra Twitter Name Twitter Handle Tweet

4. Web scraper Web scraper was built using R. Package used is Rselenium* https://drive.google.com/drive/folders/0Bzq4aAyBiuFQdHd0Y2JobmhjY 1k Rselenium is based on Selenium server and automates the web browser and takes control over it .You can then specify how you want to interact with the web browser by specifying the action e.g. scrolling, pressing the end key or pressing some specific button on the website 4Aadish Chopra *I did web scraping using R. Earlier teams did it using Java and Python

5. How the collected data looks like ? 5Aadish Chopra https://drive.google.com/drive/folders/0B-x4pbFFhIICWmdvWnBwX1VwZzA

6. Text Cleaning Example Tweet : Don't forget to register for the American Heart Associationâ€™s HeartsaverÂ® First Aid Training offered by the...http://fb.me/6PjnRWaMAÂ Steps taken to clean tweet • Tweets can be cleaned using the “tm” package in R . The functions are as follows 1. removeNumbers 2. removePunctuation 3. Stopwords 4. removeWords 5. stripWhitespace 6. stemDocument 6Aadish Chopra

7. Text Cleaning However if one wants more control over the cleaning of the data. For this purpose regex expressions can be used 1. Punctuation replacement: LaSalletwe<-gsub(pattern = "W",replacement = " ",LaSalletwe) 2. Digits replacement: LaSalletwe<-gsub(pattern="d",replacement = "",LaSalletwe) 3. To remove a single letter: LaSalletwe<-gsub(pattern="b[A- z]b{1}",replace="",LaSalletwe) 4. The list of stopwords* can be modified in the English.dat file which can be opened. The stopwords can be created for any language in R. It can be helpful for instance in cities like Chicago where people speak languages like Spanish *Changed the English.dat file in retrospective manner. 7Aadish Chopra

8. Text Cleaning 8Aadish Chopra English.dat file can be changed for stopwords

9. Topic Modelling Topic Modelling: • Classify documents using topic models. • Get themes from documents • Get topic distributions from corpus • Get word distributions within topics A technique which is popular is known as Latent Dirichlet allocation(LDA) was used to find the topic distributions within documents. Various versions of LDA like sLDA(supervised), Gibbs sampling are also present and may be used interchangeably depending on the goals. 9Aadish Chopra

10. Topic Modelling 10Aadish Chopra α is the parameter of the Dirichlet prior on the per-document topic distributions, β is the parameter of the Dirichlet prior on the per-topic word distribution, Ɵ is the topic distribution for document m, ψ is the word distribution for topic k, z is the topic for the n-th word in document m, and w is the specific word. N is the number of words in a document M is the number of documents

11. Topic Modelling LDA gives the themes of the document or the corpus. LDA can be used in various ways. 1. On a single tweet 2. On a single handle 3. On the entire corpus 11Aadish Chopra

12. Word Frequency -On a single handle We are illustrating by taking example of LaSalleCoHealth Total no of Observations: 691 Total no of words after data cleaning: 1126 observations Top 16 words which occurs most in their tweet are 12Aadish Chopra

13. Word Frequency-On the corpus Total no of Observations: 16449 Total no of words after data cleaning: 13665 observations 13Aadish Chopra

Text minings

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Text minings

Similar to Text minings (20)

More from University of Illinois,Chicago

More from University of Illinois,Chicago (15)

Recently uploaded

Recently uploaded (20)

Text minings