SlideShare a Scribd company logo
1 of 33
Handling Text Data
INAFU6513 Lecture 7b
Lab 7: your 5-7 things
Get familiar with text processing
Get familiar with text data
Read text data
Classify text data
Analyse text data
Text processing
● Information retrieval
○ Search
○ Named entity recognition
● Learning
○ Classification
○ Clustering
○ Topic identification/ topic following
○ Sentiment analysis
○ Network analysis (words, people etc)
Reading Text Data
Text Data Sources
● Messages (tweets, emails, sms messages...)
● Document text (reports, blogposts, website text…)
● Audio (via speech-to-text processing)
● Images (via OCR)
Get your raw text data
fsipa = open('sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
print(sipatext)
Counting: Bags of Words
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform([sipatext])
print('{}'.format(word_counts))
print('{}'.format(count_vect.vocabulary_))
Counting sets of words: N-Grams
● Pairs (or triples, 4s etc) of words
● Also: pairs etc of characters, e.g. [‘mor’, ‘ore’, ‘re ‘,
‘e t’, ‘ th’, ‘tha’, ‘han’]
● Know your Ns:
○ ‘Unigram’ == 1-gram
○ ‘Bigram’ == 2-gram
○ ‘Trigram’ == 3-gram
count_vectn = CountVectorizer(ngram_range =(2, 2))
Stopwords
count_vect2 =
CountVectorizer(stop_words='english')
word_counts2 =
count_vect2.fit_transform([sipatext])
Term Frequencies
● TF: Term Frequency:
○ word count / (number of words in this document)
○ “How important (0 to 1) is this word to this document”?
● IDF: Inverse Document Frequency
○ 1 / (number of documents this word appears in)
○ “How common is this word in this corpus”?
● TFIDF:
○ TF * IDF
Machine Learning with Text Data
Classifying Text
Words are a valid input to machine learning algorithms
In this example, we’re using:
● Newsgroup emails as samples (‘rows’ in our input)
● Words in each email as features (‘columns’)
● Newsgroup ids as targets
The 20newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups( subset='train', categories=cats)
twenty_test = fetch_20newsgroups(subset='test', categories=cats)
Example email
Convert words to TFIDF scores
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
Fit your model to the data
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
Test your model
docs_test = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_test)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = nb_classifier.predict(X_new_tfidf)
for doc, category in zip(docs_test, predicted):
print('{} => {}'.format(doc, twenty_train.target_names[category]))
Text Clustering
We can also ‘cluster’ documents
● The ‘distance’ function is based on the words they have in common
Common machine learning algorithms for text clustering include:
● Latent Semantic Analysis
● Latent Dirichlet Allocation
Text Analysis
Word colocation
● Create a graph (network visualisation) of words that appear together in
documents
● Use network analysis (later session) to show which pairs of words are
important in your documents
Sentiment analysis
● Mark documents (e.g. tweets) as having positive or negative sentiment
● Using machine learning
○ Training set: sentences, with ‘positive’/’negative’ for each sentence
● Using a sentiment dictionary
○ Positive or negative ‘score’ for each emotive word
○ Sentiment dictionaries can be used as machine learning algorithms
‘seeds’
Named Entity Recognition
● Find the names of people, organisations, locations etc in text
● Can use these to create social graphs (networks showing how people etc
connect to each other) and find ‘hubs’, ‘connectors’ etc
Natural Language Processing
Natural Language Processing
● Understanding the grammar and meaning of text
● Useful for, e.g. translation between languages
● Python library: NLTK
Getting started with NLTK
import nltk
nltk.download()
Get text ready for NLTK processing
from nltk import word_tokenize
from nltk.text import Text
fsipa = open('example_data/sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
sipawords = word_tokenize(sipatext)
textlist = Text(sipawords)
NLTK: concordance
textlist.concordance(‘school’)
textlist.similar('school')
textlist.common_contexts(['school', 'university'])
NLTK: word dispersion plots
from nltk.book import *
text2.dispersion_plot(['Elinor', 'Willoughby', 'Sophia'])
NLTK: Word Meanings
from nltk.corpus import wordnet as wn
word = 'class'
synset = wn.synsets(word)
print('Synset: {}n'.format(synset))
for i in range(len(synset)):
print('Meaning {}: {} {}'.format(i, synset[i].lemma_names(), synset[i].definition()))
NLTK: Synsets
NLTK: converting words into logic
from nltk import load_parser
parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0)
sentence = 'Angus gives a bone to every dog'
tokens = sentence.split()
for tree in parser.parse(tokens):
print(tree.label()['SEM'])
Exercises
Exercises
Try the code in the 7.x series notebooks

More Related Content

What's hot

An Introduction to the C++ Standard Library
An Introduction to the C++ Standard LibraryAn Introduction to the C++ Standard Library
An Introduction to the C++ Standard LibraryJoyjit Choudhury
 
Structures in c language
Structures in c languageStructures in c language
Structures in c languagetanmaymodi4
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structureMahmoud Alfarra
 
Introduction to data structure by anil dutt
Introduction to data structure by anil duttIntroduction to data structure by anil dutt
Introduction to data structure by anil duttAnil Dutt
 
Arrays accessing using for loops
Arrays accessing using for loopsArrays accessing using for loops
Arrays accessing using for loopssangrampatil81
 
Data Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn HubData Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn HubLearn Hub
 
Programming Fundamentals lecture 6
Programming Fundamentals lecture 6Programming Fundamentals lecture 6
Programming Fundamentals lecture 6REHAN IJAZ
 
20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in RKazuki Yoshida
 
Data types in python lecture (2)
Data types in python lecture (2)Data types in python lecture (2)
Data types in python lecture (2)Ali ٍSattar
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkSteffen Wenz
 
Towards a theory of data entangelement
Towards a theory of data entangelementTowards a theory of data entangelement
Towards a theory of data entangelementAleksandr Yampolskiy
 

What's hot (18)

An Introduction to the C++ Standard Library
An Introduction to the C++ Standard LibraryAn Introduction to the C++ Standard Library
An Introduction to the C++ Standard Library
 
Structures in c language
Structures in c languageStructures in c language
Structures in c language
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Introduction to data structure by anil dutt
Introduction to data structure by anil duttIntroduction to data structure by anil dutt
Introduction to data structure by anil dutt
 
cs8251 unit 1 ppt
cs8251 unit 1 pptcs8251 unit 1 ppt
cs8251 unit 1 ppt
 
Arrays accessing using for loops
Arrays accessing using for loopsArrays accessing using for loops
Arrays accessing using for loops
 
Data Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn HubData Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn Hub
 
C++ arrays part1
C++ arrays part1C++ arrays part1
C++ arrays part1
 
Programming Fundamentals lecture 6
Programming Fundamentals lecture 6Programming Fundamentals lecture 6
Programming Fundamentals lecture 6
 
20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R
 
2CPP16 - STL
2CPP16 - STL2CPP16 - STL
2CPP16 - STL
 
Python - variable types
Python - variable typesPython - variable types
Python - variable types
 
Core & advanced java classes in mumbai
Core & advanced java classes in mumbaiCore & advanced java classes in mumbai
Core & advanced java classes in mumbai
 
Computer programming 2 Lesson 5
Computer programming 2  Lesson 5Computer programming 2  Lesson 5
Computer programming 2 Lesson 5
 
Data types in python lecture (2)
Data types in python lecture (2)Data types in python lecture (2)
Data types in python lecture (2)
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning Talk
 
Towards a theory of data entangelement
Towards a theory of data entangelementTowards a theory of data entangelement
Towards a theory of data entangelement
 

Similar to Session 07 text data.pptx

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata londonkperi
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessingAbdurRazzaqe1
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with pythonKumud Arora
 
프알못의 Keras 사용기
프알못의 Keras 사용기프알못의 Keras 사용기
프알못의 Keras 사용기Mijeong Jeon
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Typescript - why it's awesome
Typescript - why it's awesomeTypescript - why it's awesome
Typescript - why it's awesomePiotr Miazga
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowS N
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET Journal
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2sadhana312471
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARKMRKUsafzai0607
 

Similar to Session 07 text data.pptx (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Python basic
Python basicPython basic
Python basic
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Python for Dummies
Python for DummiesPython for Dummies
Python for Dummies
 
Python dictionaries
Python dictionariesPython dictionaries
Python dictionaries
 
Python-Cheat-Sheet.pdf
Python-Cheat-Sheet.pdfPython-Cheat-Sheet.pdf
Python-Cheat-Sheet.pdf
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
프알못의 Keras 사용기
프알못의 Keras 사용기프알못의 Keras 사용기
프알못의 Keras 사용기
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Typescript - why it's awesome
Typescript - why it's awesomeTypescript - why it's awesome
Typescript - why it's awesome
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
2. Python Cheat Sheet.pdf
2. Python Cheat Sheet.pdf2. Python Cheat Sheet.pdf
2. Python Cheat Sheet.pdf
 

More from Sara-Jayne Terp

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Sara-Jayne Terp
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageSara-Jayne Terp
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...Sara-Jayne Terp
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other thingsSara-Jayne Terp
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of DisinformationSara-Jayne Terp
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umarylandSara-Jayne Terp
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...Sara-Jayne Terp
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeleySara-Jayne Terp
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksSara-Jayne Terp
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_secSara-Jayne Terp
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copySara-Jayne Terp
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideSara-Jayne Terp
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scaleSara-Jayne Terp
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformationSara-Jayne Terp
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowSara-Jayne Terp
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSara-Jayne Terp
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old thingsSara-Jayne Terp
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing dataSara-Jayne Terp
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger dataSara-Jayne Terp
 

More from Sara-Jayne Terp (20)

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of age
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other things
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of Disinformation
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworks
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec
 
2020 09-01 disclosure
2020 09-01 disclosure2020 09-01 disclosure
2020 09-01 disclosure
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guide
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scale
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformation
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz now
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_belief
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old things
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 

Recently uploaded (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 

Session 07 text data.pptx

  • 2. Lab 7: your 5-7 things Get familiar with text processing Get familiar with text data Read text data Classify text data Analyse text data
  • 3. Text processing ● Information retrieval ○ Search ○ Named entity recognition ● Learning ○ Classification ○ Clustering ○ Topic identification/ topic following ○ Sentiment analysis ○ Network analysis (words, people etc)
  • 5. Text Data Sources ● Messages (tweets, emails, sms messages...) ● Document text (reports, blogposts, website text…) ● Audio (via speech-to-text processing) ● Images (via OCR)
  • 6. Get your raw text data fsipa = open('sipatext.txt', 'r') sipatext = fsipa.read() fsipa.close() print(sipatext)
  • 7. Counting: Bags of Words from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() word_counts = count_vect.fit_transform([sipatext]) print('{}'.format(word_counts)) print('{}'.format(count_vect.vocabulary_))
  • 8. Counting sets of words: N-Grams ● Pairs (or triples, 4s etc) of words ● Also: pairs etc of characters, e.g. [‘mor’, ‘ore’, ‘re ‘, ‘e t’, ‘ th’, ‘tha’, ‘han’] ● Know your Ns: ○ ‘Unigram’ == 1-gram ○ ‘Bigram’ == 2-gram ○ ‘Trigram’ == 3-gram count_vectn = CountVectorizer(ngram_range =(2, 2))
  • 10. Term Frequencies ● TF: Term Frequency: ○ word count / (number of words in this document) ○ “How important (0 to 1) is this word to this document”? ● IDF: Inverse Document Frequency ○ 1 / (number of documents this word appears in) ○ “How common is this word in this corpus”? ● TFIDF: ○ TF * IDF
  • 12. Classifying Text Words are a valid input to machine learning algorithms In this example, we’re using: ● Newsgroup emails as samples (‘rows’ in our input) ● Words in each email as features (‘columns’) ● Newsgroup ids as targets
  • 13. The 20newsgroups dataset from sklearn.datasets import fetch_20newsgroups cats = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups( subset='train', categories=cats) twenty_test = fetch_20newsgroups(subset='test', categories=cats)
  • 15. Convert words to TFIDF scores from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(twenty_train.data) tfidf_transformer = TfidfTransformer(use_idf=True) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
  • 16. Fit your model to the data from sklearn.naive_bayes import MultinomialNB nb_classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
  • 17. Test your model docs_test = ['God is love', 'OpenGL on the GPU is fast'] X_new_counts = count_vect.transform(docs_test) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = nb_classifier.predict(X_new_tfidf) for doc, category in zip(docs_test, predicted): print('{} => {}'.format(doc, twenty_train.target_names[category]))
  • 18. Text Clustering We can also ‘cluster’ documents ● The ‘distance’ function is based on the words they have in common Common machine learning algorithms for text clustering include: ● Latent Semantic Analysis ● Latent Dirichlet Allocation
  • 20. Word colocation ● Create a graph (network visualisation) of words that appear together in documents ● Use network analysis (later session) to show which pairs of words are important in your documents
  • 21. Sentiment analysis ● Mark documents (e.g. tweets) as having positive or negative sentiment ● Using machine learning ○ Training set: sentences, with ‘positive’/’negative’ for each sentence ● Using a sentiment dictionary ○ Positive or negative ‘score’ for each emotive word ○ Sentiment dictionaries can be used as machine learning algorithms ‘seeds’
  • 22. Named Entity Recognition ● Find the names of people, organisations, locations etc in text ● Can use these to create social graphs (networks showing how people etc connect to each other) and find ‘hubs’, ‘connectors’ etc
  • 24. Natural Language Processing ● Understanding the grammar and meaning of text ● Useful for, e.g. translation between languages ● Python library: NLTK
  • 25. Getting started with NLTK import nltk nltk.download()
  • 26. Get text ready for NLTK processing from nltk import word_tokenize from nltk.text import Text fsipa = open('example_data/sipatext.txt', 'r') sipatext = fsipa.read() fsipa.close() sipawords = word_tokenize(sipatext) textlist = Text(sipawords)
  • 28. NLTK: word dispersion plots from nltk.book import * text2.dispersion_plot(['Elinor', 'Willoughby', 'Sophia'])
  • 29. NLTK: Word Meanings from nltk.corpus import wordnet as wn word = 'class' synset = wn.synsets(word) print('Synset: {}n'.format(synset)) for i in range(len(synset)): print('Meaning {}: {} {}'.format(i, synset[i].lemma_names(), synset[i].definition()))
  • 31. NLTK: converting words into logic from nltk import load_parser parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0) sentence = 'Angus gives a bone to every dog' tokens = sentence.split() for tree in parser.parse(tokens): print(tree.label()['SEM'])
  • 33. Exercises Try the code in the 7.x series notebooks

Editor's Notes

  1. Topic following: includes tracking things like hate speech (iHub Nairobi has done a lot of work on this topic) Verification: the Pheme project (http://www.pheme.eu/) is working on automatically tracking the veracity of stories.
  2. For speech recognition in python, try https://pypi.python.org/pypi/SpeechRecognition/ or speech http://code.activestate.com/recipes/579115-recognizing-speech-speech-to-text-with-the-python-/ We’re looking at two pieces of data today: the Wikipedia entry for SIPA, and a set of tweets about the #migrantcrisis, grabbed from the Twitter API by using notebook 3.1.
  3. Scikit-learn has some powerful text processing functions, including this one to separate text into words
  4. word n-grams; character n-grams
  5. Stopwords are common words (“the”, “a”, “and”) that don’t add to meaning, and might confuse outputs
  6. From http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/: If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
  7. Aka computational linguistics
  8. More than you ever wanted to know about parsing sentences: http://www.nltk.org/howto/featgram.html Simple_sem is a simple grammar, just for teaching: its whole specification is at https://github.com/nltk/nltk_teach/blob/master/examples/grammars/book_grammars/simple-sem.fcfg