Detecting egotism in text - Mahyar Rahmatian 2020

1
Final Project
Detecting Egotism in Text using
Deep Learning
Rahmatian, Mahyar
@Rahmatian, Mahyar
CSCI E-89 Deep Learning, Spring 2020
Harvard University Extension School
Prof. Zoran B. Djordjević

“The Ego is a veil between humans and God.” Rumi
 What is this ego that we need to identify and transcend?
 Egotism features an inflated opinion of one's personal features and importance
distinguished by a person’s amplified vision of one’s self and self-importance. It is a
destructive force that we can recognize in our text using Deep Learning.
 We mainly will be using Python’s spaCY prebuilt statistical neural network models
to perform tasks on English text. We’ll also be training spaCy’s CNN model with our
own data (egoistic and non-egotistic sentences) to introduce new NERs (Name
Entity Recognition). Other Python NLP libraries used in this project are NTLK, and
Genism.
 We’ll be defining 8 different methods to detect Egotism in text.
 It may be subjective as to what is or is not egotistic, it should be fairly easy to
reflect those changes in our detection methods. See project report for more detail
on our definitions.
@ Rahmatian, Mahyar 2

Pre-possessing
 Cleanup
import preprocess_kgptalkie as ps
def get_clean(x):
x = str(x).lower().replace('', '').replace('_', ' ')
x = ps.remove_emails(x)
x = ps.remove_urls(x)
x = ps.remove_html_tags(x)
x = ps.remove_accented_chars(x)
x = ps.remove_special_chars(x)
x = ps.make_base(x)
x = re.sub("(.)1{2,}", "1", x)
return x
DOCUMENT_cleaned= get_clean(DOCUMENT)
 Summarize
from gensim.summarization import summarize
print(summarize(DOCUMENT, word_count=75, split=False))

5 Documents to Examine
 Document: A CNN news item text – as a reference point and we expect this to be a
neutral document
 DOCUMENT_ego_a: A statement from President Trump about President-Elect
Biden. We expect this to be Egoistic!
 DOCUMENT_ego_b: A text segment From Donald Trump’s book, The Art of Deal.
We expect this to be Egoistic!
 DOCUMENT_no_ego_a: A short article from Eckhart Tolle, the most popular
spiritual author in the United States and best-selling author of The Power of Now.
We expect this to be non_Egoistic!
 DOCUMENT_no_ego_b: Another short article from Eckhart Tolle, the most popular
spiritual author in the United States and best-selling author of The Power of Now.
We expect this to be non_Egoistic!

Method 1 – entities frequency
 The more entities in a document the more egoistic, use spaCy to find all entities.
 Frequency of top 5 entities
 Average of DOCUMENT_ego 17
 Average of DOCUMENT_no_ego 2.5

Method 2 - tense
 Ego likes past and future, and dissolves in present , use NLTK word_tokenize to find
the tense of a document. (word infections) The less present more egotistic.
 present %
 Average of DOCUMENT_ego 69

Method 3 - plural
 The less % of plural version of verbs/nouns in use, the more egoistic.
 Plural percent
 Average of DOCUMENT_ego 6.5

Method 4 - pronoun
 Use spaCy pronoun detection to find separationist (I, mine, yours) vs inclusive (we,
ours) pronouns. Ego documents show less inclusive.
 Inclusive pronoun percent
 Average of DOCUMENT_ego 3.5
 Average of DOCUMENT_no_ego 17

Method 5 - readability
 Ego likes high complexity in readability. Use spacy_readability library to score a
document in 2 different methods, then simplify the average of those methods to
Easy, Hard, and Very Hard readability
 Average of DOCUMENT_ego Hard readability
 Average of DOCUMENT_no_ego Hard readability

Method 6 - sentiment
 Ego likes negativity. Use NLTK SentimentIntensityAnalyzer to find the sentiment
 Average of DOCUMENT_ego neutral
 Average of DOCUMENT_no_ego neutral

Method 7 - emotion
 Ego likes Angry, Surprise, Sad, Fear, but not Joy. Use text2emotion to detect
emotions, then calculate, score = happy - (Angry + Surprise + Sad + Fear)
from +1 (max happy) to -1 (min happy)
 Emotion score
 Average of DOCUMENT_ego -.85
 Average of DOCUMENT_no_ego -.65

Method 8 – training NER
 Training spaCy with sentences to learn two new Egoistic and non-Egoistic entities
(NER).
 For training egoistic entities, we need egoistic words. These words must be used in
two sentences. One sentence with egoistic context and the other in non- Egoistic
or neutral context.
 For example:
“complain” is an egoistic word
 Egoistic sentence is “She had done nothing but cry, complain and faint since
this ordeal had begun”
 Non-Egoistic sentence is “I have nothing to complain about”

Method 8 - training NER
 We start with seed words for both Egoistic and Non-Egoistic entities. We then find
synonyms and antonyms words for both sets. And later, we combine them to our
collection of Egoistic and non-Egoistic list of words.
 For example:
 complain  criticize (synonyms), applaud (antonyms)
 gratitude  grateful (synonyms), resentment (antonyms)
 Combined Egoistic list = complain, criticize, resentment
 Combined Non-Egoistic list = gratitude, grateful, applaud
 We can find thousands of words, but here we just select about 20 words from each
category to make a sentence

 Training sentences for Egoistic entity, one word is used in egotistic context and
next line, the same word is used in non-egoistic context

 Use Spacy matcher to help with labeling, {'entities': [(25, 35, 'EGOISTIC')]}) then a
little manual formatting to get the final training text below

 Training

Method 8 – EGOISTIC entity
 Finding our new EGOISTIC entity in our documents
 Average number of EGOISTIC entities for DOCUMENT_ego 8.5
 Average number of EGOISTIC entities for DOCUMENT_no_ego 5

Method 8 – training NER (non-EGOISTIC)
 Different set of words to write paired-sentences for non-EGOISTIC sentences

Method 8 – non-EGOISTIC entity
 Finding our new non-EGOISTIC entity in our documents
 Average number of non_EGOISTIC entities for DOCUMENT_ego 0.5
 Average number of non_EGOISTIC entities for DOCUMENT_no_ego 2.5

Final Tally
 Scores from all methods. (< means less number is better, less egotism)
 We see that 6 (in bold) out of 9 indicators correctly differentiated between the two
documents
 There is room to improve each of the indicators for greater differentiation
 It is also possible to run more documents through the 9 indicators and gather
more rows, then feed those rows to a secondary NN.

The End
 Associated notebook is a very good training ground for deep learning in NLP.
 It is very time consuming to generate a good set of labeled sentences to feed the
model. With more effort on labeled sentences, it will be easy to detect egotism in
our text more accurately.
 It is possible to feed the result of all indicators to yet another Deep Learning
model and expect higher accuracy
 Future Enactments:
 Voice to text
 Web based
 Individual method Scoring improvements
 Train with more labeled sentences
 Upgrade to spaCy 3.0 and use spacy-transformers, pretrained transformers like
BERT
 Resume Enhancer

“The sage battles his own ego, the fool battles
everyone else’s” - Rumi

YouTube URLs, Last Page
 Two minute (short): https://youtu.be/9DYvJWaepc8
 15 minutes (long): https://youtu.be/KZqg6KqUyMg
@Your Name 23

Detecting egotism in text - Mahyar Rahmatian 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Detecting egotism in text - Mahyar Rahmatian 2020

Similar to Detecting egotism in text - Mahyar Rahmatian 2020 (20)

Recently uploaded

Recently uploaded (20)

Detecting egotism in text - Mahyar Rahmatian 2020