Generative AI on Enterprise Cloud with NiFi and Milvus
Detecting egotism in text - Mahyar Rahmatian 2020
1. 1
Final Project
Detecting Egotism in Text using
Deep Learning
Rahmatian, Mahyar
@Rahmatian, Mahyar
CSCI E-89 Deep Learning, Spring 2020
Harvard University Extension School
Prof. Zoran B. Djordjević
2. “The Ego is a veil between humans and God.” Rumi
What is this ego that we need to identify and transcend?
Egotism features an inflated opinion of one's personal features and importance
distinguished by a person’s amplified vision of one’s self and self-importance. It is a
destructive force that we can recognize in our text using Deep Learning.
We mainly will be using Python’s spaCY prebuilt statistical neural network models
to perform tasks on English text. We’ll also be training spaCy’s CNN model with our
own data (egoistic and non-egotistic sentences) to introduce new NERs (Name
Entity Recognition). Other Python NLP libraries used in this project are NTLK, and
Genism.
We’ll be defining 8 different methods to detect Egotism in text.
It may be subjective as to what is or is not egotistic, it should be fairly easy to
reflect those changes in our detection methods. See project report for more detail
on our definitions.
@ Rahmatian, Mahyar 2
3. Pre-possessing
Cleanup
import preprocess_kgptalkie as ps
def get_clean(x):
x = str(x).lower().replace('', '').replace('_', ' ')
x = ps.remove_emails(x)
x = ps.remove_urls(x)
x = ps.remove_html_tags(x)
x = ps.remove_accented_chars(x)
x = ps.remove_special_chars(x)
x = ps.make_base(x)
x = re.sub("(.)1{2,}", "1", x)
return x
DOCUMENT_cleaned= get_clean(DOCUMENT)
Summarize
from gensim.summarization import summarize
print(summarize(DOCUMENT, word_count=75, split=False))
@ Rahmatian, Mahyar 3
4. 5 Documents to Examine
Document: A CNN news item text – as a reference point and we expect this to be a
neutral document
DOCUMENT_ego_a: A statement from President Trump about President-Elect
Biden. We expect this to be Egoistic!
DOCUMENT_ego_b: A text segment From Donald Trump’s book, The Art of Deal.
We expect this to be Egoistic!
DOCUMENT_no_ego_a: A short article from Eckhart Tolle, the most popular
spiritual author in the United States and best-selling author of The Power of Now.
We expect this to be non_Egoistic!
DOCUMENT_no_ego_b: Another short article from Eckhart Tolle, the most popular
spiritual author in the United States and best-selling author of The Power of Now.
We expect this to be non_Egoistic!
@ Rahmatian, Mahyar 4
5. Method 1 – entities frequency
The more entities in a document the more egoistic, use spaCy to find all entities.
Frequency of top 5 entities
Average of DOCUMENT_ego 17
Average of DOCUMENT_no_ego 2.5
@ Rahmatian, Mahyar 5
6. Method 2 - tense
Ego likes past and future, and dissolves in present , use NLTK word_tokenize to find
the tense of a document. (word infections) The less present more egotistic.
present %
Average of DOCUMENT_ego 69
Average of DOCUMENT_no_ego 71.5
@ Rahmatian, Mahyar 6
7. Method 3 - plural
The less % of plural version of verbs/nouns in use, the more egoistic.
Plural percent
Average of DOCUMENT_ego 6.5
Average of DOCUMENT_no_ego 3.5
@ Rahmatian, Mahyar 7
8. Method 4 - pronoun
Use spaCy pronoun detection to find separationist (I, mine, yours) vs inclusive (we,
ours) pronouns. Ego documents show less inclusive.
Inclusive pronoun percent
Average of DOCUMENT_ego 3.5
Average of DOCUMENT_no_ego 17
@ Rahmatian, Mahyar 8
9. Method 5 - readability
Ego likes high complexity in readability. Use spacy_readability library to score a
document in 2 different methods, then simplify the average of those methods to
Easy, Hard, and Very Hard readability
Average of DOCUMENT_ego Hard readability
Average of DOCUMENT_no_ego Hard readability
@ Rahmatian, Mahyar 9
10. Method 6 - sentiment
Ego likes negativity. Use NLTK SentimentIntensityAnalyzer to find the sentiment
Average of DOCUMENT_ego neutral
Average of DOCUMENT_no_ego neutral
@ Rahmatian, Mahyar 10
11. Method 7 - emotion
Ego likes Angry, Surprise, Sad, Fear, but not Joy. Use text2emotion to detect
emotions, then calculate, score = happy - (Angry + Surprise + Sad + Fear)
from +1 (max happy) to -1 (min happy)
Emotion score
Average of DOCUMENT_ego -.85
Average of DOCUMENT_no_ego -.65
@ Rahmatian, Mahyar 11
12. Method 8 – training NER
Training spaCy with sentences to learn two new Egoistic and non-Egoistic entities
(NER).
For training egoistic entities, we need egoistic words. These words must be used in
two sentences. One sentence with egoistic context and the other in non- Egoistic
or neutral context.
For example:
“complain” is an egoistic word
Egoistic sentence is “She had done nothing but cry, complain and faint since
this ordeal had begun”
Non-Egoistic sentence is “I have nothing to complain about”
@ Rahmatian, Mahyar 12
13. Method 8 - training NER
We start with seed words for both Egoistic and Non-Egoistic entities. We then find
synonyms and antonyms words for both sets. And later, we combine them to our
collection of Egoistic and non-Egoistic list of words.
For example:
complain criticize (synonyms), applaud (antonyms)
gratitude grateful (synonyms), resentment (antonyms)
Combined Egoistic list = complain, criticize, resentment
Combined Non-Egoistic list = gratitude, grateful, applaud
We can find thousands of words, but here we just select about 20 words from each
category to make a sentence
@ Rahmatian, Mahyar 13
14. Method 8 - training NER
Training sentences for Egoistic entity, one word is used in egotistic context and
next line, the same word is used in non-egoistic context
@ Rahmatian, Mahyar 14
15. Method 8 - training NER
Use Spacy matcher to help with labeling, {'entities': [(25, 35, 'EGOISTIC')]}) then a
little manual formatting to get the final training text below
@ Rahmatian, Mahyar 15
16. Method 8 - training NER
Training
@ Rahmatian, Mahyar 16
17. Method 8 – EGOISTIC entity
Finding our new EGOISTIC entity in our documents
Average number of EGOISTIC entities for DOCUMENT_ego 8.5
Average number of EGOISTIC entities for DOCUMENT_no_ego 5
@ Rahmatian, Mahyar 17
18. Method 8 – training NER (non-EGOISTIC)
Different set of words to write paired-sentences for non-EGOISTIC sentences
@ Rahmatian, Mahyar 18
19. Method 8 – non-EGOISTIC entity
Finding our new non-EGOISTIC entity in our documents
Average number of non_EGOISTIC entities for DOCUMENT_ego 0.5
Average number of non_EGOISTIC entities for DOCUMENT_no_ego 2.5
@ Rahmatian, Mahyar 19
20. Final Tally
Scores from all methods. (< means less number is better, less egotism)
We see that 6 (in bold) out of 9 indicators correctly differentiated between the two
documents
There is room to improve each of the indicators for greater differentiation
It is also possible to run more documents through the 9 indicators and gather
more rows, then feed those rows to a secondary NN.
@ Rahmatian, Mahyar 20
21. The End
Associated notebook is a very good training ground for deep learning in NLP.
It is very time consuming to generate a good set of labeled sentences to feed the
model. With more effort on labeled sentences, it will be easy to detect egotism in
our text more accurately.
It is possible to feed the result of all indicators to yet another Deep Learning
model and expect higher accuracy
Future Enactments:
Voice to text
Web based
Individual method Scoring improvements
Train with more labeled sentences
Upgrade to spaCy 3.0 and use spacy-transformers, pretrained transformers like
BERT
Resume Enhancer
@ Rahmatian, Mahyar 21
22. “The sage battles his own ego, the fool battles
everyone else’s” - Rumi
@ Rahmatian, Mahyar 22
23. YouTube URLs, Last Page
Two minute (short): https://youtu.be/9DYvJWaepc8
15 minutes (long): https://youtu.be/KZqg6KqUyMg
@Your Name 23