The document discusses several studies that aimed to determine personality types through analysis of social media data like tweets.
One study analyzed over 1.7 million tweets to determine Myers-Briggs Type Indicator (MBTI) personality types. It found that word vectors, part-of-speech tags, and n-grams achieved over 65% accuracy on average. Another study used over 960,000 tweets to classify MBTI types with over 99% accuracy using support vector machines and logistic regression. However, it saw performance drop with testing data, likely due to noise in tweets. A third study compiled a dataset of 1.2 million tweets from 1,500 users self-reporting MBTI types.
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
What Is Sentiment Analysis?
Problem Statement
Why Twitter data?
The Process at a Glance
Methodology: How are we doing it?
Pre-processing of the datasets
Extract the candidate or take it as user input.
Calculate sentiment
Visualizing the candidate data
What visualization are we talking about?
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
What Is Sentiment Analysis?
Problem Statement
Why Twitter data?
The Process at a Glance
Methodology: How are we doing it?
Pre-processing of the datasets
Extract the candidate or take it as user input.
Calculate sentiment
Visualizing the candidate data
What visualization are we talking about?
Review of Natural Language Processing tasks and examples of why it is so hard. Then he describes in detail text categorization and particularly sentiment analysis. A few common approaches for predicting sentiment are discussed, going even further, explaining statistical machine learning algorithms.
Sentiment Analysis also known as opinion mining and Emotional AI
Refers to the use of natural language processing, text analysis, computational linguistics and biometrics to systematically identify, extract, quantify and study affective states and subjective information.
widely used in
Reviews
Survey responses
Online and social media
Health care
This Project Aimed at doing a comprehensive study of Different Machine Learning Approaches on Sentiment Analysis of Movie Reviews. Support Vector Machines were the one that Performed Most Accurately with Radial Basis Function. Lots of Other kernel functions and Kernel Parameters were tried to find the optimal one. We achieved accuracy up to 83%.
Sentiment analysis or opinion mining is a process of categorizing and identifying the sentiment expressed in a particular text. The need of automatic sentiment retrieval of
the text is quite high as a number of reviews obtained from the Internet sources like Twitter are huge in number. These reviews or opinions on popular products or events help in determining the public opinion towards the issue. An averaged histogram model is proposed in the process that deals with text classification in continuous variable approach. After data cleaning and feature extraction from the reviews, average histograms are constructed for every class, containing a generalized feature representation in that particular class, namely positive and negative. Histograms of every test elements are then classified using k-NN, Bayesian Classifier and LSTM network. This work is then implemented in Android integrated with Twitter. The user will have to provide the topic for analysis. The Application will show the result as the percentage of positive review tweets in favor of the topic using Bayesian Classifier.
Sentiment analysis of Twitter data using pythonHetu Bhavsar
Twitter is a popular social networking website where users posts and interact with messages known as “tweets”. To automate the analysis of such data, the area of Sentiment Analysis has emerged. It aims at identifying opinionative data in the Web and classifying them according to their polarity, i.e., whether they carry a positive or negative connotation. We will attempt to conduct sentiment analysis on “tweets” using various different machine learning algorithms.
Design the Conversation: A case study on making digital banking clear and humanSara Walsh
This is a case study of how we in Capital One Small Business Bank changed the content and design of our online account opening conversation so that 92% of customers completed the experience versus 26%. I shared it at the J. Boye Aarhus 2016 conference in Denmark in November.
DevFest19 - Early Diagnosis of Chronic Diseases by Smartphone AIGaurav Kheterpal
Session by Sabyasachi Mukhopadhyay
Kolkata Lead, Facebook Developer Circle
GDE in ML
Intel Software Innovator
Visiting Faculty, SCIT Pune
Co-Founder & Chief Research Officer, Twelit MedTech Pvt. Ltd
StoryFlow - Visually Tracking Evolution of StoriesYingcai Wu
Storyline visualizations, which are useful in many applications, aim to illustrate the dynamic relationships between entities in a story. However, the growing complexity and scalability of stories pose great challenges for existing approaches. In this paper, we propose an efficient optimization approach to generating an aesthetically appealing storyline visualization, which effectively handles the hierarchical relationships between entities over time. The approach formulates the storyline layout as a novel hybrid optimization approach that combines discrete and continuous optimization. The discrete method generates an initial layout through the ordering and alignment of entities, and the continuous method optimizes the initial layout to produce the optimal one. The efficient approach makes real-time interactions (e.g., bundling and straightening) possible, thus enabling users to better understand and track how the story evolves.
This work was presented in IEEE InfoVis 2013.
Project page:
http://www.ycwu.org/projects/infovis13.html
Review of Natural Language Processing tasks and examples of why it is so hard. Then he describes in detail text categorization and particularly sentiment analysis. A few common approaches for predicting sentiment are discussed, going even further, explaining statistical machine learning algorithms.
Sentiment Analysis also known as opinion mining and Emotional AI
Refers to the use of natural language processing, text analysis, computational linguistics and biometrics to systematically identify, extract, quantify and study affective states and subjective information.
widely used in
Reviews
Survey responses
Online and social media
Health care
This Project Aimed at doing a comprehensive study of Different Machine Learning Approaches on Sentiment Analysis of Movie Reviews. Support Vector Machines were the one that Performed Most Accurately with Radial Basis Function. Lots of Other kernel functions and Kernel Parameters were tried to find the optimal one. We achieved accuracy up to 83%.
Sentiment analysis or opinion mining is a process of categorizing and identifying the sentiment expressed in a particular text. The need of automatic sentiment retrieval of
the text is quite high as a number of reviews obtained from the Internet sources like Twitter are huge in number. These reviews or opinions on popular products or events help in determining the public opinion towards the issue. An averaged histogram model is proposed in the process that deals with text classification in continuous variable approach. After data cleaning and feature extraction from the reviews, average histograms are constructed for every class, containing a generalized feature representation in that particular class, namely positive and negative. Histograms of every test elements are then classified using k-NN, Bayesian Classifier and LSTM network. This work is then implemented in Android integrated with Twitter. The user will have to provide the topic for analysis. The Application will show the result as the percentage of positive review tweets in favor of the topic using Bayesian Classifier.
Sentiment analysis of Twitter data using pythonHetu Bhavsar
Twitter is a popular social networking website where users posts and interact with messages known as “tweets”. To automate the analysis of such data, the area of Sentiment Analysis has emerged. It aims at identifying opinionative data in the Web and classifying them according to their polarity, i.e., whether they carry a positive or negative connotation. We will attempt to conduct sentiment analysis on “tweets” using various different machine learning algorithms.
Design the Conversation: A case study on making digital banking clear and humanSara Walsh
This is a case study of how we in Capital One Small Business Bank changed the content and design of our online account opening conversation so that 92% of customers completed the experience versus 26%. I shared it at the J. Boye Aarhus 2016 conference in Denmark in November.
DevFest19 - Early Diagnosis of Chronic Diseases by Smartphone AIGaurav Kheterpal
Session by Sabyasachi Mukhopadhyay
Kolkata Lead, Facebook Developer Circle
GDE in ML
Intel Software Innovator
Visiting Faculty, SCIT Pune
Co-Founder & Chief Research Officer, Twelit MedTech Pvt. Ltd
StoryFlow - Visually Tracking Evolution of StoriesYingcai Wu
Storyline visualizations, which are useful in many applications, aim to illustrate the dynamic relationships between entities in a story. However, the growing complexity and scalability of stories pose great challenges for existing approaches. In this paper, we propose an efficient optimization approach to generating an aesthetically appealing storyline visualization, which effectively handles the hierarchical relationships between entities over time. The approach formulates the storyline layout as a novel hybrid optimization approach that combines discrete and continuous optimization. The discrete method generates an initial layout through the ordering and alignment of entities, and the continuous method optimizes the initial layout to produce the optimal one. The efficient approach makes real-time interactions (e.g., bundling and straightening) possible, thus enabling users to better understand and track how the story evolves.
This work was presented in IEEE InfoVis 2013.
Project page:
http://www.ycwu.org/projects/infovis13.html
안녕하세요 딥논읽-DNR입니다!
오늘 소개드릴 논문은 'YOLOS' 라는 논문입니다. YOLOS에 대해 간략하게 먼저 설명을 드리면 오직 Transformer만을 이용하여 2D object Detection을 수행한 모델이라고 이해해 주시면 됩니다. 구조는 오직 Transformer의 Encoder만을 사용하여 Object detection을 수행하였는대요, 데이터셋을 균등하게 학습시켜도 오브젝트마다의 AP가 차이가 심했던 다른 CNN기반의 Object detector와 다르게, 해당 모델은 모든 카테고리에 대해서 AP가 꽤나 균등하게 나오는것도 중요한 특징중 하나 입니다.
오늘 논문 리뷰를 이미지 처리팀 김병현님이 자세한 리뷰를 도와주셨습니다! 오늘도 많은 관심 미리 감사드립니다!
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
Abstract: This session will focus on the first steps involved in identifying SNPs from whole genome, exome capture or targeted resequencing data: The different read mapping approaches to a DNA reference sequence will be introduced and quality metrics discussed.
Abstract:
A combination of exponential and Lindley failure rate model is considered and named it as exponential-Lindley
additive failure rate model. In this paper, we studied the distributional properties, central and non-central moments,
estimation of parameters, testing of hypothesis and the power of likelihood ratio criterion about the proposed model.
Key words: Exponential distribution, Lindley distribution, ML estimation, Likelihood ratio type criterion.
Estimation of Age Through Fingerprints Using Wavelet Transform and Singular V...CSCJournals
The forensic investigators always search for fingerprint evidence which is seen as one of the best types of physical evidence linking a suspect to the crime. In this paper discrete wavelet transform (DWT) and the singular value decomposition (SVD) has been used to estimate a person’s age using his/her fingerprint. The most robust K nearest neighbor (KNN) used as a classifier. The evaluation of the system is carried on using internal database of 3570 fingerprints in which 1980 were male fingerprints and 1590 were female fingerprints. Tested fingerprint is grouped into any one of the following five groups: up to 12, 13-19, 20-25, 26-35 and 36 and above. By the proposed method, fingerprints were classified accurately by 96.67%, 71.75%, 86.26%, 76.39% and 53.14% in five groups respectively for male and similarly classified by 66.67%, 63.64%, 76.77%, 72.41% and 16.79% in five groups respectively for female.
CRM is how a customer looks to a company, while CEM is really how the company looks to the customer.
CRM comes after the experience, and CEM works hard on anticipating it.
GIM encompasses the management, leadership, structures and practices required for the successful operation of GIS within an entity, nationally, regionally or globally.
Big Data Definition & Characteristic.
Company Dominates Big Data.
Big Data and Other Technologies.
Big Data and UN.
Big Data for Statistics.
Big Data for Development.
Big data & Open Data.
Big data & SDG’s.
The term “fog computing” or “edge computing” means that rather than hosting and working from a centralized cloud, fog systems operate on network ends. It is a term for placing some processes and resources at the edge of the cloud, instead of establishing channels for cloud storage and utilization.
ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely-used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models were based on a description of trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.
Network address translation (NAT) is a method of remapping one IP address space into another by modifying network address information in Internet Protocol (IP) datagram packet headers while they are in transit across a traffic routing device.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
5. WHY PERSONALITY PREDICTION?
Areas which are directly affected with a user’s personality:
1. Marketing.
2. Recommendation Systems.
3. Customized web pages, advertisements and products.
4. Customized search engines and user experience.
5. Understanding criminal and psychopathic behaviors.
6. Sentiment analysis and clustering of text.
By Joud Khattab 5
6. LITERATURE SURVEY
1) Understanding Personality through Social Media:
Y.Wang et al. (2016), Department of Computer Science, Stanford University.
2) Detection of MBTI viaText Based Computer-Mediated Communication:
D. Brinks et al. (2012), Department of Electrical Engineering, Stanford University.
3) PersonalityTraits onTwitter:
B. Plank et al. (2015), Center for LanguageTechnology, University of Copenhagen.
4) Identifying PersonalityTypes Using Document Classification Methods:
M. Komisin et al. (2012), Department of Computer Science, University of North Carolina
Wilmington.
By Joud Khattab 6
8. DATA SET
(Y. WANG, 2016)
Twitter dataset:
GNIPAPIs.
around 90,000 users.
Extracting and filtering all personality-related tweets from 2006 to 2015.
The most recent tweets for all the 90,000 users.
1.7 million tweets that contain the personality codes.
By Joud Khattab 8
(1)
9. DATA CLEANING
(Y. WANG, 2016)
1. PositiveTweets:
@ProfCarol Just wondering, what’s your type? I’m an ENFJ
@whitneyhess that’s an interesting test.. I got ENTP and it seems pretty accurate IMO
@megfowler I’m INTP according to this http://similarminds.com/jung.html
2. NegativeTweets:
I’ll bet that Jeremiah @jowyang is an ESTJ
@mark ENTJYou should have known... http://typelogic.com/entj.html
I love my wife. Even though she’s INFP
Retrieve 120K tweets out of all the 1.7M tweets with personality codes.
By Joud Khattab 9
(1)
10. SOCIAL MEDIA DATA DISADVANTAGE
(Y. WANG, 2016)
Language on social media has richer content that makes linguistic analysis tool
perform poorly.
Each tweet is limited to 140 character contains hashtag, at-mention, URL and
emoticons.
People tend to use shorten version of phrases “iono” means “I don’t know”.
Lack of conventional orthography.
Collecting personality data is costly.
By Joud Khattab 10
(1)
12. FEATURES SELECTION
(Y. WANG, 2016)
1) Bag of N-Grams.
2) Part-Of-Speech Tags.
3) WordVectors.
By Joud Khattab 12
(1)
13. N-GRAM
(Y. WANG, 2016)
By Joud Khattab 13
(1)
Top correlated unigram forThinking Top correlated unigram for Feeling
Top correlated bigram for Introversion Top correlated bigram for Extroversion
14. POSTAGGING
(Y. WANG, 2016)
Twitter POS tagger has 25 types of distinctive tags has been used.
Common noun is a good indicator for personality.
People who use common nouns more often tend to be in Extroversion, Intuition,
Thinking, or Judging type.
Introverted people use more pronouns but less common nouns.
Interjection which includes (“lol”, “haha”, “FTW”, “yea”) is more likely to be used
by Sensing and Perceiving type.
Emoticon is more likely to be used by Sensing and Feeling type.
Numbers are more likely to be used by Sensing andThinking type.
Extroverted people are more likely to use hashtags.
By Joud Khattab 14
(1)
15. WORD COUNT
(Y. WANG, 2016)
1) Average word vectors:
average all the vectors of all the word that is available in the tweets of a user to
represent the vector representations of that user.
2) Weighted average word vectors:
A weighted average the vectors of the words that is available in the tweets of a user
according to theTF-IDF values.
The weighted vector representation is then used to represent the vector
representations of that user.
By Joud Khattab 15
(1)
16. MODEL SELECTION
(Y. WANG, 2016)
1. Logistic Regression model with 10-fold cross-validation.
2. Random Forest and SVM.
By Joud Khattab 16
(1)
17. MODEL RESULTS
(Y. WANG, 2016)
Classifier E vs I N vs S T vs F P vs J Average
WordVector 67.9% 64.3% 67.3% 60.8% 65.1%
Bag of n-grams 63.1% 58.8% 62.1% 58.8% 60.7%
Unigram 61.7% 58.1% 60.9% 58.2% 59.7%
Bigram 60.9% 56.9% 60.7% 57.3% 59.0%
Trigram 61.3% 56.7% 59.3% 57.0% 58.6%
POSTag 59.3% 57.5% 60.3% 56.9% 58.5%
POS + n-rams 62.8% 60.7% 63.3% 59.6% 61.6%
POS + n-gram
+WordVector
69.1% 65.3% 68.0% 61.9% 66.1%
By Joud Khattab 17
(1)
18. DETECTION OF MBTI VIA TEXT BASED
COMPUTER-MEDIATED COMMUNICATION
D. Brinks et al. (2012)
Department of Electrical Engineering
Stanford University
By Joud Khattab 18
(2)
19. DATA SET
(D. BRINKS, 2012)
Twitter API to get tweets including MBTI abbreviation.
6,358 users includes 960,715 tweets.
Multiple level of data elimination where done to eliminate any improper data.
By Joud Khattab 19
(2)
20. DATA CLEANING
(D. BRINKS, 2012)
Many users labeled “INTP” weren’t referencing their MBT. instead, they had
simply misspelled “into”.
Any user whose tweet contained two or more different MBTs was rejected.
numbers, links, @<user>, and MBTs were replaced with “NUMBER”, “URL”,
“AT_USER”, and “MBT”.
Contractions were replaced by their expanded form.
Words were converted to lowercase.
Finally, all of a user’s tweets were aggregated into a single text block.
By Joud Khattab 20
(2)
22. PROCESSING PARAMETERIZATION
(D. BRINKS, 2012)
1) Porter Stemming.
2) Emoticon Substitution.
3) MinimumToken Frequency.
4) Minimum User Frequency.
5) Term FrequencyTransform.
6) Inverse Document FrequencyTransform.
By Joud Khattab 22
(2)
23. TRAINING ACCURACY BY CLASSIFIER
(D. BRINKS, 2012)
Classifier E vs I N vs S T vs F P vs J Average
Multinomial Event Model Naive Bayes 96.0% 83.4% 84.6% 75.9% 85.0%
L2-regularized logistic regression (primal) 99.8% 99.8% 100.0% 99.8% 99.9%
L2-regularized L2-loss SV classification
(dual)
99.8% 99.9% 99.9% 99.9% 99.9%
L2-regularized L2-loss SV classification
(primal)
99.8% 99.9% 99.9% 99.9% 99.9%
L2-regularized L1-loss SV classification
(dual)
99.9% 99.9% 99.9% 99.9% 99.9%
SV classification by Crammer and Singer 100.0% 100.0% 100.0% 100.0% 100.0%
L1-regularized L2-loss SV classification 100.0% 100.0% 100.0% 100.0% 100.0%
L1-regularized logistic regression 99.9% 99.9% 99.8% 99.9% 99.9%
L2-regularized logistic regression (dual) 100.0% 100.0% 100.0% 100.0% 100.0%
By Joud Khattab 23
(2)
24. HIGHVARIANCE SOLUTIONS
(D. BRINKS, 2012)
1. Get more data:
Unfortunately,Twitter places a cap on data retrieval requests.
Even after tripling the number of collected tweets, performance remained constant.
2. Decreasing the feature set size:
Modifying the preprocessing steps.
Parameterized number of features fed to classifier to determine the optimal features.
Several transforms detailed were added to the classifier.
Algorithm was modified to use confidence metrics in its classification and instructed to
only decide for users about which it had a strong degree of certainty.
However, none of these options improved testing behavior to any significant
degree.
By Joud Khattab 24
(2)
25. PERFORMANCE BY CLASSIFIER
(D. BRINKS, 2012)
Classifier E vs I N vs S T vs F P vs J Average
Multinomial Event Model Naive Bayes 63.9% 74.6% 60.8% 58.5% 64.5%
L2-regularized logistic regression (primal) 60.3% 70.7% 59.4% 56.1% 61.6%
L2-regularized L2-loss SV classification
(dual)
56.9% 67.5% 59.3% 54.1% 59.5%
L2-regularized L2-loss SV classification
(primal)
58.8% 69.5% 59.0% 55.9% 61.0%
L2-regularized L1-loss SV classification
(dual)
56.8% 67.6% 59.6% 54.5% 59.7%
SV classification by Crammer and Singer 56.8% 67.7% 59.4% 54.5% 59.6%
L1-regularized L2-loss SV classification 59.4% 68.3% 56.8% 56.1% 60.2%
L1-regularized logistic regression 60.9% 70.5% 58.5% 56.3% 61.6%
L2-regularized logistic regression (dual) 59.2% 69.6% 59.0% 55.0% 60.7%
By Joud Khattab 25
(2)
26. DATA PROBLEM
(D. BRINKS, 2012)
Reasons why the machine classifier did not achieve better performance because a
large portion of tweets are noise with respect to MBTI.
Twitter imposes a 140-character limit on each tweet, users are forced to express
themselves succinctly.
Large percentage of tokens in tweets are not English words, but twitter handles being
retweeted or URLs.Thus, while a user’s tweet set may contain a thousand tokens, a
significant subset is unique to that individual user, and cannot be used for correlation.
Due to retweeting, a user’s tweet may not be expressing his or her own thoughts.
By Joud Khattab 26
(2)
27. COMPARISON WITH HUMAN EXPERTS
(D. BRINKS, 2012)
Spectrum Human 1 Human 2 MNEMNB
E vs I 50.0% 40.0% 55.0%
N vs S 50.0% 90.0% 90.0%
T vs F 80.0% 65.0% 55.0%
P vs J 60.0% 50.0% 65.0%
Average 60.0% 61.3% 66.3%
By Joud Khattab 27
(2)
28. PERSONALITY TRAITS ON TWITTER
B. Plank et al. (2015)
Center for LanguageTechnology
University of Copenhagen
By Joud Khattab 28
(3)
29. DATA SET
(B. PLANK, 2015)
Corpus of 1.2M tweets.
1,500 users that self-identity with an MBTI.
Open source code and data set.
By Joud Khattab 29
(3)
32. By Joud Khattab 32
0 2 4 6 8 10 12 14 16 18
ISTP
ESFP
ESFJ
ESTJ
ESTP
ENFJ
ENTJ
ISTJ
ISFP
ENTP
ISFJ
INTP
ENFP
INFJ
INFP
INTJ
MBTI distribution inTwitter corpusVS general US population
US Population
Paper 3
Paper 2
Paper 1
33. CLASSIFIER
(B. PLANK, 2015)
By Joud Khattab 33
(3)
Classifier E vs I N vs S T vs F P vs J Average
Accuracy for four
discrimination tasks
Majority 64.1% 77.5% 58.4% 58.8% 64.7%
System 72.5% 77.4% 61.2% 55.4% 66.6%
Prediction performance
for four discrimination
Tasks controlled for
gender
Majority 64.9% 79.6% 51.8% 59.4% 63.9%
System 72.1% 79.5% 54.0% 58.2% 66.0%
34. PREDICTIVE FEATURES
(B. PLANK, 2015)
By Joud Khattab 34
(3)
INTROVERT
• someone (91%)
• probably (89%)
• favorite (83%)
• stars (81%)
• b (81%)
• writing (78%)
• , the (77%)
• status count< 5000
(77%)
• lol (74%)
• but i (74%)
EXTROVERT
• pull (96%)
• mom (81%)
• travel (78%)
• don’t get (78%)
• when you’re (77%)
• posted (77%)
• #HASHTAG is
(76%)
• comes to (72%)
• tonight ! (71%)
• join (69%)
THINKING
• must be (95%)
• drink (95%)
• red (91%)
• from the (89%)
• all the (88%)
• business (85%)
• to get a (81%)
• hope (81%)
• june (78%)
• their (77%)
FEELING
• out to (88%)
• difficult (87%)
• the most (85%)
• couldn’t (85%)
• me and (80%)
• in @USER (80%)
• wonderful (79%)
• what it (79%)
• trying to (79%)
• ! so (78%)
35. IDENTIFYING PERSONALITY TYPES USING
DOCUMENT CLASSIFICATION METHODS
M. Komisin et al. (2012)
Department of Computer Science
University of North CarolinaWilmington
By Joud Khattab 35
(4)
36. DATA SET
(M. KOMISIN, 2012)
Data collected as part of a graduate course:
Students took the MBTI Step II.
Completed a Best Possible Future Self (BPFS) exercise.
Over 3 semesters, data was collected from 40 subjects.
Best Possible Future SelfWriting (BPFS) Exercise:
This essay contains elements of self-description, present and future, as well as various contexts.
“Think about your life in the future. Imagine everything gone as well as it possibly.You have succeeded
accomplishing all your life goals.Think of this as the realization of all your dreams. Now, write about it.”
Many existing data sets are comprised of written essays, which usually contain highly canonical
language, often of a specific topic.
Such controlled settings inhibit the expression of individual traits much more than spontaneous
language.
By Joud Khattab 36
(4)
37. PREPROCESSING
(M. KOMISIN, 2012)
1. Word stemming.
2. Stop-words removal.
3. Multiple Data smoothing techniques.
Lidstone smoothing.
Good-Turing smoothing.
Witten and Bell Smoothing.
By Joud Khattab 37
(4)
38. MODEL SELECTION
(M. KOMISIN, 2012)
1. Naïve Bayes.
2. SVM.
3. Linguistic Inquiry andWord Count (LIWC).
By Joud Khattab 38
(4)
39. LIWC FEATURES
(PENNEBAKER, 2001)
STANDARD COUNTS:
Word count, words per sentence, type/token ratio, words captured, words longer than 6
letters, negations, assents, articles, prepositions, numbers.
Pronouns: 1st person singular, 1st person plural, total 1st person, total 2nd person, total
3rd person
PSYCHOLOGICAL PROCESSES:
Affective or emotional processes: positive emotions, positive feelings, optimism and
energy, negative emotions, anxiety or fear, anger, sadness.
Cognitive Processes: causation, insight, discrepancy, inhibition, tentative, certainty.
Sensory and perceptual processes: seeing, hearing, feeling.
Social processes: communication, other references to people, friends, family, humans.
By Joud Khattab 39
(4)
40. LIWC FEATURES
(PENNEBAKER, 2001)
RELATIVITY:
Time, past tense verb, present tense verb, future tense verb.
Space: up, down, inclusive, exclusive.
Motion.
PERSONAL CONCERNS:
Occupation: school, work and job, achievement.
Leisure activity: home, sports, television and movies, music.
Money and financial issues.
Metaphysical issues: religion, death, physical states and functions, body states and
symptoms, sexuality, eating and drinking, sleeping, grooming.
By Joud Khattab 40
(4)
42. TEXT FEATURES OF BPFS ESSAYS
(M. KOMISIN, 2012)
Myers-Briggs
Preferences
Word
Tokens
Unique
Words
WordsTokens
Per Document
UniqueWord
Types Per
Document
Extraversion 10,428 1,859 401 72
Introversion 5,275 1,140 377 81
Sensing 7,913 1,455 377 69
Intuition 7,790 1,594 410 84
Thinking 6,879 1,348 362 71
Feeling 8,824 1,685 420 80
Judging 6,210 1,389 388 87
Perceiving 9,493 1,649 396 69
By Joud Khattab 42
(4)
43. TEXT FEATURES OF BPFS ESSAYS AFTER
PORTER AND STOP-WORD FILTERING
(M. KOMISIN, 2012)
Myers-Briggs
Preferences
Word
Tokens
Unique
Words
WordsTokens
Per Document
UniqueWord
Types Per
Document
Extraversion 5,631 1,376 217 53
Introversion 2,834 846 202 60
Sensing 4,335 1,067 206 51
Intuition 4,130 1,178 217 62
Thinking 3,718 1,015 196 53
Feeling 4,747 1,224 226 58
Judging 3,312 1,030 207 64
Perceiving 5,153 1,207 215 50
By Joud Khattab 43
(4)
44. CLASSIFICATION RESULTS
(M. KOMISIN, 2012)
Summary of results with leave-one-out
cross validation and sample size (n = 40)
Summary of results with leave-one-out cross
validation and reduced sample size (n = 30)
lowest clarity scores removed
By Joud Khattab 44
(4)
45. By Joud Khattab 45
Research
Papers
Date Set
Kind
Date Set Size Features and Pre-processing
Prediction
Models
Evaluation
Metrics
Y.Wang, 2016 Twitter Dataset
1.7 M tweets for
90,000 users, 120 K
tweets after
preprocessing
n-grams, POS tags, word vectors
(Average word vectors, Weighted
average word vectors)
Logistic Regression
(10-fold cross-
validation), Random
Forest, SVM
Highest average is
66.1% for combined
features
D. Brinks, 2012 Twitter Dataset
960 K tweets for
6,000 users
Porter Stemming, Emoticon
Substitution, MinimumToken
Frequency, Minimum User Frequency,
Term FrequencyTransform, Inverse
Document FrequencyTransform
Naïve Bayes, multi-
variate event model,
confidence metrics,
SVM, logistic
regression
Highest average is
64.5%
B. Plank, 2015 Twitter Dataset
1.2 M tweets for 1,500
users
gender, n-grams, count statistics,
tweets count, followers, statuses,
favorites
logistic regression
Highest average is
66.6% (T–F predicted
with high reliability,
while
others are very hard to
model)
M. Komisin,
2012
MBTITest and
BPFS Exercise
4800 text
specific word choices, semantic
categories words
Porter stemming, stop-words
removal, smoothing techniques
Naïve Bayes, SVM,
LIWC
Highest average 65%
46. RESEARCH GAP
TwitterVS. Document.
Language on social media has richer content that makes linguistic analysis tool
perform poorly.
Each tweet is limited to 140 character contains hashtag, at-mention, URL and
emoticons.
Due to retweeting, a user’s tweet may not be expressing his or her own thoughts.
Removing StopWords problem.
Collecting personality data is costly.
MBTI distribution in twitter that discussed in the fourth paper.
By Joud Khattab 46
47. PROPOSED WORK
Validation
Model Selection
N-Gram POS tagger Naïve Bayes
Data Preprocessing
Snow Ball Stemmer Porter Stemmer Lemmatize StopWords Emoji
Data Cleaning
Data Collection
Twitter Corpus Letter Corpus Text Corpus
Research
By Joud Khattab 47
48. MODEL SELECTION (TEXT CORPUS)
NAÏVE BAYES
Data Set E / I T / F S / N
cleaned version naive bayes gain function for every two letter
50 / 20 0.6 0.95 0.525
70 / 30 ↓ 0.5 ↓ ↑ 0.96 ↑ ↑ 0.616 ↑
cleaned version stop word naive bayes gain
50 / 20 0.6 0.975 0.525
70 / 30 ↓ 0.5 ↓ ↑ 0.983 ↑ ↑ 0.57 ↑
cleaned version snow stemmer naive bayes gain
50 / 20 0.6 0.975 0.525
70 / 30 ↓ 0.5 ↓ ↑ 0.967 ↑ ↑ 0.583 ↑
By Joud Khattab 48
1)
49. MODEL SELECTION (LETTER CORPUS)
N-GRAM
1. cleaned version 1-gram first 20%
2. cleaned version 2-gram first 20%
3. cleaned version 3-gram first 20%
4. cleaned version snow stemmer 1-gram first 20%
5. cleaned version snow stemmer 2-gram first 20%
6. cleaned version snow stemmer 3-gram first 20%
7. cleaned version stop words 1-gram first 20%
8. cleaned version stop words 2-gram first 20%
9. cleaned version stop words 3-gram first 20%
By Joud Khattab 49
2)