This document discusses using semantic analysis of social media posts to automatically compute personality traits based on the Five Factor Model. It presents the background on using language to predict personality traits and describes word embeddings to represent words as vectors. An experiment is described that uses a dataset of social media posts with known personality scores to train models like SVM and LASSO to predict the Big Five personality traits of openness, conscientiousness, extraversion, agreeableness, and neuroticism. The models are tested on datasets from MyPersonality and Twitter, achieving mean squared errors between 0.3-0.7. Future work proposes expanding the approach to larger datasets and additional features.
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
Even in the age of big data, labeled data is a scarce resource in many machine learning use cases. Florian Wilhelm evaluates generative adversarial networks (GANs) when used to extract information from vehicle registrations under a varying amount of labeled data, compares the performance with supervised learning techniques, and demonstrates a significant improvement when using unlabeled data.
This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni.
My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi.
http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
Comparing the State-of-the-Art Deep Learning with Machine Learning algorithms performance on TF-IDF vector creation for Sentiment Analysis using Airline Tweeter Data Set.
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
Existing state-of-the-art supervised methods in Machine Learning require large amounts of annotated data to achieve good performance and generalization. However, manually constructing such a training data set with sentiment labels is a labor-intensive and time-consuming task. With the proliferation of data acquisition in domains such as images, text and video, the rate at which we acquire data is greater than the rate at which we can label them. Techniques that reduce the amount of labeled data needed to achieve competitive accuracies are of paramount importance for deploying scalable, data-driven, real-world solutions.
At Envestnet | Yodlee, we have deployed several advanced state-of-the-art Machine Learning solutions that process millions of data points on a daily basis with very stringent service level commitments. A key aspect of our Natural Language Processing solutions is Semi-supervised learning (SSL): A family of methods that also make use of unlabelled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Pure supervised solutions fail to exploit the rich syntactic structure of the unlabelled data to improve decision boundaries. There is an abundance of published work in the field - but few papers have succeeded in showing significantly better results than state-of-the-art supervised learning. Often, methods have simplifying assumptions that fail to transfer to real-world scenarios. There is a lack of practical guidelines for deploying effective SSL solutions. We attempt to bridge that gap by sharing our learning from successful SSL models deployed in production
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
Even in the age of big data, labeled data is a scarce resource in many machine learning use cases. Florian Wilhelm evaluates generative adversarial networks (GANs) when used to extract information from vehicle registrations under a varying amount of labeled data, compares the performance with supervised learning techniques, and demonstrates a significant improvement when using unlabeled data.
This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni.
My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi.
http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
Comparing the State-of-the-Art Deep Learning with Machine Learning algorithms performance on TF-IDF vector creation for Sentiment Analysis using Airline Tweeter Data Set.
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
Existing state-of-the-art supervised methods in Machine Learning require large amounts of annotated data to achieve good performance and generalization. However, manually constructing such a training data set with sentiment labels is a labor-intensive and time-consuming task. With the proliferation of data acquisition in domains such as images, text and video, the rate at which we acquire data is greater than the rate at which we can label them. Techniques that reduce the amount of labeled data needed to achieve competitive accuracies are of paramount importance for deploying scalable, data-driven, real-world solutions.
At Envestnet | Yodlee, we have deployed several advanced state-of-the-art Machine Learning solutions that process millions of data points on a daily basis with very stringent service level commitments. A key aspect of our Natural Language Processing solutions is Semi-supervised learning (SSL): A family of methods that also make use of unlabelled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Pure supervised solutions fail to exploit the rich syntactic structure of the unlabelled data to improve decision boundaries. There is an abundance of published work in the field - but few papers have succeeded in showing significantly better results than state-of-the-art supervised learning. Often, methods have simplifying assumptions that fail to transfer to real-world scenarios. There is a lack of practical guidelines for deploying effective SSL solutions. We attempt to bridge that gap by sharing our learning from successful SSL models deployed in production
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them.
In this session, Francesca will go over a few methods and tools that enable you to "unpack” machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open-source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual data points.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...IIIT Hyderabad
Social media systems provide a platform for users to freely express their thoughts and opinions. Although this property represents incredible and unique communication opportunities, it also brings along important challenges. Often, content which constitutes hate speech, abuse, harmful intent proliferates online platforms. Since problematic content reduces the health of a platform and negatively affects user experience, communities have terms of usage or community norms in place, which when violated by a user, leads to moderation action on that user by the platform. Unfortunately, the scale at which these platforms operate makes manual content moderation near impossible, leading to the need for automated or semi-automated content moderation systems. For understanding the prevalence and impact of such content, there are multiple methods including supervised machine learning and deep learning models. Despite the vast interest in the theme and wide popularity of some methods, it is unclear which model is most suitable for a certain platform since there have been few benchmarking efforts for moderated content. To that end, we compare existing approaches used for automatic moderation of multimodal content on five online platforms: Twitter, Reddit, Wikipedia, Quora, Whisper. In addition to investigating existing approaches, we propose a novel Capsule Network based method that performs better due to its ability to understand hierarchical patterns. In practical scenarios, labeling large scale data for training new models for a different domain or platform is a cumbersome task. Therefore we enrich our existing pre-trained model with a minimal number of labeled examples from a different domain to create a co-trained model for the new domain. We perform a cross-platform analysis using different models to identify which model is better. Finally, we analyze all methods, both qualitatively and quantitatively, to gain a deeper understanding of model performance, concluding that our method shows an increase of 10% in average precision. We also find that the co-trained models perform well despite having less training data and may be considered a cost-effective solution.
Interest in Deep Learning has been growing in the past few years. With advances in software and hardware technologies, Neural Networks are making a resurgence. With interest in AI based applications growing, and companies like IBM, Google, Microsoft, NVidia investing heavily in computing and software applications, it is time to understand Deep Learning better!
In this lecture, we will get an introduction to Autoencoders and Recurrent Neural Networks and understand the state-of-the-art in hardware and software architectures. Functional Demos will be presented in Keras, a popular Python package with a backend in Theano. This will be a preview of the QuantUniversity Deep Learning Workshop that will be offered in 2017.
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
There has been much effort on studying how social media sites, such as Twitter, help propagate information in differ- ent situations, including spreading alerts and SOS messages in an emergency. However, existing work has not addressed how to actively identify and engage the right strangers at the right time on social media to help effectively propagate intended information within a desired time frame. To ad- dress this problem, we have developed two models: (i) a feature-based model that leverages peoples’ exhibited social behavior, including the content of their tweets and social interactions, to characterize their willingness and readiness to propagate information on Twitter via the act of retweeting; and (ii) a wait-time model based on a user's previous retweeting wait times to predict her next retweeting time when asked. Based on these two models, we build a recommender system that predicts the likelihood of a stranger to retweet information when asked, within a specific time window, and recommends the top-N qualified strangers to engage with. Our experiments, including live studies in the real world, demonstrate the effectiveness of our work.
Presented at Intelligent User Interfaces 2014, Haifa, Israel. February 27, 2014.
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
Jose Luis Fernandez-Marquez (UNIGE) - CCL trackerCitizenCyberlab
Jose Luis Fernandez-Marquez (UNIGE) presenting CCL tracker framework at the Citizen Cyberlab Summit, 17-18 September 2015, University of Geneva (UNIGE).
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them.
In this session, Francesca will go over a few methods and tools that enable you to "unpack” machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open-source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual data points.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...IIIT Hyderabad
Social media systems provide a platform for users to freely express their thoughts and opinions. Although this property represents incredible and unique communication opportunities, it also brings along important challenges. Often, content which constitutes hate speech, abuse, harmful intent proliferates online platforms. Since problematic content reduces the health of a platform and negatively affects user experience, communities have terms of usage or community norms in place, which when violated by a user, leads to moderation action on that user by the platform. Unfortunately, the scale at which these platforms operate makes manual content moderation near impossible, leading to the need for automated or semi-automated content moderation systems. For understanding the prevalence and impact of such content, there are multiple methods including supervised machine learning and deep learning models. Despite the vast interest in the theme and wide popularity of some methods, it is unclear which model is most suitable for a certain platform since there have been few benchmarking efforts for moderated content. To that end, we compare existing approaches used for automatic moderation of multimodal content on five online platforms: Twitter, Reddit, Wikipedia, Quora, Whisper. In addition to investigating existing approaches, we propose a novel Capsule Network based method that performs better due to its ability to understand hierarchical patterns. In practical scenarios, labeling large scale data for training new models for a different domain or platform is a cumbersome task. Therefore we enrich our existing pre-trained model with a minimal number of labeled examples from a different domain to create a co-trained model for the new domain. We perform a cross-platform analysis using different models to identify which model is better. Finally, we analyze all methods, both qualitatively and quantitatively, to gain a deeper understanding of model performance, concluding that our method shows an increase of 10% in average precision. We also find that the co-trained models perform well despite having less training data and may be considered a cost-effective solution.
Interest in Deep Learning has been growing in the past few years. With advances in software and hardware technologies, Neural Networks are making a resurgence. With interest in AI based applications growing, and companies like IBM, Google, Microsoft, NVidia investing heavily in computing and software applications, it is time to understand Deep Learning better!
In this lecture, we will get an introduction to Autoencoders and Recurrent Neural Networks and understand the state-of-the-art in hardware and software architectures. Functional Demos will be presented in Keras, a popular Python package with a backend in Theano. This will be a preview of the QuantUniversity Deep Learning Workshop that will be offered in 2017.
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
There has been much effort on studying how social media sites, such as Twitter, help propagate information in differ- ent situations, including spreading alerts and SOS messages in an emergency. However, existing work has not addressed how to actively identify and engage the right strangers at the right time on social media to help effectively propagate intended information within a desired time frame. To ad- dress this problem, we have developed two models: (i) a feature-based model that leverages peoples’ exhibited social behavior, including the content of their tweets and social interactions, to characterize their willingness and readiness to propagate information on Twitter via the act of retweeting; and (ii) a wait-time model based on a user's previous retweeting wait times to predict her next retweeting time when asked. Based on these two models, we build a recommender system that predicts the likelihood of a stranger to retweet information when asked, within a specific time window, and recommends the top-N qualified strangers to engage with. Our experiments, including live studies in the real world, demonstrate the effectiveness of our work.
Presented at Intelligent User Interfaces 2014, Haifa, Israel. February 27, 2014.
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
Jose Luis Fernandez-Marquez (UNIGE) - CCL trackerCitizenCyberlab
Jose Luis Fernandez-Marquez (UNIGE) presenting CCL tracker framework at the Citizen Cyberlab Summit, 17-18 September 2015, University of Geneva (UNIGE).
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Semantic Analysis to Compute Personality Traits from Social Media Posts
1. Semantic Analysis to Compute
Personality Traits from Social
Media Posts
Master Degree in Computer Engineering
DAUIN
Supervisor
Prof. Maurizio Morisio
Intership Tutor
Dott. Ing. Giuseppe Rizzo
Candidate
Giulio Carducci s225395
2. Personality
Is it possible to automatically compute the personality of an
individual from the language he/she uses in social networks??
2
3. Background – Lexical Hypothesis
Lexical Hypothesis
• Personality characteristics that are important to
a group of people will eventually become a part
of that group's language.
• Main personality characteristics of an individual
are more likely to be encoded into language as a
single word.
Sir Francis Galton
Galton, F. Measurement of Character. Fortnightly Review, 1884, 36:179-185.
3
4. Five Factor Model (FFM)
• Openness
inventive/curious vs. consistent/cautious
• Conscientiousness
efficient/organized vs. easy-going/careless
• Extraversion
outgoing/energetic vs. solitary/reserved
• Agreeableness
friendly/compassionate vs. challenging/detached
• Neuroticism
sensitive/nervous vs. secure/confident
Background – Personality
4
5. Social networks are rich sources of
information
Personality prediction from social
network data
• Page likes
• Number of followers/following
• Choice of profile picture
• Personal profile information
• ...
Background – Personality and Social Networks
5
6. myPersonality
• Up to 95% prediction accuracy
• Average accuracy of 77%
Background – Personality and Social Networks
6
7. Word Embedding denotes a set of NLP techniques where words are mapped
to vectors of real numbers.
‘cat’ 𝑥1, 𝑥2, 𝑥3, . . . , 𝑥 𝑛−1, 𝑥 𝑛 𝑛 = 300
Word embeddings can boost the performances of many NLP applications,
and have two main advantages over traditional word vectorization
techniques:
• Dimensionality reduction
Vector space of dimension 𝑛 instead of the number of distinct words
• Contextual similarity
Similar words are mapped to vectors that are close in the vector space
Background – Semantic Analysis
7
10. • Big 16,000,000 status updates
of 115,000 users
• Small 10,000 status updates
of 250 users
Statistics Value MIN MAX AVG
Status updates per user 1 39 223
Total words 146,128
Total words after
preprocessings
72,896
Distinct words 15,470
Distinct words after
preprocessing
15,185
Words per status update 1 14 113
Words per status update
after preprocessing
0 7 57
Experimental Setup – Gold Standard
MyPersonality Dataset
10
11. • 1 million word vectors
• 𝑛 = 300
• Trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news
dataset
• Trained with the continuous-bag-of-words (cbow) model from word2vec
• Ordered by descending frequency
• 95.08% word coverage on myPersonality
Experimental Setup – Word Embeddings
11
13. • Conversion to lowercase
“Today is a #sunny day!” → “today is a #sunny day!”.
• Stop-words removal
“today is a #sunny day!” → “today #sunny day!”.
• Punctuation removal
“today #sunny day!” → “today sunny day”.
• Tokenization
“today sunny day” → [today] [sunny] [day].
• Short posts removal
All posts with less than 3 tokens are removed.
Removes noise and less-informative data
Experimental Setup – Text Preprocessing
13
15. • Feed training data to the algorithm to compute a predictive model
• Training samples: ( 𝑣𝑒𝑐 𝑠𝑡𝑎𝑡𝑢𝑠 𝑢𝑝𝑑𝑎𝑡𝑒 , 𝐵𝐼𝐺5 𝑠𝑐𝑜𝑟𝑒)
• Supervised Learning: for each training sample, we specify the ground truth label
Linear Regression
𝒚 = 𝛽𝑿+ ∈
𝑦𝑖 = 𝛽01 + 𝛽1 𝒙𝑖1 + 𝛽2 𝑥𝑖2 + . . . +𝛽900 𝑥𝑖900 + 𝜖𝑖 𝑖 = 1,2, . . . , 𝑁
Least Absolute Shrinkage ans Selection Operator (LASSO)
𝑚𝑖𝑛
𝛽
1
2∗𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑿𝛽 − 𝑦 2
2
+ 𝛼 𝛽 1
Support Vector Machines (SVM)
𝒚 = 𝛽𝑿+ ∈
𝐽 𝛽 =
1
2
𝛽 𝑖=1
𝑁
𝜉𝑖 + 𝜉𝑖
∗
min 𝐽(𝛽)
Experimental Setup – Model Training
15
16. Type equation here.
• Also called Tuning of the hyperparameters
• Loss Function
Mean Squared Error (MSE) Mean squared difference between actual and
predicted values. Average value over 10-fold cross-validation
𝑀𝑆𝐸(𝑃, 𝐴) =
1
𝑛
𝑖=1
𝑛
𝑝𝑖 − 𝑎𝑖
2
𝑃 = (𝑝1, 𝑝2, … , 𝑝 𝑛)
𝐴 = (𝑎1, 𝑎2, … , 𝑎 𝑛)
Algorithm Parameter Value
SVM Kernel linear, rbf, poly
C 1, 10, 100
Gamma 0.01, 0.1, 1, 10
Degree 2,3
LASSO Alpha 1−15
, 1−10
, 1−8
, 1−5
, 1−4
, 1−3
, 1−2
, 1, 5, 10
SVM
Experimental Setup – Parameters Optimization
16
∈ ℝ+
: [0, +∞)
17. • Further cleaning steps applied before preprocessing:
∙ Pure retweets removal (= retweets with no added comment)
∙ URLs removal
∙ Mentions removal
• Preprocessing and transformation performed the same way as status updates
[𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900]
[𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900]
[𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900]
[0 − 5]
[0 − 5]
[0 − 5]
[0 − 5]
Clean Preprocess Transform
Experimental Setup – Personality Prediction
17
18. Trait
SVM Configuration
MSE
Kernel C Gamma
Openness rbf 1 1 0.3316
Conscientiousness rbf 10 1 0.5300
Extraversion rbf 10 1 0.7084
Agreeableness rbf 10 1 0.4477
Neuroticism rbf 10 10 0.5572
• Margin over Lreg: 8%
• Margin over LASSO: 1%
Method
MSE
Mean Std
Sum 0.6942 0.4862
Maximum 0.5350 0.0228
Minimum 0.5342 0.0230
Average 0.5366 0.0246
Concatenation 0.5364 0.0188
• Low mean MSE
• Lowest MSE std
Concatenation is more stable
with respect to other methods
Experimental Results – Algorithm and Transformation
18
19. MyPersonality big 16,000,000 status updates of 116,000 users
Same approach of myPersonality small
Training samples: ( 𝑣𝑒𝑐 𝑠𝑡𝑎𝑡𝑢𝑠 𝑢𝑝𝑑𝑎𝑡𝑒 , 𝐵𝐼𝐺5 𝑠𝑐𝑜𝑟𝑒)
Training time, Overfitting
Downsample
• 5000
• 10000
• 15000
• 20000
Dataset
Mean Squared Error
OPE CON EXT AGR NEU
MP small 0.3316 0.5300 0.7084 0.4477 0.5572
MP big (10k) 0.4184 0.5101 0.6971 0.4799 0.6459
MP big (20k) 0.4181 0.5066 0.6816 0.4773 0.6444
Experimental Results – MyPersonality Big
19
20. Statistic Value MIN AVG MAX
Total users 24
Total tweets 18,473
Tweets per user 9 769.7 2,252
Avg words per
tweet per user
5 6.8 8.8
• 26 participants
• 2 removed – not enough tweets
• Big Five Inventory (BFI, 44 items)
Experimental Results – Twitter Sample
20
21. Dataset
Mean Squared Error
OPE CON EXT AGR NEU
Twitter Sample (MP small) 0.3812 0.3129 0.3002 0.1319 0.2673
Twitter Sample (MP big) 0.3178 0.3236 0.4110 0.1362 0.2803
Literature* 0.4761 0.5776 0.7744 0.6241 0.7225
GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=username&count=200
statuses/user_timeline
• Returns: up to 200 tweets of username
• Format: json
• Rate limit: 1500/15 minutes
*Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J. Our Twitter Profiles, Our Selves: Predicting Personality with Twitter.
180-185. 10.1109/PASSAT/ SocialCom.2011.26.
Experimental Results – Twitter Sample
21
22. • Train word embeddings on textual data from social media
• Use a CNN for text transformation and prediction
• Expand the feature vector with additional semantic features
• Train multilingual word embeddings
• Test the approach on a bigger dataset
• Expand the Twitter sample
Future Work
22
In this thesis we analyze whether it is possible to automatically compute the personality of an individual by only relying on the language he uses in social
It is the most widely accepted model of personality and it defines
Low and high scores for the same trait indicate opposite tendencies
Users share a great amount of digital content, and personality has been successfully predicted using many different input data
As the recent case of CA
The size of the vectors is usually set to 300
And there is a spatial relationship between them
WE also present geometrical properties
This is possible thanks to the similarity of SU and tweets, which is further increased by additional tweet processing steps
Each SU is labeled with the personality scores of the user who wrote it
We first test our approach on MP small and then extend the analysis to MP big
For this reason we expect to lose some predictive power
Before transforming SU we preprocess them to remove noise and less-informative data
Preprocessing segments a SU into a list of words
We then compute, for each of the 300 vector components, max min and avg among all the word vectors of the SU, and concatenate them into a feature vector of size 900 that we use to train the models
Sup. Learn. Approach because for each IN vector we also specify OUT value, that is the personality trait score
Optimiz. Phase by training different models on the tr. set of MP and estimating their performance with MSE
We implement 10fcv on the tr. set to test the models on the whole dataset
We test 19 different combinations of the values reported in the table and observe that... So we use SVM to train the 5 predictive models
To test the model on TW, we crawl the TW API to DL all the TW of a given user and clean them
We compute the pers. Score of a user by averaging all the scores of his TW
We report the SVM configs that performed the best in the optimization phase
Mean err and mean std over the five traits
We then extend the analysis to the whole MP dataset of 16M SU by using the same algorithms and configs used for MP small
Compare the results of the two datasets on the same task
To test the models on TW, we devise an experiment involving 26 participants who answered a pers. quest. And agreed to take part in it
Same social and working environment
We use TW to coompute pers. of participants and compare it with quest. results.
We compare our results with those obtained by a study in literature
This is probably because the TW user sample has very similar pers. Characteristics and is not various