AI4Trust
ARTIFICIAL INTELLIGENCE
FOR BUILDING TRUST
MASTER’S DEGREE IN ARTIFICIAL INTELLIGENCE, PATTERN RECOGNITION AND DIGITAL IMAGING
UNIVERSITAT POLITÈCNICA DE VALÈNCIA
March 2, 2017
Francisco Manuel Rangel Pardo
AI4Trust-MIARFIDALC2017autoritas
8 changes that will transform humanity forever
1. Rewrite genetics
2. Discover new materials
3. Govern a disenchanted world
4. Teach machines to learn
5. Find a universal energy
6. Win to infections
7. Crop with little water
8. Life on Mars
autoritas
8 changes that will transform humanity forever
1. Rewrite genetics
2. Discover new materials
3. Govern a disenchanted world
4. Teach machines to learn
5. Find a universal energy
6. Win to infections
7. Crop with little water
8. Life on Mars
AI4Trust-MIARFIDALC2017
autoritas
To build trust
AI4Trust-MIARFIDALC2017
S
autoritas
TRUST
STRATEGY
INTELLIGENCE
DASHBOARD
ACTION
METHODS/TOOLS
TRAINING
AI4Trust-MIARFIDALC2017
autoritas
Intelligence
AI4Trust-MIARFIDALC2017
autoritas
We need to answer questions...
… to know what questions to ask
AI4Trust-MIARFIDALC2017
autoritas
Easy!
Big data +
artificial
intelligence
AI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritas
In real time!!
AI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritas
Approx. only 2% of
contents are geotagged!!
AI4Trust-MIARFIDALC2017
autoritas
Language variety identification to improve geotagging ----> Later on
AI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritas
I really want Cataluña to be
#independent!!
Is the following sentence positive, negative, neutral or none?
AI4Trust-MIARFIDALC2017
autoritas
I really want Cataluña to be
#independent!!
Is the following sentence positive, negative, neutral or none?
It depends on the subjectivity of the receiver, e.g.
- Positive for Puigdemont
- Negative for Rajoy
AI4Trust-MIARFIDALC2017
autoritas
I really want Cataluña to be
#independent!!
Is the following sentence positive, negative, neutral or none?
It depends on the subjectivity of the receiver, e.g.
- Positive for Puigdemont
- Negative for Rajoy
But what about the stance of the author with respect to the main topic?
AI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritasAI4Trust-MIARFIDALC2017
autoritas
https://www.youtube.com/watch?v=YrqMEn-5Pi8
- Retrieve and store
- Evolution
- Words and topics
- Labelling
- Hashtags
- People
- Locations
- Brands
- Polarity, stance
- Users, relationships
- Gender, age
- Author profile
- ...
BIG DATA IS A BIG PROBLEM… AND A BIG OPPORTUNITY
tweets/second tweets/minute tweets/hour tweets/day
AI4Trust-MIARFIDALC2017
autoritas
What’s the profile of your organisation community?
AI4Trust-MIARFIDALC2017
autoritas
Political parties
What’s the affinity among political parties and media?
Media
AI4Trust-MIARFIDALC2017
autoritas
Some applied technologies
- Age & gender identification ----> PAN@CLEF; EmoGraph
- Language variety identification ----> LDR + Word2Vec
- Language variety & gender identification ---> PAN 2017
- Stance & gender detection ----> IberEval 2017
AI4Trust-MIARFIDALC2017
autoritas
Language Variety Identification
Language variety identification aims to detect linguistic variations in
order to classify different varieties of the same language.
Language variety identification may be considered an author profiling
task, besides a classification one, because the cultural idiosyncrasies
may influence the way users use the language (e.g. different expressions,
vocabulary…).
AI4Trust-MIARFIDALC2017
autoritas
Differences with the state of the art
To discriminate between different varieties of the same language, but with
the following differences:
- We focus on different varieties of Spanish, although we tested our
approach also with a different set of languages.
- Instead of n-gram based representations, we propose a low
dimensionality representation which is helpful when dealing with big
data in social media.
- We evaluate the proposed method with an independent test set
generated from different authors in order to reduce possible overfitting.
- We make available our dataset to the research community.
(https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs)
AI4Trust-MIARFIDALC2017
autoritas
A Low Dimensionality Representation (LDR)
Step 1. Term-frequency - inverse document frequency (tf-idf) matrix:
Step 2. Class-dependent term weighting:
Step 3. Class-dependent document representation:
- Each column is a vocabulary term t
- Each row is a document d
- wij is the tf-idf weight of the term j in the document i
- represents the assigned class c to the document
AI4Trust-MIARFIDALC2017
autoritas
LDR features
AI4Trust-MIARFIDALC2017
autoritas
Alternative representations
- We use the common state-of-the-art representations based on n-grams. We
iterated n from 1 to 10, and selected the 1000, 5000 and 10000 most
frequent n-grams. The best results were obtained with:
- character 4-grams; the 10,000 most frequent
- word 1-gram (bag-of-words); the 10,000 most frequent
- word 2-grams; the 10,000 highest tf-idf
- Two variations of the continuous Skip-gram model (Mikolov et al.):
- Skip-grams
- Sentence Vectors
Maximizing the average of the log probability: Using the negative sampling estimator:
AI4Trust-MIARFIDALC2017
autoritas
Hispablogs
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
- - Completely
independent authors
between training
and test sets
- - Manually collected
by social media
experts of Autoritas
AI4Trust-MIARFIDALC2017
autoritas
Accuracy results with different machine learning algorithms
Significance of the results wrt. the two
systems with the highest performance
AI4Trust-MIARFIDALC2017
autoritas
The effect of the pre-processing
Accuracy obtained after removing words with frequency equal or lower than n
(a) Continuous scale (b) Non-continuous scale
AI4Trust-MIARFIDALC2017
autoritas
The effect of the pre-processing
Number of words after removing those with frequency equal or lower than n, and
some examples of very infrequent words.
AI4Trust-MIARFIDALC2017
autoritas
Evaluation results
*
**
*
**
Representation Accuracy (%)
Skip-gram 72,2
LDR 71,1
SenVec 70,8
BOW 52,7
Char 4-grams 51,5
EmoGraph 39,3
tf-idf 2-grams 32,2
Random baseline 20,0
AI4Trust-MIARFIDALC2017
autoritas
Error analysis
Confusion matrix of the 5-class
classification
F1 values for identification as the
corresponding language variety vs. others
AI4Trust-MIARFIDALC2017
autoritas
Features contribution
Accuracy obtained with different combinations of features
AI4Trust-MIARFIDALC2017
autoritas
Cost analysis
Complexity of obtaining the features:
Number of features:
Representation # Features
LDR 30
Skip-gram 300
SenVec 300
EmoGraph 1,100
BOW 10,000
Char 4-grams 10,000
tf-idf 2-grams 10,000
l: number of varieties
n: number of terms of the document
m: number of terms in the document that
coincides with some term in the vocabulary
n m & l<<n
AI4Trust-MIARFIDALC2017
autoritas
Robustness
Results obtained with
the development set of
the DSLCC corpus
from the
Discriminating
between Similar
Languages task
(2015)
NOTE: Significant results in bold
AI4Trust-MIARFIDALC2017
autoritas
LDR for age and gender identification
Dataset Genre Lang Age Pos. Gender Pos. Nº
Partici.
EmoGraph LDR EmoGraph LDR
PAN-AP-2013 Social Media ES 66,24* 62,70 3 63,65* 60,75 6 21
PAN-AP-2014 Social Media ES 45,9 38,16 6 68,6* 56,89 9 9
PAN-AP-2014 Social Media EN 34,2* 31,63 6 53,4 51,42 9 10
PAN-AP-2014 Blogs ES 46,4 46,43 3 64,3 50,00 5 9
PAN-AP-2014 Blogs EN 46,2 38,46 3 71,3 67,95 1 10
PAN-AP-2014 Twitter ES 58,9 56,67 2 73,3 63,33 2 8
PAN-AP-2014 Twitter EN 45,5 52,60 1 72,1 67,53 3 9
AI4Trust-MIARFIDALC2017
autoritas
Conclusions
LDR outperforms common state-of-the-art representations by 35%
increase in accuracy.
LDR obtains competitive results compared with two distributed
representation-based approaches that employed the popular continuous
Skip-gram model.
LDR remains competitive with different languages and media (DSLCC).
The dimensionality reduction is from thousands to only 6 features per
language variety. This allows to deal with big data in social media.
We have applied LDR to age and gender identification with competitive
results with respect to the well-behaved EmoGraph.
AI4Trust-MIARFIDALC2017
autoritas
HispaTweets
https://github.com/autoritas/RD-Lab/tree/master/doc/projects/Identificacion-de-la-Variedad-del-Lenguaje-para-la-Mejora-del-Geoposicionamiento-en-Social-
Media
- 650 authors per language
variety
- 865 tweets per author (avg)
- 7 Spanish varieties:
- Argentina
- Chile
- Colombia
- Mexico
- Peru
- Spain
- Venezuela
AI4Trust-MIARFIDALC2017
autoritas
PAN-AP 2017
GENDER AND LANGUAGE VARIETY IDENTIFICATION
ENGLISH SPANISH PORTUGUESE ARABIC
● Australia
● Canada
● Great Britain
● Ireland
● New Zealand
● United States
● Argentina
● Chile
● Colombia
● Mexico
● Peru
● Spain
● Venezuela
● Brazil
● Portugal
● Egypt
● Gulf
● Levantine
● Maghrebi
http://pan.webis.de/clef17/pan17-web/author-profiling.html
- 500 authors per gender and language variety
- 100 tweets per author
AI4Trust-MIARFIDALC2017
autoritas
IberEval 2017
AI4Trust-MIARFIDALC2017
http://stel.ub.edu/Stance-IberEval2017/index.html
GENDER AND STANCE DETECTION WRT. INDEPENDENCE OF CATALONIA
● Tweets in Spanish and Catalan
● Annotated with
○ Gender (male / female)
○ Stance (favor / against / none)
Language: Catalan
Stance: FAVOR
Gender: FEMALE
Tweet: "15 diplomàtics internacional observen les plebiscitàries, serà que
interessen a tothom menys a Espanya #27S"
(‘15 international diplomats observe the plebiscite, perhaps it is of interest to
everybody except to Spain#27S’)
AL4Trust - Artificial Intelligence for Building Trust

AL4Trust - Artificial Intelligence for Building Trust