Ponencia sobre Escucha Inteligente en el Master Universitario en Ingeniería Informática (MUIinf). Como caso práctico se explica el geoposicionamiento basado en la identificación de variedad del lenguaje.
21. consulting, s.a.autoritas
21
1 N
A real example: Spanish public television
5 brands
10 presenters
10 programas
10 campañas
6.856 queries8 channels
Concept-query mapping
Intelligence
Concepts
Queries
27. consulting, s.a.autoritas
27
ORDEN
If the url includes the date,
it’s easy to know it
That’s relative. Is
this url from july or
january:
http://xxx/07/01/2010/
crawler-403-
forbidden.html
The importance (and difficulty) of
obtaining the right date
28. consulting, s.a.autoritas
28
English
estoy sin internet ¬¬ fuuuuck!!!
Finnish
... euskocaja, como euskolabel, euskotren,
euskomueble... XDDD
Portuguese
Flowah Powah!
German
Vierrrrrrrrrrrrnes, egunon!!
Language Models vs. n-Gramms vs. Machine Learning
Language filtering
32. consulting, s.a.autoritas
32
Information Retrieval Evaluation...
...in Science
- Do you want that all
retrieved content is
good? -> Precision
- Don’t you want to leave
anything without
retrieving? -> Recall
33. consulting, s.a.autoritas
33
7.000 retrieved
54 wrong
99.23% precision
3.000 retrieved
50 not retrieved
98.36% recall
Information Retrieval Evaluation...
...in the Industry
We have a lack of
credibility!!
34. consulting, s.a.autoritas
34
Example
QUERIES: Mallorca, Menorca, Ibiza, Formentera, Baleares, ...
METHOD: 1.200 random tweets / 3 annotators
30%
70%
Labeled
Not labeled
70
80
90
100
Profile Detector Combination
94%
82%
85%
79%
87%
85%
Precision
Recall
Touristic project: 4.575.096 tweets
1.372.528
3.202.567
672.539 wrongly retrieved
204.419 good but not retrieved
1.372.528 that we don’t know...
35. consulting, s.a.autoritas
354
Bring order to the
information...
...to analyse it and
get value
DECISION MAKING
STRATEGY
INFORATION
RETRIEVAL
FILTERING, CLEANING
AND NOISE DELETION
KNOWLEDGE
EXTRACTION
Smart
Listening
37. consulting, s.a.autoritas
37
ORDEN
WHAT are people talking about,
the consonant, the programming
language or the galician
telecommunication company?
Semantic ambiguity in millions of documents!!
38. consulting, s.a.autoritas
38
WHEN? -> Crisis management
When are things happening?
In real time!!
And let remember the issue of dates identification...
40. consulting, s.a.autoritas
40
ORDEN
HOW? -> Not only sentiment analysis
Polarity is onlye one
dimension:
- Emotions
- Motivations
- Values
- SWOT
All of them answer
the question “how?”
42. consulting, s.a.autoritas
An example: “The risk premium in Spain is 235”
Positive, negative, neutral o none?
42
ORDEN
My question: For whom?
- For the president of the country?
- For the leader of the opposition?
- For the director of the Spanish Bank?
- For the foreigner investor?
- For the national capitalist?
- For whom has a morgage?
Subjectivity in transmitter,
and in receiver!!
43. consulting, s.a.autoritas
43
ORDEN
WHO? -> Social Network Analysis
If I want to
transmit a
successful
message, who
can help me?
If there is a
crisis, who do I
have to keep an
eye on?
44. consulting, s.a.autoritas
44
ORDENWHY -> Author Profiling
EMOTIONAL PROFILE
GENDER
AGE GROUP
NATIVE LANGUAGE
... political ideology, religious beliefs, and much more!
45. consulting, s.a.autoritas
45
Big Data is both the solution and the problem...
...but mainly the opportunity
80~120
tw/sec
=4.800~7.200
tw/min
=288.000~432.000
tw/hour
=6.912.000~10.368.000
tw/day
‣Retrieval & Storing
‣Evolution
‣Words & Topics
‣Tagging
‣Hashtags
‣People
‣Locations
‣Brands
‣Sentiment & Emotion
‣Users & Relationships
‣Influence
‣Gender & Age
‣LanguageVariety
‣...
49. consulting, s.a.autoritas
49
Show me how you talk...
...and I tell you where you are
HispaBlogs
TRAINING TEST
450 200
x 5 varietiesx 5 varieties
https://github.com/autoritas/RD-Lab/
tree/master/data/HispaBlogs
50. consulting, s.a.autoritas
50
State-of-the-art
Automatic Identification of Language Varieties: The Case of Portuguese.
Zampieri, M., Gebrekidan-Gebre, B.
In Proceedings of the Conference on Natural Language Processing 2012
Corpus (1000 documents from newsletters): 2 regional variations
Features: word and character n-grams
ML Algorithm: Language probability distributions with log-likelihood function for probability estimation
Evaluation method: 50/50 split
Accuracy:
Word uni-grams: 99.6%
Word bi-grams: 91.2%
Character 4-grams: 99.8%
Automatic Identification of Arabic Language Varieties and Dialects in Social Media.
Sadat, F., Kazemi, F., Farzindar, A.
In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa 2014
Corpus (blogs and forum documents): 6 regional variations
Features: character n-grams
ML Algorithm: Markov language model vs. Naïve Bayes
Evaluation method: 50/50 split
Accuracy: 98% (78% F-measure)