Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
4 text mining and open ended questions in sample surveys ludovic lebart cnrs
1. AMAI - 2009 - September 8th, 2009 (Mexico D.F.)
Text Mining and Open-ended Questions
in Sample Surveys
Ludovic Lebart
Centre National de la Recherche Scientifique
Telecom-ParisTech, Paris, France
1
2. Text Mining and Open-ended Questions
in Sample Surveys
Summary / Outline
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
3. Text Mining and Open-ended Questions
in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
4. 1- Principles of Data Mining and Text mining: A reminder
The « data Mining » approach...
✔ Ancient techniques are easier to use,
✔ Ancient techniques are improved
✔ New techniques are conceived
✔ New fields of application
✔ New products: Softwares
✔ Need for a selection of methods, of simple and clear strategy for
data processing
5. 1- Principles of Data Mining and Text mining: A reminder
Survey data processing and data mining
Reminder : (Fayadd et al.)
Data Mining (KDD) is the non-trivial process of identifying patterns
in huge data sets, these patterns being supposed to be
valid,
novel,
potentially useful,
and ultimately… understandable
These huge data sets could be unstructured, non representative.
The main goal being to automatically extract from the ore (raw data)
the genuine diamond of truth…. (Benzécri 1973)
6. 1- Principles of Data Mining and Text mining: A reminder
Survey data : Homogeneity of content, of coding…
… different from the usual inputs of Data Mining programs.
Despite the fact that we may deal with several observational levels
(households, individuals, trajectories or biographical data,
areas or regions…), there is a consistency and a unity of content
in a survey data set - together with general hypotheses formulated
beforehand - that are not present in the usual data mining input data.
A survey (whatever its complexity) is a costly set of measurements
that follows a specific decision.
In this context, a lot of meta-information is generally available
(Demographic, economic, sociologic, epidemiologic, etc)
that provides a framework for the interpretation phase.
7. 1- Principles of Data Mining and Text mining: A reminder
“Text Mining” and Multivariate exploratory statistical
analysis of texts
✔ Initial paradigm:
✔
✔ - Extracting statistical units from texts
✔
✔ - Complementing lexicometry with a multivariate approach
✔
✔ - Applying visualization tools to lexical tables
✔
✔ Evolution and diversification of techniques and approaches
7
8. 1- Principles of Data Mining and Text mining: A reminder
The fields of Text Mining
WEB Press
Scientific papers, abstracts Information Retrieval
Open-ended questions, free responses
Qualitative interviews, Discourses, Reports
Complaints
8
9. Text Mining and Open-ended Questions
in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
10. 2- Open-ended Questions: Why? How?
Open questions : Why?
x To shorten interview time:
Open ended questions are less costly in terms of interview time,
and generate less fatigue and tension (voluminous lists of items)
x To gather spontaneous information:
Marketing survey questions contain many questions of this type.
" What do you recall (or: what do you like) about this ad?
x To probe the response to a closed-end question.:
This is the follow up additional question "Why?".
Explanations concerning a response already given have to be
provided in a spontaneous fashion.
x To get information relating to non-comparable variables:
Example : Environmental activism, dietary habits….
10
12. 2- Open-ended Questions: Why? How?
Comparison between open and closed questions
A classical experiment, quoted by Schuman and Presser (1981),
stresses the difficulty of comparing the two types of questionning.
When asked
"what is the most important problem facing this country [USA] at present",
16% of Americans mention crime and violence (grouped free responses),
whereas the same item asked in a closed question
produces 35% of the same response.
The explanation given by authors is the following:
lack of security is often considered as a local, not a national problem,
so that the item crime and violence is not mentioned
spontaneously very often.
Closing the question indicates that this response is a relevant or
possible response, resulting in a higher response percentage.
13. 2- Open-ended Questions: Why? How?
Heuristic value of open-ended questions
In some particular contexts, the absence of a response item can
play a positive role.
It can establish a climate of confidence and communication,
and lead to better results when certain subjects are brought up.
This is what is indicated by the work of Sudman and Bradburn (1974)
concerning questions having to do with "threats", and of Bradburn et al.
(1979) concerning questions about alcohol and sexuality.
In international studies, it is important to know whether people interviewed
in different countries understand the closed questions in the same way.
(case of the follow up :”Why” ).
As a matter of fact, it is also legitimate to raise this same issue with respect
to regional and generational differences.
14. 2- Open-ended Questions: Why? How?
Heuristic value of open-ended questions (continuation)
In some other particular contexts, the cultural gap between those who
have conceived the questionnaire and the interviewees is hidden by the
purely numerical coding of the closed questions.
In a national survey about the attitudes of economically impaired people
towards the minimum wage system in France, a classical open question
was asked at the end of the interview:
“Would you like to add something about some topics that could be
missing in this questionnaire, about the minimum wage system ?”
One answer (among many others of the same vein) was
“ We eat potatoes and eggs, despite my diabetes and my cholesterol,
because there are cheap.”
Another: “Thank you for coming. It proves that you are thinking of me”.
Some respondents are far from the problematic “Attitude towards an institution”
15. 2- Open-ended Questions: Why? How?
Empirical Post-Coding of free responses
(Drawbacks of this type of processing)
Coder bias: Coder bias is added to interviewer bias,
since the coder makes decisions and formulates
interpretations, introducing a «personal touch ».
Alteration of form: Information is destroyed in its form
and often weakened in its content: quality of expression,
level of vocabulary, and general interview tonality are lost.
Weakening of content: (case of responses that are composed,
complex, vague and diversified).
Infrequent responses are eliminated a priori.
15
16. 2- Open-ended Questions: Why? How?
Example 1: Open Question « Life » (international sample surveys)
The following open-ended question was asked :
"What is the single most important thing in life for you?"
It was followed by the probe:
"What other things are very important to you?".
This question was included in a multinational survey conducted
in seven countries (Japan, France, Germany, Italy, Nederland,
United Kingdom, USA) in the late nineteen eighties
(Hayashi et al., 1992).
Our illustrative example is limited to the British sample
(Sample size: 1043).
16
17. 2- Open-ended Questions: Why? How?
Example 1: Open Question « Life »: Examples of responses
GenderEduc. Age Responses
1 1 4 happiness in people around me, contented family,
would make me happy
1 2 2 my own time, not dictated by other people
1 2 2 freedom of choice as to what I do in my leisure time
1 3 2 I suppose work
1 2 1 firm, my work, which is my dad's firm
2 1 6 just the memory of my last husband
2 2 6 well-being of my handicapped son
1 1 5 my wife, she gave me courage to carry on even in the
bad times
2 2 3 my sons, my kids are very important to me, being on
my own, I am responsible for their education
1 3 3 job, being a teacher I love my job, for the well-being
of the children 17
18. 2- Open-ended Questions: Why? How?
Example 2: Open Questions / Copy-Test
Following a viewing of a television commercial on breakfast cereals (copy-test),
several open questions were asked.
One of them, which we shall use as our example, is :
What was the main idea of this commercial?
In addition a number of closed questions were also asked (socio-demographic
characteristics of respondents, purchase intent toward product seen).
Purchase intent being an important issue, this question plays a major role in
the discussions that follow.
Two examples of responses to that open question.
1 - That it has complex carbohydrates in it, it has energy releaser
and it tastes good... It showed people eating grape nuts.
2 - It gives you energy in the morning, nothing else.
19. 2- Open-ended Questions: Why? How?
Example 3: An international survey (Tokyo Gas Company)
A survey in three cities (Tokyo, New York, Paris) about dietary habits.
The common open-ended questions were:
"What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").
“ What would be an ideal meal?”
Akuto H.(Ed.) (1992). International Comparison of Dietary Cultures,
Nihon Keizai Shimbun, Tokyo.
Akuto H., Lebart L. (1992). Le Repas Idéal. Analyse de Réponses
Libres en Anglais, Français, Japonais. Les Cahiers de l'Analyse des
Données, vol XVII, n°3, Dunod, Paris
20. 2- Open-ended Questions: Why? How?
Example 3: An international survey (continuation)
Four responses (New York)
"What dishes do you like and eat often?
“What would be an ideal meal?”
---- 1
SPAGHETTI,CHINESE
++++
CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE
---- 2
SEAFOOD,GREEN SALAD,CHINESE FOOD
++++
CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD
---- 3
CHINESE FOOD
++++
CHINESE FOOD,FRENCH FOOD,VEAL,BREAD
---- 4
PASTA
++++
BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA
21. 2- Open-ended Questions: Why? How?
Example 4: Evaluación de vinos mediante notas y comentarios
Guia de Catas de Castilla y León (2005)
522 vinos de Castilla y León pertenecientes a 207 bodegas
5 denominaciones: Bierzo, Cigales, Ribera del Duero, Rueda, Toro
22. 2- Open-ended Questions: Why? How?
Example 4: Evaluación de vinos mediante notas y comentarios
Example of two texts
---- Nota= 80 Valdelosfriales-2003
Joven típico, con notas de tempranillo y balsámicos; en boca amable y frutoso.
---- Nota=91 Tares P3-2001 premium
Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo
caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado,
tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones
de tierra húmeda y pólvora en el largo final.
23. Text Mining and Open-ended Questions
in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
24. 3- From texts to numerical data
Statistical units derived from texts
CORPUS
Texts
Sentences or responses
Segments or quasi-segments
Words, lemmas, n-grams
Characters
s
24
25. 3- From texts to numerical data
Ambiguity of frequencies:
statistical frequency versus « linguistic frequency »
Sample surveys
(statistical frequency)
Closed questions
Open questions
ouvertes
Texts
( linguistic frequency)
25
26. 3- From texts to numerical data
Example 1: Question « Life » - continuation
The counts for the first phase of numeric coding are as follows:
Out of 1043 responses, there are 13 669 occurrences,
with 1 413 distinct words.
When the words appearing at least 16 times are selected,
there remain 10 357 occurrences of these words,
with 135 distinct words (types).
The same questionnaire also had a number of closed-end
questions (among them, the socio-demographic characteristics
of the respondents, which play a major role).
In this example we focus on a partitioning of the sample into
nine categories, obtained by cross-tabulating age
(three categories) with educational level (three categories).
26
27. 3- From texts to numerical data
Example 1: Selected statistical units
Words Appearing at Least Sixteen Times (Alphabetic Order)
in the 1043 responses to the open question
Word Frequency Word Frequency Word Frequency
I 248 go 19 of 312
I'm 22 going 26 on 59
a 298 good 303 other 33
able 55 grandchildren 30 others 17
about 31 happiness 227 our 29
after 26 happy 137 out 34
all 86 have 99 own 16
and 504 having 70 peace 77
anything 19 health 609 people 61
are 65 healthy 45 really 28
27
29. 3- From texts to numerical data
Example of a lexical contingency table
- First partition: three age categories
- less than 30 years [noted -30],
- between 30 years and 55 years [-55]
- over 55 years [+ 55] .
- Second partition: three educational levels
- No degree or Low [noted L],
- Medium [M],
- High level [H]
29
30. 3- From texts to numerical data
Example 1: A lexical contingency table
Partial listing of lexical table cross-tabulating 135 words of frequency
greater than or equal to 16 with 9 age-education categories
L-30 L-55 L+55 M-30 M-55 M+55 H-30 H-55 H+55
I 2 46 92 30 25 19 11 21 2
I'm 2 5 9 3 2 1 0 0 0
a 10 56 66 54 44 19 20 22 7
able 1 9 16 9 7 4 4 5 0
about 0 3 13 7 1 2 4 1 0
after 1 8 11 3 1 2 0 0 0
all 1 24 19 8 18 6 3 5 2
and 8 89 148 86 73 30 25 32 13
anything 0 4 9 1 3 0 1 1 0
30
31. 3- From texts to numerical data
Example 2: "What is the main idea in this commercial"
Words appearing more than 9 times (100 responses)
Number Word Frequency Number Word Frequency
1 I 14 25 in 27
2 a 59 26 is 37
3 about 15 27 it 133
4 all 21 28 it's 28
5 and 42 29 long 14
6 are 25 30 morning 9
7 been 12 37 nothing 25
8 carbohydrate 14 32 nutritional 9
9 carbohydrates33 33 nutritious 12
10 cereal 34 34 nuts 25
11 complex 25 35 of 25
12 crunchy 9 36 people 28
13 eaten 10 37 showed 11
14 eating 19 38 taste 11
15 energy 33 39 that 80
16 for 57 40 that's 13
17 give 9 41 he 82
18 gives 11 42 they 50
19 good 52 43 to 32
20 grape 25 44 was 19
21 has 30 45 with 11
22 have 27 46 years 11
23 healthy 23 47 you 81
24 how 9
32. 3- From texts to numerical data
Example 2: "What is the main idea in this commercial" Examples of repeated segments
SEGM FREQ LENGTH "TEXT of SEGMENT"
-------------------------------------
-----------------------------------------a
1 8 3 a long time
-----------------------------------------are
2 6 4 are good for you
-----------------------------------------carbohydrates
3 5 3 carbohydrates in it
-----------------------------------------complex
4 15 2 complex carbohydrates
-----------------------------------------for
5 37 2 for you
-----------------------------------------give
6 7 3 give you energy
-----------------------------------------gives
7 11 2 gives you
8 9 3 gives you energy
-----------------------------------------good
9 24 2 good for
10 22 3 good for you
-----------------------------------------grape
11 25 2 grape nuts
-----------------------------------------have
12 6 3 have been eating
-----------------------------------------healthy
13 6 3 healthy for you
-----------------------------------------is
14 9 4 is good for you
-----------------------------------------it
15 26 2 it has
16 19 2 it is
17 14 2 it was
18 8 3 it gives you
19 8 3 it has a
20 6 3 it has complex
21 5 3 it is good
22 6 4 it gives you energy
-----------------------------------------people
33. 3- From texts to numerical data
Example 3: An international survey (Tokyo Gas Company)
The common open-ended question : "What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").
- Sub-sample 1 (city of Tokyo) : 1008 individuals.
The global corpus of open responses contains 6219 occurrences of
832 distinct words. 139 words appear at least 7 times, leading to 4975
occurrences.
- Sub-sample 2 (city of New-York) contains 634 individuals.
(6511 occurrences of 638 distinct words).
The processing takes into account the 83 words appearing at least 12 times.
- Sub-sample 3 (city of Paris) contains 1000 individuals.
The global corpus contains 11108 occurrences of 1229 distinct words.
The processing takes into account the 112 words appearing at least 18 times,
leading to 7806 occurrences.
- The three sets of respondents are broken down into into six categories
(three categories of age, combined with the gender).
34. 3- From texts to numerical data
Example 3: An international survey (Tokyo Gas Company)
!------------------------------------!
! words (frequency order) !
!-------!---------------------!------!
! num. ! used words ! freq.!
!-------!---------------------!------!
! 12 ! CHICKEN ! 254 !
! 73 ! STEAK ! 101 !
!
!
49 ! PASTA
22 ! FISH
!
!
95 !
87 ! City of New York
! 60 ! SALAD ! 85 !
! 1 ! AND ! 85 !
!
!
23 ! FOOD
52 ! PIZZA
!
!
82 !
62 !
The common open-ended question :
! 79 ! VEGETABLES ! 57 ! "What dishes do you like and eat
! 4 ! BEEF ! 56 !
! 71 ! SPAGHETTI ! 55 ! often?
!
!
13 ! CHINESE
80 ! WITH
!
!
54 !
48 !
(With a probe: "Any other dishes you
! 59 ! ROAST ! 47 ! like and eat often?").
! 58 ! RICE ! 45 !
! 67 ! SHRIMP ! 45 !
! 43 ! MACARONI ! 42 !
! 56 ! POTATOES ! 39 !
! 35 ! HAMBURGERS ! 36 !
! 75 ! TUNA ! 35 !
! 26 ! FRIED ! 33 !
! 77 ! VEAL ! 33 !
! 38 ! ITALIAN ! 31 !
! 2 ! BAKED ! 29 !
! 48 ! PARMESAN ! 29 !
! 55 ! POTATO ! 27 !
! 46 ! MEATBALLS ! 25 !
! 3 ! BEANS ! 24 !
! 45 ! MEAT ! 24 !
! 76 ! TURKEY ! 24 !
! 14 ! CHOPS ! 23 !
! 34 ! HAMBURGER ! 22 !
!------------------------------------!
35. 3- From texts to numerical data
Example 4: Evaluación de vinos mediante notas y comentarios
Pos. Palabra Frec
1 de 891 Lematización:
2 y 806 -Singulares y plurales
3 en 694
- Masculino y femenino
- Formas verbales a infinitivo
4 boca 433 -…
5 con 356
Eliminados artículos,
6 fruta 334
preposiciones …
7 un 308
8 nariz 246 Conservadas palabras
utilizadas al menos 8 veces.
9 muy 237
10 la 211 Quedan
10 notas 211 -250 palabras
12 que 168
-443 vinos
13 taninos 167
P1 P2 ... P250
14 el 159
Vino 1 0 1 ... 2
15 una 152
Vino 2 1 0 ... 1
16 madera 140
Vino 3 0 0 ... 1
17 bien 116
. . . . . . . . . . .
18 toque 108
Vino 443 1 2 ... 0
36. 3- From texts to numerical data
Example 4: Evaluación de vinos mediante notas y comentarios (Continuation)
boca (mouth)
fruta (fruit)
muy (very)
nariz (nose)
nota (note)
taninos (tannins)
madera (wood)
buen (good)
rojo (red)
negro (black)
toque (hint)
bien (well)
final (end)
maduro (ripened)
balsámico (balsamic)
vino (wine)
todavía (still)
elegante (elegant)
agradable (pleasant)
jugoso (juicy)
medio (medium)
fino (fine)
algo (some/something)
cereza (cherry)
ser (to be)
ligero (light)
suave (mild)
potente (powerful)
acidez (acidity)
0 50 100 150 200 250 300 350 400
Frequency
37. Text Mining and Open-ended Questions
in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
38. 4) Basic statistical tools: Visualization, Characteristic words.
✔ Applying visualization tools to lexical tables
✔
● Principal axes analyses of lexical tables
● Classification (clustering) of words and texts
✔
✔ Selecting characteristic units and responses
✔ (or: sentences)
✔
● Characteristic units (words, segments, lemmas)
● Selecting « Modal responses »
38
39. 4) Basic statistical tools: Visualization, Characteristic words.
Briefly, one can summarize the principles of methods for performing these
data reductions:
Principal axes methods, largely based upon linear algebra, produce
graphical representations on which the geometric proximities among row-
points and among column-points translate statistical associations among rows
and among columns. Correspondence analysis belongs to this family of
methods.
Clustering or classification methods that create groupings of rows or of
columns into clusters (or into families of hierarchical clusters) including the
SOM (Self Organizing Maps, or Kohonen maps).
These two families of methods can be used on the same data matrix and
they complement one another very effectively.
Selection of characteristic units and responses (or: sentences)
Characteristic units (words, segments, lemmas)Selecting « Modal
responses »
40. 4) Basic statistical tools: Visualization, Characteristic words.
General factor for individual i
x = aj f + ε
i
j i j
i
Value of variable j
for individual i Residual (hopefully small)
Coefficient of variable j
Known = Unknown
Visualization through principal coordinates, « a breakthrough in 1904 ».
Charles Spearman (1904) – “General intelligence, objectively determined and
measured”. Amer. Journal of Psychology, 15, p 201-293.
40
41. 4) Basic statistical tools: Visualization, Characteristic words.
Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych.,
9, p 345-366.
Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press,
Chicago.
x = a j f + b j g + ... + ε
i
j i i j
i
41
42. 4) Basic statistical tools: Visualization, Characteristic words.
Singular Values Decomposition is a theorem, not a model
Eckart C., Young G. (1936) - The approximation of one matrix by another of lower rank.
Psychometrika, l, p 211-218.
Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices.
Bull. Amer. Math. Assoc., 45, p 118-121.
= λ1 × + ... + λ α × + ... + λ p ×
X v1 u'1 vα u'α vp u'p
A precursor: Pearson K. (1901) - On lines and planes of closest fit to
systems of points in space. Phil. Mag. 2, n°ll, p 559-572.
42
45. 4) Basic statistical tools: Visualization, Characteristic words.
Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes
45
46. 4) Basic statistical tools: Visualization, Characteristic words.
A pedagogical example: Description of « Textual Graphs »
46
47. 4) Basic statistical tools: Visualization, Characteristic words.
Each area “answers” to the fictitious “open-question” :
Which are your neighbouring areas?
**** Ain
Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie
**** Aisne
Aisne Ardennes Marne Nord Oise Seine_Marne Somme
**** Allier
Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone
**** Alpes_Prov
Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse
**** Alpes_Hautes
Alpes_Hautes Alpes_Prov Drome Isere Savoie
**** Alpes_Marit
Alpes_Marit Alpes_Prov Var
**** Ardeche
Ardeche Drome Gard Loire Hte_Loire Lozere
**** Ardennes
Ardennes Aisne Marne Meuse
……………………….
47
48. 4) Basic statistical tools: Visualization, Characteristic words.
The idea: When a pattern exists
within a text, some techniques may This map is blindly
detect it and exhibit it. produced from the
previous texts.
48
49. 4) Basic statistical tools: Visualization, Characteristic words.
Example: CORRESPONDENCE ANALYSIS of a simple
lexical table
Correspondence analysis can be presented in several different ways.
• It is difficult to trace the method's history accurately
(see, e.g., Hill, 1974 ; Benzecri, 1976 ; Nishisato, 1980; Gifi, 1990).
•The underlying theory probably dates back to...
•Fisher (1936) , Guttman (1941), and Hayashi (1956).
49
50. 4) Basic statistical tools: Visualization, Characteristic words.
• Correspondence analysis and principal components
analysis are used under different circumstances:
• Principal components analysis is used for tables
consisting of continuous measurements.
• Correspondence analysis is best applied to
contingency tables (cross-tabulations) frequently
encountered when analyzing textual data.
• By extension, it also provides a satisfactory
description of data tables with binary coding.
50
51. 4) Basic statistical tools: Visualization, Characteristic words.
• Cross-tabulations or contingency tables are among the
most common data structures used for analyzing
qualitative data.
• By looking simultaneously at two partitions at a time of a
population or sample, a cross-tabulation enables us to
work with variations in the data by response categories, a
necessary step for the interpretation of results.
51
52. 4) Basic statistical tools: Visualization, Characteristic words.
• We will use as a leading example a small
contingency table.
• However, this kind of exploratory method is chiefly
useful when we are dealing with very large data
tables
• (pedagogical paradox)
• In the following table, the 14 rows are words used
in responses to an open-ended question given by
2000 respondents.
•The 5 columns are the educational levels of the
respondents.
52
56. 4) Basic statistical tools: Visualization, Characteristic words.
1 j p
1 general term of the
contingency table
Symmetry F=
i fij
(n,p)
of the two
spaces: n
row-profiles column-profiles
rows and 1 j p j j'
columns 1
i
i' i
n
p n
n points in R p points in R
• •• • • • • • • • • •• • •
• • •••
• • • • • •• • •• • • • •• • ••
• • •• • • • • • • •
• • • • • • • • • • •
• • • •• • •
• •
•
p
Rn
• • •
R
56
57. 4) Basic statistical tools: Visualization, Characteristic words.
A x is 2 (21 % ) .
F in a n c e s
.1 5
H IG H
. F u tu r e
.
W ar N o DE GRE E
. .
F ea r .
. U n em p l o y m e n t
T R A D E S e lfi s h n e s s
. . A x is 1
. H e a l th
- .2 - .1 0 .1 . .2
M on ey (57 % )
.
.
EL EM D iff ic u lt
H o u s in g
. W o rk .
- .1 5
.D e c i s i o n
O c c u p a tio n
.
E c o n o m ic C O LL E G E
. .
57
58. 4) Basic statistical tools: Visualization, Characteristic words.
Characteristic elements (words, lemmas, segments)
The corpus contains several parts (categories of respondents).
Notations:
kij -sub-frequency of word i in the part j of the corpus;
ki. -frequency of word i in the whole corpus;
k.j -frequency (size) of part j;
k.. -size of the corpus (or, simply, k).
We are interested in the statistical significance of sub-frequency kij .
Is the word i abnormally frequent in part j ? Is it abnormally rare?
The comparison between the relative frequency of word i in part j
and the relative frequency of word i in the entire corpus leads to a classical
statistical test using either the hypergeometric distribution or its normal
58
approximation.
59. 4) Basic statistical tools: Visualization, Characteristic words.
The 4 parameters for computing characteristic elements
T E X T P A R T S
W O R D S
ki j ki .
k. j k. .
k. . size of corpus
ki . frequency of word in corpus
ki j frequency of word in text part
k. j size of text part
59
60. Text Mining and Open-ended Questions
in Sample Surveys
Summary / Outline
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
60
61. 5) Applications: Open questions, sample surveys, texts
Example 1: Open Question « Life » (International sample surveys)
The following open-ended question was asked :
"What is the single most important thing in life for you?"
It was followed by the probe:
"What other things are very important to you?".
The two forthcoming diapositives show the principal plane
produced by a correspondence analysis of the previous
lexical contingency table (section 3).
Proximity between 2 category-points (columns) means
similarity of lexical profiles of the 2 categories.
Proximity between 2 word-points (rows) means similarity
of lexical profiles of these words.
62. c h u r c h w e lf a r e Example 1 (« Life » question)
m in d son
E 3-A G E 3 if
Correspondence th em
Words - Categories k id s fo r
p ea ce
E 2 -A G E 3
s e c u rity co n ten tm en t
c h il d r e n
E 2-A G E 2
E 1 -A G E 2 a re
E 3-A G E 2 f a m i ly f ro m sh o u ld
g e n e ra l v er y w if e m e
le is u re fr e e d o m h a p p in ess
sta n d a rd e m plo y m e nt d og h elp
tim e
house w o rld d a u g h ter
m u sic
a nd w o u ld
e d u catio n w o rk I
m one y be w ell
to E 1 -A G E 3
not a n y thin g
E 2 -A G E 1 you
s a t is f .
lo v e da y
E 1 -A G E 1
jo b ha ve g o in g
ke e p
c o m f o r ta b l y
t h in g s c o m f o r t a b le m o re
m u ch
E 3 -A G E 1
f rie nd s th in k c an
f u tu re ca r out go
62
63. p ea c e o f m in d Example 1 (« Life » question)
E3-A G E3
Location of
p e a c e in t h e w o r ld E 2 -A G E 3 Segments
w e lfa r e o f m y fa m ily
E2-A G E2 h a p p in e ss , g o o d h e a lth
E 1 -A G E 2
E3-A G E2 la w a n d o r d e r
a n ic e h o m e
a g o o d st a n d a r d o f liv in g I d o n 't kn o w
h a v in g e n o u g h m o n e y to l ive E1-A G E3
E 2 -A G E 1
E1-A G E1
a g o o d jo b c a n 't t h i n k o f a n y t h i n g e l s e
fr ie n d s a n d fa m i ly
E 3 -A G E 1
63
64. 5) Applications: Open questions, sample surveys, texts
Example 1 (« Life » question) Characteristic words
words %W %glob Fr.W Fr.glob TestValue
H-30 = -30 * high (Young, High education)
1 f rie n d s 2 .8 7 1 .1 1 17 11 6 3 .4 4
2 do 1 .3 5 .4 5 8 4 7 2 .6 0
3 w ant 1 .0 1 .3 0 6 3 1 2 .4 4
4 b e in g 2 .1 9 1 .1 1 13 11 6 2 .1 8
5 jo b 2 .5 3 1 .3 6 15 14 2 2 .1 6
6 h a v in g 1 .5 2 .6 7 9 7 0 2 .1 1
7 t h in g s .8 4 .2 7 5 2 8 2 .0 6
- -- - -- - -- - -- - -- -
2 w ife .0 0 .6 5 0 68 -2 .1 0
1 h e a lt h 2 .7 0 5 .8 5 16 609 -3 .5 9
H+55 = +55 * high (Older, High education)
1 m in d 2 .5 5 .4 5 5 47 2 .9 1
2 w e lfa r e 1 .5 3 .2 1 3 22 2 .4 2
3 peace 2 .5 5 .7 4 5 77 2 .1 7
64
65. 5) Applications: Open questions, sample surveys, texts
Example 1 (« Life » question) : Modal Responses
Category 7 Less than 30 years, high level of education
1 .3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e
1 .1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o ,
w ith in re a s o n , h a v in g g o o d frie n d s , h a v in g a fu lfillin g jo b to d o ,
h a v in g s o m e id e a o f w h a t y o u w a n t to d o a n d th e fre e d o m to c h o o s e ,
p ro te c tio n o f th e e n v iro n m e n t
1 .0 5 - 3 to h a v e g o o d frie n d s a ro u n d h a v in g a g o o d jo b , liv in g in a g o o d a re a ,
h a v in g lo ts o f fre e d o m to d o th e th in g s y o u w a n t to d o
.9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y
Category 9 Over 55 years, high level of education
.9 7 - 1 to g e th e rn e s s , p e a c e o f m in d , g o o d h e a lth , re lig io n , n o
.6 4 - 2 n o t t o d i e , p e a c e o f m i n d , d o n 't l i k e p e o p l e l i v i n g e n v i o u s o f e a c h
o th e r
.6 3 - 3 p e a c e o f m in d g o o d h e a lth , h a p p in e s s , e n o u g h m o n e y to k e e p a
s ta n d a rd o f liv in g
.3 8 - 4 w e lfa re o f m y fa m ily w o rk , s a tis fa c tio n , g o o d h e a lth , tra v e l 65
66. 5) Applications: Open questions, sample surveys, texts
Example 1: Similar survey in Japan
(Same open question, same categories of
respondents)
66
67. 5) Applications: Open questions, sample surveys, texts
Example 1: Similar survey in Japan: visualization
of the characteristic words for 2 categories
67
68. 5) Applications: Open questions, sample surveys, texts
Example 2: Open Questions / Copy-Test
69. 5) Applications: Open questions, sample surveys, texts
E x a m p le s o f 2 C h a ra c te ris tic re s p o n s e s fo r 4 c a te g o rie s
TEXT 1 PWNB = Prob.w.n.buy
Example 2: Open
-- 1 to tell you about how long people have eaten them. Questions / Copy-
-- 1 the complex carbohydrate that are in this cereal.
-- 1 the people who eat this cereal and the product. that's all. Test
-- 2 it's supposed to be healthy, it has good carbohydrates in it.
TEXT 2 Hesi = Hesitates
. -- 1 it gives you energy in the morning. nothing else.
. -- 2 grape nuts cereal gives you energy
-- 2 it has complex carbohydrates. they showed the man eating it with
-- 2 strawberries and bananas.
TEXT 3 PWB = Probably would buy
-- 1 it's nutritious for you. nothing else.
-- 2 that,is good for you, that,s all it said to me
TEXT 4 DW B = Definitely would buy
-- 1 they are bigger nuggets. low in carbohydrates, that's all.
-- 2 it has nutty flavor, it is nutritious
70. 5) Applications: Open questions, sample surveys, texts
Example 3: International survey (Tokyo Gas Company). A survey in three cities (Tokyo,
New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often?
New York: First principal plane. Table crossing words and age x gender categories
71. 5) Applications: Open questions, sample surveys, texts
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
New York: First principal plane. Example of confidence areas for categories (Bootstrap)
72. 5) Applications: Open questions, sample surveys, texts
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
New York: First principal plane. Example of confidence areas for words (Bootstrap)
73. 5) Applications: Open questions, sample surveys, texts
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
New York: First principal plane. Example of Kohonen Map (Self Organizing map).
74. 5) Applications: Open questions, sample surveys, texts
Axis 2 : 1.75%
First Principal Plane
Mesoneros de Castilla (03)
6.0
WINES & MARKS
Example 4: Comments about 522 Spanish wines.
Nota y commentarios activos
4.5
Valdelosfrailes (03)
Gran Reserva
Fuentenarro (02)
Tinto joven 3.0
Torondos (02) Tinto crianza
Gayubar (02) Vega Sicilia 'Único' (94)
Viña Sastre Pesus(01)
1.5
Valdetán (02)
94
Eje de calidad
78
79
93
80
82 88
97 Jaros Chafandín (01)
81 83 90 91 92
-3.0 -1.5 84 85 86 1.5 Axis 1: 3.52%
89
87 San Román (01)
Numanthia (02)
Bienvenida Sitio de El Palo (01)
Bienvenida Sitio de El Palo (02)
95 Termanthia (02)
Tares P3 (01)
Carramimbre (03) Gran Elías Mora (00)
-1.5
Viña Eremos (03)
Marqués de Peñamonte (01)
Tinto reserva
Tinto roble
75. 5) Applications: Open questions, sample surveys, texts
Example 4: Comments about 522 Spanish wines (continuation)
lowest marks highest marks
agradable reducido potente estilo denso
vez
frutal sobremadurez discreto puro concentrado salado
graso
crianza sequedad frutosidad dejar necesitar impresionante
tuestes torrefacto
algo medio ensamblado mineral potencial
cierto granuloso
limpio tempranillo seco primer sabroso
abierto rojo
gran enérgico
ligero ligeramente clásico moderno sorprende
algún típico salino tiempo
americano
beber demasiado dominar expresión fino carnoso tacto
evolucionar capa franco donde amargo complejo todo
compotado
fácil mucho largo noble
suave
ser cascajo
tradicional Ribera
cesta bouquet
rústico coco
sílex
joven toque pólvora
intenso
roble voluptuoso
firme
lineal magnífico
vino
corto amable chocolate
herbáceo
consistencia
-1,9 -1,5 -1,1 -0,7 -0,3 0,1 0,5 0,9 1,3
Mark81 82 83 84 85 86 87 88 89 90
Average mark: 85.16
76. 5) Applications: Open questions, sample surveys, texts
Example 4: Comments about 522
Axis2
Variables suplementarias Spanish wines (continuation)
Mesoneros de Castilla (03)
Jaros Chafandín (01)
Vega Sicilia 'Único' (94)
4.5
Viña Sastre Pesus(01)
Valdelosfrailes (03)
Gran Reserva Punta Esencia (01)
Fuentenarro (02)
Astrales (02)
3.0 20-24,9€
0-4,9€ 5-9,9€
Torondos (02) Valdecuadrón (02) 15-19,9€ 25-29,9€
Tinto joven 10-14,9€ 50-99,9€
Gayubar (02)
Tinto crianza
1.5
Viñatorondos (03) 94
Valdetán (02) 79 100-300€
78
93
81 91 97
80
Viña Valdable (03) 82 83 88 30-49,9€ 90 92
- 3.0 - 1.5 84 89
1.5 Axis1
Marqués de Olivara (98) 85 86 87
Rauda (01)
El Marqués (02) 95
Carramimbre (03) Valsotillo (01) - 1.5
Viña Eremos (03)
San Román (01)
Tinto reserva Numanthia (02)
Marqués de Peñamonte (01)
Bienvenida Sitio de El Palo (01) Termanthia (02)
Tinto roble
Gran Elías Mora (00) Tares P3 (01)
Bienvenida Sitio de El Palo (02)
77. 5) Applications: Open questions, sample surveys, texts
Example 4: Comments about 522 Spanish wines (continuation)
---- Wine 212 (mark= 85) Legaris-2001
Tuestes, gominolas y buenos balsámicos marcan la intensidad media frutal de este
crianza. En boca aparece muy lineal, con consistencia media; el retrogusto frutal todavía
tapado por una madera algo rústica.
---- Wine 30 (mark=91) Tares P3-2001 premium
Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo
caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado,
tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones
de tierra húmeda y pólvora en el largo final.
---- Wine 314 (mark=97) Vega Sicilia 'Único-1994
Hay que realizar un ejercicio de disciplina gustativa de primer rango para describir este
gran vino. el bouquet es fresco, bien armado de fruta roja que se ve potenciada por tintes
de chocolates, tabacos, notas de sotobosque y una madera que se manifiesta pero que
resulta difícil de localizar y menos de concretar. Tenemos el caso raro de un tinto que
sale ileso del paso del tiempo sin lucir su armadura, que es la barrica. En boca joven,
aunque ya tiene su cuerpo vigoroso y enérgico bastante ensamblado, con la excepción de
algunos taninos saltamontes que quedan para domesticar. Largo y vibrante final que
mezcla madurez con una notable finura fresca.
78. Text Mining and Open-ended Questions
in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
79. 6) About textual data in general
Processing Strategy
A priori Grouping (Lexical contingency table)
Juxtaposition of Lexical contingency tables
Instrumental Partition
Direct Analysis of the sparse Lexical table
79
80. 6) About textual data in general
Importance of Meta-data
Meta-data
linguistics
Grammar / Syntax
Textual data
Semantics networks
External Corpora
externes
Other a priori structures
sociolinguistics,
chronology, etc.
80
81. 6) About textual data in general
The four phases of a linguistic analysis
(A bxg flower)
A big flower A bug flower
Morphology
A bag flower A bog flower
Syntax The spoon speaks (The speaks)
Semantics
A man thinks (A stone thinks)
Pragmatics A challenge to I.A.
81
82. 6) About textual data in general
Homography, Polysemy, Synonymy
To bear
Homographs: BORE A tedious person
To bore
Task
Polysemous words: DUTY
Tax
DRUG medicine
Addicting product
82
83. 6) About textual data in general
Semantic content of a lexical profile
Distributional linguistics (Z. Harris)
X is sometimes purring
X mews
X has whiskers
X likes milk
X likes chasing mice
At the end, the point « X » will be superimposed with the point « CAT»
83
84. 6) About textual data in general
Semantic similarity is not a transitive relationship
Example of semantic chains:
(1) calm–wisdom–discretion–wariness–fear–panic,
(2) fact–feature –aspect–appearance–illusion .
84
85. 6) About textual data in general
New additional variables
Nouns (proper, common)
Verbs (auxiliary, modal, usual…)
Adjective
Pronoun
Determiner
Adverb
Preposition
Conjunction
85
87. Text Mining and Open-ended Questions
in Sample Surveys
Summary / Outline
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
88. 7) Conclusions
As a conclusion...
For each open-ended question,
and for each partition of the sample of respondents,
we obtain, without any preliminary coding or other intervention:
• A visualization of proximities between words and categories.
• Characteristic elements or words for each category .
• Modal responses for each category (a kind of automatic summary).
[Remember also that the open question “Why” following a closed
question provides an indispensable assessment of the real
understanding of the question].
89. 7) Conclusions
As a conclusion... (continuation)
All these processing are carried out under the supervision of robust
assessment procedures:
- Non-parametric statistical tests,
- Bootstrap validation.
We are not dealing here with a novel sophisticated modelling.
It is rather a painstaking effort to stick to the real concerns of the
respondent, i.e.: the customer, the user, the client.
With the rapid development of online surveys, the spreading of e-mails
and blogs, the presented set of tools is expected to be a noteworthy
component in a new methodology of customer knowledge.
90. 7) Conclusions – Short Bibliography
- Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo.
- Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys,
Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press).
- Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse
d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006.
- Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,)
http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm
Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference,
19/1, 122-134.
- Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied
Statistics, 2, 120-132.
- Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York.
- Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge.
- Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass,
San Francisco.
- Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis,
J. of the Amer. Soc. for Information Science, 41 (6), 391-407.
- Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris.
- Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective,
North-Holland, Amsterdam.
- Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT,
Physica Verlag, 67-76.
- Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida.
- Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht.
- Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y.
- Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254.
- Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes,
Cahiers de l'Analyse des Données, 489-500.
- Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives,
Behaviormetrika, 26, 9-30.
- Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York.
- Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.
91. Surveys data and software (DtmVic)
can be downloaded from
www.dtm-vic.com