4 text mining and open ended questions in sample surveys ludovic lebart cnrs

AMAI - 2009 - September 8th, 2009 (Mexico D.F.)

Text Mining and Open-ended Questions
in Sample Surveys

Ludovic Lebart
Centre National de la Recherche Scientifique
Telecom-ParisTech, Paris, France

1

in Sample Surveys

Summary / Outline

1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions

in Sample Surveys






7) Conclusions

1- Principles of Data Mining and Text mining: A reminder

The « data Mining » approach...

✔ Ancient techniques are easier to use,

✔ Ancient techniques are improved

✔ New techniques are conceived

✔ New fields of application

✔ New products: Softwares

✔ Need for a selection of methods, of simple and clear strategy for
data processing


Survey data processing and data mining

Reminder : (Fayadd et al.)

Data Mining (KDD) is the non-trivial process of identifying patterns
in huge data sets, these patterns being supposed to be

valid,
novel,
potentially useful,
and ultimately… understandable

These huge data sets could be unstructured, non representative.

The main goal being to automatically extract from the ore (raw data)
the genuine diamond of truth…. (Benzécri 1973)


Survey data : Homogeneity of content, of coding…
… different from the usual inputs of Data Mining programs.

Despite the fact that we may deal with several observational levels
(households, individuals, trajectories or biographical data,
areas or regions…), there is a consistency and a unity of content
in a survey data set - together with general hypotheses formulated
beforehand - that are not present in the usual data mining input data.

A survey (whatever its complexity) is a costly set of measurements
that follows a specific decision.

In this context, a lot of meta-information is generally available
(Demographic, economic, sociologic, epidemiologic, etc)
that provides a framework for the interpretation phase.


“Text Mining” and Multivariate exploratory statistical
analysis of texts

✔ Initial paradigm:
✔

✔ - Extracting statistical units from texts
✔

✔ - Complementing lexicometry with a multivariate approach
✔

✔ - Applying visualization tools to lexical tables
✔

✔ Evolution and diversification of techniques and approaches

7


The fields of Text Mining

WEB Press

Scientific papers, abstracts Information Retrieval

Open-ended questions, free responses

Qualitative interviews, Discourses, Reports

Complaints

8

in Sample Surveys






7) Conclusions

2- Open-ended Questions: Why? How?

Open questions : Why?
x To shorten interview time:
Open ended questions are less costly in terms of interview time,
and generate less fatigue and tension (voluminous lists of items)

x To gather spontaneous information:
Marketing survey questions contain many questions of this type.
" What do you recall (or: what do you like) about this ad?

x To probe the response to a closed-end question.:
This is the follow up additional question "Why?".
Explanations concerning a response already given have to be
provided in a spontaneous fashion.

x To get information relating to non-comparable variables:
Example : Environmental activism, dietary habits….
10


Open questions : Why ?

Cost
DRAWBACKS
Complexity
Specificity

ADVANTAGES Speed
Freedom
Specificity

11


Comparison between open and closed questions

A classical experiment, quoted by Schuman and Presser (1981),
stresses the difficulty of comparing the two types of questionning.

When asked
"what is the most important problem facing this country [USA] at present",
16% of Americans mention crime and violence (grouped free responses),
whereas the same item asked in a closed question
produces 35% of the same response.

The explanation given by authors is the following:
lack of security is often considered as a local, not a national problem,
so that the item crime and violence is not mentioned
spontaneously very often.

Closing the question indicates that this response is a relevant or
possible response, resulting in a higher response percentage.


Heuristic value of open-ended questions

In some particular contexts, the absence of a response item can
play a positive role.
It can establish a climate of confidence and communication,
and lead to better results when certain subjects are brought up.

This is what is indicated by the work of Sudman and Bradburn (1974)
concerning questions having to do with "threats", and of Bradburn et al.
(1979) concerning questions about alcohol and sexuality.

In international studies, it is important to know whether people interviewed
in different countries understand the closed questions in the same way.
(case of the follow up :”Why” ).

As a matter of fact, it is also legitimate to raise this same issue with respect
to regional and generational differences.


Heuristic value of open-ended questions (continuation)

In some other particular contexts, the cultural gap between those who
have conceived the questionnaire and the interviewees is hidden by the
purely numerical coding of the closed questions.

In a national survey about the attitudes of economically impaired people
towards the minimum wage system in France, a classical open question
was asked at the end of the interview:

“Would you like to add something about some topics that could be
missing in this questionnaire, about the minimum wage system ?”

One answer (among many others of the same vein) was
“ We eat potatoes and eggs, despite my diabetes and my cholesterol,
because there are cheap.”

Another: “Thank you for coming. It proves that you are thinking of me”.

Some respondents are far from the problematic “Attitude towards an institution”


Empirical Post-Coding of free responses

(Drawbacks of this type of processing)

Coder bias: Coder bias is added to interviewer bias,
since the coder makes decisions and formulates
interpretations, introducing a «personal touch ».

Alteration of form: Information is destroyed in its form
and often weakened in its content: quality of expression,
level of vocabulary, and general interview tonality are lost.

Weakening of content: (case of responses that are composed,
complex, vague and diversified).

Infrequent responses are eliminated a priori.

15


Example 1: Open Question « Life » (international sample surveys)

The following open-ended question was asked :

"What is the single most important thing in life for you?"
It was followed by the probe:
"What other things are very important to you?".

This question was included in a multinational survey conducted
in seven countries (Japan, France, Germany, Italy, Nederland,
United Kingdom, USA) in the late nineteen eighties
(Hayashi et al., 1992).

Our illustrative example is limited to the British sample
(Sample size: 1043).

16


Example 1: Open Question « Life »: Examples of responses

GenderEduc. Age Responses

1 1 4 happiness in people around me, contented family,
would make me happy
1 2 2 my own time, not dictated by other people
1 2 2 freedom of choice as to what I do in my leisure time
1 3 2 I suppose work
1 2 1 firm, my work, which is my dad's firm
2 1 6 just the memory of my last husband
2 2 6 well-being of my handicapped son
1 1 5 my wife, she gave me courage to carry on even in the
bad times
2 2 3 my sons, my kids are very important to me, being on
my own, I am responsible for their education
1 3 3 job, being a teacher I love my job, for the well-being
of the children 17


Example 2: Open Questions / Copy-Test

Following a viewing of a television commercial on breakfast cereals (copy-test),
several open questions were asked.
One of them, which we shall use as our example, is :

What was the main idea of this commercial?

In addition a number of closed questions were also asked (socio-demographic
characteristics of respondents, purchase intent toward product seen).
Purchase intent being an important issue, this question plays a major role in
the discussions that follow.

Two examples of responses to that open question.

1 - That it has complex carbohydrates in it, it has energy releaser
and it tastes good... It showed people eating grape nuts.

2 - It gives you energy in the morning, nothing else.


Example 3: An international survey (Tokyo Gas Company)

A survey in three cities (Tokyo, New York, Paris) about dietary habits.

The common open-ended questions were:

"What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").
“ What would be an ideal meal?”

Akuto H.(Ed.) (1992). International Comparison of Dietary Cultures,
Nihon Keizai Shimbun, Tokyo.

Akuto H., Lebart L. (1992). Le Repas Idéal. Analyse de Réponses
Libres en Anglais, Français, Japonais. Les Cahiers de l'Analyse des
Données, vol XVII, n°3, Dunod, Paris


Example 3: An international survey (continuation)

Four responses (New York)

"What dishes do you like and eat often?
“What would be an ideal meal?”

---- 1
SPAGHETTI,CHINESE
++++
CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE

---- 2
SEAFOOD,GREEN SALAD,CHINESE FOOD
++++
CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD

---- 3
CHINESE FOOD
++++
CHINESE FOOD,FRENCH FOOD,VEAL,BREAD
---- 4
PASTA
++++
BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA


Example 4: Evaluación de vinos mediante notas y comentarios

Guia de Catas de Castilla y León (2005)
522 vinos de Castilla y León pertenecientes a 207 bodegas

5 denominaciones: Bierzo, Cigales, Ribera del Duero, Rueda, Toro


Example of two texts

---- Nota= 80 Valdelosfriales-2003
Joven típico, con notas de tempranillo y balsámicos; en boca amable y frutoso.

---- Nota=91 Tares P3-2001 premium
Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo
caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado,
tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones
de tierra húmeda y pólvora en el largo final.

in Sample Surveys






7) Conclusions

3- From texts to numerical data

Statistical units derived from texts

CORPUS

Texts

Sentences or responses

Segments or quasi-segments

Words, lemmas, n-grams

Characters
s

24


Ambiguity of frequencies:
statistical frequency versus « linguistic frequency »

Sample surveys
(statistical frequency)

Closed questions

Open questions
ouvertes

Texts
( linguistic frequency)

25


Example 1: Question « Life » - continuation

The counts for the first phase of numeric coding are as follows:
Out of 1043 responses, there are 13 669 occurrences,
with 1 413 distinct words.
When the words appearing at least 16 times are selected,
there remain 10 357 occurrences of these words,
with 135 distinct words (types).

The same questionnaire also had a number of closed-end
questions (among them, the socio-demographic characteristics
of the respondents, which play a major role).

In this example we focus on a partitioning of the sample into
nine categories, obtained by cross-tabulating age
(three categories) with educational level (three categories).
26


Example 1: Selected statistical units

Words Appearing at Least Sixteen Times (Alphabetic Order)
in the 1043 responses to the open question

Word Frequency Word Frequency Word Frequency

I 248 go 19 of 312
I'm 22 going 26 on 59
a 298 good 303 other 33
able 55 grandchildren 30 others 17
about 31 happiness 227 our 29
after 26 happy 137 out 34
all 86 have 99 own 16
and 504 having 70 peace 77
anything 19 health 609 people 61
are 65 healthy 45 really 28

27


Example of morpho-syntactic information

Gender Educ. Age Tagged responses

1 1 4 happiness/NN in/IN people/NNS around/IN me/PRP
,/, contented/VBN family/NN ,/, would/MD make/VB
me/PRP happy/JJ
1 2 2 my/PRP$ own/JJ time/NN ,/, not/RB dictated/VBN
by/IN other/JJ people/NNS
1 2 2 freedom/NN of/IN choice/NN as/IN to/TO what/WP
I/PRP do/VB in/IN my/PRP$ leisure/NN time/NN
1 3 2 I/PRP suppose/VBP work/NN
1 2 1 firm/NN ,/, my/PRP$ work/NN ,/, which/WDT is/VBZ
my/PRP$ dad's/NNS firm/NN
2 1 6 just/RB the/DT memory/NN of/IN my/PRP$ last/JJ
husband/NN
2 2 6 wellbeing/NN of/IN my/PRP$ handicapped/JJ son/NN
1 1 5 my/PRP$ wife/NN ,/, she/PRP gave/VBD me/PRP
courage/NN to/TO carry/VB on/IN even/RB in/IN
the/DT bad/JJ times/NNS

28


Example of a lexical contingency table

- First partition: three age categories

- less than 30 years [noted -30],
- between 30 years and 55 years [-55]
- over 55 years [+ 55] .

- Second partition: three educational levels

- No degree or Low [noted L],
- Medium [M],
- High level [H]

29


Example 1: A lexical contingency table

Partial listing of lexical table cross-tabulating 135 words of frequency
greater than or equal to 16 with 9 age-education categories

L-30 L-55 L+55 M-30 M-55 M+55 H-30 H-55 H+55

I 2 46 92 30 25 19 11 21 2
I'm 2 5 9 3 2 1 0 0 0
a 10 56 66 54 44 19 20 22 7
able 1 9 16 9 7 4 4 5 0
about 0 3 13 7 1 2 4 1 0
after 1 8 11 3 1 2 0 0 0
all 1 24 19 8 18 6 3 5 2
and 8 89 148 86 73 30 25 32 13
anything 0 4 9 1 3 0 1 1 0

30


Example 2: "What is the main idea in this commercial"
Words appearing more than 9 times (100 responses)

Number Word Frequency Number Word Frequency

1 I 14 25 in 27
2 a 59 26 is 37
3 about 15 27 it 133
4 all 21 28 it's 28
5 and 42 29 long 14
6 are 25 30 morning 9
7 been 12 37 nothing 25
8 carbohydrate 14 32 nutritional 9
9 carbohydrates33 33 nutritious 12
10 cereal 34 34 nuts 25
11 complex 25 35 of 25
12 crunchy 9 36 people 28
13 eaten 10 37 showed 11
14 eating 19 38 taste 11
15 energy 33 39 that 80
16 for 57 40 that's 13
17 give 9 41 he 82
18 gives 11 42 they 50
19 good 52 43 to 32
20 grape 25 44 was 19
21 has 30 45 with 11
22 have 27 46 years 11
23 healthy 23 47 you 81
24 how 9

Example 2: "What is the main idea in this commercial" Examples of repeated segments

SEGM FREQ LENGTH "TEXT of SEGMENT"
-------------------------------------
-----------------------------------------a
1 8 3 a long time
-----------------------------------------are
2 6 4 are good for you
-----------------------------------------carbohydrates
3 5 3 carbohydrates in it
-----------------------------------------complex
4 15 2 complex carbohydrates
-----------------------------------------for
5 37 2 for you
-----------------------------------------give
6 7 3 give you energy
-----------------------------------------gives
7 11 2 gives you
8 9 3 gives you energy
-----------------------------------------good
9 24 2 good for
10 22 3 good for you
-----------------------------------------grape
11 25 2 grape nuts
-----------------------------------------have
12 6 3 have been eating
-----------------------------------------healthy
13 6 3 healthy for you
-----------------------------------------is
14 9 4 is good for you
-----------------------------------------it
15 26 2 it has
16 19 2 it is
17 14 2 it was
18 8 3 it gives you
19 8 3 it has a
20 6 3 it has complex
21 5 3 it is good
22 6 4 it gives you energy
-----------------------------------------people



The common open-ended question : "What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").

- Sub-sample 1 (city of Tokyo) : 1008 individuals.
The global corpus of open responses contains 6219 occurrences of
832 distinct words. 139 words appear at least 7 times, leading to 4975
occurrences.

- Sub-sample 2 (city of New-York) contains 634 individuals.
(6511 occurrences of 638 distinct words).
The processing takes into account the 83 words appearing at least 12 times.

- Sub-sample 3 (city of Paris) contains 1000 individuals.
The global corpus contains 11108 occurrences of 1229 distinct words.
The processing takes into account the 112 words appearing at least 18 times,
leading to 7806 occurrences.

- The three sets of respondents are broken down into into six categories
(three categories of age, combined with the gender).


!------------------------------------!
! words (frequency order) !
!-------!---------------------!------!
! num. ! used words ! freq.!
!-------!---------------------!------!
! 12 ! CHICKEN ! 254 !
! 73 ! STEAK ! 101 !
!
!
49 ! PASTA
22 ! FISH
!
!
95 !
87 ! City of New York
! 60 ! SALAD ! 85 !
! 1 ! AND ! 85 !
!
!
23 ! FOOD
52 ! PIZZA
!
!
82 !
62 !
The common open-ended question :
! 79 ! VEGETABLES ! 57 ! "What dishes do you like and eat
! 4 ! BEEF ! 56 !
! 71 ! SPAGHETTI ! 55 ! often?
!
!
13 ! CHINESE
80 ! WITH
!
!
54 !
48 !
(With a probe: "Any other dishes you
! 59 ! ROAST ! 47 ! like and eat often?").
! 58 ! RICE ! 45 !
! 67 ! SHRIMP ! 45 !
! 43 ! MACARONI ! 42 !
! 56 ! POTATOES ! 39 !
! 35 ! HAMBURGERS ! 36 !
! 75 ! TUNA ! 35 !
! 26 ! FRIED ! 33 !
! 77 ! VEAL ! 33 !
! 38 ! ITALIAN ! 31 !
! 2 ! BAKED ! 29 !
! 48 ! PARMESAN ! 29 !
! 55 ! POTATO ! 27 !
! 46 ! MEATBALLS ! 25 !
! 3 ! BEANS ! 24 !
! 45 ! MEAT ! 24 !
! 76 ! TURKEY ! 24 !
! 14 ! CHOPS ! 23 !
! 34 ! HAMBURGER ! 22 !
!------------------------------------!



Pos. Palabra Frec

1 de 891 Lematización:
2 y 806 -Singulares y plurales
3 en 694
- Masculino y femenino
- Formas verbales a infinitivo
4 boca 433 -…
5 con 356
Eliminados artículos,
6 fruta 334
preposiciones …
7 un 308

8 nariz 246 Conservadas palabras
utilizadas al menos 8 veces.
9 muy 237
10 la 211 Quedan
10 notas 211 -250 palabras
12 que 168
-443 vinos

13 taninos 167
P1 P2 ... P250
14 el 159
Vino 1 0 1 ... 2
15 una 152
Vino 2 1 0 ... 1
16 madera 140
Vino 3 0 0 ... 1
17 bien 116
. . . . . . . . . . .
18 toque 108
Vino 443 1 2 ... 0


Example 4: Evaluación de vinos mediante notas y comentarios (Continuation)

boca (mouth)
fruta (fruit)
muy (very)
nariz (nose)
nota (note)
taninos (tannins)
madera (wood)
buen (good)
rojo (red)
negro (black)
toque (hint)
bien (well)
final (end)
maduro (ripened)
balsámico (balsamic)
vino (wine)
todavía (still)
elegante (elegant)
agradable (pleasant)
jugoso (juicy)
medio (medium)
fino (fine)
algo (some/something)
cereza (cherry)
ser (to be)
ligero (light)
suave (mild)
potente (powerful)
acidez (acidity)
0 50 100 150 200 250 300 350 400
Frequency

in Sample Surveys






7) Conclusions


✔ Applying visualization tools to lexical tables
✔

● Principal axes analyses of lexical tables
● Classification (clustering) of words and texts
✔

✔ Selecting characteristic units and responses
✔ (or: sentences)
✔

● Characteristic units (words, segments, lemmas)
● Selecting « Modal responses »

38


Briefly, one can summarize the principles of methods for performing these
data reductions:

Principal axes methods, largely based upon linear algebra, produce
graphical representations on which the geometric proximities among row-
points and among column-points translate statistical associations among rows
and among columns. Correspondence analysis belongs to this family of
methods.

Clustering or classification methods that create groupings of rows or of
columns into clusters (or into families of hierarchical clusters) including the
SOM (Self Organizing Maps, or Kohonen maps).

These two families of methods can be used on the same data matrix and
they complement one another very effectively.

Selection of characteristic units and responses (or: sentences)
Characteristic units (words, segments, lemmas)Selecting « Modal
responses »


General factor for individual i

x = aj f + ε
i
j i j
i
Value of variable j
for individual i Residual (hopefully small)

Coefficient of variable j

Known = Unknown

Visualization through principal coordinates, « a breakthrough in 1904 ».

Charles Spearman (1904) – “General intelligence, objectively determined and
measured”. Amer. Journal of Psychology, 15, p 201-293.
40


Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych.,
9, p 345-366.
Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press,
Chicago.

x = a j f + b j g + ... + ε
i
j i i j
i

41


Singular Values Decomposition is a theorem, not a model
Eckart C., Young G. (1936) - The approximation of one matrix by another of lower rank.
Psychometrika, l, p 211-218.

Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices.
Bull. Amer. Math. Assoc., 45, p 118-121.

= λ1 × + ... + λ α × + ... + λ p ×

X v1 u'1 vα u'α vp u'p

A precursor: Pearson K. (1901) - On lines and planes of closest fit to
systems of points in space. Phil. Mag. 2, n°ll, p 559-572.

42


Image “Cheetah” (Data Compression, Mark Nelson)
and table (200 x 320) containing levels of grey.

95 88 88 87 95 88 95 95 95 106 95 78 65 71 78 77 77 etc.
143 144 151 151 153 170 183 181 162 140 116 128 133 144 159 166 170
153 151 162 166 162 151 126 117 128 143 147 175 181 170 166 132 116
143 144 133 130 143 153 159 175 192 201 188 162 135 116 101 106 118
123 112 116 130 143 147 162 183 166 135 123 120 116 116 129 140 159
133 151 162 166 170 188 166 128 116 132 140 126 143 151 144 155 176
160 168 166 159 135 101 93 98 120 128 126 147 154 158 176 181 181
154 155 153 144 126 106 118 133 136 153 159 153 162 162 154 143 128
159 153 147 159 150 154 155 153 158 170 159 147 130 136 140 150 150
151 144 147 176 188 170 166 183 170 166 153 130 132 154 162 120 135
155 181 183 162 144 147 147 144 126 120 123 129 130 112 101 135 150
166 147 129 123 133 144 133 117 109 118 132 112 109 120 136 120 136
136 130 136 147 147 140 136 144 140 132 129 151 153 140 128 153 147
130 133 140 124 136 152 166 147 144 151 159 140 123 130 123 109 112
126 120 143 145 162 153 155 175 154 144 136 130 120 112 123 123 144
144 159 155 155 162 166 158 147 140 147 126 123 132 135 136 144 147
136 143 162 175 136 110 112 135 120 118 126 151 150 130 129 133 147
133 151 143 106 85 93 128 136 140 140 144 143 126 117 116 129 124
……………………………..etc.
43


Eigenvalues of the Correspondence Analysis of the previous table

Trace before diagonalization: 0.15930 Trace after diagonalization: 0.15930
eigenvalues
1: 0.045 28.549 28.549 **************************************************
2: 0.028 17.695 46.243 ******************************
3: 0.019 12.205 58.448 *********************
4: 0.012 7.306 65.754 ************
5: 0.007 4.674 70.428 ********
6: 0.006 3.516 73.944 ******
7: 0.005 2.944 76.888 *****
8: 0.003 2.179 79.067 ***
9: 0.003 1.869 80.936 ***
10: 0.002 1.531 82.467 **
11: 0.002 1.371 83.838 **
12: 0.002 1.106 84.944 *
13: 0.002 1.066 86.010 *
14: 0.002 0.956 86.965 *
15: 0.001 0.791 87.756 *
16: 0.001 0.758 88.514 *
17: 0.001 0.690 89.204 *
18: 0.001 0.567 89.771
19: 0.001 0.554 90.325
20: 0.001 0.477 90.801
21: 0.001 0.422 91.223
22: 0.001 0.406 91.629
23: 0.001 0.384 92.013
24: 0.001 0.339 92.352 44


Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes

45


A pedagogical example: Description of « Textual Graphs »

46


Each area “answers” to the fictitious “open-question” :
Which are your neighbouring areas?

**** Ain
Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie
**** Aisne
Aisne Ardennes Marne Nord Oise Seine_Marne Somme
**** Allier
Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone
**** Alpes_Prov
Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse
**** Alpes_Hautes
Alpes_Hautes Alpes_Prov Drome Isere Savoie
**** Alpes_Marit
Alpes_Marit Alpes_Prov Var
**** Ardeche
Ardeche Drome Gard Loire Hte_Loire Lozere
**** Ardennes
Ardennes Aisne Marne Meuse
……………………….
47


The idea: When a pattern exists
within a text, some techniques may This map is blindly
detect it and exhibit it. produced from the
previous texts.

48


Example: CORRESPONDENCE ANALYSIS of a simple
lexical table

Correspondence analysis can be presented in several different ways.

• It is difficult to trace the method's history accurately
(see, e.g., Hill, 1974 ; Benzecri, 1976 ; Nishisato, 1980; Gifi, 1990).

•The underlying theory probably dates back to...

•Fisher (1936) , Guttman (1941), and Hayashi (1956).

49


• Correspondence analysis and principal components
analysis are used under different circumstances:

• Principal components analysis is used for tables
consisting of continuous measurements.

• Correspondence analysis is best applied to
contingency tables (cross-tabulations) frequently
encountered when analyzing textual data.

• By extension, it also provides a satisfactory
description of data tables with binary coding.

50


• Cross-tabulations or contingency tables are among the
most common data structures used for analyzing
qualitative data.

• By looking simultaneously at two partitions at a time of a
population or sample, a cross-tabulation enables us to
work with variations in the data by response categories, a
necessary step for the interpretation of results.

51


• We will use as a leading example a small
contingency table.
• However, this kind of exploratory method is chiefly
useful when we are dealing with very large data
tables
• (pedagogical paradox)

• In the following table, the 14 rows are words used
in responses to an open-ended question given by
2000 respondents.
•The 5 columns are the educational levels of the
respondents.

52


A contingency table crossing words and education level

No Elem.Trade High Coll- Total
Words degr. Sch. Sch. Sch. ege

Money 51 64 32 29 17 193
Future 53 90 78 75 22 318
Unemployment 71 111 50 40 11 283
Decision 1 7 5 5 4 22
Difficult 7 11 4 3 2 27
Economic 7 13 12 11 11 54
Selfishness 21 37 14 26 9 107
Occupation 12 35 19 6 7 79
Finances 10 7 7 3 1 28
War 4 7 7 6 2 26
Housing 8 22 7 10 5 52
Fear 25 45 38 38 13 159
Health 18 27 20 19 9 93
Work 35 61 29 14 12 151

Total 323 537 322 285 125 1592

53


Row-profiles of the same table

No Elem. Trade High Coll Total
Words degree Sch. Sch. Sch. ege

Money 26.4 33.2 16.6 15.0 8.8 100.0
Future 16.7 28.3 24.5 23.6 6.9 100.0
Unemployment 25.1 39.2 17.7 14.1 3.9 100.0
Decision 4.5 31.8 22.7 22.7 18.2 100.0
Difficult 25.9 40.7 14.8 11.1 7.4 100.0
Economic 13.0 24.1 22.2 20.4 20.4 100.0
Selfishness 19.6 34.6 13.1 24.3 8.4 100.0
Occupation 15.2 44.3 24.1 7.6 8.9 100.0
Finances 35.7 25.0 25.0 10.7 3.6 100.0
War 15.4 26.9 26.9 23.1 7.7 100.0
Housing 15.4 42.3 13.5 19.2 9.6 100.0
Fear 15.7 28.3 23.9 23.9 8.2 100.0
Health 19.4 29.0 21.5 20.4 9.7 100.0
Work 23.2 40.4 19.2 9.3 7.9 100.0

Total 20.3 33.7 20.2 17.9 7.9 100.0

54


Column-profiles of the same table

No Elem. Trade High Coll- Total
Words degree Sch. Sch. Sch. ege

Money 15.8 11.9 9.9 10.2 13.6 12.1
Future 16.4 16.8 24.2 26.3 17.6 20.0
Unemployment 22.0 20.7 15.5 14.0 8.8 17.8
Decision .3 1.3 1.6 1.8 3.2 1.4
Difficult 2.2 2.0 1.2 1.1 1.6 1.7
Economic 2.2 2.4 3.7 3.9 8.8 3.4
Selfishness 6.5 6.9 4.3 9.1 7.2 6.7
Occupation 3.7 6.5 5.9 2.1 5.6 5.0
Finances 3.1 1.3 2.2 1.1 .8 1.8
War 1.2 1.3 2.2 2.1 1.6 1.6
Housing 2.5 4.1 2.2 3.5 4.0 3.3
Fear 7.7 8.4 11.8 13.3 10.4 10.0
Health 5.6 5.0 6.2 6.7 7.2 5.8
Work 10.8 11.4 9.0 4.9 9.6 9.5

Total 100.0 100.0 100.0 100.0 100.0 100.0

55


1 j p
1 general term of the
contingency table
Symmetry F=
i fij
(n,p)
of the two
spaces: n
row-profiles column-profiles
rows and 1 j p j j'
columns 1
i

i' i

n

p n
n points in R p points in R
• •• • • • • • • • • •• • •
• • •••
• • • • • •• • •• • • • •• • ••
• • •• • • • • • • •
• • • • • • • • • • •
• • • •• • •
• •
•
p
Rn
• • •
R
56


A x is 2 (21 % ) .
F in a n c e s
.1 5
H IG H
. F u tu r e
.
W ar N o DE GRE E
. .
F ea r .
. U n em p l o y m e n t
T R A D E S e lfi s h n e s s
. . A x is 1
. H e a l th
- .2 - .1 0 .1 . .2
M on ey (57 % )
.
.
EL EM D iff ic u lt
H o u s in g
. W o rk .
- .1 5

.D e c i s i o n
O c c u p a tio n
.
E c o n o m ic C O LL E G E
. .
57


Characteristic elements (words, lemmas, segments)

The corpus contains several parts (categories of respondents).

Notations:
kij -sub-frequency of word i in the part j of the corpus;
ki. -frequency of word i in the whole corpus;
k.j -frequency (size) of part j;
k.. -size of the corpus (or, simply, k).

We are interested in the statistical significance of sub-frequency kij .

Is the word i abnormally frequent in part j ? Is it abnormally rare?

The comparison between the relative frequency of word i in part j
and the relative frequency of word i in the entire corpus leads to a classical
statistical test using either the hypergeometric distribution or its normal
58
approximation.


The 4 parameters for computing characteristic elements

T E X T P A R T S

W O R D S

ki j ki .

k. j k. .

k. . size of corpus
ki . frequency of word in corpus

ki j frequency of word in text part

k. j size of text part

59

in Sample Surveys

Summary / Outline






7) Conclusions
60


Example 1: Open Question « Life » (International sample surveys)

The following open-ended question was asked :

"What is the single most important thing in life for you?"
It was followed by the probe:
"What other things are very important to you?".

The two forthcoming diapositives show the principal plane
produced by a correspondence analysis of the previous
lexical contingency table (section 3).

Proximity between 2 category-points (columns) means
similarity of lexical profiles of the 2 categories.

Proximity between 2 word-points (rows) means similarity
of lexical profiles of these words.

c h u r c h w e lf a r e Example 1 (« Life » question)
m in d son
E 3-A G E 3 if
Correspondence th em
Words - Categories k id s fo r
p ea ce
E 2 -A G E 3
s e c u rity co n ten tm en t
c h il d r e n
E 2-A G E 2

E 1 -A G E 2 a re
E 3-A G E 2 f a m i ly f ro m sh o u ld
g e n e ra l v er y w if e m e
le is u re fr e e d o m h a p p in ess
sta n d a rd e m plo y m e nt d og h elp
tim e
house w o rld d a u g h ter

m u sic
a nd w o u ld
e d u catio n w o rk I
m one y be w ell
to E 1 -A G E 3
not a n y thin g
E 2 -A G E 1 you
s a t is f .
lo v e da y
E 1 -A G E 1
jo b ha ve g o in g
ke e p
c o m f o r ta b l y
t h in g s c o m f o r t a b le m o re
m u ch
E 3 -A G E 1
f rie nd s th in k c an
f u tu re ca r out go
62

p ea c e o f m in d Example 1 (« Life » question)
E3-A G E3
Location of
p e a c e in t h e w o r ld E 2 -A G E 3 Segments
w e lfa r e o f m y fa m ily
E2-A G E2 h a p p in e ss , g o o d h e a lth
E 1 -A G E 2
E3-A G E2 la w a n d o r d e r

a n ic e h o m e
a g o o d st a n d a r d o f liv in g I d o n 't kn o w
h a v in g e n o u g h m o n e y to l ive E1-A G E3
E 2 -A G E 1
E1-A G E1
a g o o d jo b c a n 't t h i n k o f a n y t h i n g e l s e

fr ie n d s a n d fa m i ly

E 3 -A G E 1
63


Example 1 (« Life » question) Characteristic words
words %W %glob Fr.W Fr.glob TestValue
H-30 = -30 * high (Young, High education)

1 f rie n d s 2 .8 7 1 .1 1 17 11 6 3 .4 4
2 do 1 .3 5 .4 5 8 4 7 2 .6 0
3 w ant 1 .0 1 .3 0 6 3 1 2 .4 4
4 b e in g 2 .1 9 1 .1 1 13 11 6 2 .1 8
5 jo b 2 .5 3 1 .3 6 15 14 2 2 .1 6
6 h a v in g 1 .5 2 .6 7 9 7 0 2 .1 1
7 t h in g s .8 4 .2 7 5 2 8 2 .0 6
- -- - -- - -- - -- - -- -
2 w ife .0 0 .6 5 0 68 -2 .1 0
1 h e a lt h 2 .7 0 5 .8 5 16 609 -3 .5 9

H+55 = +55 * high (Older, High education)

1 m in d 2 .5 5 .4 5 5 47 2 .9 1
2 w e lfa r e 1 .5 3 .2 1 3 22 2 .4 2
3 peace 2 .5 5 .7 4 5 77 2 .1 7
64


Example 1 (« Life » question) : Modal Responses
Category 7 Less than 30 years, high level of education

1 .3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e
1 .1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o ,
w ith in re a s o n , h a v in g g o o d frie n d s , h a v in g a fu lfillin g jo b to d o ,
h a v in g s o m e id e a o f w h a t y o u w a n t to d o a n d th e fre e d o m to c h o o s e ,
p ro te c tio n o f th e e n v iro n m e n t
1 .0 5 - 3 to h a v e g o o d frie n d s a ro u n d h a v in g a g o o d jo b , liv in g in a g o o d a re a ,
h a v in g lo ts o f fre e d o m to d o th e th in g s y o u w a n t to d o
.9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y

Category 9 Over 55 years, high level of education

.9 7 - 1 to g e th e rn e s s , p e a c e o f m in d , g o o d h e a lth , re lig io n , n o
.6 4 - 2 n o t t o d i e , p e a c e o f m i n d , d o n 't l i k e p e o p l e l i v i n g e n v i o u s o f e a c h
o th e r
.6 3 - 3 p e a c e o f m in d g o o d h e a lth , h a p p in e s s , e n o u g h m o n e y to k e e p a
s ta n d a rd o f liv in g
.3 8 - 4 w e lfa re o f m y fa m ily w o rk , s a tis fa c tio n , g o o d h e a lth , tra v e l 65


Example 1: Similar survey in Japan
(Same open question, same categories of
respondents)

66


Example 1: Similar survey in Japan: visualization
of the characteristic words for 2 categories

67


Example 2: Open Questions / Copy-Test


E x a m p le s o f 2 C h a ra c te ris tic re s p o n s e s fo r 4 c a te g o rie s
TEXT 1 PWNB = Prob.w.n.buy
Example 2: Open
-- 1 to tell you about how long people have eaten them. Questions / Copy-
-- 1 the complex carbohydrate that are in this cereal.
-- 1 the people who eat this cereal and the product. that's all. Test
-- 2 it's supposed to be healthy, it has good carbohydrates in it.

TEXT 2 Hesi = Hesitates

. -- 1 it gives you energy in the morning. nothing else.

. -- 2 grape nuts cereal gives you energy
-- 2 it has complex carbohydrates. they showed the man eating it with
-- 2 strawberries and bananas.

TEXT 3 PWB = Probably would buy

-- 1 it's nutritious for you. nothing else.

-- 2 that,is good for you, that,s all it said to me

TEXT 4 DW B = Definitely would buy

-- 1 they are bigger nuggets. low in carbohydrates, that's all.

-- 2 it has nutty flavor, it is nutritious


Example 3: International survey (Tokyo Gas Company). A survey in three cities (Tokyo,
New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often?

New York: First principal plane. Table crossing words and age x gender categories


Example 3: International survey (continuation). Question: "What dishes do you like and eat often?

New York: First principal plane. Example of confidence areas for categories (Bootstrap)



New York: First principal plane. Example of confidence areas for words (Bootstrap)



New York: First principal plane. Example of Kohonen Map (Self Organizing map).


Axis 2 : 1.75%
First Principal Plane
Mesoneros de Castilla (03)
6.0
WINES & MARKS

Example 4: Comments about 522 Spanish wines.
Nota y commentarios activos
4.5
Valdelosfrailes (03)

Gran Reserva
Fuentenarro (02)

Tinto joven 3.0

Torondos (02) Tinto crianza

Gayubar (02) Vega Sicilia 'Único' (94)
Viña Sastre Pesus(01)
1.5
Valdetán (02)
94
Eje de calidad
78

79

93

80
82 88
97 Jaros Chafandín (01)
81 83 90 91 92
-3.0 -1.5 84 85 86 1.5 Axis 1: 3.52%
89
87 San Román (01)
Numanthia (02)
Bienvenida Sitio de El Palo (01)
95 Termanthia (02)
Tares P3 (01)
Carramimbre (03) Gran Elías Mora (00)
-1.5
Viña Eremos (03)

Marqués de Peñamonte (01)
Tinto reserva
Tinto roble


Example 4: Comments about 522 Spanish wines (continuation)
lowest marks highest marks
agradable reducido potente estilo denso
vez
frutal sobremadurez discreto puro concentrado salado
graso
crianza sequedad frutosidad dejar necesitar impresionante
tuestes torrefacto
algo medio ensamblado mineral potencial
cierto granuloso
limpio tempranillo seco primer sabroso
abierto rojo
gran enérgico
ligero ligeramente clásico moderno sorprende
algún típico salino tiempo
americano
beber demasiado dominar expresión fino carnoso tacto
evolucionar capa franco donde amargo complejo todo
compotado
fácil mucho largo noble
suave
ser cascajo
tradicional Ribera
cesta bouquet
rústico coco
sílex
joven toque pólvora
intenso
roble voluptuoso
firme
lineal magnífico
vino
corto amable chocolate
herbáceo
consistencia

-1,9 -1,5 -1,1 -0,7 -0,3 0,1 0,5 0,9 1,3

Mark81 82 83 84 85 86 87 88 89 90
Average mark: 85.16

Example 4: Comments about 522
Axis2
Variables suplementarias Spanish wines (continuation)
Mesoneros de Castilla (03)
Jaros Chafandín (01)
Vega Sicilia 'Único' (94)

4.5
Viña Sastre Pesus(01)
Valdelosfrailes (03)
Gran Reserva Punta Esencia (01)
Fuentenarro (02)
Astrales (02)

3.0 20-24,9€
0-4,9€ 5-9,9€
Torondos (02) Valdecuadrón (02) 15-19,9€ 25-29,9€
Tinto joven 10-14,9€ 50-99,9€
Gayubar (02)
Tinto crianza
1.5
Viñatorondos (03) 94
Valdetán (02) 79 100-300€
78
93
81 91 97
80
Viña Valdable (03) 82 83 88 30-49,9€ 90 92
- 3.0 - 1.5 84 89
1.5 Axis1
Marqués de Olivara (98) 85 86 87
Rauda (01)
El Marqués (02) 95

Carramimbre (03) Valsotillo (01) - 1.5
Viña Eremos (03)
San Román (01)
Tinto reserva Numanthia (02)
Marqués de Peñamonte (01)
Bienvenida Sitio de El Palo (01) Termanthia (02)
Tinto roble
Gran Elías Mora (00) Tares P3 (01)


Example 4: Comments about 522 Spanish wines (continuation)
---- Wine 212 (mark= 85) Legaris-2001
Tuestes, gominolas y buenos balsámicos marcan la intensidad media frutal de este
crianza. En boca aparece muy lineal, con consistencia media; el retrogusto frutal todavía
tapado por una madera algo rústica.

---- Wine 30 (mark=91) Tares P3-2001 premium
Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo
caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado,
tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones
de tierra húmeda y pólvora en el largo final.

---- Wine 314 (mark=97) Vega Sicilia 'Único-1994
Hay que realizar un ejercicio de disciplina gustativa de primer rango para describir este
gran vino. el bouquet es fresco, bien armado de fruta roja que se ve potenciada por tintes
de chocolates, tabacos, notas de sotobosque y una madera que se manifiesta pero que
resulta difícil de localizar y menos de concretar. Tenemos el caso raro de un tinto que
sale ileso del paso del tiempo sin lucir su armadura, que es la barrica. En boca joven,
aunque ya tiene su cuerpo vigoroso y enérgico bastante ensamblado, con la excepción de
algunos taninos saltamontes que quedan para domesticar. Largo y vibrante final que
mezcla madurez con una notable finura fresca.

in Sample Surveys






7) Conclusions


Processing Strategy

A priori Grouping (Lexical contingency table)

Juxtaposition of Lexical contingency tables

Instrumental Partition

Direct Analysis of the sparse Lexical table

79


Importance of Meta-data

Meta-data
linguistics

Grammar / Syntax
Textual data
Semantics networks

External Corpora
externes
Other a priori structures
sociolinguistics,
chronology, etc.

80


The four phases of a linguistic analysis
(A bxg flower)
A big flower A bug flower
Morphology
A bag flower A bog flower

Syntax The spoon speaks (The speaks)

Semantics
A man thinks (A stone thinks)

Pragmatics A challenge to I.A.

81


Homography, Polysemy, Synonymy

To bear

Homographs: BORE A tedious person

To bore

Task
Polysemous words: DUTY
Tax
DRUG medicine

Addicting product
82


Semantic content of a lexical profile

Distributional linguistics (Z. Harris)

X is sometimes purring
X mews
X has whiskers
X likes milk
X likes chasing mice

At the end, the point « X » will be superimposed with the point « CAT»

83


Semantic similarity is not a transitive relationship

Example of semantic chains:

(1) calm–wisdom–discretion–wariness–fear–panic,

(2) fact–feature –aspect–appearance–illusion .

84


New additional variables
Nouns (proper, common)
Verbs (auxiliary, modal, usual…)
Adjective
Pronoun
Determiner
Adverb
Preposition
Conjunction

85


new variables, new metrics

86

7) Conclusions

As a conclusion...

For each open-ended question,

and for each partition of the sample of respondents,

we obtain, without any preliminary coding or other intervention:

• A visualization of proximities between words and categories.

• Characteristic elements or words for each category .

• Modal responses for each category (a kind of automatic summary).

[Remember also that the open question “Why” following a closed
question provides an indispensable assessment of the real
understanding of the question].

7) Conclusions

As a conclusion... (continuation)

All these processing are carried out under the supervision of robust
assessment procedures:

- Non-parametric statistical tests,
- Bootstrap validation.

We are not dealing here with a novel sophisticated modelling.

It is rather a painstaking effort to stick to the real concerns of the
respondent, i.e.: the customer, the user, the client.

With the rapid development of online surveys, the spreading of e-mails
and blogs, the presented set of tools is expected to be a noteworthy
component in a new methodology of customer knowledge.

7) Conclusions – Short Bibliography

- Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo.
- Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys,
Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press).
- Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse
d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006.
- Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,)
http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm
Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference,
19/1, 122-134.
- Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied
Statistics, 2, 120-132.
- Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York.
- Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge.
- Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass,
San Francisco.
- Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis,
J. of the Amer. Soc. for Information Science, 41 (6), 391-407.
- Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris.
- Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective,
North-Holland, Amsterdam.
- Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT,
Physica Verlag, 67-76.
- Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida.
- Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht.
- Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y.
- Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254.
- Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes,
Cahiers de l'Analyse des Données, 489-500.
- Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives,
Behaviormetrika, 26, 9-30.
- Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York.
- Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.

Surveys data and software (DtmVic)
can be downloaded from
www.dtm-vic.com

Thank You
Gracias
Grazie
Obrigado
Merci

92

4 text mining and open ended questions in sample surveys ludovic lebart cnrs

Recommended

Recommended

More Related Content

Similar to 4 text mining and open ended questions in sample surveys ludovic lebart cnrs

Similar to 4 text mining and open ended questions in sample surveys ludovic lebart cnrs (20)

More from Evelyn Femat

More from Evelyn Femat (20)

Recently uploaded

Recently uploaded (20)

4 text mining and open ended questions in sample surveys ludovic lebart cnrs