SlideShare a Scribd company logo
1 of 20
Download to read offline
Experimenting the
TextTiling Algorithm
Summary of the work done by master
students at Université Toulouse Le Mirail
Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L.,
Delpech E., El Maarouf I., Fontan L., Gotlik W.
Experimenting the Text Tiling
algorithm
Part I : What is the Text Tiling Algorithm ?
Part II : Experimentations with the Text
Tiling algorithm
Part III : Demo
Part I :
What is the TextTiling algorithm?
 « an algorithm for partitionning expository texts into
coherent multi-paragraph discourse units which reflects
the subtopic structure of the texts »

 developed by Marti Hearst (1997):
«TextTiling: Segmenting Text into Multi-Paragraph
Subtopic Passages », In Computational Linguistics, March
1997.
http://www.ischool.berkeley.edu/~hearst/tiling-about.html
Why segment a text into multi-paragraphs
unit ?
Computational tasks that use arbitrary windows might
benefit from using windows with motivated boundaries
Ease of readability for online long texts (Reading
Assistant Tools)
IR : retrieving relevant passages instead of whole
document
Summarization : extract sentences according to their
position in the subtopic structure
What is the hypothesis behind TextTiling ?

 « TextTiling assumes that a set of lexical items is in use
during the course of a given subtopic discussion, and
when that subtopic changes, a significant proportion
when that subtopic changes, a significant proportion of the
of the vocabulary changes
vocabulary changes as well »as well »
Text Tiling doesn’t detect subtopics per se but shifts in
topic by means of change in vocabulary
Operates a linear segmentation (no hierarchy)
Detection of topic shift
Raw text
Tokenisation

similarity score SS
bloc A vs bloc B S
S

Segmentation into
pseudo-sentences
(20 tokens)

a similarity score is computed every
pseudo-sentence between 2 blocks of 6
pseudo-sequences


the more vocabulary in common, the
highest the score


S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
I. Detection of topic shift
SCORE
1

 a gap means there is a

0,85

0,9

drop in vocabulary similarity

0,8

0,8

0,7

 topic shifts occur at the

0,6
0,75

deepest gaps (after
smoothing)

0,5
0,4
0,7

tiles boundaries will be
adjusted to the nearest
paragraph break

0,3
0,65
0,2

0,1
0,6
0
1 1 3 3 5 5 7 7 9 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Pseudo-sentence
number
Evaluation by Hearst (1997)
 Evaluation on 12 magazine articles annotated by 7
judges

 Judges are asked « to mark the paragraph boudary at
which the topic changed »

 In case of disagreement among judges, a boudary is
kept if at least 3 judges agree on it

 Agreement among judges (kappa measure) :

kappa = 0.647
Evaluation by Hearst (1997)
Precision

Recall

0.43

0.42

TextTiler

0.66

0.61

Judges

0.81

0.71

Baseline
(random)

Works well on long (+1800 words) expository texts with
little structural demarcation
Part II : Experimentations with
theTextTiling algorithm
 Work done by masters students, Université Toulouse Le
Mirail

 Implementation in Perl
 Experimentations :
 cross annotation of 3 texts
 variation of :


linguistic parameters



computation parameters
Annotation of topic boundary
 No clear-cut topic shift, rather ‘regions’ of shift
Annotators felt a smaller unity (sentence) would have
been more convenient

 Our kappa : 0.56
 Heart’s judges : 0.65

 kappa should be at least > 0.67, the best is > 0.8

 A difficult (unnatural ?) task for humans
Variation of linguistic parameters
basic

trigrams

lemmatization (TreeTagger*)
0,61

0,7

0,58

0,6

0,53

0,5

0,35
0,34

0,26
0,23

PRECISION
F-MESURE

0,4

0,25

0,3
0,2

0,17

0,1
0

RECALL
* http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Variation of computation parameters
 Computation window :


pseudo-sentence length



block length

 Smoothing :
0,7

0,7
0,7

0,6

0,6
0,6

0,5

0,5
0,5

0,4

0,4
0,4

0,3

0,3
0,3

0,2

0,2
0,2

0,1

0,1
0,1

0

0

0

1

1

15
57
71
18 17 22 2736 40 5053 65 66 78 85 92 99 105 118127 137 141 148 155 162170 183 196
1425 29 41 4349 57 64 73 81 89 92 105 113 121 129 134 145 153157169 177 185 193 197
79 97 106 113 120 131 144 161 169 176 183 190 201
9
33
Size of computation window
Pseudo-sentence length

Block length
2

4

6

8

10

12

14

16

18

20

5

++ +++ ++

++

++

++

++

++

++

++

10

++

++

++

+

+

++

+

+

+

+

15

++

+

+

+

+

+

+

-

-

-

20

+

+

+

-

-

-

-

-

-

--

25

+

+

-

-

-

-

-

--

--

--

30

+

-

-

-

-

--

--

--

--

--

35

+

-

-

-

-

--

--

--

--

--

40

--

--

--

--

--

--

--

--

--

--
Correlation
window size / smoothing
window size (number of tokens)
10

30

40

50

iteration

3

3

1

1

1

width

Smoothing

20

2

1

2

2

1

 Correlation between window size and smoothing :
The smallest your window, the more smoothing you need
to smoothe
Optimal parameters set
Nb
parag.

Nb
Words sentences tokens
smooth.
words /
/
/
iteration
parag. block
sentence

smooth.
width

Text 1

12

2000

167

6

5

3

2

Text 2

22

2400

109

6

10

1

1

Text 3

37

1750

20

8

10

1

1

 One optimal parameters set per text
 Optimal set varies according to text/paragraph
length ?
Final thoughts
 Linguistic processing :
lemmatization doesn’t significantly improve TextTiling
 what about stemming ?


 Computation parameters :
 parameters are highly dependent


optimal parameters set vary from text to text

 Proposal : an adaptative Text Tiler ?
 window size could be adapted to text intrinsic qualities
 smoothing could then be adapted to window size
Part III :

Demo
Similarity score – Hearst (1997)

Sim (b1 ,b2) =

∑t wt,b1 . wt,b2

√ ∑ w² b1 . ∑ w² b2
t

t

t

t

b1 : block 1
b2 : block 2
t : token
w : weight (frequency) of the token in the block
back
Kappa measure
http://www.musc.edu/dc/icrebm/kappa.html
Annot 1
yes

no

TOTAL

40

35

Y2=75

no

5

20

N2=25

TOTAL

Y1=45

N1=55

T=100

Annot2 yes

Kappa

Agreement
P(A) = 0.6
Expected agreement
P(E)
= (Y1.Y2 + N1.N2) / T²
= 0.475

P(A) – P(E)
=

1 – P(E)

= 0.24
back

More Related Content

What's hot

Introduction to TinyML - Solomon Muhunyo Githu
Introduction to TinyML - Solomon Muhunyo GithuIntroduction to TinyML - Solomon Muhunyo Githu
Introduction to TinyML - Solomon Muhunyo GithuSolomon Githu
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Deep Learning Italia
 
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning akira-ai
 
Design Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor NetworksDesign Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor NetworksKhushbooGupta145
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Dataconomy Media
 
Distributed & parallel system
Distributed & parallel systemDistributed & parallel system
Distributed & parallel systemManish Singh
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)Shweta Ghate
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the HaystackMachine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the HaystackAlistair Gillespie
 
01 - Introduction to Distributed Systems
01 - Introduction to Distributed Systems01 - Introduction to Distributed Systems
01 - Introduction to Distributed SystemsDilum Bandara
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Saurabh Kaushik
 
Federated learning in brief
Federated learning in briefFederated learning in brief
Federated learning in briefShashi Perera
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 

What's hot (20)

Generative models
Generative modelsGenerative models
Generative models
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
Introduction to TinyML - Solomon Muhunyo Githu
Introduction to TinyML - Solomon Muhunyo GithuIntroduction to TinyML - Solomon Muhunyo Githu
Introduction to TinyML - Solomon Muhunyo Githu
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
 
Design Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor NetworksDesign Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor Networks
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 
Distributed & parallel system
Distributed & parallel systemDistributed & parallel system
Distributed & parallel system
 
Random number generator
Random number generatorRandom number generator
Random number generator
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Draw and explain the architecture of general purpose microprocessor
Draw and explain the architecture of general purpose microprocessor Draw and explain the architecture of general purpose microprocessor
Draw and explain the architecture of general purpose microprocessor
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 
Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the HaystackMachine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack
 
01 - Introduction to Distributed Systems
01 - Introduction to Distributed Systems01 - Introduction to Distributed Systems
01 - Introduction to Distributed Systems
 
Web-Socket
Web-SocketWeb-Socket
Web-Socket
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
 
Federated learning in brief
Federated learning in briefFederated learning in brief
Federated learning in brief
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 

Similar to Experimenting the TextTiling Algorithm

Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingJinho Choi
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitJ Singh
 
Self-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptSelf-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptGerman Terrazas
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
 
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot SizesUse of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot SizesAIRCC Publishing Corporation
 
A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...IAEME Publication
 
cis97003
cis97003cis97003
cis97003perfj
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOpsPooyan Jamshidi
 
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftMathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftAlexanderCominsky
 
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...ijcseit
 
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHMNEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHMijcsit
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
 

Similar to Experimenting the TextTiling Algorithm (20)

Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
Self-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptSelf-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of concept
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Modelling and Analysis Laboratory Manual
Modelling and Analysis Laboratory ManualModelling and Analysis Laboratory Manual
Modelling and Analysis Laboratory Manual
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot SizesUse of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
 
50120140503004
5012014050300450120140503004
50120140503004
 
A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...
 
cis97003
cis97003cis97003
cis97003
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftMathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
 
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
 
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHMNEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
50120130406023
5012013040602350120130406023
50120130406023
 

More from Estelle Delpech

Génération automatique de texte
Génération automatique de texteGénération automatique de texte
Génération automatique de texteEstelle Delpech
 
Identification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieuxIdentification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieuxEstelle Delpech
 
Découverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des LanguesDécouverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des LanguesEstelle Delpech
 
Invited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis awardInvited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis awardEstelle Delpech
 
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Estelle Delpech
 
Identification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieuxIdentification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieuxEstelle Delpech
 
Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...Estelle Delpech
 
Nomao: data analysis for personalized local search
Nomao: data analysis for personalized local searchNomao: data analysis for personalized local search
Nomao: data analysis for personalized local searchEstelle Delpech
 
Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)Estelle Delpech
 
Nomao: local search and recommendation engine
Nomao: local search and recommendation engineNomao: local search and recommendation engine
Nomao: local search and recommendation engineEstelle Delpech
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesEstelle Delpech
 
Évaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialiséeÉvaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialiséeEstelle Delpech
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
Robust rule-based parsing
Robust rule-based parsingRobust rule-based parsing
Robust rule-based parsingEstelle Delpech
 
Text Processing for Procedural Question Answering
Text Processing for Procedural Question AnsweringText Processing for Procedural Question Answering
Text Processing for Procedural Question AnsweringEstelle Delpech
 

More from Estelle Delpech (19)

Génération automatique de texte
Génération automatique de texteGénération automatique de texte
Génération automatique de texte
 
Identification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieuxIdentification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieux
 
Découverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des LanguesDécouverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des Langues
 
Invited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis awardInvited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis award
 
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
 
Identification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieuxIdentification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieux
 
Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...
 
Nomao: data analysis for personalized local search
Nomao: data analysis for personalized local searchNomao: data analysis for personalized local search
Nomao: data analysis for personalized local search
 
Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)
 
Nomao: local search and recommendation engine
Nomao: local search and recommendation engineNomao: local search and recommendation engine
Nomao: local search and recommendation engine
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Évaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialiséeÉvaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialisée
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
R&D Lingua et Machina
R&D Lingua et MachinaR&D Lingua et Machina
R&D Lingua et Machina
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
Robust rule-based parsing
Robust rule-based parsingRobust rule-based parsing
Robust rule-based parsing
 
Text Processing for Procedural Question Answering
Text Processing for Procedural Question AnsweringText Processing for Procedural Question Answering
Text Processing for Procedural Question Answering
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Experimenting the TextTiling Algorithm

  • 1. Experimenting the TextTiling Algorithm Summary of the work done by master students at Université Toulouse Le Mirail Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L., Delpech E., El Maarouf I., Fontan L., Gotlik W.
  • 2. Experimenting the Text Tiling algorithm Part I : What is the Text Tiling Algorithm ? Part II : Experimentations with the Text Tiling algorithm Part III : Demo
  • 3. Part I : What is the TextTiling algorithm?  « an algorithm for partitionning expository texts into coherent multi-paragraph discourse units which reflects the subtopic structure of the texts »  developed by Marti Hearst (1997): «TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages », In Computational Linguistics, March 1997. http://www.ischool.berkeley.edu/~hearst/tiling-about.html
  • 4. Why segment a text into multi-paragraphs unit ? Computational tasks that use arbitrary windows might benefit from using windows with motivated boundaries Ease of readability for online long texts (Reading Assistant Tools) IR : retrieving relevant passages instead of whole document Summarization : extract sentences according to their position in the subtopic structure
  • 5. What is the hypothesis behind TextTiling ?  « TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion when that subtopic changes, a significant proportion of the of the vocabulary changes vocabulary changes as well »as well » Text Tiling doesn’t detect subtopics per se but shifts in topic by means of change in vocabulary Operates a linear segmentation (no hierarchy)
  • 6. Detection of topic shift Raw text Tokenisation similarity score SS bloc A vs bloc B S S Segmentation into pseudo-sentences (20 tokens) a similarity score is computed every pseudo-sentence between 2 blocks of 6 pseudo-sequences  the more vocabulary in common, the highest the score  S S S S S S S S S S S S S S S
  • 7. I. Detection of topic shift SCORE 1  a gap means there is a 0,85 0,9 drop in vocabulary similarity 0,8 0,8 0,7  topic shifts occur at the 0,6 0,75 deepest gaps (after smoothing) 0,5 0,4 0,7 tiles boundaries will be adjusted to the nearest paragraph break 0,3 0,65 0,2 0,1 0,6 0 1 1 3 3 5 5 7 7 9 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Pseudo-sentence number
  • 8. Evaluation by Hearst (1997)  Evaluation on 12 magazine articles annotated by 7 judges  Judges are asked « to mark the paragraph boudary at which the topic changed »  In case of disagreement among judges, a boudary is kept if at least 3 judges agree on it  Agreement among judges (kappa measure) : kappa = 0.647
  • 9. Evaluation by Hearst (1997) Precision Recall 0.43 0.42 TextTiler 0.66 0.61 Judges 0.81 0.71 Baseline (random) Works well on long (+1800 words) expository texts with little structural demarcation
  • 10. Part II : Experimentations with theTextTiling algorithm  Work done by masters students, Université Toulouse Le Mirail  Implementation in Perl  Experimentations :  cross annotation of 3 texts  variation of :  linguistic parameters  computation parameters
  • 11. Annotation of topic boundary  No clear-cut topic shift, rather ‘regions’ of shift Annotators felt a smaller unity (sentence) would have been more convenient  Our kappa : 0.56  Heart’s judges : 0.65  kappa should be at least > 0.67, the best is > 0.8  A difficult (unnatural ?) task for humans
  • 12. Variation of linguistic parameters basic trigrams lemmatization (TreeTagger*) 0,61 0,7 0,58 0,6 0,53 0,5 0,35 0,34 0,26 0,23 PRECISION F-MESURE 0,4 0,25 0,3 0,2 0,17 0,1 0 RECALL * http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
  • 13. Variation of computation parameters  Computation window :  pseudo-sentence length  block length  Smoothing : 0,7 0,7 0,7 0,6 0,6 0,6 0,5 0,5 0,5 0,4 0,4 0,4 0,3 0,3 0,3 0,2 0,2 0,2 0,1 0,1 0,1 0 0 0 1 1 15 57 71 18 17 22 2736 40 5053 65 66 78 85 92 99 105 118127 137 141 148 155 162170 183 196 1425 29 41 4349 57 64 73 81 89 92 105 113 121 129 134 145 153157169 177 185 193 197 79 97 106 113 120 131 144 161 169 176 183 190 201 9 33
  • 14. Size of computation window Pseudo-sentence length Block length 2 4 6 8 10 12 14 16 18 20 5 ++ +++ ++ ++ ++ ++ ++ ++ ++ ++ 10 ++ ++ ++ + + ++ + + + + 15 ++ + + + + + + - - - 20 + + + - - - - - - -- 25 + + - - - - - -- -- -- 30 + - - - - -- -- -- -- -- 35 + - - - - -- -- -- -- -- 40 -- -- -- -- -- -- -- -- -- --
  • 15. Correlation window size / smoothing window size (number of tokens) 10 30 40 50 iteration 3 3 1 1 1 width Smoothing 20 2 1 2 2 1  Correlation between window size and smoothing : The smallest your window, the more smoothing you need to smoothe
  • 16. Optimal parameters set Nb parag. Nb Words sentences tokens smooth. words / / / iteration parag. block sentence smooth. width Text 1 12 2000 167 6 5 3 2 Text 2 22 2400 109 6 10 1 1 Text 3 37 1750 20 8 10 1 1  One optimal parameters set per text  Optimal set varies according to text/paragraph length ?
  • 17. Final thoughts  Linguistic processing : lemmatization doesn’t significantly improve TextTiling  what about stemming ?   Computation parameters :  parameters are highly dependent  optimal parameters set vary from text to text  Proposal : an adaptative Text Tiler ?  window size could be adapted to text intrinsic qualities  smoothing could then be adapted to window size
  • 19. Similarity score – Hearst (1997) Sim (b1 ,b2) = ∑t wt,b1 . wt,b2 √ ∑ w² b1 . ∑ w² b2 t t t t b1 : block 1 b2 : block 2 t : token w : weight (frequency) of the token in the block back
  • 20. Kappa measure http://www.musc.edu/dc/icrebm/kappa.html Annot 1 yes no TOTAL 40 35 Y2=75 no 5 20 N2=25 TOTAL Y1=45 N1=55 T=100 Annot2 yes Kappa Agreement P(A) = 0.6 Expected agreement P(E) = (Y1.Y2 + N1.N2) / T² = 0.475 P(A) – P(E) = 1 – P(E) = 0.24 back