SlideShare a Scribd company logo
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011
What we get




        Andr´ Santos andrefs@cpan.org
            e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
What this is really about




                    similarity



         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)
      Proper names (e.g. “Sherlock Holmes”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Measuring similarity




                                           |ALIEs ∩ BLIEs |
       similarity (A, B) =
                                           |ALIEs ∪ BLIEs |




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Measuring similarity




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
pairbooks

  Similarity values
     < 0.2 Documents                are    not related
     > 0.4 Documents                are    candidate pairs
     > 0.9 Documents                are    near duplicates
         1.0 Documents              are    duplicates

  Languages
  High similarity, same language: (Near) duplicates
  High similarity, different language: Candidate pairs

           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Behold, pairbooks!

   ~ $ pairbooks                PT_list.txt ES_list.txt
  PTBR__Umberto_EcoO_nome_da_rosa.txt
    (0.227) [6954,7382]   ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)
    (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.018) [6954,5604]   ES__Umberto_EcoDiario_Minimo__2.txt(...)

  PTBR__Umberto_EcoO_Pendulo_de_Focault.txt
    (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)
    (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt
  (...)




            Andr´ Santos andrefs@cpan.org
                e                           Identifying similar text documents
Perfect LIEs do not exist
  Year references
                         Can be confused with page numbers
                         Headers/footers can contain them
                         (publishing year, copyright, . . . )
  Proper names
                         Sometimes are translated (e.g. “S˜o
                                                          a
                         Tom´” “Judas Tom´” etc)
                              e,           e,
                         Some languages use different scripts
                         (e.g. Russian)
                         Some languages have declensions
        ...
              Andr´ Santos andrefs@cpan.org
                  e                           Identifying similar text documents
How to improve LIEs (future work)



     accept a list of equivalent words
     accept a list of stop words
     ...




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Give me one of those!


  CPAN
   http://search.cpan.org/perldoc?pairbooks

     Developer version
     requires Linux, Perl
     Incomplete documentation




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011

More Related Content

Viewers also liked

Steps to change the formatting of the text
Steps to change the formatting of the textSteps to change the formatting of the text
Steps to change the formatting of the text
Jeremy Dawes
 
How we can help accountants clients save money
How we can help accountants clients save moneyHow we can help accountants clients save money
How we can help accountants clients save money
BNI Team Momentum - Buderim
 
Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)
TEDx Adventure Catalyst
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
andrefsantos
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımıekinozcicekciler
 
Professional Certifications
Professional CertificationsProfessional Certifications
Professional Certificationseeakin79
 
Our Own Success Summit
Our Own Success SummitOur Own Success Summit
Our Own Success Summit
Pollen Marketing
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures En
srrm7
 
La Excepción
La ExcepciónLa Excepción
La Excepción
Josematute
 
Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and more
filipj2000
 
Alf Lizzio 2015
Alf Lizzio 2015Alf Lizzio 2015
Alf Lizzio 2015
TEDx Adventure Catalyst
 

Viewers also liked (11)

Steps to change the formatting of the text
Steps to change the formatting of the textSteps to change the formatting of the text
Steps to change the formatting of the text
 
How we can help accountants clients save money
How we can help accountants clients save moneyHow we can help accountants clients save money
How we can help accountants clients save money
 
Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
 
Professional Certifications
Professional CertificationsProfessional Certifications
Professional Certifications
 
Our Own Success Summit
Our Own Success SummitOur Own Success Summit
Our Own Success Summit
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures En
 
La Excepción
La ExcepciónLa Excepción
La Excepción
 
Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and more
 
Alf Lizzio 2015
Alf Lizzio 2015Alf Lizzio 2015
Alf Lizzio 2015
 

Similar to Identifying similar text documents

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Irum Malik
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
andrefsantos
 
Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
Alex Curtis
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 

Similar to Identifying similar text documents (7)

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

More from andrefsantos

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
andrefsantos
 
Slides
SlidesSlides
Slides
andrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
andrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
andrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
andrefsantos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
andrefsantos
 
Bigorna
BigornaBigorna
Bigorna
andrefsantos
 

More from andrefsantos (8)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Slides
SlidesSlides
Slides
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
 
Bigorna
BigornaBigorna
Bigorna
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Identifying similar text documents

  • 1. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011
  • 2. What we get Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 3. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 4. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 5. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 6. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 7. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 8. What this is really about similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 9. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 10. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 11. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Proper names (e.g. “Sherlock Holmes”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 12. Measuring similarity |ALIEs ∩ BLIEs | similarity (A, B) = |ALIEs ∪ BLIEs | Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 13. Measuring similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 14. pairbooks Similarity values < 0.2 Documents are not related > 0.4 Documents are candidate pairs > 0.9 Documents are near duplicates 1.0 Documents are duplicates Languages High similarity, same language: (Near) duplicates High similarity, different language: Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 15. Behold, pairbooks! ~ $ pairbooks PT_list.txt ES_list.txt PTBR__Umberto_EcoO_nome_da_rosa.txt (0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...) (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...) PTBR__Umberto_EcoO_Pendulo_de_Focault.txt (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...) (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt (...) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 16. Perfect LIEs do not exist Year references Can be confused with page numbers Headers/footers can contain them (publishing year, copyright, . . . ) Proper names Sometimes are translated (e.g. “S˜o a Tom´” “Judas Tom´” etc) e, e, Some languages use different scripts (e.g. Russian) Some languages have declensions ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 17. How to improve LIEs (future work) accept a list of equivalent words accept a list of stop words ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 18. Give me one of those! CPAN http://search.cpan.org/perldoc?pairbooks Developer version requires Linux, Perl Incomplete documentation Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 19. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011