Identifying similar text documents            Andr´ Santos                e         andrefs@cpan.org            November 2...
What we get        Andr´ Santos andrefs@cpan.org            e                           Identifying similar text documents
Duplicated versions         Andr´ Santos andrefs@cpan.org             e                           Identifying similar text...
Duplicated versions         Andr´ Santos andrefs@cpan.org             e                           Identifying similar text...
Candidate pairs         Andr´ Santos andrefs@cpan.org             e                           Identifying similar text doc...
Candidate pairs         Andr´ Santos andrefs@cpan.org             e                           Identifying similar text doc...
Candidate pairs         Andr´ Santos andrefs@cpan.org             e                           Identifying similar text doc...
What this is really about                    similarity         Andr´ Santos andrefs@cpan.org             e               ...
It’s all LIEs!  Language Independent Element (LIE)  Terms which are usually kept untouched during  translation.           ...
It’s all LIEs!  Language Independent Element (LIE)  Terms which are usually kept untouched during  translation.      Year ...
It’s all LIEs!  Language Independent Element (LIE)  Terms which are usually kept untouched during  translation.      Year ...
Measuring similarity                                           |ALIEs ∩ BLIEs |       similarity (A, B) =                 ...
Measuring similarity         Andr´ Santos andrefs@cpan.org             e                           Identifying similar tex...
pairbooks  Similarity values     < 0.2 Documents                are    not related     > 0.4 Documents                are ...
Behold, pairbooks!   ~ $ pairbooks                PT_list.txt ES_list.txt  PTBR__Umberto_EcoO_nome_da_rosa.txt    (0.227) ...
Perfect LIEs do not exist  Year references                         Can be confused with page numbers                      ...
How to improve LIEs (future work)     accept a list of equivalent words     accept a list of stop words     ...         An...
Give me one of those!  CPAN   http://search.cpan.org/perldoc?pairbooks     Developer version     requires Linux, Perl     ...
Identifying similar text documents            Andr´ Santos                e         andrefs@cpan.org            November 2...
Upcoming SlideShare
Loading in …5
×

Identifying similar text documents

851
-1

Published on

Slides from a ligthning talk oabout the Perl module Text::Perfide::BookPairs, presented on the I International Per-fide Workshops, at University of MInho, 2011.

Published in: Technology, Art & Photos
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
851
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Identifying similar text documents

  1. 1. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011
  2. 2. What we get Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  3. 3. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  4. 4. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  5. 5. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  6. 6. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  7. 7. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  8. 8. What this is really about similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  9. 9. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  10. 10. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  11. 11. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Proper names (e.g. “Sherlock Holmes”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  12. 12. Measuring similarity |ALIEs ∩ BLIEs | similarity (A, B) = |ALIEs ∪ BLIEs | Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  13. 13. Measuring similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  14. 14. pairbooks Similarity values < 0.2 Documents are not related > 0.4 Documents are candidate pairs > 0.9 Documents are near duplicates 1.0 Documents are duplicates Languages High similarity, same language: (Near) duplicates High similarity, different language: Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  15. 15. Behold, pairbooks! ~ $ pairbooks PT_list.txt ES_list.txt PTBR__Umberto_EcoO_nome_da_rosa.txt (0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...) (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...) PTBR__Umberto_EcoO_Pendulo_de_Focault.txt (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...) (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt (...) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  16. 16. Perfect LIEs do not exist Year references Can be confused with page numbers Headers/footers can contain them (publishing year, copyright, . . . ) Proper names Sometimes are translated (e.g. “S˜o a Tom´” “Judas Tom´” etc) e, e, Some languages use different scripts (e.g. Russian) Some languages have declensions ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  17. 17. How to improve LIEs (future work) accept a list of equivalent words accept a list of stop words ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  18. 18. Give me one of those! CPAN http://search.cpan.org/perldoc?pairbooks Developer version requires Linux, Perl Incomplete documentation Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  19. 19. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×