Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011
What we get




        Andr´ Santos andrefs@cpan.org
            e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
What this is really about




                    similarity



         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)
      Proper names (e.g. “Sherlock Holmes”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Measuring similarity




                                           |ALIEs ∩ BLIEs |
       similarity (A, B) =
                                           |ALIEs ∪ BLIEs |




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Measuring similarity




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
pairbooks

  Similarity values
     < 0.2 Documents                are    not related
     > 0.4 Documents                are    candidate pairs
     > 0.9 Documents                are    near duplicates
         1.0 Documents              are    duplicates

  Languages
  High similarity, same language: (Near) duplicates
  High similarity, different language: Candidate pairs

           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Behold, pairbooks!

   ~ $ pairbooks                PT_list.txt ES_list.txt
  PTBR__Umberto_EcoO_nome_da_rosa.txt
    (0.227) [6954,7382]   ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)
    (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.018) [6954,5604]   ES__Umberto_EcoDiario_Minimo__2.txt(...)

  PTBR__Umberto_EcoO_Pendulo_de_Focault.txt
    (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)
    (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt
  (...)




            Andr´ Santos andrefs@cpan.org
                e                           Identifying similar text documents
Perfect LIEs do not exist
  Year references
                         Can be confused with page numbers
                         Headers/footers can contain them
                         (publishing year, copyright, . . . )
  Proper names
                         Sometimes are translated (e.g. “S˜o
                                                          a
                         Tom´” “Judas Tom´” etc)
                              e,           e,
                         Some languages use different scripts
                         (e.g. Russian)
                         Some languages have declensions
        ...
              Andr´ Santos andrefs@cpan.org
                  e                           Identifying similar text documents
How to improve LIEs (future work)



     accept a list of equivalent words
     accept a list of stop words
     ...




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Give me one of those!


  CPAN
   http://search.cpan.org/perldoc?pairbooks

     Developer version
     requires Linux, Perl
     Incomplete documentation




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011

Identifying similar text documents

  • 1.
    Identifying similar textdocuments Andr´ Santos e andrefs@cpan.org November 2011
  • 2.
    What we get Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 3.
    Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 4.
    Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 5.
    Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 6.
    Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 7.
    Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 8.
    What this isreally about similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 9.
    It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 10.
    It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 11.
    It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Proper names (e.g. “Sherlock Holmes”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 12.
    Measuring similarity |ALIEs ∩ BLIEs | similarity (A, B) = |ALIEs ∪ BLIEs | Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 13.
    Measuring similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 14.
    pairbooks Similarityvalues < 0.2 Documents are not related > 0.4 Documents are candidate pairs > 0.9 Documents are near duplicates 1.0 Documents are duplicates Languages High similarity, same language: (Near) duplicates High similarity, different language: Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 15.
    Behold, pairbooks! ~ $ pairbooks PT_list.txt ES_list.txt PTBR__Umberto_EcoO_nome_da_rosa.txt (0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...) (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...) PTBR__Umberto_EcoO_Pendulo_de_Focault.txt (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...) (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt (...) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 16.
    Perfect LIEs donot exist Year references Can be confused with page numbers Headers/footers can contain them (publishing year, copyright, . . . ) Proper names Sometimes are translated (e.g. “S˜o a Tom´” “Judas Tom´” etc) e, e, Some languages use different scripts (e.g. Russian) Some languages have declensions ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 17.
    How to improveLIEs (future work) accept a list of equivalent words accept a list of stop words ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 18.
    Give me oneof those! CPAN http://search.cpan.org/perldoc?pairbooks Developer version requires Linux, Perl Incomplete documentation Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 19.
    Identifying similar textdocuments Andr´ Santos e andrefs@cpan.org November 2011