Full text search adventures

1,692 views

Published on

Talk on Full Text Search, RialsConf 2010

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,692
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Photo source: http://www.flickr.com/photos/9mohamed0/4268238013/sizes/o/\n
  • \n
  • \n
  • Photo source: http://www.flickr.com/photos/zehfernando/3457455680/\n
  • Photo source: http://www.flickr.com/photos/bevvell/4649795989/in/pool-97958286@N00\n
  • Photo source: http://www.flickr.com/photos/caveman_92223/2763166886/\n
  • Photo source: http://www.flickr.com/photos/lochaven/2588186224/\n
  • Postgres: In database “tsvector” , partial indexes, acts_as_tsearch\n\nMySql FULLTEXT indices are fully indexed fields which support stopwords, boolean searches, and relevancy ratings: http://onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html\nNote: MySql FULLTEXT requires MyISAM storage engine\nComparison of MySql vs. PostgresQL: http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL\n\nSolr/Lucene: Separate Index, Language Features: Faceted Search, Similar Documents (you may also like…)\nSphinx typically installed on the same machine, directly accessed your database\n
  • \n
  • \n
  • Word boundaries understood by context in: Chinese, Japanese, Korean, Thai\nCJK word boundaries not handled in MySql 5: http://blogs.sun.com/soapbox/entry/fulltext_and_asian_languages_with\n
  • \n
  • \n
  • Rethinking Full-Text Search for Multilingual DatabasesJeffrey Sorensen and Salim Roukos IBM T. J. Watson Research Center Yorktown Heights, New York <sorenj|roukos>@us.ibm.com\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Stop words can cause problems when using a search engine to search for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'\nhttp://en.wikipedia.org/wiki/Stop_words\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Photo source: http://www.flickr.com/photos/thatguyfromcchs08/2300190277/\n
  • Photo source: http://www.flickr.com/photos/stuckincustoms/4443168109/sizes/l/\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • think of a blank canvas... don’t think about Solr or Sphinx, first think about what people are trying to find and what will help them most. \nMaybe browse is more im\n
  • \n
  • Full text search adventures

    1. 1. Adventures
 in
 Full
Text
SearchSarah
Allen

@ultrasaurus
    2. 2. class Article < ActiveRecord::Base acts_as_solrend
    3. 3. 3
    4. 4. Tokyo
Dystopia
    5. 5. LanguageRelevanceAccuracy Speed
    6. 6. Text as Language
    7. 7. stemming synonyms stop
wordsword
boundaries
    8. 8. SELECT text FROM phrases WHERE text like %run%; Can you run this to the post office for me? Im going for a run, want to come along? Cross country running Im too drunk to drive. I am running out of battery power. Work is not like wolf - it wont run away.
    9. 9. SELECT text FROM phrases WHERE vectors @@ run::tsquery; Can you run this to the post office for me? Sorry I am running really late. Im going for a run, want to come along? Cross country running I am running out of battery power. Work is not like wolf - it wont run away.
    10. 10. Tokenization and StemmingGoogle App Engine /JRuby / Lucenehttp://full-text-search.appspot.comhttp://github.com/ultrasaurus/full-text-search-appengine
    11. 11. hAp://full‐text‐search.appspot.com/ 16
    12. 12. hAp://full‐text‐search.appspot.com/ 17
    13. 13. hAp://full‐text‐search.appspot.com/ 18
    14. 14. hAp://localhost:8080/_ah/admin/datastore?kind=Notes 19
    15. 15. ./script/generate scaffold note content:string index:List -f --skip-migration./script/generate dd_model note content:string index:List -f
    16. 16. class Note include DataMapper::Resource property :id, Serial property :content, String, :required => true, :length => 500 property :index, List, :required => true timestamps :atend
    17. 17. java_import org.apache.lucene.analysis.snowball.SnowballAnalyzerjava_import java.io.StringReader
    18. 18. before :valid?, :update_indexdef update_index analyzer = SnowballAnalyzer.new("English") s = StringReader.new(content) token_stream = analyzer.tokenStream(nil, s) terms = [] while (token = token_stream.next) do terms << token.term end self.index = termsend
    19. 19. before :valid?, :update_indexdef update_index analyzer = SnowballAnalyzer.new("English") s = StringReader.new(content) token_stream = analyzer.tokenStream(nil, s) terms = [] while (token = token_stream.next) do terms << token.term end self.index = termsend
    20. 20. hAp://full‐text‐search.appspot.com/ 25
    21. 21. a about above after again against all am an and any are arent as at be because been before being below between both but by cant cannot could couldnt did didnt do doesdoesnt doing dont down during each few for from further had hadnt has hasnt have havent having he hed hell hes her here heres hers herself him himself his how hows i id ill imive if in into is isnt it its its itself lets me more most mustnt my myself no nor not of off on once only or other ought our ours ourselves out over own same shant she shed shell shes should shouldnt so some such than that thats the their theirs them themselves then there theres these they theyd theyll theyre theyve this those through to too under until up very was wasnt we wed well were weve were werent what whats when whens where wheres which while who whos whom why whys with wont would wouldnt you youd youll youre youve your yours yourself yourselves http://www.ranks.nl/resources/stopwords.html
    22. 22. Word Boundaries
    23. 23. 


    24. 24. 


    25. 25. 


 










 
 


 
 


    26. 26. 


 










 
 


 
 


    27. 27. 

 
 
 I
love
horses 

 










 
 


 
 


    28. 28. 

 
 
 I
love
horses 

 










 
 


 
 


    29. 29. 

 
 
 I
love
horses 

 

Horses
are
beauSful








 
 


 
 


    30. 30. 

 
 
 I
love
horses 

 

Horses
are
beauSful








 
 


 
 


    31. 31. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 


 
 


    32. 32. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 


 
 


    33. 33. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods 
 


    34. 34. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods


 
 


    35. 35. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods


 
 


    36. 36. 

 
 
 I
love
horses

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods


 
 








You
are
an
idiot.


    37. 37. Relevance
    38. 38. Accuracy
    39. 39. Speed
    40. 40. Write HostedDatabase Search Rails
    41. 41. Read HostedDatabase Search Rails
    42. 42. Target Target SourceText Language LanguageWe’re
running
out
of
daylight en jaCould
you
run
this? en jaCross‐country
running en jaI’m
going
for
a
run,
want
to
come
along? en ja
    43. 43. I’m
going
for
a
run,
want
to
come
along? en ja
    44. 44. I’m
going
for
a
run,
want
to
come
along? en ja 

    45. 45. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
    46. 46. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi
    47. 47. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi2009‐11‐29
20:36:47
UTC
    48. 48. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi2009‐11‐29
20:36:47
UTChAp://….16ec695a‐8fce‐4277‐bdd4.flv
    49. 49. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi2009‐11‐29
20:36:47
UTChAp://….16ec695a‐8fce‐4277‐bdd4.flvhAp://….Japanese_ikuko_kobayashi.jpg
    50. 50. 62
    51. 51. class Page < ActiveRecord::Base acts_as_tsearch :fields => [ ... ]end
    52. 52. Page.send :acts_as_tsearch, :fields => [:title]PagePart.send :acts_as_tsearch, :fields => [:content]ProgramPropertyList.send :acts_as_tsearch, :fields =>[:instructor, :program_desc, :program_detail, :resource]
    53. 53. @pages
=
Page.find_by_tsearch(@query)
    54. 54. 66
    55. 55. 69
    56. 56. 70
    57. 57. 71
    58. 58. class Phrase < ActiveRecord::Base acts_as_tsearch :fields => [:text]end
    59. 59. Phrase.find_by_tsearch(term, :conditions => {:language_id => target_language.id})
    60. 60. When you think about search...
    61. 61. Questions?

    ×