Your SlideShare is downloading. ×
Full text search adventures
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Full text search adventures

1,292
views

Published on

Talk on Full Text Search, RialsConf 2010

Talk on Full Text Search, RialsConf 2010

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,292
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Photo source: http://www.flickr.com/photos/9mohamed0/4268238013/sizes/o/\n
  • \n
  • \n
  • Photo source: http://www.flickr.com/photos/zehfernando/3457455680/\n
  • Photo source: http://www.flickr.com/photos/bevvell/4649795989/in/pool-97958286@N00\n
  • Photo source: http://www.flickr.com/photos/caveman_92223/2763166886/\n
  • Photo source: http://www.flickr.com/photos/lochaven/2588186224/\n
  • Postgres: In database “tsvector” , partial indexes, acts_as_tsearch\n\nMySql FULLTEXT indices are fully indexed fields which support stopwords, boolean searches, and relevancy ratings: http://onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html\nNote: MySql FULLTEXT requires MyISAM storage engine\nComparison of MySql vs. PostgresQL: http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL\n\nSolr/Lucene: Separate Index, Language Features: Faceted Search, Similar Documents (you may also like…)\nSphinx typically installed on the same machine, directly accessed your database\n
  • \n
  • \n
  • Word boundaries understood by context in: Chinese, Japanese, Korean, Thai\nCJK word boundaries not handled in MySql 5: http://blogs.sun.com/soapbox/entry/fulltext_and_asian_languages_with\n
  • \n
  • \n
  • Rethinking Full-Text Search for Multilingual DatabasesJeffrey Sorensen and Salim Roukos IBM T. J. Watson Research Center Yorktown Heights, New York <sorenj|roukos>@us.ibm.com\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Stop words can cause problems when using a search engine to search for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'\nhttp://en.wikipedia.org/wiki/Stop_words\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Photo source: http://www.flickr.com/photos/thatguyfromcchs08/2300190277/\n
  • Photo source: http://www.flickr.com/photos/stuckincustoms/4443168109/sizes/l/\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • think of a blank canvas... don’t think about Solr or Sphinx, first think about what people are trying to find and what will help them most. \nMaybe browse is more im\n
  • \n
  • Transcript

    • 1. Adventures
 in
 Full
Text
SearchSarah
Allen

@ultrasaurus
    • 2. class Article < ActiveRecord::Base acts_as_solrend
    • 3. 3
    • 4. Tokyo
Dystopia
    • 5. LanguageRelevanceAccuracy Speed
    • 6. Text as Language
    • 7. stemming synonyms stop
wordsword
boundaries
    • 8. SELECT text FROM phrases WHERE text like %run%; Can you run this to the post office for me? Im going for a run, want to come along? Cross country running Im too drunk to drive. I am running out of battery power. Work is not like wolf - it wont run away.
    • 9. SELECT text FROM phrases WHERE vectors @@ run::tsquery; Can you run this to the post office for me? Sorry I am running really late. Im going for a run, want to come along? Cross country running I am running out of battery power. Work is not like wolf - it wont run away.
    • 10. Tokenization and StemmingGoogle App Engine /JRuby / Lucenehttp://full-text-search.appspot.comhttp://github.com/ultrasaurus/full-text-search-appengine
    • 11. hAp://full‐text‐search.appspot.com/ 16
    • 12. hAp://full‐text‐search.appspot.com/ 17
    • 13. hAp://full‐text‐search.appspot.com/ 18
    • 14. hAp://localhost:8080/_ah/admin/datastore?kind=Notes 19
    • 15. ./script/generate scaffold note content:string index:List -f --skip-migration./script/generate dd_model note content:string index:List -f
    • 16. class Note include DataMapper::Resource property :id, Serial property :content, String, :required => true, :length => 500 property :index, List, :required => true timestamps :atend
    • 17. java_import org.apache.lucene.analysis.snowball.SnowballAnalyzerjava_import java.io.StringReader
    • 18. before :valid?, :update_indexdef update_index analyzer = SnowballAnalyzer.new("English") s = StringReader.new(content) token_stream = analyzer.tokenStream(nil, s) terms = [] while (token = token_stream.next) do terms << token.term end self.index = termsend
    • 19. before :valid?, :update_indexdef update_index analyzer = SnowballAnalyzer.new("English") s = StringReader.new(content) token_stream = analyzer.tokenStream(nil, s) terms = [] while (token = token_stream.next) do terms << token.term end self.index = termsend
    • 20. hAp://full‐text‐search.appspot.com/ 25
    • 21. a about above after again against all am an and any are arent as at be because been before being below between both but by cant cannot could couldnt did didnt do doesdoesnt doing dont down during each few for from further had hadnt has hasnt have havent having he hed hell hes her here heres hers herself him himself his how hows i id ill imive if in into is isnt it its its itself lets me more most mustnt my myself no nor not of off on once only or other ought our ours ourselves out over own same shant she shed shell shes should shouldnt so some such than that thats the their theirs them themselves then there theres these they theyd theyll theyre theyve this those through to too under until up very was wasnt we wed well were weve were werent what whats when whens where wheres which while who whos whom why whys with wont would wouldnt you youd youll youre youve your yours yourself yourselves http://www.ranks.nl/resources/stopwords.html
    • 22. Word Boundaries
    • 23. 


    • 24. 


    • 25. 


 










 
 


 
 


    • 26. 


 










 
 


 
 


    • 27. 

 
 
 I
love
horses 

 










 
 


 
 


    • 28. 

 
 
 I
love
horses 

 










 
 


 
 


    • 29. 

 
 
 I
love
horses 

 

Horses
are
beauSful








 
 


 
 


    • 30. 

 
 
 I
love
horses 

 

Horses
are
beauSful








 
 


 
 


    • 31. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 


 
 


    • 32. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 


 
 


    • 33. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods 
 


    • 34. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods


 
 


    • 35. 

 
 
 I
love
horses 

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods


 
 


    • 36. 

 
 
 I
love
horses

 

Horses
are
beauSful 






 deer
in
the
forest 
 








deer
live
in
the
woods


 
 








You
are
an
idiot.


    • 37. Relevance
    • 38. Accuracy
    • 39. Speed
    • 40. Write HostedDatabase Search Rails
    • 41. Read HostedDatabase Search Rails
    • 42. Target Target SourceText Language LanguageWe’re
running
out
of
daylight en jaCould
you
run
this? en jaCross‐country
running en jaI’m
going
for
a
run,
want
to
come
along? en ja
    • 43. I’m
going
for
a
run,
want
to
come
along? en ja
    • 44. I’m
going
for
a
run,
want
to
come
along? en ja 

    • 45. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
    • 46. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi
    • 47. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi2009‐11‐29
20:36:47
UTC
    • 48. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi2009‐11‐29
20:36:47
UTChAp://….16ec695a‐8fce‐4277‐bdd4.flv
    • 49. I’m
going
for
a
run,
want
to
come
along? en ja 
ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?Ikuko
Kobayashi2009‐11‐29
20:36:47
UTChAp://….16ec695a‐8fce‐4277‐bdd4.flvhAp://….Japanese_ikuko_kobayashi.jpg
    • 50. 62
    • 51. class Page < ActiveRecord::Base acts_as_tsearch :fields => [ ... ]end
    • 52. Page.send :acts_as_tsearch, :fields => [:title]PagePart.send :acts_as_tsearch, :fields => [:content]ProgramPropertyList.send :acts_as_tsearch, :fields =>[:instructor, :program_desc, :program_detail, :resource]
    • 53. @pages
=
Page.find_by_tsearch(@query)
    • 54. 66
    • 55. 69
    • 56. 70
    • 57. 71
    • 58. class Phrase < ActiveRecord::Base acts_as_tsearch :fields => [:text]end
    • 59. Phrase.find_by_tsearch(term, :conditions => {:language_id => target_language.id})
    • 60. When you think about search...
    • 61. Questions?