SlideShare a Scribd company logo
Adventures

        in

 Full
Text
Search
Sarah
Allen

@ultrasaurus
class Article < ActiveRecord::Base
  acts_as_solr
end
3
Tokyo
Dystopia
Language
Relevance
Accuracy
 Speed
Text as Language
stemming
   synonyms
  stop
words
word
boundaries
SELECT text FROM phrases WHERE text like '%run%';

 Can you run this to the post office for me?
 I'm going for a run, want to come along?
 Cross country running
 I'm too drunk to drive.
 I am running out of battery power.
 Work is not like wolf - it won't run away.
SELECT text FROM phrases WHERE
            vectors @@ 'run'::tsquery;

 Can you run this to the post office for me?
 Sorry I am running really late.
 I'm going for a run, want to come along?
 Cross country running
 I am running out of battery power.
 Work is not like wolf - it won't run away.
Tokenization and Stemming
Google App Engine /JRuby / Lucene

http://full-text-search.appspot.com

http://

github.com/
ultrasaurus/
full-text-search-appengine
hAp://full‐text‐search.appspot.com/




                                      16
hAp://full‐text‐search.appspot.com/




                                      17
hAp://full‐text‐search.appspot.com/




                                      18
hAp://localhost:8080/_ah/admin/datastore?kind=Notes




                                                      19
./script/generate scaffold note
   content:string index:List -f --skip-migration

./script/generate dd_model note content:string index:List -f
class Note
 include DataMapper::Resource

 property :id,   Serial
 property :content, String,      :required => true, :length => 500
 property :index, List,       :required => true
 timestamps :at

end
java_import org.apache.lucene.analysis.snowball.SnowballAnalyzer
java_import java.io.StringReader
before :valid?, :update_index

def update_index
 analyzer = SnowballAnalyzer.new("English")
 s = StringReader.new(content)
 token_stream = analyzer.tokenStream(nil, s)

 terms = []
 while (token = token_stream.next) do
   terms << token.term
 end
 self.index = terms
end
before :valid?, :update_index

def update_index
 analyzer = SnowballAnalyzer.new("English")
 s = StringReader.new(content)
 token_stream = analyzer.tokenStream(nil, s)

 terms = []
 while (token = token_stream.next) do
   terms << token.term
 end
 self.index = terms
end
hAp://full‐text‐search.appspot.com/




                                      25
a about above after again against all am an and any are
    aren't as at be because been before being below between
   both but by can't cannot could couldn't did didn't do does
doesn't doing don't down during each few for from further had
   hadn't has hasn't have haven't having he he'd he'll he's her
 here here's hers herself him himself his how how's i i'd i'll i'm
i've if in into is isn't it it's its itself let's me more most mustn't
  my myself no nor not of off on once only or other ought our
    ours ourselves out over own same shan't she she'd she'll
 she's should shouldn't so some such than that that's the their
    theirs them themselves then there there's these they they'd
  they'll they're they've this those through to too under until up
 very was wasn't we we'd we'll we're we've were weren't what
   what's when when's where where's which while who who's
  whom why why's with won't would wouldn't you you'd you'll
             you're you've your yours yourself yourselves

           http://www.ranks.nl/resources/stopwords.html
Word Boundaries












        
















        
          





        
     








        
















        
          





        
     





        
   
 I
love
horses 




        
















        
             





        
        





        
   
 I
love
horses 




        
















        
             





        
        





        
   
 I
love
horses 




        

Horses
are
beauSful














        
             





        
        





        
   
 I
love
horses 




        

Horses
are
beauSful














        
             





        
        





           
   
 I
love
horses 




           

Horses
are
beauSful

                               







   deer
in
the
forest




           
             





           
        





           
   
 I
love
horses 




           

Horses
are
beauSful

                               







   deer
in
the
forest




           
             





           
        





           
   
 I
love
horses 




           

Horses
are
beauSful

                              







   deer
in
the
forest




           
             








deer
live
in
the
woods


           
        





           
   
 I
love
horses 




           

Horses
are
beauSful

                              







   deer
in
the
forest




           
             








deer
live
in
the
woods





           
        





           
   
 I
love
horses 




           

Horses
are
beauSful

                              







   deer
in
the
forest




           
             








deer
live
in
the
woods





           
        





           
   
 I
love
horses




           

Horses
are
beauSful

                              







   deer
in
the
forest




           
            








deer
live
in
the
woods





           
        








You
are
an
idiot.


Relevance
Accuracy
Speed
Write
                   Hosted
Database
                   Search



           Rails
Read
                   Hosted
Database
                   Search



           Rails
Target                                     Target     Source
Text                                       Language   Language

We’re
running
out
of
daylight              en         ja

Could
you
run
this?                        en         ja

Cross‐country
running                      en         ja

I’m
going
for
a
run,
want
to
come
along?   en         ja
I’m
going
for
a
run,
want
to
come
along?   en   ja
I’m
going
for
a
run,
want
to
come
along?   en   ja




                                

I’m
going
for
a
run,
want
to
come
along?    en   ja




                                       

ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
I’m
going
for
a
run,
want
to
come
along?    en   ja




                                       

ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
Ikuko
Kobayashi
I’m
going
for
a
run,
want
to
come
along?    en   ja




                                       

ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
Ikuko
Kobayashi
2009‐11‐29
20:36:47
UTC
I’m
going
for
a
run,
want
to
come
along?    en   ja




                                       

ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
Ikuko
Kobayashi
2009‐11‐29
20:36:47
UTC
hAp://….16ec695a‐8fce‐4277‐bdd4.flv
I’m
going
for
a
run,
want
to
come
along?    en   ja




                                       

ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
Ikuko
Kobayashi
2009‐11‐29
20:36:47
UTC
hAp://….16ec695a‐8fce‐4277‐bdd4.flv
hAp://….Japanese_ikuko_kobayashi.jpg
62
class Page < ActiveRecord::Base
  acts_as_tsearch :fields => [ ... ]
end
Page.send :acts_as_tsearch, :fields => [:title]
PagePart.send :acts_as_tsearch, :fields =>
  [:content]
ProgramPropertyList.send :acts_as_tsearch,
  :fields
  =>[:instructor, :program_desc,
  :program_detail, :resource]
@pages
=
Page.find_by_tsearch(@query)
66
69
70
71
class Phrase < ActiveRecord::Base
  acts_as_tsearch :fields => [:text]
end
Phrase.find_by_tsearch(term,
  :conditions => {:language_id =>
                   target_language.id})
When you think about
     search...
Questions?

More Related Content

More from Sarah Allen

Communication is a Technical Skill
Communication is a Technical SkillCommunication is a Technical Skill
Communication is a Technical Skill
Sarah Allen
 
Improving Federal Government Services
Improving Federal Government ServicesImproving Federal Government Services
Improving Federal Government Services
Sarah Allen
 
Transparency Wins
Transparency WinsTransparency Wins
Transparency Wins
Sarah Allen
 
A Short History of Computers
A Short History of ComputersA Short History of Computers
A Short History of Computers
Sarah Allen
 
Designing for Fun
Designing for FunDesigning for Fun
Designing for Fun
Sarah Allen
 
Ruby in the US Government for Ruby World Conference
Ruby in the US Government for Ruby World ConferenceRuby in the US Government for Ruby World Conference
Ruby in the US Government for Ruby World Conference
Sarah Allen
 
Identities of Dead People
Identities of Dead PeopleIdentities of Dead People
Identities of Dead People
Sarah Allen
 
3 Reasons Not to Use Ruby
3 Reasons Not to Use Ruby 3 Reasons Not to Use Ruby
3 Reasons Not to Use Ruby
Sarah Allen
 
Ruby Nation: Why no haz Ruby?
Ruby Nation: Why no haz Ruby?Ruby Nation: Why no haz Ruby?
Ruby Nation: Why no haz Ruby?
Sarah Allen
 
Why no ruby in gov?
Why no ruby in gov?Why no ruby in gov?
Why no ruby in gov?
Sarah Allen
 
People Patterns or What I learned from Toastmasters
People Patterns or What I learned from ToastmastersPeople Patterns or What I learned from Toastmasters
People Patterns or What I learned from Toastmasters
Sarah Allen
 
Blazing Cloud: Agile Product Development
Blazing Cloud: Agile Product DevelopmentBlazing Cloud: Agile Product Development
Blazing Cloud: Agile Product Development
Sarah Allen
 
Crowdsourced Transcription Landscape
Crowdsourced Transcription LandscapeCrowdsourced Transcription Landscape
Crowdsourced Transcription Landscape
Sarah Allen
 
Lessons Learned Future Thoughts
Lessons Learned Future ThoughtsLessons Learned Future Thoughts
Lessons Learned Future Thoughts
Sarah Allen
 
Mobile Web Video
Mobile Web VideoMobile Web Video
Mobile Web Video
Sarah Allen
 
Elementary Computer History
Elementary Computer HistoryElementary Computer History
Elementary Computer History
Sarah Allen
 
Sarah Allen Computer Science Entrepreneur
Sarah Allen Computer Science EntrepreneurSarah Allen Computer Science Entrepreneur
Sarah Allen Computer Science Entrepreneur
Sarah Allen
 
Agile Business Development
Agile Business DevelopmentAgile Business Development
Agile Business Development
Sarah Allen
 
Teaching code literacy
Teaching code literacyTeaching code literacy
Teaching code literacy
Sarah Allen
 
Test First Teaching and the path to TDD
Test First Teaching and the path to TDDTest First Teaching and the path to TDD
Test First Teaching and the path to TDD
Sarah Allen
 

More from Sarah Allen (20)

Communication is a Technical Skill
Communication is a Technical SkillCommunication is a Technical Skill
Communication is a Technical Skill
 
Improving Federal Government Services
Improving Federal Government ServicesImproving Federal Government Services
Improving Federal Government Services
 
Transparency Wins
Transparency WinsTransparency Wins
Transparency Wins
 
A Short History of Computers
A Short History of ComputersA Short History of Computers
A Short History of Computers
 
Designing for Fun
Designing for FunDesigning for Fun
Designing for Fun
 
Ruby in the US Government for Ruby World Conference
Ruby in the US Government for Ruby World ConferenceRuby in the US Government for Ruby World Conference
Ruby in the US Government for Ruby World Conference
 
Identities of Dead People
Identities of Dead PeopleIdentities of Dead People
Identities of Dead People
 
3 Reasons Not to Use Ruby
3 Reasons Not to Use Ruby 3 Reasons Not to Use Ruby
3 Reasons Not to Use Ruby
 
Ruby Nation: Why no haz Ruby?
Ruby Nation: Why no haz Ruby?Ruby Nation: Why no haz Ruby?
Ruby Nation: Why no haz Ruby?
 
Why no ruby in gov?
Why no ruby in gov?Why no ruby in gov?
Why no ruby in gov?
 
People Patterns or What I learned from Toastmasters
People Patterns or What I learned from ToastmastersPeople Patterns or What I learned from Toastmasters
People Patterns or What I learned from Toastmasters
 
Blazing Cloud: Agile Product Development
Blazing Cloud: Agile Product DevelopmentBlazing Cloud: Agile Product Development
Blazing Cloud: Agile Product Development
 
Crowdsourced Transcription Landscape
Crowdsourced Transcription LandscapeCrowdsourced Transcription Landscape
Crowdsourced Transcription Landscape
 
Lessons Learned Future Thoughts
Lessons Learned Future ThoughtsLessons Learned Future Thoughts
Lessons Learned Future Thoughts
 
Mobile Web Video
Mobile Web VideoMobile Web Video
Mobile Web Video
 
Elementary Computer History
Elementary Computer HistoryElementary Computer History
Elementary Computer History
 
Sarah Allen Computer Science Entrepreneur
Sarah Allen Computer Science EntrepreneurSarah Allen Computer Science Entrepreneur
Sarah Allen Computer Science Entrepreneur
 
Agile Business Development
Agile Business DevelopmentAgile Business Development
Agile Business Development
 
Teaching code literacy
Teaching code literacyTeaching code literacy
Teaching code literacy
 
Test First Teaching and the path to TDD
Test First Teaching and the path to TDDTest First Teaching and the path to TDD
Test First Teaching and the path to TDD
 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

Full text search adventures

  • 1. Adventures
 in
 Full
Text
Search Sarah
Allen

@ultrasaurus
  • 2. class Article < ActiveRecord::Base acts_as_solr end
  • 3. 3
  • 4.
  • 5.
  • 6.
  • 7.
  • 11. stemming synonyms stop
words word
boundaries
  • 12. SELECT text FROM phrases WHERE text like '%run%';  Can you run this to the post office for me? I'm going for a run, want to come along?  Cross country running  I'm too drunk to drive.  I am running out of battery power.  Work is not like wolf - it won't run away.
  • 13. SELECT text FROM phrases WHERE vectors @@ 'run'::tsquery;  Can you run this to the post office for me?  Sorry I am running really late. I'm going for a run, want to come along?  Cross country running  I am running out of battery power.  Work is not like wolf - it won't run away.
  • 14.
  • 15. Tokenization and Stemming Google App Engine /JRuby / Lucene http://full-text-search.appspot.com http:// github.com/ ultrasaurus/ full-text-search-appengine
  • 20. ./script/generate scaffold note content:string index:List -f --skip-migration ./script/generate dd_model note content:string index:List -f
  • 21. class Note include DataMapper::Resource property :id, Serial property :content, String, :required => true, :length => 500 property :index, List, :required => true timestamps :at end
  • 23. before :valid?, :update_index def update_index analyzer = SnowballAnalyzer.new("English") s = StringReader.new(content) token_stream = analyzer.tokenStream(nil, s) terms = [] while (token = token_stream.next) do terms << token.term end self.index = terms end
  • 24. before :valid?, :update_index def update_index analyzer = SnowballAnalyzer.new("English") s = StringReader.new(content) token_stream = analyzer.tokenStream(nil, s) terms = [] while (token = token_stream.next) do terms << token.term end self.index = terms end
  • 26. a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves http://www.ranks.nl/resources/stopwords.html
  • 30. 


 

 








 
 


 
 


  • 31. 


 

 








 
 


 
 


  • 32. 

 
 
 I
love
horses 

 

 








 
 


 
 


  • 33. 

 
 
 I
love
horses 

 

 








 
 


 
 


  • 34. 

 
 
 I
love
horses 

 

Horses
are
beauSful 








 
 


 
 


  • 35. 

 
 
 I
love
horses 

 

Horses
are
beauSful 








 
 


 
 


  • 36. 

 
 
 I
love
horses 

 

Horses
are
beauSful 
 





 deer
in
the
forest 
 


 
 


  • 37. 

 
 
 I
love
horses 

 

Horses
are
beauSful 
 





 deer
in
the
forest 
 


 
 


  • 38. 

 
 
 I
love
horses 

 

Horses
are
beauSful 
 





 deer
in
the
forest 
 








deer
live
in
the
woods 
 


  • 39. 

 
 
 I
love
horses 

 

Horses
are
beauSful 
 





 deer
in
the
forest 
 








deer
live
in
the
woods


 
 


  • 40. 

 
 
 I
love
horses 

 

Horses
are
beauSful 
 





 deer
in
the
forest 
 








deer
live
in
the
woods


 
 


  • 41. 

 
 
 I
love
horses

 

Horses
are
beauSful 
 





 deer
in
the
forest 
 








deer
live
in
the
woods


 
 








You
are
an
idiot.


  • 43.
  • 44.
  • 45.
  • 46.
  • 48. Speed
  • 49.
  • 50.
  • 51. Write Hosted Database Search Rails
  • 52. Read Hosted Database Search Rails
  • 53.
  • 54. Target Target Source Text Language Language We’re
running
out
of
daylight en ja Could
you
run
this? en ja Cross‐country
running en ja I’m
going
for
a
run,
want
to
come
along? en ja
  • 57. I’m
going
for
a
run,
want
to
come
along? en ja 
 ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka?
  • 58. I’m
going
for
a
run,
want
to
come
along? en ja 
 ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka? Ikuko
Kobayashi
  • 59. I’m
going
for
a
run,
want
to
come
along? en ja 
 ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka? Ikuko
Kobayashi 2009‐11‐29
20:36:47
UTC
  • 60. I’m
going
for
a
run,
want
to
come
along? en ja 
 ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka? Ikuko
Kobayashi 2009‐11‐29
20:36:47
UTC hAp://….16ec695a‐8fce‐4277‐bdd4.flv
  • 61. I’m
going
for
a
run,
want
to
come
along? en ja 
 ha
shi
ri
ni
iku
ke
do
iAtsho
ni
ki
ma
su
ka? Ikuko
Kobayashi 2009‐11‐29
20:36:47
UTC hAp://….16ec695a‐8fce‐4277‐bdd4.flv hAp://….Japanese_ikuko_kobayashi.jpg
  • 62. 62
  • 63. class Page < ActiveRecord::Base acts_as_tsearch :fields => [ ... ] end
  • 64. Page.send :acts_as_tsearch, :fields => [:title] PagePart.send :acts_as_tsearch, :fields => [:content] ProgramPropertyList.send :acts_as_tsearch, :fields =>[:instructor, :program_desc, :program_detail, :resource]
  • 66. 66
  • 67.
  • 68.
  • 69. 69
  • 70. 70
  • 71. 71
  • 72. class Phrase < ActiveRecord::Base acts_as_tsearch :fields => [:text] end
  • 73. Phrase.find_by_tsearch(term, :conditions => {:language_id => target_language.id})
  • 74.
  • 75. When you think about search...
  • 76.

Editor's Notes

  1. Photo source: http://www.flickr.com/photos/9mohamed0/4268238013/sizes/o/\n
  2. \n
  3. \n
  4. Photo source: http://www.flickr.com/photos/zehfernando/3457455680/\n
  5. Photo source: http://www.flickr.com/photos/bevvell/4649795989/in/pool-97958286@N00\n
  6. Photo source: http://www.flickr.com/photos/caveman_92223/2763166886/\n
  7. Photo source: http://www.flickr.com/photos/lochaven/2588186224/\n
  8. Postgres: In database &amp;#x201C;tsvector&amp;#x201D; , partial indexes, acts_as_tsearch\n\nMySql FULLTEXT indices are fully indexed fields which support stopwords, boolean searches, and relevancy ratings: http://onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html\nNote: MySql FULLTEXT requires MyISAM storage engine\nComparison of MySql vs. PostgresQL: http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL\n\nSolr/Lucene: Separate Index, Language Features: Faceted Search, Similar Documents (you may also like&amp;#x2026;)\nSphinx typically installed on the same machine, directly accessed your database\n
  9. \n
  10. \n
  11. Word boundaries understood by context in: Chinese, Japanese, Korean, Thai\nCJK word boundaries not handled in MySql 5: http://blogs.sun.com/soapbox/entry/fulltext_and_asian_languages_with\n
  12. \n
  13. \n
  14. Rethinking Full-Text Search for Multilingual DatabasesJeffrey Sorensen and Salim Roukos IBM T. J. Watson Research Center Yorktown Heights, New York &lt;sorenj|roukos&gt;@us.ibm.com\n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. Stop words can cause problems when using a search engine to search for phrases that include them, particularly in names such as &apos;The Who&apos;, &apos;The The&apos;, or &apos;Take That&apos;\nhttp://en.wikipedia.org/wiki/Stop_words\n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. Photo source: http://www.flickr.com/photos/thatguyfromcchs08/2300190277/\n
  49. Photo source: http://www.flickr.com/photos/stuckincustoms/4443168109/sizes/l/\n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. think of a blank canvas... don&amp;#x2019;t think about Solr or Sphinx, first think about what people are trying to find and what will help them most. \nMaybe browse is more im\n
  77. \n