Your SlideShare is downloading. ×
Needle in an enterprise haystack
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Needle in an enterprise haystack

4,159
views

Published on

Published in: Technology

1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,159
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
1
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. search Needle in an enterprise haystack engine integrations 1
  • 2. Who am I? Andrew Mleczko Plone Integrator Redturtle Technology (Ferrara/Italy) andrew.mleczko@redturtle.net 2
  • 3. so why do you need an external search engine? 3
  • 4. why do you need an external search engine... • Plone's portal_catalog is slow with big sites (large number of indexed objects) • You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText) • You want to query Plone's content from external applications • You want to use advanced search features 4
  • 5. there are several that you can use solutions 5
  • 6. Plone external indexing and searching • Out-of-the-box: • collective.gsa (Google Search Appliance) • collective.solr (Apache Solr) • Custom integrations: • Solr • Tsearch2 http://www.flickr.com/photos/jenny-pics/3527749814 6
  • 7. http://www.flickr.com/photos/st3f4n/2767217547 Solr? 7
  • 8. http://www.flickr.com/photos/st3f4n/2767217547 a search engine based on Lucene 8
  • 9. http://www.flickr.com/photos/st3f4n/2767217547 Lucene? 9
  • 10. http://www.flickr.com/photos/st3f4n/2767217547 Full-text search library 100% in java 10
  • 11. http://www.flickr.com/photos/st3f4n/2767217547 XML/HTTP, JSON interface, Solr Open Source 11
  • 12. http://www.flickr.com/photos/st3f4n/2767217547 python API collective.solr and Plone integration 12
  • 13. 13 solr collective.solr Document format
  • 14. Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr 13
  • 15. Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr >>> conn = SolrConnection(host='127.0.0.1', ...) >>> book = {'title': 'The Trap', ...! ! ! 'author': 'Agatha Christie', ...! ! ! 'genre' : 'thriller'} >>> conn.add(**book) 13
  • 16. Response format 14
  • 17. Response format <response><result numFound=”2” start=”0”> <doc><str name=”title”>Coma</str> solr <str name=”author”>Robin Cook</str></doc> <doc><str name=”title”>The Trap</str> ! <str name=”author”>Agatha Christie</str></doc> </result></response> 14
  • 18. Response format >>> query = {'genre': 'thriller'} >>> response = conn.search(q=query) >>> results = SolrResponse(response).response collective.solr >>> results.numFound 2 >>> results[0].title 'Coma' >>> results[0].author 'Robin Cook' 14
  • 19. Who use solr/lucene? 15
  • 20. Who use Solr/Lucene? Who use solr/lucene? Who use Solr/Lucene? 15
  • 21. "Biblioteca Virtuale Italiana di Testi in Formato Alternativo" 16
  • 22. Architecture CSV search sources Books retriever Z39.50 retriever populator solr web site populator ... retriever 17
  • 23. Retrievers • they are normalizing sources to unique format • source can be anything from CSV to public site 18
  • 24. Public sites • makes a query • grabs HTML results • using configurable xpath parser transform HTML results into python format 19
  • 25. Normalize it! every Book needs to have minimal metadata: • Title • Format • Description • ISBN • Authors • ISSN • Publisher • Data 20
  • 26. Populators Today: • only one solr populator In the future: • populate other sites, • populate RDBMS • ... 21
  • 27. Conclusions • multiple retrivers – multiple populators • we have used only collective.solr SolrConnection API • 120.000 books indexed so far in solr - querying and indexing is extremly fast 22
  • 28. http://www.flickr.com/photos/st3f4n/2767217547 tsearch2 ? 23
  • 29. http://www.flickr.com/photos/st3f4n/2767217547 search engine fully integrated tsearch2 ? in PostgreSQL 8.3.x 24
  • 30. tsearch2 main features • Flexible and rich linguistic support (dictionaries, stop words), thesaurus • Full UTF-8 support • Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd) • Rich query language with query rewriting support • Headline support (text fragments with highlighted search terms) • It is mature (5 years of development) 25
  • 31. first steps with tsearch2 1. PostgreSQL >= 8.4 (but 8.3 will work as well) 2. COLUMN ALTER TABLE content ADD COLUMN search_vector tsvector; 3. INDEX CREATE INDEX search_index ON content USING gin(search_vector); 26
  • 32. first steps with tsearch2 4. TRIGGER CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$ begin new.search_vector := setweight(to_tsvector('pg_catalog.english', coalesce(new.subject,'')), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'B') || setweight(to_tsvector('pg_catalog.english', coalesce(new.description,'')), 'C'); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger(); 27
  • 33. http://www.flickr.com/photos/st3f4n/2767217547 how to serialize Plone content tsearch2 to SQL? 28
  • 34. http://www.flickr.com/photos/st3f4n/2767217547 „it focuses and supports out of ore.contentmirror the box, content deployment to a relational database” 29
  • 35. http://www.flickr.com/photos/st3f4n/2767217547 how to add tsearch2 to ore.contentmirror ddl? 30
  • 36. How to add tsearch2 to ore.contentmirror ddl? >>> from ore.contentmirror.schema import content >>> def setup_search(event, schema_item, bind): ...! ! bind.execute("alter table content add ...! ! ! ! ! ! column search_vector tsvector") >>> content.append_ddl_listener('after-create', ... setup_search) 31
  • 37. Geco - community portal for Italian youth 32
  • 38. Geco • Started in 2009 for Emilia-Romagna • Multiple content types, including video, polls, articles and more 33
  • 39. Geco • 95 editors (Emilia-Romagna) • 100.000 documents (Emilia- Romagna) • This year: 2 other regions joins • Future: all 20 regions joins the project • Every region has it's own server deployment 34
  • 40. Objectives ✓ fast and efficient search engine that can integrate multiple different Plone sites ✓ search results should be ordered by rank ✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments) 35
  • 41. rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting 36
  • 42. rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting >>> rank = '{0,0.05,0.05,0.9}' >>> term = 'Ferrara' >>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term))) 36
  • 43. http://www.flickr.com/photos/vramak/3499502280 Conclusions 37
  • 44. Conclusions ✓ Integrating external search engine in Plone is easy! ✓ You can find a solution that suites your needs! 38
  • 45. Questions Andrew Mleczko RedTurtle Technology andrew.mleczko@redturtle.net 39
  • 46. Thank you. 40