Needle in an enterprise haystack

4,633 views

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
4,633
On SlideShare
0
From Embeds
0
Number of Embeds
96
Actions
Shares
0
Downloads
28
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Needle in an enterprise haystack

  1. 1. search Needle in an enterprise haystack engine integrations 1
  2. 2. Who am I? Andrew Mleczko Plone Integrator Redturtle Technology (Ferrara/Italy) andrew.mleczko@redturtle.net 2
  3. 3. so why do you need an external search engine? 3
  4. 4. why do you need an external search engine... • Plone's portal_catalog is slow with big sites (large number of indexed objects) • You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText) • You want to query Plone's content from external applications • You want to use advanced search features 4
  5. 5. there are several that you can use solutions 5
  6. 6. Plone external indexing and searching • Out-of-the-box: • collective.gsa (Google Search Appliance) • collective.solr (Apache Solr) • Custom integrations: • Solr • Tsearch2 http://www.flickr.com/photos/jenny-pics/3527749814 6
  7. 7. http://www.flickr.com/photos/st3f4n/2767217547 Solr? 7
  8. 8. http://www.flickr.com/photos/st3f4n/2767217547 a search engine based on Lucene 8
  9. 9. http://www.flickr.com/photos/st3f4n/2767217547 Lucene? 9
  10. 10. http://www.flickr.com/photos/st3f4n/2767217547 Full-text search library 100% in java 10
  11. 11. http://www.flickr.com/photos/st3f4n/2767217547 XML/HTTP, JSON interface, Solr Open Source 11
  12. 12. http://www.flickr.com/photos/st3f4n/2767217547 python API collective.solr and Plone integration 12
  13. 13. 13 solr collective.solr Document format
  14. 14. Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr 13
  15. 15. Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr >>> conn = SolrConnection(host='127.0.0.1', ...) >>> book = {'title': 'The Trap', ...! ! ! 'author': 'Agatha Christie', ...! ! ! 'genre' : 'thriller'} >>> conn.add(**book) 13
  16. 16. Response format 14
  17. 17. Response format <response><result numFound=”2” start=”0”> <doc><str name=”title”>Coma</str> solr <str name=”author”>Robin Cook</str></doc> <doc><str name=”title”>The Trap</str> ! <str name=”author”>Agatha Christie</str></doc> </result></response> 14
  18. 18. Response format >>> query = {'genre': 'thriller'} >>> response = conn.search(q=query) >>> results = SolrResponse(response).response collective.solr >>> results.numFound 2 >>> results[0].title 'Coma' >>> results[0].author 'Robin Cook' 14
  19. 19. Who use solr/lucene? 15
  20. 20. Who use Solr/Lucene? Who use solr/lucene? Who use Solr/Lucene? 15
  21. 21. "Biblioteca Virtuale Italiana di Testi in Formato Alternativo" 16
  22. 22. Architecture CSV search sources Books retriever Z39.50 retriever populator solr web site populator ... retriever 17
  23. 23. Retrievers • they are normalizing sources to unique format • source can be anything from CSV to public site 18
  24. 24. Public sites • makes a query • grabs HTML results • using configurable xpath parser transform HTML results into python format 19
  25. 25. Normalize it! every Book needs to have minimal metadata: • Title • Format • Description • ISBN • Authors • ISSN • Publisher • Data 20
  26. 26. Populators Today: • only one solr populator In the future: • populate other sites, • populate RDBMS • ... 21
  27. 27. Conclusions • multiple retrivers – multiple populators • we have used only collective.solr SolrConnection API • 120.000 books indexed so far in solr - querying and indexing is extremly fast 22
  28. 28. http://www.flickr.com/photos/st3f4n/2767217547 tsearch2 ? 23
  29. 29. http://www.flickr.com/photos/st3f4n/2767217547 search engine fully integrated tsearch2 ? in PostgreSQL 8.3.x 24
  30. 30. tsearch2 main features • Flexible and rich linguistic support (dictionaries, stop words), thesaurus • Full UTF-8 support • Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd) • Rich query language with query rewriting support • Headline support (text fragments with highlighted search terms) • It is mature (5 years of development) 25
  31. 31. first steps with tsearch2 1. PostgreSQL >= 8.4 (but 8.3 will work as well) 2. COLUMN ALTER TABLE content ADD COLUMN search_vector tsvector; 3. INDEX CREATE INDEX search_index ON content USING gin(search_vector); 26
  32. 32. first steps with tsearch2 4. TRIGGER CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$ begin new.search_vector := setweight(to_tsvector('pg_catalog.english', coalesce(new.subject,'')), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'B') || setweight(to_tsvector('pg_catalog.english', coalesce(new.description,'')), 'C'); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger(); 27
  33. 33. http://www.flickr.com/photos/st3f4n/2767217547 how to serialize Plone content tsearch2 to SQL? 28
  34. 34. http://www.flickr.com/photos/st3f4n/2767217547 „it focuses and supports out of ore.contentmirror the box, content deployment to a relational database” 29
  35. 35. http://www.flickr.com/photos/st3f4n/2767217547 how to add tsearch2 to ore.contentmirror ddl? 30
  36. 36. How to add tsearch2 to ore.contentmirror ddl? >>> from ore.contentmirror.schema import content >>> def setup_search(event, schema_item, bind): ...! ! bind.execute("alter table content add ...! ! ! ! ! ! column search_vector tsvector") >>> content.append_ddl_listener('after-create', ... setup_search) 31
  37. 37. Geco - community portal for Italian youth 32
  38. 38. Geco • Started in 2009 for Emilia-Romagna • Multiple content types, including video, polls, articles and more 33
  39. 39. Geco • 95 editors (Emilia-Romagna) • 100.000 documents (Emilia- Romagna) • This year: 2 other regions joins • Future: all 20 regions joins the project • Every region has it's own server deployment 34
  40. 40. Objectives ✓ fast and efficient search engine that can integrate multiple different Plone sites ✓ search results should be ordered by rank ✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments) 35
  41. 41. rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting 36
  42. 42. rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting >>> rank = '{0,0.05,0.05,0.9}' >>> term = 'Ferrara' >>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term))) 36
  43. 43. http://www.flickr.com/photos/vramak/3499502280 Conclusions 37
  44. 44. Conclusions ✓ Integrating external search engine in Plone is easy! ✓ You can find a solution that suites your needs! 38
  45. 45. Questions Andrew Mleczko RedTurtle Technology andrew.mleczko@redturtle.net 39
  46. 46. Thank you. 40

×