• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Needle in an enterprise haystack
 

Needle in an enterprise haystack

on

  • 75,039 views

 

Statistics

Views

Total Views
75,039
Views on SlideShare
74,950
Embed Views
89

Actions

Likes
5
Downloads
26
Comments
1

10 Embeds 89

http://www.slideshare.net 65
https://twitter.com 8
http://www.techgig.com 6
http://www.redturtle.it 3
http://www.dev.redturtle.it 2
http://localhost 1
http://www.lmodules.com 1
http://115.112.206.131 1
http://facebook.slideshare.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Needle in an enterprise haystack Needle in an enterprise haystack Presentation Transcript

    • search Needle in an enterprise haystack engine integrations 1
    • Who am I? Andrew Mleczko Plone Integrator Redturtle Technology (Ferrara/Italy) andrew.mleczko@redturtle.net 2
    • so why do you need an external search engine? 3
    • why do you need an external search engine... • Plone's portal_catalog is slow with big sites (large number of indexed objects) • You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText) • You want to query Plone's content from external applications • You want to use advanced search features 4
    • there are several that you can use solutions 5
    • Plone external indexing and searching • Out-of-the-box: • collective.gsa (Google Search Appliance) • collective.solr (Apache Solr) • Custom integrations: • Solr • Tsearch2 http://www.flickr.com/photos/jenny-pics/3527749814 6
    • http://www.flickr.com/photos/st3f4n/2767217547 Solr? 7
    • http://www.flickr.com/photos/st3f4n/2767217547 a search engine based on Lucene 8
    • http://www.flickr.com/photos/st3f4n/2767217547 Lucene? 9
    • http://www.flickr.com/photos/st3f4n/2767217547 Full-text search library 100% in java 10
    • http://www.flickr.com/photos/st3f4n/2767217547 XML/HTTP, JSON interface, Solr Open Source 11
    • http://www.flickr.com/photos/st3f4n/2767217547 python API collective.solr and Plone integration 12
    • 13 solr collective.solr Document format
    • Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr 13
    • Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr >>> conn = SolrConnection(host='127.0.0.1', ...) >>> book = {'title': 'The Trap', ...! ! ! 'author': 'Agatha Christie', ...! ! ! 'genre' : 'thriller'} >>> conn.add(**book) 13
    • Response format 14
    • Response format <response><result numFound=”2” start=”0”> <doc><str name=”title”>Coma</str> solr <str name=”author”>Robin Cook</str></doc> <doc><str name=”title”>The Trap</str> ! <str name=”author”>Agatha Christie</str></doc> </result></response> 14
    • Response format >>> query = {'genre': 'thriller'} >>> response = conn.search(q=query) >>> results = SolrResponse(response).response collective.solr >>> results.numFound 2 >>> results[0].title 'Coma' >>> results[0].author 'Robin Cook' 14
    • Who use solr/lucene? 15
    • Who use Solr/Lucene? Who use solr/lucene? Who use Solr/Lucene? 15
    • "Biblioteca Virtuale Italiana di Testi in Formato Alternativo" 16
    • Architecture CSV search sources Books retriever Z39.50 retriever populator solr web site populator ... retriever 17
    • Retrievers • they are normalizing sources to unique format • source can be anything from CSV to public site 18
    • Public sites • makes a query • grabs HTML results • using configurable xpath parser transform HTML results into python format 19
    • Normalize it! every Book needs to have minimal metadata: • Title • Format • Description • ISBN • Authors • ISSN • Publisher • Data 20
    • Populators Today: • only one solr populator In the future: • populate other sites, • populate RDBMS • ... 21
    • Conclusions • multiple retrivers – multiple populators • we have used only collective.solr SolrConnection API • 120.000 books indexed so far in solr - querying and indexing is extremly fast 22
    • http://www.flickr.com/photos/st3f4n/2767217547 tsearch2 ? 23
    • http://www.flickr.com/photos/st3f4n/2767217547 search engine fully integrated tsearch2 ? in PostgreSQL 8.3.x 24
    • tsearch2 main features • Flexible and rich linguistic support (dictionaries, stop words), thesaurus • Full UTF-8 support • Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd) • Rich query language with query rewriting support • Headline support (text fragments with highlighted search terms) • It is mature (5 years of development) 25
    • first steps with tsearch2 1. PostgreSQL >= 8.4 (but 8.3 will work as well) 2. COLUMN ALTER TABLE content ADD COLUMN search_vector tsvector; 3. INDEX CREATE INDEX search_index ON content USING gin(search_vector); 26
    • first steps with tsearch2 4. TRIGGER CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$ begin new.search_vector := setweight(to_tsvector('pg_catalog.english', coalesce(new.subject,'')), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'B') || setweight(to_tsvector('pg_catalog.english', coalesce(new.description,'')), 'C'); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger(); 27
    • http://www.flickr.com/photos/st3f4n/2767217547 how to serialize Plone content tsearch2 to SQL? 28
    • http://www.flickr.com/photos/st3f4n/2767217547 „it focuses and supports out of ore.contentmirror the box, content deployment to a relational database” 29
    • http://www.flickr.com/photos/st3f4n/2767217547 how to add tsearch2 to ore.contentmirror ddl? 30
    • How to add tsearch2 to ore.contentmirror ddl? >>> from ore.contentmirror.schema import content >>> def setup_search(event, schema_item, bind): ...! ! bind.execute("alter table content add ...! ! ! ! ! ! column search_vector tsvector") >>> content.append_ddl_listener('after-create', ... setup_search) 31
    • Geco - community portal for Italian youth 32
    • Geco • Started in 2009 for Emilia-Romagna • Multiple content types, including video, polls, articles and more 33
    • Geco • 95 editors (Emilia-Romagna) • 100.000 documents (Emilia- Romagna) • This year: 2 other regions joins • Future: all 20 regions joins the project • Every region has it's own server deployment 34
    • Objectives ✓ fast and efficient search engine that can integrate multiple different Plone sites ✓ search results should be ordered by rank ✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments) 35
    • rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting 36
    • rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting >>> rank = '{0,0.05,0.05,0.9}' >>> term = 'Ferrara' >>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term))) 36
    • http://www.flickr.com/photos/vramak/3499502280 Conclusions 37
    • Conclusions ✓ Integrating external search engine in Plone is easy! ✓ You can find a solution that suites your needs! 38
    • Questions Andrew Mleczko RedTurtle Technology andrew.mleczko@redturtle.net 39
    • Thank you. 40