Successfully reported this slideshow.

Mongo db full text search with sphinx

11,734 views

Published on

Mongo db full text search with sphinx

  1. 1. MongoDB Full Text Searchwith Sphinx<br />Pierre Far, PhD<br />Twitter: @ocwsearch<br />Web: www.ocwsearch.com<br />Email: pierre@ocwsearch.com<br />
  2. 2. About<br />A search engine of the full text of OpenCourseWare course materials.<br />2600+ courses, 10 universities, 11 OCW collections<br />Courses in English, Japanese, Spanish, Dutch<br />
  3. 3. Why MongoDB?<br />Very helpful community<br />Document DB<br />Schemaless<br />
  4. 4. Technology Stack<br />Website (HTML), API (JSON)<br />Query<br />Index<br />mongos3<br />xmlpipe2<br />Amazon S3<br />Adaptor Scripts<br />
  5. 5. xmlpipe2<br />An XML documents input into Sphinx<br />Any XML source so...<br />Read courses from MongoDB and stream as XML<br />sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial<br />
  6. 6. Pitfall 1: Document ID<br />“ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS”<br />Generate a unique 10-digit numeric ID for each course.<br />Must be deterministic<br />Unique index on field.<br />
  7. 7. Pitfall 2: UTF-8<br />“Fatal error: Uncaught exception 'MongoException' with message 'non-utf8 string”<br />Encoding: it’s a lie.<br />mb_detect_encoding() unreliable.<br />2-part solution<br /> 1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8');<br /> 2. $Text = FixEncoding($Text);<br />
  8. 8. FixEncoding();<br />A set of real encoding detection functions http://lachy.id.au/dev/2005/11/encoding-functions-source<br />FixEncoding() is a wrapper for these functions<br />
  9. 9. UTF-8 in Sphinx<br />In sphinx.conf:<br />charset_type = utf-8<br />ngram_chars<br />charset_table<br />sphinxsearch.com/wiki/doku.php?id=charset_tables<br />
  10. 10. mongos3<br />MongoDB document = S3 object<br />Backup tool for MongoDB<br />$Contents = gzencode(json_encode($Course), 9);<br />
  11. 11. Thanks!<br />Any questions?<br />Twitter: @ocwsearch<br />Web: www.ocwsearch.com<br />Email: pierre@ocwsearch.com<br />

×