Your SlideShare is downloading. ×
0
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Mongo db full text search with sphinx
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mongo db full text search with sphinx

10,592

Published on

1 Comment
15 Likes
Statistics
Notes
No Downloads
Views
Total Views
10,592
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
85
Comments
1
Likes
15
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 10gen and usersA course is a (really long) documentAllowed OCW Search to get new features seamlessly
  • PHP scriptOnly fields to be indexed
  • Use course meta data in one algo to always produce the same output given the same inputs.
  • Need a way to work with all kinds of input
  • Uses regexs. Ugly, but works.PHP crashes with regexes matching really long strings.Split up string into array and loop, detecting encoding and reacting accordingly.It’s probably wrong for cases I’ve yet to see.
  • Uses CloudFusion libraryObject name = unique ID.
  • Transcript

    • 1. MongoDB Full Text Searchwith Sphinx<br />Pierre Far, PhD<br />Twitter: @ocwsearch<br />Web: www.ocwsearch.com<br />Email: pierre@ocwsearch.com<br />
    • 2. About<br />A search engine of the full text of OpenCourseWare course materials.<br />2600+ courses, 10 universities, 11 OCW collections<br />Courses in English, Japanese, Spanish, Dutch<br />
    • 3. Why MongoDB?<br />Very helpful community<br />Document DB<br />Schemaless<br />
    • 4. Technology Stack<br />Website (HTML), API (JSON)<br />Query<br />Index<br />mongos3<br />xmlpipe2<br />Amazon S3<br />Adaptor Scripts<br />
    • 5. xmlpipe2<br />An XML documents input into Sphinx<br />Any XML source so...<br />Read courses from MongoDB and stream as XML<br />sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial<br />
    • 6. Pitfall 1: Document ID<br />“ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS”<br />Generate a unique 10-digit numeric ID for each course.<br />Must be deterministic<br />Unique index on field.<br />
    • 7. Pitfall 2: UTF-8<br />“Fatal error: Uncaught exception 'MongoException' with message 'non-utf8 string”<br />Encoding: it’s a lie.<br />mb_detect_encoding() unreliable.<br />2-part solution<br /> 1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8');<br /> 2. $Text = FixEncoding($Text);<br />
    • 8. FixEncoding();<br />A set of real encoding detection functions http://lachy.id.au/dev/2005/11/encoding-functions-source<br />FixEncoding() is a wrapper for these functions<br />
    • 9. UTF-8 in Sphinx<br />In sphinx.conf:<br />charset_type = utf-8<br />ngram_chars<br />charset_table<br />sphinxsearch.com/wiki/doku.php?id=charset_tables<br />
    • 10. mongos3<br />MongoDB document = S3 object<br />Backup tool for MongoDB<br />$Contents = gzencode(json_encode($Course), 9);<br />
    • 11. Thanks!<br />Any questions?<br />Twitter: @ocwsearch<br />Web: www.ocwsearch.com<br />Email: pierre@ocwsearch.com<br />

    ×