Dedicated search for a private phpBB forum using sphinx

6,300 views
6,205 views

Published on

This is the presentation I gave at barcampNortheast3. It describes crawling a password protected forum, extracting the content from the html and then making that content searchable. The slide deck is relatively thin but I intend to add additional notes at http://jonathanstreet.com/blog/bcne3-search-phpbb-with-sphinx

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,300
On SlideShare
0
From Embeds
0
Number of Embeds
1,062
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dedicated search for a private phpBB forum using sphinx

  1. 1. A mini-google for a private phpBB forum<br />
  2. 2. Background<br />Recently joined a non-technical forum<br />Set up by a friend of the founder – hasn’t been seen for 1-2 years<br />Running on shared hosting<br />Search limited to past years content to keep things fast<br />
  3. 3. 3 steps<br />Mirror site locally<br />Insert content in a database<br />Release sphinx<br />
  4. 4. The Problems<br />It’s private <br />
  5. 5. The Problems<br />It’s private <br />It’s not well formed<br />
  6. 6. The Problems<br />It’s private <br />It’s not well formed<br />Page info contained in query string<br />viewtopic.php?f=30&t=10170<br />
  7. 7. Wget<br />Old faithful – first tool I reach for when I want to download anything from a website<br />Simple for simple tasks but with the flexibility to handle more complex tasks<br />
  8. 8. Wget – Logging in<br />Ability to import a Netscape style cookies file<br />Didn’t work for me<br />Wrong format?<br />Wrongly configured wget?<br />phpBB checking user agent / ip address?<br />Log in directly in wget<br />Multi-step process<br />
  9. 9. Wget – are we there yet?<br />There is massive redundancy in the link structure<br />Every post has an individual link which pulls in the entire topic<br />Can’t exclude based on query string<br />
  10. 10. Zend_HTTP<br />Most of my time spent with PHP and more recently Zend framework<br />There is a lot to be said for using a tool you are familiar with<br />
  11. 11. Zend_HTTP<br />
  12. 12. Scraping HTML<br />First needed to correct errors<br />Tidy extension<br />SimpleXML<br />Need to change xmlns to ns<br />Still doesn’t work in all cases<br />
  13. 13. Releasing Sphinx<br />Ridiculously simple<br />Simple config file adapted from example<br />Runs for ~30s for ~90k posts<br />Added a simply database query to beef up web interface from the example <br />Only downside – memory footprint<br />
  14. 14. Future tasks<br />Keep index updated<br />Implemented but could be more efficient<br />Exercise to learn python<br />

×