Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dedicated search for a private phpBB forum using sphinx


Published on

This is the presentation I gave at barcampNortheast3. It describes crawling a password protected forum, extracting the content from the html and then making that content searchable. The slide deck is relatively thin but I intend to add additional notes at

Published in: Technology
  • Be the first to comment

Dedicated search for a private phpBB forum using sphinx

  1. 1. A mini-google for a private phpBB forum<br />
  2. 2. Background<br />Recently joined a non-technical forum<br />Set up by a friend of the founder – hasn’t been seen for 1-2 years<br />Running on shared hosting<br />Search limited to past years content to keep things fast<br />
  3. 3. 3 steps<br />Mirror site locally<br />Insert content in a database<br />Release sphinx<br />
  4. 4. The Problems<br />It’s private <br />
  5. 5. The Problems<br />It’s private <br />It’s not well formed<br />
  6. 6. The Problems<br />It’s private <br />It’s not well formed<br />Page info contained in query string<br />viewtopic.php?f=30&t=10170<br />
  7. 7. Wget<br />Old faithful – first tool I reach for when I want to download anything from a website<br />Simple for simple tasks but with the flexibility to handle more complex tasks<br />
  8. 8. Wget – Logging in<br />Ability to import a Netscape style cookies file<br />Didn’t work for me<br />Wrong format?<br />Wrongly configured wget?<br />phpBB checking user agent / ip address?<br />Log in directly in wget<br />Multi-step process<br />
  9. 9. Wget – are we there yet?<br />There is massive redundancy in the link structure<br />Every post has an individual link which pulls in the entire topic<br />Can’t exclude based on query string<br />
  10. 10. Zend_HTTP<br />Most of my time spent with PHP and more recently Zend framework<br />There is a lot to be said for using a tool you are familiar with<br />
  11. 11. Zend_HTTP<br />
  12. 12. Scraping HTML<br />First needed to correct errors<br />Tidy extension<br />SimpleXML<br />Need to change xmlns to ns<br />Still doesn’t work in all cases<br />
  13. 13. Releasing Sphinx<br />Ridiculously simple<br />Simple config file adapted from example<br />Runs for ~30s for ~90k posts<br />Added a simply database query to beef up web interface from the example <br />Only downside – memory footprint<br />
  14. 14. Future tasks<br />Keep index updated<br />Implemented but could be more efficient<br />Exercise to learn python<br />