This is the presentation I gave at barcampNortheast3. It describes crawling a password protected forum, extracting the content from the html and then making that content searchable. The slide deck is relatively thin but I intend to add additional notes at http://jonathanstreet.com/blog/bcne3-search-phpbb-with-sphinx
2. Background Recently joined a non-technical forum Set up by a friend of the founder – hasn’t been seen for 1-2 years Running on shared hosting Search limited to past years content to keep things fast
3. 3 steps Mirror site locally Insert content in a database Release sphinx
6. The Problems It’s private It’s not well formed Page info contained in query string viewtopic.php?f=30&t=10170
7. Wget Old faithful – first tool I reach for when I want to download anything from a website Simple for simple tasks but with the flexibility to handle more complex tasks
8. Wget – Logging in Ability to import a Netscape style cookies file Didn’t work for me Wrong format? Wrongly configured wget? phpBB checking user agent / ip address? Log in directly in wget Multi-step process
9. Wget – are we there yet? There is massive redundancy in the link structure Every post has an individual link which pulls in the entire topic Can’t exclude based on query string
10. Zend_HTTP Most of my time spent with PHP and more recently Zend framework There is a lot to be said for using a tool you are familiar with
12. Scraping HTML First needed to correct errors Tidy extension SimpleXML Need to change xmlns to ns Still doesn’t work in all cases
13. Releasing Sphinx Ridiculously simple Simple config file adapted from example Runs for ~30s for ~90k posts Added a simply database query to beef up web interface from the example Only downside – memory footprint
14. Future tasks Keep index updated Implemented but could be more efficient Exercise to learn python