Dedicated search for a private phpBB forum using sphinx
Upcoming SlideShare
Loading in...5
×
 

Dedicated search for a private phpBB forum using sphinx

on

  • 5,085 views

This is the presentation I gave at barcampNortheast3. It describes crawling a password protected forum, extracting the content from the html and then making that content searchable. The slide deck ...

This is the presentation I gave at barcampNortheast3. It describes crawling a password protected forum, extracting the content from the html and then making that content searchable. The slide deck is relatively thin but I intend to add additional notes at http://jonathanstreet.com/blog/bcne3-search-phpbb-with-sphinx

Statistics

Views

Total Views
5,085
Views on SlideShare
4,150
Embed Views
935

Actions

Likes
1
Downloads
4
Comments
0

8 Embeds 935

http://jonathanstreet.com 895
http://localhost 22
http://www.slideshare.net 5
http://www.newsblur.com 5
http://127.0.0.1 3
http://webcache.googleusercontent.com 2
http://www.slashdocs.com 2
http://theoldreader.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Dedicated search for a private phpBB forum using sphinx Dedicated search for a private phpBB forum using sphinx Presentation Transcript

  • A mini-google for a private phpBB forum
  • Background
    Recently joined a non-technical forum
    Set up by a friend of the founder – hasn’t been seen for 1-2 years
    Running on shared hosting
    Search limited to past years content to keep things fast
  • 3 steps
    Mirror site locally
    Insert content in a database
    Release sphinx
    View slide
  • The Problems
    It’s private
    View slide
  • The Problems
    It’s private
    It’s not well formed
  • The Problems
    It’s private
    It’s not well formed
    Page info contained in query string
    viewtopic.php?f=30&t=10170
  • Wget
    Old faithful – first tool I reach for when I want to download anything from a website
    Simple for simple tasks but with the flexibility to handle more complex tasks
  • Wget – Logging in
    Ability to import a Netscape style cookies file
    Didn’t work for me
    Wrong format?
    Wrongly configured wget?
    phpBB checking user agent / ip address?
    Log in directly in wget
    Multi-step process
  • Wget – are we there yet?
    There is massive redundancy in the link structure
    Every post has an individual link which pulls in the entire topic
    Can’t exclude based on query string
  • Zend_HTTP
    Most of my time spent with PHP and more recently Zend framework
    There is a lot to be said for using a tool you are familiar with
  • Zend_HTTP
  • Scraping HTML
    First needed to correct errors
    Tidy extension
    SimpleXML
    Need to change xmlns to ns
    Still doesn’t work in all cases
  • Releasing Sphinx
    Ridiculously simple
    Simple config file adapted from example
    Runs for ~30s for ~90k posts
    Added a simply database query to beef up web interface from the example
    Only downside – memory footprint
  • Future tasks
    Keep index updated
    Implemented but could be more efficient
    Exercise to learn python