Mining Wikipedia For Awesome Data

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

2 comments

Comments 1 - 2 of 2 previous next Post a comment

  • + guest16f34 guest16f34 2 years ago
    Wow, this is very impressive!!! I uploaded it onto my webserver and it works wonderfully!



    Just one question. How do I get the example.php page to not show the header and footer. For example, Have it not display:



    Array ( [title] => Fall Out Boy [url] => http://en.wikipedia.org/wiki/Fall Out Boy [html] =>



    on top, and a ) at the bottom. So it only shows the content it pulls from the wiki.
  • + enrico.foschi Enrico Foschi 2 years ago
    The presentation is amazing, really well done.

    What I did to get html content was to load directly the Wiki page and then to scrape for body content with xPath. That worked really fine and I could get the html content in just 1 call.

    Issues: if the body id will change, the scraping business logic will have to be modified.

    Thanks for sharing,

    Enrico Foschi
Post a comment
Embed Video
Edit your comment Cancel

4 Favorites

Mining Wikipedia For Awesome Data - Presentation Transcript

  1. Mining Wikipedia for Awesome Data Neil Crosby
  2. What’s this about then? • There’s loads of groovy content on Wikipedia[citation needed]. • You are lazy. • You want groovy content on your site.
  3. Wikipedia has an API • Who knew? • http://en.wikipedia.org/w/api.php
  4. API has lots of options Param Values What does it do? format php, json, Output format. TODO redirects 0, 1 Redirect to good pages. rvsection 0, 1, 2, 3, etc Page section to get data for. action query, parse API method.
  5. Getting WikiText? Easy • http://en.wikipedia.org/w/api.php? format=php&action=query&titles=one +flew+over+the+cuckoo’s +nest&rvprop=content&prop=revisions&re directs=1
  6. Searching? Harder • Wikipedia doesn’t have a good search engine.
  7. Use Yahoo! BOSS • http://boss.yahooapis.com/ysearch/web/v1/ site:en.wikipedia.org+one+flew+over+the +cuckoo’s+nest?appid=yourBOSSiD • First result: http://en.wikipedia.org/wiki/ One_Flew_Over_the_Cuckoo's_Nest_(fil m)
  8. Then get WikiText • http://en.wikipedia.org/w/api.php? format=php&action=query&titles= One_Flew_Over_the_Cuckoo's_Nest_(fil m)&rvprop=content&prop=revisions&redir ects=1
  9. The WikiText '''''One Flew Over the Cuckoo's Nest''''' is a [[1975 in film|1975]] [[comedy-drama]] film [[film director|directed]] by [[Miloš Forman]]. The film is an adaptation of the 1962 novel ''[[One Flew Over the Cuckoo's Nest (novel)|One Flew Over the Cuckoo's Nest]]'' by [[Ken Kesey]]. The movie was the first to [[List of Big Five Academy Award winners and nominees|win all five]]...
  10. But I wanted HTML! • WikiText is no good for dumping into a website.
  11. Another API call • http://en.wikipedia.org/w/api.php? action=parse&format=php&text=returned +wiki+text • Text will be big - do as a POST.
  12. Wiki HTML! <p><i><b>One Flew Over the Cuckoo's Nest</b></i> is a <a href=\"/wiki/ 1975_in_film\" title=\"1975 in film\">1975</a> <a href=\"/wiki/Comedy-drama\" title=\"Comedy-drama\">comedy-drama</a> film <a href=\"/wiki/Film_director\" title=\"Film director\">directed</a> by <a href=\"/wiki/ Milo%C5%A1_Forman\" title=\"Miloš Forman\">Miloš Forman</a>. The film is an...
  13. Reducing the HTML • DOMDocument->loadHTML() • DOMXPath->query() to get wanted nodes. • DOMDocument->saveHTML() • str_replace() away HTML boilerplate.
  14. The Cuckoo Problem • “One Flew Over the Cuckoo’s Nest” • A book? • A film? • Depends on context.
  15. The Cuckoo Solution • Give context: • “one flew over the cuckoo’s nest book” • “one flew over the cuckoo’s nest movie” • Yahoo! BOSS gives relevant result. Yay.
  16. There’s still a problem... • Sometimes you can give too much context. • “wii fit” gets expected result. • “wii fit electronics” returns “WiiMote”. • Oh dear.
  17. When is too much? • Who knows? • Just because an article exists for the basic term doesn’t mean it’s the right article. • I’ve not solved this yet.
  18. It’s all too complicated • So don’t do it all. • Use a library. • http://thecodetrain.co.uk/code/wikislurp
  19. Runs as a web service • http://yoursite.com/wikislurp/?params=blah
  20. What are the params? Param Meaning secret Your self-chosen appid. query What you’d like wiki info about. context A little bit of context. section Article section to retrieve. Zero indexed. xpath Specify the elements to return. output Serialised php or json.
  21. What does it return? • An array. • Keys for “url”, “title” and “article”.
  22. Why a webservice? • You can’t abandon a function call in PHP. • You can abandon a CURL call. • If wikislurp takes too long, move on.
  23. Kitten Break There’s some code coming up, soz. http://www.flickr.com/photos/gsx-r750/1475603952/
  24. How to call WikiSlurp • http://yoursite.com/wikislurp/? secret=YOUR+SECRET&query=one+flew +over+the+cuckoo’s +nest&context=book&xpath=/html/body/ p[position()<=3]&section=0&output=json
  25. And from PHP? $s = curl_init(); curl_setopt($s,CURLOPT_URL, $url); curl_setopt($s,CURLOPT_HEADER,false); curl_setopt($s, CURLOPT_RETURNTRANSFER,1); // wait 1 second, then abort curl_setopt($s,CURLOPT_TIMEOUT,1); $result = curl_exec($s); curl_close( $s );
  26. XPath? Query Gives You //p All <p> All <p> directly under /html/body/p <body> /html/body/p[2] 2nd <p> directly... /html/body/ First three <p> directly... p[position()<=3]
  27. Oh noes, more XPath Query Gives You All <p> with single class /html/body/p[@class='fish'] “fish” /html/body/ All <p> with any class p[contains(concat(\" including “fish” \",@class,\" \"), \" fish \")]
  28. Phew. Have another kitten. http://www.flickr.com/photos/evapro/305689596/
  29. Future Features • Do something intelligent with context. • Convert to HTML without an extra API call. • Return proper error codes if things go wrong.
  30. Where is this used? • TheTenWordReview.com • IsNeilAnnoyedBy.com
  31. Questions? • I will blog about this talk at The Code Train. • No, really - I will. • Download the slurpy source code from http://thecodetrain.co.uk/code/wikislurp • Slides? http://icanhaz.com/wikislurpslides • I was and am http://NeilCrosby.com/vcard

+ Neil CrosbyNeil Crosby, 2 years ago

custom

4972 views, 4 favs, 6 embeds more stats

More info about this document

© All Rights Reserved

Go to text version

  • Total Views 4972
    • 4902 on SlideShare
    • 70 from embeds
  • Comments 2
  • Favorites 4
  • Downloads 89
Most viewed embeds
  • 55 views on http://thecodetrain.co.uk
  • 6 views on http://tumblr.iamdanw.com
  • 4 views on http://barcamplondon4.backnetwork.com
  • 2 views on http://lj-toys.com
  • 2 views on http://safe.tumblr.com

more

All embeds
  • 55 views on http://thecodetrain.co.uk
  • 6 views on http://tumblr.iamdanw.com
  • 4 views on http://barcamplondon4.backnetwork.com
  • 2 views on http://lj-toys.com
  • 2 views on http://safe.tumblr.com
  • 1 views on https://lms.pratt.edu

less

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

Cancel
File a copyright complaint
Having problems? Go to our helpdesk?

Categories