• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Freebase: Wikipedia Mining 20080416
 

Freebase: Wikipedia Mining 20080416

on

  • 5,282 views

Slides from the "Wikipedia Mining" talk at the Spring Freebase User Group meeting.

Slides from the "Wikipedia Mining" talk at the Spring Freebase User Group meeting.

Statistics

Views

Total Views
5,282
Views on SlideShare
5,074
Embed Views
208

Actions

Likes
7
Downloads
196
Comments
1

4 Embeds 208

http://blog.freebase.com 202
http://www.slideshare.net 4
file:// 1
http://66.102.9.104 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hello All!

    There is a video of the talk that goes with this presentation. Check it out at:

    http://blog.freebase.com/

    Or watch the video below ...
    <br /><object type="application/x-shockwave-flash" data="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="350" height="288"><param name="movie" value="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf"></param><embed src="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Freebase: Wikipedia Mining 20080416 Freebase: Wikipedia Mining 20080416 Presentation Transcript

    • Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat
    • Why Mine Wikipedia? • How can we automatically extract the unstructured content from Wikipedia … • … to create a structured database of information … • … that can be leveraged by users in applications and data loads 2
    • A Remarkable Source of Information 2.15 M articles as of April 2008 Doubling every 12 - 18 months 3
    • Problem is … • Wikipedia is written by humans, for humans. - Great if you need to look up a fact, or learn about something • But you can’t … - Ask questions: “What movies by George Lucas has Harrison Ford starred in?” - Search effectively: “Find me all companies that build personal computers.” - Build applications: “Let’s make a social app that ranks consumer goods listed in wikipedia.” 4
    • From unstructured … 5
    • … to structured 6
    • Searching for Structure: Topics Articles define a topic 7
    • Searching for Structure: Types Categories & Lists provide type 8
    • Searching for Structure: Types Categories & Lists provide type 9
    • Searching for Structure: Properties Templates & Infoboxes give properties 10
    • Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/architecture/structurequot; quot;namequot; : null, quot;height_metersquot; : null, quot;sortquot; : quot;-height_metersquot;, quot;limitquot; : 10, } ] } What are the highest buildings in the world? 11
    • Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/location/countryquot; quot;namequot; : null, ”official_languagequot; : “English”, quot;limitquot; : 100 } ] } What are all the countries that speak English? 12
    • A Treasure Trove Waiting To Be Opened • 2,150,000 articles (ie, topics) • 7,100,000 category refs (ie, typings) - Found within 280,000 categories • 42,000,000 template values (ie, properties) - Found within 10,000 templates and 56,000 template keys • All growing at ~2% every two weeks • Available information doubles every year! 13
    • Topic Population From Wikipedia Topic Name Blurb Wikipedia Attribution Image Wikipedia Link 14
    • Fresh Topic 15
    • Similar, but different … • Many pages in wikipedia are not topics - Disambiguation pages, lists, categories, images, docs, talk … • Only store a 1200-character blurb - We’re not wikipedia, after all • Don’t need to add “(suffix)” to names - “Python (genus)” vs “Python (programming language)” - Freebase types disambiguate without names • Cities should be specified without state suffix - “San Francisco” vs “San Francisco, California” - Cleanup in progress, some exceptions remain • “Exclusionist” vs “Inclusionist” - Exclusionists appear to be winning in Wikilandia - Freebase is inherently more inclusionist 16
    • You Can’t Read The Same Wikipedia Twice Every 2 weeks … - 65,000 new pages - 8,000 deletes - 30,000 new topics - 5,000 name changes - 80,000 new aliases - 1,000 page ID changes - 10,000 merges - 1,000 splits … change in Wikipedia 17
    • Keeping track of changes … • Store reference information within freebase - Page_ids, article titles and redirects - Page_id (WPID) is stored in /wikipedia/en_id - Article titles and redirects are stored in /wikipedia/en - “mwcl_wikipedia_en”, “mw_infobot” user • None of these IDs are stable in wiki-land … 18
    • Determining actions by comparing keys case action new topic create a new topic name change add new name as en key; if quot;untouchedquot;, rename the topic id change change the en_id to the new value merge move the en key to the new topic; if quot;untouchedquot;, merge the topics split create new topic, move en key from old topic to new topic delete keep topic, but delete en_id and en keys from topic • Because we are more inclusionist than wikipedia, we usually do not delete topics. • Topic renames only occur on “untouched” topics. • Merges occur automatically on “untouched” topics - Otherwise, flagged for review in “pipeline” 19
    • Map Template Fields To Properties 20
    • Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} 21
    • Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} “manufacturer” --> /aviation/aircraft_model/manufacturer 22
    • Just the Starting Point … • Extracted to date from Wikipedia: - 2,365,000 topics - 2,895,000 typings - 5,638,000 properties • A complement to user-entered data - User data always takes precedence, won’t be overwritten • Processes are being automated to keep in sync 23
    • Thanks! Tristan Buckner Topic updater /user/tristan Image loader Colin Evans WEX /user/colin Al Marks Category mapper /user/al Template mapper WEX 24