Wikipedia Mining
Spring Freebase User Group meeting
2008-04-16 / zenkat
Why Mine Wikipedia?


• How can we automatically extract the
  unstructured content from Wikipedia …


• … to create a str...
A Remarkable Source of Information




      2.15 M articles as of April 2008
      Doubling every 12 - 18 months      3
Problem is …

• Wikipedia is written by humans, for humans.
 -   Great if you need to look up a fact, or learn about somet...
From unstructured …




                      5
… to structured




                  6
Searching for Structure: Topics




          Articles define a topic

                                    7
Searching for Structure: Types




     Categories & Lists provide type
                                       8
Searching for Structure: Types




     Categories & Lists provide type
                                       9
Searching for Structure: Properties




  Templates & Infoboxes give properties
                                          10
Searching for Structure: Properties




                       {
                           quot;queryquot; : [
          ...
Searching for Structure: Properties




{
    quot;queryquot; : [
      {
        quot;typequot; : quot;/location/countryq...
A Treasure Trove Waiting To Be Opened

•       2,150,000 articles (ie, topics)


•       7,100,000 category refs (ie, typi...
Topic Population From Wikipedia

       Topic Name
                                       Blurb




                    Wi...
Fresh Topic




              15
Similar, but different …

•       Many pages in wikipedia are not topics
    -   Disambiguation pages, lists, categories, ...
You Can’t Read The Same Wikipedia Twice


Every 2 weeks …


 -   65,000   new pages     -   8,000   deletes
 -   30,000   ...
Keeping track of changes …

•       Store reference information within freebase
    -   Page_ids, article titles and redir...
Determining actions by comparing keys

          case          action


          new topic     create a new topic


     ...
Map Template Fields To Properties




                                    20
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufa...
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufa...
Just the Starting Point …

• Extracted to date from Wikipedia:

 - 2,365,000 topics
 - 2,895,000 typings
 - 5,638,000 prop...
Thanks!



          Tristan Buckner   Topic updater
           /user/tristan    Image loader


            Colin Evans
  ...
Upcoming SlideShare
Loading in …5
×

Freebase: Wikipedia Mining 20080416

4,608 views

Published on

Slides from the "Wikipedia Mining" talk at the Spring Freebase User Group meeting.

Published in: Technology, Business
1 Comment
10 Likes
Statistics
Notes
  • Hello All!

    There is a video of the talk that goes with this presentation. Check it out at:

    http://blog.freebase.com/

    Or watch the video below ...
    <br /><object type="application/x-shockwave-flash" data="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="350" height="288"><param name="movie" value="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf"></param><embed src="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,608
On SlideShare
0
From Embeds
0
Number of Embeds
156
Actions
Shares
0
Downloads
205
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Freebase: Wikipedia Mining 20080416

  1. Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat
  2. Why Mine Wikipedia? • How can we automatically extract the unstructured content from Wikipedia … • … to create a structured database of information … • … that can be leveraged by users in applications and data loads 2
  3. A Remarkable Source of Information 2.15 M articles as of April 2008 Doubling every 12 - 18 months 3
  4. Problem is … • Wikipedia is written by humans, for humans. - Great if you need to look up a fact, or learn about something • But you can’t … - Ask questions: “What movies by George Lucas has Harrison Ford starred in?” - Search effectively: “Find me all companies that build personal computers.” - Build applications: “Let’s make a social app that ranks consumer goods listed in wikipedia.” 4
  5. From unstructured … 5
  6. … to structured 6
  7. Searching for Structure: Topics Articles define a topic 7
  8. Searching for Structure: Types Categories & Lists provide type 8
  9. Searching for Structure: Types Categories & Lists provide type 9
  10. Searching for Structure: Properties Templates & Infoboxes give properties 10
  11. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/architecture/structurequot; quot;namequot; : null, quot;height_metersquot; : null, quot;sortquot; : quot;-height_metersquot;, quot;limitquot; : 10, } ] } What are the highest buildings in the world? 11
  12. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/location/countryquot; quot;namequot; : null, ”official_languagequot; : “English”, quot;limitquot; : 100 } ] } What are all the countries that speak English? 12
  13. A Treasure Trove Waiting To Be Opened • 2,150,000 articles (ie, topics) • 7,100,000 category refs (ie, typings) - Found within 280,000 categories • 42,000,000 template values (ie, properties) - Found within 10,000 templates and 56,000 template keys • All growing at ~2% every two weeks • Available information doubles every year! 13
  14. Topic Population From Wikipedia Topic Name Blurb Wikipedia Attribution Image Wikipedia Link 14
  15. Fresh Topic 15
  16. Similar, but different … • Many pages in wikipedia are not topics - Disambiguation pages, lists, categories, images, docs, talk … • Only store a 1200-character blurb - We’re not wikipedia, after all • Don’t need to add “(suffix)” to names - “Python (genus)” vs “Python (programming language)” - Freebase types disambiguate without names • Cities should be specified without state suffix - “San Francisco” vs “San Francisco, California” - Cleanup in progress, some exceptions remain • “Exclusionist” vs “Inclusionist” - Exclusionists appear to be winning in Wikilandia - Freebase is inherently more inclusionist 16
  17. You Can’t Read The Same Wikipedia Twice Every 2 weeks … - 65,000 new pages - 8,000 deletes - 30,000 new topics - 5,000 name changes - 80,000 new aliases - 1,000 page ID changes - 10,000 merges - 1,000 splits … change in Wikipedia 17
  18. Keeping track of changes … • Store reference information within freebase - Page_ids, article titles and redirects - Page_id (WPID) is stored in /wikipedia/en_id - Article titles and redirects are stored in /wikipedia/en - “mwcl_wikipedia_en”, “mw_infobot” user • None of these IDs are stable in wiki-land … 18
  19. Determining actions by comparing keys case action new topic create a new topic name change add new name as en key; if quot;untouchedquot;, rename the topic id change change the en_id to the new value merge move the en key to the new topic; if quot;untouchedquot;, merge the topics split create new topic, move en key from old topic to new topic delete keep topic, but delete en_id and en keys from topic • Because we are more inclusionist than wikipedia, we usually do not delete topics. • Topic renames only occur on “untouched” topics. • Merges occur automatically on “untouched” topics - Otherwise, flagged for review in “pipeline” 19
  20. Map Template Fields To Properties 20
  21. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} 21
  22. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} “manufacturer” --> /aviation/aircraft_model/manufacturer 22
  23. Just the Starting Point … • Extracted to date from Wikipedia: - 2,365,000 topics - 2,895,000 typings - 5,638,000 properties • A complement to user-entered data - User data always takes precedence, won’t be overwritten • Processes are being automated to keep in sync 23
  24. Thanks! Tristan Buckner Topic updater /user/tristan Image loader Colin Evans WEX /user/colin Al Marks Category mapper /user/al Template mapper WEX 24

×