0
Wikipedia Mining
Spring Freebase User Group meeting
2008-04-16 / zenkat
Why Mine Wikipedia?


• How can we automatically extract the
  unstructured content from Wikipedia …


• … to create a str...
A Remarkable Source of Information




      2.15 M articles as of April 2008
      Doubling every 12 - 18 months      3
Problem is …

• Wikipedia is written by humans, for humans.
 -   Great if you need to look up a fact, or learn about somet...
From unstructured …




                      5
… to structured




                  6
Searching for Structure: Topics




          Articles define a topic

                                    7
Searching for Structure: Types




     Categories & Lists provide type
                                       8
Searching for Structure: Types




     Categories & Lists provide type
                                       9
Searching for Structure: Properties




  Templates & Infoboxes give properties
                                          10
Searching for Structure: Properties




                       {
                           quot;queryquot; : [
          ...
Searching for Structure: Properties




{
    quot;queryquot; : [
      {
        quot;typequot; : quot;/location/countryq...
A Treasure Trove Waiting To Be Opened

•       2,150,000 articles (ie, topics)


•       7,100,000 category refs (ie, typi...
Topic Population From Wikipedia

       Topic Name
                                       Blurb




                    Wi...
Fresh Topic




              15
Similar, but different …

•       Many pages in wikipedia are not topics
    -   Disambiguation pages, lists, categories, ...
You Can’t Read The Same Wikipedia Twice


Every 2 weeks …


 -   65,000   new pages     -   8,000   deletes
 -   30,000   ...
Keeping track of changes …

•       Store reference information within freebase
    -   Page_ids, article titles and redir...
Determining actions by comparing keys

          case          action


          new topic     create a new topic


     ...
Map Template Fields To Properties




                                    20
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufa...
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufa...
Just the Starting Point …

• Extracted to date from Wikipedia:

 - 2,365,000 topics
 - 2,895,000 typings
 - 5,638,000 prop...
Thanks!



          Tristan Buckner   Topic updater
           /user/tristan    Image loader


            Colin Evans
  ...
Upcoming SlideShare
Loading in...5
×

Freebase: Wikipedia Mining 20080416

3,644

Published on

Slides from the "Wikipedia Mining" talk at the Spring Freebase User Group meeting.

Published in: Technology, Business
1 Comment
9 Likes
Statistics
Notes
  • Hello All!

    There is a video of the talk that goes with this presentation. Check it out at:

    http://blog.freebase.com/

    Or watch the video below ...
    <br /><object type="application/x-shockwave-flash" data="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="350" height="288"><param name="movie" value="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf"></param><embed src="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&file=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss%2Fflash%2F%3Freferrer%3Dfreebase%2Eblip%2Etv&showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="350" height="288" type="application/x-shockwave-flash"></embed></object>
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,644
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
200
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Transcript of "Freebase: Wikipedia Mining 20080416"

  1. 1. Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat
  2. 2. Why Mine Wikipedia? • How can we automatically extract the unstructured content from Wikipedia … • … to create a structured database of information … • … that can be leveraged by users in applications and data loads 2
  3. 3. A Remarkable Source of Information 2.15 M articles as of April 2008 Doubling every 12 - 18 months 3
  4. 4. Problem is … • Wikipedia is written by humans, for humans. - Great if you need to look up a fact, or learn about something • But you can’t … - Ask questions: “What movies by George Lucas has Harrison Ford starred in?” - Search effectively: “Find me all companies that build personal computers.” - Build applications: “Let’s make a social app that ranks consumer goods listed in wikipedia.” 4
  5. 5. From unstructured … 5
  6. 6. … to structured 6
  7. 7. Searching for Structure: Topics Articles define a topic 7
  8. 8. Searching for Structure: Types Categories & Lists provide type 8
  9. 9. Searching for Structure: Types Categories & Lists provide type 9
  10. 10. Searching for Structure: Properties Templates & Infoboxes give properties 10
  11. 11. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/architecture/structurequot; quot;namequot; : null, quot;height_metersquot; : null, quot;sortquot; : quot;-height_metersquot;, quot;limitquot; : 10, } ] } What are the highest buildings in the world? 11
  12. 12. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/location/countryquot; quot;namequot; : null, ”official_languagequot; : “English”, quot;limitquot; : 100 } ] } What are all the countries that speak English? 12
  13. 13. A Treasure Trove Waiting To Be Opened • 2,150,000 articles (ie, topics) • 7,100,000 category refs (ie, typings) - Found within 280,000 categories • 42,000,000 template values (ie, properties) - Found within 10,000 templates and 56,000 template keys • All growing at ~2% every two weeks • Available information doubles every year! 13
  14. 14. Topic Population From Wikipedia Topic Name Blurb Wikipedia Attribution Image Wikipedia Link 14
  15. 15. Fresh Topic 15
  16. 16. Similar, but different … • Many pages in wikipedia are not topics - Disambiguation pages, lists, categories, images, docs, talk … • Only store a 1200-character blurb - We’re not wikipedia, after all • Don’t need to add “(suffix)” to names - “Python (genus)” vs “Python (programming language)” - Freebase types disambiguate without names • Cities should be specified without state suffix - “San Francisco” vs “San Francisco, California” - Cleanup in progress, some exceptions remain • “Exclusionist” vs “Inclusionist” - Exclusionists appear to be winning in Wikilandia - Freebase is inherently more inclusionist 16
  17. 17. You Can’t Read The Same Wikipedia Twice Every 2 weeks … - 65,000 new pages - 8,000 deletes - 30,000 new topics - 5,000 name changes - 80,000 new aliases - 1,000 page ID changes - 10,000 merges - 1,000 splits … change in Wikipedia 17
  18. 18. Keeping track of changes … • Store reference information within freebase - Page_ids, article titles and redirects - Page_id (WPID) is stored in /wikipedia/en_id - Article titles and redirects are stored in /wikipedia/en - “mwcl_wikipedia_en”, “mw_infobot” user • None of these IDs are stable in wiki-land … 18
  19. 19. Determining actions by comparing keys case action new topic create a new topic name change add new name as en key; if quot;untouchedquot;, rename the topic id change change the en_id to the new value merge move the en key to the new topic; if quot;untouchedquot;, merge the topics split create new topic, move en key from old topic to new topic delete keep topic, but delete en_id and en keys from topic • Because we are more inclusionist than wikipedia, we usually do not delete topics. • Topic renames only occur on “untouched” topics. • Merges occur automatically on “untouched” topics - Otherwise, flagged for review in “pipeline” 19
  20. 20. Map Template Fields To Properties 20
  21. 21. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} 21
  22. 22. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} “manufacturer” --> /aviation/aircraft_model/manufacturer 22
  23. 23. Just the Starting Point … • Extracted to date from Wikipedia: - 2,365,000 topics - 2,895,000 typings - 5,638,000 properties • A complement to user-entered data - User data always takes precedence, won’t be overwritten • Processes are being automated to keep in sync 23
  24. 24. Thanks! Tristan Buckner Topic updater /user/tristan Image loader Colin Evans WEX /user/colin Al Marks Category mapper /user/al Template mapper WEX 24
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×