SlideShare a Scribd company logo
1 of 24
Download to read offline
Wikipedia Mining
Spring Freebase User Group meeting
2008-04-16 / zenkat
Why Mine Wikipedia?


• How can we automatically extract the
  unstructured content from Wikipedia …


• … to create a structured database of
  information …


• … that can be leveraged by users in
  applications and data loads

                                         2
A Remarkable Source of Information




      2.15 M articles as of April 2008
      Doubling every 12 - 18 months      3
Problem is …

• Wikipedia is written by humans, for humans.
 -   Great if you need to look up a fact, or learn about something



• But you can’t …
 -   Ask questions:
     “What movies by George Lucas has Harrison Ford starred in?”

 -   Search effectively:
     “Find me all companies that build personal computers.”

 -   Build applications:
     “Let’s make a social app that ranks consumer goods listed in
     wikipedia.”
                                                                     4
From unstructured …




                      5
… to structured




                  6
Searching for Structure: Topics




          Articles define a topic

                                    7
Searching for Structure: Types




     Categories & Lists provide type
                                       8
Searching for Structure: Types




     Categories & Lists provide type
                                       9
Searching for Structure: Properties




  Templates & Infoboxes give properties
                                          10
Searching for Structure: Properties




                       {
                           quot;queryquot; : [
                             {
                               quot;typequot; : quot;/architecture/structurequot;
                               quot;namequot; : null,
                               quot;height_metersquot; : null,
                               quot;sortquot; : quot;-height_metersquot;,
                               quot;limitquot; : 10,
                             }
                           ]
                       }




  What are the highest buildings in the world?
                                                                    11
Searching for Structure: Properties




{
    quot;queryquot; : [
      {
        quot;typequot; : quot;/location/countryquot;
        quot;namequot; : null,
        ”official_languagequot; : “English”,
        quot;limitquot; : 100
      }
    ]
}




    What are all the countries that speak English?
                                                     12
A Treasure Trove Waiting To Be Opened

•       2,150,000 articles (ie, topics)


•       7,100,000 category refs (ie, typings)
    -   Found within 280,000 categories



•       42,000,000 template values (ie, properties)
    -   Found within 10,000 templates and 56,000 template keys




•       All growing at ~2% every two weeks


•       Available information doubles every year!
                                                                 13
Topic Population From Wikipedia

       Topic Name
                                       Blurb




                    Wikipedia Attribution


                                             Image




                                            Wikipedia
                                              Link




                                                        14
Fresh Topic




              15
Similar, but different …

•       Many pages in wikipedia are not topics
    -   Disambiguation pages, lists, categories, images, docs, talk …


•       Only store a 1200-character blurb
    -   We’re not wikipedia, after all


•       Don’t need to add “(suffix)” to names
    -   “Python (genus)” vs “Python (programming language)”
    -   Freebase types disambiguate without names


•       Cities should be specified without state suffix
    -   “San Francisco” vs “San Francisco, California”
    -   Cleanup in progress, some exceptions remain


•       “Exclusionist” vs “Inclusionist”
    -   Exclusionists appear to be winning in Wikilandia
    -   Freebase is inherently more inclusionist                        16
You Can’t Read The Same Wikipedia Twice


Every 2 weeks …


 -   65,000   new pages     -   8,000   deletes
 -   30,000   new topics    -   5,000   name changes
 -   80,000   new aliases   -   1,000   page ID changes
 -   10,000   merges        -   1,000   splits



                            … change in Wikipedia

                                                          17
Keeping track of changes …

•       Store reference information within freebase
    -   Page_ids, article titles and redirects




    -   Page_id (WPID) is stored in /wikipedia/en_id
    -   Article titles and redirects are stored in /wikipedia/en
    -   “mwcl_wikipedia_en”, “mw_infobot” user


•       None of these IDs are stable in wiki-land …
                                                                   18
Determining actions by comparing keys

          case          action


          new topic     create a new topic


          name change   add new name as en key; if quot;untouchedquot;, rename the topic


          id change     change the en_id to the new value


          merge         move the en key to the new topic; if quot;untouchedquot;, merge the topics


          split         create new topic, move en key from old topic to new topic


          delete        keep topic, but delete en_id and en keys from topic




•       Because we are more inclusionist than wikipedia,
        we usually do not delete topics.
•       Topic renames only occur on “untouched” topics.
•       Merges occur automatically on “untouched” topics
    -   Otherwise, flagged for review in “pipeline”
                                                                                             19
Map Template Fields To Properties




                                    20
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufacturer =[[Boeing Commercial Airplanes]]
 |first flight =[[June 12]] [[1994]]
 |introduction =[[June 7]] [[1995]] with [[United]]
 |primary user = [[Singapore Airlines]]               MediaWiki
 |more users = [[Air France-KLM]]
 |produced = 1993 - Present                           Template
 |number built = 723 as of March 2008
 |unit cost = US$187.5-253 million                    Rendering
}}




                                                                  21
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufacturer =[[Boeing Commercial Airplanes]]
 |first flight =[[June 12]] [[1994]]
 |introduction =[[June 7]] [[1995]] with [[United]]
 |primary user = [[Singapore Airlines]]               MediaWiki
 |more users = [[Air France-KLM]]
 |produced = 1993 - Present                           Template
 |number built = 723 as of March 2008
 |unit cost = US$187.5-253 million                    Rendering
}}




            “manufacturer” -->
  /aviation/aircraft_model/manufacturer




                                                                  22
Just the Starting Point …

• Extracted to date from Wikipedia:

 - 2,365,000 topics
 - 2,895,000 typings
 - 5,638,000 properties


• A complement to user-entered data
 -   User data always takes precedence, won’t be overwritten



• Processes are being automated to keep in sync

                                                               23
Thanks!



          Tristan Buckner   Topic updater
           /user/tristan    Image loader


            Colin Evans
                                WEX
            /user/colin


             Al Marks       Category mapper
             /user/al       Template mapper
                                 WEX




                                              24

More Related Content

Similar to Freebase: Wikipedia Mining 20080416

From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSPMin-Yih Hsu
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsSteven Francia
 
A Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpA Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpMichael Girouard
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in PracticeJaime Crespo
 
Aeliapedia: Knowledge Building with XWiki at AELIA
Aeliapedia: Knowledge  Building with XWiki at  AELIAAeliapedia: Knowledge  Building with XWiki at  AELIA
Aeliapedia: Knowledge Building with XWiki at AELIAXWiki
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 
Blogs And Wikis In Academia
Blogs And Wikis In AcademiaBlogs And Wikis In Academia
Blogs And Wikis In AcademiaBill Warters
 
Cohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationCohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationSimon Buckingham Shum
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responsesdarrelmiller71
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slideslancesfa
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Communitytinacallahan
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki WorkshopDan Bolser
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDBMongoDB
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedtutorialsruby
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedtutorialsruby
 
DBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataDBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataJakob .
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Trickshannonhill
 

Similar to Freebase: Wikipedia Mining 20080416 (20)

Tel Vortrag
Tel VortragTel Vortrag
Tel Vortrag
 
From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSP
 
Scalax
ScalaxScalax
Scalax
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
 
A Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpA Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented Php
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Aeliapedia: Knowledge Building with XWiki at AELIA
Aeliapedia: Knowledge  Building with XWiki at  AELIAAeliapedia: Knowledge  Building with XWiki at  AELIA
Aeliapedia: Knowledge Building with XWiki at AELIA
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
OMEKA
OMEKAOMEKA
OMEKA
 
Blogs And Wikis In Academia
Blogs And Wikis In AcademiaBlogs And Wikis In Academia
Blogs And Wikis In Academia
 
Cohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationCohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 Argumentation
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki Workshop
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDB
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresented
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresented
 
DBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataDBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of Data
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Tricks
 

Recently uploaded

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Freebase: Wikipedia Mining 20080416

  • 1. Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat
  • 2. Why Mine Wikipedia? • How can we automatically extract the unstructured content from Wikipedia … • … to create a structured database of information … • … that can be leveraged by users in applications and data loads 2
  • 3. A Remarkable Source of Information 2.15 M articles as of April 2008 Doubling every 12 - 18 months 3
  • 4. Problem is … • Wikipedia is written by humans, for humans. - Great if you need to look up a fact, or learn about something • But you can’t … - Ask questions: “What movies by George Lucas has Harrison Ford starred in?” - Search effectively: “Find me all companies that build personal computers.” - Build applications: “Let’s make a social app that ranks consumer goods listed in wikipedia.” 4
  • 7. Searching for Structure: Topics Articles define a topic 7
  • 8. Searching for Structure: Types Categories & Lists provide type 8
  • 9. Searching for Structure: Types Categories & Lists provide type 9
  • 10. Searching for Structure: Properties Templates & Infoboxes give properties 10
  • 11. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/architecture/structurequot; quot;namequot; : null, quot;height_metersquot; : null, quot;sortquot; : quot;-height_metersquot;, quot;limitquot; : 10, } ] } What are the highest buildings in the world? 11
  • 12. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/location/countryquot; quot;namequot; : null, ”official_languagequot; : “English”, quot;limitquot; : 100 } ] } What are all the countries that speak English? 12
  • 13. A Treasure Trove Waiting To Be Opened • 2,150,000 articles (ie, topics) • 7,100,000 category refs (ie, typings) - Found within 280,000 categories • 42,000,000 template values (ie, properties) - Found within 10,000 templates and 56,000 template keys • All growing at ~2% every two weeks • Available information doubles every year! 13
  • 14. Topic Population From Wikipedia Topic Name Blurb Wikipedia Attribution Image Wikipedia Link 14
  • 16. Similar, but different … • Many pages in wikipedia are not topics - Disambiguation pages, lists, categories, images, docs, talk … • Only store a 1200-character blurb - We’re not wikipedia, after all • Don’t need to add “(suffix)” to names - “Python (genus)” vs “Python (programming language)” - Freebase types disambiguate without names • Cities should be specified without state suffix - “San Francisco” vs “San Francisco, California” - Cleanup in progress, some exceptions remain • “Exclusionist” vs “Inclusionist” - Exclusionists appear to be winning in Wikilandia - Freebase is inherently more inclusionist 16
  • 17. You Can’t Read The Same Wikipedia Twice Every 2 weeks … - 65,000 new pages - 8,000 deletes - 30,000 new topics - 5,000 name changes - 80,000 new aliases - 1,000 page ID changes - 10,000 merges - 1,000 splits … change in Wikipedia 17
  • 18. Keeping track of changes … • Store reference information within freebase - Page_ids, article titles and redirects - Page_id (WPID) is stored in /wikipedia/en_id - Article titles and redirects are stored in /wikipedia/en - “mwcl_wikipedia_en”, “mw_infobot” user • None of these IDs are stable in wiki-land … 18
  • 19. Determining actions by comparing keys case action new topic create a new topic name change add new name as en key; if quot;untouchedquot;, rename the topic id change change the en_id to the new value merge move the en key to the new topic; if quot;untouchedquot;, merge the topics split create new topic, move en key from old topic to new topic delete keep topic, but delete en_id and en keys from topic • Because we are more inclusionist than wikipedia, we usually do not delete topics. • Topic renames only occur on “untouched” topics. • Merges occur automatically on “untouched” topics - Otherwise, flagged for review in “pipeline” 19
  • 20. Map Template Fields To Properties 20
  • 21. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} 21
  • 22. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} “manufacturer” --> /aviation/aircraft_model/manufacturer 22
  • 23. Just the Starting Point … • Extracted to date from Wikipedia: - 2,365,000 topics - 2,895,000 typings - 5,638,000 properties • A complement to user-entered data - User data always takes precedence, won’t be overwritten • Processes are being automated to keep in sync 23
  • 24. Thanks! Tristan Buckner Topic updater /user/tristan Image loader Colin Evans WEX /user/colin Al Marks Category mapper /user/al Template mapper WEX 24