SlideShare a Scribd company logo
1 of 24
Download to read offline
Wikipedia Mining
Spring Freebase User Group meeting
2008-04-16 / zenkat
Why Mine Wikipedia?


• How can we automatically extract the
  unstructured content from Wikipedia …


• … to create a structured database of
  information …


• … that can be leveraged by users in
  applications and data loads

                                         2
A Remarkable Source of Information




      2.15 M articles as of April 2008
      Doubling every 12 - 18 months      3
Problem is …

• Wikipedia is written by humans, for humans.
 -   Great if you need to look up a fact, or learn about something



• But you can’t …
 -   Ask questions:
     “What movies by George Lucas has Harrison Ford starred in?”

 -   Search effectively:
     “Find me all companies that build personal computers.”

 -   Build applications:
     “Let’s make a social app that ranks consumer goods listed in
     wikipedia.”
                                                                     4
From unstructured …




                      5
… to structured




                  6
Searching for Structure: Topics




          Articles define a topic

                                    7
Searching for Structure: Types




     Categories & Lists provide type
                                       8
Searching for Structure: Types




     Categories & Lists provide type
                                       9
Searching for Structure: Properties




  Templates & Infoboxes give properties
                                          10
Searching for Structure: Properties




                       {
                           quot;queryquot; : [
                             {
                               quot;typequot; : quot;/architecture/structurequot;
                               quot;namequot; : null,
                               quot;height_metersquot; : null,
                               quot;sortquot; : quot;-height_metersquot;,
                               quot;limitquot; : 10,
                             }
                           ]
                       }




  What are the highest buildings in the world?
                                                                    11
Searching for Structure: Properties




{
    quot;queryquot; : [
      {
        quot;typequot; : quot;/location/countryquot;
        quot;namequot; : null,
        ”official_languagequot; : “English”,
        quot;limitquot; : 100
      }
    ]
}




    What are all the countries that speak English?
                                                     12
A Treasure Trove Waiting To Be Opened

•       2,150,000 articles (ie, topics)


•       7,100,000 category refs (ie, typings)
    -   Found within 280,000 categories



•       42,000,000 template values (ie, properties)
    -   Found within 10,000 templates and 56,000 template keys




•       All growing at ~2% every two weeks


•       Available information doubles every year!
                                                                 13
Topic Population From Wikipedia

       Topic Name
                                       Blurb




                    Wikipedia Attribution


                                             Image




                                            Wikipedia
                                              Link




                                                        14
Fresh Topic




              15
Similar, but different …

•       Many pages in wikipedia are not topics
    -   Disambiguation pages, lists, categories, images, docs, talk …


•       Only store a 1200-character blurb
    -   We’re not wikipedia, after all


•       Don’t need to add “(suffix)” to names
    -   “Python (genus)” vs “Python (programming language)”
    -   Freebase types disambiguate without names


•       Cities should be specified without state suffix
    -   “San Francisco” vs “San Francisco, California”
    -   Cleanup in progress, some exceptions remain


•       “Exclusionist” vs “Inclusionist”
    -   Exclusionists appear to be winning in Wikilandia
    -   Freebase is inherently more inclusionist                        16
You Can’t Read The Same Wikipedia Twice


Every 2 weeks …


 -   65,000   new pages     -   8,000   deletes
 -   30,000   new topics    -   5,000   name changes
 -   80,000   new aliases   -   1,000   page ID changes
 -   10,000   merges        -   1,000   splits



                            … change in Wikipedia

                                                          17
Keeping track of changes …

•       Store reference information within freebase
    -   Page_ids, article titles and redirects




    -   Page_id (WPID) is stored in /wikipedia/en_id
    -   Article titles and redirects are stored in /wikipedia/en
    -   “mwcl_wikipedia_en”, “mw_infobot” user


•       None of these IDs are stable in wiki-land …
                                                                   18
Determining actions by comparing keys

          case          action


          new topic     create a new topic


          name change   add new name as en key; if quot;untouchedquot;, rename the topic


          id change     change the en_id to the new value


          merge         move the en key to the new topic; if quot;untouchedquot;, merge the topics


          split         create new topic, move en key from old topic to new topic


          delete        keep topic, but delete en_id and en keys from topic




•       Because we are more inclusionist than wikipedia,
        we usually do not delete topics.
•       Topic renames only occur on “untouched” topics.
•       Merges occur automatically on “untouched” topics
    -   Otherwise, flagged for review in “pipeline”
                                                                                             19
Map Template Fields To Properties




                                    20
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufacturer =[[Boeing Commercial Airplanes]]
 |first flight =[[June 12]] [[1994]]
 |introduction =[[June 7]] [[1995]] with [[United]]
 |primary user = [[Singapore Airlines]]               MediaWiki
 |more users = [[Air France-KLM]]
 |produced = 1993 - Present                           Template
 |number built = 723 as of March 2008
 |unit cost = US$187.5-253 million                    Rendering
}}




                                                                  21
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufacturer =[[Boeing Commercial Airplanes]]
 |first flight =[[June 12]] [[1994]]
 |introduction =[[June 7]] [[1995]] with [[United]]
 |primary user = [[Singapore Airlines]]               MediaWiki
 |more users = [[Air France-KLM]]
 |produced = 1993 - Present                           Template
 |number built = 723 as of March 2008
 |unit cost = US$187.5-253 million                    Rendering
}}




            “manufacturer” -->
  /aviation/aircraft_model/manufacturer




                                                                  22
Just the Starting Point …

• Extracted to date from Wikipedia:

 - 2,365,000 topics
 - 2,895,000 typings
 - 5,638,000 properties


• A complement to user-entered data
 -   User data always takes precedence, won’t be overwritten



• Processes are being automated to keep in sync

                                                               23
Thanks!



          Tristan Buckner   Topic updater
           /user/tristan    Image loader


            Colin Evans
                                WEX
            /user/colin


             Al Marks       Category mapper
             /user/al       Template mapper
                                 WEX




                                              24

More Related Content

Similar to Freebase: Wikipedia Mining 20080416

From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSPMin-Yih Hsu
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsSteven Francia
 
A Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpA Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpMichael Girouard
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in PracticeJaime Crespo
 
Aeliapedia: Knowledge Building with XWiki at AELIA
Aeliapedia: Knowledge  Building with XWiki at  AELIAAeliapedia: Knowledge  Building with XWiki at  AELIA
Aeliapedia: Knowledge Building with XWiki at AELIAXWiki
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 
Blogs And Wikis In Academia
Blogs And Wikis In AcademiaBlogs And Wikis In Academia
Blogs And Wikis In AcademiaBill Warters
 
Cohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationCohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationSimon Buckingham Shum
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responsesdarrelmiller71
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slideslancesfa
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Communitytinacallahan
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki WorkshopDan Bolser
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDBMongoDB
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedtutorialsruby
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedtutorialsruby
 
DBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataDBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataJakob .
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Trickshannonhill
 

Similar to Freebase: Wikipedia Mining 20080416 (20)

Tel Vortrag
Tel VortragTel Vortrag
Tel Vortrag
 
From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSP
 
Scalax
ScalaxScalax
Scalax
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
 
A Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpA Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented Php
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Aeliapedia: Knowledge Building with XWiki at AELIA
Aeliapedia: Knowledge  Building with XWiki at  AELIAAeliapedia: Knowledge  Building with XWiki at  AELIA
Aeliapedia: Knowledge Building with XWiki at AELIA
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
OMEKA
OMEKAOMEKA
OMEKA
 
Blogs And Wikis In Academia
Blogs And Wikis In AcademiaBlogs And Wikis In Academia
Blogs And Wikis In Academia
 
Cohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationCohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 Argumentation
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki Workshop
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDB
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresented
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresented
 
DBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataDBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of Data
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Tricks
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Freebase: Wikipedia Mining 20080416

  • 1. Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat
  • 2. Why Mine Wikipedia? • How can we automatically extract the unstructured content from Wikipedia … • … to create a structured database of information … • … that can be leveraged by users in applications and data loads 2
  • 3. A Remarkable Source of Information 2.15 M articles as of April 2008 Doubling every 12 - 18 months 3
  • 4. Problem is … • Wikipedia is written by humans, for humans. - Great if you need to look up a fact, or learn about something • But you can’t … - Ask questions: “What movies by George Lucas has Harrison Ford starred in?” - Search effectively: “Find me all companies that build personal computers.” - Build applications: “Let’s make a social app that ranks consumer goods listed in wikipedia.” 4
  • 7. Searching for Structure: Topics Articles define a topic 7
  • 8. Searching for Structure: Types Categories & Lists provide type 8
  • 9. Searching for Structure: Types Categories & Lists provide type 9
  • 10. Searching for Structure: Properties Templates & Infoboxes give properties 10
  • 11. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/architecture/structurequot; quot;namequot; : null, quot;height_metersquot; : null, quot;sortquot; : quot;-height_metersquot;, quot;limitquot; : 10, } ] } What are the highest buildings in the world? 11
  • 12. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/location/countryquot; quot;namequot; : null, ”official_languagequot; : “English”, quot;limitquot; : 100 } ] } What are all the countries that speak English? 12
  • 13. A Treasure Trove Waiting To Be Opened • 2,150,000 articles (ie, topics) • 7,100,000 category refs (ie, typings) - Found within 280,000 categories • 42,000,000 template values (ie, properties) - Found within 10,000 templates and 56,000 template keys • All growing at ~2% every two weeks • Available information doubles every year! 13
  • 14. Topic Population From Wikipedia Topic Name Blurb Wikipedia Attribution Image Wikipedia Link 14
  • 16. Similar, but different … • Many pages in wikipedia are not topics - Disambiguation pages, lists, categories, images, docs, talk … • Only store a 1200-character blurb - We’re not wikipedia, after all • Don’t need to add “(suffix)” to names - “Python (genus)” vs “Python (programming language)” - Freebase types disambiguate without names • Cities should be specified without state suffix - “San Francisco” vs “San Francisco, California” - Cleanup in progress, some exceptions remain • “Exclusionist” vs “Inclusionist” - Exclusionists appear to be winning in Wikilandia - Freebase is inherently more inclusionist 16
  • 17. You Can’t Read The Same Wikipedia Twice Every 2 weeks … - 65,000 new pages - 8,000 deletes - 30,000 new topics - 5,000 name changes - 80,000 new aliases - 1,000 page ID changes - 10,000 merges - 1,000 splits … change in Wikipedia 17
  • 18. Keeping track of changes … • Store reference information within freebase - Page_ids, article titles and redirects - Page_id (WPID) is stored in /wikipedia/en_id - Article titles and redirects are stored in /wikipedia/en - “mwcl_wikipedia_en”, “mw_infobot” user • None of these IDs are stable in wiki-land … 18
  • 19. Determining actions by comparing keys case action new topic create a new topic name change add new name as en key; if quot;untouchedquot;, rename the topic id change change the en_id to the new value merge move the en key to the new topic; if quot;untouchedquot;, merge the topics split create new topic, move en key from old topic to new topic delete keep topic, but delete en_id and en keys from topic • Because we are more inclusionist than wikipedia, we usually do not delete topics. • Topic renames only occur on “untouched” topics. • Merges occur automatically on “untouched” topics - Otherwise, flagged for review in “pipeline” 19
  • 20. Map Template Fields To Properties 20
  • 21. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} 21
  • 22. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} “manufacturer” --> /aviation/aircraft_model/manufacturer 22
  • 23. Just the Starting Point … • Extracted to date from Wikipedia: - 2,365,000 topics - 2,895,000 typings - 5,638,000 properties • A complement to user-entered data - User data always takes precedence, won’t be overwritten • Processes are being automated to keep in sync 23
  • 24. Thanks! Tristan Buckner Topic updater /user/tristan Image loader Colin Evans WEX /user/colin Al Marks Category mapper /user/al Template mapper WEX 24