Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008
Upcoming SlideShare
Loading in...5
×
 

Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008

on

  • 1,393 views

Wikipedia contains a wealth of collective knowledge but due to its semi-structured design and idiosyncratic markup mining this resource is a formidable challenge. This session will examine techniques ...

Wikipedia contains a wealth of collective knowledge but due to its semi-structured design and idiosyncratic markup mining this resource is a formidable challenge. This session will examine techniques for mining semantically weak data sources for explicit facts.

The session will utilize WEX and preprocessed normalization of Wikipedia designed to make this corpus easily accessible to developers interested in machine learning, natural language processing, or knowledge extraction. The process through which WEX is prepared, as a guide to creating mineable structures from semi-structured data, will be discussed followed by approaches to machine extraction on structures of mixed data quality.

The session is targeted at intermediate developers with an interest in machine learning or knowledge extraction (though no experience is assumed with either).

The demonstrations leverage the power of Postgres 8.3’s XPath capability to simplify the programming model and present examples in Python, but the data and principles are compatible with any modern data infrastructure.

Statistics

Views

Total Views
1,393
Views on SlideShare
1,393
Embed Views
0

Actions

Likes
2
Downloads
30
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008 Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008 Presentation Transcript

    • We all love Wikipedia!
    • Wikipedia has lots of data...
    • Lots of semi-structured data!
    • At Freebase, we use Wikipedia as a source for extracting facts and relationships
    • Some Interesting Data in Wikipedia Infoboxes Categories Text
    • Problems With Wikipedia Data •The data is dirty •Wiki markup is hard to parse.. •.. and often dirty •.. and not well defined •Properties and relations are in Wiki markup •Page redirects have to be resolved
    • {{Infobox Company | company_name = Sony Corporation <br> | company_logo = [[Image:Sony logo.svg|220px|]] | slogan = like.no.other | company_type = [[Public company|Public]]<br>({{Tyo|6758}})<br>({{nyse|SNE}}) | foundation = [[May 7]] [[1946]] (adopted current name in 1958)<ref name=sonycorpinfo>{{citeweb|url=http://www.sony.net/SonyInfo/ CorporateInfo/|title=Sony Global - Corporate Information| accessdate=2007-07-24}}</ref> | founder = [[Masaru Ibuka]]<br>[[Akio Morita]] | location = {{flagicon|Japan}} [[Minato, Tokyo]], [[Japan]]<ref name="sonycorpinfo"/> | area_served = [[Worldwide]] | key_people = [[Sir Howard Stringer]]<br><small>([[Chairman]]) & ([[CEO]])</ small><ref name="sonycorpinfo"/><br />[[Ryoji Chubachi]]<br><small>([[President]]) & ([[CEO|Electronics CEO]])<ref name="sonycorpinfo"/> | industry = [[Consumer electronics]]<br>[[Entertainment]] | products = [[Audio]]<br>[[Video]]<br>[[Televisions]]<br>[[Information Technology| Communications and Information Technology]]<br>[[Semiconductors]]<br>[[Electronic components]]<br>[[Motion Picture]]<br>[[Music]]<br>[[Online|Online Business]]<br>[[Sony Playstation]] | services = [[Financial services]] | market cap = [[United States Dollar|US$]] 40.56 Billion (''2008'') | revenue = {{profit}} [[United States Dollar|US$]] 88.714 Billion (''2008'')<ref name="2007 Q4">{{citeweb|url=http://www.sony.net/SonyInfo/ IR/financial/fr/07q4_sony.pdf|title=Sony Corporation Earnings release for the fiscal year ended March 31, 2008|format=PDF}}</ref> | operating_income = {{profit}} [[United States Dollar|US$]] 3.745 Billion (''2008'') | net_income = {{profit}} [[United States Dollar|US$]] 3.694 Billion (''2008'') | assets = {{increase}} [[United States Dollar|US$]] 117.603 Billion (''2008'') | equity = {{increase}} [[United States Dollar|US$]] 32.465 Billion (''2008'') | num_employees = 180,500 (as of [[March 31]] [[2008]]) <ref name="sonycorpinfo"/> | parent = | subsid = [[Sony Corporation shareholders and subsidiaries|List of the subsidiaries]]
    • Problems With Wikipedia Data •Wikipedia is huge! •2,150,00 articles •7,100,000 category references •found in 280,000 categories •54,029 non-trivial templates (>= 5 uses) •50,671,533 Template name-value properties •Wikipedia is growing! •Grows 2% a week •25,170 new articles •39,571 new redirects •8,000 deletes •5,000 name changes •1,000 article splits •1,000 id changes •Before this talk is over, there will be 150 NEW articles!
    • The Freebase Wikipedia Extraction (WEX) Current Current Current Current Wikipedia Current Wikipedia Wikipedia Wiki2XML Parser Wikipedia Dump Markup Wiki Wikipedia Dump Markup XML Dump Dump Articles Dump Articles Magnus Manske Big Sections Templates Text Postgres Articles Database! Freebase Redirects Categories Mappings
    • WEX Article XML <template name="Infobox_President"> <param name="name">Abraham Lincoln</param> <param name="nationality">American</param> <param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param> <param name="order">16th<space/><link> <target>President of the United States</target></link></param> <param name="term_start"><link><target>March 4</target></link>, <space/><link><target>1861</target></link></param> <param name="term_end"><link><target>April 15</target></link>, <space/><link><target>1865</target></link></param> <param name="predecessor"><link><target>James Buchanan</target></link></param> <param name="successor"><link><target>Andrew Johnson</target></link></param> .... </template>
    • WEX Schema category_ members articles sections template_ redirects calls template_ freebase_ values wpid freebase_ freebase_ types names
    • SELECT xpath('/param/text()', template_values.xml) FROM template_values INNER JOIN template_calls ON call_id = template_calls.id INNER JOIN articles ON articles.wpid = article_wpid WHERE template_article_name = 'Template:Infobox Bridge' AND template_values.name = 'mainspan' AND articles.name = 'Fremont Bridge (Portland)' Result: "{"1,255 ft (382.5 m)","longest in Oregon"}"
    • words become “features”
    • category_list=['Category:Ninjas', 'Category:Pirates', 'Category:Assassins', 'Category:Apple Inc. employees', 'Category:Microsoft employees', 'Category:Google employees', 'Category:Free software programmers', 'Category:Computer programmers'] print "Training classifiers..." # Get the members of every category name_classes={} for category in category_list: members = get_category_members(cur, category) print str(len(members)) + " examples for " + category for name in members: name_classes.setdefault(name,set()).add(category)
    • def get_category_members(cur, category): records = [] queue = [category] recordsSeen = set() while len(queue) > 0 and len(records) < 500: currentCategory = queue.pop(0) cur.execute("select articles.wpid, articles.name " + "from wikipedia.category_members, wikipedia.articles " + "where category_members.category_name like %s and " + "articles.wpid=category_members.article_wpid", (currentCategory,)) result = cur.fetchall() for wpid, name in result: if wpid not in recordsSeen: recordsSeen.add(wpid) if name.startswith("Category:"): queue.append(name) else: records.append(name) return records
    • for name in name_classes: name, text = get_article_text(cur, name) words=getwords(text[0:1024]) for cat,cl in classifiers: if cat in name_classes[name]: cl.train(words,1) else: cl.train(words,0)
    • def get_article_text(cur, name): cur.execute("select name, text from " + "wikipedia.articles where name=%s", (name,)) return cur.fetchone()
    • # Test set: test_set=["Henri Caesar", "Long John Silver", "Jack Sparrow", "Storm Shadow (G.I. Joe)", "Leonardo (TMNT)", "Bill Gates", "Steve Jobs", "Richard Stallman", "Larry Page", "Guido van Rossum", "Larry Wall", "Jerry Yang"] # Run tests: for testName in test_set: name, text = get_article_text(cur, testName) print name words=getwords(text[0:1024]) for cat,cl in classifiers: py,pn=cl.prob(words,1),cl.prob(words,0) print '%st%st%f' % (cat,cl.classify(words),py/pn if pn>0 else 100)
    • Category:Ninjas 0 0.000000 Category:Pirates 1 155082805066493952.000000 Category:Assassins 0 0.000001 Category:Apple Inc. employees 0 0.000000 Category:Microsoft employees 0 0.000000 Category:Google employees 0 0.000000 Category:Free software programmers 0 0.000000 Category:Computer programmers 0 0.000000
    • Category:Ninjas 1 166627867883323968.000000 Category:Pirates 1 19.751727 Category:Assassins 1 413388475722811.625000 Category:Apple Inc. employees 0 0.000000 Category:Microsoft employees 0 0.000000 Category:Google employees 0 0.000000 Category:Free software programmers 0 0.000000 Category:Computer programmers 0 0.000000
    • ninja japan characters period sengoku service they use means term based different japanese era heroes appearance kanji
    • pirate coast ship pirates crew off sea captured island century ships piracy north merchant captain according
    • can we construct a sentence that fits into both categories? (without using the word “pirate” or “ninja”)
    • “Toby Segaran lived during the sengoku period in Japan. He spent many years at sea capturing Japanese ships.”
    • http://code.google.com/p/wexbayes/
    • Category:Ninjas 0.008158 Category:Pirates 21125425312885.750000 Category:Assassins 237.408562 Category:Apple Inc. employees 0.000000 Category:Microsoft employees 0.000000 Category:Google employees 0.000000 Category:Free software programmers 0.000000 Category:Computer programmers 0.000000
    • Category:Ninjas 2924533519139.380859 Category:Pirates 0.003120 Category:Assassins 67800277337186.242188
    • Category:Ninjas 300.861827 Category:Pirates 8392781192.375138 Category:Assassins 3817.111709
    • Category:Ninjas 0.000000 Category:Pirates 0.000000 Category:Assassins 0.000000 Category:Apple Inc. employees 63.870863 Category:Microsoft employees 186751.882154 Category:Google employees 0.012458 Category:Free software programmers 0.000197 Category:Computer programmers 1222.542202
    • Category:Apple Inc. employees 66414979293154.234375 Category:Microsoft employees 694373.180082 Category:Google employees 2381.361809 Category:Free software programmers 0.014712 Category:Computer programmers 871493.654163
    • Category:Apple Inc. employees 11.530269 Category:Microsoft employees 1.616829 Category:Google employees 2703.744581 Category:Free software programmers 12521439594.583622 Category:Computer programmers 141466542964.903381
    • Category:Apple Inc. employees 23162.014385 Category:Microsoft employees 99.417180 Category:Google employees 291981001482.833679 Category:Free software programmers 0.026512 Category:Computer programmers 258.589845
    • Category:Apple Inc. employees 2.018667 Category:Microsoft employees 0.310716 Category:Google employees 84.447472 Category:Free software programmers 21693.027739 Category:Computer programmers 7538656551.050776
    • Category:Apple Inc. employees 518.061964 Category:Microsoft employees 16855.582495 Category:Google employees 940060750.923012 Category:Free software programmers 259957538.360797 Category:Computer programmers 462842056873.530640
    • Category:Apple Inc. employees 0.063467 Category:Microsoft employees 0.004360 Category:Google employees 79.474061 Category:Free software programmers 0.000001 Category:Computer programmers 0.000108