Web Scale Named Entity Mining

  • 1,230 views
Uploaded on

presentation given at the Industry Day of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence …

presentation given at the Industry Day of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence

http://wi-iat-2011.org/

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,230
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
17
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Web scaleNamed Entity Mining"Theres simply too much information out there"WI-IAT 2011
  • 2. in memoriam ofHerbert A. Simon …
  • 3. stuckApril 2011
  • 4. Herbert Simons Brookings Institute Lecture"Designing Organizations for an Information-Rich World"Johns Hopkins University, September 1, 1969
  • 5. 1.Tales & legends
  • 6. Find & procurea crystal plastic replacement of a polycarbonate LEXAN 943Main constraints:•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stressand exposure to detergent agents)•compatible with existing tools - withdrawal must be close to LEXAN 943•optical characteristic close to LEXAN 943•weldable by ultrasonic welding•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94delay : one weekorganization centric search
  • 7. Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable airdefense missile system ?location centric search
  • 8. Recent information (past month)about call for proposal"outils Web innovants en entreprise" ?time centric search
  • 9. Location"pro" searches focus onOrgs PeopleTimenamed entities
  • 10. 2.IntroducingWebNEM
  • 11. relevantquery ?queryagain ?where ?+ browsing/rankingresultsAttention-greedy & burdensomeproductspecificationsgetmanufactureror distributorfindcompliantproducts
  • 12. "SA-24 Grinch9K338 Igla-S"Goal : Attention-saver process
  • 13. exploratory data analysisof high dimensional data
  • 14. "In exploratory data analysis of high dimensional dataone of the main tasks is the formation of asimplified, usually visual, overview of data sets.....Clustering and projectionare among the examples of useful methodsto achieve this task."Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and theirapplication in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)Lourenço, Lobo, Bação – JOCLAD 2004
  • 15. WebNEMcollection ofrelevant data,anywhere in the web+ projection onNamed Entities spacetopical web crawlernamed entity recognitionvisualization/exploratory analysis tools
  • 16. "Web scale" collection : brute forcenever-ending crawlfast answer,"any" topica priori"whole" Web indexinggeneral index"everywhere"huge resources required(data size based)userquery
  • 17. "Web scale" collection : our approach"close to optimal" resources(usage based)userqueryon-demand topical crawldelayed answer,but less garbagetailored indexanywhererelevantbuilt on orderWeb slices
  • 18. Projection : when to extract entities ?Named Entity Recognition is resource intensivecrawl time whole web 1010 asynchronousquery time collection 102 real-timecrawl time web slice 104 asynchronousprocess step data size required response time
  • 19. www.squido.frour SaaS Web mining systemlarge scaleNamed Entity extraction (EN/FR)beta released to customersJune 2011
  • 20. WebNEM with Squidoindexfocusedcrawlsearchtopicshallowentity extractionpagecleaninguserqueriesusercollectionsdeepentity extractionvisualizationvisualization
  • 21. Page cleaninginsteadofthisworkonthisfast heuristicDOM processing
  • 22. Shallow extractiondetectlanguagetokenizesentencesplitgazetteers grammarWebdocsformatparseindex
  • 23. Deep extractionPOStaggergrammarorthomatcher indexmorphoanalyzerNP/VPchunker≅≅≅≅ shallow extraction + elaborate linguistics
  • 24. 3.Annoyances
  • 25. Linguistic processing throughputdeep extractiontoo expensivewhen crawlingshallowextractionOKpenaltyonqualityworkaround :asynch deep extractionon smaller collectionsquery time sanitization
  • 26. Page cleaningneed evaluationgoal : ↗accuracy ? cost : ↘ recall ?performance impact ?↘ +1 processing step↗ less text in later steps
  • 27. "Multiple dates" usage ?<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE><DATE TYPE="DateDay" D="11" M="2" Y="2008">February 9-13, 2008</DATE><DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE><DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE><DATE TYPE="DateDay" D="16" M="2" Y="2009">February 15-18, 2009</DATE><DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE><DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE><DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE><DATE TYPE="DateDay" D="27" M="10" Y="2009">October 25-29, 2009</DATE><DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE><DATE TYPE="DateDay" D="8" M="2" Y="2009">February 7-10, 2009</DATE><DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE><DATE TYPE="DateDay" D="9" M="10" Y="2007">October 7-12, 2007</DATE><DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE><DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE><DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE>retrieveby datesortby date?
  • 28. Publishing date ?critical fortime centricsearchespublished05/2011tagged as7 jul 2011
  • 29. & many more…wrongspellingTapei→Taipeilocation is also a first name"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)compound first names"Jean-Claude Marin"→Claude Marinwrong character case (very frequent on titles)breaks all case-based rulesbarrack obama→not extractedHow To Buy Electric Trucks→Buy Electric (organization)In Virginia Life Is Sweet→Virginia Life (person)polymorphism"Nagy Bocsa", "Nagy-Bocsa", "Nagy"sanitize parser outputfor tokenizationtransliteration, case, punctuation, …
  • 30. 4. Results
  • 31. ReminderNext results are obtainedautomaticallyfrom unstructured contentpicked on the webby an autonomous system,without previous knowledgeof the topic or the visited Web sites
  • 32. Lets try it with a use case"hydrogen storage for fuel cells"Whats inside a collectionof 66 highly ranked documents ?run a few cycles(shallow extraction only)entityweight function(tf-idf, …)some104 pagesPeopleOrgs Location Time
  • 33. Special attention paidto so-called outliers
  • 34. Organizations > 900 : overload…page cleaning + entity sanitization=> better details & accuracy
  • 35. ↗attention ↘information : top 50academicteam ?H2 militaryusage ?new questions are instantly popping up?
  • 36. Peopleauthors lead torelevant content(classic IR method,even in libraries !)?
  • 37. Countriespolitical threatson Lithium batterysuppliesargument in favor ofH2 technology
  • 38. Cities"Austin is in a unique positionto offer its electric grid as areal world proving ground""Direct Methanol Fuel Cells"⇒alternative to H2!!!
  • 39. changeover from nickel to lithiumwill be complete by 2016 and 2018Multiple-dates timelineoutlookhistorydomainstimeHonda President Takanobu Ito saysaround 10 percent of Honda’s global saleswill be hybrids by 2015
  • 40. In a few clicks...DMFC alternative to H2Austin,TXhydrogen storagefor fuel cells ?changeover fromnickel to lithiumby 2016/2018
  • 41. 5. Perspectives
  • 42. To clean or not to clean ?performance impact"attention" impactrun pipeline with/without cleaningcorpuslabel examples +/-cleansetfullsettime fullpipeline
  • 43. Publishing date extractionheuristicDOM processingprototype readyneed large scaleevaluationbuild goldstandard fromRSS feeds
  • 44. A zest of Linked Data ?too slow & fatfor crawling...use it "offline"disambiguation, gazetteers, infoboxes, ...
  • 45. Play with graphsentity co-occurence, page similarity, ...
  • 46. UI/user experiencesearch facetsword cloudsmapsdashboardsinfoboxeshighlightinggraphs
  • 47. Lexical Taxonomies Induction22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),Barcelona, Spain, July 19-22nd, 2011another kind of projection
  • 48. a. A real need of Attention-saving…b. WebNEM results are encouragingc. Work in progress, lots of paths to explore6. Digest
  • 49. "Theres simplytoo muchinformation outthere.""Leaders feelmisled. Stupid.Trapped."
  • 50. Final word by Herbert Simon"Filtering by intelligent programsis the main part of the answer"[to information overload]
  • 51. www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/pub/st%C3%A9phanie-jacquemont/20/271/767www.linkedin.com/in/fpouillouxMANY THANKS!joint work of
  • 52. CREDITSPhotos2. Home page, The 2011 IEEE/WIC/ACM International Conference on WebIntelligence4. Designing Organizations for an Information-Rich World, The Herbert A.Simon Collection5.Vlad the Impaler, Wikimedia commons7. Missile 9M342 of the portable anti-aircraft missile system Igla-S,©vitalykuzmin.net10. Internet Map 2005, ©www.opte.org33. The Inspector, ©DePatie-Freleng Enterprises36. Nanomaterials for Solid State Hydrogen Storage, book cover,©springer.com40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy44. Linking Open Data cloud diagram, ©Richard Cyganiak and AnjaJentzsch, lod-cloud.net44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter48. Views of the solar corona by the Transition Region and CoronalExplorer, Stanford-Lockheed Institute for Space Research, NASA SmallExplorer program49. Hyperformance book cover, www.tjwaters.com50. Dr Simon solving puzzles, The Herbert A. Simon CollectionWebsiteswi-iat-2011.orgThe Herbert A. Simon Collection, Carnegie Mellon University Libraries,diva.library.cmu.edu/webapp/simon/index.htmlwww.google.comonline.barrons.comwww.me.utexas.edu/~dmfc-muriwww.alsace-industrie.frwww.hybridcars.comwww.me.utexas.edu/blogs/meyersresearchgroupBibliographySimon, H. A. (1971), "Designing Organizations for an Information-RichWorld", Carnegie Mellon University Libraries,diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl0002/doc0001Waters, T. J. (2011), "Hyperformance",www.tjwaters.com/hyperformance-excerpt.htmlR. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for InducingLexical Taxonomies from Scratch. Proc. of the 22nd International JointConference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July19-22nd, 2011, pp. 1872-1877.