Web Scale Named Entity Mining


Published on

presentation given at the Industry Day of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence


Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web Scale Named Entity Mining

  1. 1. Web scaleNamed Entity Mining"Theres simply too much information out there"WI-IAT 2011
  2. 2. in memoriam ofHerbert A. Simon …
  3. 3. stuckApril 2011
  4. 4. Herbert Simons Brookings Institute Lecture"Designing Organizations for an Information-Rich World"Johns Hopkins University, September 1, 1969
  5. 5. 1.Tales & legends
  6. 6. Find & procurea crystal plastic replacement of a polycarbonate LEXAN 943Main constraints:•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stressand exposure to detergent agents)•compatible with existing tools - withdrawal must be close to LEXAN 943•optical characteristic close to LEXAN 943•weldable by ultrasonic welding•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94delay : one weekorganization centric search
  7. 7. Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable airdefense missile system ?location centric search
  8. 8. Recent information (past month)about call for proposal"outils Web innovants en entreprise" ?time centric search
  9. 9. Location"pro" searches focus onOrgs PeopleTimenamed entities
  10. 10. 2.IntroducingWebNEM
  11. 11. relevantquery ?queryagain ?where ?+ browsing/rankingresultsAttention-greedy & burdensomeproductspecificationsgetmanufactureror distributorfindcompliantproducts
  12. 12. "SA-24 Grinch9K338 Igla-S"Goal : Attention-saver process
  13. 13. exploratory data analysisof high dimensional data
  14. 14. "In exploratory data analysis of high dimensional dataone of the main tasks is the formation of asimplified, usually visual, overview of data sets.....Clustering and projectionare among the examples of useful methodsto achieve this task."Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and theirapplication in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)Lourenço, Lobo, Bação – JOCLAD 2004
  15. 15. WebNEMcollection ofrelevant data,anywhere in the web+ projection onNamed Entities spacetopical web crawlernamed entity recognitionvisualization/exploratory analysis tools
  16. 16. "Web scale" collection : brute forcenever-ending crawlfast answer,"any" topica priori"whole" Web indexinggeneral index"everywhere"huge resources required(data size based)userquery
  17. 17. "Web scale" collection : our approach"close to optimal" resources(usage based)userqueryon-demand topical crawldelayed answer,but less garbagetailored indexanywhererelevantbuilt on orderWeb slices
  18. 18. Projection : when to extract entities ?Named Entity Recognition is resource intensivecrawl time whole web 1010 asynchronousquery time collection 102 real-timecrawl time web slice 104 asynchronousprocess step data size required response time
  19. 19. www.squido.frour SaaS Web mining systemlarge scaleNamed Entity extraction (EN/FR)beta released to customersJune 2011
  20. 20. WebNEM with Squidoindexfocusedcrawlsearchtopicshallowentity extractionpagecleaninguserqueriesusercollectionsdeepentity extractionvisualizationvisualization
  21. 21. Page cleaninginsteadofthisworkonthisfast heuristicDOM processing
  22. 22. Shallow extractiondetectlanguagetokenizesentencesplitgazetteers grammarWebdocsformatparseindex
  23. 23. Deep extractionPOStaggergrammarorthomatcher indexmorphoanalyzerNP/VPchunker≅≅≅≅ shallow extraction + elaborate linguistics
  24. 24. 3.Annoyances
  25. 25. Linguistic processing throughputdeep extractiontoo expensivewhen crawlingshallowextractionOKpenaltyonqualityworkaround :asynch deep extractionon smaller collectionsquery time sanitization
  26. 26. Page cleaningneed evaluationgoal : ↗accuracy ? cost : ↘ recall ?performance impact ?↘ +1 processing step↗ less text in later steps
  27. 27. "Multiple dates" usage ?<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE><DATE TYPE="DateDay" D="11" M="2" Y="2008">February 9-13, 2008</DATE><DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE><DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE><DATE TYPE="DateDay" D="16" M="2" Y="2009">February 15-18, 2009</DATE><DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE><DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE><DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE><DATE TYPE="DateDay" D="27" M="10" Y="2009">October 25-29, 2009</DATE><DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE><DATE TYPE="DateDay" D="8" M="2" Y="2009">February 7-10, 2009</DATE><DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE><DATE TYPE="DateDay" D="9" M="10" Y="2007">October 7-12, 2007</DATE><DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE><DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE><DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE>retrieveby datesortby date?
  28. 28. Publishing date ?critical fortime centricsearchespublished05/2011tagged as7 jul 2011
  29. 29. & many more…wrongspellingTapei→Taipeilocation is also a first name"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)compound first names"Jean-Claude Marin"→Claude Marinwrong character case (very frequent on titles)breaks all case-based rulesbarrack obama→not extractedHow To Buy Electric Trucks→Buy Electric (organization)In Virginia Life Is Sweet→Virginia Life (person)polymorphism"Nagy Bocsa", "Nagy-Bocsa", "Nagy"sanitize parser outputfor tokenizationtransliteration, case, punctuation, …
  30. 30. 4. Results
  31. 31. ReminderNext results are obtainedautomaticallyfrom unstructured contentpicked on the webby an autonomous system,without previous knowledgeof the topic or the visited Web sites
  32. 32. Lets try it with a use case"hydrogen storage for fuel cells"Whats inside a collectionof 66 highly ranked documents ?run a few cycles(shallow extraction only)entityweight function(tf-idf, …)some104 pagesPeopleOrgs Location Time
  33. 33. Special attention paidto so-called outliers
  34. 34. Organizations > 900 : overload…page cleaning + entity sanitization=> better details & accuracy
  35. 35. ↗attention ↘information : top 50academicteam ?H2 militaryusage ?new questions are instantly popping up?
  36. 36. Peopleauthors lead torelevant content(classic IR method,even in libraries !)?
  37. 37. Countriespolitical threatson Lithium batterysuppliesargument in favor ofH2 technology
  38. 38. Cities"Austin is in a unique positionto offer its electric grid as areal world proving ground""Direct Methanol Fuel Cells"⇒alternative to H2!!!
  39. 39. changeover from nickel to lithiumwill be complete by 2016 and 2018Multiple-dates timelineoutlookhistorydomainstimeHonda President Takanobu Ito saysaround 10 percent of Honda’s global saleswill be hybrids by 2015
  40. 40. In a few clicks...DMFC alternative to H2Austin,TXhydrogen storagefor fuel cells ?changeover fromnickel to lithiumby 2016/2018
  41. 41. 5. Perspectives
  42. 42. To clean or not to clean ?performance impact"attention" impactrun pipeline with/without cleaningcorpuslabel examples +/-cleansetfullsettime fullpipeline
  43. 43. Publishing date extractionheuristicDOM processingprototype readyneed large scaleevaluationbuild goldstandard fromRSS feeds
  44. 44. A zest of Linked Data ?too slow & fatfor crawling...use it "offline"disambiguation, gazetteers, infoboxes, ...
  45. 45. Play with graphsentity co-occurence, page similarity, ...
  46. 46. UI/user experiencesearch facetsword cloudsmapsdashboardsinfoboxeshighlightinggraphs
  47. 47. Lexical Taxonomies Induction22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),Barcelona, Spain, July 19-22nd, 2011another kind of projection
  48. 48. a. A real need of Attention-saving…b. WebNEM results are encouragingc. Work in progress, lots of paths to explore6. Digest
  49. 49. "Theres simplytoo muchinformation outthere.""Leaders feelmisled. Stupid.Trapped."
  50. 50. Final word by Herbert Simon"Filtering by intelligent programsis the main part of the answer"[to information overload]
  51. 51. www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/pub/st%C3%A9phanie-jacquemont/20/271/767www.linkedin.com/in/fpouillouxMANY THANKS!joint work of
  52. 52. CREDITSPhotos2. Home page, The 2011 IEEE/WIC/ACM International Conference on WebIntelligence4. Designing Organizations for an Information-Rich World, The Herbert A.Simon Collection5.Vlad the Impaler, Wikimedia commons7. Missile 9M342 of the portable anti-aircraft missile system Igla-S,©vitalykuzmin.net10. Internet Map 2005, ©www.opte.org33. The Inspector, ©DePatie-Freleng Enterprises36. Nanomaterials for Solid State Hydrogen Storage, book cover,©springer.com40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy44. Linking Open Data cloud diagram, ©Richard Cyganiak and AnjaJentzsch, lod-cloud.net44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter48. Views of the solar corona by the Transition Region and CoronalExplorer, Stanford-Lockheed Institute for Space Research, NASA SmallExplorer program49. Hyperformance book cover, www.tjwaters.com50. Dr Simon solving puzzles, The Herbert A. Simon CollectionWebsiteswi-iat-2011.orgThe Herbert A. Simon Collection, Carnegie Mellon University Libraries,diva.library.cmu.edu/webapp/simon/index.htmlwww.google.comonline.barrons.comwww.me.utexas.edu/~dmfc-muriwww.alsace-industrie.frwww.hybridcars.comwww.me.utexas.edu/blogs/meyersresearchgroupBibliographySimon, H. A. (1971), "Designing Organizations for an Information-RichWorld", Carnegie Mellon University Libraries,diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl0002/doc0001Waters, T. J. (2011), "Hyperformance",www.tjwaters.com/hyperformance-excerpt.htmlR. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for InducingLexical Taxonomies from Scratch. Proc. of the 22nd International JointConference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July19-22nd, 2011, pp. 1872-1877.