1. Web scaleNamed Entity Mining"Theres simply too much information out there"WI-IAT 2011
2. in memoriam ofHerbert A. Simon …
3. stuckApril 2011
4. Herbert Simons Brookings Institute Lecture"Designing Organizations for an Information-Rich World"Johns Hopkins University, September 1, 1969
5. 1.Tales & legends
6. Find & procurea crystal plastic replacement of a polycarbonate LEXAN 943Main constraints:•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stressand exposure to detergent agents)•compatible with existing tools - withdrawal must be close to LEXAN 943•optical characteristic close to LEXAN 943•weldable by ultrasonic welding•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94delay : one weekorganization centric search
7. Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable airdefense missile system ?location centric search
8. Recent information (past month)about call for proposal"outils Web innovants en entreprise" ?time centric search
12. "SA-24 Grinch9K338 Igla-S"Goal : Attention-saver process
13. exploratory data analysisof high dimensional data
14. "In exploratory data analysis of high dimensional dataone of the main tasks is the formation of asimplified, usually visual, overview of data sets.....Clustering and projectionare among the examples of useful methodsto achieve this task."Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and theirapplication in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)Lourenço, Lobo, Bação – JOCLAD 2004
15. WebNEMcollection ofrelevant data,anywhere in the web+ projection onNamed Entities spacetopical web crawlernamed entity recognitionvisualization/exploratory analysis tools
17. "Web scale" collection : our approach"close to optimal" resources(usage based)userqueryon-demand topical crawldelayed answer,but less garbagetailored indexanywhererelevantbuilt on orderWeb slices
18. Projection : when to extract entities ?Named Entity Recognition is resource intensivecrawl time whole web 1010 asynchronousquery time collection 102 real-timecrawl time web slice 104 asynchronousprocess step data size required response time
19. www.squido.frour SaaS Web mining systemlarge scaleNamed Entity extraction (EN/FR)beta released to customersJune 2011
20. WebNEM with Squidoindexfocusedcrawlsearchtopicshallowentity extractionpagecleaninguserqueriesusercollectionsdeepentity extractionvisualizationvisualization
28. Publishing date ?critical fortime centricsearchespublished05/2011tagged as7 jul 2011
29. & many more…wrongspellingTapei→Taipeilocation is also a first name"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)compound first names"Jean-Claude Marin"→Claude Marinwrong character case (very frequent on titles)breaks all case-based rulesbarrack obama→not extractedHow To Buy Electric Trucks→Buy Electric (organization)In Virginia Life Is Sweet→Virginia Life (person)polymorphism"Nagy Bocsa", "Nagy-Bocsa", "Nagy"sanitize parser outputfor tokenizationtransliteration, case, punctuation, …
30. 4. Results
31. ReminderNext results are obtainedautomaticallyfrom unstructured contentpicked on the webby an autonomous system,without previous knowledgeof the topic or the visited Web sites
32. Lets try it with a use case"hydrogen storage for fuel cells"Whats inside a collectionof 66 highly ranked documents ?run a few cycles(shallow extraction only)entityweight function(tf-idf, …)some104 pagesPeopleOrgs Location Time
35. ↗attention ↘information : top 50academicteam ?H2 militaryusage ?new questions are instantly popping up?
36. Peopleauthors lead torelevant content(classic IR method,even in libraries !)?
37. Countriespolitical threatson Lithium batterysuppliesargument in favor ofH2 technology
38. Cities"Austin is in a unique positionto offer its electric grid as areal world proving ground""Direct Methanol Fuel Cells"⇒alternative to H2!!!
39. changeover from nickel to lithiumwill be complete by 2016 and 2018Multiple-dates timelineoutlookhistorydomainstimeHonda President Takanobu Ito saysaround 10 percent of Honda’s global saleswill be hybrids by 2015
40. In a few clicks...DMFC alternative to H2Austin,TXhydrogen storagefor fuel cells ?changeover fromnickel to lithiumby 2016/2018
41. 5. Perspectives
42. To clean or not to clean ?performance impact"attention" impactrun pipeline with/without cleaningcorpuslabel examples +/-cleansetfullsettime fullpipeline
43. Publishing date extractionheuristicDOM processingprototype readyneed large scaleevaluationbuild goldstandard fromRSS feeds
44. A zest of Linked Data ?too slow & fatfor crawling...use it "offline"disambiguation, gazetteers, infoboxes, ...
45. Play with graphsentity co-occurence, page similarity, ...