Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Named Entities

636 views

Published on

An introduction to named entities, named entity recognition (NER), and named entity disambiguation (entity linking). There is also information about how this is useful for Companybook.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Named Entities

  1. 1. NAMED ENTITIES March 2016 1
  2. 2. NAMED ENTITIES 2 PLAN What? More what? How? How well? Where to? Why?
  3. 3. NAMED ENTITIES 3
  4. 4. NAMED ENTITIES 4 Kabosu
  5. 5. NAMED ENTITIES 5 DEFINITIONS The names of i.e.: • persons • organisations • locations • expressions of times • quantities • monetary values • percentages
  6. 6. NAMED ENTITY RECOGNITION 6 INTERACTIVE A Maine bill would allow residents of the state’s island communities to ship medical samples by ferry rather than in person. Democrats in the State Senate say the proposal was prompted by changes in the Maine State Ferry Service that make it difficult to ship samples, such as bloodwork, from islands. New policies say Maine ferries won’t transport lab work for patients anymore. That means island residents must travel to mainland hospitals to deliver the samples, which can take hours. Sen. Dave Miramant, a Camden Democrat, says a “lock box” for samples should be available on all boats. The bill was the subject of a Jan. 28 public hearing where some North Haven island residents spoke in favor of it. The state Legislature’s transportation committee will review the bill soon.
  7. 7. NAMED ENTITY RECOGNITION 7 STANFORD ONLINE NER A Maine bill would allow residents of the state’s island communities to ship medical samples by ferry rather than in person. Democrats in the State Senate say the proposal was prompted by changes in the Maine State Ferry Service that make it difficult to ship samples, such as bloodwork, from islands. New policies say Maine ferries won’t transport lab work for patients anymore. That means island residents must travel to mainland hospitals to deliver the samples, which can take hours. Sen. Dave Miramant, a Camden Democrat, says a “lock box” for samples should be available on all boats. The bill was the subject of a Jan. 28 public hearing where some North Haven island residents spoke in favor of it. The state Legislature’s transportation committee will review the bill soon.
  8. 8. NAMED ENTITY RECOGNITION 8 STANFORD ONLINE NER
  9. 9. NAMED ENTITY RECOGNITION 9 DEFINITIONS • Find the names of entities in text • State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%
  10. 10. NAMED ENTITY RECOGNITION 10 APPROACHES • Detection • Classification into entity type • Approaches • Grammar (rule) based • Statistical • Machine Learned
  11. 11. NAMED ENTITY RECOGNITION 11 APPROACHES • Current state-of-the-art: • Conditional Random Field
  12. 12. NAMED ENTITY RECOGNITION 12 FUTURE • Deep Learning • Based on Word2Vec • Enhance with sequence information (memory)
  13. 13. NAMED ENTITY RECOGNITION 13 OUR MOTIVATION
  14. 14. 14 Purpose Help companies grow their business faster by finding the most relevant prospects Enable companies to get strategic sales information within seconds Approach A big data company using intelligent algorithms to turn business data into insights and services Combine company data, key people profiles and relevant news in a unique, real-time platform Companies 163,000,000 companies world-wide People 160,000,000 key executives News Over 2,000,000 news articles per day Worldwide 212 countries covered Triggers 1,400,000 news triggers per month FACTSHEET Structured Structured Feed Generated
  15. 15. NAMED ENTITY RECOGNITION 15 OUR MOTIVATION Unstructured data • Crawl company data • Currently only crawling English companies • ~1B web pages
  16. 16. NAMED ENTITY RECOGNITION 16 OUR MOTIVATION
  17. 17. NAMED ENTITY RECOGNITION 17 OUR MOTIVATION Extraction: • Find people in crawled data • Find relevant business information in crawled data • Same for news
  18. 18. NAMED ENTITY RECOGNITION 18 OUR MOTIVATION Matching: • Unstructured data supporting structured data • News articles are matched to the right company • Triggers in news
  19. 19. NAMED ENTITY RECOGNITION 19 OUR MOTIVATION
  20. 20. NAMED ENTITY RECOGNITION 20 OUR MOTIVATION
  21. 21. NAMED ENTITY RECOGNITION 21 OUR MOTIVATION
  22. 22. NAMED ENTITIES 22 RECOGNITION EXPERIMENTS
  23. 23. NAMED ENTITY RECOGNITION 23 STANFORD NER • http://nlp.stanford.edu/software/CRF-NER.shtml • Dual license including GPL v2 • Conditional Random Field sequence model
  24. 24. NAMED ENTITY RECOGNITION 24 STANFORD NER • Detects many entities • Detects companies, person names, titles, locations • Detects many things that are not entities • Easily fooled by Titlecase • Easily fooled by abbreviations
  25. 25. NAMED ENTITY RECOGNITION 25 INTERACTIVE A Maine bill would allow residents of the state’s island communities to ship medical samples by ferry rather than in person. Democrats in the State Senate say the proposal was prompted by changes in the Maine State Ferry Service that make it difficult to ship samples, such as bloodwork, from islands. New policies say Maine ferries won’t transport lab work for patients anymore. That means island residents must travel to mainland hospitals to deliver the samples, which can take hours. Sen. Dave Miramant, a Camden Democrat, says a “lock box” for samples should be available on all boats. The bill was the subject of a Jan. 28 public hearing where some North Haven island residents spoke in favor of it. The state Legislature’s transportation committee will review the bill soon.
  26. 26. NAMED ENTITY RECOGNITION 26 STANFORD NER Democrats in the State Senate say the proposal was prompted by changes in the Maine State Ferry Service that make it difficult to ship samples, …, from islands. … The state Legislature’s transportation committee will review the bill soon. Identified as organisations: • State Senate • Maine State Ferry Service • Legislature
  27. 27. NAMED ENTITY RECOGNITION 27 STANFORD NER Democrats in the State Senate say the proposal was prompted by changes in the Maine State Ferry Service that make it difficult to ship samples, …, from islands. … The state Legislature’s transportation committee will review the bill soon. Not really named - should be: • State Senate - The Main State Senate • Legislature - The Main State Legislature
  28. 28. NAMED ENTITY RECOGNITION 28 STANFORD NER Sen. Dave Miramant, a Camden Democrat, says Identified as Person: • Dave Miramant
  29. 29. NAMED ENTITY RECOGNITION 29 STANFORD NER A Maine bill would allow residents of the state’s island communities to ship medical samples by ferry rather than in person. Identified as Location: • Maine
  30. 30. NAMED ENTITY RECOGNITION 30 STANFORD NER
  31. 31. NAMED ENTITY RECOGNITION 31 STANFORD NER
  32. 32. NAMED ENTITY RECOGNITION 32 OUR MOTIVATION Matching: • The correct company • The correct person • The correct location DISAMBIGUATION!
  33. 33. NAMED ENTITY DISAMBIGUATION 33 THE PROBLEM
  34. 34. NAMED ENTITY DISAMBIGUATION 34 THE PROBLEM • Which Apple, Apple Inc. or Apple Corps
  35. 35. NAMED ENTITY DISAMBIGUATION 35 THE PROBLEM • Agfa Apogee or Apogee Electronics?
  36. 36. NAMED ENTITIES 36 DISAMBIGUATION • ERD Challenge at SIGIR 2014 • Most (all) solutions based around Web Search (Bing) and Wikipedia
  37. 37. NAMED ENTITIES 37 DISAMBIGUATION SMAPH
  38. 38. NAMED ENTITY DISAMBIGUTAION 38 SMAPH • Annotator - source 1 • Normal Search - source 2 • Wikisearch - source 3
  39. 39. NAMED ENTITY DISAMBIGUTAION 39 SMAPH 1. Fetching – from a search engine 2. Spotting – parse results to identify candidate mentions for the entities to be annotated. 3. Candidate generation • from the Wikipedia pages occurring in the search results • from an existing annotator, using the mentions identified in the spotting step 4. Pruning – binary SVM classifier
  40. 40. NAMED ENTITY DISAMBIGUTAION 40 SMAPH • Annotator - source 1 • Normal Search - source 2 • Wikisearch - source 3
  41. 41. NAMED ENTITY DISAMBIGUTAION 41 SMAPH F1: 62.9%
  42. 42. NAMED ENTITY DISAMBIGUTAION 42 SMAPH • 60+% F1 score for disambiguation is good • The 90+% F1-score was for recognition
  43. 43. NAMED ENTITIES 43 DISAMBIGUATION EXPERIMENTS
  44. 44. WIKIPEDIA 44 Numbers Pacific Standard none Ohio State Ohio_State_Buckeyes Panama Golf Club none Poors Reafirms Ecopetrol S.A. none Ecopetrol S.A. BOGOTA Ecopetrol Standard and Poors Standard_%26_Poor%27s Ecopetrol Ecopetrol Poors none Company none WVEC WVEC Sentara Norfolk General Hospital Sentara_Norfolk_General_Hospital Google Google
  45. 45. WIKIPEDIA 45 GOLD SET NUMBERS • Total number of detected entities: 328 • Actual positive: 197 • Actual negative: 131 Should have been a job for Mechanical Turk Not all detected entities are actually entities
  46. 46. WIKIPEDIA 46 SYSTEM • Docker • Elastic search v2.2 • English wikipedia index • wikiparse: https://github.com/andrewvc/wikiparse
  47. 47. WIKIPEDIA 47 SCORING • Actual Negative - none • Actual Positive - something other than none • Predicted positive - something, not necessarily the correct one • Predicted negative - none Predicted positive and incorrect is treated as actual negative
  48. 48. WIKIPEDIA 48 QUERY POST { "query": { "bool": { "disable_coord": "True", "should": [ { "term": { "title": "nyse" } }, { "term": { "body": "nyse" } }, { "term": { "body": "zacks investment research" } }, { "term": { "body": "pg&e corporation" } }, { "term": { "body": "pg&e co." } } ], "must": { "term": { "title": "nyse" } }, "minimum_should_match": 1 } } }
  49. 49. WIKIPEDIA 49 INITIAL RESULTS Actual Positive Actual Negative Predicted Positive 23 52 Predicted Negative 138 115 31% 14%
  50. 50. NAMED ENTITY DISAMBIGUTAION 50 SCORE F1: 0.19
  51. 51. WIKIPEDIA 51 COMMON FALSE POSITIVE SITUATION • LLC - Wikipedia, the free encyclopedia • https://en.wikipedia.org/wiki/LLC • LLC may refer to: Air transport[edit]. LLC, LHD Landing Craft, Australian variant of the LCM-1E landing craft; LLC, ICAO airline designator of FlyLAL Charters ...
  52. 52. WIKIPEDIA 52 SCORE THRESHOLD Actual Positive Actual Negative Predicted Positive 22 26 Predicted Negative 153 127 46% 13%
  53. 53. NAMED ENTITY DISAMBIGUTAION 53 SCORE F1: 0.20
  54. 54. WIKIPEDIA 54 NO MANDATORY TITLE MATCH { "query": { "bool": { "disable_coord": "True", "should": [ { "term": { "title": "nyse" } }, { "term": { "body": "nyse" } }, { "term": { "body": "zacks investment research" } }, { "term": { "body": "pg&e corporation" } }, { "term": { "body": "pg&e co." } } ], "must": { "term": { "title": "nyse" } }, "minimum_should_match": 1 } } }
  55. 55. WIKIPEDIA 55 NO MANDATORY TITLE MATCH Actual Positive Actual Negative Predicted Positive 23 290 Predicted Negative 8 7 7% 74%
  56. 56. NAMED ENTITY DISAMBIGUTAION 56 SCORE F1: 0.13
  57. 57. WIKIPEDIA 57 COMMON FALSE POSITIVE SITUATION { "query": { "bool": { "disable_coord": "True", "should": [ { "term": { "title": "nyse" } }, { "term": { "body": "nyse" } }, { "term": { "body": "zacks investment research" } }, { "term": { "body": "pg&e corporation" } }, { "term": { "body": "pg&e co." } } ], "must": { "term": { "title": "nyse" } }, "minimum_should_match": 1 } } }
  58. 58. WIKIPEDIA 58 NO OPTIONAL TITLE MATCH Actual Positive Actual Negative Predicted Positive 1 312 Predicted Negative 8 7
  59. 59. NAMED ENTITY DISAMBIGUTAION 59 MY BEST RESULT F1: 19%
  60. 60. LOW SCORE 60 EXPLANATION • No web search used • No annotator • Only a wikipedia search
  61. 61. NAMED ENTITY DISAMBIGUTAION 61 SMAPH • Annotator - source 1 • Normal Search - source 2 • Wikisearch - source 3
  62. 62. OTHER SOURCES 62 WEB SEARCH
  63. 63. WEB SEARCH 63 WHY NOT? • Google prohibits bot requests • Bing allows paid API queries - expensive • Building our own web index is too expensive
  64. 64. ALTERNATIVES 64 DATABASE OF COMPANIES • If we had a database of companies • Find the company name aliases • Find key people • Products?
  65. 65. BUSINESS SEARCH ENGINE 65 COMPANY SEARCH
  66. 66. BUSINESS SEARCH ENGINE 66 COMPANY SEARCH
  67. 67. BUSINESS SEARCH ENGINE 67 COMPANY SEARCH
  68. 68. BUSINESS SEARCH ENGINE 68 COMPANY SEARCH
  69. 69. BUSINESS SEARCH ENGINE 69 ENTITY SEARCH
  70. 70. BUSINESS SEARCH ENGINE 70 ENTITY SEARCH
  71. 71. NAMED ENTITY RECOGNITION 71 STANFORD NER
  72. 72. NAMED ENTITY 72 Knut O. Hellan • CTO Companybook • Twitter: @KnutHellan

×