SlideShare a Scribd company logo
1 of 63
Download to read offline
Mining	
  the	
  History	
  of	
  Medicine	
  
Sophia	
  Ananiadou	
  
Na-onal	
  Centre	
  for	
  Text	
  Mining	
  
School	
  of	
  Computer	
  Science	
  
The	
  University	
  of	
  Manchester	
  
•  Text	
  Mining	
  tools	
  for	
  enriching	
  content	
  
•  Mining	
  the	
  History	
  of	
  Medicine	
  
•  Seman-c	
  annota-ons:	
  	
  
–  en--es,	
  events	
  
–  OCR	
  correc-on	
  
–  Time-­‐sensi-ve	
  term	
  inventory	
  and	
  search	
  
	
  
•  A	
  prototype	
  search	
  system	
  for	
  the	
  history	
  of	
  medicine	
  
	
  
Outline	
  
2	
  
raw	
  
(unstructured)	
  
text	
  
part-­‐of-­‐speech	
  
tagging	
  
named	
  en-ty	
  
recogni-on	
  
event	
  extrac-on	
  
annotated	
  
(structured)	
  
text	
  
text	
  processing	
  pipeline	
  
lexica	
   ontologies	
  
….	
  	
  ……………………I	
  believe	
  
the	
  pain	
  was	
  seated	
  in	
  the	
  
pillars	
  of	
  the	
  diaphragm,	
  as	
  
it	
  was	
  chiefly	
  occasioned	
  by	
  
hiccough	
  ….	
  ……	
  
	
  
Text	
  Mining	
  Pipeline	
  	
  
3	
  
Impact	
  of	
  text	
  mining	
  	
  
•  Discovery	
  of	
  concepts	
  and	
  
named	
  en--es	
  (anatomical	
  
en--es,	
  treatments,	
  
symptoms..)	
  
•  Going	
  a	
  step	
  further:	
  
extrac-ng	
  rela-onships,	
  
events	
  from	
  text	
  (cause,	
  
affect)	
  
•  Extrac-ng	
  opinions,	
  aStudes,	
  
negated	
  statements,	
  
hypotheses,	
  contradic-ons….	
  
•  Improves	
  classifica-on	
  of	
  
documents	
  
•  Improves	
  informa-on	
  access	
  
•  Enables	
  seman-c	
  search	
  
•  Enables	
  ques-on-­‐answering,	
  
sen-ment	
  analysis,	
  trend	
  
discovery	
  	
  
Impact	
  of	
  text	
  mining	
  
4	
  
 
•  Na-onal	
  Centre	
  for	
  Text	
  Mining,	
  School	
  of	
  
Computer	
  Science	
  	
  www.nactem.ac.uk	
  	
  
	
  
•  Centre	
  for	
  the	
  History	
  of	
  Science,	
  Technology	
  and	
  
Medicine	
  
	
  
	
  
Mining	
  the	
  History	
  of	
  Medicine:	
  
an	
  interdisciplinary	
  project	
  	
  
AHRC	
  	
  
Digital	
  Transforma-ons	
  
Big	
  Data	
  
5	
  
Mining	
  the	
  History	
  of	
  Medicine	
  
Aim	
  
Seman7c	
  search	
  system	
  
•  Bri-sh	
  Medical	
  Journal	
  (BMJ)	
  	
  (1840	
  –	
  present)	
  (350K	
  
ar-cles)	
  
•  London	
  Medical	
  Officer	
  of	
  Health	
  reports	
  (MoH)	
  
(1848	
  –	
  1972)	
  
	
  	
  	
  	
  	
  	
  	
  
Provide	
  different	
  perspec7ves	
  (e.g.	
  medical,	
  public	
  health)	
  
on	
  the	
  treatment	
  and	
  preven-on	
  of	
  diseases	
  	
  
ü  over	
  -me	
  
ü  in	
  different	
  areas	
  
Text	
  Mining:	
  tools,	
  
resources,	
  
infrastructure	
  
6	
  
Mining	
  the	
  History	
  of	
  Medicine	
  
Search	
  historical	
  
archives	
  
OCR	
  
Keyword	
  
search?	
  
BMJ	
  archive	
  was	
  digi-sed	
  several	
  years	
  ago	
  
ü  Word	
  error	
  rate	
  can	
  be	
  over	
  30%	
  
ü  Re-­‐OCRing	
  -me-­‐consuming	
  and	
  costly	
  
MOH	
  OCRed	
  more	
  recently	
  with	
  beier	
  results	
  	
  	
  
	
  
	
  
Keyword	
  search	
  over	
  huge	
  textual	
  archives	
  
inefficient	
  
7	
  
Mining	
  the	
  History	
  of	
  Medicine	
  
Solu-ons	
  
ü  Apply	
  OCR	
  improvement	
  techniques	
  	
  
ü  Automa-cally	
  detect	
  seman7c	
  informa-on	
  in	
  
text	
  
ü  En##es:	
  condi-ons,	
  signs	
  or	
  symptoms,	
  
therapeu-c	
  measures	
  
ü  Events:	
  	
  symptoms	
  caused	
  by	
  lung	
  disease	
  	
  
	
  
ü  Automa-cally	
  iden-fy	
  how	
  the	
  naming	
  of	
  
concepts	
  changes	
  over	
  7me	
  
Sophia	
  Ananiadou	
   8	
  
Mining	
  the	
  History	
  of	
  Medicine	
  
Sema-c	
  
search	
  
system	
  
	
  
ü  Automa-c	
  query	
  building,	
  term	
  sugges-ons	
  	
  
•  historical	
  variants	
  unknown	
  to	
  the	
  user	
  
ü  Faceted	
  search	
  based	
  on	
  ar-cle	
  metadata	
  
•  Date,	
  authors,	
  -tle	
  
	
  
ü  Faceted	
  search	
  based	
  on	
  seman-c	
  content	
  
•  Combina-ons	
  of	
  en--es	
  and/or	
  rela-onships	
  
	
  
Sophia	
  Ananiadou	
   9	
  
Extrac-ng	
  seman-c	
  types,	
  events	
  
ü  Our	
  text	
  mining	
  tools	
  learn	
  from	
  gold	
  standard	
  
•  subset	
  of	
  documents	
  from	
  the	
  archives	
  manually	
  annotated	
  
with	
  en77es	
  and	
  events	
  	
  
ü  Corpus	
  composi-on	
  
•  25	
  ar-cles	
  from	
  BMJ	
  
•  4	
  extracts	
  from	
  MOH	
  
•  70,000	
  words	
  
ü  Ar-cles	
  from	
  4	
  key	
  decades	
  in	
  medical	
  history	
  
•  1850s,	
  1890s,	
  1920	
  and	
  1960s	
  
•  Capture	
  language/terminology	
  changes	
  over	
  -me	
  
Sophia	
  Ananiadou	
  
10	
  
OCR	
  Post	
  Correc-on	
  
•  Re-­‐OCR	
  	
  BMJ	
  too	
  -me	
  consuming	
  for	
  current	
  project	
  
•  Our	
  solu-on:	
  	
  post-­‐correct	
  original	
  OCR	
  output	
  based	
  on	
  
spell-­‐checker	
  output	
  
SpellChecker	
   %	
  errors	
  resolved	
  
ASpell	
   45%	
  
Hunspell	
   60%	
  
Word	
   63%	
  
MAC	
  OS	
   58%	
  
Best	
  open	
  source	
  
spell	
  checker	
  
11	
  
OCR	
  Post	
  Correc-on	
  
ü Hunspell	
  produces	
  ranked	
  list	
  of	
  spelling	
  correc-ons	
  
ü However,	
  tailored	
  to:	
  	
  
•  Human	
  errors	
  
•  General	
  language	
  
Our	
  approach	
  
Augment	
  Hunspell	
  dic-onary	
  with	
  OpenMedSpel	
  dic-onary	
  
Override	
  default	
  Hunspell	
  correc-on	
  ranking	
  
ü  Rank	
  sugges-ons	
  according	
  to	
  corpus	
  frequency	
  
ü  Use	
  decade-­‐specific	
  word	
  frequency	
  lists	
  
	
  
12	
  
JOHN MILLER, aged 26, glass-blower of Gateshead, was
adnmitted August 27th, 1840, unider the care of Sir John
Fife, with the usual syptnpoms of stone in the bladder,
wlich he lhas laboured under from childhood. At the age of
12 years he was admitted into thtis infirmaryI, buit his
frienids wotld niot allow the operation to be performed.
Ile took a considerable (luantity of medicine, which
greatly relieved him. Since then the symptomt of stone
have been constantly presenit, but have not been very
urgent. A.- About a month ago, after batlhing, the
symptoms became much arggravated-so much so, as to comlpel
him to apply again to the hospital for relief. On
admission he was sallow and emaciated; laboured iunder
symptoms of stone of a high degree, with great
irritability of bladder, giviing rise to a frequient
desire to inicturate, and occasionally to the in-voluntary
discharge of faces from straining. Ile passed a larg,e
quantity of mucus with his urine;-pulse small, tongue
white, sliglht tlhirst, and appetite good.
	
  
BMJ	
  OCR	
  Errors	
  –	
  Example	
  from	
  1840	
  	
  
13	
  
Basic	
  Hunspell	
  Output	
  	
  
JOHN MILLER, aged 26, glass-blower of Gateshead, was
admitted August 27th, 1840, under the care of Sir John
Fife, with the usual symptoms of stone in the bladder,
which he has laboured under from childhood. At the age of
12 years he was admitted into this infirmary, but his
friends world into allow the operation to be performed.
Ike took a considerable (quantity of medicine, which
greatly relieved him. Since then the symptom of stone have
been constantly present, but have not been very urgent.
A.- About a month ago, after bathing, the symptoms became
much aggravated-so much so, as to compel him to apply
again to the hospital for relief. On admission he was
sallow and emaciated; laboured under symptoms of stone of
a high degree, with great irritability of bladder, giving
rise to a frequent desire to inaccurate, and occasionally
to the involuntary discharge of faces from straining. Ike
passed a lag,e quantity of mucus with his urine;-pulse
small, tongue white, slight thirst, and appetite good.
Correct	
  changes	
  
made	
  by	
  Hunspell	
  	
  
Incorrect	
  changes	
  
made	
  by	
  Hunspell	
  	
  
OCR	
  Errors	
  not	
  
detected	
  by	
  
Hunspell	
  	
  
14	
  
New	
  Frequency-­‐driven	
  Hunspell	
  Output	
  
JOHN MILLER, aged 26, glass-blower of Gateshead, was
admitted August 27th, 1840, under the care of Sir John
Fife, with the usual symptoms of stone in the bladder,
which he has laboured under from childhood. At the age of
12 years he was admitted into this infirmary, but his
friends would not allow the operation to be performed. Lie
took a considerable (quantity of medicine, which greatly
relieved him. Since then the symptoms of stone have been
constantly present, but have not been very urgent. A.-
About a month ago, after bathing, the symptoms became much
aggravated-so much so, as to compel him to apply again to
the hospital for relief. On admission he was sallow and
emaciated; laboured under symptoms of stone of a high
degree, with great irritability of bladder, giving rise to
a frequent desire to inaccurate, and occasionally to the
involuntary discharge of faces from straining. Lie passed
a large,e quantity of mucus with his urine;-pulse small,
tongue white, slight thirst, and appetite good.
Correc-ons	
  only	
  
obtained	
  using	
  
frequency-­‐driven	
  
method	
  
Correc-ons	
  
obtained	
  using	
  
both	
  original	
  and	
  
frequency-­‐driven	
  
methods	
  
Remaining	
  
errors	
  with	
  
frequency-­‐driven	
  
methods	
  
15	
  
Avoiding	
  Over-­‐Correc-on	
  
Original	
  Text	
   Basic	
  Hunspell	
   Frequency-­‐
driven	
  Hunspell	
  	
  
Frequency-­‐driven	
  
Hunspell	
  
augmented	
  with	
  
OpenMedSpel	
  
an-thrombo-c	
   anthropometric	
   anthropometric	
   an7thrombo7c	
  
pathophysiological	
   physiological	
   physiological	
   pathophysiological	
  	
  
thromboembelism	
   thrombosis	
   thrombosis	
   thromboembolism	
  
warfarin	
   wayfaring	
   warfarin	
   warfarin	
  
transoesophageal	
   oesophageal	
   oesophageal	
   transoesphageal	
  
echocardiography	
   electroocardiography	
   oesophageal	
   echocardiography	
  
intracardiac	
   intra	
  cardiac	
   intracellular	
   intracardiac	
  
tomogram	
   mammogram	
   tomogram	
   tomogram	
  
•  We	
  want	
  to	
  avoid	
  erroneous	
  correc-on	
  of	
  words	
  that	
  are	
  already	
  correct	
  
16	
  
Extrac-ng	
  seman-c	
  types,	
  events	
  
ü  Our	
  text	
  mining	
  tools	
  learn	
  from	
  gold	
  standard	
  
•  subset	
  of	
  documents	
  from	
  the	
  archives	
  manually	
  annotated	
  
with	
  en77es	
  and	
  events	
  	
  
ü  Corpus	
  composi-on	
  
•  25	
  ar-cles	
  from	
  BMJ	
  
•  4	
  extracts	
  from	
  MOH	
  
•  70,000	
  words	
  
ü  Ar-cles	
  from	
  4	
  key	
  decades	
  in	
  medical	
  history	
  
•  1850s,	
  1890s,	
  1920	
  and	
  1960s	
  
•  Capture	
  language/terminology	
  changes	
  over	
  -me	
  
17	
  
 Hymera	
  corpus:	
  En-ty	
  Types	
  
Type	
   Descrip7on	
   Examples	
  
Anatomical	
   An	
  en-ty	
  forming	
  part	
  of	
  the	
  
human	
  body	
  
lung,	
  lobe,	
  sputum,	
  fibroid	
  
Biological_	
  
En7ty	
  	
  
A	
  living	
  en-ty	
  not	
  part	
  of	
  the	
  
human	
  body	
  	
  
tubercle	
  bacilli,	
  mould,	
  guinea-­‐pig,	
  
flea	
  
Condi7on	
   Medical	
  condi-on	
  or	
  ailment	
   phthisis,	
  bronchi:s,	
  typhoid	
  
Environmental	
   Environmental	
  factor	
  relevant	
  
to	
  incidence/preven-on/	
  
control/treatment	
  of	
  condi-on	
  
humidity,	
  high	
  mountain	
  climates,	
  
milk,	
  linen,	
  drains	
  	
  
Sign_or_	
  
Symptom	
  
Altered	
  physical	
  appearance	
  or	
  
behaviour	
  as	
  a	
  probable	
  result	
  
of	
  injury	
  or	
  condi-on	
  	
  
cough,	
  pain,	
  rise	
  in	
  temperature,	
  
swollen	
  
Subject	
   individual	
  or	
  group	
  of	
  cases	
  
under	
  discussion	
  
asthma	
  pa:ents,	
  those	
  with	
  
nega:ve	
  reac:ons	
  to	
  tuberculin	
  
Therapeu7c_	
  
or_	
  
Inves7ga7onal	
  	
  
treatment,	
  substance,	
  medium	
  
or	
  procedure,	
  prescribed	
  or	
  
used	
  in	
  inves-ga-on	
  
atrophine	
  sulphate,	
  generous	
  diet,	
  
change	
  of	
  air,	
  lobectomy	
  
18	
  
Hymera	
  Corpus:	
  Events	
  
Type	
   Descrip7on	
   Par7cipants	
  
Affect	
   A	
  en-ty	
  or	
  event	
  is	
  affected,	
  
infected,	
  changed	
  or	
  
transformed,	
  possibly	
  by	
  
another	
  en-ty	
  or	
  event	
  	
  
Cause:	
  The	
  cause	
  of	
  the	
  affec-on	
  
Target:	
  The	
  en-ty	
  or	
  event	
  
affected	
  
Subject:	
  	
  The	
  medical	
  subject	
  
affected	
  	
  
19	
  
Hymera	
  Corpus:	
  Events	
  
Type	
   Descrip7on	
   Par7cipants	
  
Cause	
   An	
  en-ty	
  or	
  event	
  results	
  in	
  the	
  
manifesta-on	
  of	
  another	
  en-ty	
  
or	
  event	
  
Cause:	
  The	
  cause	
  of	
  the	
  event	
  
Result:	
  The	
  resul-ng	
  en-ty	
  or	
  
event	
  
Subject:	
  	
  The	
  medical	
  subject	
  
affected	
  	
  
20	
  
•  Trained	
  classifiers	
  on	
  Hymera	
  corpus	
  
	
  
•  Used	
  dic-onaries	
  to	
  obtain	
  domain	
  specific	
  
informa-on	
  
ü 	
  UMLS	
  Metathesaurus	
  
ü 	
  MetaMap	
  (NE	
  system	
  for	
  UMLS	
  concepts	
  in	
  text)	
  
	
  
•  	
  Used	
  seman-c	
  categories	
  could	
  be	
  mapped	
  to	
  
Hymera	
  
Extrac-ng	
  en--es	
  automa-cally	
  
21	
  
UMLS	
  è Corpus	
  Mappings	
  
MetaMap	
  Categories	
   Hymera	
  category	
  
Anatomical	
  Abnormality,	
  Body	
  Substance	
  
Body	
  Part,	
  Organ,	
  or	
  Organ	
  Component	
  
Body	
  Loca-on	
  or	
  Region,	
  Body	
  Space	
  or	
  Junc-on,	
  Tissue	
  
Anatomical	
  
Animal,	
  Mammal,	
  Cell,	
  Bacterium,	
  Organism	
   Biological_	
  
En-ty	
  
Food,	
  Chemical	
  Viewed	
  Structurally	
  
Element,	
  Ion,	
  or	
  Isotope,	
  Hazardous	
  or	
  Poisonous	
  Substance	
  
Substance,	
  Natural	
  Phenomenon	
  or	
  Process	
  
Environmental	
  
Disease	
  or	
  Syndrome,	
  Pathologic	
  Func-on	
   Condi-on	
  
Clinical	
  Drug,	
  Amino	
  Acid,	
  Pep-de,	
  or	
  Protein,	
  Immunologic	
  
Factor,	
  Organic	
  Chemical,	
  Pharmacologic	
  Substance,	
  
Biologically	
  Ac-ve	
  Substance,	
  Lipid	
  	
  
Therapeu-c_or_	
  
Inves-ga-onal	
  
Sign	
  or	
  Symptom,	
  Finding	
   Sign_or_	
  
Symptom	
  
Group,	
  Pa-ent	
  or	
  Disabled	
  Group	
   Subject	
  
22	
  
Seman-c	
  En--es:	
  Summary	
  
	
  
ü  Even	
  baseline	
  system	
  produces	
  good	
  performance	
  on	
  
most	
  categories	
  0.8%	
  
ü  Environmental	
  and	
  Therapeu-c_or_Inves-ga-onal	
  	
  
ü  Recall	
  generally	
  rather	
  lower	
  than	
  precision	
  
ü  Recall	
  can	
  be	
  improved	
  using	
  seman-c	
  informa-on	
  
ü  UMLS	
  look-­‐up	
  preferable	
  	
  
ü  MetaMap	
  slow	
  and	
  unreliable	
  
ü  UMLS	
  lookup	
  takes	
  advantage	
  of	
  our	
  NER	
  built-­‐in	
  
dic-onary	
  lookup	
  to	
  boost	
  performance	
  
	
  
23	
  
Event	
  Recogni-on	
  
•  Detects	
  “trigger”	
  phrases	
  around	
  which	
  
events	
  are	
  arranged:	
  origin	
  	
  
•  Finds	
  individual	
  event	
  par:cipants:	
  Subj,	
  
Result	
  and	
  Cause	
  	
  
•  Combines	
  trigger-­‐argument	
  pairs	
  into	
  
seman-c	
  structures	
  	
  
hip://www.nactem.ac.uk/EventMine/	
  
24	
  
Event	
  Recogni-on	
  Summary	
  
	
  
ü  Complex	
  task	
  
•  600	
  Affect	
  and	
  200	
  Causality	
  events	
  
ü  Diverse	
  triggers	
  can	
  make	
  recogni-on	
  a	
  challenge	
  	
  
ü  Ongoing	
  work	
  to	
  triple	
  the	
  size	
  of	
  manually	
  
annotated	
  corpus	
  	
  
•  Different	
  decades	
  to	
  provide	
  greater	
  evidence	
  of	
  
language	
  usage	
  	
  
ü Extended	
  corpus	
  boost	
  further	
  results,	
  wider	
  
coverage	
  of	
  seman-cs	
  	
  
25	
  
Interpre-ng	
  events	
  	
  
ü Discourse	
  context	
  (meta-­‐knowledge)	
  affects	
  
interpreta-on	
  
	
  
•  p21ras	
  proteins	
  are	
  not	
  ac-vated	
  by	
  IL-­‐2	
  	
  
•  Our	
  results	
  suggest	
  that	
  p21ras	
  proteins	
  are	
  strongly	
  
ac-vated	
  by	
  IL-­‐2	
  in	
  normal	
  human	
  T	
  lymphocytes	
  	
  
•  Our	
  results	
  show	
  that	
  p21ras	
  proteins	
  are	
  weakly	
  
ac-vated	
  by	
  IL-­‐2	
  in	
  normal	
  human	
  T	
  lymphocytes	
  	
  
	
  
ü Contradic-ons,	
  nega-ons,	
  specula-ons	
  	
  
26	
  
27	
  
Terminological	
  Issues	
  
	
  
ü  Concepts	
  can	
  be	
  expressed	
  in	
  many	
  ways	
  	
  
ü  We	
  take	
  into	
  account	
  	
  such	
  variants	
  to	
  improve	
  search	
  
Lexical	
  Varia7ons	
  
oedema	
   edema	
  
whooping	
  cough	
  	
   whooping-­‐cough	
  
pulmonary	
  tuberculosis	
  	
   tuberculosis	
  
respiratory	
  diseases	
  	
   diseases	
  of	
  the	
  respiratory	
  system	
  
Seman7c	
  Varia7ons	
  
tuberculosis	
   phthisis	
  
smallpox	
   variola	
  
Sophia	
  Ananiadou	
   28	
  
Building	
  a	
  Time	
  Sensi-ve	
  
Terminological	
  Resource	
  	
  
ü  Why	
  not	
  using	
  exis-ng	
  resources?	
  
•  MeSH,	
  Disease	
  Ontology,	
  UMLS,	
  	
  Metathesaurus	
  
ü  Used	
  in	
  search	
  systems	
  to	
  expand	
  query	
  terms	
  	
  
ü  Textual	
  variants	
  of	
  concepts	
  change	
  over	
  7me	
  
•  Exis-ng	
  resources	
  tend	
  to	
  focus	
  on	
  
contemporary	
  terminology	
  
•  Historically-­‐relevant	
  variants	
  not	
  covered	
  in	
  a	
  
comprehensive	
  and	
  consistent	
  manner	
  	
  
29	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
1840	
  
1844	
  
1848	
  
1852	
  
1856	
  
1860	
  
1864	
  
1868	
  
1872	
  
1876	
  
1880	
  
1884	
  
1888	
  
1892	
  
1896	
  
1900	
  
1904	
  
1908	
  
1912	
  
1916	
  
1920	
  
1924	
  
1928	
  
1932	
  
1936	
  
1940	
  
1944	
  
1948	
  
1952	
  
1956	
  
1960	
  
1964	
  
1968	
  
Phthisis	
   Tuberculosis	
  
Term	
  Varia-on	
  in	
  BMJ	
  
	
  Phthisis	
  vs.	
  Tubercuosis	
  	
  	
  	
  	
  
30	
  
Term	
  Varia-on	
  in	
  BMJ	
  
Scarlet	
  Fever	
  vs.	
  Scarla:na	
  	
  	
  	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
1840	
  
1844	
  
1848	
  
1852	
  
1856	
  
1860	
  
1864	
  
1868	
  
1872	
  
1876	
  
1880	
  
1884	
  
1888	
  
1892	
  
1896	
  
1900	
  
1904	
  
1908	
  
1912	
  
1916	
  
1920	
  
1924	
  
1928	
  
1932	
  
1936	
  
1940	
  
1944	
  
1948	
  
1952	
  
1956	
  
1960	
  
1964	
  
1968	
  
Scarla-na	
   Scarlet	
  Fever	
  
31	
  
Time	
  Sensi-ve	
  Medical	
  Terminological	
  
Inventory	
  
ü  We	
  have	
  created	
  a	
  #me	
  sensi#ve	
  terminological	
  
inventory	
  from	
  BMJ	
  for	
  medical	
  condi#ons	
  
•  Complements	
  exis-ng	
  terminological	
  resources	
  
by	
  accoun-ng	
  for	
  term	
  variant	
  usage	
  over	
  long	
  
periods	
  of	
  7me	
  
•  Extracted	
  term	
  variants	
  from	
  19th	
  century	
  medical	
  
lexica	
  	
  
•  Iden-fied	
  more	
  variants	
  using	
  distribu:onal	
  
seman:cs	
  techniques	
  to	
  BMJ	
  archives	
  
	
  
32	
  
Processing	
  Historical	
  Lexica	
  
19th	
  century	
  medical	
  lexica	
  NLM	
  Archive	
  	
  
Two	
  lists	
  of	
  diseases	
  considered	
  
	
  
	
  
Name	
   Year	
   #	
  top-­‐level	
  entries	
  
Nomenclature	
  of	
  diseases	
   1872	
   1347	
  
Index	
  of	
  diseases,	
  their	
  
symptoms,	
  and	
  treatment	
  	
  
1882	
   522	
  
33	
  
•  English	
  main	
  
term	
  
•  La-n	
  main	
  term	
  
•  Synonym	
  list	
  
•  Sub-­‐type	
  
English	
  term	
  
•  Sub-­‐type	
  La-n	
  
term	
  
Synonym	
  set	
  
Processing	
  Historical	
  Lexica	
  -­‐	
  example	
  
34	
  
Processing	
  Historical	
  Lexica	
  
ü Link	
  synonym	
  set	
  to	
  UMLS	
  using	
  approximate	
  string	
  
matching	
  
ü 1,588	
  synonym	
  sets	
  matched	
  	
  
ü 2,422	
  variants	
  not	
  contained	
  in	
  UMLS	
  
ü Confirms	
  lack	
  of	
  comprehensive	
  historical	
  coverage	
  in	
  
UMLS	
  
However:	
  
ü Manually-­‐curated	
  lexica	
  unlikely	
  to	
  account	
  for	
  all	
  
“real	
  usage”	
  term	
  varia-on	
  in	
  medical	
  texts	
  
ü Automa-c	
  extrac-on	
  of	
  terms	
  variants	
  complements	
  
manually	
  curated	
  variants	
  	
  
ü Accounts	
  for	
  real	
  usage	
  
	
  
35	
  
TM	
  Methods	
  for	
  Historical	
  Variant	
  
Iden-fica-on	
  
ü  Pre-­‐processing	
  
	
  
ü  Distribu-onal	
  similarity	
  measures	
  
	
  
ü  Detec-on	
  of	
  historically	
  relevant	
  term	
  
variants	
  
	
  
	
  
36	
  
Dic-onary	
  
matching	
  
Machine	
  
learning	
  
Pre-­‐Processing	
  BMJ	
  
We	
  combined	
  two	
  
recognisers	
  
	
  
ü  Dic-onary	
  matching	
  from	
  
19th	
  century	
  disease	
  
lexica	
  	
  
ü  Index	
  of	
  Diseases	
  
ü  Nomenclature	
  of	
  Diseases	
  
ü  Machine	
  learning	
  trained	
  
on	
  the	
  NCBI	
  Disease	
  
Corpus	
  
	
  
37	
  
 
	
  
ü cosine	
  similarity	
  
	
  
scarlet	
  fever	
   47366.48	
   46344.15	
   30555.26	
   31896.95	
  
	
  
15241.49	
  
	
  
11759.56	
  
	
  
7407.69	
  
	
  
7388.65	
   …	
   4608.19	
  
measles	
   diptheria	
   fever	
   cough	
   death	
   pox	
   fatal	
   small	
   …	
   typhoid	
  
scarla#na	
   11474.87	
   5436.11	
   5434.60	
   2892.64	
   2413.52	
   4326.59	
  
	
  
875.16	
   2340.05	
   …	
   2071.56	
  
measles	
   diptheria	
   fever	
   cough	
   death	
   pox	
   fatal	
   small	
   …	
   typhoid	
  
Distribu-onal	
  Similarity	
  Measures	
  
38	
  
Distribu-onal	
  Similarity	
  Measures	
  
scarlet	
  fever	
  
scarla-na	
  	
  	
  	
  	
  	
  0.8434	
  
whooping	
  cough	
  	
  0.7415	
  
measles	
  0.7094	
  
diphtheria	
  	
  	
  	
  	
  	
  0.6689	
  
acute	
  pneumonia	
  0.6501	
  
chicken	
  pox	
  	
  	
  	
  	
  0.6407	
  
german	
  measles	
  	
  0.6336	
  
diarrhoeal	
  disease	
  	
  	
  	
  	
  	
  0.6301	
  
infan-le	
  diarrhoea	
  	
  	
  	
  	
  0.6234	
  
diarrhea	
  	
  	
  	
  	
  	
  	
  	
  0.6208	
  
croup	
  	
  	
  0.6093	
  
mump	
  	
  	
  	
  0.6042	
  
typhus	
  fever	
  	
  	
  	
  0.5535	
  
typhoid	
  fever	
  	
  	
  0.5508	
  
…	
  
39	
  
Access	
  “query	
  builder”	
  to	
  
start	
  search	
  refinement	
  	
  
40	
  
User	
  can	
  refine	
  search	
  using	
  
mul-ple	
  criteria.	
  Here	
  we	
  
start	
  by	
  searching	
  	
  for	
  a	
  
specific	
  term	
  
41	
  
User	
  enters	
  term	
  of	
  interest	
  
and	
  then	
  clicks	
  on	
  “Refine	
  
Search”	
  	
  	
  
42	
  
Retrieved	
  documents	
  
related	
  terms	
  
User	
  selects	
  “pulmonary	
  
tuberculosis”	
  
Current	
  search	
  criteria	
  
Frequency	
  of	
  “pulmonary	
  
consump-on”	
  over	
  -me	
  
“pulmonary	
  
tuberculosis”	
  is	
  added	
  to	
  
query	
  builder	
  	
  
composite	
  -me	
  graph	
  
(automa-cally	
  updated)	
  
Terms	
  related	
  to	
  “pulmonary	
  
consump-on”	
  and	
  “pulmonary	
  
tuberculosis”	
  
44	
  
Further	
  refine	
  retrieved	
  
documents	
  by	
  date	
  
published	
  
45	
  
Only	
  retain	
  documents	
  
published	
  up	
  un-l	
  1950	
  
46	
  
Further	
  refine	
  retrieved	
  
documents	
  by	
  date	
  
published	
  
47	
  
Refine	
  by	
  automa-cally	
  
recognised	
  seman-c	
  
en--es	
  
48	
  
Sema-c	
  refinement	
  can	
  
be	
  made	
  according	
  to	
  
different	
  en-ty	
  
categories	
  
49	
  
Here	
  “Environmental”	
  
en--es	
  present	
  in	
  the	
  
current	
  document	
  set,	
  
with	
  their	
  counts.	
  The	
  
user	
  may	
  type	
  into	
  the	
  
box	
  to	
  narrow	
  the	
  list	
  of	
  
en--es	
  displayed.	
  	
  	
  	
  	
  
50	
  
New	
  sema-c	
  restric-on	
  
added	
  to	
  query	
  builder	
  
display	
  
Number	
  of	
  documents	
  
further	
  reduced	
  	
  
51	
  
Results	
  can	
  also	
  be	
  
refined	
  by	
  searching	
  for	
  
documents	
  containing	
  
involving	
  en--es	
  
52	
  
User	
  can	
  choose	
  
between	
  the	
  2	
  
events	
  
recognised	
  by	
  
our	
  system	
  
53	
  
User	
  (par-ally)	
  
completes	
  template	
  
according	
  to	
  the	
  types	
  
of	
  events	
  they	
  are	
  
searching	
  for.	
  Here,	
  the	
  
search	
  is	
  for	
  documents	
  
men-oning	
  	
  causes	
  of	
  
tuberculosis	
  	
  	
  
54	
  
Document	
  display	
  
Bibliographic	
  informa-on	
  
Document	
  content	
  
Seman-c	
  informa-on	
  
55	
  
Pie	
  chart	
  shows	
  the	
  distribu-on	
  of	
  
en-ty	
  types	
  in	
  a	
  document.	
  
Hovering	
  the	
  mouse	
  over	
  a	
  sec-on	
  
reveals	
  the	
  en-ty	
  type	
  in	
  ques-on	
  	
  
56	
  
Number	
  of	
  
“Environmental”	
  en--es	
  
in	
  the	
  document	
  
“Environmental”	
  en-ty	
  
instances	
  in	
  the	
  document	
  
“Environmental”	
  en--es	
  
are	
  highlighted	
  
57	
  
Cases	
  where	
  specific	
  causes	
  
of	
  tuberculosis	
  are	
  
men-oned	
  in	
  the	
  document	
  
Phrases	
  iden-fying	
  
Causality	
  events	
  are	
  
highlighted	
  
58	
  
Affect	
  events	
  allow	
  us	
  to	
  
explore,	
  e.g.,	
  whether	
  
therapeu-c	
  measures	
  
have	
  a	
  posi-ve	
  or	
  
nega-ve	
  effect	
  
59	
  
Future	
  developments	
  	
  
	
  
ü  More	
  en7ty	
  types:	
  	
  
•  Tools	
  and	
  materials	
  used	
  in	
  a	
  medical	
  or	
  laboratory	
  context	
  
•  Geographical	
  Loca-ons	
  
•  Organisa-ons	
  (e.g.	
  schools,	
  workhouses,	
  hospitals,	
  etc)	
  	
  
•  Temporal	
  expressions	
  
	
  
ü  More	
  event	
  types:	
  	
  
•  Associa:ons	
  between	
  en--es	
  	
  
•  Ac:ons	
  carried	
  out	
  as	
  part	
  of	
  a	
  medical	
  examina-on,	
  
treatment	
  or	
  to	
  resolve	
  an	
  issue	
  	
  
•  Rates	
  -­‐	
  Descrip-ons	
  of	
  a	
  reported	
  rate	
  relevant	
  to	
  public	
  
health	
  (birth,	
  marriage,	
  death,	
  	
  disease,	
  etc.)	
  and	
  
comparisons	
  between	
  these	
  rates	
  
	
  
60	
  
Future	
  Developments	
  -­‐	
  Aiributes	
  
	
  
	
  
	
  
Recogni-on	
  of	
  aiributes	
  at	
  both	
  the	
  en-ty	
  and	
  event	
  levels:	
  	
  
	
  
ü  En7ty-­‐based	
  aZributes	
  	
  	
  
•  Age,	
  sex,	
  ethnicity,	
  occupa-on	
  or	
  general	
  health	
  status	
  of	
  medical	
  
subject	
  	
  
•  Frequency	
  of	
  treatments	
  
•  Quan--es	
  or	
  measures	
  rela-ng	
  to	
  en--es	
  
•  Dosages	
  of	
  treatments	
  
•  Intensity	
  of	
  a	
  symptom	
  
ü  Event-­‐based	
  aZributes	
  with	
  discourse	
  interpreta7on	
  
•  Is	
  an	
  event	
  specified	
  as	
  being	
  definite	
  or	
  speculated?	
  	
  
•  If	
  speculated,	
  which	
  level	
  of	
  certainty	
  is	
  ascribed	
  to	
  the	
  event?	
  	
  
•  What	
  is	
  the	
  source	
  of	
  the	
  informa-on	
  specified	
  in	
  the	
  event,	
  e.g.,	
  
does	
  it	
  come	
  from	
  the	
  author	
  or	
  another	
  source?	
  	
  
	
  
	
  
	
  
61	
  
Conclusions	
  
ü Explore	
  further	
  archives	
  using	
  topic	
  analysis	
  	
  
ü Automa-c	
  clustering	
  of	
  search	
  results	
  according	
  to	
  
similari-es	
  of	
  seman-c	
  content	
  
ü Expand	
  types	
  of	
  en--es	
  and	
  events	
  and	
  enrich	
  
further	
  BMJ	
  
ü Link	
  with	
  other	
  archives	
  	
  
ü Create	
  a	
  rich	
  annota-on	
  database	
  	
  
62	
  
•  BMJ	
  and	
  Wellcome	
  Library	
  for	
  making	
  the	
  archives	
  
available	
  to	
  text	
  mine	
  
•  AHRC	
  for	
  funding	
  
•  Team	
  members	
  at	
  NaCTeM	
  and	
  CHSTM	
  
•  Paul	
  Thompson,	
  Elizabeth	
  Toon,	
  Nick	
  Duvall	
  
•  Jacob	
  Carter,	
  George	
  Kontonatsios,	
  Riza	
  Ba-sta-­‐
Navarro	
  
	
  
Inves-gators:	
  S.	
  Ananiadou,	
  J.	
  McNaught	
  (NaCTeM)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  C.	
  Timmermann,	
  M.	
  Worboys	
  (CHTM)	
  
Acknowledgments	
  
63	
  

More Related Content

Similar to Text Mining the History of Medicine

Case history in maxillofacial surgery
Case history in maxillofacial surgeryCase history in maxillofacial surgery
Case history in maxillofacial surgerychaitanyeah
 
Case history for dental students
Case history for dental students Case history for dental students
Case history for dental students Muskan Hinduja
 
Manoj Saxena General Manager IBM Watson Solutions
Manoj Saxena General Manager IBM Watson SolutionsManoj Saxena General Manager IBM Watson Solutions
Manoj Saxena General Manager IBM Watson SolutionsIBM India Smarter Computing
 
Overactive_Bladder management plane.ppt
Overactive_Bladder  management plane.pptOveractive_Bladder  management plane.ppt
Overactive_Bladder management plane.pptKarimElattar4
 
medicalhistory (1).pdf
medicalhistory (1).pdfmedicalhistory (1).pdf
medicalhistory (1).pdfXavier875943
 
dental history taking
dental history takingdental history taking
dental history takingshabeel pn
 
various methods of orthodontic diagnosis
various methods of orthodontic diagnosisvarious methods of orthodontic diagnosis
various methods of orthodontic diagnosisKrishnaPriya914815
 
USF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptx
USF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptxUSF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptx
USF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptxphysio11
 
Case_Study1 CVA.doc
Case_Study1 CVA.docCase_Study1 CVA.doc
Case_Study1 CVA.docssuser44c2e8
 
Assessment of Mouth &Pharynx
Assessment of Mouth &PharynxAssessment of Mouth &Pharynx
Assessment of Mouth &PharynxGulshanUmbreen2
 
History Taking in Neurology.pptx
History Taking in Neurology.pptxHistory Taking in Neurology.pptx
History Taking in Neurology.pptxROSHINIBG
 
Introduction, Case history - class 1 - 14-06-23.pptx
Introduction, Case history - class 1 - 14-06-23.pptxIntroduction, Case history - class 1 - 14-06-23.pptx
Introduction, Case history - class 1 - 14-06-23.pptxAkhileshWodeyar1
 

Similar to Text Mining the History of Medicine (20)

Case history in maxillofacial surgery
Case history in maxillofacial surgeryCase history in maxillofacial surgery
Case history in maxillofacial surgery
 
Case history for dental students
Case history for dental students Case history for dental students
Case history for dental students
 
Renal calculi
Renal calculiRenal calculi
Renal calculi
 
Manoj Saxena General Manager IBM Watson Solutions
Manoj Saxena General Manager IBM Watson SolutionsManoj Saxena General Manager IBM Watson Solutions
Manoj Saxena General Manager IBM Watson Solutions
 
Overactive_Bladder management plane.ppt
Overactive_Bladder  management plane.pptOveractive_Bladder  management plane.ppt
Overactive_Bladder management plane.ppt
 
medicalhistory (1).pdf
medicalhistory (1).pdfmedicalhistory (1).pdf
medicalhistory (1).pdf
 
dental history taking
dental history takingdental history taking
dental history taking
 
History taking
History takingHistory taking
History taking
 
various methods of orthodontic diagnosis
various methods of orthodontic diagnosisvarious methods of orthodontic diagnosis
various methods of orthodontic diagnosis
 
Art of Diagnosis part 1.pptx
Art of Diagnosis part 1.pptxArt of Diagnosis part 1.pptx
Art of Diagnosis part 1.pptx
 
History Taking
History TakingHistory Taking
History Taking
 
Interns ppt.pptx
Interns ppt.pptxInterns ppt.pptx
Interns ppt.pptx
 
USF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptx
USF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptxUSF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptx
USF-Branded-HCA-Fac-Dev-Clinical-Reasoning.pptx
 
History taking
History takingHistory taking
History taking
 
Case_Study1 CVA.doc
Case_Study1 CVA.docCase_Study1 CVA.doc
Case_Study1 CVA.doc
 
Assessment of Mouth &Pharynx
Assessment of Mouth &PharynxAssessment of Mouth &Pharynx
Assessment of Mouth &Pharynx
 
History Taking in Neurology.pptx
History Taking in Neurology.pptxHistory Taking in Neurology.pptx
History Taking in Neurology.pptx
 
Gsur 302
Gsur 302Gsur 302
Gsur 302
 
Gsur 302
Gsur 302Gsur 302
Gsur 302
 
Introduction, Case history - class 1 - 14-06-23.pptx
Introduction, Case history - class 1 - 14-06-23.pptxIntroduction, Case history - class 1 - 14-06-23.pptx
Introduction, Case history - class 1 - 14-06-23.pptx
 

More from Digital History

Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020Digital History
 
Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020Digital History
 
Commemorating the Great War on Twitter
Commemorating the Great War on TwitterCommemorating the Great War on Twitter
Commemorating the Great War on TwitterDigital History
 
Community Archives and Ethics
Community Archives and EthicsCommunity Archives and Ethics
Community Archives and EthicsDigital History
 
Contemporary web archives ihr
Contemporary web archives ihrContemporary web archives ihr
Contemporary web archives ihrDigital History
 
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...Digital History
 
The Language of Migration in the Victorian Press: A Corpus Linguistic Approach
The Language of Migration in the Victorian Press: A Corpus Linguistic ApproachThe Language of Migration in the Victorian Press: A Corpus Linguistic Approach
The Language of Migration in the Victorian Press: A Corpus Linguistic ApproachDigital History
 
Identifying responses to revolution
Identifying responses to revolutionIdentifying responses to revolution
Identifying responses to revolutionDigital History
 
Chance encounters with the past
Chance encounters with the pastChance encounters with the past
Chance encounters with the pastDigital History
 
The lives and criminal careers of juvenile offenders
The lives and criminal careers of juvenile offendersThe lives and criminal careers of juvenile offenders
The lives and criminal careers of juvenile offendersDigital History
 
Tudor Intelligence Networks - Ruth Ahnert
Tudor Intelligence Networks - Ruth AhnertTudor Intelligence Networks - Ruth Ahnert
Tudor Intelligence Networks - Ruth AhnertDigital History
 
The Pictorial publisher - Agents technologies and the illustrrated book in Br...
The Pictorial publisher - Agents technologies and the illustrrated book in Br...The Pictorial publisher - Agents technologies and the illustrrated book in Br...
The Pictorial publisher - Agents technologies and the illustrrated book in Br...Digital History
 
Cordell scientific american
Cordell scientific americanCordell scientific american
Cordell scientific americanDigital History
 
Political Meetings Mapper with British Library Labs: mapping the origins of B...
Political Meetings Mapper with British Library Labs: mapping the origins of B...Political Meetings Mapper with British Library Labs: mapping the origins of B...
Political Meetings Mapper with British Library Labs: mapping the origins of B...Digital History
 
European or Imperial Metropolis? Depictions of London in British Newspapers, ...
European or Imperial Metropolis? Depictions of London in British Newspapers, ...European or Imperial Metropolis? Depictions of London in British Newspapers, ...
European or Imperial Metropolis? Depictions of London in British Newspapers, ...Digital History
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...Digital History
 
Emma Bayne: ‘Traces Through Time overview and next steps’
Emma Bayne: ‘Traces Through Time overview and next steps’ Emma Bayne: ‘Traces Through Time overview and next steps’
Emma Bayne: ‘Traces Through Time overview and next steps’ Digital History
 

More from Digital History (20)

Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020
 
Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020
 
Commemorating the Great War on Twitter
Commemorating the Great War on TwitterCommemorating the Great War on Twitter
Commemorating the Great War on Twitter
 
Community Archives and Ethics
Community Archives and EthicsCommunity Archives and Ethics
Community Archives and Ethics
 
Contemporary web archives ihr
Contemporary web archives ihrContemporary web archives ihr
Contemporary web archives ihr
 
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
 
The Language of Migration in the Victorian Press: A Corpus Linguistic Approach
The Language of Migration in the Victorian Press: A Corpus Linguistic ApproachThe Language of Migration in the Victorian Press: A Corpus Linguistic Approach
The Language of Migration in the Victorian Press: A Corpus Linguistic Approach
 
Identifying responses to revolution
Identifying responses to revolutionIdentifying responses to revolution
Identifying responses to revolution
 
Chance encounters with the past
Chance encounters with the pastChance encounters with the past
Chance encounters with the past
 
The lives and criminal careers of juvenile offenders
The lives and criminal careers of juvenile offendersThe lives and criminal careers of juvenile offenders
The lives and criminal careers of juvenile offenders
 
History of teaching ihr
History of teaching ihrHistory of teaching ihr
History of teaching ihr
 
Tudor Intelligence Networks - Ruth Ahnert
Tudor Intelligence Networks - Ruth AhnertTudor Intelligence Networks - Ruth Ahnert
Tudor Intelligence Networks - Ruth Ahnert
 
The Pictorial publisher - Agents technologies and the illustrrated book in Br...
The Pictorial publisher - Agents technologies and the illustrrated book in Br...The Pictorial publisher - Agents technologies and the illustrrated book in Br...
The Pictorial publisher - Agents technologies and the illustrrated book in Br...
 
Cordell scientific american
Cordell scientific americanCordell scientific american
Cordell scientific american
 
Mapping paris
Mapping parisMapping paris
Mapping paris
 
Political Meetings Mapper with British Library Labs: mapping the origins of B...
Political Meetings Mapper with British Library Labs: mapping the origins of B...Political Meetings Mapper with British Library Labs: mapping the origins of B...
Political Meetings Mapper with British Library Labs: mapping the origins of B...
 
European or Imperial Metropolis? Depictions of London in British Newspapers, ...
European or Imperial Metropolis? Depictions of London in British Newspapers, ...European or Imperial Metropolis? Depictions of London in British Newspapers, ...
European or Imperial Metropolis? Depictions of London in British Newspapers, ...
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
 
Emma Bayne: ‘Traces Through Time overview and next steps’
Emma Bayne: ‘Traces Through Time overview and next steps’ Emma Bayne: ‘Traces Through Time overview and next steps’
Emma Bayne: ‘Traces Through Time overview and next steps’
 
Ihr june15-evans
Ihr june15-evansIhr june15-evans
Ihr june15-evans
 

Recently uploaded

How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 

Recently uploaded (20)

How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 

Text Mining the History of Medicine

  • 1. Mining  the  History  of  Medicine   Sophia  Ananiadou   Na-onal  Centre  for  Text  Mining   School  of  Computer  Science   The  University  of  Manchester  
  • 2. •  Text  Mining  tools  for  enriching  content   •  Mining  the  History  of  Medicine   •  Seman-c  annota-ons:     –  en--es,  events   –  OCR  correc-on   –  Time-­‐sensi-ve  term  inventory  and  search     •  A  prototype  search  system  for  the  history  of  medicine     Outline   2  
  • 3. raw   (unstructured)   text   part-­‐of-­‐speech   tagging   named  en-ty   recogni-on   event  extrac-on   annotated   (structured)   text   text  processing  pipeline   lexica   ontologies   ….    ……………………I  believe   the  pain  was  seated  in  the   pillars  of  the  diaphragm,  as   it  was  chiefly  occasioned  by   hiccough  ….  ……     Text  Mining  Pipeline     3  
  • 4. Impact  of  text  mining     •  Discovery  of  concepts  and   named  en--es  (anatomical   en--es,  treatments,   symptoms..)   •  Going  a  step  further:   extrac-ng  rela-onships,   events  from  text  (cause,   affect)   •  Extrac-ng  opinions,  aStudes,   negated  statements,   hypotheses,  contradic-ons….   •  Improves  classifica-on  of   documents   •  Improves  informa-on  access   •  Enables  seman-c  search   •  Enables  ques-on-­‐answering,   sen-ment  analysis,  trend   discovery     Impact  of  text  mining   4  
  • 5.   •  Na-onal  Centre  for  Text  Mining,  School  of   Computer  Science    www.nactem.ac.uk       •  Centre  for  the  History  of  Science,  Technology  and   Medicine       Mining  the  History  of  Medicine:   an  interdisciplinary  project     AHRC     Digital  Transforma-ons   Big  Data   5  
  • 6. Mining  the  History  of  Medicine   Aim   Seman7c  search  system   •  Bri-sh  Medical  Journal  (BMJ)    (1840  –  present)  (350K   ar-cles)   •  London  Medical  Officer  of  Health  reports  (MoH)   (1848  –  1972)                 Provide  different  perspec7ves  (e.g.  medical,  public  health)   on  the  treatment  and  preven-on  of  diseases     ü  over  -me   ü  in  different  areas   Text  Mining:  tools,   resources,   infrastructure   6  
  • 7. Mining  the  History  of  Medicine   Search  historical   archives   OCR   Keyword   search?   BMJ  archive  was  digi-sed  several  years  ago   ü  Word  error  rate  can  be  over  30%   ü  Re-­‐OCRing  -me-­‐consuming  and  costly   MOH  OCRed  more  recently  with  beier  results           Keyword  search  over  huge  textual  archives   inefficient   7  
  • 8. Mining  the  History  of  Medicine   Solu-ons   ü  Apply  OCR  improvement  techniques     ü  Automa-cally  detect  seman7c  informa-on  in   text   ü  En##es:  condi-ons,  signs  or  symptoms,   therapeu-c  measures   ü  Events:    symptoms  caused  by  lung  disease       ü  Automa-cally  iden-fy  how  the  naming  of   concepts  changes  over  7me   Sophia  Ananiadou   8  
  • 9. Mining  the  History  of  Medicine   Sema-c   search   system     ü  Automa-c  query  building,  term  sugges-ons     •  historical  variants  unknown  to  the  user   ü  Faceted  search  based  on  ar-cle  metadata   •  Date,  authors,  -tle     ü  Faceted  search  based  on  seman-c  content   •  Combina-ons  of  en--es  and/or  rela-onships     Sophia  Ananiadou   9  
  • 10. Extrac-ng  seman-c  types,  events   ü  Our  text  mining  tools  learn  from  gold  standard   •  subset  of  documents  from  the  archives  manually  annotated   with  en77es  and  events     ü  Corpus  composi-on   •  25  ar-cles  from  BMJ   •  4  extracts  from  MOH   •  70,000  words   ü  Ar-cles  from  4  key  decades  in  medical  history   •  1850s,  1890s,  1920  and  1960s   •  Capture  language/terminology  changes  over  -me   Sophia  Ananiadou   10  
  • 11. OCR  Post  Correc-on   •  Re-­‐OCR    BMJ  too  -me  consuming  for  current  project   •  Our  solu-on:    post-­‐correct  original  OCR  output  based  on   spell-­‐checker  output   SpellChecker   %  errors  resolved   ASpell   45%   Hunspell   60%   Word   63%   MAC  OS   58%   Best  open  source   spell  checker   11  
  • 12. OCR  Post  Correc-on   ü Hunspell  produces  ranked  list  of  spelling  correc-ons   ü However,  tailored  to:     •  Human  errors   •  General  language   Our  approach   Augment  Hunspell  dic-onary  with  OpenMedSpel  dic-onary   Override  default  Hunspell  correc-on  ranking   ü  Rank  sugges-ons  according  to  corpus  frequency   ü  Use  decade-­‐specific  word  frequency  lists     12  
  • 13. JOHN MILLER, aged 26, glass-blower of Gateshead, was adnmitted August 27th, 1840, unider the care of Sir John Fife, with the usual syptnpoms of stone in the bladder, wlich he lhas laboured under from childhood. At the age of 12 years he was admitted into thtis infirmaryI, buit his frienids wotld niot allow the operation to be performed. Ile took a considerable (luantity of medicine, which greatly relieved him. Since then the symptomt of stone have been constantly presenit, but have not been very urgent. A.- About a month ago, after batlhing, the symptoms became much arggravated-so much so, as to comlpel him to apply again to the hospital for relief. On admission he was sallow and emaciated; laboured iunder symptoms of stone of a high degree, with great irritability of bladder, giviing rise to a frequient desire to inicturate, and occasionally to the in-voluntary discharge of faces from straining. Ile passed a larg,e quantity of mucus with his urine;-pulse small, tongue white, sliglht tlhirst, and appetite good.   BMJ  OCR  Errors  –  Example  from  1840     13  
  • 14. Basic  Hunspell  Output     JOHN MILLER, aged 26, glass-blower of Gateshead, was admitted August 27th, 1840, under the care of Sir John Fife, with the usual symptoms of stone in the bladder, which he has laboured under from childhood. At the age of 12 years he was admitted into this infirmary, but his friends world into allow the operation to be performed. Ike took a considerable (quantity of medicine, which greatly relieved him. Since then the symptom of stone have been constantly present, but have not been very urgent. A.- About a month ago, after bathing, the symptoms became much aggravated-so much so, as to compel him to apply again to the hospital for relief. On admission he was sallow and emaciated; laboured under symptoms of stone of a high degree, with great irritability of bladder, giving rise to a frequent desire to inaccurate, and occasionally to the involuntary discharge of faces from straining. Ike passed a lag,e quantity of mucus with his urine;-pulse small, tongue white, slight thirst, and appetite good. Correct  changes   made  by  Hunspell     Incorrect  changes   made  by  Hunspell     OCR  Errors  not   detected  by   Hunspell     14  
  • 15. New  Frequency-­‐driven  Hunspell  Output   JOHN MILLER, aged 26, glass-blower of Gateshead, was admitted August 27th, 1840, under the care of Sir John Fife, with the usual symptoms of stone in the bladder, which he has laboured under from childhood. At the age of 12 years he was admitted into this infirmary, but his friends would not allow the operation to be performed. Lie took a considerable (quantity of medicine, which greatly relieved him. Since then the symptoms of stone have been constantly present, but have not been very urgent. A.- About a month ago, after bathing, the symptoms became much aggravated-so much so, as to compel him to apply again to the hospital for relief. On admission he was sallow and emaciated; laboured under symptoms of stone of a high degree, with great irritability of bladder, giving rise to a frequent desire to inaccurate, and occasionally to the involuntary discharge of faces from straining. Lie passed a large,e quantity of mucus with his urine;-pulse small, tongue white, slight thirst, and appetite good. Correc-ons  only   obtained  using   frequency-­‐driven   method   Correc-ons   obtained  using   both  original  and   frequency-­‐driven   methods   Remaining   errors  with   frequency-­‐driven   methods   15  
  • 16. Avoiding  Over-­‐Correc-on   Original  Text   Basic  Hunspell   Frequency-­‐ driven  Hunspell     Frequency-­‐driven   Hunspell   augmented  with   OpenMedSpel   an-thrombo-c   anthropometric   anthropometric   an7thrombo7c   pathophysiological   physiological   physiological   pathophysiological     thromboembelism   thrombosis   thrombosis   thromboembolism   warfarin   wayfaring   warfarin   warfarin   transoesophageal   oesophageal   oesophageal   transoesphageal   echocardiography   electroocardiography   oesophageal   echocardiography   intracardiac   intra  cardiac   intracellular   intracardiac   tomogram   mammogram   tomogram   tomogram   •  We  want  to  avoid  erroneous  correc-on  of  words  that  are  already  correct   16  
  • 17. Extrac-ng  seman-c  types,  events   ü  Our  text  mining  tools  learn  from  gold  standard   •  subset  of  documents  from  the  archives  manually  annotated   with  en77es  and  events     ü  Corpus  composi-on   •  25  ar-cles  from  BMJ   •  4  extracts  from  MOH   •  70,000  words   ü  Ar-cles  from  4  key  decades  in  medical  history   •  1850s,  1890s,  1920  and  1960s   •  Capture  language/terminology  changes  over  -me   17  
  • 18.  Hymera  corpus:  En-ty  Types   Type   Descrip7on   Examples   Anatomical   An  en-ty  forming  part  of  the   human  body   lung,  lobe,  sputum,  fibroid   Biological_   En7ty     A  living  en-ty  not  part  of  the   human  body     tubercle  bacilli,  mould,  guinea-­‐pig,   flea   Condi7on   Medical  condi-on  or  ailment   phthisis,  bronchi:s,  typhoid   Environmental   Environmental  factor  relevant   to  incidence/preven-on/   control/treatment  of  condi-on   humidity,  high  mountain  climates,   milk,  linen,  drains     Sign_or_   Symptom   Altered  physical  appearance  or   behaviour  as  a  probable  result   of  injury  or  condi-on     cough,  pain,  rise  in  temperature,   swollen   Subject   individual  or  group  of  cases   under  discussion   asthma  pa:ents,  those  with   nega:ve  reac:ons  to  tuberculin   Therapeu7c_   or_   Inves7ga7onal     treatment,  substance,  medium   or  procedure,  prescribed  or   used  in  inves-ga-on   atrophine  sulphate,  generous  diet,   change  of  air,  lobectomy   18  
  • 19. Hymera  Corpus:  Events   Type   Descrip7on   Par7cipants   Affect   A  en-ty  or  event  is  affected,   infected,  changed  or   transformed,  possibly  by   another  en-ty  or  event     Cause:  The  cause  of  the  affec-on   Target:  The  en-ty  or  event   affected   Subject:    The  medical  subject   affected     19  
  • 20. Hymera  Corpus:  Events   Type   Descrip7on   Par7cipants   Cause   An  en-ty  or  event  results  in  the   manifesta-on  of  another  en-ty   or  event   Cause:  The  cause  of  the  event   Result:  The  resul-ng  en-ty  or   event   Subject:    The  medical  subject   affected     20  
  • 21. •  Trained  classifiers  on  Hymera  corpus     •  Used  dic-onaries  to  obtain  domain  specific   informa-on   ü   UMLS  Metathesaurus   ü   MetaMap  (NE  system  for  UMLS  concepts  in  text)     •   Used  seman-c  categories  could  be  mapped  to   Hymera   Extrac-ng  en--es  automa-cally   21  
  • 22. UMLS  è Corpus  Mappings   MetaMap  Categories   Hymera  category   Anatomical  Abnormality,  Body  Substance   Body  Part,  Organ,  or  Organ  Component   Body  Loca-on  or  Region,  Body  Space  or  Junc-on,  Tissue   Anatomical   Animal,  Mammal,  Cell,  Bacterium,  Organism   Biological_   En-ty   Food,  Chemical  Viewed  Structurally   Element,  Ion,  or  Isotope,  Hazardous  or  Poisonous  Substance   Substance,  Natural  Phenomenon  or  Process   Environmental   Disease  or  Syndrome,  Pathologic  Func-on   Condi-on   Clinical  Drug,  Amino  Acid,  Pep-de,  or  Protein,  Immunologic   Factor,  Organic  Chemical,  Pharmacologic  Substance,   Biologically  Ac-ve  Substance,  Lipid     Therapeu-c_or_   Inves-ga-onal   Sign  or  Symptom,  Finding   Sign_or_   Symptom   Group,  Pa-ent  or  Disabled  Group   Subject   22  
  • 23. Seman-c  En--es:  Summary     ü  Even  baseline  system  produces  good  performance  on   most  categories  0.8%   ü  Environmental  and  Therapeu-c_or_Inves-ga-onal     ü  Recall  generally  rather  lower  than  precision   ü  Recall  can  be  improved  using  seman-c  informa-on   ü  UMLS  look-­‐up  preferable     ü  MetaMap  slow  and  unreliable   ü  UMLS  lookup  takes  advantage  of  our  NER  built-­‐in   dic-onary  lookup  to  boost  performance     23  
  • 24. Event  Recogni-on   •  Detects  “trigger”  phrases  around  which   events  are  arranged:  origin     •  Finds  individual  event  par:cipants:  Subj,   Result  and  Cause     •  Combines  trigger-­‐argument  pairs  into   seman-c  structures     hip://www.nactem.ac.uk/EventMine/   24  
  • 25. Event  Recogni-on  Summary     ü  Complex  task   •  600  Affect  and  200  Causality  events   ü  Diverse  triggers  can  make  recogni-on  a  challenge     ü  Ongoing  work  to  triple  the  size  of  manually   annotated  corpus     •  Different  decades  to  provide  greater  evidence  of   language  usage     ü Extended  corpus  boost  further  results,  wider   coverage  of  seman-cs     25  
  • 26. Interpre-ng  events     ü Discourse  context  (meta-­‐knowledge)  affects   interpreta-on     •  p21ras  proteins  are  not  ac-vated  by  IL-­‐2     •  Our  results  suggest  that  p21ras  proteins  are  strongly   ac-vated  by  IL-­‐2  in  normal  human  T  lymphocytes     •  Our  results  show  that  p21ras  proteins  are  weakly   ac-vated  by  IL-­‐2  in  normal  human  T  lymphocytes       ü Contradic-ons,  nega-ons,  specula-ons     26  
  • 27. 27  
  • 28. Terminological  Issues     ü  Concepts  can  be  expressed  in  many  ways     ü  We  take  into  account    such  variants  to  improve  search   Lexical  Varia7ons   oedema   edema   whooping  cough     whooping-­‐cough   pulmonary  tuberculosis     tuberculosis   respiratory  diseases     diseases  of  the  respiratory  system   Seman7c  Varia7ons   tuberculosis   phthisis   smallpox   variola   Sophia  Ananiadou   28  
  • 29. Building  a  Time  Sensi-ve   Terminological  Resource     ü  Why  not  using  exis-ng  resources?   •  MeSH,  Disease  Ontology,  UMLS,    Metathesaurus   ü  Used  in  search  systems  to  expand  query  terms     ü  Textual  variants  of  concepts  change  over  7me   •  Exis-ng  resources  tend  to  focus  on   contemporary  terminology   •  Historically-­‐relevant  variants  not  covered  in  a   comprehensive  and  consistent  manner     29  
  • 30. 0   10   20   30   40   50   60   1840   1844   1848   1852   1856   1860   1864   1868   1872   1876   1880   1884   1888   1892   1896   1900   1904   1908   1912   1916   1920   1924   1928   1932   1936   1940   1944   1948   1952   1956   1960   1964   1968   Phthisis   Tuberculosis   Term  Varia-on  in  BMJ    Phthisis  vs.  Tubercuosis           30  
  • 31. Term  Varia-on  in  BMJ   Scarlet  Fever  vs.  Scarla:na         0   10   20   30   40   50   60   1840   1844   1848   1852   1856   1860   1864   1868   1872   1876   1880   1884   1888   1892   1896   1900   1904   1908   1912   1916   1920   1924   1928   1932   1936   1940   1944   1948   1952   1956   1960   1964   1968   Scarla-na   Scarlet  Fever   31  
  • 32. Time  Sensi-ve  Medical  Terminological   Inventory   ü  We  have  created  a  #me  sensi#ve  terminological   inventory  from  BMJ  for  medical  condi#ons   •  Complements  exis-ng  terminological  resources   by  accoun-ng  for  term  variant  usage  over  long   periods  of  7me   •  Extracted  term  variants  from  19th  century  medical   lexica     •  Iden-fied  more  variants  using  distribu:onal   seman:cs  techniques  to  BMJ  archives     32  
  • 33. Processing  Historical  Lexica   19th  century  medical  lexica  NLM  Archive     Two  lists  of  diseases  considered       Name   Year   #  top-­‐level  entries   Nomenclature  of  diseases   1872   1347   Index  of  diseases,  their   symptoms,  and  treatment     1882   522   33  
  • 34. •  English  main   term   •  La-n  main  term   •  Synonym  list   •  Sub-­‐type   English  term   •  Sub-­‐type  La-n   term   Synonym  set   Processing  Historical  Lexica  -­‐  example   34  
  • 35. Processing  Historical  Lexica   ü Link  synonym  set  to  UMLS  using  approximate  string   matching   ü 1,588  synonym  sets  matched     ü 2,422  variants  not  contained  in  UMLS   ü Confirms  lack  of  comprehensive  historical  coverage  in   UMLS   However:   ü Manually-­‐curated  lexica  unlikely  to  account  for  all   “real  usage”  term  varia-on  in  medical  texts   ü Automa-c  extrac-on  of  terms  variants  complements   manually  curated  variants     ü Accounts  for  real  usage     35  
  • 36. TM  Methods  for  Historical  Variant   Iden-fica-on   ü  Pre-­‐processing     ü  Distribu-onal  similarity  measures     ü  Detec-on  of  historically  relevant  term   variants       36  
  • 37. Dic-onary   matching   Machine   learning   Pre-­‐Processing  BMJ   We  combined  two   recognisers     ü  Dic-onary  matching  from   19th  century  disease   lexica     ü  Index  of  Diseases   ü  Nomenclature  of  Diseases   ü  Machine  learning  trained   on  the  NCBI  Disease   Corpus     37  
  • 38.     ü cosine  similarity     scarlet  fever   47366.48   46344.15   30555.26   31896.95     15241.49     11759.56     7407.69     7388.65   …   4608.19   measles   diptheria   fever   cough   death   pox   fatal   small   …   typhoid   scarla#na   11474.87   5436.11   5434.60   2892.64   2413.52   4326.59     875.16   2340.05   …   2071.56   measles   diptheria   fever   cough   death   pox   fatal   small   …   typhoid   Distribu-onal  Similarity  Measures   38  
  • 39. Distribu-onal  Similarity  Measures   scarlet  fever   scarla-na            0.8434   whooping  cough    0.7415   measles  0.7094   diphtheria            0.6689   acute  pneumonia  0.6501   chicken  pox          0.6407   german  measles    0.6336   diarrhoeal  disease            0.6301   infan-le  diarrhoea          0.6234   diarrhea                0.6208   croup      0.6093   mump        0.6042   typhus  fever        0.5535   typhoid  fever      0.5508   …   39  
  • 40. Access  “query  builder”  to   start  search  refinement     40  
  • 41. User  can  refine  search  using   mul-ple  criteria.  Here  we   start  by  searching    for  a   specific  term   41  
  • 42. User  enters  term  of  interest   and  then  clicks  on  “Refine   Search”       42  
  • 43. Retrieved  documents   related  terms   User  selects  “pulmonary   tuberculosis”   Current  search  criteria   Frequency  of  “pulmonary   consump-on”  over  -me  
  • 44. “pulmonary   tuberculosis”  is  added  to   query  builder     composite  -me  graph   (automa-cally  updated)   Terms  related  to  “pulmonary   consump-on”  and  “pulmonary   tuberculosis”   44  
  • 45. Further  refine  retrieved   documents  by  date   published   45  
  • 46. Only  retain  documents   published  up  un-l  1950   46  
  • 47. Further  refine  retrieved   documents  by  date   published   47  
  • 48. Refine  by  automa-cally   recognised  seman-c   en--es   48  
  • 49. Sema-c  refinement  can   be  made  according  to   different  en-ty   categories   49  
  • 50. Here  “Environmental”   en--es  present  in  the   current  document  set,   with  their  counts.  The   user  may  type  into  the   box  to  narrow  the  list  of   en--es  displayed.           50  
  • 51. New  sema-c  restric-on   added  to  query  builder   display   Number  of  documents   further  reduced     51  
  • 52. Results  can  also  be   refined  by  searching  for   documents  containing   involving  en--es   52  
  • 53. User  can  choose   between  the  2   events   recognised  by   our  system   53  
  • 54. User  (par-ally)   completes  template   according  to  the  types   of  events  they  are   searching  for.  Here,  the   search  is  for  documents   men-oning    causes  of   tuberculosis       54  
  • 55. Document  display   Bibliographic  informa-on   Document  content   Seman-c  informa-on   55  
  • 56. Pie  chart  shows  the  distribu-on  of   en-ty  types  in  a  document.   Hovering  the  mouse  over  a  sec-on   reveals  the  en-ty  type  in  ques-on     56  
  • 57. Number  of   “Environmental”  en--es   in  the  document   “Environmental”  en-ty   instances  in  the  document   “Environmental”  en--es   are  highlighted   57  
  • 58. Cases  where  specific  causes   of  tuberculosis  are   men-oned  in  the  document   Phrases  iden-fying   Causality  events  are   highlighted   58  
  • 59. Affect  events  allow  us  to   explore,  e.g.,  whether   therapeu-c  measures   have  a  posi-ve  or   nega-ve  effect   59  
  • 60. Future  developments       ü  More  en7ty  types:     •  Tools  and  materials  used  in  a  medical  or  laboratory  context   •  Geographical  Loca-ons   •  Organisa-ons  (e.g.  schools,  workhouses,  hospitals,  etc)     •  Temporal  expressions     ü  More  event  types:     •  Associa:ons  between  en--es     •  Ac:ons  carried  out  as  part  of  a  medical  examina-on,   treatment  or  to  resolve  an  issue     •  Rates  -­‐  Descrip-ons  of  a  reported  rate  relevant  to  public   health  (birth,  marriage,  death,    disease,  etc.)  and   comparisons  between  these  rates     60  
  • 61. Future  Developments  -­‐  Aiributes         Recogni-on  of  aiributes  at  both  the  en-ty  and  event  levels:       ü  En7ty-­‐based  aZributes       •  Age,  sex,  ethnicity,  occupa-on  or  general  health  status  of  medical   subject     •  Frequency  of  treatments   •  Quan--es  or  measures  rela-ng  to  en--es   •  Dosages  of  treatments   •  Intensity  of  a  symptom   ü  Event-­‐based  aZributes  with  discourse  interpreta7on   •  Is  an  event  specified  as  being  definite  or  speculated?     •  If  speculated,  which  level  of  certainty  is  ascribed  to  the  event?     •  What  is  the  source  of  the  informa-on  specified  in  the  event,  e.g.,   does  it  come  from  the  author  or  another  source?           61  
  • 62. Conclusions   ü Explore  further  archives  using  topic  analysis     ü Automa-c  clustering  of  search  results  according  to   similari-es  of  seman-c  content   ü Expand  types  of  en--es  and  events  and  enrich   further  BMJ   ü Link  with  other  archives     ü Create  a  rich  annota-on  database     62  
  • 63. •  BMJ  and  Wellcome  Library  for  making  the  archives   available  to  text  mine   •  AHRC  for  funding   •  Team  members  at  NaCTeM  and  CHSTM   •  Paul  Thompson,  Elizabeth  Toon,  Nick  Duvall   •  Jacob  Carter,  George  Kontonatsios,  Riza  Ba-sta-­‐ Navarro     Inves-gators:  S.  Ananiadou,  J.  McNaught  (NaCTeM)                                                    C.  Timmermann,  M.  Worboys  (CHTM)   Acknowledgments   63