Text mining exercise~5 m       Lars Juhl Jensen
the task
named entity recognition
human proteins
link proteins to diseases
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
62,755 abstracts
65,588 abstracts
one directory with each set
one file with each abstract
dictionary
tab-delimited file
human proteins
22,523 entities
synonyms
from many databases
orthographic variation
prefixes and postfixes
automatically generated
2,726,495 names
tagdir program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
named entity recognition
find unfortunate names
create “black list”
information extraction
co-mentioning
within documents
link proteins to diseases
link between the diseases
a helping hand
“black list”
100+ matches
10+ matches
wrap up
prostate cancer
FOLH1
schizophrenia
Glutamate carboxypeptidase II
same protein
synonyms matter
“black list” is crucial
text mining is quite simple
diseases.jensenlab.org
Text mining exercise
Text mining exercise
Upcoming SlideShare
Loading in …5
×

Text mining exercise

790 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
790
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Text mining exercise

  1. 1. Text mining exercise~5 m Lars Juhl Jensen
  2. 2. the task
  3. 3. named entity recognition
  4. 4. human proteins
  5. 5. link proteins to diseases
  6. 6. what I have done
  7. 7. information retrieval
  8. 8. two diseases
  9. 9. prostate cancer
  10. 10. schizophrenia
  11. 11. two sets of documents
  12. 12. 62,755 abstracts
  13. 13. 65,588 abstracts
  14. 14. one directory with each set
  15. 15. one file with each abstract
  16. 16. dictionary
  17. 17. tab-delimited file
  18. 18. human proteins
  19. 19. 22,523 entities
  20. 20. synonyms
  21. 21. from many databases
  22. 22. orthographic variation
  23. 23. prefixes and postfixes
  24. 24. automatically generated
  25. 25. 2,726,495 names
  26. 26. tagdir program
  27. 27. flexible matching
  28. 28. upper- and lower-case
  29. 29. spaces and hyphens
  30. 30. tab-delimited output
  31. 31. what you will do
  32. 32. named entity recognition
  33. 33. find unfortunate names
  34. 34. create “black list”
  35. 35. information extraction
  36. 36. co-mentioning
  37. 37. within documents
  38. 38. link proteins to diseases
  39. 39. link between the diseases
  40. 40. a helping hand
  41. 41. “black list”
  42. 42. 100+ matches
  43. 43. 10+ matches
  44. 44. wrap up
  45. 45. prostate cancer
  46. 46. FOLH1
  47. 47. schizophrenia
  48. 48. Glutamate carboxypeptidase II
  49. 49. same protein
  50. 50. synonyms matter
  51. 51. “black list” is crucial
  52. 52. text mining is quite simple
  53. 53. diseases.jensenlab.org

×