Successfully reported this slideshow.
Your SlideShare is downloading. ×

Using NLP to Explore Entity Relationships in COVID-19 Literature

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 35 Ad

Using NLP to Explore Entity Relationships in COVID-19 Literature

Download to read offline

In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.

In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Using NLP to Explore Entity Relationships in COVID-19 Literature (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Using NLP to Explore Entity Relationships in COVID-19 Literature

  1. 1. Using Spark-NLP to build a Biomedical Knowledge Graph Or how to build a space telescope to not get lost in the darkness
  2. 2. About Us Alex Thomas is a principal data scientist at Wisecube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP. Vishnu is the CTO and Founder of Wisecube AI and has over two decades of experience building data science teams and platforms. Vishnu is a big believer in graph based systems and has extensive experience with various graph databases including Neo4J, the original TitanDB release (now JanusGraph) and more recently OrientDB and AWS Neptune.
  3. 3. About Wisecube AI Wisecube AI helps accelerate Biomedical R&D by combining the power of knowledge graphs, intelligent applications and a low code Platform.
  4. 4. Biomedical data: The final frontier Biomedical Big (data) Bang ! Too much dark data being created to comprehend by scientists Insights are hidden The curse of unstructured data Not enough labeled data Gathering Labels is expensive in Biomedical domains
  5. 5. Hubble Space Telescope: NLP and Knowledge Graphs 1. Visualize: Allow users to gather high level insights for biomedical research area 2. Explore: Discover connections and concepts hidden in the text and structured data 3. Learn: Create representations of concepts used to learn deep learning models for prediction and experimentation.
  6. 6. Building Hubble Orpheus
  7. 7. Orpheus Pipeline
  8. 8. Overview ● Datasets ○ Text Processing ○ Topic Modeling ○ Entity Extraction ○ Graph Building
  9. 9. Datasets ❏ CORD-19 ❏ Text dataset of biomedical articles related to COVID-19 ❏ From the Semantic Scholar team at the Allen Institute for AI ❏ Contains articles and metadata file ❏ More than ~200k articles mentioned in metadata ❏ ~90k with JSON files converted from PDFs ❏ ~70k with JSON files converted from PubMed Central ❏ Most overlap ❏ ~90k articles with JSON files in the dataset https://www.semanticscholar.org/cord19
  10. 10. Datasets ❏ Infectious Disease Names ❏ List of ~500 diseases names and synonyms ❏ Manually curated ❏ Added preferred names ❏ Wikipedia links ❏ Removed overly common or ambiguous terms ❏ ~10 hours work ❏ 1351 disease names https://www.atsu.edu/faculty/chamberlain/website/diseases.htm
  11. 11. Datasets ❏ ChEMBL subset ❏ Synonyms ❏ SMILES ❏ 13309 compounds https://www.ebi.ac.uk/chembl/
  12. 12. Datasets ❏ UniProt subset ❏ Names ❏ Associated UniProt IDs ❏ 1,473,423 protein names ❏ ~40k usually found in CORD-19 https://www.uniprot.org/
  13. 13. CORD-19 Dataset ❏ 94483 documents ❏ 26147.6 average character length ❏ 20685 journals
  14. 14. Overview ✓ CORD-19 ● Text Processing ○ Topic Modeling ○ Entity Extraction ○ Graph Building
  15. 15. Text Processing ❏ The dataset is too large to process all in memory ❏ Spark NLP ❏ Website: nlp.johnsnowlabs.com ❏ Book: Natural Language Processing with Spark NLP (by me, Alex) ❏ Amazon link: amzn.com/1492047767 ❏ NLP library built on top of Spark MLlib ❏ Building our own pipeline
  16. 16. Text Processing ❏ Pipeline ❏ Sentence tokenizing - splitting text into sentences ❏ Tokenizing - splitting text into tokens ❏ Normalizing - lowercasing, removing non-alphabetics ❏ Stop word cleaning - removing common words ❏ Lemmatizing - reducing words to their dictionary entry ❏ E.g. symptoms → symptom, diagnoses → diagnosis ❏ Outputs ❏ Normalized tokens - used for entity extraction ❏ Lemmas - used for topic modeling
  17. 17. Text Processing ❏ Identifying phrases (manual process) ❏ Run pipeline ❏ Analyze n-gram frequencies ❏ n-grams are sequences of tokens of length n ❏ Here n=2,3,4 ❏ Identify “stop phrases” ❏ Formulaic statements related to copywrite, links, etc. ❏ Identify phrases ❏ Add to tokenizer ❏ Repeat
  18. 18. Text Processing - Log-Mean TF.IDF of unigrams - Log-Mean TF.IDF of bigrams - Log-Mean TF.IDF of trigrams - Log-Mean TF.IDF of 4-grams
  19. 19. Overview ✓ CORD-19 ✓ Text Processing ● Topic Modeling ○ Entity Extraction ○ Graph Building
  20. 20. Topic Modeling ❏ Topic modeling ❏ “You shall know a word by the company it keeps” - J. R. Firth 1957 ❏ Clustering text data into topics ❏ Visualize diversity in corpus ❏ Analyze vocabulary ❏ Latent Dirichlet Allocation ❏ Documents are mixtures of topics ❏ Topics are mixtures of words
  21. 21. Topic Modeling ❏ pyLDAvis used for visualization
  22. 22. Overview ✓ CORD-19 ✓ Text Processing ✓ Topic Modeling ● Entity Extraction ○ Graph Building
  23. 23. Entity Extraction ❏ Dictionary based extraction ❏ Aho-Corasick algorithm ❏ Requires dictionary / wordlist ❏ Model based extraction ❏ Deep learning common for recent models ❏ Conditional random fields common for before 5 years ago
  24. 24. Entity Extraction ❏ Aho-Corasick algorithm ❏ Efficiently search for large number of patterns ❏ Pro: no training data needed, only a list of names and aliases ❏ Con: can only find entities in the alias list ❏ Con: does not use context ❏ APRIL is a protein name (UniProt)
  25. 25. Entity Extraction ❏ Model-based approach ❏ Predict which tokens are part of a reference to an entity ❏ Pro: Identifies phrases based on context ❏ Pro: Can be tuned to different data sets ❏ Con: Requires training data ❏ Deep learning model require a lot of data token label The O influenza B-DISEASE virus I-DISEASE is O divided O into O different O types O and O subtypes O
  26. 26. Entity Extraction ❏ Bootstrapping ❏ Output the tokens with the entity labels ❏ BIO format ❏ Tokenization must be consistent ❏ CRF ❏ Fewer open-source implementations ❏ Deep Learning ❏ State of the art ❏ Requires large amounts of data ❏ Slow to run without GPU
  27. 27. Overview ✓ CORD-19 ✓ Text Processing ✓ Topic Modeling ✓ Entity Extraction ● Graph Building
  28. 28. Graph Building ❏ Heuristic vs Model ❏ Relationship extraction data sets are rare, compared to NER models ❏ Creating labels requires experts ❏ Heuristics with labels ❏ Stated relationships may span across multiple sentences ❏ Certain styles of language are excessively verbose ❏ Especially academic language
  29. 29. Graph Building ❏ Entity co-occurrence ❏ Context ❏ Document ❏ Sentence ❏ Weight ❏ Binary ❏ Co-occurrence count ❏ TF.IDF
  30. 30. Graph Building edge_ixs = [(i, j) for i, j in edge_ixs if i < j] C = np.dot(B.T, B) nonzero_edges = np.argwhere(C > 0) DF = B.sum(axis=0) TF = np.log2(1 + X) IDF = np.log2(N / DF) TFIDF = np.multiply(TF, IDF) N = ... # number of docs M = ... # number of unique entities # NxM matrix, X[i, j] = number of times # entity j occurs in doc i X = ... B = X>0 # co-occurrence matrix # C[i, j] = number of docs with # entities i and j # C[i, i] = number of docs with # entity i C = np.dot(B.T, B) edges_ixs = np.argwhere(C > 0)
  31. 31. Overview ✓ CORD-19 ✓ Text Processing ✓ Topic Modeling ✓ Entity Extraction ✓ Graph Building
  32. 32. Full pipeline ❏ Pre-calculate data for LDA visualization ❏ Load articles + entities + topics into search index ❏ Elasticsearch ❏ Load entities and edges into graph database ❏ OrientDB, Amazon Neptune
  33. 33. Demo
  34. 34. Key Takeaways Biomedical Scientists are drowning in Dark data (un-labeled and/or unstructured) NLP is the lense to help bring this data into focus. Knowledge Graphs are the Hubble Space Telescopes of the Biomedical Domain
  35. 35. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×