Deep Learning for Domain-Specific Entity Extraction from Unstructured Text with Zoran Dzunic and Mohamed AbdelHady

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Mohamed AbdelHady, Microsoft AI Platform
Zoran Dzunic, Microsoft AI Platform
Deep Learning for Domain-
Specific Entity Extraction
from Unstructured Text
#DL1SAIS
Goals
• What is entity extraction?
• When to train a custom entity extraction model?
• What are word embeddings?
• How to train a custom word embedding model
on a Spark cluster?
• How to train a custom Deep Neural Network for
entity extraction?
2#DL1SAIS
Entity Extraction
• Subtask of information extraction
• Also known as Named-entity recognition (NER), entity chunking and
entity identification
• Find phrases in text that refer to a real-world entity of specific types
Zoran and Mohamed are at Spark+AI Summit in San Francisco.
Zoran : PERSON
: LOC
3#DL1SAIS
Entity Extraction
• Subtask of information extraction
• Also known as Named-entity recognition (NER), entity chunking and
entity identification
• Find phrases in text that refer to a real-world entity of specific types
Zoran and Mohamed are at Spark+AI Summit in San Francisco.
Zoran : PERSON
Mohamed : PERSON
Spark+AI Summit : ORG
San Francisco : LOC
4#DL1SAIS
Biomedical Entity Extraction
• Entity types
drug/chemical, disease, protein, DNA, etc.
• Critical step for complex biomedical NLP tasks:
– Extraction of diseases, symptoms from electronic medical or health records
– Understanding the interactions between different entity types such as drug-
drug interaction, drug-disease relationship and gene-protein relationship,
e.g.,
• Drug A cures Disease B.
• Drug A causes Disease B.
Similar for other domains (e.g., legal, finance)
5#DL1SAIS
Biomedical Entity Extraction
6#DL1SAIS
Demo
https://medicalentitydemo.azurewebsites.net
7#DL1SAIS
Approach
1. Feature Extraction Phase – Domain Specific Features
Use a large amount of unlabeled domain-specific data corpus such as
Medline PubMed abstracts to train a neural word embedding model.
2. Model Training Phase – Domain Specific Model
The output embeddings are considered as automatically generated
features to train a neural entity extractor using a small/reasonable
amount of labeled data.
8#DL1SAIS
Word Embedding
a semantic continuous representation of words
9#DL1SAIS
#DL1SAIS 10
Input: Words
11#DL1SAIS
Naloxon
e
reverses the
B-Chemical O O
Words:
[0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …][0.5, 0.1, 0.5 …]Embedding:
Features: Word Embeddings
12#DL1SAIS
B-Chemical O O
dim small (e.g., 50, 200)
13#DL1SAIS
Embeddings
Custom Word Embeddings
• Publicly available pre-trained models such as
Google News
• Can we do better on a specific domain?
• We trained a word embedding model for biomedical
domain on 27 million Pubmed abstracts (22GB)
• Azure HDInsight Spark Cluster, 11 worker nodes
• Spark MLlib Word2Vec
• Trained in ~30min
14#DL1SAIS
DNNs for Entity
Extraction
15#DL1SAIS
#DL1SAIS 16
Why Deep Learning?
17#DL1SAIS
DNN Architecture
18#DL1SAIS
• Keras with TensorFlow
• GPU enabled Azure Data Science VM
(DSVM)
NC6 Standard (56 GB, K80 NVIDIA Tesla)
or
Deep Learning VM (DLVM)
• Parameters
– # recurrent units = 150
– droput rate = 0.2
Results
19#DL1SAIS
Datasets
• Proteins, Cell Line, Cell Type, DNA and RNA Detection
Bio-Entity Recognition Task at BioNLP/NLPBA 2004
- http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html
• Chemicals and Diseases Detection
BioCreative V CDR task corpus
- http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
• Drugs Detection
Semeval 2013 - Task 9.1 (Drug Recognition)
- https://www.cs.york.ac.uk/semeval-2013/task9/
20#DL1SAIS
Dataset Description
http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
21#DL1SAIS
•
-
-
-
Experimental Setup
• Azure ML Python Package for Text Analytics.
https://docs.microsoft.com/en-us/python/api/overview/azure-machine-
learning/textanalytics
https://aka.ms/aml-packages/text/download
22#DL1SAIS
CRFSuite:
• Extract traditional features
• Train CRF model
Conditional Random Fields (CRF)
#DL1SAIS 23
Results (exact match)
Algorithm + Features Recall Precision F-score
Dictionary Lookup 64% 74% 68%
CRF: Traditional Features 61% 81% 70%
CRF: Pubmed Embedding 40% 61% 48%
CRF: Traditional + Pubmed Embed. 65% 80% 71%
LSTM: Pubmed Embedding 76% 77% 76%
LSTM: Generic Embeddings 74% 63% 67%
#DL1SAIS 24
Embedding Comparison
#DL1SAIS 25
#DL1SAIS 26
Embedding Comparison
Takeaways
• Recipe for building a custom entity extraction pipeline:
– Get a large amount of in-domain unlabeled data
– Train a word2vec model on unlabeled data on Spark
– Get as much of labeled data as possible
– Train an LSTM -based Neural Network on a GPU-enabled machine
• Word embeddings are powerful features
– Convey word semantics
– Perform better than traditional features
– No feature engineering
• LSTM NN is more powerful model than traditional CRF
#DL1SAIS 27
28#DL1SAIS
Questions
1 of 28

Recommended

Inverted index by
Inverted indexInverted index
Inverted indexKrishna Gehlot
9.2K views19 slides
Introduction to data science by
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Mahtab Haque
497 views14 slides
Knowledge discovery thru data mining by
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
5K views37 slides
Day 1 (Lecture 3): Predictive Analytics in Healthcare by
Day 1 (Lecture 3): Predictive Analytics in HealthcareDay 1 (Lecture 3): Predictive Analytics in Healthcare
Day 1 (Lecture 3): Predictive Analytics in HealthcareAseda Owusua Addai-Deseh
44.9K views70 slides
Data science Big Data by
Data science Big DataData science Big Data
Data science Big Datasreekanthricky
661 views16 slides
Deep learning for biomedicine by
Deep learning for biomedicineDeep learning for biomedicine
Deep learning for biomedicineTruyen Tran
827 views48 slides

More Related Content

What's hot

Top Data Mining Techniques and Their Applications by
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsPromptCloud
623 views30 slides
Introduction to Big Data/Machine Learning by
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
306.6K views137 slides
Data mining by
Data mining Data mining
Data mining AthiraR23
424 views13 slides
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j by
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j
1.8K views34 slides
Survey on data mining techniques in heart disease prediction by
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
3.9K views6 slides
Data Mining by
Data MiningData Mining
Data MiningSHIKHA GAUTAM
13K views128 slides

What's hot(20)

Top Data Mining Techniques and Their Applications by PromptCloud
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their Applications
PromptCloud623 views
Introduction to Big Data/Machine Learning by Lars Marius Garshol
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol306.6K views
Data mining by AthiraR23
Data mining Data mining
Data mining
AthiraR23424 views
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j by Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j1.8K views
Survey on data mining techniques in heart disease prediction by Sivagowry Shathesh
Survey on data mining techniques in heart disease predictionSurvey on data mining techniques in heart disease prediction
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh3.9K views
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop by iwan_rg
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg2.5K views
IE: Named Entity Recognition (NER) by Marina Santini
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini4.5K views
Lunch & Learn: Delivering insight into complex patient journey with graph an... by Neo4j
Lunch & Learn:  Delivering insight into complex patient journey with graph an...Lunch & Learn:  Delivering insight into complex patient journey with graph an...
Lunch & Learn: Delivering insight into complex patient journey with graph an...
Neo4j238 views
ETL QA by dillip kar
ETL QAETL QA
ETL QA
dillip kar10.7K views
word level analysis by tjs1
word level analysis word level analysis
word level analysis
tjs1231 views
Data warehousing - Dr. Radhika Kotecha by Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika Kotecha
Radhika Kotecha539 views
Design and implementation of Clinical Databases using openEHR by Pablo Pazos
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHR
Pablo Pazos8K views
Distributed implementation of a lstm on spark and tensorflow by Emanuel Di Nardo
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo6.1K views
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train... by Edureka!
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Edureka!1.6K views
Serverless ML Workshop with Hopsworks at PyData Seattle by Jim Dowling
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling458 views

Similar to Deep Learning for Domain-Specific Entity Extraction from Unstructured Text with Zoran Dzunic and Mohamed AbdelHady

Essay On Research Database Form by
Essay On Research Database FormEssay On Research Database Form
Essay On Research Database FormWrite My Paper For Me In 3 Hours Canada
5 views22 slides
Reflected Intelligence: Lucene/Solr as a self-learning data system by
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
6.1K views64 slides
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web... by
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...Alexandre Riazanov
481 views41 slides
Neo4j and bioinformatics by
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformaticsPablo Pareja Tobes
5.1K views45 slides
The Role of Metadata in Reproducible Computational Research by
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchJeremy Leipzig
344 views37 slides
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente... by
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
1.1K views64 slides

Similar to Deep Learning for Domain-Specific Entity Extraction from Unstructured Text with Zoran Dzunic and Mohamed AbdelHady(20)

Reflected Intelligence: Lucene/Solr as a self-learning data system by Trey Grainger
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger6.1K views
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web... by Alexandre Riazanov
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...
Alexandre Riazanov481 views
The Role of Metadata in Reproducible Computational Research by Jeremy Leipzig
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
Jeremy Leipzig344 views
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente... by Lucidworks
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Lucidworks1.1K views
Ibjectages And Disadvantages Of A Relational Database... by Angie Logan
Ibjectages And Disadvantages Of A Relational Database...Ibjectages And Disadvantages Of A Relational Database...
Ibjectages And Disadvantages Of A Relational Database...
Angie Logan2 views
R&D Search 081013 Search Solutions Conference by Nick Brown
R&D Search 081013 Search Solutions ConferenceR&D Search 081013 Search Solutions Conference
R&D Search 081013 Search Solutions Conference
Nick Brown6.7K views
EiTESAL eHealth Conference 14&15 May 2017 by EITESANGO
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
EITESANGO78 views
Connecting the dots: drug information and Linked Data by Tomasz Adamusiak
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked Data
Tomasz Adamusiak500 views
The Semantic Web - This time... its Personal by Mark Wilkinson
The Semantic Web - This time... its PersonalThe Semantic Web - This time... its Personal
The Semantic Web - This time... its Personal
Mark Wilkinson573 views
Potential Drug Targets With Information About Their... by Tracy Dolittle
Potential Drug Targets With Information About Their...Potential Drug Targets With Information About Their...
Potential Drug Targets With Information About Their...
Tracy Dolittle2 views
Solving the Disconnected Data Problem in Healthcare Using MongoDB by MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDB
MongoDB2.6K views
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D... by Machine Learning Prague
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
AI, Search, and the Disruption of Knowledge Management by Trey Grainger
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
Trey Grainger522 views
Achieving Privacy in Publishing Search logs by IOSR Journals
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
IOSR Journals415 views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
739 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks688 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks675 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks550 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks448 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks512 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Recently uploaded

Supercharging your Data with Azure AI Search and Azure OpenAI by
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAIPeter Gallagher
37 views32 slides
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx by
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 views3 slides
CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
6 views24 slides
RuleBookForTheFairDataEconomy.pptx by
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptxnoraelstela1
67 views16 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
11 views4 slides
Data structure and algorithm. by
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
18 views24 slides

Recently uploaded(20)

Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
Building Real-Time Travel Alerts by Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text with Zoran Dzunic and Mohamed AbdelHady

  • 1. Mohamed AbdelHady, Microsoft AI Platform Zoran Dzunic, Microsoft AI Platform Deep Learning for Domain- Specific Entity Extraction from Unstructured Text #DL1SAIS
  • 2. Goals • What is entity extraction? • When to train a custom entity extraction model? • What are word embeddings? • How to train a custom word embedding model on a Spark cluster? • How to train a custom Deep Neural Network for entity extraction? 2#DL1SAIS
  • 3. Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Zoran and Mohamed are at Spark+AI Summit in San Francisco. Zoran : PERSON : LOC 3#DL1SAIS
  • 4. Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Zoran and Mohamed are at Spark+AI Summit in San Francisco. Zoran : PERSON Mohamed : PERSON Spark+AI Summit : ORG San Francisco : LOC 4#DL1SAIS
  • 5. Biomedical Entity Extraction • Entity types drug/chemical, disease, protein, DNA, etc. • Critical step for complex biomedical NLP tasks: – Extraction of diseases, symptoms from electronic medical or health records – Understanding the interactions between different entity types such as drug- drug interaction, drug-disease relationship and gene-protein relationship, e.g., • Drug A cures Disease B. • Drug A causes Disease B. Similar for other domains (e.g., legal, finance) 5#DL1SAIS
  • 8. Approach 1. Feature Extraction Phase – Domain Specific Features Use a large amount of unlabeled domain-specific data corpus such as Medline PubMed abstracts to train a neural word embedding model. 2. Model Training Phase – Domain Specific Model The output embeddings are considered as automatically generated features to train a neural entity extractor using a small/reasonable amount of labeled data. 8#DL1SAIS
  • 9. Word Embedding a semantic continuous representation of words 9#DL1SAIS
  • 12. [0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …][0.5, 0.1, 0.5 …]Embedding: Features: Word Embeddings 12#DL1SAIS B-Chemical O O dim small (e.g., 50, 200)
  • 14. Custom Word Embeddings • Publicly available pre-trained models such as Google News • Can we do better on a specific domain? • We trained a word embedding model for biomedical domain on 27 million Pubmed abstracts (22GB) • Azure HDInsight Spark Cluster, 11 worker nodes • Spark MLlib Word2Vec • Trained in ~30min 14#DL1SAIS
  • 18. DNN Architecture 18#DL1SAIS • Keras with TensorFlow • GPU enabled Azure Data Science VM (DSVM) NC6 Standard (56 GB, K80 NVIDIA Tesla) or Deep Learning VM (DLVM) • Parameters – # recurrent units = 150 – droput rate = 0.2
  • 20. Datasets • Proteins, Cell Line, Cell Type, DNA and RNA Detection Bio-Entity Recognition Task at BioNLP/NLPBA 2004 - http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html • Chemicals and Diseases Detection BioCreative V CDR task corpus - http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • Drugs Detection Semeval 2013 - Task 9.1 (Drug Recognition) - https://www.cs.york.ac.uk/semeval-2013/task9/ 20#DL1SAIS
  • 22. Experimental Setup • Azure ML Python Package for Text Analytics. https://docs.microsoft.com/en-us/python/api/overview/azure-machine- learning/textanalytics https://aka.ms/aml-packages/text/download 22#DL1SAIS
  • 23. CRFSuite: • Extract traditional features • Train CRF model Conditional Random Fields (CRF) #DL1SAIS 23
  • 24. Results (exact match) Algorithm + Features Recall Precision F-score Dictionary Lookup 64% 74% 68% CRF: Traditional Features 61% 81% 70% CRF: Pubmed Embedding 40% 61% 48% CRF: Traditional + Pubmed Embed. 65% 80% 71% LSTM: Pubmed Embedding 76% 77% 76% LSTM: Generic Embeddings 74% 63% 67% #DL1SAIS 24
  • 27. Takeaways • Recipe for building a custom entity extraction pipeline: – Get a large amount of in-domain unlabeled data – Train a word2vec model on unlabeled data on Spark – Get as much of labeled data as possible – Train an LSTM -based Neural Network on a GPU-enabled machine • Word embeddings are powerful features – Convey word semantics – Perform better than traditional features – No feature engineering • LSTM NN is more powerful model than traditional CRF #DL1SAIS 27