SlideShare a Scribd company logo
Creating Knowledge
bases from text in
absence of training data.
Sanghamitra Deb
Accenture Technology Laboratory
Typical Business Process
Executive
Summary
Business
Decisions
hours of knowledge
curation by experts
The Generalized approach of extracting text: Parsing
Tokenization Normalization Parsing Lemmatization
Tokenization: Separating sentences, words, remove
special characters, phrase detections
Normalization: lowering words, word-sense
disambiguation
Parsing: Detecting parts of speech, nouns, verbs etc.
Lemmatization: Remove plurals and different word
forms to a single word (found in the dictionary).
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
How do we generate this training data?
A different Approach
Stanford
Replaces training data by encoding domain knowledge
The snorkel approach of Entity Extraction
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Write Rules: Encode your domain knowledge
into rules.
Validate Rules: coverage, conflicts, accuracy
Run learning: logistic regression, lstm, …
Examine a random
set of candidates,
create new rules
Observe the lowest
accuracy(highest conflict)
and edit rules
iterate
Data Dive: FDA Drug Labels
It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
When oral therapy is not feasible and the strength ,
dosage form , and route of administration of the drug
reasonably lend the preparation to the treatment of the
condition
Data Dive: FDA Drug Labels
Data Dive: Clinical Trials Data
We present a case of a 10-year-old boy who had severe relapsing
pancreatitis three times in two months within 3 weeks after starting treatment
with methylphenidate ( ritalin ) due to attention deficit hyperactivity
disorder (adhd).
The boy was generally healthy except for that he was newly diagnosed with
adhd and started the use of methylphenidate ( ritalin ) for the past three
weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering from pancreatitis due to the
use of ritalin is more than this published case.
Physicians must pay attention regarding this possible complication and it
should be taken into consideration in every patient with abdominal pain who
started consuming ritalin.
Final Goal: Entity and relationship Extraction
Data Dosage Drug
Treats
Disease
Side
Effects
Age Gender Ethnicity duration
10-year-old 0 0 0 0 1 0 0 0
pancreatiti
s-ritalin
0 0 0 1 0 0 0 0
adhd-ritalin 0 0 1 0 0 0 0 0
ritalin 0 1 0 0 0 0 0 0
30 mg 1 0 0 0 0 0 0 0
past three
weeks
0 0 0 0 0 0 0 1
boy 0 0 0 0 0 1 0 0
Candidate Extraction
Using domain knowledge and language structure collect
a set of high recall low precision. Typically this set should
have 80% recall and 20% precision.
60% accuracy, too specific need to make it more general
30% accuracy, this looks fine
…………………………………………………………………………………………………………………………………………………………………….
…………………………………………………………………………………………………………………………………………………………………….
Automated Features:
pos-tags
context
dep-tree
char-offsets
Rule Functions
Testing Rule Functions:
Rule Functions Output
0
25
50
75
100
-1 0 1
Expected Output
Real Output
Results and performance.
drug-name
disease
candidate
Candidates snorkel
Lithium
Carbonate
bipolar
disorder
1 1
Lithium
Carbonate
individual 1 0
Lithium
Carbonate
maintenance 1 0
Lithium
Carbonate
manic episode 1 1
Precision and recall ~70%
Why Docker?
• Portability: develop here run
there: Internal Clusters, aws,
google cloud etc, Reusable by
team and clients
• isolation: os and docker
isolated from bugs.
• Fast
• Easy virtualization : hard ware
emulation, virtualized os.
• Lightweight
Python stack on docker
FROM ubuntu:latest
# MAINTAINER Sanghamitra Deb <sangha123@gmail.com>
CMD echo Installing Accenture Tech Labs Scientific Python Enviro
RUN apt-get install python -y
RUN apt-get update && apt-get upgrade -y
RUN apt-get install curl -y
RUN apt-get install emacs -y
RUN curl -O https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN rm get-pip.py
RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc
RUN apt-get install python-setuptools build-essential python-dev -y
RUN apt-get install gfortran swig -y
RUN apt-get install libatlas-dev liblapack-dev -y
RUN apt-get install libfreetype6 libfreetype6-dev -y
RUN apt-get install libxft-dev -y
RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev
RUN apt-get install python-numpy
ADD requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt -q
Dockerfile
scipy
matplotlib
ipython
jupyter
pandas
Bottleneck
patsy
pymc
statsmodels
scikit-learn
BeautifulSoup
seaborn
gensim
fuzzywuzzy
xmltodict
untangle
nltk
flask
enum34
requirements.txt
docker build -t sangha/python .
docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v
/location/in/hadoop/ sangha/python bash
docker exec -it pharmaExtraction0.1 bash
docker exec -d  pharmaExtraction0.1 python  /root/pycodes/rest_api.py
Building the Dockerfile
Typical ML pipeline vs Snorkel
(1) Candidate Extraction.
(2) Rule Function
(3) Hyperparameter tuning
Snorkel :
Pros:
• Very little training
data necessary
• Do not have to
think about feature
generation
• Do not need deep
knowledge in
Machine Learning
• Convenient UI for
data annotation
• Created structured
databases from
unstructured text
Cons:
• Code is getting
refactored very
rapidly and
frequently.
• Not much
transparency in the
internal workings.
Banks: Loan
Approval Paleontology
Design of Clinical Trials
Legal
Investigation
Market Research
Reports
Human Trafficking
Inventory Management
Content Marketing
Product descriptions and
reviews
Pharmaceutical
Industry
Applicability across 

a variety of industries
and use cases
Where to get it?
https://github.com/HazyResearch/snorkel
http://arxiv.org/pdf/1512.06474v2.pdf

More Related Content

Similar to Extracting knowledgebase from text

Kata Burst
Kata BurstKata Burst
Kata Burst
Beth Carrington
 
Ketamine in Bipolar Depression
Ketamine in Bipolar DepressionKetamine in Bipolar Depression
Ketamine in Bipolar DepressionJanina Jochim
 
H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...
H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...
H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...
Georgi Daskalov
 
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
 BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
Ashish Sharma
 
Every Patient, Every Treatment: Expanding SGRT for All Indications
Every Patient, Every Treatment: Expanding SGRT for All IndicationsEvery Patient, Every Treatment: Expanding SGRT for All Indications
Every Patient, Every Treatment: Expanding SGRT for All Indications
SGRT Community
 
The role of advertising in changing patient’s attitude towards OTC pharmaceut...
The role of advertising in changing patient’s attitude towards OTC pharmaceut...The role of advertising in changing patient’s attitude towards OTC pharmaceut...
The role of advertising in changing patient’s attitude towards OTC pharmaceut...
Hany Wahied MBA
 
eclampsia
eclampsiaeclampsia
eclampsia
Prabha Amandari
 
211 spirometer handbook-naca
211 spirometer handbook-naca211 spirometer handbook-naca
211 spirometer handbook-naca
Leandro Agostini Granzotti
 
Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...
Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...
Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...
Muhammad Zbeedat
 
Essential Biology 3.6, 7.6, C2 Enzymes
Essential Biology 3.6, 7.6, C2 EnzymesEssential Biology 3.6, 7.6, C2 Enzymes
Essential Biology 3.6, 7.6, C2 Enzymes
Stephen Taylor
 
Accelerated Stress Testing
Accelerated Stress TestingAccelerated Stress Testing
Accelerated Stress Testing
Hilaire (Ananda) Perera P.Eng.
 
Review isolation and characterization of bioactive compounds from plant resou...
Review isolation and characterization of bioactive compounds from plant resou...Review isolation and characterization of bioactive compounds from plant resou...
Review isolation and characterization of bioactive compounds from plant resou...
Nguyen Vinh
 
Emerging Challenges for Artificial Intelligence in Medicinal Chemistry
Emerging Challenges for Artificial Intelligence in Medicinal ChemistryEmerging Challenges for Artificial Intelligence in Medicinal Chemistry
Emerging Challenges for Artificial Intelligence in Medicinal Chemistry
Ed Griffen
 
Caroline Hurley MATH499 Project
Caroline Hurley MATH499 ProjectCaroline Hurley MATH499 Project
Caroline Hurley MATH499 ProjectCaroline Hurley
 
Return to professional practice drug calculation
Return to professional practice drug calculationReturn to professional practice drug calculation
Return to professional practice drug calculation
Gerardo Medina
 
Lecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentLecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentDaria Bogdanova
 
Second or third additional chemotherapy drug for non-small cell lung cancer i...
Second or third additional chemotherapy drug for non-small cell lung cancer i...Second or third additional chemotherapy drug for non-small cell lung cancer i...
Second or third additional chemotherapy drug for non-small cell lung cancer i...James Hilbert
 
Predictive Analytics for Competitive Advantage
Predictive Analytics for Competitive AdvantagePredictive Analytics for Competitive Advantage
Predictive Analytics for Competitive Advantagevishwavijayps
 

Similar to Extracting knowledgebase from text (20)

Kata Burst
Kata BurstKata Burst
Kata Burst
 
Ketamine in Bipolar Depression
Ketamine in Bipolar DepressionKetamine in Bipolar Depression
Ketamine in Bipolar Depression
 
H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...
H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...
H1-antihistamines for the treatment of anaphylaxis with and without shock (Re...
 
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
 BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
 
Every Patient, Every Treatment: Expanding SGRT for All Indications
Every Patient, Every Treatment: Expanding SGRT for All IndicationsEvery Patient, Every Treatment: Expanding SGRT for All Indications
Every Patient, Every Treatment: Expanding SGRT for All Indications
 
POL_INSIGHT_2013_B
POL_INSIGHT_2013_BPOL_INSIGHT_2013_B
POL_INSIGHT_2013_B
 
The role of advertising in changing patient’s attitude towards OTC pharmaceut...
The role of advertising in changing patient’s attitude towards OTC pharmaceut...The role of advertising in changing patient’s attitude towards OTC pharmaceut...
The role of advertising in changing patient’s attitude towards OTC pharmaceut...
 
eclampsia
eclampsiaeclampsia
eclampsia
 
211 spirometer handbook-naca
211 spirometer handbook-naca211 spirometer handbook-naca
211 spirometer handbook-naca
 
Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...
Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...
Enhancing Psychotherapy Treatment by Analyzing Alliance Ruptures through Gaze...
 
Essential Biology 3.6, 7.6, C2 Enzymes
Essential Biology 3.6, 7.6, C2 EnzymesEssential Biology 3.6, 7.6, C2 Enzymes
Essential Biology 3.6, 7.6, C2 Enzymes
 
Accelerated Stress Testing
Accelerated Stress TestingAccelerated Stress Testing
Accelerated Stress Testing
 
Review isolation and characterization of bioactive compounds from plant resou...
Review isolation and characterization of bioactive compounds from plant resou...Review isolation and characterization of bioactive compounds from plant resou...
Review isolation and characterization of bioactive compounds from plant resou...
 
Emerging Challenges for Artificial Intelligence in Medicinal Chemistry
Emerging Challenges for Artificial Intelligence in Medicinal ChemistryEmerging Challenges for Artificial Intelligence in Medicinal Chemistry
Emerging Challenges for Artificial Intelligence in Medicinal Chemistry
 
Caroline Hurley MATH499 Project
Caroline Hurley MATH499 ProjectCaroline Hurley MATH499 Project
Caroline Hurley MATH499 Project
 
Return to professional practice drug calculation
Return to professional practice drug calculationReturn to professional practice drug calculation
Return to professional practice drug calculation
 
Lecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentLecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignment
 
Hyperprolactinemia endocrin society
Hyperprolactinemia endocrin societyHyperprolactinemia endocrin society
Hyperprolactinemia endocrin society
 
Second or third additional chemotherapy drug for non-small cell lung cancer i...
Second or third additional chemotherapy drug for non-small cell lung cancer i...Second or third additional chemotherapy drug for non-small cell lung cancer i...
Second or third additional chemotherapy drug for non-small cell lung cancer i...
 
Predictive Analytics for Competitive Advantage
Predictive Analytics for Competitive AdvantagePredictive Analytics for Competitive Advantage
Predictive Analytics for Competitive Advantage
 

More from Sanghamitra Deb

odsc_2023.pdf
odsc_2023.pdfodsc_2023.pdf
odsc_2023.pdf
Sanghamitra Deb
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
Sanghamitra Deb
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
Sanghamitra Deb
 
Intro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingIntro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic Modeling
Sanghamitra Deb
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
Sanghamitra Deb
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
Sanghamitra Deb
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
NLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsNLP and Machine Learning for non-experts
NLP and Machine Learning for non-experts
Sanghamitra Deb
 
Democratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUsDemocratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUs
Sanghamitra Deb
 
Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.
Sanghamitra Deb
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
Sanghamitra Deb
 

More from Sanghamitra Deb (14)

odsc_2023.pdf
odsc_2023.pdfodsc_2023.pdf
odsc_2023.pdf
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Intro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingIntro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic Modeling
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
NLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsNLP and Machine Learning for non-experts
NLP and Machine Learning for non-experts
 
Democratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUsDemocratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUs
 
Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 

Extracting knowledgebase from text

  • 1. Creating Knowledge bases from text in absence of training data. Sanghamitra Deb Accenture Technology Laboratory
  • 3. The Generalized approach of extracting text: Parsing Tokenization Normalization Parsing Lemmatization Tokenization: Separating sentences, words, remove special characters, phrase detections Normalization: lowering words, word-sense disambiguation Parsing: Detecting parts of speech, nouns, verbs etc. Lemmatization: Remove plurals and different word forms to a single word (found in the dictionary).
  • 4. Extract sentences that contain the specific attribute POS tag and extract unigrams,bigrams and trigrams centered on nouns Extract Features: words around nouns: bag of words/word vectors, position of the noun and length of sentence. Train a Machine Learning model to predict which unigrams, bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship. Map training data to create a balanced positive and negative training set. The Generalized approach of extracting text : ML
  • 5. Extract sentences that contain the specific attribute POS tag and extract unigrams,bigrams and trigrams centered on nouns Extract Features: words around nouns: bag of words/word vectors, position of the noun and length of sentence. Train a Machine Learning model to predict which unigrams, bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship. Map training data to create a balanced positive and negative training set. The Generalized approach of extracting text : ML How do we generate this training data?
  • 6. A different Approach Stanford Replaces training data by encoding domain knowledge
  • 7. The snorkel approach of Entity Extraction Extract sentences that contain the specific attribute POS tag and extract unigrams,bigrams and trigrams centered on nouns Write Rules: Encode your domain knowledge into rules. Validate Rules: coverage, conflicts, accuracy Run learning: logistic regression, lstm, … Examine a random set of candidates, create new rules Observe the lowest accuracy(highest conflict) and edit rules iterate
  • 8. Data Dive: FDA Drug Labels
  • 9. It is indicated for treating respiratory disorder caused due to allergy. For the relief of symptoms of depression. Evidence supporting efficacy of carbamazepine as an anticonvulsant was derived from active drug-controlled studies that enrolled patients with the following seizure types: When oral therapy is not feasible and the strength , dosage form , and route of administration of the drug reasonably lend the preparation to the treatment of the condition Data Dive: FDA Drug Labels
  • 10. Data Dive: Clinical Trials Data We present a case of a 10-year-old boy who had severe relapsing pancreatitis three times in two months within 3 weeks after starting treatment with methylphenidate ( ritalin ) due to attention deficit hyperactivity disorder (adhd). The boy was generally healthy except for that he was newly diagnosed with adhd and started the use of methylphenidate ( ritalin ) for the past three weeks at a dose, of 30 mg daily. We believe that the number of persons suffering from pancreatitis due to the use of ritalin is more than this published case. Physicians must pay attention regarding this possible complication and it should be taken into consideration in every patient with abdominal pain who started consuming ritalin.
  • 11. Final Goal: Entity and relationship Extraction Data Dosage Drug Treats Disease Side Effects Age Gender Ethnicity duration 10-year-old 0 0 0 0 1 0 0 0 pancreatiti s-ritalin 0 0 0 1 0 0 0 0 adhd-ritalin 0 0 1 0 0 0 0 0 ritalin 0 1 0 0 0 0 0 0 30 mg 1 0 0 0 0 0 0 0 past three weeks 0 0 0 0 0 0 0 1 boy 0 0 0 0 0 1 0 0
  • 12. Candidate Extraction Using domain knowledge and language structure collect a set of high recall low precision. Typically this set should have 80% recall and 20% precision. 60% accuracy, too specific need to make it more general 30% accuracy, this looks fine ……………………………………………………………………………………………………………………………………………………………………. …………………………………………………………………………………………………………………………………………………………………….
  • 16. Rule Functions Output 0 25 50 75 100 -1 0 1 Expected Output Real Output
  • 17. Results and performance. drug-name disease candidate Candidates snorkel Lithium Carbonate bipolar disorder 1 1 Lithium Carbonate individual 1 0 Lithium Carbonate maintenance 1 0 Lithium Carbonate manic episode 1 1 Precision and recall ~70%
  • 18. Why Docker? • Portability: develop here run there: Internal Clusters, aws, google cloud etc, Reusable by team and clients • isolation: os and docker isolated from bugs. • Fast • Easy virtualization : hard ware emulation, virtualized os. • Lightweight Python stack on docker
  • 19. FROM ubuntu:latest # MAINTAINER Sanghamitra Deb <sangha123@gmail.com> CMD echo Installing Accenture Tech Labs Scientific Python Enviro RUN apt-get install python -y RUN apt-get update && apt-get upgrade -y RUN apt-get install curl -y RUN apt-get install emacs -y RUN curl -O https://bootstrap.pypa.io/get-pip.py RUN python get-pip.py RUN rm get-pip.py RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc RUN apt-get install python-setuptools build-essential python-dev -y RUN apt-get install gfortran swig -y RUN apt-get install libatlas-dev liblapack-dev -y RUN apt-get install libfreetype6 libfreetype6-dev -y RUN apt-get install libxft-dev -y RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev RUN apt-get install python-numpy ADD requirements.txt /tmp/requirements.txt RUN pip install -r /tmp/requirements.txt -q Dockerfile scipy matplotlib ipython jupyter pandas Bottleneck patsy pymc statsmodels scikit-learn BeautifulSoup seaborn gensim fuzzywuzzy xmltodict untangle nltk flask enum34 requirements.txt docker build -t sangha/python . docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v /location/in/hadoop/ sangha/python bash docker exec -it pharmaExtraction0.1 bash docker exec -d  pharmaExtraction0.1 python  /root/pycodes/rest_api.py Building the Dockerfile
  • 20. Typical ML pipeline vs Snorkel (1) Candidate Extraction. (2) Rule Function (3) Hyperparameter tuning
  • 21. Snorkel : Pros: • Very little training data necessary • Do not have to think about feature generation • Do not need deep knowledge in Machine Learning • Convenient UI for data annotation • Created structured databases from unstructured text Cons: • Code is getting refactored very rapidly and frequently. • Not much transparency in the internal workings.
  • 22. Banks: Loan Approval Paleontology Design of Clinical Trials Legal Investigation Market Research Reports Human Trafficking Inventory Management Content Marketing Product descriptions and reviews Pharmaceutical Industry Applicability across 
 a variety of industries and use cases
  • 23. Where to get it? https://github.com/HazyResearch/snorkel http://arxiv.org/pdf/1512.06474v2.pdf