SlideShare a Scribd company logo
Application of Manual Document Classification Algorithm
Ran Zhou with Mentor: Peter Merrill
Data
The data we are using for training the document classification
algorithm have the following features:
• Five types of document instances
• Binary labels – Yes/No
• Two data formats
• PDF files
o Get the text part by optical character recognition (OCR)
o Get the labels from the curated database through a link variable
• Curated Database
• Relatively small data size
2018 DUKE RESEARCH COMPUTING
SYMPOSIUM POSTER COMPETITIONIntroduction
Manual document classification is a tedious process of manually
classifying documents instances, which includes two steps:
• Some instances are identified in need of being labeled/classified
• The document instances are labeled/classified according to
rigorous, predefined criteria
Acknowledgements
Leadership Team:
Renato Lopes, MD,MHS,PhD
Schuyler Jones, MD
Michael Pencina, PhD
Larry Carin, PhD
Lisa Wruck, PhD
Shelley Rusincovitch, MM
References
Dinghan Shen et al, On the use of word embeddings alone to
represent natural language sequences, 2017(submitted to ICLR)
Contact Information
Ran Zhou
Master student in statistical science department at Duke
E-mail: rz69@duke.edu
Goal
The goal is to find an algorithm to simulate the manual classification
process. The algorithm will be able to:
• Take text data as the input and automatically give a classification
label as the output.
• Reduce the mistakes created by manual classification and
improve the accuracy.
• Improve the efficiency of the classification process by replacing
manual classification process by machine.
Figure 4. Data Components
Figure 3. Curated Database Summary
• Total instance count: 11000
• Average text length: 44 (min:1, max:1117)
Figure 2. PDF File Summary
• Total instance count: 11033
• Average text length: 1274 (min: 3, max: 24984)
Model
“At the DCRI, our values honor our
history and guide our actions and daily
decisions.… “ (An example text input of
the algorithm)
Yes / NoNeural Network
Figure 1. Classification Algorithm
Figure 5. Simple Word Embedding Based Model (SWEM)
e_1
e_2
E =
e_84
w_1
w_2
w_84
E = Yes
Word Embedding Simple Neural Network
Classification Results
Figure 7. Individual Instance TypeFigure 6. All Instance Type
Conclusion
 The model highly improves the classification accuracy and the computational efficiency comparing to the
traditional neural network model, recurrent neural network(RNN) and convolutional neural network(CNN).
 The classification results for all instance types and individual instance type are much better than random guess
and existing algorithms for similar problems.
 Next, we will test the model on the data from the PDF files to further improve the accuracy.
Type 1 Type 2 Type 3 Type 4 Type 5
2082 3950 3013 689 1299
Type 1 Type 2 Type 3 Type 4 Type 5
2045 3951 3003 689 1312
Yes No Yes No Yes No Sub 1 Yes Sub 2 Yes No /
833 1212 2993 958 291 2712 369 107 213 /
Matt Wilson, RN
Team Members:
Ricardo Henao, PhD
Peter Merrill, PhD
Ran Zhou, BS
Lynn Perkins, MBA
SWEM is a neural network algorithm good at capturing semantic meaning of both individual words and their
compositions simply by a special embedding method. Different from traditional neural network models for capturing
semantic meaning of sentences, SWEM does not require large memory to train a large amount of extra layers.
SWEM is also shown to be able to give more robust results for small data.

More Related Content

What's hot

Assessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computing
ijcsa
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
Urjit Patel
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Yandex
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Isabelle Augenstein
 
Genetic algorithms vs Traditional algorithms
Genetic algorithms vs Traditional algorithmsGenetic algorithms vs Traditional algorithms
Genetic algorithms vs Traditional algorithms
Dr. C.V. Suresh Babu
 
NLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationNLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentation
stuti_agarwal
 
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Aleksi Aaltonen
 
4 de47584
4 de475844 de47584
4 de47584
imtiaz7863
 
Getting better at detecting anomalies by using ensembles
Getting better at detecting anomalies by using ensemblesGetting better at detecting anomalies by using ensembles
Getting better at detecting anomalies by using ensembles
CSIRO
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Dustin Smith
 
Topic modeling
Topic modelingTopic modeling
Topic modeling
Sajal Sharma
 
Doc format.
Doc format.Doc format.
Doc format.
butest
 
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
Reza Sadeghi
 
Unit 1 Introduction to Data Compression
Unit 1 Introduction to Data CompressionUnit 1 Introduction to Data Compression
Unit 1 Introduction to Data Compression
Dr Piyush Charan
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
jerdeb
 
Hybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classificationHybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classification
Alexander Decker
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification
Alexander Decker
 
Resume - Orit Eistein
Resume - Orit EisteinResume - Orit Eistein
Resume - Orit Eistein
Orit Eistein
 
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
ijcsa
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
adeij1
 

What's hot (20)

Assessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computing
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
Genetic algorithms vs Traditional algorithms
Genetic algorithms vs Traditional algorithmsGenetic algorithms vs Traditional algorithms
Genetic algorithms vs Traditional algorithms
 
NLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationNLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentation
 
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
 
4 de47584
4 de475844 de47584
4 de47584
 
Getting better at detecting anomalies by using ensembles
Getting better at detecting anomalies by using ensemblesGetting better at detecting anomalies by using ensembles
Getting better at detecting anomalies by using ensembles
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Topic modeling
Topic modelingTopic modeling
Topic modeling
 
Doc format.
Doc format.Doc format.
Doc format.
 
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
 
Unit 1 Introduction to Data Compression
Unit 1 Introduction to Data CompressionUnit 1 Introduction to Data Compression
Unit 1 Introduction to Data Compression
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Hybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classificationHybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classification
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification
 
Resume - Orit Eistein
Resume - Orit EisteinResume - Orit Eistein
Resume - Orit Eistein
 
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
 

Similar to Ran zhou poster 2018

Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Institute of Contemporary Sciences
 
313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx
sameernsn1
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
KevinSims18
 
Data science lecture4_doaa_mohey
Data science lecture4_doaa_moheyData science lecture4_doaa_mohey
Data science lecture4_doaa_mohey
Doaa Mohey Eldin
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
Ikbal Ahmed
 
Machine learning meets user analytics - Metageni tech talk
Machine learning meets user analytics - Metageni tech talkMachine learning meets user analytics - Metageni tech talk
Machine learning meets user analytics - Metageni tech talk
Gabriel Hughes PhD
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
Jinho Choi
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET Journal
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
IJERA Editor
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
Document Analyser Using Deep Learning
Document Analyser Using Deep LearningDocument Analyser Using Deep Learning
Document Analyser Using Deep Learning
IRJET Journal
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET Journal
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
CSCJournals
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
ijsc
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
ijsc
 
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Wuhan University
 

Similar to Ran zhou poster 2018 (20)

Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
 
313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
Data science lecture4_doaa_mohey
Data science lecture4_doaa_moheyData science lecture4_doaa_mohey
Data science lecture4_doaa_mohey
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
Machine learning meets user analytics - Metageni tech talk
Machine learning meets user analytics - Metageni tech talkMachine learning meets user analytics - Metageni tech talk
Machine learning meets user analytics - Metageni tech talk
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Document Analyser Using Deep Learning
Document Analyser Using Deep LearningDocument Analyser Using Deep Learning
Document Analyser Using Deep Learning
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
 
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
 

Recently uploaded

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 

Recently uploaded (20)

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 

Ran zhou poster 2018

  • 1. Application of Manual Document Classification Algorithm Ran Zhou with Mentor: Peter Merrill Data The data we are using for training the document classification algorithm have the following features: • Five types of document instances • Binary labels – Yes/No • Two data formats • PDF files o Get the text part by optical character recognition (OCR) o Get the labels from the curated database through a link variable • Curated Database • Relatively small data size 2018 DUKE RESEARCH COMPUTING SYMPOSIUM POSTER COMPETITIONIntroduction Manual document classification is a tedious process of manually classifying documents instances, which includes two steps: • Some instances are identified in need of being labeled/classified • The document instances are labeled/classified according to rigorous, predefined criteria Acknowledgements Leadership Team: Renato Lopes, MD,MHS,PhD Schuyler Jones, MD Michael Pencina, PhD Larry Carin, PhD Lisa Wruck, PhD Shelley Rusincovitch, MM References Dinghan Shen et al, On the use of word embeddings alone to represent natural language sequences, 2017(submitted to ICLR) Contact Information Ran Zhou Master student in statistical science department at Duke E-mail: rz69@duke.edu Goal The goal is to find an algorithm to simulate the manual classification process. The algorithm will be able to: • Take text data as the input and automatically give a classification label as the output. • Reduce the mistakes created by manual classification and improve the accuracy. • Improve the efficiency of the classification process by replacing manual classification process by machine. Figure 4. Data Components Figure 3. Curated Database Summary • Total instance count: 11000 • Average text length: 44 (min:1, max:1117) Figure 2. PDF File Summary • Total instance count: 11033 • Average text length: 1274 (min: 3, max: 24984) Model “At the DCRI, our values honor our history and guide our actions and daily decisions.… “ (An example text input of the algorithm) Yes / NoNeural Network Figure 1. Classification Algorithm Figure 5. Simple Word Embedding Based Model (SWEM) e_1 e_2 E = e_84 w_1 w_2 w_84 E = Yes Word Embedding Simple Neural Network Classification Results Figure 7. Individual Instance TypeFigure 6. All Instance Type Conclusion  The model highly improves the classification accuracy and the computational efficiency comparing to the traditional neural network model, recurrent neural network(RNN) and convolutional neural network(CNN).  The classification results for all instance types and individual instance type are much better than random guess and existing algorithms for similar problems.  Next, we will test the model on the data from the PDF files to further improve the accuracy. Type 1 Type 2 Type 3 Type 4 Type 5 2082 3950 3013 689 1299 Type 1 Type 2 Type 3 Type 4 Type 5 2045 3951 3003 689 1312 Yes No Yes No Yes No Sub 1 Yes Sub 2 Yes No / 833 1212 2993 958 291 2712 369 107 213 / Matt Wilson, RN Team Members: Ricardo Henao, PhD Peter Merrill, PhD Ran Zhou, BS Lynn Perkins, MBA SWEM is a neural network algorithm good at capturing semantic meaning of both individual words and their compositions simply by a special embedding method. Different from traditional neural network models for capturing semantic meaning of sentences, SWEM does not require large memory to train a large amount of extra layers. SWEM is also shown to be able to give more robust results for small data.

Editor's Notes

  1. Note: This slide master is 18 x 21, which will scale to 36 x 42 for printing Be sure to use the option to save as PDF, optimize for: standard (publishing online and printing)