SlideShare a Scribd company logo
1 of 25
2017 HPCC Systems® Community Day
Needle in a Haystack
Zhe Yu, Tim Menzies
NC State University
Raleigh, NC, US
Advanced text mining with HPCC Systems®
Data Miners Optimizers
Better
Decisions 2
Attorneys:
which documents are relevant to
my case?
60-80% of total cost
Use Cases
Researchers:
which papers are relevant to my
research?
weeks to months of work
CR
1% ~ 5%
3
Have Done
Issues:
Core algorithm
arXiv:1612.03224
how to start
when to stop
arXiv:1705.0
5420
Tools:
https://github.com/fa
stread/src
https://github.com/ai
-
se/FASTREAD_E
CL
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 4
Current Framework
Search API
Download
(software OR applicati* OR systems ) AND (fault* OR defect*
OR quality OR error-prone) AND (predict* OR prone* OR
probability OR assess* OR detect* OR estimat* OR classificat*)
KC
5
Pros:
• simpler search
• no data extraction
• more potential results
• more user involvement
• same techniques
Cons:
• scalability?
• cost to host the service
Search API
Learn API
Human
Review
K
defect,
prediction
New Framework
6
How to start
When to stop
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
Core Algorithm Human Errors Scalability
7
K U
Learner
x
x R?
label(x)
update select
● Random sampling
● Stop review when |RK| ≥ 0.95|R|
● Human makes no error
● Corpus not too large
Assumptions:
8
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
Core Algorithm
9
Cormack’14 [1]
Wallace’10 [3] Miwa’14 [2]
Core Algorithm
● When to start?
● Query strategy?
● Stop training?
● Data Balancing?
{H, P}
{U, C}
{S, T}
{N, A, W, M}
[1] Cormack, G.V. and Grossman, M.R., 2014, July. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM
SIGIR conference on Research & development in information retrieval (pp. 153-162). ACM.
[2] Miwa, M., Thomas, J., O’Mara-Eves, A. and Ananiadou, S., 2014. Reducing systematic review workload through certainty-based screening. Journal of biomedical informatics, 51, pp.242-253.
[3] Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C. and Schmid, C.H., 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics, 11(1), p.55.
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
Among the 2*2*2*4=32 treatments:
● Wallace’10 [3]: PUSA
● Miwa’14 [2]: PCSW
● Cormack’14 [1]: HCTN
● FAST1 [4]: HUTM
10
Learner
x
x R?
label(x)
update select
● Random sampling
● Stop review when |RK| ≥ 0.95|R|
Assumptions:
● Human makes no error
● Corpus not too large
K U
11
How to start
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
12
Cormack’15 [5]
How to start
● RANDOM
● Auto-BM25 [5]
● Auto-Syn [5]
● UPDATE [6]
● REUSE [6]
[5] Cormack, G., and Grossman, M.. "Autonomy and reliability of continuous active learning for technology-assisted review." arXiv:1504.06868 (2015).
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420.
Keywords or previous review data
13
FAST2[6] = FAST1[4] + Auto-BM25 + SEMI
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 14
Learner
x
x R?
label(x)
update select
● Random sampling
● Stop review when |RK| ≥ 0.95|R|
Assumptions:
● Human makes no error
● Corpus not too large
K U
15
When to stop
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
16
Wallace’13 [7]
When to stop
● Uniform random sampling
● Wallace’13 [7]
● SEMI [6]
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420.
[7] Wallace, B.C., Dahabreh, I.J., Moran, K.H., Brodley, C.E. and Trikalinos, T.A., 2013. Active literature discovery for scoping evidence reviews: How many needles are there. In KDD workshop on
data mining for healthcare (KDD-DMH).
● Estimate |R| with
○ labeled data K
○ unlabeled data U
● Stop when |RK| ≥ 0.95|RE|
17
FAST2[6] = FAST1[4] + Auto-BM25 + SEMI
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 18
Learner
x
x R?
label(x)
update select
● Random sampling
● Stop review when |RK| ≥ 0.95|R|
Assumptions:
● Human makes no error
● Corpus not too large
K U
19
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
Scalability
20
Solution?
HPCC Systems®
Experiments?
Preparing data
21
Learner
x
x R?
label(x)
update select
● Random sampling
● Stop review when |RK| ≥ 0.95|R|
Assumptions:
● Human makes no error
● Corpus not too large
K U
22
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
Human Errors
23
Have Done
Issues:
Core algorithm
arXiv:1612.03224
how to start
when to stop
arXiv:1705.0
5420
Tools:
https://github.com/fa
stread/src
https://github.com/ai
-
se/FASTREAD_E
CL
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 24
Thank you!Questions?
25

More Related Content

More from HPCC Systems

Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis ToolHPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems
 
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...HPCC Systems
 
Using the Open Source VS Code Editor with the HPCC Systems Platform
Using the Open Source VS Code Editor with the HPCC Systems PlatformUsing the Open Source VS Code Editor with the HPCC Systems Platform
Using the Open Source VS Code Editor with the HPCC Systems PlatformHPCC Systems
 
Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...
Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...
Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...HPCC Systems
 
Visualizing HPCC Systems Log Data Using ELK
Visualizing HPCC Systems Log Data Using ELKVisualizing HPCC Systems Log Data Using ELK
Visualizing HPCC Systems Log Data Using ELKHPCC Systems
 
Predicting College STEM Enrollment using HPCC Systems in Educational Research
Predicting College STEM Enrollment using HPCC Systems in Educational ResearchPredicting College STEM Enrollment using HPCC Systems in Educational Research
Predicting College STEM Enrollment using HPCC Systems in Educational ResearchHPCC Systems
 
Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...
Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...
Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...HPCC Systems
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsHPCC Systems
 

More from HPCC Systems (20)

Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
 
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
 
Using the Open Source VS Code Editor with the HPCC Systems Platform
Using the Open Source VS Code Editor with the HPCC Systems PlatformUsing the Open Source VS Code Editor with the HPCC Systems Platform
Using the Open Source VS Code Editor with the HPCC Systems Platform
 
Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...
Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...
Visualizing your Data Natively on the HPCC Systems Platform with the “Visuali...
 
Visualizing HPCC Systems Log Data Using ELK
Visualizing HPCC Systems Log Data Using ELKVisualizing HPCC Systems Log Data Using ELK
Visualizing HPCC Systems Log Data Using ELK
 
Predicting College STEM Enrollment using HPCC Systems in Educational Research
Predicting College STEM Enrollment using HPCC Systems in Educational ResearchPredicting College STEM Enrollment using HPCC Systems in Educational Research
Predicting College STEM Enrollment using HPCC Systems in Educational Research
 
Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...
Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...
Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to...
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for Translations
 

Recently uploaded

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 

Recently uploaded (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 

Needle in a Haystack (Advanced text mining with ECL)

  • 1. 2017 HPCC Systems® Community Day Needle in a Haystack Zhe Yu, Tim Menzies NC State University Raleigh, NC, US Advanced text mining with HPCC Systems®
  • 3. Attorneys: which documents are relevant to my case? 60-80% of total cost Use Cases Researchers: which papers are relevant to my research? weeks to months of work CR 1% ~ 5% 3
  • 4. Have Done Issues: Core algorithm arXiv:1612.03224 how to start when to stop arXiv:1705.0 5420 Tools: https://github.com/fa stread/src https://github.com/ai - se/FASTREAD_E CL [4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224. [6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 4
  • 5. Current Framework Search API Download (software OR applicati* OR systems ) AND (fault* OR defect* OR quality OR error-prone) AND (predict* OR prone* OR probability OR assess* OR detect* OR estimat* OR classificat*) KC 5
  • 6. Pros: • simpler search • no data extraction • more potential results • more user involvement • same techniques Cons: • scalability? • cost to host the service Search API Learn API Human Review K defect, prediction New Framework 6
  • 7. How to start When to stop Cormack’14 Cormack’15 Cormack’16 Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15 Medicine: E-discovery: Core Algorithm Human Errors Scalability 7
  • 8. K U Learner x x R? label(x) update select ● Random sampling ● Stop review when |RK| ≥ 0.95|R| ● Human makes no error ● Corpus not too large Assumptions: 8
  • 9. Cormack’14 Cormack’15 Cormack’16 Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15 Medicine: E-discovery: Core Algorithm 9
  • 10. Cormack’14 [1] Wallace’10 [3] Miwa’14 [2] Core Algorithm ● When to start? ● Query strategy? ● Stop training? ● Data Balancing? {H, P} {U, C} {S, T} {N, A, W, M} [1] Cormack, G.V. and Grossman, M.R., 2014, July. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (pp. 153-162). ACM. [2] Miwa, M., Thomas, J., O’Mara-Eves, A. and Ananiadou, S., 2014. Reducing systematic review workload through certainty-based screening. Journal of biomedical informatics, 51, pp.242-253. [3] Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C. and Schmid, C.H., 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics, 11(1), p.55. [4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224. Among the 2*2*2*4=32 treatments: ● Wallace’10 [3]: PUSA ● Miwa’14 [2]: PCSW ● Cormack’14 [1]: HCTN ● FAST1 [4]: HUTM 10
  • 11. Learner x x R? label(x) update select ● Random sampling ● Stop review when |RK| ≥ 0.95|R| Assumptions: ● Human makes no error ● Corpus not too large K U 11
  • 12. How to start Cormack’14 Cormack’15 Cormack’16 Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15 Medicine: E-discovery: 12
  • 13. Cormack’15 [5] How to start ● RANDOM ● Auto-BM25 [5] ● Auto-Syn [5] ● UPDATE [6] ● REUSE [6] [5] Cormack, G., and Grossman, M.. "Autonomy and reliability of continuous active learning for technology-assisted review." arXiv:1504.06868 (2015). [6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. Keywords or previous review data 13
  • 14. FAST2[6] = FAST1[4] + Auto-BM25 + SEMI [4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224. [6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 14
  • 15. Learner x x R? label(x) update select ● Random sampling ● Stop review when |RK| ≥ 0.95|R| Assumptions: ● Human makes no error ● Corpus not too large K U 15
  • 16. When to stop Cormack’14 Cormack’15 Cormack’16 Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15 Medicine: E-discovery: 16
  • 17. Wallace’13 [7] When to stop ● Uniform random sampling ● Wallace’13 [7] ● SEMI [6] [6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. [7] Wallace, B.C., Dahabreh, I.J., Moran, K.H., Brodley, C.E. and Trikalinos, T.A., 2013. Active literature discovery for scoping evidence reviews: How many needles are there. In KDD workshop on data mining for healthcare (KDD-DMH). ● Estimate |R| with ○ labeled data K ○ unlabeled data U ● Stop when |RK| ≥ 0.95|RE| 17
  • 18. FAST2[6] = FAST1[4] + Auto-BM25 + SEMI [4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224. [6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 18
  • 19. Learner x x R? label(x) update select ● Random sampling ● Stop review when |RK| ≥ 0.95|R| Assumptions: ● Human makes no error ● Corpus not too large K U 19
  • 20. Cormack’14 Cormack’15 Cormack’16 Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15 Medicine: E-discovery: Scalability 20
  • 22. Learner x x R? label(x) update select ● Random sampling ● Stop review when |RK| ≥ 0.95|R| Assumptions: ● Human makes no error ● Corpus not too large K U 22
  • 23. Cormack’14 Cormack’15 Cormack’16 Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15 Medicine: E-discovery: Human Errors 23
  • 24. Have Done Issues: Core algorithm arXiv:1612.03224 how to start when to stop arXiv:1705.0 5420 Tools: https://github.com/fa stread/src https://github.com/ai - se/FASTREAD_E CL [4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224. [6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 24

Editor's Notes

  1. Know the state-of-the-art before innovatively improve it What are we trying to solve Business score How I use ECL A massive online reading tool HPCC is our platform
  2. Before going into technical details Show the final goal
  3. This work actually demonstrates how important it is to conduct literature reviews-- without inventing anything new, just refactoring existing methods, we result in much better performance than the state-of-the-art methods. What this results mean, this reduction mean
  4. With little domain knowledge applied, two or three keyword search, data from reviews of same topic.. Prevent Run aways, more reduction
  5. It benefits the management’s job to look ahead, and manage the cost and gain.
  6. With little domain knowledge applied, two or three keyword search, data from reviews of same topic.. Prevent Run aways, more reduction