From the 2017 HPCC Systems Community Day:
How to keep up with the explosion of data in the 21st century? In many technical, legal, and medical domains, practitioners and researchers struggle to stay on top of a mountain of new reports and results. This is a pressing technical problem (how to keep relevant) as well as a current industrial problem (in the legal domain, LexisNexis is often asked to identify the few relevant documents within a very large space of irrelevancies).
To address this problem, we present and evaluate FASTREAD. Our experience with this tool teaches us three things. Firstly, it shows that we can quickly find (say) 50 relevant documents within 10,000 using some incremental data mining methods. Secondly, it shows that updating our existing knowledge can be remarkably easy, if we lever what was learned from the past. Thirdly, it demonstrates the advantages of ECL's open architecture. FASTREAD is a Python/ECL hybrids that exploits all the tools available in the Python ecosystem while at the same time, levering ECL's big data capabilities.
Zhe Yu
PhD Student, North Carolina State University
Zhe Yu is a second year Ph.D student in the Department of Computer Science at North Carolina State University. He is interested in techniques where machine learning and human effort are combined to provide performance as good as human works and less workload for human at the same time. He is now focusing on applying machine learning and data mining algorithms (especially active learning) to assist the process of identifying and retrieving relevant information within a very large space of irrelevancies. Zhe Yu has been working as a research assistant for Dr. Tim Menzies since Aug. 2015 and has been intensively collaborating with/interning at LexisNexis since then. To know more about him, please visit his webpage at http://azhe825.github.io/.
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
Needle in a Haystack (Advanced text mining with ECL)
1. 2017 HPCC Systems® Community Day
Needle in a Haystack
Zhe Yu, Tim Menzies
NC State University
Raleigh, NC, US
Advanced text mining with HPCC Systems®
3. Attorneys:
which documents are relevant to
my case?
60-80% of total cost
Use Cases
Researchers:
which papers are relevant to my
research?
weeks to months of work
CR
1% ~ 5%
3
4. Have Done
Issues:
Core algorithm
arXiv:1612.03224
how to start
when to stop
arXiv:1705.0
5420
Tools:
https://github.com/fa
stread/src
https://github.com/ai
-
se/FASTREAD_E
CL
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 4
5. Current Framework
Search API
Download
(software OR applicati* OR systems ) AND (fault* OR defect*
OR quality OR error-prone) AND (predict* OR prone* OR
probability OR assess* OR detect* OR estimat* OR classificat*)
KC
5
6. Pros:
• simpler search
• no data extraction
• more potential results
• more user involvement
• same techniques
Cons:
• scalability?
• cost to host the service
Search API
Learn API
Human
Review
K
defect,
prediction
New Framework
6
7. How to start
When to stop
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
Core Algorithm Human Errors Scalability
7
8. K U
Learner
x
x R?
label(x)
update select
● Random sampling
● Stop review when |RK| ≥ 0.95|R|
● Human makes no error
● Corpus not too large
Assumptions:
8
10. Cormack’14 [1]
Wallace’10 [3] Miwa’14 [2]
Core Algorithm
● When to start?
● Query strategy?
● Stop training?
● Data Balancing?
{H, P}
{U, C}
{S, T}
{N, A, W, M}
[1] Cormack, G.V. and Grossman, M.R., 2014, July. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM
SIGIR conference on Research & development in information retrieval (pp. 153-162). ACM.
[2] Miwa, M., Thomas, J., O’Mara-Eves, A. and Ananiadou, S., 2014. Reducing systematic review workload through certainty-based screening. Journal of biomedical informatics, 51, pp.242-253.
[3] Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C. and Schmid, C.H., 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics, 11(1), p.55.
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
Among the 2*2*2*4=32 treatments:
● Wallace’10 [3]: PUSA
● Miwa’14 [2]: PCSW
● Cormack’14 [1]: HCTN
● FAST1 [4]: HUTM
10
12. How to start
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
12
13. Cormack’15 [5]
How to start
● RANDOM
● Auto-BM25 [5]
● Auto-Syn [5]
● UPDATE [6]
● REUSE [6]
[5] Cormack, G., and Grossman, M.. "Autonomy and reliability of continuous active learning for technology-assisted review." arXiv:1504.06868 (2015).
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420.
Keywords or previous review data
13
14. FAST2[6] = FAST1[4] + Auto-BM25 + SEMI
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 14
16. When to stop
Cormack’14 Cormack’15 Cormack’16
Wallace’10 Wallace’11 Wallace’13 Miwa’14 Wallace’15
Medicine:
E-discovery:
16
17. Wallace’13 [7]
When to stop
● Uniform random sampling
● Wallace’13 [7]
● SEMI [6]
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420.
[7] Wallace, B.C., Dahabreh, I.J., Moran, K.H., Brodley, C.E. and Trikalinos, T.A., 2013. Active literature discovery for scoping evidence reviews: How many needles are there. In KDD workshop on
data mining for healthcare (KDD-DMH).
● Estimate |R| with
○ labeled data K
○ unlabeled data U
● Stop when |RK| ≥ 0.95|RE|
17
18. FAST2[6] = FAST1[4] + Auto-BM25 + SEMI
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 18
24. Have Done
Issues:
Core algorithm
arXiv:1612.03224
how to start
when to stop
arXiv:1705.0
5420
Tools:
https://github.com/fa
stread/src
https://github.com/ai
-
se/FASTREAD_E
CL
[4] Yu, Z., Kraft, N.A. and Menzies, T., 2017. How to Read Less: On the Benefit of Active Learning for Primary Study Selection in Systematic Literature Reviews. arXiv preprint arXiv:1612.03224.
[6] Yu, Z. and Menzies, T., 2017. FAST2: a Better Text Miner for Faster Understanding of the SE Literature. arXiv preprint arXiv:1705.05420. 24
Know the state-of-the-art before innovatively improve it
What are we trying to solve
Business score
How I use ECL
A massive online reading tool
HPCC is our platform
Before going into technical details
Show the final goal
This work actually demonstrates how important it is to conduct literature reviews-- without inventing anything new, just refactoring existing methods, we result in much better performance than the state-of-the-art methods.
What this results mean, this reduction mean
With little domain knowledge applied, two or three keyword search, data from reviews of same topic..
Prevent Run aways, more reduction
It benefits the management’s job to look ahead, and manage the cost and gain.
With little domain knowledge applied, two or three keyword search, data from reviews of same topic..
Prevent Run aways, more reduction