Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
•
•
•
Understanding of the
subject (human)
Extract features (human)
Search query
Predict search scores
Review a few results
(hum...
Understanding of the
subject (human)
Parse and vectorize
documents
Predict scores
Review a few results
(human)
Document
la...
•
•
•
•
•
•
•
•
•
•
•
•
•
•
github.com/FreeDiscovery/FreeDiscovery
Text vectorization
(BoW, n-grams)
Latent Semantic
Indexing (LSI/ LSA)
Raw documents
Logistic Regression,
SVM, xgboost, ..
...
FreeDiscovery core
Model / data
persistence
Document ID
mapping
REST API server
Nginx proxy
(optional)
• flask
• marshmallow webargs
• Werkzeug gunicorn
• flask-apispec
•
bootprint-openapi
Sphinx
•
•
•
•
•
•
•
•
•
MiniBatchKMeans
BIRCH DBSCAN
HDBSCAN,
¹ hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
●
●
●
simhash-py
●
●
●
=>
Recall vs Documents Retrieved (Logistic Regression CV)
The average performance variation from baseline run with
Logistic Regression CV (BOW, log TF-IDF weight) for the ERDM
data...
Reviewer
Document
labels
TAR system
Scores
predict
•
•
•
•
●
joblib
●
● pandas
● HashingVectorizer
Search query:
time grid search
Better search query?
scikit-learn project:
● 9100 issues / PR
● 850 open issues
● 540 open ...
●
●
●
●
David Grossman,
Eugene Yang, Ophir Frieder
➔
➔
➔
➔
➔
➔ github.com/FreeDiscovery/FreeDiscovery
➔ freediscovery.io/doc/stable
@RomanYurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak

Download to read offline

PyParis 2017
http://pyparis.org

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak

  1. 1. • • •
  2. 2. Understanding of the subject (human) Extract features (human) Search query Predict search scores Review a few results (human) Search results • •
  3. 3. Understanding of the subject (human) Parse and vectorize documents Predict scores Review a few results (human) Document labels Train supervised machine learning model
  4. 4. • • • • • • • • •
  5. 5. • • • • • github.com/FreeDiscovery/FreeDiscovery
  6. 6. Text vectorization (BoW, n-grams) Latent Semantic Indexing (LSI/ LSA) Raw documents Logistic Regression, SVM, xgboost, .. K-Nearest Neighbors Birch + cluster labeling DBSCAN, I-Match, simhash JWZ algorithm Sparse matrix (10-100k dim) Dense matrix (100-300 dim)
  7. 7. FreeDiscovery core Model / data persistence Document ID mapping REST API server Nginx proxy (optional)
  8. 8. • flask • marshmallow webargs • Werkzeug gunicorn • flask-apispec • bootprint-openapi Sphinx •
  9. 9. • • •
  10. 10. • • • • • MiniBatchKMeans BIRCH DBSCAN HDBSCAN, ¹ hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
  11. 11. ● ● ● simhash-py ● ● ● =>
  12. 12. Recall vs Documents Retrieved (Logistic Regression CV)
  13. 13. The average performance variation from baseline run with Logistic Regression CV (BOW, log TF-IDF weight) for the ERDM dataset (1000 train size, 700000 test size).
  14. 14. Reviewer Document labels TAR system Scores
  15. 15. predict • • • •
  16. 16. ● joblib ● ● pandas ● HashingVectorizer
  17. 17. Search query: time grid search Better search query? scikit-learn project: ● 9100 issues / PR ● 850 open issues ● 540 open PR ● 90k comments
  18. 18. ● ● ● ●
  19. 19. David Grossman, Eugene Yang, Ophir Frieder
  20. 20. ➔ ➔ ➔ ➔ ➔
  21. 21. ➔ github.com/FreeDiscovery/FreeDiscovery ➔ freediscovery.io/doc/stable @RomanYurchak

PyParis 2017 http://pyparis.org

Views

Total views

757

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

16

Shares

0

Comments

0

Likes

0

×