Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak

453 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak

  1. 1. • • •
  2. 2. Understanding of the subject (human) Extract features (human) Search query Predict search scores Review a few results (human) Search results • •
  3. 3. Understanding of the subject (human) Parse and vectorize documents Predict scores Review a few results (human) Document labels Train supervised machine learning model
  4. 4. • • • • • • • • •
  5. 5. • • • • • github.com/FreeDiscovery/FreeDiscovery
  6. 6. Text vectorization (BoW, n-grams) Latent Semantic Indexing (LSI/ LSA) Raw documents Logistic Regression, SVM, xgboost, .. K-Nearest Neighbors Birch + cluster labeling DBSCAN, I-Match, simhash JWZ algorithm Sparse matrix (10-100k dim) Dense matrix (100-300 dim)
  7. 7. FreeDiscovery core Model / data persistence Document ID mapping REST API server Nginx proxy (optional)
  8. 8. • flask • marshmallow webargs • Werkzeug gunicorn • flask-apispec • bootprint-openapi Sphinx •
  9. 9. • • •
  10. 10. • • • • • MiniBatchKMeans BIRCH DBSCAN HDBSCAN, ¹ hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
  11. 11. ● ● ● simhash-py ● ● ● =>
  12. 12. Recall vs Documents Retrieved (Logistic Regression CV)
  13. 13. The average performance variation from baseline run with Logistic Regression CV (BOW, log TF-IDF weight) for the ERDM dataset (1000 train size, 700000 test size).
  14. 14. Reviewer Document labels TAR system Scores
  15. 15. predict • • • •
  16. 16. ● joblib ● ● pandas ● HashingVectorizer
  17. 17. Search query: time grid search Better search query? scikit-learn project: ● 9100 issues / PR ● 850 open issues ● 540 open PR ● 90k comments
  18. 18. ● ● ● ●
  19. 19. David Grossman, Eugene Yang, Ophir Frieder
  20. 20. ➔ ➔ ➔ ➔ ➔
  21. 21. ➔ github.com/FreeDiscovery/FreeDiscovery ➔ freediscovery.io/doc/stable @RomanYurchak

×