Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integrate ManifoldCF with Solr

2,760 views

Published on

Aurélien Mazoyer from France Labs gives users a step-by-step breakdown of everything ManifoldCF is capable of. He also works through a scenario that shows what can happen when using Apache ManifoldCF with Apache Solr.

Published in: Engineering
  • Be the first to comment

Integrate ManifoldCF with Solr

  1. 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  2. 2. Properly integrate ManifoldCF with Solr Aurélien MAZOYER Search Expert, Co-founder, France Labs
  3. 3. 3 01 Apache Manifold CF o Agenda • Overview of ManifoldCF • Our scenario : find files on a file share • In real life
  4. 4. 4 01 Apache Manifold CF o Overview • Connector Framework • Incremental crawling • Handle authorization • Configuration via REST API and UI
  5. 5. 5 01 Apache Manifold CF o History • Based on « Connector Framework » developed by Karl Wright for the MetaCarta Appliance • Donated to the Apache Software Foundation in 2009 • May 2012 : out of incubation • Current version : 2.2 (August 2015)
  6. 6. 6 01 Connectors gone wild o Different connectors for : • Content repositories • Web, Wiki, DB, Email, RSS, CMIS, Alfresco… • But also Windows Share, Sharepoint, Dropbox… • Authorities • LDAP, AD, CMIS… • Output • Solr, Elasticsearch, OSS…
  7. 7. 7 03 Big picture Manifold CF Solr Elasticsearch Repository N OpenLDAP Authority N … Daemon Agent Conn. 1 Manifold CF authority service Ouputs Authorities Conn. 2 Conn. N ManifoldCF UI ManifoldCF API Conn. 1 Conn. 2 Conn. N Wiki DB Repository N … … Repositories Conn. 1 Conn. N
  8. 8. 8 01 Roles of components o Daemon agent • Java process • Run repository and ouput connectors • Run data crawling jobs
  9. 9. 9 01 Roles of components o Authority service • Web application • Run authority connectors • Get security tokens for a specific user
  10. 10. 10 01 Component Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* o ManifoldCF UI That’s it.
  11. 11. 11 01 API Configuration o API
  12. 12. 12 01 Test it! o For testing purpose: • java –jar post.jar • All-in-one process • Embedded database (HSQL)
  13. 13. 13 01 Taking MCF to production Multi-process deployment o 3 web application in a servlet container • mcf-crawler-ui • mcf-authorization-service • mcf-api-service o Daemon agent o Database • PostgresSQL o Synchronize on filesystem ( local or distributed (zK) )
  14. 14. 14 01 Search files with Security : Solr + MCF o Our scenario • File share using Active Directory • Search with Solr • With security constraints
  15. 15. 15 01 Security model : Solr + MCF o Authorization • Early Binding • Index documents with ACLs • Compute authorization at runtime o Authentication • Not handled by Solr/ManifoldCF • Front-end application should authenticate user
  16. 16. 16 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Phase 1 : Indexing Repositories Authorities Output Connector Solr Extracting Handler Manifold CF authority service AD ConnectorWindows Share MCF Plugin Send docs and ACLs Crawl documents with ACLs
  17. 17. Get User access token Solr MCF Plugin 17 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Repositories Authorities Extracting Handler Manifold CF authority service AD Connector Front End Authenticated Search Filter docs based on ACLs and users info Authorized results Phase 2 : Searching Output Connector Windows Share
  18. 18. 18 01 Configure Solr + MCF o side o 4 connections and 1 job • Create Windows Share connection • Create Solr connection • Create Active Directory connection • Create Authority Group connection • Create a crawling Job
  19. 19. 19 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…*
  20. 20. 20 01 Component AD Group Crawl Job Solr Connection AD Connection Windows Share Connection
  21. 21. 21 01 Configure Solr + MCF o Frond end side o Authentication • For Tomcat • JDNI Tomcat Realm • TomcatSPNEGO
  22. 22. 22 01 Configure Solr + MCF o side o Modify schema.xml • Add fields for security tokens o Modify solrconfig.xml • Add MCF Solr Plugin (query parser) o And don’t forget to protect the Solr instance :-P
  23. 23. 23 01 Configure Solr + MCF o Leverage Solr Extracting handler • Based on ApacheTika • Mime type detection • Embed parsing library • Supported extension: • MS Office (OLE2 and OOXML) • OpenDocument • Pdf • Audio/video/image files • Now OCRs thanks to Tika 1.7 (and Tesseract) o Now, can be done directly in MCF!
  24. 24. 24 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* Transformation Connection 0…* 1…*
  25. 25. 25 01 Crawling principle o Crawling model • Incremental model • Continuous model ManifoldCF In Action – Chapter 1 (Karl Wright) Phase 1 Phase 2
  26. 26. 26 01 Incremental crawling of file share o Incremental crawling not so easy with some repositories: Windows Shar e Connector JCIFS Windows Share Uhuuu, file share, what's new since last time we met? Errkkk…
  27. 27. 27 01 Incremental crawling of file share : Solr + MCF o Phase 1 : Discovery/Indexing Depth first Fetch SMB file attributes If file is a directory and if matches inclusion regex For each file If file is a regular file and if matches inclusion regex List files in SMB directory Check ingeststatus entry in crawler DB If no entry or the version attribute is different Fetch file content Update ingeststatus entry in DB Push file to Solr For each start path entry Windows Share
  28. 28. 28 01 o What is ingeststatus database entry? o Simplified version : o LastVersion? • Here, computed from lastModified and ACLs on the file DOCURI LAST_INGEST LAST_VERSION protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1 protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1 +S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023- 2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1 84/ShareFolder/TestFile.txt+1444462827664:16Y Incremental crawling of file share
  29. 29. 29 01 Incremental crawling of file share : Solr + MCF o Phase 2 : Deleting unreachable documents Update Crawler database Send delete command to Solr For each crawler DB entry
  30. 30. 30 01 How to see what happened o Search History o Monitoring • Job Status • Notification Connections
  31. 31. 31 01 How to see what happened o Search History o History • Simple History • Maximum Activity • Maximum Bandwidth • Result Histogram o Status • Document Status • Queue Status
  32. 32. 32 01 Performance issue o Find bottleneck • Crawled repository • Network • Solr • MCF database • MCF configuration
  33. 33. 33 01 Handle performance issue o Specific connector’s configuration • Throttling • Max JVM connections o Can improve speed / limit impact on crawled repository o Very specific to the repository
  34. 34. 34 01 Handle performance issue o Job settings o Size limit of ingested documents o Use regex to remove some extensions from crawl
  35. 35. 35 01 Investigate errors • Increase connector’s log level • Read MCF simple history • Thread Dump
  36. 36. 36 01 Common errors in file crawling o Crawler account rights o Exotic files o Very biiiiiiig files o JCIFS errors o Solr connector timeout
  37. 37. 37 01 When use ManifoldCF? q = crawled_environment:heterogeneous OR scenario:intranet OR security:mandatory
  38. 38. 38 01 References o ManifoldCF documentation https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html o ManifoldCF in Action (K. Wright) https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs o Securing Solr document with MCF (K. Wright) http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 o France Labs blog posts : http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/ http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
  39. 39. 39 01 Datafari Search Admin o Intranet “ready to play” search solution • Apache License o Embed: o Solr o ManifoldCF o And other cool stuff: • Admin and responsive search UI • User Management • Banana for user behavior analysis • Tesseract OCR • A funny zebra • Etc… www.datafari.com
  40. 40. 40 aurelien.mazoyer@francelabs.com @francelabs www.francelabs.com

×