Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment


Published on

Patrick Beaucamp (BPM-Counseil, France)

A journey in the Dark Web, for companies looking to take control of their search strategy. Objective if this presentation is to prove that any reasonable cost, any organisation can setup its own search strategy, outside or in parallel of its document management strategy.

Challenge at French Ministry is to aggregate internal content, external content on social network (pinterest, youtube, facebook) and external legacy WebSite content (other Website from agency in relation with Ministry) and provide a brand new Web Site with "best of the bread" interface : search engine, auto completion and word correction, easy custom and secured navigation

Result is awesome, for a budget kept under control, we provided a new Drupal Module to monitor and configure Solr6 indexation and search engine, together with custom API to index external WebSite.

This session will come with a presentation of the Project Architecture (multi tiers servers) and a live demo of the Search interface

Published in: Internet
  • Be the first to comment

  • Be the first to like this

II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

  1. 1. Patrick Beaucamp Founder of the Vanilla, AklaBox & Data4Citizen Projects Mail : Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment II-PIC, Bangalore 2th November 2017 1II-PIC, Bangalore
  2. 2. 2II-PIC, Bangalore
  3. 3. Presentation Agenda Open Source Search Engine & Search Platform Features expected for Search Platforms (Interface) 3II-PIC, Bangalore Open Source Platform at French Ministry Project Context Platform Architecture WebSite Powered by a Search engine Personal Experience of Search – Search Ideas
  4. 4. You know Solr ? 4II-PIC, Bangalore
  5. 5. Part 1 – Search concepts and Ideas « Sharing and awaking your mind » 5II-PIC, Bangalore
  6. 6. Searching … and finding ! 6 How many times per day do you Google ? (search, maps, translate …) Tribute to Open Source at II-PIC … thanks Christoph ! Search is the first Step : collecting information II-PIC, Bangalore
  7. 7. Searching ??? 7 Using Search Engine (and beeing influenced by Seo) Search is a subject in itself : II-PIC, Bangalore Register to News Feed and Alerts : « Push Mode » « Artificial Intelligence » facts : an algorithm is working for you : Facebook proposal , Gmail reminder … « minority report » is there !
  8. 8. 8II-PIC, Bangalore User Behavior Analysis for Sales & Marketing Team, Web Design Team WebSite as a Vitrin : Which Menu & Sub menu are visited ? Where are the dead branch ? No real « Search Approach » Before Browsing behavior
  9. 9. 9II-PIC, Bangalore Browsing behavior User Behavior Analysis for Sales & Marketing Team, Web Design Team WebSite as a Search Interface What people are looking for ? How are they searching? Now Review your SEO
  10. 10. Searching … and finding ! 10II-PIC, Bangalore
  11. 11. Searching … and finding ! 11 We all became private investigators one day or another II-PIC, Bangalore
  12. 12. Searching … and finding ! 12II-PIC, Bangalore
  13. 13. Searching … and finding ! 13 Different search engine lead to different results II-PIC, Bangalore
  14. 14. Searching … and finding ! 14 Different search engine by country II-PIC, Bangalore
  15. 15. Searching … and finding ! 15 Funny word : SEO … its more « how to be found on Internet » … and you need to pay for it ! II-PIC, Bangalore
  16. 16. Searching … and finding ! My personal experience 16 I tried to find a person during 23 years, roughly from 1993 to 2016 From 1993 to 1998 : no search engine available … only private investigator ? From 1999 to 2015 : regular Search – no results I founded this person on facebook, not on google From a browser : « f + tab » … « g + tab », « y + tab » … Some years : no search, other years : multiples search II-PIC, Bangalore
  17. 17. Searching … and finding ! 17 The person I was looking published on facebook using his/her real name – its his/her decision to be visible or not Where do we stand with the « Right to Forget » II-PIC, Bangalore
  18. 18. Searching … and finding ! 18 Companies like Facebook have tons of data : they need to provide search infrastructure (indexing + search interface) I was lucky to make a try with facebook search interface II-PIC, Bangalore
  19. 19. Searching … and finding ! 19 Discovery of Cholera – 1854 (John Snow) II-PIC, Bangalore
  20. 20. Searching … and finding ! 20 Bicycle Accident in Street : who is taking care of trafic management Example in Boston : Open Data II-PIC, Bangalore
  21. 21. Searching … and finding ! 21 LION – 2016 (Garth Davis) Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo II-PIC, Bangalore
  22. 22. « Internal » Searching Strategy 22II-PIC, Bangalore It’s easy to add a « search » feature In WebSite (Drupal Hosting) Company don’t want to live this again ! You need a Strategy for your internal data : its your digital assets
  23. 23. Part 2 – Search Components The « Recipe » 23II-PIC, Bangalore
  24. 24. OpenSource LandScape 24 Crawling Indexing Storing WebSite Reference WebSite Accessibility Update Management Search Interface Result Visualization Auto Completion Natural Language Voice Recognition Maps Ads Unstructured data Access Management II-PIC, Bangalore
  25. 25. Search Platform Objectives Constraints : being able to reach WebSite and content : Internal WebSites (Intranet) & External WebSites Internal Document Repositories 25 Being able to index WebSite content (and page updates) Beeing able to store unstructured data Crawling Storing Indexing II-PIC, Bangalore
  26. 26. Search Platform Objectives 26 Provide usable Search results (auto classification, visualization) Don’t Forget why and what you search : • You search in existing documents • You need visualization tools • Its not a crystal ball : search reflects the past Provide usable Search interfaces (semantic search, multi language search …) Search Interface Result Visualization II-PIC, Bangalore
  27. 27. 27 Before indexing your document base, you need to access it ! Apache Nutch is a highly extensible and scalable open source web crawler software project. Reference : Nutch II-PIC, Bangalore
  28. 28. 28 Solr • What is Solr – Indexation and Search Engine • Promoted by the Apache Foundation • Built on Top of Apache Lucene (Java Search library) – Major engine characteristics • Scalable, fault tolerance, distribution indexation process, dynamic workload balancer, centraized configuration – Technical environment • Java • Embeded Jetty server for platform administration II-PIC, Bangalore
  29. 29. 29 Solr Main characteristics Admin Interface Flexible and scalable Configuration Modular Multiple index management with a signle instance II-PIC, Bangalore
  30. 30. 30 Solr Main characteristics Standard communication interfaces (html, xml, json) Configuration can be done with or without schema Real time Indexation II-PIC, Bangalore
  31. 31. 31 Solr Main characteristics Customizable Full Text analysis Rich documents indexation (using Tika) II-PIC, Bangalore
  32. 32. 32 Solr Main characteristics Search by facet and filters Term suggestion and orthograph correction Geospatial Search II-PIC, Bangalore
  33. 33. 33 Solr Solr behavior II-PIC, Bangalore
  34. 34. 34 -Synonyms - It is possible to extend the search to synonyms if they are listed in a glossary. For example, to find articles containing synonyms to “TV” when you search with the word TV. -Metadata - Dictionary for list of searchable keywords Search Engine Basic (1/2) II-PIC, Bangalore
  35. 35. 35 -Reserved Words, Protected Words - Indexing usually uses stemming, which is to reduce words to their root, for example "Developp" to find items also contain the word when trying to develop the word development. However, sometimes there are adverse lemmatizations, indexing under one lemma two words that have no relation. It is possible to prevent the stemming of words by listing them in a file protwords.txt. -StopWords - The stopwords are meaningless words. A word considered insignificant will be ignored. Note that some words are insignificant in some contexts, others have homonyms signifiers. For example, can refer to a summer season (rather mean) or past participle of the verb to be (relatively insignificant). Stopwords.txt the file looks like this Search Engine Basic (2/2) II-PIC, Bangalore
  36. 36. 36 -Multi Language support (this is where commercial search engine have still more to bring to customer), even there is now Asian type language support (Hindi, Thai, Chineese, …) -Elision : - Elisions are a feature of the French, which consist of a contraction of the words like or when they are followed by a vowel. Example: + aircraft gives the aircraft. It is possible to remove these elisions using a lexicon. -Limits solved other the past 3 years • Full text search interface (language with search engine) • SubQuery support : now its ok starting with Solr 4.7 (we are v6) • Scalability (this is where Solr is taking technical advantage) Search Engine Current Limits II-PIC, Bangalore
  37. 37. 37 -Advance indexing and querying tools. -Provides distributed searching capabilities to prevent bottleneck for a particular server. -Provides document excerpts (snippets) generation that provides summary of the search -Relevance ranking display extracts from the documents based on the query. Search Interface expectation (1/3) II-PIC, Bangalore
  38. 38. 38 -Duplicate document detection, including fuzzy near duplicates -Rich Document Parsing and Indexing without using Database Indexing. -Ranking control carry out a targeted ranking of individual documents. -Search Grouping by Type / Tag / Categories (General page, documents, images) Search Interface expectation (2/3) II-PIC, Bangalore
  39. 39. 39 -Multi Criteria support -Ranking -Natural language support -Apps Support (Android, Ipad) Search Interface expectation (3/3) II-PIC, Bangalore
  40. 40. Part 3 – A Real Project 40II-PIC, Bangalore
  41. 41. Project at Ministry Initial decision and guidelines from Ministry 41 New WebSite will be done using Drupal CMS 8.2 WebSite should be powered by a « Google alike Search Toolbar » WebSite – Infrastructure – should connect with multiples other WebSite All Infra (Software) must be Open Source components II-PIC, Bangalore
  42. 42. Project at Ministry 42 II-PIC, Bangalore
  43. 43. Project at Ministry 43 II-PIC, Bangalore
  44. 44. Project at Ministry - Architecture 44II-PIC, Bangalore
  45. 45. Project at Ministry - Architecture 45II-PIC, Bangalore
  46. 46. Project at Ministry - Technical 46 Projects Steps Nutch crawler for various WebSite • Facebook, LinkedIn, Twitter, Youtube … • Internal WebSite, Previous WebSite Drupal Forms for Metadata & indexation • Specific Forms for different kind of documents • Drupal CMS process to add new content Drupal 8 Module for Solr : custom search, monitoring, reporting • Existing drupal solr is limited to single instance of drupal • Not possible to use Solr Admin interface II-PIC, Bangalore
  47. 47. Project at Ministry - Technical 47 Additional PHP libraries Curl : Communication Drupal-Solr (http-get http-post & attached file) Ssh2 : server administration command Zookeeper : Communication Drupal-Zookeeper MemCached : Communication Drupal-Memcached Solarium : Communication Drupal-Solr (abstraction layer) GoogleApi : youtube content indexation II-PIC, Bangalore Paragraph : News and Content edition Piwik : Statistics (like Google Analytics)
  48. 48. Project at Ministry – Admin Interface 48 Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr) II-PIC, Bangalore
  49. 49. Project at Ministry – Admin Interface 49 Drupal8 Addon to monitor the global infrastructure - Statistics II-PIC, Bangalore
  50. 50. Project at Ministry - Validation 50 Projects Validation & Deployment No problems with Zookeeper, Solr, Nutch Stress tests for the global platform : initial slow down with 10 000 simultaneous connection Sub-Project : Adressing the Single Point of Failure Solution : Problems with Drupal & MySql -> MemCached II-PIC, Bangalore
  51. 51. Project at Ministry - Next 51 Next Steps Review of WebSite content … new Ministry New Content to be indexed : • Other WebSite and Social Content • New set of document to be added in the repository II-PIC, Bangalore
  52. 52. 52II-PIC, Bangalore