Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

798 views

Published on

A journey in the Dark Web, for companies looking to take control of their search strategy. Objective if this presentation is to prove that any reasonable cost, any organisation can setup its own search strategy, outside or in parallel of its document management strategy.

Challenge at French Ministry is to aggregate internal content, external content on social network (pinterest, youtube, facebook) and external legacy WebSite content (other Website from agency in relation with Ministry) and provide a brand new Web Site with "best of the bread" interface : search engine, auto completion and word correction, easy custom and secured navigation

Result is awesome, for a budget kept under control, we provided a new Drupal Module to monitor and configure Solr6 indexation and search engine, together with custom API to index external WebSite.

This session will come with a presentation of the Project Architecture (multi tiers servers) and a live demo of the Search interface

Published in: Internet
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

  1. 1. Patrick Beaucamp Founder of the Vanilla Project Mail : Patrick.beaucamp@bpm-conseil.com Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment II-SDV, Nice 24th April 2017 1II-SDV, Nice
  2. 2. Presentation Agenda Open Source Search Engine & Search Platform Some interesting Platforms Features expected for Search Platforms (Interface) 2II-SDV, Nice Open Source Platform at French Ministry Project Context Platform Architecture WebSite Powered by a Search engine Echo : Tuesday am, presentation from Deep Search 9 and Tuesday pm prssentation from FranceLabs Personal Experience of Search
  3. 3. Searching … and finding ! II-SDV : SEARCH, DATA MINING and VISUALISATION 3II-SDV, Nice How many times per day do you Google ? (search, maps, translate …) Tribute to Open Source at II-SDV Search is the first Step : collecting information
  4. 4. Searching … and finding ! 4II-SDV, Nice
  5. 5. Searching … and finding ! An exemple – my personal experience 5II-SDV, Nice I tried to find a person during 23 years, roughly from 1993 to 2016 From 1993 to 1998 : no search engine available … only private investigator ? From 1999 to 2015 : regular Search – no results I founded this person on facebook, not on google From a browser : « f + tab » … « g + tab », « y + tab » … Some years : no search, other years : multiples search
  6. 6. Searching … and finding ! 6II-SDV, Nice 1) We all became private investigators one day or another
  7. 7. Searching … and finding ! 7II-SDV, Nice
  8. 8. Searching … and finding ! 8II-SDV, Nice 2) Different search engine lead to different results
  9. 9. Searching … and finding ! 9II-SDV, Nice 2) Different search engine by country
  10. 10. Searching … and finding ! 10II-SDV, Nice Funny word : SEO … its more « how to be found on Internet » … and you need to pay for it !
  11. 11. Searching … and finding ! 11II-SDV, Nice 3) The person I was looking published on facebook using his/her real name – its his/her decision to be visible or not 4) Where do we stand with the « Right to Forget »
  12. 12. Searching … and finding ! 12II-SDV, Nice Companies like Facebook have tons of data : they need to provide search infrastructure (indexing + search interface) I was lucky to make a try with facebook search interface
  13. 13. Searching … and finding ! 13II-SDV, Nice Discovery of Cholera – 1854 (John Snow) http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
  14. 14. Searching … and finding ! 14II-SDV, Nice Bicycle Accident in Street : who is taking care of trafic management Example in Boston : http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html Open Data
  15. 15. Searching … and finding ! 15II-SDV, Nice LION – 2016 (Garth Davis) Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo
  16. 16. OpenSource LandScape 16II-SDV, Nice Crawling Indexing Storing WebSite Reference WebSite Accessibility Update Management Search Interface Result Visualization Auto Completion Natural Language Voice Recognition Maps Ads Unstructured data Access Management
  17. 17. Search Platform Objectives Constraints : being able to reach WebSite and content : Internal WebSites (Intranet) & External WebSites Internal Document Repositories 17II-SDV, Nice Being able to index WebSite content (and page updates) Beeing able to store unstructured data Crawling Storing Indexing
  18. 18. Search Platform Objectives 18II-SDV, Nice Provide usable Search results (auto classification, visualization) Don’t Forget why and what you search : • You search in existing documents • You need visualization tools • Its not a crystal ball : search reflects the past Provide usable Search interfaces (semantic search, multi language search …) Search Interface Result Visualization
  19. 19. 19II-SDV, Nice Lucene is a java based indexing and search API Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks (Fusion) and ElasticSearch, provides packaging and extension of top of Lucene and Solr. -Nutch is the crawling component -Tika is a document Metadata manager – content analysis toolkit -Zookeeper is a multi thread process manager OpenSource LandScape
  20. 20. 20II-SDV, Nice -Search Landscape -Lucene : http://lucene.apache.org -Solr/Lucene : http://lucene.apache.org/solr/ -Plateform OpenSearch : http://www.open-search-server.com -Plateform Katta : http://katta.sourceforge.net -Plateform LucidWorks : http://www.lucidworks.com -Plateform ElasticSearch : http://www.elasticsearch.com -Sphinx : http://sphinxsearch.com/ -Cloudera : https://www.cloudera.com/documentation/enterprise/5-5- x/topics/search_architecture.html -FranceLabs : http://www.francelabs.com/ (Datafari) -AklaBox : www.aklabox.com (AklaSearch) OpenSource LandScape
  21. 21. 21II-SDV, Nice Lucene : Retrieval Software library Use existing Search Infrastructure like Solr/Lucene (Vanilla certified) http://www.lucidworks.com/ or http://www.elasticsearch.org/ Search Engine Focus
  22. 22. 22II-SDV, Nice -Cloudera with Solr/Cloud (Solr/Lucene) -Mapr with ElasticSearch (Lucene code) -HortonWorks with LucidWorks (Solr/Lucene) Hadoop Search Platform - Big Data
  23. 23. 23II-SDV, Nice Before indexing your document base, you need to access it ! Apache Nutch is a highly extensible and scalable open source web crawler software project. Reference : http://nutch.apache.org/ Nutch
  24. 24. 24II-SDV, Nice Solr • What is Solr – Indexation and Search Engine • Promoted by the Apache Foundation • Built on Top of Apache Lucene (Java Search library) – Major engine characteristics • Scalable, fault tolerance, distribution indexation process, dynamic workload balancer, centraized configuration – Technical environment • Java • Embeded Jetty server for platform administration
  25. 25. 25II-SDV, Nice Solr Main characteristics Admin Interface Flexible and scalable Configuration Modular Multiple index management with a signle instance
  26. 26. 26II-SDV, Nice Solr Main characteristics Standard communication interfaces (html, xml, json) Configuration can be done with or without schema Real time Indexation
  27. 27. 27II-SDV, Nice Solr Main characteristics Customizable Full Text analysis Rich documents indexation (using Tika)
  28. 28. 28II-SDV, Nice Solr Main characteristics Search by facet and filters Term suggestion and orthograph correction Geospatial Search
  29. 29. 29II-SDV, Nice Solr Solr behavior
  30. 30. 30II-SDV, Nice -Synonyms - It is possible to extend the search to synonyms if they are listed in a glossary. For example, to find articles containing synonyms to “TV” when you search with the word TV. -Metadata - Dictionary for list of searchable keywords Search Engine Basic (1/2)
  31. 31. 31II-SDV, Nice -Reserved Words, Protected Words - Indexing usually uses stemming, which is to reduce words to their root, for example "Developp" to find items also contain the word when trying to develop the word development. However, sometimes there are adverse lemmatizations, indexing under one lemma two words that have no relation. It is possible to prevent the stemming of words by listing them in a file protwords.txt. -StopWords - The stopwords are meaningless words. A word considered insignificant will be ignored. Note that some words are insignificant in some contexts, others have homonyms signifiers. For example, can refer to a summer season (rather mean) or past participle of the verb to be (relatively insignificant). Stopwords.txt the file looks like this Search Engine Basic (2/2)
  32. 32. 32II-SDV, Nice -Multi Language support (this is where commercial search engine have still more to bring to customer), even there is now Asian type language support (Hindi, Thai, Chineese, …) -Elision : - Elisions are a feature of the French, which consist of a contraction of the words like or when they are followed by a vowel. Example: + aircraft gives the aircraft. It is possible to remove these elisions using a lexicon. -Limits solved other the past 3 years • Full text search interface (language with search engine) • SubQuery support : now its ok starting with Solr 4.7 (we are v6) • Scalability (this is where Solr is taking technical advantage) Search Engine Current Limits
  33. 33. 33II-SDV, Nice -Advance indexing and querying tools. -Provides distributed searching capabilities to prevent bottleneck for a particular server. -Provides document excerpts (snippets) generation that provides summary of the search -Relevance ranking display extracts from the documents based on the query. Search Interface expectation (1/3)
  34. 34. 34II-SDV, Nice -Duplicate document detection, including fuzzy near duplicates -Rich Document Parsing and Indexing without using Database Indexing. -Ranking control carry out a targeted ranking of individual documents. -Search Grouping by Type / Tag / Categories (General page, documents, images) Search Interface expectation (2/3)
  35. 35. 35II-SDV, Nice -Multi Criteria support -Ranking -Natural language support -Apps Support (Android, Ipad) Search Interface expectation (3/3)
  36. 36. Project at Ministry Initial decision and guidelines from Ministry 36II-SDV, Nice New WebSite will be done using Drupal CMS 8.2 WebSite should be powered by a « Google alike Search Toolbar » WebSite – Infrastructure – should connect with multiples other WebSite All Infra (Software) must be Open Source components
  37. 37. Project at Ministry 37II-SDV, Nice http://www.developpement-durable.gouv.fr/
  38. 38. Project at Ministry 38II-SDV, Nice http://www.developpement-durable.gouv.fr/
  39. 39. Project at Ministry - Architecture 39II-SDV, Nice
  40. 40. Project at Ministry - Architecture 40II-SDV, Nice
  41. 41. Project at Ministry - Technical 41II-SDV, Nice Projects Steps Nutch crawler for various WebSite • Facebook, LinkedIn, Twitter, Youtube … • Internal WebSite, Previous WebSite Drupal Forms for Metadata & indexation • Specific Forms for different kind of documents • Drupal CMS process to add new content Drupal 8 Module for Solr : custom search, monitoring, reporting • Existing drupal solr is limited to single instance of drupal • Not possible to use Solr Admin interface
  42. 42. Project at Ministry - Technical 42II-SDV, Nice Additional PHP libraries Curl : Communication Drupal-Solr (http-get http-post & attached file) Ssh2 : server administration command Zookeeper : Communication Drupal-Zookeeper MemCached : Communication Drupal-Memcached Solarium : Communication Drupal-Solr (abstraction layer) GoogleApi : youtube content indexation
  43. 43. Project at Ministry – Admin Interface 43II-SDV, Nice Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
  44. 44. Project at Ministry – Admin Interface 44II-SDV, Nice Drupal8 Addon to monitor the global infrastructure - Statistics
  45. 45. Project at Ministry - Validation 45II-SDV, Nice Projects Validation & Deployment No problems with Zookeeper, Solr, Nutch Stress tests for the global platform : initial slow down with 10 000 simultaneous connection Sub-Project : Adressing the Single Point of Failure Solution : Problems with Drupal & MySql -> MemCached
  46. 46. Project at Ministry - Next 46II-SDV, Nice Next Steps Review of WebSite content … new Ministry New Content to be indexed : • Other WebSite and Social Content • New set of document to be added in the repository
  47. 47. 47II-SDV, Nice

×