A journey in the Dark Web, for companies looking to take control of their search strategy. Objective if this presentation is to prove that any reasonable cost, any organisation can setup its own search strategy, outside or in parallel of its document management strategy.
Challenge at French Ministry is to aggregate internal content, external content on social network (pinterest, youtube, facebook) and external legacy WebSite content (other Website from agency in relation with Ministry) and provide a brand new Web Site with "best of the bread" interface : search engine, auto completion and word correction, easy custom and secured navigation
Result is awesome, for a budget kept under control, we provided a new Drupal Module to monitor and configure Solr6 indexation and search engine, together with custom API to index external WebSite.
This session will come with a presentation of the Project Architecture (multi tiers servers) and a live demo of the Search interface
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment
1. Patrick Beaucamp
Founder of the Vanilla Project
Mail : Patrick.beaucamp@bpm-conseil.com
Custom Open Source Search Engine with Drupal 8
and Solr at French Ministry of Environment
II-SDV, Nice 24th April 2017
1II-SDV, Nice
2. Presentation Agenda
Open Source Search Engine & Search Platform
Some interesting Platforms
Features expected for Search Platforms (Interface)
2II-SDV, Nice
Open Source Platform at French Ministry
Project Context
Platform Architecture
WebSite Powered by a Search engine
Echo : Tuesday am, presentation from Deep Search 9 and
Tuesday pm prssentation from FranceLabs
Personal Experience of Search
3. Searching … and finding !
II-SDV : SEARCH, DATA MINING and
VISUALISATION
3II-SDV, Nice
How many times per day do you Google ? (search,
maps, translate …)
Tribute to Open Source at II-SDV
Search is the first Step : collecting information
5. Searching … and finding !
An exemple – my personal experience
5II-SDV, Nice
I tried to find a person during 23 years, roughly from 1993
to 2016
From 1993 to 1998 : no search engine available …
only private investigator ?
From 1999 to 2015 : regular Search – no results
I founded this person on facebook, not on google
From a browser : « f + tab » … « g + tab », « y + tab » …
Some years : no search, other years : multiples search
6. Searching … and finding !
6II-SDV, Nice
1) We all became private investigators one day or another
8. Searching … and finding !
8II-SDV, Nice
2) Different search engine lead to different results
9. Searching … and finding !
9II-SDV, Nice
2) Different search engine by country
10. Searching … and finding !
10II-SDV, Nice
Funny word : SEO … its more « how to be found on
Internet » … and you need to pay for it !
11. Searching … and finding !
11II-SDV, Nice
3) The person I was looking published on facebook using
his/her real name – its his/her decision to be visible or not
4) Where do we stand with the « Right to Forget »
12. Searching … and finding !
12II-SDV, Nice
Companies like Facebook have tons of data : they need to
provide search infrastructure (indexing + search interface)
I was lucky to make a try with facebook search interface
13. Searching … and finding !
13II-SDV, Nice
Discovery of Cholera – 1854 (John Snow)
http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
14. Searching … and finding !
14II-SDV, Nice
Bicycle Accident in Street : who is taking care of trafic management
Example in Boston :
http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html
Open Data
17. Search Platform Objectives
Constraints : being able to reach WebSite and content :
Internal WebSites (Intranet) & External WebSites
Internal Document Repositories
17II-SDV, Nice
Being able to index WebSite content (and page updates)
Beeing able to store unstructured data
Crawling
Storing
Indexing
18. Search Platform Objectives
18II-SDV, Nice
Provide usable Search results (auto classification,
visualization)
Don’t Forget why and what you search :
• You search in existing documents
• You need visualization tools
• Its not a crystal ball : search reflects the past
Provide usable Search interfaces (semantic search, multi
language search …)
Search Interface
Result Visualization
19. 19II-SDV, Nice
Lucene is a java based indexing and search API
Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks
(Fusion) and ElasticSearch, provides packaging and extension of top of Lucene
and Solr.
-Nutch is the crawling component
-Tika is a document Metadata manager – content analysis toolkit
-Zookeeper is a multi thread process manager
OpenSource LandScape
21. 21II-SDV, Nice
Lucene : Retrieval Software library
Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)
http://www.lucidworks.com/ or http://www.elasticsearch.org/
Search Engine Focus
22. 22II-SDV, Nice
-Cloudera with Solr/Cloud (Solr/Lucene)
-Mapr with ElasticSearch (Lucene code)
-HortonWorks with LucidWorks (Solr/Lucene)
Hadoop Search Platform - Big Data
23. 23II-SDV, Nice
Before indexing your document base, you need to access it !
Apache Nutch is a highly extensible and scalable open source web crawler
software project.
Reference : http://nutch.apache.org/
Nutch
24. 24II-SDV, Nice
Solr
• What is Solr
– Indexation and Search Engine
• Promoted by the Apache Foundation
• Built on Top of Apache Lucene (Java Search library)
– Major engine characteristics
• Scalable, fault tolerance, distribution indexation process, dynamic
workload balancer, centraized configuration
– Technical environment
• Java
• Embeded Jetty server for platform administration
30. 30II-SDV, Nice
-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
Search Engine Basic (1/2)
31. 31II-SDV, Nice
-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
Search Engine Basic (2/2)
32. 32II-SDV, Nice
-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Limits solved other the past 3 years
• Full text search interface (language with search engine)
• SubQuery support : now its ok starting with Solr 4.7 (we are v6)
• Scalability (this is where Solr is taking technical advantage)
Search Engine Current Limits
33. 33II-SDV, Nice
-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
Search Interface expectation (1/3)
34. 34II-SDV, Nice
-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
Search Interface expectation (2/3)
35. 35II-SDV, Nice
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface expectation (3/3)
36. Project at Ministry
Initial decision and guidelines from Ministry
36II-SDV, Nice
New WebSite will be done using Drupal CMS 8.2
WebSite should be powered by a « Google alike Search Toolbar »
WebSite – Infrastructure – should connect with multiples other
WebSite
All Infra (Software) must be Open Source components
41. Project at Ministry - Technical
41II-SDV, Nice
Projects Steps
Nutch crawler for various WebSite
• Facebook, LinkedIn, Twitter, Youtube …
• Internal WebSite, Previous WebSite
Drupal Forms for Metadata & indexation
• Specific Forms for different kind of documents
• Drupal CMS process to add new content
Drupal 8 Module for Solr : custom search, monitoring, reporting
• Existing drupal solr is limited to single instance of drupal
• Not possible to use Solr Admin interface
42. Project at Ministry - Technical
42II-SDV, Nice
Additional PHP libraries
Curl : Communication Drupal-Solr (http-get http-post & attached file)
Ssh2 : server administration command
Zookeeper : Communication Drupal-Zookeeper
MemCached : Communication Drupal-Memcached
Solarium : Communication Drupal-Solr (abstraction layer)
GoogleApi : youtube content indexation
43. Project at Ministry – Admin Interface
43II-SDV, Nice
Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
44. Project at Ministry – Admin Interface
44II-SDV, Nice
Drupal8 Addon to monitor the global infrastructure - Statistics
45. Project at Ministry - Validation
45II-SDV, Nice
Projects Validation & Deployment
No problems with Zookeeper, Solr, Nutch
Stress tests for the global platform : initial slow down with 10 000
simultaneous connection
Sub-Project : Adressing the Single Point of Failure
Solution : Problems with Drupal & MySql -> MemCached
46. Project at Ministry - Next
46II-SDV, Nice
Next Steps
Review of WebSite content … new Ministry
New Content to be indexed :
• Other WebSite and Social Content
• New set of document to be added in the repository