Using ElasticSearch as a fast, flexible,
and scalable solution to search
occurrence records and checklists

Christian Gendreau, Canadensys
Marie-Elise Lecoq, GBIF France
Introduction
ElasticSearch is an open source, document oriented, distributed
search engine, built on top of Apache Lucene.

From ElasticSearch GitHub page
Setup
•  Java 6 or higher
•  Download : # wget …elasticsearch-0.90.5.zip
•  Unzip
Configuration
•  Name your cluster
•  Replication and multi-shard are enabled by default
•  Start : # bin/elasticsearch
Add data
Using the REST API
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1'
-d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}'
Import data
Rivers
•  Document-based database (mongoDB)
•  JDBC (relational database)
•  Data source (wikipedia, Twitter)
Mapping
•  Schema-less
•  Customize indexing
•  Customize querying
ElasticSearch at
Canadensys
Database of Vascular Plants of Canada (VASCAN)

data.canadensys.net/vascan
Our ElasticSearch index
Index structure for scientific names
•  autocompletion : edge_ngram filter
o 

“carex” -> “ca”,”car”,”care”,”carex”

•  genus first letter : pattern_replace filter
o 

“carex feta” -> “c. feta”

•  epithet : path_hierarchy tokenizer
o 

“carex feta” -> “feta”
ElasticSearch at GBIF France
Data stored in ElasticSearch are updated upon MongoDB
changes.
The search engine requests elasticsearch using filters like taxon,
date, place, dataset and geolocalisation.
Statistic calculation using facets
ElasticSearch at GBIF France
ElasticSearch - Solr
•  Solr and elasticsearch both tries to solve the same problem
with no much differences

•  Development setup and production deployment (replication /
sharding) easier with elasticsearch

•  By default, the elasticsearch is well configured for Lucene and
customization remains easy.
Facets
•  “Group by” in SQL
•  Mostly used for calculate statistics
•  Example :
curl -XGET [...]
"facets" : {
”dataset" : {
"terms" : {
"field" : ”dataset",
"order" : "term”
…
API and libraries
REST API
o  interoperability between different programming languages
o  HTTP request

Java API
o 
o 

more efficient than REST API due to the binary API use.
built in marshaling(data formatting on the network)
Query - RESTfull API
Example:
$ curl localhost:9200/vascan/_search?pretty=1 -d
'{"query":{
"match":{
"name" :{
"query":"carex"
}
}
}
}’
Query - Java API
Code example:
...
SearchRequestBuilder srb = client.prepareSearch(INDEX_NAME)
.setQuery(QueryBuilders
.boolQuery()
.should(QueryBuilders.matchQuery("vernacular_name",text))
.setTypes(VERNACULAR_TYPE);
...
Pitfalls
• 
• 
• 
• 

Error reporting (index creation, river creation)
Results may be hard to predict using complex queries
Documentation
With each mapping modification comes a free reindex from
data
Future
•  Scientific Name analyzer
•  Geospatial component
Thank you!

Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

  • 1.
    Using ElasticSearch asa fast, flexible, and scalable solution to search occurrence records and checklists Christian Gendreau, Canadensys Marie-Elise Lecoq, GBIF France
  • 2.
    Introduction ElasticSearch is anopen source, document oriented, distributed search engine, built on top of Apache Lucene. From ElasticSearch GitHub page
  • 3.
    Setup •  Java 6or higher •  Download : # wget …elasticsearch-0.90.5.zip •  Unzip
  • 4.
    Configuration •  Name yourcluster •  Replication and multi-shard are enabled by default •  Start : # bin/elasticsearch
  • 5.
    Add data Using theREST API $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elastic Search" }'
  • 6.
    Import data Rivers •  Document-baseddatabase (mongoDB) •  JDBC (relational database) •  Data source (wikipedia, Twitter)
  • 7.
    Mapping •  Schema-less •  Customizeindexing •  Customize querying
  • 8.
    ElasticSearch at Canadensys Database ofVascular Plants of Canada (VASCAN) data.canadensys.net/vascan
  • 9.
    Our ElasticSearch index Indexstructure for scientific names •  autocompletion : edge_ngram filter o  “carex” -> “ca”,”car”,”care”,”carex” •  genus first letter : pattern_replace filter o  “carex feta” -> “c. feta” •  epithet : path_hierarchy tokenizer o  “carex feta” -> “feta”
  • 10.
    ElasticSearch at GBIFFrance Data stored in ElasticSearch are updated upon MongoDB changes. The search engine requests elasticsearch using filters like taxon, date, place, dataset and geolocalisation. Statistic calculation using facets
  • 11.
  • 12.
    ElasticSearch - Solr • Solr and elasticsearch both tries to solve the same problem with no much differences •  Development setup and production deployment (replication / sharding) easier with elasticsearch •  By default, the elasticsearch is well configured for Lucene and customization remains easy.
  • 13.
    Facets •  “Group by”in SQL •  Mostly used for calculate statistics •  Example : curl -XGET [...] "facets" : { ”dataset" : { "terms" : { "field" : ”dataset", "order" : "term” …
  • 14.
    API and libraries RESTAPI o  interoperability between different programming languages o  HTTP request Java API o  o  more efficient than REST API due to the binary API use. built in marshaling(data formatting on the network)
  • 15.
    Query - RESTfullAPI Example: $ curl localhost:9200/vascan/_search?pretty=1 -d '{"query":{ "match":{ "name" :{ "query":"carex" } } } }’
  • 16.
    Query - JavaAPI Code example: ... SearchRequestBuilder srb = client.prepareSearch(INDEX_NAME) .setQuery(QueryBuilders .boolQuery() .should(QueryBuilders.matchQuery("vernacular_name",text)) .setTypes(VERNACULAR_TYPE); ...
  • 17.
    Pitfalls •  •  •  •  Error reporting (indexcreation, river creation) Results may be hard to predict using complex queries Documentation With each mapping modification comes a free reindex from data
  • 18.
    Future •  Scientific Nameanalyzer •  Geospatial component
  • 19.