3. About me
Alberto Paro, @aparo77
My motto: “always learning”
CTO at Big Data Technologies
Freelance Consulting
International companies (Italy, Switzerland, Austria, USA)
Web Development on Big Data Solutions
NLP/Spark/Lucene/SOLR/ElasticSearch implementations &
training
Reactive and Functional Programming (Scala, Akka,
Spray.io, Play)
4. About me
Packt Publishing Book Author and reviewer
ElasticSearch Cookbook (Author, Dec 2013)
ElasticSearch Server (Review, Apr 2014)
ElasticSearch Cookbook – Second Edition (Author, Dec 2014)
Using ElasticSearch from 2010 ~ version 1.10
PyES – ElasticSearch python driver used by Cern, IBM, …
ElasticSearch MongoDB river
Django ElasticSearch Engine
For companies I developed up to 4 ORMs for ElasticSearch
(.Net, Python, Scala) and several plugins
5. ElasticSearch
Apache Lucene
Started in 2010 by Shay Banon
Open Source – Apache License
A company was formed in 2012: ElasticSearch
Training, support and development
6. ElasticSearch
Scalable
Distributed, Node Discovery
Automatic sharding
Query distribution
RESTful, HTTP API
With API wrappers for .Net, Ruby, Java, Scala, …
JSON in, JSON out -> JSON Coast-to-Coast
Document Model
Maps Json to Object
“schemaless” -> field type recognition
Keeps source, keeps ‘version’ number, keeps timestamp, …
7. ElasticSearch
Field types and analyzers
String, numeric, geo, …
Custom types: attachments, IP, IBAN, …
Arrays, subdocuments, nested documents
Integrated Aggregations
Your big data insights
Terms
Min/Max/Avg/Sum
Top hit
Geo Distance
And more
8. DBMS -> ElasticSearch
DBMS ElasticSearch MongoDB
Database Index Database
Table Type Collection
Field Field Field
Record Document Document
User must rethink their models.
9. DBMS -> ElasticSearch
Datamodelling is the same Entity Relation, plus:
Multi values
Embedding
Mutable/Immutable data
Alternative three foreign key alternative:
Term query
Parent/Child
Nested
{
"book" : {
"isbn" : ”9781782166627",
"name" : ”ElasticSearch Cookbook",
"author" : {
"first_name" : ”Alberto",
"last_name" : ”Paro"
},
"pages" : 430,
"tag" : [”elasticsearch", ”java”, “python”, “Rest”]
}
}
10. Common Pitfalls
Schema(less)?
Automatic field type recognition
Can miss types
Strict about types: only some types can be upgraded
Check the datetime:
UNIX (epoch from …) (the standard world)
ISO 8601 -> “yyyy-MM-ddTHH:mm:ssZ”
11. Common Pitfalls
What’s the best transport protocol?
In JVM, prefer the native
Faster
Extra bonus
HTTP best for balancer
Thrift best for performance
Faster than HTTP
Charset “safe”
12. Common Pitfalls
Never, Never publish your ElasticSearch server outside
DMZ
Security problems with scripting
Simple HTTP can destroy your server
Or simply drain your money on Amazon Cloud
ElasticSearch has a lot of problems with URL security
Vulnerabilities
13. Common Pitfalls
Very fast indexing
Bulk indexing:
Set up without replicas (replicas = 0, not 1)
Play with bulk size (300-500-1000-5000-10000)
Performances depends on data complexity
Before indexing: After indexing:
curl -XPUT localhost:9200/test/_settings -d
'{
"index" : {
"refresh_interval" : "1s"
} }'
curl -XPUT localhost:9200/test/_settings -d
'{
"index" : {
"refresh_interval" : "-1”
} }'
14. Common Pitfalls
ElasticSearch uses a lot of memory and file-descriptors!
Optimize them in /etc/security/limits.conf
elasticsearch soft nofile 32000
elasticsearch hard nofile 32000
elasticsearch - memlock unlimited
Set the ES_HEAP_SIZE
ElasticSearch config file conf/elasticsearch.yml
bootstrap.mlockall: true
15. Common Pitfalls
Wait the yellow status
Are you using ElasticSearch as Primary datastore?
It can replace both DBMS or MongoDB
but it depends on your data
Cron Snapshots
Don’t abuse flush
(Be reactive)
Prefer “update” to post repost the same object
Use the “version” Luke!
16. Common Pitfalls
If possible don’t use rivers
Hard to debug
Reduce your server responsivity
Can crash your server
They will be removed (2.0?)
(Prefer Spark SchemaDDL)
Use scripts
The easy way to extend ElasticSearch for trivial
functionalities
Prefer Groovy (or native Java for performances)
Don’t use inline scripts, if possible
Prefer indexed or file with parameters
17. Common Pitfalls
Use plugins
If it’s not available, write a new one
Always backup before upgrading
Snapshots can save your life!
Bug in 1.3.x
Check your plugins to compatibility
Read the ElasticSearch changelog
Sometimes you MUST upgrade your cluster
Use a least 3 nodes (if possible)
18. Conclusions
ElasticSearch benefits
Easy to setup
Very clever architecture
Drawbacks
Change sharding in a full index non-trivial
Pay attention when upgrading
ElasticSearch
Clever architecture, fast, stable, extendable
Does exactly what you need