E L A S T I C S E A R C H ,
L O G S TA S H , K I B A N A
C O O L S E A R C H ,
A N A LY T I C S ,
D ATA M I N I N G
A N D M O R E …
O L E K S I Y P A N C H E N K O / L O H I K A / 2 0 1 5
MY NAME IS…
Oleksiy Panchenko
Software engineer, Lohika
E-mail: oleksij@gmail.com
Twitter: oleskiyp
LinkedIn:
https://ua.linkedin.com/in/opanchenko
AGENDA
• Introduction. What is it all about?
• Jump start Elastic. Demo time
• Architecture and deployment. Why is
Elasticsearch elastic?
• Case studies. 4 real-life projects
• Query API in depth + Demo
• Elasticsearch ecosystem. ELK Stack + Demo
• Q & A
INTRODUCTION
W H A T I S I T A L L A B O U T ?
HOW TO MAKE YOUR SITE
SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
• Google search
• Why not to use plain vanilla SQL? RDBMS rocks!
select *
from books
join authors
on …
where …
• Sphinx (hello Craigslist, Habrahabr, The Pirate Bay, 1C);
Xapian
• Lucene Family: Apache Lucene, Elasticsearch, Apache
Apache Solr, Amazon Cloudsearch, …
WHO HAS EVER USED
ELASTICSEARCH?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
LUCENE AS A CORE
• Lucene = Low-level Java library (JAR) which
implements search functionality
• Can be used in both web and standalone
applications (desktop, mobile)
• Lucene stores its index as a local binary file
• Implemented in Java, ports to other languages
available
• Initial version: 1999
• Apache project since 2001
• Latest stable release: 5.2.1 (15 June 2015)
LUCENE AS A CORE
• Lucene was originally written in
1999 by Doug Cutting (creator
(creator of Hadoop and Nutch,
http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
MORE ABOUT SEARCH ENGINES
Riak Search
TIME TO TALK ABOUT
ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text Search
Multilingual search, geolocation,
fuzzy search, did-you-mean
suggestions, autocomplete
https://www.elastic.co/products/elasticsearch
High Availability
Multitenancy
Distributed, Horizontally Scalable
https://www.elastic.co/products/elasticsearch
Document-Oriented
Schema-Free
Conflict Management
Optimistic Concurrency Control
https://www.elastic.co/products/elasticsearch
Apache 2 Open Source License
Awesome documentation
Large community
Developer-Friendly, RESTful API
Client libraries available for
many programming languages
and frameworks.
ELASTICSEARCH USERS
https://www.elastic.co/use-cases
https://en.wikipedia.org/wiki/Elasticsearch#Users
ELASTICSEARCH – PAST &
PRESENT
• 2004. Shay Banon (aka
Kimchy) started working on
Compass – Java Search
Engine on top of Lucene
• 2010. Initial release of
Elasticsearch
• Latest stable release: 1.7.1
(July 29, 2015)
• 500K downloads per
month• https://github.com/elastic/elasticsearch
http://opensource.hk/sites/default/files/u1/shay-banon.jpg
ELASTICSEARCH
AS A COMPANY
• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees
• https://www.elastic.co/
• Product portfolio:
– Elasticsearch, Logstash, Kibana (ELK stack)
– Watcher
– Shield
– Marvel
– es-hadoop
– found
JUMP START
ELASTIC
D E M O T I M E
INSTALLATION &
CONFIGURATION
• Prerequisites:
– JDK 6 or above (recommended: JDK 8)
– RAM: min. 2Gb (recommended: 16–64 Gb for
production)
– CPU: number of cores over clock rate
– Disks: recommended SSD
• Homebrew, apt, yum: apt-get install
elasticsearch
• Download (ZIP, TAR, DEB, RPM):
https://www.elastic.co/downloads/elasticsearch
• Installation is absolutely straightforward and easy:
LET’S TALK ABOUT
TERMINOLOGY
Index ~ DB Schema
Type ~ DB Table
Documen
t
Record, JSON object
Mapping ~ Schema definition in RDBMS
DEMO #1
http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ARCHITECTURE
AND
DEPLOYMENT
W H Y I S E L A S T I C S E A R C H E L A S T I C ?
Cluster One or more nodes which
share the same cluster name
Node Running instance of
Elasticsearch which belongs
to a cluster
Shard A portion of data – single
Lucene instance.
Default: 5 shards in an index
Primary
Shard
Master copy of data
Replica
Shard
Exact copy of a primary
shard.
Default: 1 replica
SINGLE-NODE CLUSTER
0 1 2 3 4
Hash
Function*
{ "id": "123", "name": "john", … }
{ "id": "124", "name": "patricia", … }
{ "id": "125", "name": "scott", … }
* Also consider custom routing
TWO-NODE CLUSTER
0 1 R2 3 R4Node
1
R0 R1 2 R3 4Node
2
* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)
BENEFITS OF SHARDING
• Take advantage of multi-core CPUs (one shard is
a single Lucene instance = single JVM process)
• Horizontal scalability. Dynamic rebalancing
• Fault tolerance and cluster resilience
• NB! The number of shards can not be changed
dynamically on the fly – need to perform full
reindexing
• Max number of documents per shard:
2,147,483,519 – imposed by Lucene
CUSTOM ROUTING
• Social network. Users, events
• event_id: 17567654, 17567655, 17567656, …
user_id: 10300, 10301, …
• No Elasticsearch ID provided: ID will be auto-
generated
 Events will be equally distributed across the
shards
• Obvious approach: Elasticsearch ID = event_id
 Events will be equally distributed across the
shards
• Elasticsearch ID = user_id
 Events which belong to the same user will be
ELASTICSEARCH NODE TYPES
• Data node node.data = true
• Master node node.master = true
• Communication client http.enabled =
true
• TCP ports 9200 (ext), 9300 (int)
• A node can play 2 or 3 roles at the same time
• Multicast discovery (true by default):
discovery.zen.ping.multicast.enabled
DEPLOYMENT DIAGRAM
INDEXING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
RETRIEVING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html
• In terms of retrieving documents, primary and
replica shards are equivalent: data can be read
from either primary or replica shard
DISTRIBUTED SEARCH
• Given search query, retrieve 10 most relevant
results
https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html
CASE STUDIES
4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&pat
h-prefix=ru
GENERAL INFO
• 4 projects, ~2 years
• RDBMS (MySQL, PostgreSQL) as a primary data
storage
• Both on-premise Elasticsearch installation (AWS,
MS Azure) and SaaS (Bonsai @ Heroku)
• 1 or 2 instances in a cluster
• Data volume: Gigabytes; millions of documents
• Back-end: Java, Ruby
#1. SOCIAL INFLUENCER
MARKETING PLATFORM
http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
• Document types: Blog Posts, Bloggers
(Influencers)
• Elasticsearch usage:
– search and rank Influencers by category,
keywords, tags, location, audience,
influence
– search blog posts by keywords etc.
• Amount of data:
– Influencers: hundreds of thousands
– Blog Posts: millions
• ES cluster size: 2 instances
• Technology stack: Java, MySQL, Dynamo
#2. JOB SITE
http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
• Document types: Job Postings, Jobseekers
• Find relevant jobs
– Simple one-click search
– Advanced search (title, keywords, industry,
location/distance, salary, requirements)
• Elasticsearch as a Recommendation Engine
Recommend jobs based on: previously
applied/viewed jobs, location, distance,
schedule etc.
• 2 types of recommendations:
– Side banner (You also might be interested
in…)
– E-mail subscriptions every 2 weeks
• No fixed document structure (jobs from
different providers)
• Full-text search
• Fuzzy search
• Geolocation (distance)
• Weighted search: Boosted search
clauses
• Dynamic scripting (Mvel until v1.4.0, then
Groovy)
SEARCH QUERIES
SOME MORE FACTS
• Amount of data:
–Job postings: ~1M
–Applicants: ~20K
• Cluster size: 2 ‘medium’ EC2 instances
• Technology stack:
–Ruby on Rails
–Elasticsearch, PostgreSQL, Redis
–Heroku + add-ons, AWS (S3, EC2)
–Lots of 3rd party APIs and integrations
IMPLEMENTATION (RUBY)
• A Model is ActiveRecord (Ruby on Rails ORM)
• ActiveRecord can persist itself to the database
• ActiveRecord::Callbacks:
– after_commit on [:create, :update] {
index_document }
– after_commit on [:destroy] { delete_document }
– after_create…
– after_save …
– after_destroy…
• Rake tasks to drop/recreate index, reindex
documents
• Zero-downtime reindexing using aliases
• Ruby/Rails client:
https://github.com/elastic/elasticsearch-rails
LESSONS LEARNED
• On-premise deployment (EC2) vs. SaaS
(Bonsai @ Heroku)
• Dynamic scripting
• PostgreSQL as a backup search engine
sucks
#3. CAR TRADING
http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
PARSING ADS
Price
$3900
1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG
WAT???
• Fuzzy Search (Levenstein Distance Algorithm) used to
parse ads and classify cars
• Elasticsearch index contains dictionary (Year, Make,
Model, Trim)
• Used in conjunction with other approaches: regular
expressions, dictionaries of synonyms (VW  Volkswagen,
Chevy  Chevrolet), normalization (e.g. LX-370  LX370)
• Algorithm approach:
– Parse Year (1996)
– Search most relevant Make (VW, volkswagon 
Volkswagen)
– Search most relevant Model (Passat) for Make =
Volkswagen, Year = 1996
– Search most relevant Trim (TDi 4dr Sedan)
• Parsing quality: 90%
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
#4. [NDA]
http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
SOME UNCOVERED INFO
• Check documents against duplicate content
• Shingle analysis (commonly used by copywriters and SEO
experts)
– I have a dream that one day this nation will rise up and
live…
– Normalization
I have a dream that one day this nation will rise up and
live…
– Splitting a text into shingles (n-grams), n = 3..10
have dream that
dream that this
that this nation
this nation will
…
– Replacement: latin ‘c’  cyrillic ‘c’
https://en.wikipedia.org/wiki/W-shingling
QUERY API IN
DEPTH
+ D E M O
FILTERS VS. QUERIES
As a general rule, filters should be used:
• for binary yes/no searches
• for queries on exact values
Filters are much faster than queries
Filters are usually great candidates for caching
27 Filters available (Elasticsearch 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
QUERIES VS. FILTERS
As a general rule, queries should be used instead
of filters:
• for full text search
• where the result depends on a relevance score
Common approach: Filter as many records as
possible, then query them.
38 Queries available (Elasticsearch v 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
DEMO #2
http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
SOME THEORY BEHIND
RELEVANCE SCORING
full AND text AND search AND (elasticsearch OR
lucene)
• Term Frequency: How often does the term
appear in the document?
• Inverse Document Frequency: How often does
the term appear in all documents in the
collection?
• Field-length norm: How long is the field?
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
MORE COOL FEATURES
• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)
• Autocomplete suggestion:
• Did-you-mean suggestion:
• Highlight results:
SEARCH IMAGES
https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/
https://github.com/kzwang/elasticsearch-image
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ELASTICSEARCH
ECOSYSTEM.
ELK STACK
+ D E M O
CLIENTS
http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png
• Java: 1 native client + 1 community
supported
• Python: 1 official + 7 community supported
• Ruby: 1 official + 7 community supported
• JavaScript: 1 official + 4
• PHP: 1 official + 4
• C#. NET: 1 official + 2
• Scala: 4
• Groovy (1), Haskell (1), Perl (1), Clojure (1),
Go (3),
R (2), Erlang (3), OCaml (2), Smalltalk (1),
ColdFusion (1), C++ (1)
• Command Line (2)https://www.elastic.co/guide/en/elasticsearch/client/community/current/clients.html
INTEGRATIONS
• Django
• Ruby on Rails
• Spring, Spring Data
• Node.js
• Symfony, Drupal, Wordpress
• Grails
• Play! Framework
https://www.elastic.co/guide/en/elasticsearch/client/community/current/integrations.html
FRONT ENDS
http://php.archive.razorflow.com/assets/img/header_v1.png
ELASTICSEARCH-HEAD
http://mobz.github.io/elasticsearch-head/
ESCLIENT
https://github.com/rdpatil4/ESClient
AVAILABLE FRONT ENDS
https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html
• elasticsearch-head: A web front end for an Elasticsearch
cluster.
• browser: Web front-end over elasticsearch data.
• Inquisitor: Front-end to help debug/diagnose queries and
analyzers
• Hammer: Web front-end for elasticsearch
• Calaca: Simple search client for Elasticsearch
• ESClient: Simple search, update, delete client for
Elasticsearch
HEALTH AND PERFORMANCE
http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png
ELASTICSEARCH-HEAD
https://github.com/mobz/elasticsearch-head
BIGDESK
https://github.com/lukas-vlcek/bigdesk
WHATSON
https://github.com/xyu/elasticsearch-whatson
ELASTICOCEAN
https://itunes.apple.com/us/app/elasticocean/id955278030
HEALTH AND PERFORMANCE
https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html
• bigdesk: Live charts and statistics for elasticsearch cluster.
• Kopf: Live cluster health and shard allocation monitoring with
administration toolset.
• paramedic: Live charts with cluster stats and indices/shards
information.
• ElasticsearchHQ: Free cluster health monitoring tool
• SPM for Elasticsearch: Performance monitoring with live charts
showing cluster and node stats, integrated alerts, email reports, etc.
• check-es: Nagios/Shinken plugins for checking on elasticsearch
• check_elasticsearch: An Elasticsearch availability and performance
monitoring plugin for Nagios.
• opsview-elasticsearch: Opsview plugin written in Perl for monitoring
Elasticsearch
• SegmentSpy: Plugin to watch Lucene segment merges across your
cluster
• es2graphite: Send cluster and indices stats and status to Graphite for
monitoring and graphing.
• Scout: Provides plugins for monitoring Elasticsearch nodes, clusters,
and indices.
• ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring
10 ES METRICS TO WATCH
http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html
1. Cluster health — nodes and shards
2. Node performance — CPU
3. Node performance — memory usage
4. Node performance — disk I/O
5. Java — heap usage and garbage collection
6. Java — JVM pool size
7. Search performance — request latency and
request rate
8. Search performance — filter cache
9. Search performance — field data cache
10.Indexing performance — refresh times and
merge times
RIVERS (DEPRECATED IN 1.5.0)
http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi
• JDBC River Plugin, CSV River Plugin
• MongoDB, CouchDB, Solr, Redis, Neo4j,
DynamoDB, RethinkDB, Hazelcast, …
• JMS, RabbitMQ, ActiveMQ, Amazon SQS, Kafka,
…
• Twitter, Wikipedia, Git, GitHub, Subversion, RSS, …
• FileSystem, Dropbox, Google Drive, Amazon S3,
…
• IMAP/POP3, Web, LDAP
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#river
OTHER PLUGINS
https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg
• Internalization, normalization, analysis,
languages support (Chinese, Japanese, Khmer,
Thai etc.), transliteration etc.
• Discovery plugins: Amazon AWS, MS Azure,
Google GCE, ZooKeeper
• Transport plugins: allow to use Elasticsearch REST
API over Servlet, ZeroMQ, Jetty, Redis,
Memecached
• Scripting in Elasticsearch queries: Groovy,
JavaScript, Python, Clojure, SQL (!)
• Front-ends (CRUD operations) & data
visualization
• Snapshot/Restore Repository: HDFS, AWS S3,
GridFS
• Misc: Attachments handling (uses Apache Tika),
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html
ELASTICSEARCH
PRODUCT PORTFOLIO
http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg
FOUND ($)
• Elasticsearch as a service
• Starts from $45/mo (1GB RAM, 8GB SSD, 1 data
center)
• No deployment and maintenance overhead
https://www.elastic.co/products/found
SHIELD ($)
• Authentication
• Authorization: RBAC
• Encrypted communication, IP filtering
• Audit logging
• Other approaches:
• Jetty instead of
embedded server
• Nginx as a front-end
https://www.elastic.co/products/shield
MARVEL ($)
• Elasticsearch cluster health check, monitoring,
performance
• Real-time and historical analysis
• Customizable dashboards
https://www.elastic.co/products/marvel
WATCHER
• Alerts about anomalies in data
• Proactive monitoring of ES cluster (in
conjunction with Marvel)
• A lot of ways of notifications: e-mails, SMS,
webhooks
• Retrospective analysis
• High availability
https://www.elastic.co/products/watcher
ELK
https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png
LOGSTASH + ELASTIC + KIBANA
LOGSTASH ADVANCED
LOGSTASH
• Variety of inputs and outputs (165 plugins)
• 120 predefined patterns + custom log formats
• Flexible DSL to parse/normalize/enrich logs
• Implemented in Ruby, running on JRuby
https://www.elastic.co/products/logstash
SOME LOGSTASH INPUTS
https://www.elastic.co/guide/en/logstash/current/input-plugins.html
• file
• stdin
• syslog
• eventlog
• jdbc
• varnishlog
• websocket
• log4j
• jmx
• s3
• sqs
• rss
• redis
• rabbitmq
• zeromq
• kafka
• twitter
• elasticsearch
• github
• lumberjack
SOME LOGSTASH OUTPUTS
https://www.elastic.co/guide/en/logstash/current/output-plugins.html
• file
• stdout
• csv
• exec
• elasticsearch
• email
• nagios
• syslog
• redis
• loggly
• jira
• hipchat
• irc
• graphite
• http
• s3
• sqs
• sns
• rabbitmq
• zeromq
KIBANA
• Variety of charts: bar charts, line and scatter
plots, histograms, pie charts, maps
• Flexible and customizable UI, responsive design
• Slice and dice data to get necessary details
• Seamless integration with Elasticsearch
• Simple data export
https://www.elastic.co/products/kibana
DEMO #3
http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg
ELASTICSEARCH DRAWBACKS
• No transaction support. Elasticsearch is not a
database.
• No joins, constraints and other RDBMS features
• Durability and consistency issues, data loss:
– https://aphyr.com/posts/323-call-me-maybe-
elasticsearch-1-5-0
– https://www.elastic.co/guide/en/elasticsearch/resili
ency/current/index.html
PERFORMANCE?
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/
http://solr-vs-elasticsearch.com/
• Apache Solr can be faster than ES in search-only
scenarios while Elasticsearch usually outperforms
Solr when doing writes and reads concurrently
• Sphinx is faster at indexing (up to 15MB/s per
core)
• Performance issues can be usually fixed by
horizontal scaling
SUMMARY
• ES is not a silver bullet but really really powerful
tool
• Elasticsearch is not a RDBMS and is not supposed
to act as a database. Choose your tools
properly. Leverage the synergy of DB + ES
• Elasticsearch is dead simple at the start but
might be sophisticated later as you go
• Kick off easily, then hire a good DevOps
engineer for best results
• Ecosystem around Elasticsearch is just amazing
• Give it a try – it can bring a lot of value to your
product and your CV ;)
http://www.aperfectworld.org/clipart/gestures/rockhard11.png
QUESTIONS?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
THANK YOU!
http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
USEFUL LINKS
• Elasticsearch:
https://www.elastic.co/products/elasticsearch
• Logstash: https://www.elastic.co/products/logstash
• Kibana: https://www.elastic.co/products/kibana
• Scripts for the demos:
https://github.com/opanchenko/morning-at-lohika-ELK

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...