Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

E L A S T I C S E A R C H ,
L O G S TA S H , K I B A N A
C O O L S E A R C H ,
A N A LY T I C S ,
D ATA M I N I N G
A N D M O R E …
O L E K S I Y P A N C H E N K O / L O H I K A / 2 0 1 5

MY NAME IS…
Oleksiy Panchenko
Software engineer, Lohika
E-mail: oleksij@gmail.com
Twitter: oleskiyp
LinkedIn:
https://ua.linkedin.com/in/opanchenko

AGENDA
• Introduction. What is it all about?
• Jump start Elastic. Demo time
• Architecture and deployment. Why is
Elasticsearch elastic?
• Case studies. 4 real-life projects
• Query API in depth + Demo
• Elasticsearch ecosystem. ELK Stack + Demo
• Q & A

INTRODUCTION
W H A T I S I T A L L A B O U T ?

HOW TO MAKE YOUR SITE
SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png

• Google search
• Why not to use plain vanilla SQL? RDBMS rocks!
select *
from books
join authors
on …
where …
• Sphinx (hello Craigslist, Habrahabr, The Pirate Bay, 1C);
Xapian
• Lucene Family: Apache Lucene, Elasticsearch, Apache
Apache Solr, Amazon Cloudsearch, …

WHO HAS EVER USED
ELASTICSEARCH?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png

LUCENE AS A CORE
• Lucene = Low-level Java library (JAR) which
implements search functionality
• Can be used in both web and standalone
applications (desktop, mobile)
• Lucene stores its index as a local binary file
• Implemented in Java, ports to other languages
available
• Initial version: 1999
• Apache project since 2001
• Latest stable release: 5.2.1 (15 June 2015)

LUCENE AS A CORE
• Lucene was originally written in
1999 by Doug Cutting (creator
(creator of Hadoop and Nutch,
http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg

MORE ABOUT SEARCH ENGINES
Riak Search

TIME TO TALK ABOUT
ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text Search
Multilingual search, geolocation,
fuzzy search, did-you-mean
suggestions, autocomplete

High Availability
Multitenancy
Distributed, Horizontally Scalable

Document-Oriented
Schema-Free
Conflict Management
Optimistic Concurrency Control

Apache 2 Open Source License
Awesome documentation
Large community
Developer-Friendly, RESTful API
Client libraries available for
many programming languages
and frameworks.

ELASTICSEARCH USERS
https://www.elastic.co/use-cases
https://en.wikipedia.org/wiki/Elasticsearch#Users

ELASTICSEARCH – PAST &
PRESENT
• 2004. Shay Banon (aka
Kimchy) started working on
Compass – Java Search
Engine on top of Lucene
• 2010. Initial release of
Elasticsearch
• Latest stable release: 1.7.1
(July 29, 2015)
• 500K downloads per
month• https://github.com/elastic/elasticsearch
http://opensource.hk/sites/default/files/u1/shay-banon.jpg

ELASTICSEARCH
AS A COMPANY
• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees
• https://www.elastic.co/
• Product portfolio:
– Elasticsearch, Logstash, Kibana (ELK stack)
– Watcher
– Shield
– Marvel
– es-hadoop
– found

JUMP START
ELASTIC
D E M O T I M E

INSTALLATION &
CONFIGURATION
• Prerequisites:
– JDK 6 or above (recommended: JDK 8)
– RAM: min. 2Gb (recommended: 16–64 Gb for
production)
– CPU: number of cores over clock rate
– Disks: recommended SSD
• Homebrew, apt, yum: apt-get install
elasticsearch
• Download (ZIP, TAR, DEB, RPM):
https://www.elastic.co/downloads/elasticsearch
• Installation is absolutely straightforward and easy:

LET’S TALK ABOUT
TERMINOLOGY
Index ~ DB Schema
Type ~ DB Table
Documen
t
Record, JSON object
Mapping ~ Schema definition in RDBMS

DEMO #1
http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg

http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg

ARCHITECTURE
AND
DEPLOYMENT
W H Y I S E L A S T I C S E A R C H E L A S T I C ?

Cluster One or more nodes which
share the same cluster name
Node Running instance of
Elasticsearch which belongs
to a cluster
Shard A portion of data – single
Lucene instance.
Default: 5 shards in an index
Primary
Shard
Master copy of data
Replica
Shard
Exact copy of a primary
shard.
Default: 1 replica

SINGLE-NODE CLUSTER
0 1 2 3 4
Hash
Function*
{ "id": "123", "name": "john", … }
{ "id": "124", "name": "patricia", … }
{ "id": "125", "name": "scott", … }
* Also consider custom routing

TWO-NODE CLUSTER
0 1 R2 3 R4Node
1
R0 R1 2 R3 4Node
2
* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)

BENEFITS OF SHARDING
• Take advantage of multi-core CPUs (one shard is
a single Lucene instance = single JVM process)
• Horizontal scalability. Dynamic rebalancing
• Fault tolerance and cluster resilience
• NB! The number of shards can not be changed
dynamically on the fly – need to perform full
reindexing
• Max number of documents per shard:
2,147,483,519 – imposed by Lucene

CUSTOM ROUTING
• Social network. Users, events
• event_id: 17567654, 17567655, 17567656, …
user_id: 10300, 10301, …
• No Elasticsearch ID provided: ID will be auto-
generated
 Events will be equally distributed across the
shards
• Obvious approach: Elasticsearch ID = event_id
 Events will be equally distributed across the
shards
• Elasticsearch ID = user_id
 Events which belong to the same user will be

ELASTICSEARCH NODE TYPES
• Data node node.data = true
• Master node node.master = true
• Communication client http.enabled =
true
• TCP ports 9200 (ext), 9300 (int)
• A node can play 2 or 3 roles at the same time
• Multicast discovery (true by default):
discovery.zen.ping.multicast.enabled

INDEXING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html

RETRIEVING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html
• In terms of retrieving documents, primary and
replica shards are equivalent: data can be read
from either primary or replica shard

DISTRIBUTED SEARCH
• Given search query, retrieve 10 most relevant
results
https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html

CASE STUDIES
4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&pat
h-prefix=ru

GENERAL INFO
• 4 projects, ~2 years
• RDBMS (MySQL, PostgreSQL) as a primary data
storage
• Both on-premise Elasticsearch installation (AWS,
MS Azure) and SaaS (Bonsai @ Heroku)
• 1 or 2 instances in a cluster
• Data volume: Gigabytes; millions of documents
• Back-end: Java, Ruby

#1. SOCIAL INFLUENCER
MARKETING PLATFORM
http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg

• Document types: Blog Posts, Bloggers
(Influencers)
• Elasticsearch usage:
– search and rank Influencers by category,
keywords, tags, location, audience,
influence
– search blog posts by keywords etc.
• Amount of data:
– Influencers: hundreds of thousands
– Blog Posts: millions
• ES cluster size: 2 instances
• Technology stack: Java, MySQL, Dynamo

#2. JOB SITE
http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg

• Document types: Job Postings, Jobseekers
• Find relevant jobs
– Simple one-click search
– Advanced search (title, keywords, industry,
location/distance, salary, requirements)
• Elasticsearch as a Recommendation Engine
Recommend jobs based on: previously
applied/viewed jobs, location, distance,
schedule etc.
• 2 types of recommendations:
– Side banner (You also might be interested
in…)
– E-mail subscriptions every 2 weeks

• No fixed document structure (jobs from
different providers)
• Full-text search
• Fuzzy search
• Geolocation (distance)
• Weighted search: Boosted search
clauses
• Dynamic scripting (Mvel until v1.4.0, then
Groovy)
SEARCH QUERIES

SOME MORE FACTS
• Amount of data:
–Job postings: ~1M
–Applicants: ~20K
• Cluster size: 2 ‘medium’ EC2 instances
• Technology stack:
–Ruby on Rails
–Elasticsearch, PostgreSQL, Redis
–Heroku + add-ons, AWS (S3, EC2)
–Lots of 3rd party APIs and integrations

IMPLEMENTATION (RUBY)
• A Model is ActiveRecord (Ruby on Rails ORM)
• ActiveRecord can persist itself to the database
• ActiveRecord::Callbacks:
– after_commit on [:create, :update] {
index_document }
– after_commit on [:destroy] { delete_document }
– after_create…
– after_save …
– after_destroy…
• Rake tasks to drop/recreate index, reindex
documents
• Zero-downtime reindexing using aliases
• Ruby/Rails client:
https://github.com/elastic/elasticsearch-rails

LESSONS LEARNED
• On-premise deployment (EC2) vs. SaaS
(Bonsai @ Heroku)
• Dynamic scripting
• PostgreSQL as a backup search engine
sucks

#3. CAR TRADING
http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png

1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG
WAT???
• Fuzzy Search (Levenstein Distance Algorithm) used to
parse ads and classify cars
• Elasticsearch index contains dictionary (Year, Make,
Model, Trim)
• Used in conjunction with other approaches: regular
expressions, dictionaries of synonyms (VW  Volkswagen,
Chevy  Chevrolet), normalization (e.g. LX-370  LX370)
• Algorithm approach:
– Parse Year (1996)
– Search most relevant Make (VW, volkswagon 
Volkswagen)
– Search most relevant Model (Passat) for Make =
Volkswagen, Year = 1996
– Search most relevant Trim (TDi 4dr Sedan)
• Parsing quality: 90%
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html

#4. [NDA]
http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg

SOME UNCOVERED INFO
• Check documents against duplicate content
• Shingle analysis (commonly used by copywriters and SEO
experts)
– I have a dream that one day this nation will rise up and
live…
– Normalization
I have a dream that one day this nation will rise up and
live…
– Splitting a text into shingles (n-grams), n = 3..10
have dream that
dream that this
that this nation
this nation will
…
– Replacement: latin ‘c’  cyrillic ‘c’
https://en.wikipedia.org/wiki/W-shingling

FILTERS VS. QUERIES
As a general rule, filters should be used:
• for binary yes/no searches
• for queries on exact values
Filters are much faster than queries
Filters are usually great candidates for caching
27 Filters available (Elasticsearch 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html

QUERIES VS. FILTERS
As a general rule, queries should be used instead
of filters:
• for full text search
• where the result depends on a relevance score
Common approach: Filter as many records as
possible, then query them.
38 Queries available (Elasticsearch v 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html

DEMO #2
http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg

SOME THEORY BEHIND
RELEVANCE SCORING
full AND text AND search AND (elasticsearch OR
lucene)
• Term Frequency: How often does the term
appear in the document?
• Inverse Document Frequency: How often does
the term appear in all documents in the
collection?
• Field-length norm: How long is the field?
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting

MORE COOL FEATURES
• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)
• Autocomplete suggestion:
• Did-you-mean suggestion:
• Highlight results:

SEARCH IMAGES
https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/
https://github.com/kzwang/elasticsearch-image

ELASTICSEARCH
ECOSYSTEM.
ELK STACK
+ D E M O

CLIENTS
http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png

• Java: 1 native client + 1 community
supported
• Python: 1 official + 7 community supported
• Ruby: 1 official + 7 community supported
• JavaScript: 1 official + 4
• PHP: 1 official + 4
• C#. NET: 1 official + 2
• Scala: 4
• Groovy (1), Haskell (1), Perl (1), Clojure (1),
Go (3),
R (2), Erlang (3), OCaml (2), Smalltalk (1),
ColdFusion (1), C++ (1)
• Command Line (2)https://www.elastic.co/guide/en/elasticsearch/client/community/current/clients.html

INTEGRATIONS
• Django
• Ruby on Rails
• Spring, Spring Data
• Node.js
• Symfony, Drupal, Wordpress
• Grails
• Play! Framework
https://www.elastic.co/guide/en/elasticsearch/client/community/current/integrations.html

FRONT ENDS
http://php.archive.razorflow.com/assets/img/header_v1.png

ELASTICSEARCH-HEAD
http://mobz.github.io/elasticsearch-head/

ESCLIENT
https://github.com/rdpatil4/ESClient

AVAILABLE FRONT ENDS
https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html
• elasticsearch-head: A web front end for an Elasticsearch
cluster.
• browser: Web front-end over elasticsearch data.
• Inquisitor: Front-end to help debug/diagnose queries and
analyzers
• Hammer: Web front-end for elasticsearch
• Calaca: Simple search client for Elasticsearch
• ESClient: Simple search, update, delete client for
Elasticsearch

HEALTH AND PERFORMANCE
http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png

ELASTICSEARCH-HEAD
https://github.com/mobz/elasticsearch-head

BIGDESK
https://github.com/lukas-vlcek/bigdesk

WHATSON
https://github.com/xyu/elasticsearch-whatson

ELASTICOCEAN
https://itunes.apple.com/us/app/elasticocean/id955278030

HEALTH AND PERFORMANCE
https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html
• bigdesk: Live charts and statistics for elasticsearch cluster.
• Kopf: Live cluster health and shard allocation monitoring with
administration toolset.
• paramedic: Live charts with cluster stats and indices/shards
information.
• ElasticsearchHQ: Free cluster health monitoring tool
• SPM for Elasticsearch: Performance monitoring with live charts
showing cluster and node stats, integrated alerts, email reports, etc.
• check-es: Nagios/Shinken plugins for checking on elasticsearch
• check_elasticsearch: An Elasticsearch availability and performance
monitoring plugin for Nagios.
• opsview-elasticsearch: Opsview plugin written in Perl for monitoring
Elasticsearch
• SegmentSpy: Plugin to watch Lucene segment merges across your
cluster
• es2graphite: Send cluster and indices stats and status to Graphite for
monitoring and graphing.
• Scout: Provides plugins for monitoring Elasticsearch nodes, clusters,
and indices.
• ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring

10 ES METRICS TO WATCH
http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html
1. Cluster health — nodes and shards
2. Node performance — CPU
3. Node performance — memory usage
4. Node performance — disk I/O
5. Java — heap usage and garbage collection
6. Java — JVM pool size
7. Search performance — request latency and
request rate
8. Search performance — filter cache
9. Search performance — field data cache
10.Indexing performance — refresh times and
merge times

RIVERS (DEPRECATED IN 1.5.0)
http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi

• JDBC River Plugin, CSV River Plugin
• MongoDB, CouchDB, Solr, Redis, Neo4j,
DynamoDB, RethinkDB, Hazelcast, …
• JMS, RabbitMQ, ActiveMQ, Amazon SQS, Kafka,
…
• Twitter, Wikipedia, Git, GitHub, Subversion, RSS, …
• FileSystem, Dropbox, Google Drive, Amazon S3,
…
• IMAP/POP3, Web, LDAP
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#river

OTHER PLUGINS
https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg

• Internalization, normalization, analysis,
languages support (Chinese, Japanese, Khmer,
Thai etc.), transliteration etc.
• Discovery plugins: Amazon AWS, MS Azure,
Google GCE, ZooKeeper
• Transport plugins: allow to use Elasticsearch REST
API over Servlet, ZeroMQ, Jetty, Redis,
Memecached
• Scripting in Elasticsearch queries: Groovy,
JavaScript, Python, Clojure, SQL (!)
• Front-ends (CRUD operations) & data
visualization
• Snapshot/Restore Repository: HDFS, AWS S3,
GridFS
• Misc: Attachments handling (uses Apache Tika),
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html

ELASTICSEARCH
PRODUCT PORTFOLIO
http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg

FOUND ($)
• Elasticsearch as a service
• Starts from $45/mo (1GB RAM, 8GB SSD, 1 data
center)
• No deployment and maintenance overhead
https://www.elastic.co/products/found

SHIELD ($)
• Authentication
• Authorization: RBAC
• Encrypted communication, IP filtering
• Audit logging
• Other approaches:
• Jetty instead of
embedded server
• Nginx as a front-end
https://www.elastic.co/products/shield

MARVEL ($)
• Elasticsearch cluster health check, monitoring,
performance
• Real-time and historical analysis
• Customizable dashboards
https://www.elastic.co/products/marvel

WATCHER
• Alerts about anomalies in data
• Proactive monitoring of ES cluster (in
conjunction with Marvel)
• A lot of ways of notifications: e-mails, SMS,
webhooks
• Retrospective analysis
• High availability
https://www.elastic.co/products/watcher

ELK
https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png

LOGSTASH
• Variety of inputs and outputs (165 plugins)
• 120 predefined patterns + custom log formats
• Flexible DSL to parse/normalize/enrich logs
• Implemented in Ruby, running on JRuby
https://www.elastic.co/products/logstash

SOME LOGSTASH INPUTS
https://www.elastic.co/guide/en/logstash/current/input-plugins.html
• file
• stdin
• syslog
• eventlog
• jdbc
• varnishlog
• websocket
• log4j
• jmx
• s3
• sqs
• rss
• redis
• rabbitmq
• zeromq
• kafka
• twitter
• elasticsearch
• github
• lumberjack

SOME LOGSTASH OUTPUTS
https://www.elastic.co/guide/en/logstash/current/output-plugins.html
• file
• stdout
• csv
• exec
• elasticsearch
• email
• nagios
• syslog
• redis
• loggly
• jira
• hipchat
• irc
• graphite
• http
• s3
• sqs
• sns
• rabbitmq
• zeromq

KIBANA
• Variety of charts: bar charts, line and scatter
plots, histograms, pie charts, maps
• Flexible and customizable UI, responsive design
• Slice and dice data to get necessary details
• Seamless integration with Elasticsearch
• Simple data export
https://www.elastic.co/products/kibana

DEMO #3
http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg

ELASTICSEARCH DRAWBACKS
• No transaction support. Elasticsearch is not a
database.
• No joins, constraints and other RDBMS features
• Durability and consistency issues, data loss:
– https://aphyr.com/posts/323-call-me-maybe-
elasticsearch-1-5-0
– https://www.elastic.co/guide/en/elasticsearch/resili
ency/current/index.html

PERFORMANCE?
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/
http://solr-vs-elasticsearch.com/
• Apache Solr can be faster than ES in search-only
scenarios while Elasticsearch usually outperforms
Solr when doing writes and reads concurrently
• Sphinx is faster at indexing (up to 15MB/s per
core)
• Performance issues can be usually fixed by
horizontal scaling

SUMMARY
• ES is not a silver bullet but really really powerful
tool
• Elasticsearch is not a RDBMS and is not supposed
to act as a database. Choose your tools
properly. Leverage the synergy of DB + ES
• Elasticsearch is dead simple at the start but
might be sophisticated later as you go
• Kick off easily, then hire a good DevOps
engineer for best results
• Ecosystem around Elasticsearch is just amazing
• Give it a try – it can bring a lot of value to your
product and your CV ;)
http://www.aperfectworld.org/clipart/gestures/rockhard11.png

QUESTIONS?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png

THANK YOU!
http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg

USEFUL LINKS
• Elasticsearch:
• Logstash: https://www.elastic.co/products/logstash
• Kibana: https://www.elastic.co/products/kibana
• Scripts for the demos:
https://github.com/opanchenko/morning-at-lohika-ELK

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

In this document

More Related Content

What's hot

Similar to Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

Recently uploaded

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...