Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
E L A S T I C S E A R C H ,
L O G S TA S H , K I B A N A
C O O L S E A R C H ,
A N A LY T I C S ,
D ATA M I N I N G
A N D ...
MY NAME IS…
Oleksiy Panchenko
Software engineer, Lohika
E-mail: oleksij@gmail.com
Twitter: oleskiyp
LinkedIn:
https://ua.l...
AGENDA
• Introduction. What is it all about?
• Jump start Elastic. Demo time
• Architecture and deployment. Why is
Elastic...
INTRODUCTION
W H A T I S I T A L L A B O U T ?
HOW TO MAKE YOUR SITE
SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
• Google search
• Why not to use plain vanilla SQL? RDBMS rocks!
select *
from books
join authors
on …
where …
• Sphinx (h...
WHO HAS EVER USED
ELASTICSEARCH?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
LUCENE AS A CORE
• Lucene = Low-level Java library (JAR) which
implements search functionality
• Can be used in both web a...
LUCENE AS A CORE
• Lucene was originally written in
1999 by Doug Cutting (creator
(creator of Hadoop and Nutch,
http://www...
MORE ABOUT SEARCH ENGINES
Riak Search
TIME TO TALK ABOUT
ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text Search
...
https://www.elastic.co/products/elasticsearch
High Availability
Multitenancy
Distributed, Horizontally Scalable
https://www.elastic.co/products/elasticsearch
Document-Oriented
Schema-Free
Conflict Management
Optimistic Concurrency Con...
https://www.elastic.co/products/elasticsearch
Apache 2 Open Source License
Awesome documentation
Large community
Developer...
ELASTICSEARCH USERS
https://www.elastic.co/use-cases
https://en.wikipedia.org/wiki/Elasticsearch#Users
ELASTICSEARCH – PAST &
PRESENT
• 2004. Shay Banon (aka
Kimchy) started working on
Compass – Java Search
Engine on top of L...
ELASTICSEARCH
AS A COMPANY
• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees
• https://www.elastic.co/
...
JUMP START
ELASTIC
D E M O T I M E
INSTALLATION &
CONFIGURATION
• Prerequisites:
– JDK 6 or above (recommended: JDK 8)
– RAM: min. 2Gb (recommended: 16–64 Gb...
LET’S TALK ABOUT
TERMINOLOGY
Index ~ DB Schema
Type ~ DB Table
Documen
t
Record, JSON object
Mapping ~ Schema definition i...
DEMO #1
http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ARCHITECTURE
AND
DEPLOYMENT
W H Y I S E L A S T I C S E A R C H E L A S T I C ?
Cluster One or more nodes which
share the same cluster name
Node Running instance of
Elasticsearch which belongs
to a clus...
SINGLE-NODE CLUSTER
0 1 2 3 4
Hash
Function*
{ "id": "123", "name": "john", … }
{ "id": "124", "name": "patricia", … }
{ "...
TWO-NODE CLUSTER
0 1 R2 3 R4Node
1
R0 R1 2 R3 4Node
2
* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘...
BENEFITS OF SHARDING
• Take advantage of multi-core CPUs (one shard is
a single Lucene instance = single JVM process)
• Ho...
CUSTOM ROUTING
• Social network. Users, events
• event_id: 17567654, 17567655, 17567656, …
user_id: 10300, 10301, …
• No E...
ELASTICSEARCH NODE TYPES
• Data node node.data = true
• Master node node.master = true
• Communication client http.enabled...
DEPLOYMENT DIAGRAM
INDEXING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
RETRIEVING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html
• In terms of retrievi...
DISTRIBUTED SEARCH
• Given search query, retrieve 10 most relevant
results
https://www.elastic.co/guide/en/elasticsearch/g...
CASE STUDIES
4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revi...
GENERAL INFO
• 4 projects, ~2 years
• RDBMS (MySQL, PostgreSQL) as a primary data
storage
• Both on-premise Elasticsearch ...
#1. SOCIAL INFLUENCER
MARKETING PLATFORM
http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
• Document types: Blog Posts, Bloggers
(Influencers)
• Elasticsearch usage:
– search and rank Influencers by category,
key...
#2. JOB SITE
http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
• Document types: Job Postings, Jobseekers
• Find relevant jobs
– Simple one-click search
– Advanced search (title, keywor...
• No fixed document structure (jobs from
different providers)
• Full-text search
• Fuzzy search
• Geolocation (distance)
•...
SOME MORE FACTS
• Amount of data:
–Job postings: ~1M
–Applicants: ~20K
• Cluster size: 2 ‘medium’ EC2 instances
• Technolo...
IMPLEMENTATION (RUBY)
• A Model is ActiveRecord (Ruby on Rails ORM)
• ActiveRecord can persist itself to the database
• Ac...
LESSONS LEARNED
• On-premise deployment (EC2) vs. SaaS
(Bonsai @ Heroku)
• Dynamic scripting
• PostgreSQL as a backup sear...
#3. CAR TRADING
http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
PARSING ADS
Price
$3900
1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG
WAT???
• Fuzzy Search (Levenstein Distance Algorithm) used to
parse ads an...
#4. [NDA]
http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
SOME UNCOVERED INFO
• Check documents against duplicate content
• Shingle analysis (commonly used by copywriters and SEO
e...
QUERY API IN
DEPTH
+ D E M O
FILTERS VS. QUERIES
As a general rule, filters should be used:
• for binary yes/no searches
• for queries on exact values
...
QUERIES VS. FILTERS
As a general rule, queries should be used instead
of filters:
• for full text search
• where the resul...
DEMO #2
http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
SOME THEORY BEHIND
RELEVANCE SCORING
full AND text AND search AND (elasticsearch OR
lucene)
• Term Frequency: How often do...
MORE COOL FEATURES
• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)
• Autocomplete suggestion:
• Did-you-mean su...
SEARCH IMAGES
https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/
https://github.com/kzwang/ela...
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ELASTICSEARCH
ECOSYSTEM.
ELK STACK
+ D E M O
CLIENTS
http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png
• Java: 1 native client + 1 community
supported
• Python: 1 official + 7 community supported
• Ruby: 1 official + 7 commun...
INTEGRATIONS
• Django
• Ruby on Rails
• Spring, Spring Data
• Node.js
• Symfony, Drupal, Wordpress
• Grails
• Play! Framew...
FRONT ENDS
http://php.archive.razorflow.com/assets/img/header_v1.png
ELASTICSEARCH-HEAD
http://mobz.github.io/elasticsearch-head/
ESCLIENT
https://github.com/rdpatil4/ESClient
AVAILABLE FRONT ENDS
https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html
• elasticsearc...
HEALTH AND PERFORMANCE
http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png
ELASTICSEARCH-HEAD
https://github.com/mobz/elasticsearch-head
BIGDESK
https://github.com/lukas-vlcek/bigdesk
WHATSON
https://github.com/xyu/elasticsearch-whatson
ELASTICOCEAN
https://itunes.apple.com/us/app/elasticocean/id955278030
HEALTH AND PERFORMANCE
https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html
• bigdesk: Live ...
10 ES METRICS TO WATCH
http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html
1. Cluster health — nodes a...
RIVERS (DEPRECATED IN 1.5.0)
http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi
• JDBC River Plugin, CSV River Plugin
• MongoDB, CouchDB, Solr, Redis, Neo4j,
DynamoDB, RethinkDB, Hazelcast, …
• JMS, Rab...
OTHER PLUGINS
https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg
• Internalization, normalization, analysis,
languages support (Chinese, Japanese, Khmer,
Thai etc.), transliteration etc.
...
ELASTICSEARCH
PRODUCT PORTFOLIO
http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg
FOUND ($)
• Elasticsearch as a service
• Starts from $45/mo (1GB RAM, 8GB SSD, 1 data
center)
• No deployment and maintena...
SHIELD ($)
• Authentication
• Authorization: RBAC
• Encrypted communication, IP filtering
• Audit logging
• Other approach...
MARVEL ($)
• Elasticsearch cluster health check, monitoring,
performance
• Real-time and historical analysis
• Customizabl...
WATCHER
• Alerts about anomalies in data
• Proactive monitoring of ES cluster (in
conjunction with Marvel)
• A lot of ways...
ELK
https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png
LOGSTASH + ELASTIC + KIBANA
LOGSTASH ADVANCED
LOGSTASH
• Variety of inputs and outputs (165 plugins)
• 120 predefined patterns + custom log formats
• Flexible DSL to pa...
SOME LOGSTASH INPUTS
https://www.elastic.co/guide/en/logstash/current/input-plugins.html
• file
• stdin
• syslog
• eventlo...
SOME LOGSTASH OUTPUTS
https://www.elastic.co/guide/en/logstash/current/output-plugins.html
• file
• stdout
• csv
• exec
• ...
KIBANA
• Variety of charts: bar charts, line and scatter
plots, histograms, pie charts, maps
• Flexible and customizable U...
DEMO #3
http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg
ELASTICSEARCH DRAWBACKS
• No transaction support. Elasticsearch is not a
database.
• No joins, constraints and other RDBMS...
PERFORMANCE?
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/
http://solr-vs-elasticsearch.com/
• Apache ...
SUMMARY
• ES is not a silver bullet but really really powerful
tool
• Elasticsearch is not a RDBMS and is not supposed
to ...
QUESTIONS?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
THANK YOU!
http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
USEFUL LINKS
• Elasticsearch:
https://www.elastic.co/products/elasticsearch
• Logstash: https://www.elastic.co/products/lo...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
How to Use HazelcastMQ for Flexible Messaging and More
Next
Download to read offline and view in fullscreen.

Share

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

Download to read offline

In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs.
Video: https://www.youtube.com/watch?v=GL7xC5kpb-c
Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

  1. 1. E L A S T I C S E A R C H , L O G S TA S H , K I B A N A C O O L S E A R C H , A N A LY T I C S , D ATA M I N I N G A N D M O R E … O L E K S I Y P A N C H E N K O / L O H I K A / 2 0 1 5
  2. 2. MY NAME IS… Oleksiy Panchenko Software engineer, Lohika E-mail: oleksij@gmail.com Twitter: oleskiyp LinkedIn: https://ua.linkedin.com/in/opanchenko
  3. 3. AGENDA • Introduction. What is it all about? • Jump start Elastic. Demo time • Architecture and deployment. Why is Elasticsearch elastic? • Case studies. 4 real-life projects • Query API in depth + Demo • Elasticsearch ecosystem. ELK Stack + Demo • Q & A
  4. 4. INTRODUCTION W H A T I S I T A L L A B O U T ?
  5. 5. HOW TO MAKE YOUR SITE SEARCHABLE? http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
  6. 6. • Google search • Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where … • Sphinx (hello Craigslist, Habrahabr, The Pirate Bay, 1C); Xapian • Lucene Family: Apache Lucene, Elasticsearch, Apache Apache Solr, Amazon Cloudsearch, …
  7. 7. WHO HAS EVER USED ELASTICSEARCH? http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
  8. 8. LUCENE AS A CORE • Lucene = Low-level Java library (JAR) which implements search functionality • Can be used in both web and standalone applications (desktop, mobile) • Lucene stores its index as a local binary file • Implemented in Java, ports to other languages available • Initial version: 1999 • Apache project since 2001 • Latest stable release: 5.2.1 (15 June 2015)
  9. 9. LUCENE AS A CORE • Lucene was originally written in 1999 by Doug Cutting (creator (creator of Hadoop and Nutch, http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
  10. 10. MORE ABOUT SEARCH ENGINES Riak Search
  11. 11. TIME TO TALK ABOUT ELASTICSEARCH https://www.elastic.co/products/elasticsearch Near Real-Time Data (NRT) Full-Text Search Multilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete
  12. 12. https://www.elastic.co/products/elasticsearch High Availability Multitenancy Distributed, Horizontally Scalable
  13. 13. https://www.elastic.co/products/elasticsearch Document-Oriented Schema-Free Conflict Management Optimistic Concurrency Control
  14. 14. https://www.elastic.co/products/elasticsearch Apache 2 Open Source License Awesome documentation Large community Developer-Friendly, RESTful API Client libraries available for many programming languages and frameworks.
  15. 15. ELASTICSEARCH USERS https://www.elastic.co/use-cases https://en.wikipedia.org/wiki/Elasticsearch#Users
  16. 16. ELASTICSEARCH – PAST & PRESENT • 2004. Shay Banon (aka Kimchy) started working on Compass – Java Search Engine on top of Lucene • 2010. Initial release of Elasticsearch • Latest stable release: 1.7.1 (July 29, 2015) • 500K downloads per month• https://github.com/elastic/elasticsearch http://opensource.hk/sites/default/files/u1/shay-banon.jpg
  17. 17. ELASTICSEARCH AS A COMPANY • 2012. Elasticsearch BV; Funding: $104M in 3 rounds, 100+ employees • https://www.elastic.co/ • Product portfolio: – Elasticsearch, Logstash, Kibana (ELK stack) – Watcher – Shield – Marvel – es-hadoop – found
  18. 18. JUMP START ELASTIC D E M O T I M E
  19. 19. INSTALLATION & CONFIGURATION • Prerequisites: – JDK 6 or above (recommended: JDK 8) – RAM: min. 2Gb (recommended: 16–64 Gb for production) – CPU: number of cores over clock rate – Disks: recommended SSD • Homebrew, apt, yum: apt-get install elasticsearch • Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch • Installation is absolutely straightforward and easy:
  20. 20. LET’S TALK ABOUT TERMINOLOGY Index ~ DB Schema Type ~ DB Table Documen t Record, JSON object Mapping ~ Schema definition in RDBMS
  21. 21. DEMO #1 http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
  22. 22. http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
  23. 23. ARCHITECTURE AND DEPLOYMENT W H Y I S E L A S T I C S E A R C H E L A S T I C ?
  24. 24. Cluster One or more nodes which share the same cluster name Node Running instance of Elasticsearch which belongs to a cluster Shard A portion of data – single Lucene instance. Default: 5 shards in an index Primary Shard Master copy of data Replica Shard Exact copy of a primary shard. Default: 1 replica
  25. 25. SINGLE-NODE CLUSTER 0 1 2 3 4 Hash Function* { "id": "123", "name": "john", … } { "id": "124", "name": "patricia", … } { "id": "125", "name": "scott", … } * Also consider custom routing
  26. 26. TWO-NODE CLUSTER 0 1 R2 3 R4Node 1 R0 R1 2 R3 4Node 2 * Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)
  27. 27. BENEFITS OF SHARDING • Take advantage of multi-core CPUs (one shard is a single Lucene instance = single JVM process) • Horizontal scalability. Dynamic rebalancing • Fault tolerance and cluster resilience • NB! The number of shards can not be changed dynamically on the fly – need to perform full reindexing • Max number of documents per shard: 2,147,483,519 – imposed by Lucene
  28. 28. CUSTOM ROUTING • Social network. Users, events • event_id: 17567654, 17567655, 17567656, … user_id: 10300, 10301, … • No Elasticsearch ID provided: ID will be auto- generated  Events will be equally distributed across the shards • Obvious approach: Elasticsearch ID = event_id  Events will be equally distributed across the shards • Elasticsearch ID = user_id  Events which belong to the same user will be
  29. 29. ELASTICSEARCH NODE TYPES • Data node node.data = true • Master node node.master = true • Communication client http.enabled = true • TCP ports 9200 (ext), 9300 (int) • A node can play 2 or 3 roles at the same time • Multicast discovery (true by default): discovery.zen.ping.multicast.enabled
  30. 30. DEPLOYMENT DIAGRAM
  31. 31. INDEXING A DOCUMENT https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
  32. 32. RETRIEVING A DOCUMENT https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html • In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard
  33. 33. DISTRIBUTED SEARCH • Given search query, retrieve 10 most relevant results https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html
  34. 34. CASE STUDIES 4 R E A L - L I F E P R O J E C T S http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&pat h-prefix=ru
  35. 35. GENERAL INFO • 4 projects, ~2 years • RDBMS (MySQL, PostgreSQL) as a primary data storage • Both on-premise Elasticsearch installation (AWS, MS Azure) and SaaS (Bonsai @ Heroku) • 1 or 2 instances in a cluster • Data volume: Gigabytes; millions of documents • Back-end: Java, Ruby
  36. 36. #1. SOCIAL INFLUENCER MARKETING PLATFORM http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
  37. 37. • Document types: Blog Posts, Bloggers (Influencers) • Elasticsearch usage: – search and rank Influencers by category, keywords, tags, location, audience, influence – search blog posts by keywords etc. • Amount of data: – Influencers: hundreds of thousands – Blog Posts: millions • ES cluster size: 2 instances • Technology stack: Java, MySQL, Dynamo
  38. 38. #2. JOB SITE http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
  39. 39. • Document types: Job Postings, Jobseekers • Find relevant jobs – Simple one-click search – Advanced search (title, keywords, industry, location/distance, salary, requirements) • Elasticsearch as a Recommendation Engine Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc. • 2 types of recommendations: – Side banner (You also might be interested in…) – E-mail subscriptions every 2 weeks
  40. 40. • No fixed document structure (jobs from different providers) • Full-text search • Fuzzy search • Geolocation (distance) • Weighted search: Boosted search clauses • Dynamic scripting (Mvel until v1.4.0, then Groovy) SEARCH QUERIES
  41. 41. SOME MORE FACTS • Amount of data: –Job postings: ~1M –Applicants: ~20K • Cluster size: 2 ‘medium’ EC2 instances • Technology stack: –Ruby on Rails –Elasticsearch, PostgreSQL, Redis –Heroku + add-ons, AWS (S3, EC2) –Lots of 3rd party APIs and integrations
  42. 42. IMPLEMENTATION (RUBY) • A Model is ActiveRecord (Ruby on Rails ORM) • ActiveRecord can persist itself to the database • ActiveRecord::Callbacks: – after_commit on [:create, :update] { index_document } – after_commit on [:destroy] { delete_document } – after_create… – after_save … – after_destroy… • Rake tasks to drop/recreate index, reindex documents • Zero-downtime reindexing using aliases • Ruby/Rails client: https://github.com/elastic/elasticsearch-rails
  43. 43. LESSONS LEARNED • On-premise deployment (EC2) vs. SaaS (Bonsai @ Heroku) • Dynamic scripting • PostgreSQL as a backup search engine sucks
  44. 44. #3. CAR TRADING http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
  45. 45. PARSING ADS Price $3900
  46. 46. 1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG WAT??? • Fuzzy Search (Levenstein Distance Algorithm) used to parse ads and classify cars • Elasticsearch index contains dictionary (Year, Make, Model, Trim) • Used in conjunction with other approaches: regular expressions, dictionaries of synonyms (VW  Volkswagen, Chevy  Chevrolet), normalization (e.g. LX-370  LX370) • Algorithm approach: – Parse Year (1996) – Search most relevant Make (VW, volkswagon  Volkswagen) – Search most relevant Model (Passat) for Make = Volkswagen, Year = 1996 – Search most relevant Trim (TDi 4dr Sedan) • Parsing quality: 90% https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
  47. 47. #4. [NDA] http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
  48. 48. SOME UNCOVERED INFO • Check documents against duplicate content • Shingle analysis (commonly used by copywriters and SEO experts) – I have a dream that one day this nation will rise up and live… – Normalization I have a dream that one day this nation will rise up and live… – Splitting a text into shingles (n-grams), n = 3..10 have dream that dream that this that this nation this nation will … – Replacement: latin ‘c’  cyrillic ‘c’ https://en.wikipedia.org/wiki/W-shingling
  49. 49. QUERY API IN DEPTH + D E M O
  50. 50. FILTERS VS. QUERIES As a general rule, filters should be used: • for binary yes/no searches • for queries on exact values Filters are much faster than queries Filters are usually great candidates for caching 27 Filters available (Elasticsearch 1.7.1) https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
  51. 51. QUERIES VS. FILTERS As a general rule, queries should be used instead of filters: • for full text search • where the result depends on a relevance score Common approach: Filter as many records as possible, then query them. 38 Queries available (Elasticsearch v 1.7.1) https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
  52. 52. DEMO #2 http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
  53. 53. SOME THEORY BEHIND RELEVANCE SCORING full AND text AND search AND (elasticsearch OR lucene) • Term Frequency: How often does the term appear in the document? • Inverse Document Frequency: How often does the term appear in all documents in the collection? • Field-length norm: How long is the field? https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
  54. 54. MORE COOL FEATURES • Indexing attachments: MS Office, ePub, PDF (Apache Tika) • Autocomplete suggestion: • Did-you-mean suggestion: • Highlight results:
  55. 55. SEARCH IMAGES https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/ https://github.com/kzwang/elasticsearch-image
  56. 56. http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
  57. 57. ELASTICSEARCH ECOSYSTEM. ELK STACK + D E M O
  58. 58. CLIENTS http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png
  59. 59. • Java: 1 native client + 1 community supported • Python: 1 official + 7 community supported • Ruby: 1 official + 7 community supported • JavaScript: 1 official + 4 • PHP: 1 official + 4 • C#. NET: 1 official + 2 • Scala: 4 • Groovy (1), Haskell (1), Perl (1), Clojure (1), Go (3), R (2), Erlang (3), OCaml (2), Smalltalk (1), ColdFusion (1), C++ (1) • Command Line (2)https://www.elastic.co/guide/en/elasticsearch/client/community/current/clients.html
  60. 60. INTEGRATIONS • Django • Ruby on Rails • Spring, Spring Data • Node.js • Symfony, Drupal, Wordpress • Grails • Play! Framework https://www.elastic.co/guide/en/elasticsearch/client/community/current/integrations.html
  61. 61. FRONT ENDS http://php.archive.razorflow.com/assets/img/header_v1.png
  62. 62. ELASTICSEARCH-HEAD http://mobz.github.io/elasticsearch-head/
  63. 63. ESCLIENT https://github.com/rdpatil4/ESClient
  64. 64. AVAILABLE FRONT ENDS https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html • elasticsearch-head: A web front end for an Elasticsearch cluster. • browser: Web front-end over elasticsearch data. • Inquisitor: Front-end to help debug/diagnose queries and analyzers • Hammer: Web front-end for elasticsearch • Calaca: Simple search client for Elasticsearch • ESClient: Simple search, update, delete client for Elasticsearch
  65. 65. HEALTH AND PERFORMANCE http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png
  66. 66. ELASTICSEARCH-HEAD https://github.com/mobz/elasticsearch-head
  67. 67. BIGDESK https://github.com/lukas-vlcek/bigdesk
  68. 68. WHATSON https://github.com/xyu/elasticsearch-whatson
  69. 69. ELASTICOCEAN https://itunes.apple.com/us/app/elasticocean/id955278030
  70. 70. HEALTH AND PERFORMANCE https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html • bigdesk: Live charts and statistics for elasticsearch cluster. • Kopf: Live cluster health and shard allocation monitoring with administration toolset. • paramedic: Live charts with cluster stats and indices/shards information. • ElasticsearchHQ: Free cluster health monitoring tool • SPM for Elasticsearch: Performance monitoring with live charts showing cluster and node stats, integrated alerts, email reports, etc. • check-es: Nagios/Shinken plugins for checking on elasticsearch • check_elasticsearch: An Elasticsearch availability and performance monitoring plugin for Nagios. • opsview-elasticsearch: Opsview plugin written in Perl for monitoring Elasticsearch • SegmentSpy: Plugin to watch Lucene segment merges across your cluster • es2graphite: Send cluster and indices stats and status to Graphite for monitoring and graphing. • Scout: Provides plugins for monitoring Elasticsearch nodes, clusters, and indices. • ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring
  71. 71. 10 ES METRICS TO WATCH http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html 1. Cluster health — nodes and shards 2. Node performance — CPU 3. Node performance — memory usage 4. Node performance — disk I/O 5. Java — heap usage and garbage collection 6. Java — JVM pool size 7. Search performance — request latency and request rate 8. Search performance — filter cache 9. Search performance — field data cache 10.Indexing performance — refresh times and merge times
  72. 72. RIVERS (DEPRECATED IN 1.5.0) http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi
  73. 73. • JDBC River Plugin, CSV River Plugin • MongoDB, CouchDB, Solr, Redis, Neo4j, DynamoDB, RethinkDB, Hazelcast, … • JMS, RabbitMQ, ActiveMQ, Amazon SQS, Kafka, … • Twitter, Wikipedia, Git, GitHub, Subversion, RSS, … • FileSystem, Dropbox, Google Drive, Amazon S3, … • IMAP/POP3, Web, LDAP https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#river
  74. 74. OTHER PLUGINS https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg
  75. 75. • Internalization, normalization, analysis, languages support (Chinese, Japanese, Khmer, Thai etc.), transliteration etc. • Discovery plugins: Amazon AWS, MS Azure, Google GCE, ZooKeeper • Transport plugins: allow to use Elasticsearch REST API over Servlet, ZeroMQ, Jetty, Redis, Memecached • Scripting in Elasticsearch queries: Groovy, JavaScript, Python, Clojure, SQL (!) • Front-ends (CRUD operations) & data visualization • Snapshot/Restore Repository: HDFS, AWS S3, GridFS • Misc: Attachments handling (uses Apache Tika), https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html
  76. 76. ELASTICSEARCH PRODUCT PORTFOLIO http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg
  77. 77. FOUND ($) • Elasticsearch as a service • Starts from $45/mo (1GB RAM, 8GB SSD, 1 data center) • No deployment and maintenance overhead https://www.elastic.co/products/found
  78. 78. SHIELD ($) • Authentication • Authorization: RBAC • Encrypted communication, IP filtering • Audit logging • Other approaches: • Jetty instead of embedded server • Nginx as a front-end https://www.elastic.co/products/shield
  79. 79. MARVEL ($) • Elasticsearch cluster health check, monitoring, performance • Real-time and historical analysis • Customizable dashboards https://www.elastic.co/products/marvel
  80. 80. WATCHER • Alerts about anomalies in data • Proactive monitoring of ES cluster (in conjunction with Marvel) • A lot of ways of notifications: e-mails, SMS, webhooks • Retrospective analysis • High availability https://www.elastic.co/products/watcher
  81. 81. ELK https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png
  82. 82. LOGSTASH + ELASTIC + KIBANA
  83. 83. LOGSTASH ADVANCED
  84. 84. LOGSTASH • Variety of inputs and outputs (165 plugins) • 120 predefined patterns + custom log formats • Flexible DSL to parse/normalize/enrich logs • Implemented in Ruby, running on JRuby https://www.elastic.co/products/logstash
  85. 85. SOME LOGSTASH INPUTS https://www.elastic.co/guide/en/logstash/current/input-plugins.html • file • stdin • syslog • eventlog • jdbc • varnishlog • websocket • log4j • jmx • s3 • sqs • rss • redis • rabbitmq • zeromq • kafka • twitter • elasticsearch • github • lumberjack
  86. 86. SOME LOGSTASH OUTPUTS https://www.elastic.co/guide/en/logstash/current/output-plugins.html • file • stdout • csv • exec • elasticsearch • email • nagios • syslog • redis • loggly • jira • hipchat • irc • graphite • http • s3 • sqs • sns • rabbitmq • zeromq
  87. 87. KIBANA • Variety of charts: bar charts, line and scatter plots, histograms, pie charts, maps • Flexible and customizable UI, responsive design • Slice and dice data to get necessary details • Seamless integration with Elasticsearch • Simple data export https://www.elastic.co/products/kibana
  88. 88. DEMO #3 http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg
  89. 89. ELASTICSEARCH DRAWBACKS • No transaction support. Elasticsearch is not a database. • No joins, constraints and other RDBMS features • Durability and consistency issues, data loss: – https://aphyr.com/posts/323-call-me-maybe- elasticsearch-1-5-0 – https://www.elastic.co/guide/en/elasticsearch/resili ency/current/index.html
  90. 90. PERFORMANCE? http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/ http://solr-vs-elasticsearch.com/ • Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently • Sphinx is faster at indexing (up to 15MB/s per core) • Performance issues can be usually fixed by horizontal scaling
  91. 91. SUMMARY • ES is not a silver bullet but really really powerful tool • Elasticsearch is not a RDBMS and is not supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES • Elasticsearch is dead simple at the start but might be sophisticated later as you go • Kick off easily, then hire a good DevOps engineer for best results • Ecosystem around Elasticsearch is just amazing • Give it a try – it can bring a lot of value to your product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png
  92. 92. QUESTIONS? http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
  93. 93. THANK YOU! http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
  94. 94. USEFUL LINKS • Elasticsearch: https://www.elastic.co/products/elasticsearch • Logstash: https://www.elastic.co/products/logstash • Kibana: https://www.elastic.co/products/kibana • Scripts for the demos: https://github.com/opanchenko/morning-at-lohika-ELK
  • joobn

    Oct. 18, 2017
  • snow3d

    Oct. 17, 2017
  • MarinelRosca

    Jun. 22, 2017
  • thangaraj752

    Feb. 25, 2017
  • powerirs

    Dec. 17, 2016
  • garlatti

    Oct. 4, 2016
  • RonyPikarski

    Aug. 23, 2016
  • frozenzombie

    Jul. 18, 2016
  • vozniuk

    Jun. 22, 2016
  • app_voland

    Oct. 14, 2015
  • gilnovJetlui

    Oct. 6, 2015
  • VijayGharge1

    Sep. 4, 2015
  • shoebkhan17

    Aug. 25, 2015
  • MariyaHopyak

    Aug. 18, 2015
  • jmakarchuk

    Aug. 18, 2015

In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs. Video: https://www.youtube.com/watch?v=GL7xC5kpb-c Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK

Views

Total views

4,950

On Slideshare

0

From embeds

0

Number of embeds

429

Actions

Downloads

269

Shares

0

Comments

0

Likes

15

×