Successfully reported this slideshow.
Your SlideShare is downloading. ×

Elastic pivorak

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Intro to Apache Solr
Intro to Apache Solr
Loading in …3
×

Check these out next

1 of 72 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Elastic pivorak (20)

More from Pivorak MeetUp (20)

Advertisement

Recently uploaded (20)

Elastic pivorak

  1. 1. E L A S T I C S E A R C H M A K E Y O U R S O F T W A R E S M A R T E R ! O L E K S I Y P A N C H E N K O / # P I V O R A K / 2 0 1 5
  2. 2. MY NAME IS… Oleksiy Panchenko Software engineer, Lohika E-mail: oleksij@gmail.com Twitter: oleskiyp LinkedIn: https://ua.linkedin.com/in/opanchenko
  3. 3. AGENDA • Introduction. What is it all about? • Jump start Elastic. Demo time • Architecture and deployment. Why is Elasticsearch elastic? • Case studies. 4 real-life projects • Query API in depth + Demo • Using Elastic in Rails applications. Approaches and tools • Kinda summary • Q & A
  4. 4. [ ELASTIC MORNING @ LOHIKA ]
  5. 5. INTRODUCTION W H A T I S I T A L L A B O U T ?
  6. 6. HOW TO MAKE YOUR SITE SEARCHABLE? http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
  7. 7. • Google search • Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where … • Sphinx (hello Craigslist, Habrahabr, The Pirate Bay, 1C); Xapian • Lucene Family: Apache Lucene, Elasticsearch, Apache Apache Solr, Amazon Cloudsearch, …
  8. 8. WHO HAS EVER USED ELASTICSEARCH/SOLR/SPHINX? http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
  9. 9. LUCENE AS A CORE • Lucene = Low-level Java library (JAR) which implements search functionality • Lucene stores its index as a local binary file • Can be used in both web and standalone applications (desktop, mobile) • Implemented in Java, ports to other languages available • Initial version: 1999 • Apache project since 2001 • Latest stable release: 5.3.1 (September 24, 2015)
  10. 10. LUCENE AS A CORE • Lucene was originally written in 1999 by Doug Cutting (creator (creator of Hadoop and Nutch; http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
  11. 11. MORE ABOUT SEARCH ENGINES Riak Search
  12. 12. TIME TO TALK ABOUT ELASTICSEARCH https://www.elastic.co/products/elasticsearch Near Real-Time Data (NRT) Full-Text Search Multilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete
  13. 13. https://www.elastic.co/products/elasticsearch High Availability Multitenancy Distributed, Horizontally Scalable
  14. 14. https://www.elastic.co/products/elasticsearch Document-Oriented Schema-Free Conflict Management Optimistic Concurrency Control
  15. 15. https://www.elastic.co/products/elasticsearch Apache 2 Open Source License Awesome documentation Large community Developer-Friendly, RESTful API Client libraries available for many programming languages and frameworks.
  16. 16. ELASTICSEARCH USERS https://www.elastic.co/use-cases https://en.wikipedia.org/wiki/Elasticsearch#Users
  17. 17. ELASTICSEARCH – PAST & PRESENT • 2004. Shay Banon (aka Kimchy) started working on Compass – distributed and scalable Java Search Engine on top of Lucene • 2010. Initial release of ES • Latest stable release: 1.7.2 (September 14, 2015) • 2.0 to be released in November • 500K downloads per • https://github.com/elastic/elasticsearch http://opensource.hk/sites/default/files/u1/shay-banon.jpg
  18. 18. ELASTICSEARCH AS A COMPANY • 2012. Elasticsearch BV; Funding: $104M in 3 rounds, 100+ employees • https://www.elastic.co/ • Product portfolio: – Elasticsearch, Logstash, Kibana (ELK stack) – Watcher – Shield – Marvel – es-hadoop – found
  19. 19. JUMP START ELASTIC D E M O T I M E
  20. 20. INSTALLATION & CONFIGURATION • Prerequisites: – JDK 6 or above (recommended: JDK 8) – RAM: min. 2Gb (recommended: 16–64 Gb for production) – CPU: number of cores over clock rate – Disks: recommended SSD • Homebrew, apt, yum: apt-get install elasticsearch • Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch • Installation is absolutely straightforward and easy:
  21. 21. LET’S TALK ABOUT TERMINOLOGY Index ~ DB Schema Type ~ DB Table Documen t Record, JSON object Mapping ~ Schema definition in RDBMS
  22. 22. DEMO #1 http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
  23. 23. ARCHITECTURE AND DEPLOYMENT W H Y I S E L A S T I C S E A R C H E L A S T I C ?
  24. 24. Cluster One or more nodes which share the same cluster name Node Running instance of Elasticsearch which belongs to a cluster Shard A portion of data – single Lucene instance. Default: 5 shards in an index Primary Shard Master copy of data Replica Shard Exact copy of a primary shard. Default: 1 replica
  25. 25. SINGLE-NODE CLUSTER 0 1 2 3 4 Hash Function { "id": "123", "name": "John", … } { "id": "124", "name": "Patricia", … } { "id": "125", "name": "Scott", … } Node
  26. 26. TWO-NODE CLUSTER 0 1 R2 3 R4Node 1 R0 R1 2 R3 4Node 2 * Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)
  27. 27. BENEFITS OF SHARDING • Take advantage of multi-core CPUs (one shard is a single Lucene instance = single JVM process) • Horizontal scalability. Dynamic rebalancing • Fault tolerance and cluster resilience • NB! The number of shards can not be changed dynamically on the fly – need to perform full reindexing • Max number of documents per shard: 2,147,483,519 – imposed by Lucene
  28. 28. ELASTICSEARCH NODE TYPES • Data node node.data = true • Master node node.master = true • Communication client http.enabled = true • TCP ports 9200 (ext), 9300 (int) • A node can play 2 or 3 roles at the same time • Multicast discovery (true by default): discovery.zen.ping.multicast.enabled
  29. 29. DEPLOYMENT DIAGRAM
  30. 30. INDEXING A DOCUMENT https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
  31. 31. RETRIEVING A DOCUMENT https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html • In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard
  32. 32. DISTRIBUTED SEARCH • Given search query, retrieve 10 most relevant results https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html
  33. 33. http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
  34. 34. CASE STUDIES 4 R E A L - L I F E P R O J E C T S http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&pat h-prefix=ru
  35. 35. GENERAL INFO • 4 projects, ~2 years • RDBMS (MySQL, PostgreSQL) as a primary data storage • Both on-premise Elasticsearch installation (AWS, MS Azure) and SaaS (Bonsai @ Heroku) • 1 or 2 instances in a cluster • Data volume: Gigabytes; millions of documents • Back-end: Java, Ruby
  36. 36. #1. SOCIAL INFLUENCER MARKETING PLATFORM http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
  37. 37. • Document types: Blog Posts, Bloggers (Influencers) • Elasticsearch usage: – search and rank Influencers by category, keywords, tags, location, audience, influence – search blog posts by keywords etc. • Amount of data: – Influencers: hundreds of thousands – Blog Posts: millions • ES cluster size: 2 instances • Technology stack: Java, MySQL, Dynamo
  38. 38. #2. JOB SITE http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
  39. 39. • Document types: Job Postings, Jobseekers • Find relevant jobs – Simple one-click search – Advanced search (title, keywords, industry, location/distance, salary, requirements) • Elasticsearch as a Recommendation Engine Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc. • 2 types of recommendations: – Side banner (You also might be interested in…) – E-mail subscriptions every 2 weeks
  40. 40. • No fixed document structure (jobs from different providers) • Full-text search • Fuzzy search • Geolocation (distance) • Weighted search: Boosted search clauses • Dynamic scripting (Mvel until v1.4.0, then Groovy) SEARCH QUERIES
  41. 41. SOME MORE FACTS • Amount of data: –Job postings: ~1M –Applicants: ~20K • Cluster size: 2 ‘medium’ EC2 instances • Technology stack: –Ruby on Rails –Elasticsearch, PostgreSQL, Redis –Heroku + add-ons, AWS (S3, EC2) –Lots of 3rd party APIs and integrations
  42. 42. LESSONS LEARNED • On-premise deployment (EC2) vs. SaaS (Bonsai @ Heroku) • Dynamic scripting • PostgreSQL as a backup search engine sucks
  43. 43. #3. CAR TRADING http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
  44. 44. PARSING ADS Price $3900
  45. 45. 1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG WAT??? • Fuzzy Search (Levenstein Distance Algorithm) used to parse ads and classify cars • Elasticsearch index contains dictionary (Year, Make, Model, Trim) • Used in conjunction with other approaches: regular expressions, dictionaries of synonyms (VW  Volkswagen, Chevy  Chevrolet), normalization (e.g. LX-370  LX370) • Algorithm approach: – Parse Year (1996) – Search most relevant Make (VW, volkswagon  Volkswagen) – Search most relevant Model (Passat) for Make = Volkswagen, Year = 1996 – Search most relevant Trim (TDi 4dr Sedan) • Parsing quality: 90% https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
  46. 46. #4. [NDA] http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
  47. 47. SOME UNCOVERED INFO • Check documents against duplicate content • Shingle analysis (commonly used by copywriters and SEO experts) – I have a dream that one day this nation will rise up and live… – Normalization I have a dream that one day this nation will rise up and live… – Splitting a text into shingles (n-grams), n = 3..10 have dream that dream that this that this nation this nation will … – Replacement: latin ‘c’  cyrillic ‘c’ https://en.wikipedia.org/wiki/W-shingling
  48. 48. QUERY API IN DEPTH + D E M O
  49. 49. FILTERS VS. QUERIES As a general rule, filters should be used: • for binary yes/no searches • for queries on exact values Filters are much faster than queries Filters are usually great candidates for caching 27 Filters available (Elasticsearch 1.7.1) https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
  50. 50. QUERIES VS. FILTERS As a general rule, queries should be used instead of filters: • for full text search • where the result depends on a relevance score Common approach: Filter as many records as possible, then query them. 38 Queries available (Elasticsearch v 1.7.1) https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
  51. 51. DEMO #2 http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
  52. 52. SOME THEORY BEHIND RELEVANCE SCORING full AND text AND search AND (elasticsearch OR lucene) • Term Frequency: How often does the term appear in the document? • Inverse Document Frequency: How often does the term appear in all documents in the collection? • Field-length norm: How long is the field? https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
  53. 53. MORE COOL FEATURES • Indexing attachments: MS Office, ePub, PDF (Apache Tika) • Autocomplete suggestion: • Did-you-mean suggestion: • Highlight results:
  54. 54. SEARCH IMAGES https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/ https://github.com/kzwang/elasticsearch-image
  55. 55. USING ELASTIC IN RAILS APPLICATIONS A P P R O A C H E S A N D T O O L S
  56. 56. ELASTICSEARCH-RUBY • https://github.com/elastic/elasticsearch-ruby • Includes two packages: elasticsearch-transport + elasticsearch-api • Client for connecting to an Elasticsearch cluster • Ruby API for the Elasticsearch's REST API • Various extensions and utilities
  57. 57. ELASTICSEARCH-RAILS • https://github.com/elastic/elasticsearch-rails • Includes three packages: elasticsearch-model + elasticsearch-persistence + elasticsearch-rails • ActiveModel integration with adapters for ActiveRecord and Mongoid • Enumerable-based wrapper for search results; ActiveRecord::Relation-based wrapper for returning search results as records • Support for Kaminari and WillPaginate pagination • Convenience methods for (re)creating the index, setting up mappings, indexing documents, …
  58. 58. MY WAY (RAILS 4 APP) Gemfile config/environments/production.rb
  59. 59. MY WAY (RAILS 4 APP) job.rb
  60. 60. MY WAY (RAILS 4 APP) job.rb
  61. 61. MY WAY (RAILS 4 APP) job.rb
  62. 62. ELASTICSEARCH SEARCH QUERY
  63. 63. MY WAY (RAILS 4 APP) job_helper.rb
  64. 64. MY WAY (RAILS 4 APP) job_helper.rb
  65. 65. MY WAY (RAILS 4 APP) elasticsearch.rake
  66. 66. KINDA SUMMARY
  67. 67. ELASTICSEARCH DRAWBACKS • No transaction support. Elasticsearch is not a database. • No joins, constraints and other RDBMS features • Durability and consistency issues, data loss: – https://aphyr.com/posts/323-call-me-maybe- elasticsearch-1-5-0 – https://www.elastic.co/guide/en/elasticsearch/resili ency/current/index.html
  68. 68. PERFORMANCE? http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/ http://solr-vs-elasticsearch.com/ • Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently • Sphinx is faster at indexing (up to 15MB/s per core) • Performance issues can be usually fixed by horizontal scaling
  69. 69. SUMMARY • ES is not a silver bullet but really really powerful tool • Elasticsearch is not a RDBMS and is not supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES • Elasticsearch is dead simple at the start but might be sophisticated later as you go • Kick off easily, then hire a good DevOps engineer for best results • Ecosystem around Elasticsearch is just amazing • Give it a try – it can bring a lot of value to your product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png
  70. 70. QUESTIONS? http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
  71. 71. THANK YOU! http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
  72. 72. USEFUL LINKS • Elasticsearch: https://www.elastic.co/products/elasticsearch • Extended presentation about Elasticsearch and its ecosystem: https://www.youtube.com/watch?v=GL7xC5kpb-c • Scripts for the demos: https://github.com/opanchenko/morning-at-lohika-ELK

×