Successfully reported this slideshow.
Your SlideShare is downloading. ×

Elasticsearch quick Intro (English)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 73 Ad

Elasticsearch quick Intro (English)

Download to read offline

Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.

Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to Elasticsearch quick Intro (English) (20)

Advertisement

Recently uploaded (20)

Elasticsearch quick Intro (English)

  1. 1. Elasticsearch federico.panini@fazland.com - CTO Federico Panini CTO @ fazland.com email : federico.panini@fazland.com LikedIn : https://uk.linkedin.com/in/federicopanini slides : http://www.slideshare.net/FedericoPanini
  2. 2. What is Elasticsearch federico.panini@fazland.com - CTO full-text search engine “A search engine is an automated system which, upon request, uses a set of data and return an index of its content classifying them based on math/stats algorithm used to set the relevance, based in a search key.”
  3. 3. What’s Elasticsearch ? federico.panini@fazland.com - CTO full-text search engine
  4. 4. federico.panini@fazland.com - CTO “It’s a distributed, scalable, and highly available Real-time search and analytics software.” What’s Elasticsearch ? full-text search engine
  5. 5. federico.panini@fazland.com - CTO Real-time data Realtime data analysis Distributed system High Availability Full-text searches Document oriented DB Schemaless DB RESTFul Api Persistence per-operation Open Source Based on Apache Lucene Optimistic version control What’s Elasticsearch ? features
  6. 6. Apache Lucene #1 federico.panini@fazland.com - CTO It’s the heart of Elasticsearch Lucene is the search engine of Elasticsearch
  7. 7. Apache Lucene #1 federico.panini@fazland.com - CTO It’s in Java It’s an Apache Software Foundation, so Open Source!
  8. 8. What has more than Lucene federico.panini@fazland.com - CTO full-text searches horizontal scaling high availability Easy to use near real time
  9. 9. Architecture federico.panini@fazland.com - CTO requirements - CPU Elasticsearch doesn’t need a lot of CPU. The advice is to use the last CPU model available. In general is a good practice to use machines with 2 to 8 cores.
  10. 10. Architecture federico.panini@fazland.com - CTO requirements - Disco The I/O disk need is really important for all clusters. Please use SSD disks.
  11. 11. Architecture federico.panini@fazland.com - CTO requirements - HD - bonus slide … One very important thing to know is you have to pay attention where data is stored and mostly how. The word you have to remember is scheduler. The scheduler on *nix system is responsible to decide when data should be “written” to disc and on which priority. Usually common unix OS setup cfq as scheduler, which for instance is a scheduler for rotating disks and optimised for them. The advice is to use SSD disks and to setup the SO to use “noop” or “deadline” which are scheduler optimised for SSD’s. If you use the right scheduler you can reach improvements of 500x !!!
  12. 12. federico.panini@fazland.com - CTO Operating Systems Elasticsearch is written in Java, so it’s a multiplatform solution. Use the last JDK available. Architecture
  13. 13. federico.panini@fazland.com - CTO requirements - RAM Elasticsearch is eager of RAM!!! https://www.elastic.co/guide/en/elasticsearch/guide/current/ heap-sizing.html Architecture
  14. 14. federico.panini@fazland.com - CTO memory !?!? Use solutions with 64GB is fine not more give to the Java heap size not more than 32GB of RAM use more than one machine for elasticsearch in order setup correctly the cluster. Architecture
  15. 15. federico.panini@fazland.com - CTO Installation curl -L -O http://download.elasticsearch.org/PATH/TO/ VERSION.zip unzip elasticsearch-$VERSION.zip cd elasticsearch-$VERSION There are availbes packages for many distribution as Debian or RPM, and Puppet or Chef modules Architecture
  16. 16. Java based federico.panini@fazland.com - CTO elasticsearch Elasticsearch has been developed in JAVA Robust Scalable Multiplatform
  17. 17. Talking to Elasticsearch federico.panini@fazland.com - CTO clients Java #1 There are 2 clients available in JAVA: Node client : the client join the cluster as non-data node, this mean that the client knows perfectly where data are and on which node of the cluster.
  18. 18. federico.panini@fazland.com - CTO clients Java #2 Transport client : is a lightweight client and is the tool used to comunicate with the cluster remotely. Talking to Elasticsearch There are 2 clients available in JAVA:
  19. 19. federico.panini@fazland.com - CTO clients Java #2 Both Java clients talk to the cluster on port 9300, which is the same port use by the cluster itself. Talking to Elasticsearch There are 2 clients available in JAVA:
  20. 20. federico.panini@fazland.com - CTO client API RESTful All programming languages other than Java can talk to the Elasticsearch cluster through its API Rest available on port 9200. There are many official clients available in different programming languages.: Groovy, JavaScript, .NET, PHP, Perl, Python, e Ruby Talking to Elasticsearch
  21. 21. Elastic federico.panini@fazland.com - CTO Document oriented NoSql Elasticsearch is a document oriented database. This mean Elasticsearch is a schema-less database. After inserting documents inside Elasticsearch, the documents will be immediately indexed.
  22. 22. Elastic federico.panini@fazland.com - CTO Document oriented JSON Elasticseach uses JSON as interchange language between the server and the API clients.
  23. 23. Elastic federico.panini@fazland.com - CTO glossary cluster nodes indexes shards replica segments in-memory buffers translog
  24. 24. Elastic federico.panini@fazland.com - CTO cluster The cluster is a set which belong one or more nodes, which shares the same property cluster.name. The cluster is used to balance the load of the server itself. A node could be deleted or inserted to the cluster, the cluster itself will re-organise itself.
  25. 25. Elastic federico.panini@fazland.com - CTO cluster Inside a cluster a node is elected as Master. The Master node is responsible to manage operations as creation or removal indexes, join or deletion of a node. Every node could be elected as Master.
  26. 26. Elastic federico.panini@fazland.com - CTO nodes A node is a minimum element of Elasticsearch that ensures the proper working of the cluster.
  27. 27. Elastic federico.panini@fazland.com - CTO Index Database RDBMS Elasticsearch DATABASE INDEX
  28. 28. Elastic federico.panini@fazland.com - CTO Type Database RDBMS Elasticsearch Table TYPE
  29. 29. Elastic federico.panini@fazland.com - CTO Document Database RDBMS Elasticsearch ROW DOCUMENT
  30. 30. Elastic federico.panini@fazland.com - CTO Fields Database RDBMS Elasticsearch COLUMNS FIELDS
  31. 31. Elastic federico.panini@fazland.com - CTO shards If we want to start indexing data on Elasticsearch we need to create an index. Index is the term used only to identify a logical definition, which represent a pointer to one or more elements called SHARDS.
  32. 32. Elastic federico.panini@fazland.com - CTO shards The shard is the low level element of Elasticsearch, and contains a subset of all the data inside and index. The shard is in fact a single instance of Apache Lucene.
  33. 33. Elastic federico.panini@fazland.com - CTO Replica shards Replica shards are mirrors of shards used to protect our data from hardware failures. As the shards they are used exactly as the shards.
  34. 34. Elastic federico.panini@fazland.com - CTO shards immutability The number of shards for an index is defined at Index creation time and is IMMUTABLE.
  35. 35. Elastic federico.panini@fazland.com - CTO shards immutability curl -X http://localhost:9200/blogs -d ‘{ "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }’
  36. 36. Elastic federico.panini@fazland.com - CTO shards immutability curl http://localhost:9200/_cluster/health“{ "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 }”
  37. 37. Elastic federico.panini@fazland.com - CTO shards immutability Replica shards on a single node instance are useless, the meaning for cluster is nothing in this case. To make replica shard useful we need at least 2 nodes to have data redundancy.
  38. 38. Elastic federico.panini@fazland.com - CTO BONUS : manage data conflicts #1
  39. 39. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts #2 : Pessimistic Concurrency Control Used in standard RDBMS This approach is based on the concept that conflict could happened frequently and so to avoid them the RDBMS lock the resource. The process lock the access to the row before reading it, this way we the RDBMS is sure that only one process will access to this thread and can subsequently modify it and nobody else. At the end of its process (update/delete) the thread will release the LOCK.
  40. 40. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts #3 : Optimistic Concurrency Control Elasticsearch uses OCC This approach will consider conflicts as infrequent. The database won’t lock the resource when access to it. The responsibility is given to the application : when data is amended between a read and write then the update fails. In this case you need to re-get the fresh new data and trying to update it.
  41. 41. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts#4 : Optimistic Concurrency Control Elasticsearch is a distributed solution, concurrent and asynchronous. When a document is created / updated / deleted is absolutely necessary to replicate this information across the whole cluster. Every command sent to the nodes is sent in parallel and could happen that some data will reach its destination (node) already expired.
  42. 42. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts#5 : Optimistic Concurrency Control We need a way to understand that the entry we’re trying to update as been already updated by another process.
  43. 43. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts#6 : Optimistic Concurrency Control VERSIONING
  44. 44. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts#7 : Optimistic Concurrency Control In Elasticsearch every document has a field named: _version This system field is incremented every time an operation (update / delete) occurs over a document. In this way an update to _version:3 won’t be never applied to a document whose _version field value is at 4.
  45. 45. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts #8 : Optimistic Concurrency Control This approach move all the responsibility from the database to the application! so WE are responsible to not create conflicts over a document or and index. If we want to be sure to not have loss of data we nee to implement writes with the use of versioning!
  46. 46. Elastic federico.panini@fazland.com - CTO BONUS : manage conflicts #9 : Optimistic Concurrency Control http://www.jillesvangurp.com/2014/12/03/optimistic- locking-for-updates-in-elasticsearch/ https://aphyr.com/posts/317-call-me-maybe- elasticsearch https://www.elastic.co/guide/en/elasticsearch/resiliency/ current/index.html
  47. 47. Elastic federico.panini@fazland.com - CTO Simple searches #1 Create Index API Rest GET DELETE POST SEARCH
  48. 48. Elastic federico.panini@fazland.com - CTO Simple searches - CREATE AN INDEX curl -XPUT http://fazlab.fazland.com:9200/fazlab -d "{ "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }"
  49. 49. Elastic federico.panini@fazland.com - CTO Simple searches - INDEX A DOCUMENT curl -XPUT http://fazlab.fazland.com:9200/fazlab/categories/1?pretty -d ' { nome: "Federico" }'
  50. 50. Elastic federico.panini@fazland.com - CTO Simple searches - GET A DOCUMENT curl http://fazlab.fazland.com:9200/fazlab/categories/1?pretty
  51. 51. Elastic federico.panini@fazland.com - CTO Simple searches - DELETE A DOCUMENT curl -XDELETE http://fazlab.fazland.com:9200/fazlab/categories/2?pretty
  52. 52. Elastic federico.panini@fazland.com - CTO Simple searches #1 DEMO SEARCHES!
  53. 53. Elastic federico.panini@fazland.com - CTO mapping and analysis EXACT MATCH vs FULL TEXT
  54. 54. Elastic federico.panini@fazland.com - CTO mapping and analysis EXACT MATCH vs FULL TEXT Exact match Full Text where name = ‘Federico’ and user_id = 2 and date > “2014-09-15” “Frank has been to South beach” Frank / FRANK / frank
  55. 55. Elastic federico.panini@fazland.com - CTO mapping and analysis EXACT MATCH vs FULL TEXT Exact match Full Text binary : the document contains these values ? How much is relevant the document compared to the term used inside the query ?
  56. 56. Elastic federico.panini@fazland.com - CTO mapping and analysis Elasticsearch to help a full-text search analyse the text and uses this result to build an inverted index. Inverted Index Analyzer
  57. 57. Elastic federico.panini@fazland.com - CTO Inverted Index 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer
  58. 58. Elastic federico.panini@fazland.com - CTO Inverted Index If we want to search the word “quick” and “brown” we will pick only the documents where these 2 words are. 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer
  59. 59. Elastic federico.panini@fazland.com - CTO Inverted Index 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer
  60. 60. Elastic federico.panini@fazland.com - CTO ANALYZERS An analyzer has 3 functions: Character filters Tokenizer Token Filters
  61. 61. Elastic federico.panini@fazland.com - CTO ANALYZERS - Character Filters The first part of an analyser is to parse every string with character filer which will clean / reorganize the strings before tokenization. During this phase special HTML chars will be removed or & will be converted in AND.
  62. 62. Elastic federico.panini@fazland.com - CTO ANALYZERS - Tokenizer The second phase of an analyser is tokenisation which will divide a sentence in small terms.
  63. 63. Elastic federico.panini@fazland.com - CTO ANALYZERS - Token Filters Successivamente alla fase di Tokenizzazione delle stringhe in singoli termini (terms), i filtri (selezionati) sono applicati in sequenza. After tokenisation filters will be applied in sequence. For example : - put lower case the whole text - remove stop words - add synonyms
  64. 64. Elastic federico.panini@fazland.com - CTO Standard Analyzer “Set the shape to semi-transparent by calling set_trans(5)” The standard analyzer is the default analyzer of Elasticsearch. Divide text in single words and remove most of punctuation. “set, the, shape, to, semi, transparent, by, calling, set_trans, 5”
  65. 65. Elastic federico.panini@fazland.com - CTO Simple Analyzer “Set the shape to semi-transparent by calling set_trans(5)” The simple analyser removes all characters which are not letters and put the whole text lowercase “set, the, shape, to, semi, transparent, by, calling, set, trans”
  66. 66. Elastic federico.panini@fazland.com - CTO Whitespace Analyzer “Set the shape to semi-transparent by calling set_trans(5)” The whitespace analyser will create token by white space and put text in lowercase “Set, the, shape, to, semi, transparent, by, calling, set_trans(5)”
  67. 67. Elastic federico.panini@fazland.com - CTO Language Analyzer “Set the shape to semi-transparent by calling set_trans(5)” This analyser uses a language specific feature to remove stop words or to do stemming. “set, shape, semi, transpar, call, set_tran, 5”
  68. 68. Elastic federico.panini@fazland.com - CTO Language Analyzer arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.
  69. 69. Elastic federico.panini@fazland.com - CTO Pre-built Analyzers Standard Analyzer Simple Analyzer Whitespace Analyzer Stop Analyzer Keyword Analyzer Pattern Analyzer Language Analyzers Snowball Analyzer Custom Analyzer
  70. 70. Elastic federico.panini@fazland.com - CTO Tokenizer Standard Tokenizer Edge NGram Tokenizer Keyword Tokenizer Letter Tokenizer Lowercase Tokenizer NGram Tokenizer Whitespace Tokenizer Pattern Tokenizer UAX Email URL Tokenizer Path Hierarchy Tokenizer
  71. 71. Elastic federico.panini@fazland.com - CTO Token Filters Standard Token Filter ASCII Folding Token Filter Length Token Filter Lowercase Token Filter NGram Token Filter Edge NGram Token Filter Porter Stem Token Filter Shingle Token Filter Stop Token Filter … more than 32 Filters
  72. 72. Elastic federico.panini@fazland.com - CTO Token Filters THE END.
  73. 73. References • Elasticsearch : The Definitive Guide • https://en.wikipedia.org/wiki/Full_text_search • https://www.elastic.co/guide/en/elasticsearch/guide/current/ hardware.html • https://www.elastic.co/guide/en/elasticsearch/guide/current/ heap-sizing.html • https://mtalavera.wordpress.com/2015/02/16/monitoring-with- collectd-and-kibana/ • Fuzzy search : https://www.found.no/foundation/fuzzy-search/ • Phonetic-plugin : https://github.com/elastic/elasticsearch- analysis-phonetic federico.panini@fazland.com - CTO

×