Introduction to Elasticsearch with basics of Lucene

8,805 views

Published on

Introduction to Elasticsearch with basics of Lucene

Published in: Technology

Introduction to Elasticsearch with basics of Lucene

  1. 1. Introduction to Elasticsearch with basics of Lucene May 2014 Meetup Rahul Jain @rahuldausa @http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
  2. 2. Who am I  Software Engineer  7 years of software development experience  Built a platform to search logs in Near real time with volume of 1TB/day#  Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?)  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
  3. 3. Agenda • IR Overview • Basic Concepts • Lucene • Elasticsearch • Logstash & Kibana - Short Introduction • Q&A 3
  4. 4. Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia 4
  5. 5. Basic Concepts • Term t : a noun or compound word used in a specific context • tf (t in d) : term frequency in a document • measure of how often a term appears in the document • the number of times term t appears in the currently scored document d • idf (t) : inverse document frequency • measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index • obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. • boost (index) : boost of the field at index-time • boost (query) : boost of the field at query-time 5
  6. 6. Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/
  7. 7. Apache Lucene 7
  8. 8. Apache Lucene • Fast, high performance, scalable search/IR library • Open source • Initially developed by Doug Cutting (Also author of Hadoop) • Indexing and Searching • Inverted Index of documents • Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. • http://lucene.apache.org/ 8
  9. 9. Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html 9
  10. 10. Lucene Internals (Contd.) • Defines documents Model • Index contains documents. • Each document consist of fields. • Each Field has attributes. – What is the data type (FieldType) – How to handle the content (Analyzers, Filters) – Is it a stored field (stored="true") or Index field (indexed="true") 10
  11. 11. Indexing Pipeline • Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) • Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction 11
  12. 12. Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens
  13. 13. Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens
  14. 14. Elasticsearch 14
  15. 15. Introduction • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing, Replication, and load balanced querying • http://www.elasticsearch.org/ 15
  16. 16. Elasticsearch - Features • Distributed RESTful search server • Document oriented • Domain Driven • Schema less • Restful • Easy to scale horizontally 16
  17. 17. Elasticsearch - Features • Highlighting • Spelling Suggestions • Facets (Group by) • Query DSL – based on JSON to define queries • Automatic shard replication, routing • Zen discovery – Unicast – Multicast • Master Election – Re-election if Master Node fails
  18. 18. APIs • HTTP RESTful Api • Java Api • Clients – perl, python, php, ruby, .net etc • All APIs perform automatic node operation rerouting.
  19. 19. How to start It’s this Easy.
  20. 20. Operations
  21. 21. INDEX CREATION curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' http://localhost:9200/<index>/<type>/[<id>] Credit: http://joelabrahamsson.com/elasticsearch-101/
  22. 22. INDEX CREATION RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/
  23. 23. UPDATE curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' Updated Version Credit: http://joelabrahamsson.com/elasticsearch-101/ New field
  24. 24. GET curl -XGET "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/
  25. 25. curl -XDELETE "http://localhost:9200/movies/movie/1" -d'' DELETE Credit: http://joelabrahamsson.com/elasticsearch-101/
  26. 26.  Search across all indexes and all types  http://localhost:9200/_search  Search across all types in the movies index.  http://localhost:9200/movies/_search  Search explicitly for documents of type movie within the movies index.  http://localhost:9200/movies/movie/_search curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } } }' SEARCH Credit: http://joelabrahamsson.com/elasticsearch-101/
  27. 27. Credit: http://joelabrahamsson.com/elasticsearch-101/ SEARCH RESPONSE
  28. 28. Updating existing Mapping curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } } } } }' Credit: http://joelabrahamsson.com/elasticsearch-101/
  29. 29. Cluster Architecture Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
  30. 30. Index Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
  31. 31. Search Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
  32. 32. Who are using • Github • Stumbleupon • Soundcloud • Datadog • Stackoverflow • Many more… – http://www.elasticsearch.com/case-studies/ 32
  33. 33. Logstash
  34. 34. Logstash • Open Source, Apache licensee • Written in JRuby • Part of Elasticsearch family • http://logstash.net/ • Current version: 1.4.0 • This talk is with 1.3.3
  35. 35. Logstash • Multiple Input/ Multiple Output • Centralize logs • Collect • Parse • Forward/Store
  36. 36. Architecture Source: http://www.infoq.com/articles/review-the-logstash-book
  37. 37. Logstash – life of an event • Input  Filters  Output • Filters are processed in order of config file • Outputs are processed in order of config file • Input: Input stream – File input (tail) – Log4j – Redis – Syslog – and many more… • http://logstash.net/docs/1.3.3/
  38. 38. Logstash – life of an event • Codecs : decoding log messages • Json • Multiline • Netflow • and many more… • Filters : processing messages • Date – Date format • Grok – Regular expression based extraction • Mutate – Change data type • and many more… • Output : storing the structured message • Elasticsearch • Mongodb • Email • Nagios • and many more… http://logstash.net/docs/1.3.3/
  39. 39. Quick Start < 1.3.3 version: java -jar logstash-1.3.3-flatjar.jar agent -f agent.conf – web 1.4 version: bin/logstash agent –f agent.conf bin/logstash –web basic-agent.conf : input { tcp { type => "apache" port => 3333 } } output { stdout { debug => true } elasticsearch { embedded => true } }
  40. 40. Kibana
  41. 41. Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern
  42. 42. Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern
  43. 43. Analytics  Analytics source : Kibana.org based on ElasticSearch and Logstash  Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8 43
  44. 44. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 44

×