Your SlideShare is downloading. ×
Introduction to Elasticsearch with basics of Lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Introduction to Elasticsearch with basics of Lucene

2,327
views

Published on

Introduction to Elasticsearch with basics of Lucene

Introduction to Elasticsearch with basics of Lucene

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,327
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
114
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Elasticsearch with basics of Lucene May 2014 Meetup Rahul Jain @rahuldausa @http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
  • 2. Who am I  Software Engineer  7 years of software development experience  Built a platform to search logs in Near real time with volume of 1TB/day#  Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?)  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
  • 3. Agenda • IR Overview • Basic Concepts • Lucene • Elasticsearch • Logstash & Kibana - Short Introduction • Q&A 3
  • 4. Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia 4
  • 5. Basic Concepts • Term t : a noun or compound word used in a specific context • tf (t in d) : term frequency in a document • measure of how often a term appears in the document • the number of times term t appears in the currently scored document d • idf (t) : inverse document frequency • measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index • obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. • boost (index) : boost of the field at index-time • boost (query) : boost of the field at query-time 5
  • 6. Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/
  • 7. Apache Lucene 7
  • 8. Apache Lucene • Fast, high performance, scalable search/IR library • Open source • Initially developed by Doug Cutting (Also author of Hadoop) • Indexing and Searching • Inverted Index of documents • Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. • http://lucene.apache.org/ 8
  • 9. Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html 9
  • 10. Lucene Internals (Contd.) • Defines documents Model • Index contains documents. • Each document consist of fields. • Each Field has attributes. – What is the data type (FieldType) – How to handle the content (Analyzers, Filters) – Is it a stored field (stored="true") or Index field (indexed="true") 10
  • 11. Indexing Pipeline • Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) • Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction 11
  • 12. Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens
  • 13. Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens
  • 14. Elasticsearch 14
  • 15. Introduction • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing, Replication, and load balanced querying • http://www.elasticsearch.org/ 15
  • 16. Elasticsearch - Features • Distributed RESTful search server • Document oriented • Domain Driven • Schema less • Restful • Easy to scale horizontally 16
  • 17. Elasticsearch - Features • Highlighting • Spelling Suggestions • Facets (Group by) • Query DSL – based on JSON to define queries • Automatic shard replication, routing • Zen discovery – Unicast – Multicast • Master Election – Re-election if Master Node fails
  • 18. APIs • HTTP RESTful Api • Java Api • Clients – perl, python, php, ruby, .net etc • All APIs perform automatic node operation rerouting.
  • 19. How to start It’s this Easy.
  • 20. Operations
  • 21. INDEX CREATION curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' http://localhost:9200/<index>/<type>/[<id>] Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 22. INDEX CREATION RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 23. UPDATE curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' Updated Version Credit: http://joelabrahamsson.com/elasticsearch-101/ New field
  • 24. GET curl -XGET "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 25. curl -XDELETE "http://localhost:9200/movies/movie/1" -d'' DELETE Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 26.  Search across all indexes and all types  http://localhost:9200/_search  Search across all types in the movies index.  http://localhost:9200/movies/_search  Search explicitly for documents of type movie within the movies index.  http://localhost:9200/movies/movie/_search curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } } }' SEARCH Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 27. Credit: http://joelabrahamsson.com/elasticsearch-101/ SEARCH RESPONSE
  • 28. Updating existing Mapping curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } } } } }' Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 29. Cluster Architecture Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
  • 30. Index Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
  • 31. Search Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
  • 32. Who are using • Github • Stumbleupon • Soundcloud • Datadog • Stackoverflow • Many more… – http://www.elasticsearch.com/case-studies/ 32
  • 33. Logstash
  • 34. Logstash • Open Source, Apache licensee • Written in JRuby • Part of Elasticsearch family • http://logstash.net/ • Current version: 1.4.0 • This talk is with 1.3.3
  • 35. Logstash • Multiple Input/ Multiple Output • Centralize logs • Collect • Parse • Forward/Store
  • 36. Architecture Source: http://www.infoq.com/articles/review-the-logstash-book
  • 37. Logstash – life of an event • Input  Filters  Output • Filters are processed in order of config file • Outputs are processed in order of config file • Input: Input stream – File input (tail) – Log4j – Redis – Syslog – and many more… • http://logstash.net/docs/1.3.3/
  • 38. Logstash – life of an event • Codecs : decoding log messages • Json • Multiline • Netflow • and many more… • Filters : processing messages • Date – Date format • Grok – Regular expression based extraction • Mutate – Change data type • and many more… • Output : storing the structured message • Elasticsearch • Mongodb • Email • Nagios • and many more… http://logstash.net/docs/1.3.3/
  • 39. Quick Start < 1.3.3 version: java -jar logstash-1.3.3-flatjar.jar agent -f agent.conf – web 1.4 version: bin/logstash agent –f agent.conf bin/logstash –web basic-agent.conf : input { tcp { type => "apache" port => 3333 } } output { stdout { debug => true } elasticsearch { embedded => true } }
  • 40. Kibana
  • 41. Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern
  • 42. Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern
  • 43. Analytics  Analytics source : Kibana.org based on ElasticSearch and Logstash  Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8 43
  • 44. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 44