• Save
Intro to Elasticsearch and beyond
Upcoming SlideShare
Loading in...5
×
 

Intro to Elasticsearch and beyond

on

  • 1,843 views

Elasticsearch is a tool for querying written words. It can perform some other nifty tasks, but at its core it’s made for wading through text, returning text similar to a given query and/or ...

Elasticsearch is a tool for querying written words. It can perform some other nifty tasks, but at its core it’s made for wading through text, returning text similar to a given query and/or statistical analyses of a corpus of text.

More specifically, elasticsearch is a standalone database server, written in Java, that takes data in and stores it in a sophisticated format optimized for language based searches. Working with it is convenient as its main protocol is implemented with HTTP/JSON. Elasticsearch is also easily scalable, supporting clustering and leader election out of the box.

Whether it’s searching a database of retail products by description, finding similar text in a body of crawled web pages, or searching through posts on a blog, elasticsearch is a fantastic choice. When facing the task of cutting through the semi-structured muck that is natural language, Elasticsearch is an excellent tool.

Statistics

Views

Total Views
1,843
Views on SlideShare
1,827
Embed Views
16

Actions

Likes
8
Downloads
0
Comments
0

4 Embeds 16

https://twitter.com 6
http://intelligrapestatic.qa3.intelligrape.net 6
http://intelligrape.com 2
http://www.intelligrape.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Intro to Elasticsearch and beyond Intro to Elasticsearch and beyond Presentation Transcript

    • Introduction to Elastic Search and Beyond
    • Foursquare • Searching 50 millions venues in real time? Foursquare does it using Elasticsearch
    • GitHub • Searches 20TB of data using Elasticsearch, including 1.3 billion files and 130 billion lines of code.
    • SoundCloud • Uses Elasticsearch to provide immediate and relevant results for their online audio distribution platform reaching 180 million people
    • StumbleUpon • Elasticsearch is mission critical to StumbleUpon, delivering millions of recommendations to their community ever day.
    • Fog Creek • Elasticsearch powers Fog Creek Software’s instant search for 30 million requests per month across 40 billion lines of code.
    • Agenda ● ● ● ● ● ● Introduction Setting up Elastic Search Creating & Query ES index Analyzer Facets Percolator
    • Introduction Elastic What ???
    • Elastic Whattt .... ?? • ElasticSearch is a distributed, RESTful, free/open source server based on Apache Lucene. • Developed by Shay Banon, written in Java. • Latest and Greatest - 0.90.1 search
    • Features and Advantages
    • Document Oriented & Schema Free ● Complex Entities stored as JSON documents. ● No need for a schema upfront. Detects schema on the fly. ● Customizes the schema at field level also.
    • Distributed & Highly Available ● ● ● ● Built to scale horizontally out of the box. Indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically.” Clusters will detect and remove failed nodes, and reorganize themselves to ensure that your data is safe and accessible Each index is fully sharded with a configurable number of shards.
    • Distributed Search Master Node Search for Employee b
    • Redundancy Primary Shards Master Node Redundant Shards
    • Node Auto Discovery Relegated to a data node role Takes over as Master
    • Fail Safe New Master Node Automatically assumes Master role in case Master node goes down.
    • Multi Tenant with Multi Types ● A cluster can host multiple indices which can be queried independently or as a group. curl -XGET 'http://localhost:9200/demo,demo2/employee/_search?q=quick' ● Support for more than one type per index.
    • RESTful API & Clients Any action can be performed using a simple RESTful API JSON over HTTP. ● ● ● Native Java and Groovy clients. Community supported clients for Pearl,Python,Ruby, PHP, Javascript,Scala,Clojure and so on ... using
    • What can it do? ● Search Find all companies in the ‘search’ domain. ● Facets Return the average annual revenue for all companies. ● Combine Return the average annual revenue for all companies founded between 2010 and 2012 in the ‘search’ domain. ● Indexing Index thousands of Documents per second All in near Real time.
    • Use Cases Any application that needs searching, data analysis, aggregation capabilities.
    • Setting Up Installing Elastic Search Adding the HEAD plugin
    • How to install and run it? ● ● ● Download & extract from http://www.elasticsearch.org/download/ $ bin/elasticsearch –f There is no step 3 :D
    • Add On ... ● There is a nice web front end tool for browsing and interacting with an Elastic Search cluster. git clone git://github.com/mobz/elasticsearch-head.git firefox <cloned_dir_location>/index.html
    • Basic Operations Creating and Querying an Index
    • Quick API Intro curl -<method-type> '<host>:<port>/<indices>/<type>' Elastic Search Relational DB Indices or Index Database Type Table Field Column
    • Quick API Intro ● Index a document curl -XPOST 'localhost:9200/demo/employee' -d '{ "name" : "Mohit Garg", "age" : 28 }' ● Delete document curl -XDELETE 'localhost:9200/demo/employee/1'
    • Quick API Intro ● Finding a document based on free text search. curl -XGET 'localhost:9200/demo/_search?q=Mohit' ● Get/Retrieve a document by ID curl -XGET 'localhost:9200/crunchbase/person/1 ● Using a Query DSL curl -XPOST 'localhost:9200/_search?pretty' -d '{ "query" : { .... } }'
    • Quick API Intro ● Update (Update if found, else fail) curl -XPOST 'localhost:9200/demo/employee/3/_update' -d '{ "doc" : { "name" : "Mohit Garg" } }' ● Upsert (Update if found, else insert) curl -XPOST localhost:9200/demo1/employee/abc/_update -d '{ "doc":{ – "name":"Mohit Garg" – }, "upsert" : { – "name" : "Mohit Garg", "age":"24" } }'
    • Searching Criteria
    • Match All Query A query that matches all documents. { “query”:{ "match_all" : { } } }
    • Query String Query A query that uses a query parser in order to parse its content. ● { "query": { "query_string": { "query": "Female OR Stephania" } } }
    • Range Query ● Matches documents with fields that have terms within a certain range. The following example returns all documents where age is between 10 and 20. { "range" : { "age" : { "from" : 10, "to" : 20 } } }
    • Sorting And Pagination ● Pagination { "from" : 5, "size" : 10 "query" : {...} } ● Sorting { "sort" : *{ “salary" : { "order" : "desc" } }], "query" : {...} }
    • Analyzer
    • Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders
    • Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog The 1 1 fox Hmmm... Lets index everything ... quick brown 2) Fast jumping spiders 1 1 jumps 1 over 1 the 1 lazy 1 dog 1 Fast 2 jumping 2 spiders 2
    • Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Do we really need words like “the”, “a”, “or” ? 1 brown Wait.... quick 1 jumps 1 lazy 1 dog 1 Fast 2 jumping 2 spiders 2
    • Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Some words “stem” from others. 1 brown Wait .... quick 1 jump 1,2 lazy 1 dog 1 Fast 2 spiders 2
    • Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Some words are simply synonyms of others 1,2 brown Wait .... quick 1 jump 1,2 lazy 1 dog 1 Fast 1,2 spiders 2
    • Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Case Insensitive ?? 1,2 brown Wait .... quick 1 jump 1,2 lazy 1 dog 1 fast 1,2 spiders 2
    • Pre Process before Indexing ● ● ● ● ● Drop stopwords Lowercase everything Reduce words to stems Consider synonyms Let's test it out on ES localhost:9200/_analyze?tokenizer=standard&filters=lowercase,stop' -d 'The quick brown Fox jumped over the lazy Dog'
    • Text analysis in ElasticSearch Text is analyzed at 2 levels .. ● In document fields when preparing it for indexing. ● In query when preparing it for searching.
    • Configuration ● Analyzers can be configured per index. curl -XPUT 'localhost:9200/demo' -d '{ "settings" : { “analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "whitespace", "filter" : [ "lowercase", "stop" ,"snowball" ] } } } } }';
    • What is Mapping? ● The indexed data is based on document and fields. ● Mapping define how these documents should be handled ➢ ➢ ➢ ➢ ➢ ➢ How should the documents be indexed? What are the data types of the document fields? How to treat object typed fields? What are the relations between different types of docs? How to handle document metadata? Define boosts per fields / document type
    • Elastsearch Data Types ● ● ● ● ● Core Types include but are not limited to String, Boolean, Number(can be Double, Float, Integer etc), Binary. Array Object Nested Multi-Field
    • Mapping API curl -XPOST 'localhost:9200/demo123' -d '{ "mappings":{ "employee" : { "properties" : { "name" : { "type" : "string", "analyzer" : "snowball" }, "department" : {"type" : "object"," properties" : {"since" : { "type" : "date" },"name" : {"type" :"string"} }}}}}}' Get the current mappings of a specific type : curl -XGET 'localhost:9200/demo/employee/_mapping'
    • Advanced Queries
    • Match Query ● ● Query that accept text/numerics/dates, analyzes it, and constructs a query out of it. The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. { "query": { "match": { "name": { "query": "Mohit Garg", "operator": "and" } }}}
    • Term Or Terms Query ● Term Query : Matches documents that have fields that contain a term (not analyzed). { "query":{ "term" : { "age" : 24 } }} ● Terms Query : Same as Term Query but you can specify multiple values. { "query": { "terms": { "age": [ 30,40,50 ], "minimum_match": 2 }}}
    • Bool Query ● A query that matches documents matching boolean combinations of other queries ➢ ➢ ➢ must - The clause (query) must appear in matching documents. should - Same as must query, but you can minimum_number_should_match parameter to specify the how many should should be matched. must_not - The clause (query) must not appear in the matching documents.
    • Bool Query Example { "query": { "bool": { "must": { "term": { gender": "Female" } }, "must_not": { "range": { "salary": { "from": 200, "to": 300 } } }, "should": [ { "match": { "department": "HR" } }, { "term": { "salary": 400 } } ], "minimum_number_should_match": 1 } } }
    • Filters
    • Logical Filter (And, Or, Not) And,Or Filter : A filter that matches documents using “and” OR “or” boolean operator on other queries. { "filter": { "and": [ { "range": {"age": { "from": 20, "to": 30 } } }, { "term": {"salary": 2600}} ] } } Not Filter "filter": { "not": {"term": { "salary": 2600 }}}
    • Exists Or Missing Filter Exist Filter: Filters documents where a specific field has a value in them. { "query": { "match_all": {} }, "filter": { "exists": { "field": "age" } } } Missing Filter : Filters documents where a specific field has no value in them.
    • Faceting
    • Faceted Search ● ● ● ● The usual purpose of a full-text search engine is to return a small number of documents matching your query. Facets provide aggregated data based on a search query. In the simplest case, a terms facet can return facet counts for various facet values for a specific field. ElasticSearch supports more facet implementations, such as statistical or date histogram facets.
    • Facets .. What they look like
    • How to use Facet? ● Suppose we want to get the facet counts of age field. { "query" : { "match_all" : { } }, "facets" : { "age_facets" : { "terms" : { "field" : "age" }}}}
    • Facets Filters { "query" : { "match_all" : {} }, "facets" : { "age_facets" : { "terms" : { "field" : "age" }, "facet_filter" : { "range" : { "salary" : { "from" : 100, "to" : 150 }} }}}}
    • Range Facet ● Enables splitting numeric & date field values into range “buckets” and returns aggregated document counts per “bucket” (stats per range) { "query": { "match_all": {} }, "facets": { "age_groups": { "range": { "field": "age", "ranges": [ { "to": 25 }, {"from": 26,"to": 40}, {"from": 45} ] } } }}
    • Statistical Facet ● Computes statistics from numeric values "facets" : { "statistics" : { "statistical" : { "field" : "age" } } }
    • Histogram facet ● Computes a distribution of numeric values into fixed interval buckets Each value is “rounded” into the bucket. So a value of 135 and a bucket interval of 10 will “fall” into bucket 130 ● Get the age distribution in intervals of 5 "facets": { "field" : "age", "interval" : 5 } } }
    • Percolator Search ... Reversed
    • What is Percolation? ● “Reversed search” Instead of storing documents, and then searching them with queries ... store queries and “percolate” documents through them
    • Percolator API • Registering a Query curl -XPUT 'http://localhost:9200/_percolator/demo/testpercolator' -d '{ "query":{ "term":{ "gender":"Female" } } }'
    • Percolator API ● Filter a document against a Percolator curl -XGET 'http://localhost:9200/demo1/employee/_percolate' -d '{ "doc":{ "age":38, "name":"Alexis Wuckert", "gender":"Female", "salary":3800 } }' Response { "ok" : true, "matches" : ["testpercolator"]}
    • References www.elasticsearch.com ● http://www.elasticsearch.org/guide/ ● fogcreek.com (Video) ●http://www.slideshare.net/dnoble00/scaling-analytics-withelasticsearch ● ● And lots of Googling ...
    • Let’s Connect IntelliGrape Software (P) Ltd SDF L-6, NSEZ, Noida Phase 2, India Phone : +91-120-6493668 Fax : +91-120-4207689 Email : query@intelligrape.com /company/intelligrape /intelligrape.software /IntelliGrape