Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Managing Your Content With
Elasticsearch
Samantha Quiñones / @ieatkillerbees
About Me
• Software Engineer & Data Nerd since 1997
• Doing “media stuff” since 2012
• Principal @ AOL since 2014
• @ieatk...
What We’ll Cover
• Intro to Elasticsearch
• CRUD
• Creating Mappings
• Analyzers
• Basic Querying & Searching
• Scoring & ...
But First…
• Download - https://www.elastic.co/downloads/elasticsearch
• Clone - https://github.com/squinones/elasticsearc...
What is Elasticsearch?
• Near real-time (documents are available for search quickly after
being indexed) search engine pow...
What’s it Used For?
• Logging (we use Elasticsearch to centralize traffic logs, exception
logs, and audit logs)
• Content m...
Installing Elasticsearch
$ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/
distribution/ta...
Connecting to Elasticsearch
• Via Java, there are two native clients which connect to an ES
cluster on port 9300
• Most co...
HTTP API
curl -X GET "http://localhost:9200/?pretty"
Data Format
• Elasticsearch is a document-oriented database
• All operations are performed against documents (object graph...
Analogues
Elasticsearch MySQL MongoDB
Index Database Database
Type Table Collection
Document Row Document
Field Column Fie...
Index Madness
• Index is an overloaded term.
• As a verb, to index a document is store a document in an index.
This is ana...
Indexing Our First Document
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }’
Retrieving Our First Document
curl -X GET "http://localhost:9200/test_document/test/1"
Let’s Look at Some Stackoverflow Posts!
$ vi queries/bulk_insert_so_data.json
Bulk Insert
curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/
bulk_insert_so_data.json"
First Search
curl -X GET "http://localhost:9200/stack_overflow/_search"
Query String Searches
curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"
Query DSL
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title" : "php"
}
}
}'
Compound Queries
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{
"query" : {
"filtered": {
"query" : {
"...
Full-Text Searching
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title" : "p...
Relevancy
• When searching (in query context), results are scored by a
relevancy algorithm
• Results are presented in orde...
Phrase Searching
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title": {
"que...
Highlighting Searches
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title": {...
Aggregations
• Run statistical operations over your data
• Also near real-time!
• Complex aggregations are abstracted away...
Analyzing Tags
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"size": 0,
"aggs": {
"all_tags": {
"terms...
Nesting Aggregations
curl -X POST “http://localhost:9200/stack_overflow/_search" -d
'{
"size": 0,
"aggs": {
"all_tags": {
...
Break Time!
Under the Hood
• Elasticsearch is designed from the ground-up to run in a distributed
fashion.
• Indices (collections of d...
What is a Cluster?
• One or more nodes (servers) that work together to…
• serve a dataset that exceeds the capacity of a s...
What are Nodes?
• Individual servers within a cluster
• Can providing indexing and searching capabilities
What is an Index?
• An index is logically a collection of documents, roughly analogous
to a database in MySQL
• An index i...
What are Shards?
• Low-level units that hold a slice of available data
• A shard represents a single instance of lucene an...
What is Replication?
• Shards can have replicas
• Replicas primarily provide redundancy for when shards/nodes fail
• Repli...
Default Topology
• 5 primary shards per index
• 1 replica per shard
NODE
Clustering & Replication
NODE
R1 P2 P3 R2 R3P4 R5 P1 R4 P5
Cluster Health
curl -X GET “http://localhost:9200/_cluster/health"
curl -X GET "http://localhost:9200/_cat/health?v"
_cat API
• Display human-readable information about parts of the ES system
• Provides some limited documentation of functi...
aliases
> $ http GET ':9200/_cat/aliases?v'
alias index filter routing.index routing.search
posts posts_561729df8ce4e * - ...
allocation
> $ http GET ':9200/_cat/allocation?v'
shards disk.used disk.avail disk.total disk.percent host
33 2.6gb 21.8gb...
count
> $ http GET ':9200/_cat/count?v'
epoch timestamp count
1453790185 06:36:25 182763
> $ http GET ‘:9200/_cat/count/po...
fielddata
> $ http -b GET ':9200/_cat/fielddata?v'
id host ip node
total site_id published
7tjeJNY3TMajqRkmYsJyrA host1 10...
health
> $ http -b GET ':9200/_cat/health?v'
epoch timestamp cluster status node.total node.data shards pri relo init unas...
indices
> $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v'
health status index p...
master
> $ http -b GET ':9200/_cat/master?v'
id host ip node
7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1
nodes
> $ http -b GET ':9200/_cat/nodes?v'
host ip heap.percent ram.percent load node.role master name
127.0.0.1 127.0.0.1...
pending tasks
% curl 'localhost:9200/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1685 855ms HIGH update-...
shards
> $ http -b GET ':9200/_cat/shards?v'
index shard prirep state docs store ip node
posts_561729df8ce4e 2 r STARTED 9...
segments
> $ http -b GET ':9200/_cat/segments?v'
index shard prirep ip segment generation docs.count docs.deleted size siz...
CRUD Operations
Document Model
• Documents represent objects
• By default, all fields in all documents are analyzed, and indexed
Metadata
• _index - The index in which a document resides
• _type - The class of object that a document represents
• _id -...
Retrieving Documents
curl -X GET "http://localhost:9200/test_document/test/1"
curl -X HEAD “http://localhost:9200/test_doc...
Updating Documents
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{
"name": "test_name",
"conference": "php ...
Explicit Creates
curl -X PUT "http://localhost:9200/test_document/test/1/_create" -d '{
"name": "test_name",
"conference":...
Auto-Generated IDs
curl -X POST "http://localhost:9200/test_document/test" -d '{
"name": "test_name",
"conference": "php b...
Deleting Documents
curl -X DELETE "http://localhost:9200/test_document/test/1"
Bulk API
• Perform many operations in a single request
• Efficient batching of actions
• Bulk queries take the form of a st...
Bulk Actions
• create - Index a document IFF it doesn’t exist already
• index - Index a document, replacing it if it exist...
Bulk API Format
{ action: { metadata }}n
{ request body }n
{ action: { metadata }}n
{ request body }
Sizing Bulk Requests
• Balance quantity of documents with size of documents
• Docs list the sweet-spot between 5-15 MB per...
Searching Documents
• Structured queries - queries against concrete fields like “title” or
“score” which return specific doc...
Search Elements
• Mappings - Defines how data in fields are interpreted
• Analysis - How text is parsed and processed to mak...
About Queries
• Leaf Queries - Searches for a value in a given field. These queries
are standalone. Examples: match, range,...
Empty Search
curl -X GET "http://localhost:9200/stack_overflow/_search"
curl -X POST "http://localhost:9200/stack_overflow...
Timing Out Searches
curl -X GET "http://localhost:9200/stack_overflow/_search?timeout=1s"
curl -X POST "http://localhost:9...
Multi-Index/Type Searches
curl -X GET "http://localhost:9200/test_document,stack_overflow/_search"
Multi-Index Use Cases
• Dated indices for logging
• Roll-off indices for content-aging
• Analytic roll-ups
Pagination
curl -X GET "http://localhost:9200/stack_overflow/_search?size=5&from=5"
curl -X POST "http://localhost:9200/st...
Pagination Concerns
• Since searches are distributed across multiple shards, paged
queries must be sorted at each shard, c...
Full Text Queries
• match - Basic term matching query
• multi_match - Match which spans multiple fields
• common_terms - Ma...
Term Queries
• term - Search for an exact value
• terms - Search for an exact value in multiple fields
• range - Find docum...
Compound Queries
• constant_score - Wraps a query in filter context, giving all results a constant score
• bool - Combines ...
What are Mappings?
• Similar to schemas, they define the types of data found in fields
• Determines how individual fields are...
Mapping Types
• Indices have one or more mapping types which group documents
logically.
• Types contain meta fields, which ...
Data Types
• Scalar Values - string, long, double, boolean
• Special Scalars - date, ip
• Structural Types - object, neste...
Dynamic vs Explicit Mapping
• Dynamic fields are not defined prior to indexing
• Elasticsearch selects the most likely type ...
Shared Fields
• Fields that are defined in multiple mapping types must be identical
if:
• They have the same name
• Live in...
Examining Mappings
curl -X GET "http://localhost:9200/stack_overflow/post/_mapping"
Dynamic Mappings
• Mappings are generated when a type is created, if no mapping
was previously specified.
• Elasticsearch i...
Structured Data vs Full Text
• Exact values contain exact strings which are not subject to natural
language interpretation...
Exact Value
• “samantha@tembies.com” is an email address in all contexts
Natural Language
• “us” can be interpreted differently in natural language
• Abbreviation for “United States”
• The Englis...
Analyzing Text
• Elasticsearch is optimized for full text search
• Text is analyzed in a two-step process
• First, text is...
Analyzers
• Analyzers perform the analysis process
• Character filters clean up text, removing or modifying the text
• Toke...
Standard Analyzer
• General purpose analyzer that works for most natural language.
• Splits text on word boundaries, remov...
Standard Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=standard&text="Reverse+text+with
+strrev($text)!""
Whitespace Analyzer
• Analyzer that splits on whitespace and lowercases all tokens
Whitespace Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=whitespace&text="Reverse+text+with
+strrev($text)...
Keyword Analyzer
• Tokenizes the entire text as a single string.
• Used for things that should be kept whole, like ID numb...
Keyword Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=keyword&text="Reverse+text+with
+strrev($text)!""
Language Analyzers
• Analyzers optimized for specific natural languages.
• Reduce tokens to stems (jumper, jumped → jump)
Language Analyzers
curl -X GET "http://localhost:9200/_analyze?analyzer=english&text="Reverse+text+with
+strrev($text)!""
Analyzers
• Analyzers are applied when documents are indexed
• Analyzers are applied when a full-text search is performed ...
Character Filters
• html_strip - Removes HTML from text
• mapping - Filter based on a map of original → new ( { “ph”: “f” ...
Index Templates
• Template mappings that are applied to newly created indices
• Templates also contain index configuration ...
Scoring
• Scoring is based on a boolean model and scoring function
• Boolean model applies AND/OR logic to an inverse inde...
Term Frequency
• Terms that appear frequently in a document increase the
document’s relevancy score.
• term_frequency(term...
Inverse Document Frequency
• Terms that appear in many documents reduce a document’s
relevancy score
• inverse_doc_frequen...
Field Length Normalization
• Terms that appear in shorter fields increase the relevancy of a
document.
• norm(document) = 1...
Example from the Docs
• Given the text “quick brown fox” the term “fox” scores…
• Term Frequency: 1.0
• Inverse Doc Freque...
Basic Relevancy
{
"size": 100,
"query": {
"filtered": {
"query": {
"match": {
"contents": "miley cyrus"
}
},
"filter": {
"...
Non-Preferenced Result Recency
Recency-Adjusted Query
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"published": {
"origin": "now",
"scale...
Preferenced Result Recency
Aggregations & Analytics
Importing Energy Data
curl -X PUT "http://localhost:9200/energy_use" --data-binary "@queries/
mapping_energy.json"
curl -X...
Average Energy Use
curl -X POST "http://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"average_laundry_use...
Multiple Aggregations
curl -X POST “http://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"average_laundry_...
Nesting Aggregations
curl -X POST “http://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"by_date": {
"term...
Stats/Extended Stats
curl -X POST "http://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"by_date": {
"term...
Bucket Aggregations
• Date Histogram
• Term/Terms
• Geo*
• Significant Terms
Questions?
Use Cases?
Exploration Ideas?
https://joind.in/talk/e2e4b
Upcoming SlideShare
Loading in …5
×

Managing Your Content with Elasticsearch

396 views

Published on

Slides from my intro to search and analytics applications with Elasticsearch

Published in: Technology
  • Be the first to comment

Managing Your Content with Elasticsearch

  1. 1. Managing Your Content With Elasticsearch Samantha Quiñones / @ieatkillerbees
  2. 2. About Me • Software Engineer & Data Nerd since 1997 • Doing “media stuff” since 2012 • Principal @ AOL since 2014 • @ieatkillerbees • http://samanthaquinones.com
  3. 3. What We’ll Cover • Intro to Elasticsearch • CRUD • Creating Mappings • Analyzers • Basic Querying & Searching • Scoring & Relevance • Aggregations Basics
  4. 4. But First… • Download - https://www.elastic.co/downloads/elasticsearch • Clone - https://github.com/squinones/elasticsearch-tutorial.git
  5. 5. What is Elasticsearch? • Near real-time (documents are available for search quickly after being indexed) search engine powered by Lucene • Clustered for H/A and performance via federation with shards and replicas
  6. 6. What’s it Used For? • Logging (we use Elasticsearch to centralize traffic logs, exception logs, and audit logs) • Content management and search • Statistical analysis
  7. 7. Installing Elasticsearch $ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/ distribution/tar/elasticsearch/2.1.1/elasticsearch-2.1.1.tar.gz $ tar -zxvf elasticsearch* $ cd elasticsearch-2.1.1/bin $ ./elasticsearch
  8. 8. Connecting to Elasticsearch • Via Java, there are two native clients which connect to an ES cluster on port 9300 • Most commonly, we access Elasticsearch via HTTP API
  9. 9. HTTP API curl -X GET "http://localhost:9200/?pretty"
  10. 10. Data Format • Elasticsearch is a document-oriented database • All operations are performed against documents (object graphs expressed as JSON)
  11. 11. Analogues Elasticsearch MySQL MongoDB Index Database Database Type Table Collection Document Row Document Field Column Field
  12. 12. Index Madness • Index is an overloaded term. • As a verb, to index a document is store a document in an index. This is analogous to an SQL INSERT operation. • As a noun, an index is a collection of documents. • Fields within a document have inverted indexes, similar to how a column in an SQL table may have an index.
  13. 13. Indexing Our First Document curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }’
  14. 14. Retrieving Our First Document curl -X GET "http://localhost:9200/test_document/test/1"
  15. 15. Let’s Look at Some Stackoverflow Posts! $ vi queries/bulk_insert_so_data.json
  16. 16. Bulk Insert curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/ bulk_insert_so_data.json"
  17. 17. First Search curl -X GET "http://localhost:9200/stack_overflow/_search"
  18. 18. Query String Searches curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"
  19. 19. Query DSL curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php" } } }'
  20. 20. Compound Queries curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "filtered": { "query" : { "match" : { "title" : "(php OR python) AND (flask OR laravel)" } }, "filter": { "range": { "score": { "gt": 3 } } } } } }'
  21. 21. Full-Text Searching curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php loop" } } }'
  22. 22. Relevancy • When searching (in query context), results are scored by a relevancy algorithm • Results are presented in order from highest to lowest score
  23. 23. Phrase Searching curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } } }'
  24. 24. Highlighting Searches curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } }, "highlight": { "fields" : { "title" : {} } } }'
  25. 25. Aggregations • Run statistical operations over your data • Also near real-time! • Complex aggregations are abstracted away behind simple interfaces— you don’t need to be a statistician
  26. 26. Analyzing Tags curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 } } } }'
  27. 27. Nesting Aggregations curl -X POST “http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 }, "aggs": { "avg_score": { "avg": { "field": "score"} } } } } }'
  28. 28. Break Time!
  29. 29. Under the Hood • Elasticsearch is designed from the ground-up to run in a distributed fashion. • Indices (collections of documents) are partitioned in to shards. • Shards can be stored on a single or multiple nodes. • Shards are balanced across the cluster to improve performance • Shards are replicated for redundancy and high availability
  30. 30. What is a Cluster? • One or more nodes (servers) that work together to… • serve a dataset that exceeds the capacity of a single server… • provide federated indexing (writes) and searching (reads)… • provide H/A through sharing and replication of data
  31. 31. What are Nodes? • Individual servers within a cluster • Can providing indexing and searching capabilities
  32. 32. What is an Index? • An index is logically a collection of documents, roughly analogous to a database in MySQL • An index is in reality a namespace that points to one or more physical shards which contain data • When indexing a document, if the specified index does not exist, it will be created automatically
  33. 33. What are Shards? • Low-level units that hold a slice of available data • A shard represents a single instance of lucene and is fully- functional, self-contained search engine • Shards are either primary or replicas and are assigned to nodes
  34. 34. What is Replication? • Shards can have replicas • Replicas primarily provide redundancy for when shards/nodes fail • Replicas should not be allocated on the same node as the shard it replicates
  35. 35. Default Topology • 5 primary shards per index • 1 replica per shard
  36. 36. NODE Clustering & Replication NODE R1 P2 P3 R2 R3P4 R5 P1 R4 P5
  37. 37. Cluster Health curl -X GET “http://localhost:9200/_cluster/health" curl -X GET "http://localhost:9200/_cat/health?v"
  38. 38. _cat API • Display human-readable information about parts of the ES system • Provides some limited documentation of functions
  39. 39. aliases > $ http GET ':9200/_cat/aliases?v' alias index filter routing.index routing.search posts posts_561729df8ce4e * - - posts.public posts_561729df8ce4e * - - posts.write posts_561729df8ce4e - - - Display all configured aliases
  40. 40. allocation > $ http GET ':9200/_cat/allocation?v' shards disk.used disk.avail disk.total disk.percent host 33 2.6gb 21.8gb 24.4gb 10 host1 33 3gb 21.4gb 24.4gb 12 host2 34 2.6gb 21.8gb 24.4gb 10 host3 Show how many shards are allocated per node, with disk utilization info
  41. 41. count > $ http GET ':9200/_cat/count?v' epoch timestamp count 1453790185 06:36:25 182763 > $ http GET ‘:9200/_cat/count/posts?v’ epoch timestamp count 1453790467 06:41:07 164169 > $ http GET ‘:9200/_cat/count/posts.public?v’ epoch timestamp count 1453790472 06:41:12 164169= Display a count of documents in the cluster, or a specific index
  42. 42. fielddata > $ http -b GET ':9200/_cat/fielddata?v' id host ip node total site_id published 7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1 1.1mb 170.1kb 996.5kb __xrpsKAQW6yyCY8luLQdQ host2 10.97.180.138 node2 1.6mb 329.3kb 1.3mb bdoNNXHXRryj22YqjnqECw host3 10.97.181.190 node3 1.1mb 154.7kb 991.7kb Shows how much memory is allocated to fielddata (metadata used for sorts)
  43. 43. health > $ http -b GET ':9200/_cat/health?v' epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks 1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0
  44. 44. indices > $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v' health status index pri rep docs.count docs.deleted store.size pri.store.size green open posts_561729df8ce4e 5 1 468629 20905 4gb 2gb green open slideshows 5 1 3893 6 86mb 43mb
  45. 45. master > $ http -b GET ':9200/_cat/master?v' id host ip node 7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1
  46. 46. nodes > $ http -b GET ':9200/_cat/nodes?v' host ip heap.percent ram.percent load node.role master name 127.0.0.1 127.0.0.1 50 100 2.47 d * Mentus
  47. 47. pending tasks % curl 'localhost:9200/_cat/pending_tasks?v' insertOrder timeInQueue priority source 1685 855ms HIGH update-mapping [foo][t] 1686 843ms HIGH update-mapping [foo][t] 1693 753ms HIGH refresh-mapping [foo][[t]] 1688 816ms HIGH update-mapping [foo][t] 1689 802ms HIGH update-mapping [foo][t] 1690 787ms HIGH update-mapping [foo][t] 1691 773ms HIGH update-mapping [foo][t]
  48. 48. shards > $ http -b GET ':9200/_cat/shards?v' index shard prirep state docs store ip node posts_561729df8ce4e 2 r STARTED 94019 410.5mb 10.97.180.138 host1 posts_561729df8ce4e 2 p STARTED 94019 412.7mb 10.97.181.190 host2 posts_561729df8ce4e 0 p STARTED 93307 413.6mb 10.97.183.146 host3 posts_561729df8ce4e 0 r STARTED 93307 415mb 10.97.180.138 host1 posts_561729df8ce4e 3 p STARTED 94182 407.1mb 10.97.183.146 host2 posts_561729df8ce4e 3 r STARTED 94182 403.4mb 10.97.180.138 host1 posts_561729df8ce4e 1 r STARTED 94130 447.1mb 10.97.180.138 host1 posts_561729df8ce4e 1 p STARTED 94130 447mb 10.97.181.190 host2 posts_561729df8ce4e 4 r STARTED 93299 421.5mb 10.97.183.146 host3 posts_561729df8ce4e 4 p STARTED 93299 398.8mb 10.97.181.190 host2
  49. 49. segments > $ http -b GET ':9200/_cat/segments?v' index shard prirep ip segment generation docs.count docs.deleted size size.memory committed searchable version compound posts_561726fecd9c6 0 p 10.97.183.146 _a 10 24 0 227.7kb 69554 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _b 11 108 0 659.1kb 103242 true true 4.10.4 false posts_561726fecd9c6 0 p 10.97.183.146 _c 12 7 0 90.7kb 54706 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _d 13 6 0 82.2kb 49706 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _e 14 8 0 119kb 67162 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _f 15 1 0 35.9kb 32122 true true 4.10.4 true posts_561726fecd9c6 0 r 10.97.180.138 _a 10 24 0 227.7kb 69554 true true 4.10.4 true posts_561726fecd9c6 0 r 10.97.180.138 _b 11 108 0 659.1kb 103242 true true 4.10.4 false
  50. 50. CRUD Operations
  51. 51. Document Model • Documents represent objects • By default, all fields in all documents are analyzed, and indexed
  52. 52. Metadata • _index - The index in which a document resides • _type - The class of object that a document represents • _id - The document’s unique identifier. Auto-generated when not provided
  53. 53. Retrieving Documents curl -X GET "http://localhost:9200/test_document/test/1" curl -X HEAD “http://localhost:9200/test_document/test/1" curl -X HEAD "http://localhost:9200/test_document/test/2"
  54. 54. Updating Documents curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name", "conference": "php benelux" }' curl -X GET "http://localhost:9200/test_document/test/1"
  55. 55. Explicit Creates curl -X PUT "http://localhost:9200/test_document/test/1/_create" -d '{ "name": "test_name", "conference": "php benelux" }'
  56. 56. Auto-Generated IDs curl -X POST "http://localhost:9200/test_document/test" -d '{ "name": "test_name", "conference": "php benelux" }'
  57. 57. Deleting Documents curl -X DELETE "http://localhost:9200/test_document/test/1"
  58. 58. Bulk API • Perform many operations in a single request • Efficient batching of actions • Bulk queries take the form of a stream of single-line JSON objects that define actions and document bodies
  59. 59. Bulk Actions • create - Index a document IFF it doesn’t exist already • index - Index a document, replacing it if it exists • update - Apply a partial update to a document • delete - Delete a document
  60. 60. Bulk API Format { action: { metadata }}n { request body }n { action: { metadata }}n { request body }
  61. 61. Sizing Bulk Requests • Balance quantity of documents with size of documents • Docs list the sweet-spot between 5-15 MB per request • AOL Analytics Cluster indexes 5000 documents per batch (approx 7MB)
  62. 62. Searching Documents • Structured queries - queries against concrete fields like “title” or “score” which return specific documents. • Full-text queries - queries that find documents which match a search query and return them sorted by relevance
  63. 63. Search Elements • Mappings - Defines how data in fields are interpreted • Analysis - How text is parsed and processed to make it searchable • Query DSL - Elasticsearch’s query language
  64. 64. About Queries • Leaf Queries - Searches for a value in a given field. These queries are standalone. Examples: match, range, term • Compound Queries - Combinations of leaf queries and other compound queries which combine operations together either logically (e.g. bool queries) or alter their behavior (e.g. score queries)
  65. 65. Empty Search curl -X GET "http://localhost:9200/stack_overflow/_search" curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query": { "match_all": {} } }'
  66. 66. Timing Out Searches curl -X GET "http://localhost:9200/stack_overflow/_search?timeout=1s" curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "timeout": "1s", "query": { "match_all": {} } }'
  67. 67. Multi-Index/Type Searches curl -X GET "http://localhost:9200/test_document,stack_overflow/_search"
  68. 68. Multi-Index Use Cases • Dated indices for logging • Roll-off indices for content-aging • Analytic roll-ups
  69. 69. Pagination curl -X GET "http://localhost:9200/stack_overflow/_search?size=5&from=5" curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 5, "from": 5, "query": { "match_all": {} } }'
  70. 70. Pagination Concerns • Since searches are distributed across multiple shards, paged queries must be sorted at each shard, combined, and resorted • The cost of paging in distributed data sets can increase exponentially • It is a wise practice to set limits to how many pages of results can be returned
  71. 71. Full Text Queries • match - Basic term matching query • multi_match - Match which spans multiple fields • common_terms - Match query which preferences uncommon words • query_string - Match documents using a search “mini-dsl” • simple_query_string - A simpler version of query_string that never throws exceptions, suitable for exposing to users
  72. 72. Term Queries • term - Search for an exact value • terms - Search for an exact value in multiple fields • range - Find documents where a value is in a certain range • exists - Find documents that have any non-null value in a field • missing - Inversion of `exists` • prefix - Match terms that begin with a string • wildcard - Match terms with a wildcard • regexp - Match terms against a regular expression • fuzzy - Match terms with configurable fuzziness
  73. 73. Compound Queries • constant_score - Wraps a query in filter context, giving all results a constant score • bool - Combines multiple leaf queries with `must`, `should`, `must_not` and `filter` clauses • dis_max - Similar to bool, but creates a union of subquery results scoring each document with the maximum score of the query that produced it • function_score - Modifies the scores of documents returned by a query . Useful for altering the distribution of results based on recency, popularity, etc. • boosting - Takes a `positive` and `negative` query, returning the results of `positive` while reducing the scores of documents that also match `negative` • filtered - Combines a query clause in query context with one in filter context • limit - Perform the query over a limited number of documents in each shard
  74. 74. What are Mappings? • Similar to schemas, they define the types of data found in fields • Determines how individual fields are analyzed & stored • Sets the format of date fields • Sets rules for mapping dynamic fields
  75. 75. Mapping Types • Indices have one or more mapping types which group documents logically. • Types contain meta fields, which can be used to customize metadata like _index, _id, _type, and _source • Types can also list fields that have consistent structure across types.
  76. 76. Data Types • Scalar Values - string, long, double, boolean • Special Scalars - date, ip • Structural Types - object, nested • Special Types - geo_shape, geo_point, completion • Compound Types - string arrays, nested objects
  77. 77. Dynamic vs Explicit Mapping • Dynamic fields are not defined prior to indexing • Elasticsearch selects the most likely type for dynamic fields, based on configurable rules • Explicit fields are defined exactly prior to indexing • Types cannot accept data that is the wrong type for an explicit mapping
  78. 78. Shared Fields • Fields that are defined in multiple mapping types must be identical if: • They have the same name • Live in the same index • Map to the same field internally
  79. 79. Examining Mappings curl -X GET "http://localhost:9200/stack_overflow/post/_mapping"
  80. 80. Dynamic Mappings • Mappings are generated when a type is created, if no mapping was previously specified. • Elasticsearch is good at identifying fields much of the time, but it’s far from perfect! • Fields can contain basic data-types, but importantly, mappings optimize a field for either structured (exact) or full-text searching
  81. 81. Structured Data vs Full Text • Exact values contain exact strings which are not subject to natural language interpretation. • Full-text values must be interpreted in the context of natural language
  82. 82. Exact Value • “samantha@tembies.com” is an email address in all contexts
  83. 83. Natural Language • “us” can be interpreted differently in natural language • Abbreviation for “United States” • The English dative personal pronoun • An alternative symbol for µs • The French word us
  84. 84. Analyzing Text • Elasticsearch is optimized for full text search • Text is analyzed in a two-step process • First, text is tokenized in to individual terms • Second, terms are normalized through a filter
  85. 85. Analyzers • Analyzers perform the analysis process • Character filters clean up text, removing or modifying the text • Tokenizers break the text down in to terms • Token filters modify, remove, or add terms
  86. 86. Standard Analyzer • General purpose analyzer that works for most natural language. • Splits text on word boundaries, removes punctuation, and lowercases all tokens.
  87. 87. Standard Analyzer curl -X GET "http://localhost:9200/_analyze?analyzer=standard&text="Reverse+text+with +strrev($text)!""
  88. 88. Whitespace Analyzer • Analyzer that splits on whitespace and lowercases all tokens
  89. 89. Whitespace Analyzer curl -X GET "http://localhost:9200/_analyze?analyzer=whitespace&text="Reverse+text+with +strrev($text)!""
  90. 90. Keyword Analyzer • Tokenizes the entire text as a single string. • Used for things that should be kept whole, like ID numbers, postal codes, etc
  91. 91. Keyword Analyzer curl -X GET "http://localhost:9200/_analyze?analyzer=keyword&text="Reverse+text+with +strrev($text)!""
  92. 92. Language Analyzers • Analyzers optimized for specific natural languages. • Reduce tokens to stems (jumper, jumped → jump)
  93. 93. Language Analyzers curl -X GET "http://localhost:9200/_analyze?analyzer=english&text="Reverse+text+with +strrev($text)!""
  94. 94. Analyzers • Analyzers are applied when documents are indexed • Analyzers are applied when a full-text search is performed against a field, in order to produce the correct set of terms to search for
  95. 95. Character Filters • html_strip - Removes HTML from text • mapping - Filter based on a map of original → new ( { “ph”: “f” }) • pattern_replace - Similar to mapping, using regular expressions
  96. 96. Index Templates • Template mappings that are applied to newly created indices • Templates also contain index configuration information • Powerful when combined with dated indices
  97. 97. Scoring • Scoring is based on a boolean model and scoring function • Boolean model applies AND/OR logic to an inverse index to produce a list of matching documents
  98. 98. Term Frequency • Terms that appear frequently in a document increase the document’s relevancy score. • term_frequency(term in document) = √number_of_appearances
  99. 99. Inverse Document Frequency • Terms that appear in many documents reduce a document’s relevancy score • inverse_doc_frequency(term) = 1 + log(number_of_docs / (frequency + 1))
  100. 100. Field Length Normalization • Terms that appear in shorter fields increase the relevancy of a document. • norm(document) = 1 / √number_of_terms
  101. 101. Example from the Docs • Given the text “quick brown fox” the term “fox” scores… • Term Frequency: 1.0 • Inverse Doc Frequency: 0.30685282 • Field Norm: 0.5 • Score: 0.15342641
  102. 102. Basic Relevancy { "size": 100, "query": { "filtered": { "query": { "match": { "contents": "miley cyrus" } }, "filter": { "and": [ { "terms": { "site_id": [ 698 ] } } ] } } } }
  103. 103. Non-Preferenced Result Recency
  104. 104. Recency-Adjusted Query { "query": { "function_score": { "functions": [ { "gauss": { "published": { "origin": "now", "scale": "10d", "offset": "1d", "decay": 0.3 } } } ], "query": { "filtered": { "query": { "match": { "contents": "miley cyrus" } }, "filter": { "and": [ { "terms": { "site_id": [ 698 ] } } ] } } } } } }
  105. 105. Preferenced Result Recency
  106. 106. Aggregations & Analytics
  107. 107. Importing Energy Data curl -X PUT "http://localhost:9200/energy_use" --data-binary "@queries/ mapping_energy.json" curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/ bulk_insert_energy_data.json" curl -X GET "http://localhost:9200/energy_use/_search"
  108. 108. Average Energy Use curl -X POST "http://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "average_laundry_use": { "avg": { "field": "laundry" } }, "average_kitchen_use": { "avg": { "field": "kitchen" } }, "average_heater_use": { "avg": { "field": "heater" } }, "average_other_use": { "avg": { "field": "other" } } } }'
  109. 109. Multiple Aggregations curl -X POST “http://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "average_laundry_use": { "avg": { "field": "laundry" } }, "min_laundry_use": { "min": { "field": "laundry"} }, "max_laundry_use": { "max": { "field": "laundry"} } } }'
  110. 110. Nesting Aggregations curl -X POST “http://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "by_date": { "terms": { "field": "date" }, "aggs": { "average_laundry_use": { "avg": { "field": "laundry" } }, "min_laundry_use": { "min": { "field": "laundry"} }, "max_laundry_use": { "max": { "field": "laundry"} } } } } }'
  111. 111. Stats/Extended Stats curl -X POST "http://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "by_date": { "terms": { "field": "date" }, "aggs": { "laundry_stats": { "extended_stats": { "field": "laundry" } } } } } }'
  112. 112. Bucket Aggregations • Date Histogram • Term/Terms • Geo* • Significant Terms
  113. 113. Questions? Use Cases? Exploration Ideas? https://joind.in/talk/e2e4b

×