Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Exploration with Elasticsearch

3,757 views

Published on

A brief introduction to Elasticsearch and the many possibilities Elasticsearch offers in terms of search, data exploration and data aggregation. The presentation includes a brief introduction to search engine fundamentals and core features of Elasticsearch. The talk focuses on how we can navigate structured and unstructured data for search as well as aggregating and visualizing data for analytical purposes.

The talk aims to demonstrate case studies beyond traditional full-text-search, and hopefully show that Elasticsearch can help us build so much more than just a search engine.

Published in: Technology
  • Be the first to comment

Data Exploration with Elasticsearch

  1. 1. Data  explora+on  with  Elas+csearch   Aleksander  M.  Stensby   Monokkel  A/S  
  2. 2. •  Aleksander  M.  Stensby   •  CEO  in  Monokkel  AS   •  Previously  COO  in  Integrasco  AS   •  Working  with  search  and  data  analysis  since  2004   www.monokkel.io  
  3. 3. •  Daglig  leder  i  Monokkel  AS   •  Tidligere  COO  i  Integrasco  AS   •  Persistering,  Prosessering  og  Presentasjon  av  data   Persistence  –  Processing  –  PresentaHon  
  4. 4. Agenda   •  Search  fundamentals  primer     •  Intro  to  elasHcsearch     •  Search,  filter  and  aggregate!  
  5. 5. Agenda   •  Search  fundamentals  primer     •  Intro  to  elasHcsearch   •  Search,  filter  and  aggregate!   …  and  some  bonus  visualisaHon!  
  6. 6. What  we  will  not  cover  today…   •  All  the  different  searches,  filters  and   aggregaHons  available  in  elasHcsearch  J     •  Details  on  tokenizaHon,  analyzers…     •  ElasHcsearch  in  producHon  and  performance   tuning…   •  Data  integraHon  
  7. 7. Search  fundamentals  101  
  8. 8. Document
  9. 9. Fields (Key Value) Title Content Signature
  10. 10. “We know what we are, but know not what we may be.”
  11. 11. Term   Frequency   we   3   know   2   what   2   are   1   but   1   not   1   may   1   be   1   “We know what we are, but know not what we may be.” Term Vector
  12. 12. Index
  13. 13. “We were born to run” “No one told you when to run” “Some were born to sing the blues”
  14. 14. The  Inverted  Index   Term   Frequency   blues   1   born   2   no   1   one   1   run   2   sing   1   some   1   the   1   to   3   told   1   we   1   were   2   when   1   you   1   Documents   3   1,3   2   2   1,2   3   3   3   1,2,3   2   1   1,3   2   2   dictionary postings 1. “We were born to run ” 2. “No one told you when to run” 3. “Some were born to sing the blues”
  15. 15. Searching   born   1. “We were born to run ” 2. “No one told you when to run” 3. “Some were born to sing the blues”
  16. 16. The  Boolean  Model   Term   Frequency   blues   1   born   2   no   1   one   1   run   2   sing   1   some   1   the   1   to   3   told   1   we   1   were   2   when   1   you   1   Documents   3   1,3   2   2   1,2   3   3   3   1,2,3   2   1   1,3   2   2   dictionary postings born  
  17. 17. Term   Frequency   blues   1   born   2   no   1   one   1   run   2   sing   1   some   1   the   1   to   3   told   1   we   1   were   2   when   1   you   1   Documents   3   1,3   2   2   1,2   3   3   3   1,2,3   2   1   1,3   2   2   dictionary postings born  blues  
  18. 18. Term   Frequency   blues   1   born   2   no   1   one   1   run   2   sing   1   some   1   the   1   to   3   told   1   we   1   were   2   when   1   you   1   Documents   3   1,3   2   2   1,2   3   3   3   1,2,3   2   1   1,3   2   2   dictionary postings born  OR  blues  
  19. 19. Term   Frequency   blues   1   born   2   no   1   one   1   run   2   sing   1   some   1   the   1   to   3   told   1   we   1   were   2   when   1   you   1   Documents   3   1,3   2   2   1,2   3   3   3   1,2,3   2   1   1,3   2   2   dictionary postings born  AND  blues  
  20. 20. Term   Frequency   blues   1   born   2   no   1   one   1   run   2   sing   1   some   1   the   1   to   3   told   1   we   1   were   2   when   1   you   1   Documents   3   1,3   2   2   1,2   3   3   3   1,2,3   2   1   1,3   2   2   dictionary postings born  NOT  blues  
  21. 21. Relevancy  and  Ranking   •  Term  frequency     •  Inverse  document  frequency     •  Field-­‐length  norm  
  22. 22. Similarity   1. “We were born to run ” 2. “No one told you when to run” 3. “Some were born to sing the blues” [2,  0]   [0,  0]   [2,  5]   0   0   1   2   3   4   5   1   2   3   “blues”   “born”   query:    [2,5]   doc  3:    [2,5]   doc  2:    [0,0]   doc  1:    [2,0]  
  23. 23. Search  fundamentals  101!   •  TokenizaHon     •  NormalizaHon  (case,  stop  words  etc)     •  Stemming,  synonyms  
  24. 24. Brief  history  of  elasHcsearch   Shay  Banon     -­‐>  AbstracHon  Layer  on  top  of  Lucene     -­‐>  Compass     -­‐>  Rewricen  high  performance,     real-­‐Hme,  distributed     -­‐>  ElasHcsearch     -­‐>  February  2010  
  25. 25. elasHcsearch   •  Open  source  search  engine  -­‐  wricen  in  Java     •  Built  on  top  of  Lucene       •  Simple,  coherent,  RESTful  API   •  Distributed,  scalable  search  engine  with  real-­‐ Hme  analyHcs   {  }  
  26. 26.     “more  useable  and  concise  API,  scalability,  and   opera+onal  tools  on  top  of  Lucene’s  search   implementa+on”  
  27. 27. ElasHcsearch  nodes  and  cluster   node node node cluster
  28. 28. ElasHcsearch  shards,  nodes   index = shard node
  29. 29. Lucene  index  and  segments   segments lucene index
  30. 30. Much  more  than  just  search!   •  Real-­‐Hme  analyHcs   •  Log  analysis   •  PredicHon  modelling   •  RecommendaHons  
  31. 31.           in  5  minutes     DEMO    
  32. 32. DEMO   •  Install  ElasHcSearch     •  Load  in  some  data     •  Run  a  very  basic  search  
  33. 33.           in  15  minutes     DEMO    
  34. 34. Easy  peasy…   •  hcp://www.elasHcsearch.org/download     •  bin/elasHcsearch    or  bin/elasHcsearch.bat  on  windows     •  hcp://localhost:9200/    or  curl  –X  GET  hcp://localhost:9200/  
  35. 35. Easy  peasy  lemon  squeezy!  
  36. 36. hcp://localhost:9200/<index>/<type>/[<id>]    
  37. 37. Indexing  data   curl  -­‐XPUT  'hcp://localhost:9200/monokkel/user/aleks'   -­‐d  '{  "name"  :  "Aleksander  Stensby"  }’      
  38. 38. Indexing  data   •  shakespeare.json   – hcp://www.elasHcsearch.org/guide/en/kibana/ current/snippets/shakespeare.json     •  curl  -­‐XPUT  localhost:9200/_bulk  -­‐-­‐data-­‐binary   @shakespeare.json  
  39. 39. hcp://localhost:9200/<index>/<type>/     hcp://localhost:9200/<index>/     hcp://localhost:9200/   _search  
  40. 40. Mapping   •  Is  it  a  number?  String?  Date?   •  Combining  mulHple  fields?   •  Default  values?   •  Stored?   •  Analyzed?   •  How  should  we  tokenize/analyse/normalize   the  field?  
  41. 41. Mapping   curl  -­‐XPUT  hcp://localhost:9200/shakespeare  -­‐d  '   {    "mappings"  :  {      "_default_"  :  {        "properHes"  :  {          "speaker"  :  {"type":  "string",  "index"  :  "not_analyzed"  },          "play_name"  :  {"type":  "string",  "index"  :  "not_analyzed"  },          "line_id"  :  {  "type"  :  "integer"  },          "speech_number"  :  {  "type"  :  "integer"  }        }      }    }   }   ';  
  42. 42. The  Query  DSL   {          "query":  {YOUR_QUERY_HERE}   }  
  43. 43. Match  Query   {          "query":  {                  "match":  {"text_entry"  :  "romeo"}          }   }  
  44. 44. MulH  Match  Query   {          "query":  {                    "mulM_match":  {      "query":        "romeo",      "fields":      [  "text_entry",  "speaker"  ]    }          }   }  
  45. 45. Bool  Query   {          "query":  {   "bool":  {                  "must":          [  ],                  "must_not":  [  ],                  "should":  [  ]          }   }   }  
  46. 46. Bool  Query   {          "query":  {   "bool":  {                  "must":          {  "match":  {"text_entry":  "romeo"  }},                  "must_not":  {  "match":  {"speaker":      "ROMEO"  }},                  "should":  [                            {  "match":  {"speaker":  "JULIET"  }},    {  "match":  {"speaker":  "FRIAR  LAURENCE"  }}                      ]          }   }   }  
  47. 47. And  lots  more…   filtered  query   prefix  query   simple  query  string  query   range  query   regexp  query   term  query   terms  query   wildcard  query   dis  max  query   geoshape  query   nested  query     more  like  this  query   more  like  this  field  query   boosHng  query   common  terms  query   constant  score  query   fuzzy  like  this  query   fuzzy  like  this  field  query   funcHon  score  query   fuzzy  query   has  child  query   has  parent  query     ids  query   indices  query   span  first  query   span  mulH  term  query   span  near  query   span  not  query   span  or  query   span  term  query   top  children  query   minimum  should  match   mulH  term  query  rewrite   template  query       hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/query-­‐dsl-­‐queries.html  
  48. 48. Filtering   •  Filters  do  not  score  so  they  are  faster  to   execute  than  queries     •  Filters  can  be  cached  in  memory  -­‐  significantly   faster  than  queries     If relevance is not important, use filters, otherwise, use queries!
  49. 49. The  Filtered  Query:   {          "query":  {                  "filtered":  {                          "query":    {YOUR_QUERY_HERE},                          "filter":  {YOUR_FILTER_HERE}    }          }   }  
  50. 50. The  Filtered  Query:   {          "query":  {                  "filtered":  {                          "query":    {  "match":  {"content":  "monokkel"  }},                          "filter":  {  "term":  {  "tag":  "awesome"  }}    }          }   }  
  51. 51. Term  Filter   {          "query":  {                  "filtered":  {      "filter":  {          "term":  {          "speaker":  "ROMEO"          }      }    }          }   }  
  52. 52. Terms  Filter   {          "query":  {                  "filtered":  {      "filter":  {          "terms":  {          "speaker":  ["ROMEO",  "JULIET"]          }      }    }          }   }  
  53. 53. Bool  Filter   {          "query":  {                  "filtered":  {      "filter":  {            "bool"  :  {                      "must"  :          [],                      "should"  :      [],                      "must_not"  :  []              }          }    }          }   }  
  54. 54. Range  Filter   {          "query":  {                  "filtered":  {      "filter":  {            "range"  :  {                  "price"  :  {                            "gt"  :  20,                            "lt"  :  40                  }        }          }    }          }   }  
  55. 55. And  lots  more…   match  all  filter   and  filter   not  filter   or  filter   prefix  filter   query  filter   regexp  filter   type  filter     geo  bounding  box  filter   geo  distance  filter   geo  distance  range  filter   geo  polygon  filter   geoshape  filter   geohash  cell  filter   has  child  filter   has  parent  filter   ids  filter   indices  filter   limit  filter   nested  filter   script  filter   hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/query-­‐dsl-­‐filters.html  
  56. 56. Kibana   •  hcp://www.elasHcsearch.org/overview/ kibana/installaHon/     •  bin/kibana     or  bin/kibana.bat  on  windows     •  hcp://localhost:5601/    
  57. 57. AggregaHons   •  Buckets  and  Metrics:   par++oning  documents  based  on  a  criteria   SELECT  COUNT(color)   FROM  table   GROUP  BY  color     An  aggrega+on  is  a  combina+on  of  buckets  and   metrics   metric bucket
  58. 58. AggregaHons   {          "aggs":  {                  "speakers":  {      "terms":  {          "field":  "speaker"        }    }          }   }   your aggregation name bucket type
  59. 59. AggregaHons  
  60. 60. AggregaHons   {          "aggs":  {                  "beertypes":  {      "terms":  {          "field":  "beertype"        }    }          }   }  
  61. 61. AggregaHons   {          "aggs":  {                  "beertypes":  {      "terms":  {          "field":  "beertype"        },      "aggs":  {        "avg_ibu":  {          "avg":  {            "field":  "ibu"          }        }      }      }          }   }   your aggregation name metric type
  62. 62. AggregaHons   min   max   sum   avg   stats   extended  stats   value  count   percenHles   percenHle  ranks   cardinality   top  hits   scripted  metric   global   filter   filters   missing   nested   reverse  nested   children   terms   significant  terms   range   date  range   ipv4  range   histogram   date  historgram   geo  bounds   geo  distance   geohash  grid   hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/search-­‐aggrega+ons.html  
  63. 63. And  a  whole  lot  more!   •  Geosearch,  distance  and  bounds     •  ”More  Like  This”   •  Suggesters  /  Autocomplete   •  PercolaMon   •  Language  drivers   •  ScripMng  
  64. 64. Further  reading  and  some  great   resources!   •  hcp://www.elasHcsearch.org/guide/     •  hcp://blog.monokkel.io/     •  hcps://found.no/foundaHon/  
  65. 65. Shameful  self-­‐promoHon     / Tarjei Romtveit / Tarjei Romtveit

×