Fazendo mágica com ElasticSearch

3,218 views
3,023 views

Published on

Quando uma aplicação começa a ficar grande e complexa, fazer buscas nos seus models torna-se uma tarefa complicada. Efetuar as buscas diretamente no banco de dados é um processo lento, ineficiente e que permite pouca ou nenhuma maleabilidade sobre a forma com que a busca é feita. Surge então o ElasticSearch, uma engine de busca utilizada por empresas como Github, Twitter e 4square para indexar e buscar literalmente milhões de documentos em tempo real. Nessa palestra, explicarei quando, como e porque utilizar o ElasticSearch para facilmente indexar e efetuar buscas complexas nos seus models.

Published in: Technology

Fazendo mágica com ElasticSearch

  1. 1. Fazendo mágica com ElasticSearch PEDROFRANCESCHI @pedroh96 pedro@pagar.me github.com/pedrofranceschi
  2. 2. Outubro/2010
  3. 3. Filters Full text search Sort Highlight Facets Pagination
  4. 4. Você vai precisar buscar dados.
  5. 5. Você vai precisar entender dados.
  6. 6. (My)SQL não é a solução. (… nem NoSQL)
  7. 7. O que é o ElasticSearch?
  8. 8. ElasticSearch • “Open Source Distributed Real Time Search & Analytics” • API RESTful para indexar/buscar JSONs (“NoSQL”) • NÃO é um banco de dados • Apache Lucene • Just works (and scales) • Full text search, aggregations, scripting, etc, etc, etc.
  9. 9. Nomes? MySQL ElasticSearch Database Index Table Type Row Document Column Field Schema Mapping Partition Shard
  10. 10. Como usar o ElasticSearch?
  11. 11. $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{! "user" : “pedroh96",! "post_date" : "2009-11-15T14:12:12",! "message" : "trying out Elasticsearch"! }' Endpoint Index Type Document ID Document {! "_index" : "twitter",! "_type" : "tweet",! "_id" : "1",! "_version" : 1,! "created" : true! } PUT data
  12. 12. Endpoint Index Type $ curl -XGET 'http://localhost:9200/twitter/tweet/1' Document ID {! "_id": "1",! "_index": "twitter",! "_source": {! "message": "trying out Elasticsearch",! "post_date": "2009-11-15T14:12:12",! "user": "pedroh96"! },! "_type": "tweet",! "_version": 1,! "found": true! } Document GET data
  13. 13. GET data Endpoint Index $ curl -XGET 'http://localhost:9200/twitter/_search'! -d ‘{ query: . . . }! ! ! Query de busca ! ! ! ! ! ! ! Operador de busca
  14. 14. ActiveRecords class Tweet < ActiveRecord::Base! end
  15. 15. ActiveRecords require 'elasticsearch/model'! ! class Tweet < ActiveRecord::Base! include Elasticsearch::Model! include Elasticsearch::Model::Callbacks! end! !
  16. 16. Tweet.import
  17. 17. Tweet.search(“pedroh96”)
  18. 18. Por que usar o ElasticSearch?
  19. 19. DISCLAIMER
  20. 20. Post.where(:all, :author => "pedroh96") vs Post.search(query: { match: { author: "pedroh96" }}) Just Another Query Language?
  21. 21. 1) Full text search
  22. 22. ActiveRecords $ rails g scaffold Post title:string! source:string
  23. 23. GET /posts/5 Post.find(5) :-) ActiveRecords
  24. 24. ActiveRecords “Amazon to Buy Video Site Twitch for More Than $1B” Post.where(:all, :title => "Amazon to Buy Video Site Twitch for More Than $1B") :-)
  25. 25. “amazon” Post.where(["title LIKE ?", "%Amazon%"]) ??? ActiveRecords
  26. 26. “amazon source:online.wsj.com” Post.where(["title LIKE ? AND source = ?", "%Amazon%", "online.wsj.com"]) ?????? ActiveRecords
  27. 27. “amazon” Post.search("amazon") :-) ElasticSearch
  28. 28. ElasticSearch “amazon source:online.wsj.com” search = Post.search("amazon source:online.wsj.com") :-)
  29. 29. ElasticSearch “amazon source:online.wsj.com” search = Post.search( query:{ match: { _all: "amazon source:online.wsj.com", } } ) Full-text search
  30. 30. ElasticSearch “amazon source:online.wsj.com” search = Post.search( query:{ multi_match: { query: "amazon source:online.wsj.com", fields: ['title^10', 'source'] } } ) Full-text search Title boost
  31. 31. ElasticSearch “amazon source:online.wsj.com” search = Post.search( query:{ multi_match: { query: "amazon source:online.wsj.com", fields: ['title^10', 'source'] } }, highlight: { fields: { title: {} } } ) Title highlight Full-text search Title boost
  32. 32. ElasticSearch Title highlight > search.results[0].highlight.title => ["Twitch officially acquired by <em>Amazon</em>"]
  33. 33. 2) Aggregations (faceting)
  34. 34. Geo distance aggregation
  35. 35. ActiveRecords $ rails g scaffold Coordinate latitude:decimal longitude:decimal
  36. 36. ActiveRecords class Coordinate < ActiveRecord::Base! end
  37. 37. ActiveRecords class Coordinate < ActiveRecord::Base! def distance_to(coordinate)! # From http://en.wikipedia.org/wiki/Haversine_formula! rad_per_deg = Math::PI/180 # PI / 180! rkm = 6371 # Earth radius in kilometers! rm = rkm * 1000 # Radius in meters! ! dlon_rad = (coordinate.longitude.to_f - self.longitude.to_f) * rad_per_deg # Delta, converted to rad! dlat_rad = (coordinate.latitude.to_f - self.latitude.to_f) * rad_per_deg! ! lat1_rad = coordinate.latitude.to_f * rad_per_deg! lat2_rad = self.latitude.to_f * rad_per_deg! lon1_rad = coordinate.longitude.to_f * rad_per_deg! lon2_rad = self.longitude.to_f * rad_per_deg! ! a = Math.sin(dlat_rad/2)**2 + Math.cos(lat1_rad) * Math.cos(lat2_rad) * Math.sin(dlon_rad/2)**2! c = 2 * Math::atan2(Math::sqrt(a), Math::sqrt(1-a))! ! rm * c # Delta in meters! end! end > c1 = Coordinate.new(:latitude => -23.5532636, :longitude => -46.6528908) > c2 = Coordinate.new(:latitude => -23.5538488, :longitude => -46.6530035) > c1.distance_to(c2) => 66.07749735875552
  38. 38. ActiveRecords origin = Coordinate.new(:latitude => -23.5532636, :longitude => -46.6528908) buckets = [! {! :to => 100,! :coordinates => []! },! {! :from => 100,! :to => 300,! :coordinates => []! },! {! :from => 300,! :coordinates => []! }! ]! Coordinate.all.each do |coordinate|! distance = origin.distance_to(coordinate)! ! buckets.each do |bucket|! if distance < bucket[:to] and distance > (bucket[:from] || 0)! bucket[:coordinates] << coordinate! end! end! end ??????
  39. 39. ElasticSearch query = {! aggregations: {! Nome da aggregation rings_around_rubyconf: {! geo_distance: {! Field com localização Coordenadas da origem field: "location",! origin: "-23.5532636, -46.6528908",! ranges: [! { to: 100 },! { from: 100, to: 300 },! { from: 300 }! ]! }! Tipo da aggregation }! }! } Buckets para agregar search = Coordinate.search(query) :-)
  40. 40. (Extended) stats aggregation
  41. 41. ActiveRecords $ rails g scaffold Grade subject:string grade:decimal
  42. 42. ElasticSearch query = {! aggregations: {! Nome da aggregation grades_stats: {! Tipo da aggregation extended_stats: {! field: "grade",! }! }! }! }! ! search = Grade.search(query) Nome do field
  43. 43. ElasticSearch > search.response.aggregations.grades_stats! ! => #<Hashie::Mash avg=8.03 count=3 max=10.0 min=4.6 std_deviation=2.43 sum=24.1 sum_of_squares=211.41 variance=5.93>>
  44. 44. (Extended) stats aggregation + Scripting
  45. 45. ElasticSearch query = {! aggregations: {! grades_stats: {! extended_stats: {! field: "grade",! }! }! }! }
  46. 46. ElasticSearch query = {! aggregations: {! Nome da aggregation grades_stats: {! extended_stats: {! field: "grade",! script: "_value < 7.0 ? _value * correction : _value",! params: {! correction: 1.2! }! }! }! }! }! ! search = Grade.search(query) Nome do field JavaScript para calcular novo grade Tipo da aggregation
  47. 47. ElasticSearch > search.response.aggregations.grades_stats! ! => #<Hashie::Mash avg=8.34 count=3 max=10.0 min=5.52 std_deviation=2.00 sum=25.02 sum_of_squares=220.72 variance=4.01>>
  48. 48. Term aggregation
  49. 49. ElasticSearch query = {! aggregations: {! subjects: {! terms: {! Nome da aggregation field: "subject"! }! }! }! }! ! search = Grade.search(query) Nome do field Tipo da aggregation
  50. 50. ElasticSearch > search.response.aggregations.subjects! ! => #<Hashie::Mash buckets=[! #<Hashie::Mash doc_count=2 key=“math">,! #<Hashie::Mash doc_count=1 key="grammar">, #<Hashie::Mash doc_count=1 key=“physics">! ]>
  51. 51. Combined aggregations (term + stats)
  52. 52. ElasticSearch query = {! aggregations: {! subjects: {! terms: {! field: "subject"! }! }! }! }! ! search = Grade.search(query)
  53. 53. ElasticSearch query = {! aggregations: {! subjects: {! terms: {! Nome da parent aggregation field: "subject"! },! aggregations: {! grade_stats: {! stats: {! Nome da child aggregation field: "grade"! }! }! }! }! }! }! ! search = Grade.search(query) Field para parent aggregation Field para child aggregation
  54. 54. ElasticSearch > search.response.aggregations.subjects! ! #<Hashie::Mash buckets=[! #<Hashie::Mash doc_count=2 grade_stats=#<Hashie::Mash avg=9.0 count=2 max=10.0 min=8.0 sum=18.0> key="math">, #<Hashie::Mash doc_count=1 grade_stats=#<Hashie::Mash avg=4.6 count=1 max=4.6 min=4.6 sum=4.6> key="grammar">, #<Hashie::Mash doc_count=1 grade_stats=#<Hashie::Mash avg=9.5 count=1 max=9.5 min=9.5 sum=9.5> key=“physics">! ]>
  55. 55. Top Hits More like this Histogram Scripted metrics Geo bounds Stemmer (sinônimos) IPv4 ranges . . .
  56. 56. 3) Scoring
  57. 57. ActiveRecords $ rails g scaffold Post title:string! source:string likes:integer
  58. 58. “amazon” ElasticSearch search = Post.search( query: { match: { _all: "amazon", } } ) Full-text search search.results.results[0]._score => 0.8174651
  59. 59. “amazon” ElasticSearch search = Post.search( query: { custom_score: { query:{ match: { _all: "amazon", } }, script: "_score * doc['likes'].value" } } ) Full-text search Likes influenciam no score search.results.results[0]._score => 31.8811388
  60. 60. GET http://localhost:9200/post/_search?explain "_explanation": {! "description": "weight(tweet:honeymoon in 0)! [PerFieldSimilarity], result of:",! "value": 0.076713204,! "details": [! {! "description": "fieldWeight in 0, product of:",! "value": 0.076713204,! "details": [! {! "description": "tf(freq=1.0), with freq of:",! "value": 1,! "details": [! {! "description": "termFreq=1.0",! "value": 1! }! ]! },! {! "description": "idf(docFreq=1, maxDocs=1)",! "value": 0.30685282! },! {! "description": "fieldNorm(doc=0)",! "value": 0.25,! }! ]! }! ]! } Score explicado
  61. 61. 4) Indexando responses
  62. 62. $ rails g scaffold Post title:string! source:string likes:integer
  63. 63. class PostsController < ApplicationController! ! # ...! ! def show! @post = Post.find(params[:id])! ! render json: @post! end! ! # ...! ! end SELECT * FROM Posts WHERE id = params[:id]
  64. 64. class PostsController < ApplicationController! ! # ...! ! def show! @post = Post.search(query: { match: { id: params[:id] }})! ! render json: @post! end! ! # ...! ! end GET http://localhost:9200/posts/posts/params[:id]
  65. 65. ActiveRecords require 'elasticsearch/model'! ! class Post < ActiveRecord::Base! include Elasticsearch::Model! include Elasticsearch::Model::Callbacks! ! belongs_to :author! ! def as_indexed_json(options={})! self.as_json(! include: { author: { only: [:name, :bio] },! })! end! end Inclui um parent no JSON indexado
  66. 66. Expondo o ElasticSearch
  67. 67. http://localhost:9200/pagarme/_search https://api.pagar.me/1/search
  68. 68. Infraestrutura do Pagar.me ElasticSearch ElasticSearch Router api.pagar.me Servidor da API (Node.js) MySQL (transações e dados relacionais) MySQL (transações e dados relacionais) MongoDB (dados de clientes e não relacionais) Ambiente de testes (sandbox dos clientes) Servidor da API (Node.js) Ambiente de produção
  69. 69. Expondo o ElasticSearch • Endpoint do ElasticSearch -> Endpoint acessado pelo cliente… • … mas cuidado: dados precisam ser delimitados a conta do cliente (claro) • Vantagem: acesso às mesmas features do ElasticSearch (aggregations, statistics, scores, etc) • Segurança: desabilitar scripts do ElasticSearch
  70. 70. GET /search • Um único endpoint para todos os GETs • Todos os dados indexados e prontos para serem usados (no joins) • Queries complexas construídas no front-side (Angular.js) • Desenvolvimento front-end não dependente do back-end
  71. 71. Overall…
  72. 72. 1)Há uma ferramenta para cada tarefa. 2)Um martelo é sempre a ferramenta certa. 3)Toda ferramenta também é um martelo.
  73. 73. MySQL != NoSQL != ElasticSearch
  74. 74. Obrigado! :) PEDROFRANCESCHI @pedroh96 pedro@pagar.me github.com/pedrofranceschi
  75. 75. Perguntas? PEDROFRANCESCHI @pedroh96 pedro@pagar.me github.com/pedrofranceschi
  76. 76. Fazendo mágica com ElasticSearch PEDROFRANCESCHI @pedroh96 pedro@pagar.me github.com/pedrofranceschi

×