Your Data, Your Search, Elasticsearch

3,853 views
3,458 views

Published on

Speaker: Costin Leau
Finding relevant information fast has always been a challenge, even more so in today's growing "oceans" of data. This talk explores the area of real-time full text search, using Elasticsearch, an open-source, distributed search engine built on top of Apache Lucene. The session will showcase how to perform real-time searches on structured and non-structured data alike, how to cope with types and suggestions, do social graph filters and aggregations for efficient analytics. All from a Spring perspective Last but not least, the presentation focuses on the Hadoop platform and how Map/Reduce, Hive, Pig or Cascading jobs can leverage a search engine to significantly speed up execution and enhance their capabilities.
The presentation covers architectural topics such as index scalability, data locality and partitioning, using off and on-premise storages (HDFS, S3, local file-systems) and multi-tenancy.

Published in: Technology, News & Politics

Your Data, Your Search, Elasticsearch

  1. 1. Your data, your search, Elasticsearch Costin Leau @costinl © 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission.
  2. 2. Agenda  Elasticsearch  Big Data  Analytics
  3. 3. What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data
  4. 4. What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data Lightweight
  5. 5. What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data Lightweight Popular: ~200K dl/month
  6. 6. Users
  7. 7. Users
  8. 8. Platform adoption http://www.thoughtworks.com/radar#platforms 2013
  9. 9. Platform adoption http://www.thoughtworks.com/radar#platforms 2013
  10. 10. Use Case – Text search 1.3 billion files, 130 billion lines of code https://github.com/blog/1381-a-whole-new-code-search
  11. 11. Use Case - Geolocation 50 million venues / day
  12. 12. Use Case - Recommandations millions of recommandations
  13. 13. Use Case – Support/Reporting
  14. 14. Use Case – Centralized Logging
  15. 15. Use Case – Pure Analytics
  16. 16. Plug & Play
  17. 17. Instalation $ wget https://download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.3.tar.gz $ ./elasticsearch-0.90.3/bin/elasticsearch ... [INFO ][node][Ghost Maker] {0.90.2}[5645]: initializing ...
  18. 18. Index a document $ curl -X PUT localhost:9200/products/product/1 -d '{ "title" : "Welcome!"}'
  19. 19. Update a document $ curl -X PUT localhost:9200/products/product/1 -d '{ "title" : "Welcome to SpringOne2GX 2013!"}'
  20. 20. Search for documents... $ curl -X GET localhost:9200/products/_search?q=welcome
  21. 21. Scaling out $ ./elasticsearch-0.90.2/bin/elasticsearch -D es.node.name=Node2 ...[cluster.service] [Node2] detected_master [Node1] ...
  22. 22. Primaries and Replicas A1 A1 A2 A2 A3 A3 Primaries Replicas curl -XPUT 'http://localhost:9200/a/' -d '{ "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 1 } } }'
  23. 23. Scaling out $ ./elasticsearch-0.90.2/bin/elasticsearch -D es.node.name=Node3 ...[cluster.service] [Node3] detected_master [Node1] ...
  24. 24. JSON & HTTP { "id" : "abc123“, "title" : "A JSON Document“, "body" : "A JSON document is a ...“, "published_on" : "2013/06/27 10:00:00“, "featured" : true, "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "clara@rice.org" } }
  25. 25. http:// Lingua Franca of APIs Also supported: Native Java protocol, Thrift, Memcached
  26. 26. Search & Find $ curl -X GET "http://localhost:9200/_search?q=<YOUR QUERY>" Terms apple apple iphone Phrases "apple iphone" Proximity "apple safari"~5 Fuzzy apple~0.8 Wildcards app* *pp* Boosting apple^10 safari Range Boolean Fields [2011/05/01 TO 2011/05/31] [java TO json] apple AND NOT iphone +apple -iphone (apple OR iphone) AND NOT review title:iphone^15 OR body:iphone published_on:[2011/05/01 TO "2011/05/27 10:00:00“]
  27. 27. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  28. 28. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  29. 29. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  30. 30. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  31. 31. Search types  Full-text Search “Find all articles with ‘search’ in their title or body, give matches in titles higher score”  Structured Search “Find all articles from year 2013 tagged ‘search’”  Custom Scoring See custom_score and custom_filters_score queries
  32. 32. Search perspectives User Search Engine Fetch document field ➝ Pick configured analyzer ➝ Parse text into tokens ➝ Apply token filters ➝ Store into index
  33. 33. Slice & Dice Query Facets
  34. 34. OLAP Cube Dimensions, measures, aggregations
  35. 35. Slice    Dice Drill Down / Roll Up Show me sales numbers for all products across all locations in year 2013 Show me product A sales numbers across all locations over all years Show me products sales numbers in location X over all years
  36. 36. Clients
  37. 37. Pick your language Java Perl* Python* Ruby* Php* Javascript .Net scala clojure go Erlang Eventmachine Cli Smalltalk Ocaml
  38. 38. Spring Data
  39. 39. Spring Data Elasticsearch Easy to use Elasticsearch in a Spring-powered app  Configuring Elasticsearch client  Dedicated template for one-liners  Repository support
  40. 40. Configuration <beans xmlns:es=“http://www.sf.org/schema/data/elasticsearch”> <es:repositories base-package=“com.acme” /> <es:transport-client id="client" cluster-nodes="localhost:9300,someip:9300" /> </beans> @Configuration @EnableElasticsearchRepositories(basePackages = “com/acme") static class Config { @Bean public ElasticsearchOperations elasticsearchTemplate() { return new ElasticsearchTemplate(nodeBuilder().local(true).node().client()); } }
  41. 41. Dedicated Template  Create/delete index/mappings  Query options – Criteria – String – Search  Bulk operations  Scrolling/streaming
  42. 42. Repositories public interface BookRepository extends Repository<Book, String> { List<Book> findByNameAndPrice(String name, Integer price); List<Book> findByNameOrPrice(String name, Integer price); Page<Book> findByName(String name,Pageable page); Page<Book> findByNameNot(String name,Pageable page); Page<Book> findByPriceBetween(int price,Pageable page); Page<Book> findByNameLike(String name,Pageable page); @Query("{‘bool’ : {‘must’ : {‘field’:{‘message’ : ‘?0’}}}}") Page<Book> findByMessage(String message, Pageable pageable); }
  43. 43. Sophisticated query creation Keyword Example And/Or findByNameAndPrice Is findByName Not findByNameNot Less/GreaterThanEqual findByPriceLessThan Before/After findByPriceAFter Starting/EndingWith findByNameEndingWith Contains/Containing findByNameContaining OrderBy findByCountryOrderByName True/False findByRetiredFalse Near soon
  44. 44. Big Data
  45. 45. A Holistic View of a Big Data System Real Time Streams Analytics ETL Real-Time Processing (s4, storm) RT Semi structured Database (hBase, Cassandra, Mongo) Big SQL (Greenplum, AsterData, Etc…) Unstructured Data (HDFS) Batch Processing
  46. 46. A Holistic View of a Big Data System Real Time Streams Analytics ETL Real-Time Processing (s4, storm) RT Semi structured Database (hBase, Cassandra, Mongo) Big SQL (Greenplum, AsterData, Etc…) Unstructured Data (HDFS) Batch Processing
  47. 47. Hadoop eco-system Map Reduce Framework (MapRed) Hadoop Distributed File System (HDFS)
  48. 48. Elasticsearch - Hadoop Read/write data to Hadoop transparently • Hadoop Input/OutputFormat • Cascading Tap • Pig Storage • Hive SerDe Native Map/Reduce model
  49. 49. Elasticsearch + Hadoop Writing Reading / Querying Raw 60 Raw 60 50 50 40 40 30 30 20 20 10 10 0 0 M/R Pig Hive M/R Pig Hive
  50. 50. Data Ingestion      DIY Logstash Flume Graylog2 HDFS
  51. 51. Logstash Tool for managing events and logs  Collect, parse and store  Tons of – – – – inputs (~40) codecs (~11) filters(~40) outputs (~50)
  52. 52. Kibana Make senses of logging data  Runs inside your browser  Highly customizable  Leverages Elasticsearch aggregations/facets
  53. 53. Thank you! @costinl

×