Your SlideShare is downloading. ×
Your Data, Your Search, Elasticsearch
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Your Data, Your Search, Elasticsearch

1,987

Published on

Speaker: Costin Leau …

Speaker: Costin Leau
Finding relevant information fast has always been a challenge, even more so in today's growing "oceans" of data. This talk explores the area of real-time full text search, using Elasticsearch, an open-source, distributed search engine built on top of Apache Lucene. The session will showcase how to perform real-time searches on structured and non-structured data alike, how to cope with types and suggestions, do social graph filters and aggregations for efficient analytics. All from a Spring perspective Last but not least, the presentation focuses on the Hadoop platform and how Map/Reduce, Hive, Pig or Cascading jobs can leverage a search engine to significantly speed up execution and enhance their capabilities.
The presentation covers architectural topics such as index scalability, data locality and partitioning, using off and on-premise storages (HDFS, S3, local file-systems) and multi-tenancy.

Published in: Technology, News & Politics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,987
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
56
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Your data, your search, Elasticsearch Costin Leau @costinl © 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission.
  • 2. Agenda  Elasticsearch  Big Data  Analytics
  • 3. What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data
  • 4. What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data Lightweight
  • 5. What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data Lightweight Popular: ~200K dl/month
  • 6. Users
  • 7. Users
  • 8. Platform adoption http://www.thoughtworks.com/radar#platforms 2013
  • 9. Platform adoption http://www.thoughtworks.com/radar#platforms 2013
  • 10. Use Case – Text search 1.3 billion files, 130 billion lines of code https://github.com/blog/1381-a-whole-new-code-search
  • 11. Use Case - Geolocation 50 million venues / day
  • 12. Use Case - Recommandations millions of recommandations
  • 13. Use Case – Support/Reporting
  • 14. Use Case – Centralized Logging
  • 15. Use Case – Pure Analytics
  • 16. Plug & Play
  • 17. Instalation $ wget https://download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.3.tar.gz $ ./elasticsearch-0.90.3/bin/elasticsearch ... [INFO ][node][Ghost Maker] {0.90.2}[5645]: initializing ...
  • 18. Index a document $ curl -X PUT localhost:9200/products/product/1 -d '{ "title" : "Welcome!"}'
  • 19. Update a document $ curl -X PUT localhost:9200/products/product/1 -d '{ "title" : "Welcome to SpringOne2GX 2013!"}'
  • 20. Search for documents... $ curl -X GET localhost:9200/products/_search?q=welcome
  • 21. Scaling out $ ./elasticsearch-0.90.2/bin/elasticsearch -D es.node.name=Node2 ...[cluster.service] [Node2] detected_master [Node1] ...
  • 22. Primaries and Replicas A1 A1 A2 A2 A3 A3 Primaries Replicas curl -XPUT 'http://localhost:9200/a/' -d '{ "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 1 } } }'
  • 23. Scaling out $ ./elasticsearch-0.90.2/bin/elasticsearch -D es.node.name=Node3 ...[cluster.service] [Node3] detected_master [Node1] ...
  • 24. JSON & HTTP { "id" : "abc123“, "title" : "A JSON Document“, "body" : "A JSON document is a ...“, "published_on" : "2013/06/27 10:00:00“, "featured" : true, "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "clara@rice.org" } }
  • 25. http:// Lingua Franca of APIs Also supported: Native Java protocol, Thrift, Memcached
  • 26. Search & Find $ curl -X GET "http://localhost:9200/_search?q=<YOUR QUERY>" Terms apple apple iphone Phrases "apple iphone" Proximity "apple safari"~5 Fuzzy apple~0.8 Wildcards app* *pp* Boosting apple^10 safari Range Boolean Fields [2011/05/01 TO 2011/05/31] [java TO json] apple AND NOT iphone +apple -iphone (apple OR iphone) AND NOT review title:iphone^15 OR body:iphone published_on:[2011/05/01 TO "2011/05/27 10:00:00“]
  • 27. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  • 28. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  • 29. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  • 30. Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
  • 31. Search types  Full-text Search “Find all articles with ‘search’ in their title or body, give matches in titles higher score”  Structured Search “Find all articles from year 2013 tagged ‘search’”  Custom Scoring See custom_score and custom_filters_score queries
  • 32. Search perspectives User Search Engine Fetch document field ➝ Pick configured analyzer ➝ Parse text into tokens ➝ Apply token filters ➝ Store into index
  • 33. Slice & Dice Query Facets
  • 34. OLAP Cube Dimensions, measures, aggregations
  • 35. Slice    Dice Drill Down / Roll Up Show me sales numbers for all products across all locations in year 2013 Show me product A sales numbers across all locations over all years Show me products sales numbers in location X over all years
  • 36. Clients
  • 37. Pick your language Java Perl* Python* Ruby* Php* Javascript .Net scala clojure go Erlang Eventmachine Cli Smalltalk Ocaml
  • 38. Spring Data
  • 39. Spring Data Elasticsearch Easy to use Elasticsearch in a Spring-powered app  Configuring Elasticsearch client  Dedicated template for one-liners  Repository support
  • 40. Configuration <beans xmlns:es=“http://www.sf.org/schema/data/elasticsearch”> <es:repositories base-package=“com.acme” /> <es:transport-client id="client" cluster-nodes="localhost:9300,someip:9300" /> </beans> @Configuration @EnableElasticsearchRepositories(basePackages = “com/acme") static class Config { @Bean public ElasticsearchOperations elasticsearchTemplate() { return new ElasticsearchTemplate(nodeBuilder().local(true).node().client()); } }
  • 41. Dedicated Template  Create/delete index/mappings  Query options – Criteria – String – Search  Bulk operations  Scrolling/streaming
  • 42. Repositories public interface BookRepository extends Repository<Book, String> { List<Book> findByNameAndPrice(String name, Integer price); List<Book> findByNameOrPrice(String name, Integer price); Page<Book> findByName(String name,Pageable page); Page<Book> findByNameNot(String name,Pageable page); Page<Book> findByPriceBetween(int price,Pageable page); Page<Book> findByNameLike(String name,Pageable page); @Query("{‘bool’ : {‘must’ : {‘field’:{‘message’ : ‘?0’}}}}") Page<Book> findByMessage(String message, Pageable pageable); }
  • 43. Sophisticated query creation Keyword Example And/Or findByNameAndPrice Is findByName Not findByNameNot Less/GreaterThanEqual findByPriceLessThan Before/After findByPriceAFter Starting/EndingWith findByNameEndingWith Contains/Containing findByNameContaining OrderBy findByCountryOrderByName True/False findByRetiredFalse Near soon
  • 44. Big Data
  • 45. A Holistic View of a Big Data System Real Time Streams Analytics ETL Real-Time Processing (s4, storm) RT Semi structured Database (hBase, Cassandra, Mongo) Big SQL (Greenplum, AsterData, Etc…) Unstructured Data (HDFS) Batch Processing
  • 46. A Holistic View of a Big Data System Real Time Streams Analytics ETL Real-Time Processing (s4, storm) RT Semi structured Database (hBase, Cassandra, Mongo) Big SQL (Greenplum, AsterData, Etc…) Unstructured Data (HDFS) Batch Processing
  • 47. Hadoop eco-system Map Reduce Framework (MapRed) Hadoop Distributed File System (HDFS)
  • 48. Elasticsearch - Hadoop Read/write data to Hadoop transparently • Hadoop Input/OutputFormat • Cascading Tap • Pig Storage • Hive SerDe Native Map/Reduce model
  • 49. Elasticsearch + Hadoop Writing Reading / Querying Raw 60 Raw 60 50 50 40 40 30 30 20 20 10 10 0 0 M/R Pig Hive M/R Pig Hive
  • 50. Data Ingestion      DIY Logstash Flume Graylog2 HDFS
  • 51. Logstash Tool for managing events and logs  Collect, parse and store  Tons of – – – – inputs (~40) codecs (~11) filters(~40) outputs (~50)
  • 52. Kibana Make senses of logging data  Runs inside your browser  Highly customizable  Leverages Elasticsearch aggregations/facets
  • 53. Thank you! @costinl

×