• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Your Data, Your Search, Elasticsearch
 

Your Data, Your Search, Elasticsearch

on

  • 932 views

Speaker: Costin Leau ...

Speaker: Costin Leau
Finding relevant information fast has always been a challenge, even more so in today's growing "oceans" of data. This talk explores the area of real-time full text search, using Elasticsearch, an open-source, distributed search engine built on top of Apache Lucene. The session will showcase how to perform real-time searches on structured and non-structured data alike, how to cope with types and suggestions, do social graph filters and aggregations for efficient analytics. All from a Spring perspective Last but not least, the presentation focuses on the Hadoop platform and how Map/Reduce, Hive, Pig or Cascading jobs can leverage a search engine to significantly speed up execution and enhance their capabilities.
The presentation covers architectural topics such as index scalability, data locality and partitioning, using off and on-premise storages (HDFS, S3, local file-systems) and multi-tenancy.

Statistics

Views

Total Views
932
Views on SlideShare
932
Embed Views
0

Actions

Likes
1
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Your Data, Your Search, Elasticsearch Your Data, Your Search, Elasticsearch Presentation Transcript

    • Your data, your search, Elasticsearch Costin Leau @costinl © 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission.
    • Agenda  Elasticsearch  Big Data  Analytics
    • What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data
    • What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data Lightweight
    • What is Elasticsearch? Open-Source Search & Analytics engine - Structured & Unstructured Data Real Time Analytics capabilities (facets) REST based Distributed - Designed for the Cloud - Designed for Big Data Lightweight Popular: ~200K dl/month
    • Users
    • Users
    • Platform adoption http://www.thoughtworks.com/radar#platforms 2013
    • Platform adoption http://www.thoughtworks.com/radar#platforms 2013
    • Use Case – Text search 1.3 billion files, 130 billion lines of code https://github.com/blog/1381-a-whole-new-code-search
    • Use Case - Geolocation 50 million venues / day
    • Use Case - Recommandations millions of recommandations
    • Use Case – Support/Reporting
    • Use Case – Centralized Logging
    • Use Case – Pure Analytics
    • Plug & Play
    • Instalation $ wget https://download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.3.tar.gz $ ./elasticsearch-0.90.3/bin/elasticsearch ... [INFO ][node][Ghost Maker] {0.90.2}[5645]: initializing ...
    • Index a document $ curl -X PUT localhost:9200/products/product/1 -d '{ "title" : "Welcome!"}'
    • Update a document $ curl -X PUT localhost:9200/products/product/1 -d '{ "title" : "Welcome to SpringOne2GX 2013!"}'
    • Search for documents... $ curl -X GET localhost:9200/products/_search?q=welcome
    • Scaling out $ ./elasticsearch-0.90.2/bin/elasticsearch -D es.node.name=Node2 ...[cluster.service] [Node2] detected_master [Node1] ...
    • Primaries and Replicas A1 A1 A2 A2 A3 A3 Primaries Replicas curl -XPUT 'http://localhost:9200/a/' -d '{ "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 1 } } }'
    • Scaling out $ ./elasticsearch-0.90.2/bin/elasticsearch -D es.node.name=Node3 ...[cluster.service] [Node3] detected_master [Node1] ...
    • JSON & HTTP { "id" : "abc123“, "title" : "A JSON Document“, "body" : "A JSON document is a ...“, "published_on" : "2013/06/27 10:00:00“, "featured" : true, "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "clara@rice.org" } }
    • http:// Lingua Franca of APIs Also supported: Native Java protocol, Thrift, Memcached
    • Search & Find $ curl -X GET "http://localhost:9200/_search?q=<YOUR QUERY>" Terms apple apple iphone Phrases "apple iphone" Proximity "apple safari"~5 Fuzzy apple~0.8 Wildcards app* *pp* Boosting apple^10 safari Range Boolean Fields [2011/05/01 TO 2011/05/31] [java TO json] apple AND NOT iphone +apple -iphone (apple OR iphone) AND NOT review title:iphone^15 OR body:iphone published_on:[2011/05/01 TO "2011/05/27 10:00:00“]
    • Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
    • Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
    • Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
    • Query DSL curl -X GET localhost:9200/articles/_search -d "query" : { "filtered" : { "query" : { "bool" : { "must" : { "match" : { "author.first_name" : { "query" : "claire", "fuzziness" : 0.1 } } }, "must" : { "multi_match" : { "query" : "elasticsearch", "fields" : ["title^10", "body"] } } } }, "filter": { "and" : [ { "terms" : { "tags" : ["search"] } }, { "range" : { "published_on": {"from": "2013"} } }, { "term" : { "featured" : true } } ] } } } }' '{
    • Search types  Full-text Search “Find all articles with ‘search’ in their title or body, give matches in titles higher score”  Structured Search “Find all articles from year 2013 tagged ‘search’”  Custom Scoring See custom_score and custom_filters_score queries
    • Search perspectives User Search Engine Fetch document field ➝ Pick configured analyzer ➝ Parse text into tokens ➝ Apply token filters ➝ Store into index
    • Slice & Dice Query Facets
    • OLAP Cube Dimensions, measures, aggregations
    • Slice    Dice Drill Down / Roll Up Show me sales numbers for all products across all locations in year 2013 Show me product A sales numbers across all locations over all years Show me products sales numbers in location X over all years
    • Clients
    • Pick your language Java Perl* Python* Ruby* Php* Javascript .Net scala clojure go Erlang Eventmachine Cli Smalltalk Ocaml
    • Spring Data
    • Spring Data Elasticsearch Easy to use Elasticsearch in a Spring-powered app  Configuring Elasticsearch client  Dedicated template for one-liners  Repository support
    • Configuration <beans xmlns:es=“http://www.sf.org/schema/data/elasticsearch”> <es:repositories base-package=“com.acme” /> <es:transport-client id="client" cluster-nodes="localhost:9300,someip:9300" /> </beans> @Configuration @EnableElasticsearchRepositories(basePackages = “com/acme") static class Config { @Bean public ElasticsearchOperations elasticsearchTemplate() { return new ElasticsearchTemplate(nodeBuilder().local(true).node().client()); } }
    • Dedicated Template  Create/delete index/mappings  Query options – Criteria – String – Search  Bulk operations  Scrolling/streaming
    • Repositories public interface BookRepository extends Repository<Book, String> { List<Book> findByNameAndPrice(String name, Integer price); List<Book> findByNameOrPrice(String name, Integer price); Page<Book> findByName(String name,Pageable page); Page<Book> findByNameNot(String name,Pageable page); Page<Book> findByPriceBetween(int price,Pageable page); Page<Book> findByNameLike(String name,Pageable page); @Query("{‘bool’ : {‘must’ : {‘field’:{‘message’ : ‘?0’}}}}") Page<Book> findByMessage(String message, Pageable pageable); }
    • Sophisticated query creation Keyword Example And/Or findByNameAndPrice Is findByName Not findByNameNot Less/GreaterThanEqual findByPriceLessThan Before/After findByPriceAFter Starting/EndingWith findByNameEndingWith Contains/Containing findByNameContaining OrderBy findByCountryOrderByName True/False findByRetiredFalse Near soon
    • Big Data
    • A Holistic View of a Big Data System Real Time Streams Analytics ETL Real-Time Processing (s4, storm) RT Semi structured Database (hBase, Cassandra, Mongo) Big SQL (Greenplum, AsterData, Etc…) Unstructured Data (HDFS) Batch Processing
    • A Holistic View of a Big Data System Real Time Streams Analytics ETL Real-Time Processing (s4, storm) RT Semi structured Database (hBase, Cassandra, Mongo) Big SQL (Greenplum, AsterData, Etc…) Unstructured Data (HDFS) Batch Processing
    • Hadoop eco-system Map Reduce Framework (MapRed) Hadoop Distributed File System (HDFS)
    • Elasticsearch - Hadoop Read/write data to Hadoop transparently • Hadoop Input/OutputFormat • Cascading Tap • Pig Storage • Hive SerDe Native Map/Reduce model
    • Elasticsearch + Hadoop Writing Reading / Querying Raw 60 Raw 60 50 50 40 40 30 30 20 20 10 10 0 0 M/R Pig Hive M/R Pig Hive
    • Data Ingestion      DIY Logstash Flume Graylog2 HDFS
    • Logstash Tool for managing events and logs  Collect, parse and store  Tons of – – – – inputs (~40) codecs (~11) filters(~40) outputs (~50)
    • Kibana Make senses of logging data  Runs inside your browser  Highly customizable  Leverages Elasticsearch aggregations/facets
    • Thank you! @costinl