ElasticSearch - Suche im Zeitalter der Clouds


Published on

Eine performante Suche mit relevanten Ergebnissen in großen Datenbeständen ist inzwischen für uns alle immer und überall selbstverständlich. Suche wird nicht mehr nur in klassischen Szenarien wie Enterprise Search und Web Search eingesetzt, sondern organisiert den Zugriff auf Daten und Informationen in verschiedensten Anwendungen (Stichwort: Search-based Applications). Ein Großteil der gebräuchlichen Suchtechnologien basiert hierbei auf dem Apache-Lucene-Projekt. Im Bereich der Suchserver auf Lucene-Basis gibt es nun neben Apache Solr einen neuen Star in der Open-Soruce-Szene: ElasticSearch. Dieser Vortrag stellt ElasticSearch und die Einsatzszenarien eingehend vor und grenzt die Möglichkeiten gegenüber Lucene und Solr insbesondere im Bereich großer Datenmengen ab.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ElasticSearch - Suche im Zeitalter der Clouds

  1. 1. ElasticSearch –Suche im Zeitalter der CloudsChristian MederBernhard Pflugfelderinovex Gmbh
  2. 2. Background‣  open source (free software)‣  Linux‣  Web‣  Java‣  Android‣  CTO@inovex‣  Christian MederChristian MederSpeaker2
  3. 3. Background‣  Lucene‣  Solr‣  Text Mining Technologies,Information Retrieval‣  Hadoop‣  Java‣  Big Data Engineer@inovex‣  bpflugfelder@inovex.deBernhard PflugfelderSpeaker3
  4. 4. ‣  Search is everywhere‣  Elasticsearch‣  Examples‣  Overview‣  FeaturesAgenda4
  5. 5. Search, what?5
  6. 6. Enterprise SearchSearch applications6
  7. 7. Online shopsSearch applications7
  8. 8. Semantic searchSearch applications8
  9. 9. Navigation &Information accessSearch applications9
  10. 10. Data analysisSearch applications10http://datarpm.com/product
  11. 11. Log-file AnalysisSearch applications11http://kibana.org/
  12. 12. Document storeSearch applications12
  13. 13. ‣  Can you think of other scenarios where search applicationswill also do a good job?‣  Remind the key capabilities of search technologies:‣  Persistency‣  Flexible data model‣  Unstructured data, but not only‣  Extremely quick access to data‣  Horizontal scalabilityThere are plenty of applications scenarios out there wheresearch technologies shall be considered!Document storeSearch applications13
  14. 14. Open sourceSearch technologies14http://lucene.apache.orghttp://lucene.apache.org/solr/http://www.elasticsearch.org
  15. 15. Lucene is an open source, pure Java APIfor enabling information retrieval‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001‣  Licensed by Apache License 2.0‣  Pure Java Library with implementations for :‣  Lucene.NET (http://lucenenet.apache.org)‣  PyLucene (http://lucene.apache.org/pylucene/)‣  and more:http://wiki.apache.org/lucene-java/LuceneImplementations‣  Large and very active developer community, well documented and supported (38active committer!)‣  Current stable release: 4.2.1‣  Widely used and adopted for commercial / non-commercial projects:http://wiki.apache.org/lucene-java/PoweredByOverview15http://lucene.apache.org/
  16. 16. Solr is a standalone enterprise search server & documentstore with based on Lucene‣  Created by Yonik Seeley at CNET Networks in 2004‣  Introduced as Apache Incubator in 2006, became TLP in 2007‣  Licensed by Apache License 2.0‣  Seeley and others founded Lucid Imagination -> LucidWorks‣  Large and very active developer community, well documented and supported(strong relationship to Lucene community also)‣  Current stable release: 4.2.1‣  Widely used and adopted for commercial / non-commercial projects:http://wiki.apache.org/solr/PublicServersOverview16http://lucene.apache.org/solr/
  17. 17. “You know, for search” (Shay Banon)Search technologies17
  18. 18. Elasticsearch is a “distributed-from-scratch” search serverbased on LuceneCreated by Shay Banon with a first version made public in 02/2010:Elasticsearch itself was born out of my frustration with the fact that there isn’t really agood, open source, solution for distributed search engine out there, which alsocombines what I expect of search engines after building Compass (and on that, I willblog later…).I have been working on this for the past several months, pouring my search anddistributed knowledge into this (and portions of my heart and time ;) )[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]Motivation18http://www.elasticsearch.org/
  19. 19. ‣  Current stable version 0.20.6 working with Lucene 3.6‣  Available version 0.90 RC2 includes Lucene 4.2.1 integration‣  Licensed by Apache License 2.0‣  Small, but growing group of core developer‣  Strong support of valuable Lucene committer‣  Company elasticsearch.com founded in 2012‣  By the people behind elasticsearch.org‣  www.elasticsearch.comOverview19http://www.elasticsearch.org/
  20. 20. Customers20http://www.elasticsearch.org/
  21. 21. ‣  Code search is organized on a cluster‣  26 storage nodes holding the searchable data‣  8 client nodes coordinating query requests‣  Storage cluster has 2TB of SSD based storage‣  17 TB of indexed data is stored in cluster‣  shared in the cluster with replication factor of 1‣  makes overall 34 TB of indexed dataGithub21http://www.elasticsearch.org/
  22. 22. ‣  Question-and-answer website‣  aggregates questions and answer in terms of topics‣  Sources are the web in general, social media‣  Goals for search:‣  low latency for queries‣  increased relevancy of results.‣  evaluates elasticsearch against Solr and Sphinx‣  “After much benchmarking with our data set, we discovered that ElasticSearchwas clearly the fastest of the possible search platforms we were considering.”Quora22http://www.elasticsearch.org/
  23. 23. Quora23http://www.elasticsearch.org/http://www.quora.com/Full-Text-Search-on-Quora/What-technology-does-Quora-use-for-its-full-text-search-infrastructure/answer/Adrien-Lucas-Ecoffet?srid=pilt&share=1
  24. 24. Soundcloud24http://bed-con.org/2013/wp-content/uploads/2013/04/Wie_SoundCloud_skaliert.pdfhttp://www.elasticsearch.org/
  25. 25. Moloch25https://github.com/aol/molochhttp://www.elasticsearch.org/
  26. 26. Huffington Post26http://blogs.vmware.com/vfabric/2013/03/scaling-real-time-comments-huffpost-live-with-rabbitmq.htmlhttp://www.elasticsearch.org/
  27. 27. Search pipeline27
  28. 28. ‣  Scalable, High-Performance Indexing‣  over 95GB/hour on modern hardware‣  small RAM requirements‣  incremental indexing as fast as batch indexing‣  index size roughly 20-30% the size of text indexed‣  Powerful, Accurate and Efficient Search Algorithms‣  ranked searching -- best results returned first‣  many powerful query types‣  fielded searching (e.g., title, author, contents)‣  date-range searching‣  sorting by any field‣  multiple-index searching with merged results‣  allows simultaneous update and searching[From http://lucene.apache.org/core/features.html]Highlights28http://lucene.apache.org/
  29. 29. ‣  Pure Java application‣  Powered by Lucene‣  Document-oriented‣  Schema-less‣  HTTP API with JSON In & Out‣  Indexing / Updating‣  Searching‣  Administration / Monitoring‣  Extendable by plugins‣  Distribution is a fundamental paradigm of ElasticsearchOverview29http://www.elasticsearch.org/
  30. 30. Architecture3021 123213 3Primary Shard Replica ShardMaster nodeNodeNodehttp://www.elasticsearch.org/
  31. 31. ‣  Index distribution by auto sharding‣  Automatic replication and balancing‣  Fault tolerant + high availability‣  Cluster building & managment‣  node detection through zen discovery‣  nodes communicate via unicast / multicast‣  automatic master election‣  influence into master / data node assignment possible‣  Master responsible to‣  route the search request‣  include new nodes into cluster‣  Index / query routing (automatic / individual)Architecture31http://www.elasticsearch.org/
  32. 32. Elasticsearch-head32http://www.elasticsearch.org/https://github.com/mobz/elasticsearch-head
  33. 33. Elasticsearch-head33http://www.elasticsearch.org/https://github.com/mobz/elasticsearch-head
  34. 34. Schema-less, but34http://www.elasticsearch.org/
  35. 35. ‣  Define a mapping for type book‣  Retrieve the current mapping for type bookSchema-less, but35# echo " {"mappings" : {"books" : {"properties" : {”id" : { "type" : "string" },"title" : { "type" : "string" },"author" : { "type" : "string" },”subject" : { "type" : ”string" },”view_count" : { "type" : ”integer" },"created" : { "type" : "date","format" : “dateOptionalTime" }}}}} " > book.jsoncurl –XPUT localhost:9200/gutenberg/books/_mapping’ –d @book.json# curl localhost:9200/gutenberg/books/_mapping?pretty=1http://www.elasticsearch.org/
  36. 36. ‣  Search on terms, numeric values, dates, numeric ranges, date/time ranges‣  Lots of query types‣  terms, phrases, fuzzy, wildcard, ranges‣  faceting, filtering‣  Geospatial search called GeoShape Query‣  Configurable caching for‣  Filter queries‣  Field values‣  NRT search with separate API‣  Sorting, Highlighting‣  MoreLikeThis‣  Multi TenancySearch highlights36http://www.elasticsearch.org/
  37. 37. Faceted search37http://www.elasticsearch.org/
  38. 38. Suggestion38http://www.elasticsearch.org/
  39. 39. Highlighting39http://www.elasticsearch.org/
  40. 40. Local search40http://www.elasticsearch.org/
  41. 41. Multi Tenancy41http://www.elasticsearch.org/
  42. 42. ‣  Gateway module stores cluster metadata to:‣  Local FS, Shared FS, Hadoop, Amazon S3‣  River:‣  Pluggable service to constantly pull data‣  Manage over specific REST endpoint‣  Implementations for CouchDB, MongoDB, JDBC, Solr, …‣  Bulk indexing‣  Default: single document indexing‣  Bulk indexing over specific REST endpoints‣  Lucene Analyzer specification over elasticsearch.yml or APISome more features42http://www.elasticsearch.org/
  43. 43. ‣  Query types such as term, terms, match, wildcard, fuzzy, range, …‣  Multi Search‣  Get‣  Multi Get‣  Filter‣  Facets‣  Highlighting‣  Suggest‣  MoreLikeThis‣  Index boosting‣  Explain‣  PercolateSearch API43http://www.elasticsearch.org/
  44. 44. ‣  Create, Delete, Exists, Open, Close, Optimize, Refresh, Flush, Settings‣  Index templates (mappings + settings)‣  Get, Put, Delete Mapping‣  Get, update settings‣  Snapshot‣  Aliases‣  Warmers‣  Statistics, StatusIndices API44http://www.elasticsearch.org/
  45. 45. ‣  Live configuration of cluster settings‣  minimum master nodes‣  cache sizes‣  routing‣  allocation‣  moving shards‣  Moving replicas‣  Cluster health & status‣  Nodes info & stats, Shutdown all / specific nodesCluster API45http://www.elasticsearch.org/
  46. 46. +  Elasticssearch feels light-weighted+  Simple but effective architecture+  Easiness of use, even when using distributed search+  High matureness, even though ES is young+  High-performance search (at least based on current benchmarks seen)+  Modern technologies used (HTTP, JSON, NoXML, Java, Guava)-  Still small community and small group of core developer-  Missing data connectors (e.g. dataimporthandler),-  Missing search features grouping & search result clustering-  Less number of query types-  Less possibilities for boosting (e.g function queries)-  Less number of analyzersPros & Cons46http://www.elasticsearch.org/
  47. 47. ‣  The world becomes data-driven and user-driven‣  large data volumes‣  multiple sources‣  many users shall be able to access‣  Therefore search technologies Elasticsearch becomes important:‣  Easy aggregation of data from multiple sources‣  Provide unified access layer through search‣  Scalable regarding data volume and users‣  Highly configurable‣  ElasticSearch is easy to use, distributed, scalable and search is fastWrap up47http://www.elasticsearch.org/
  48. 48. Thank you!End48