Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

Seth Muthukaruppan, Consultant at Instacluster
Data Engineering
OpenSearch is an incredibly powerful search engine and analytics suite for ingesting, searching, visualizing, and analyzing your data and it is fully open source. This Apache 2.0-licensed and community-driven collection of technologies harnesses an architecture that combines the powers of Elasticsearch 7.10.2, Kibana 7.10.2 and Apache Lucene. With OpenSearch, users gain a distributed framework featuring particularly powerful scalability, high availability, and database-like capabilities. Attendees at this DataCon LA presentation will come away understanding OpenSearch's architecture and its building-block technology components, including: -- Apache Lucene utilization. Learn how this high-performance Java-based search library utilizes Lucene's inverted search index to delivers incredibly fast search results (while supporting natural language, wildcard, fuzzy, and proximity searches). -- OpenSearch cluster architecture. An OpenSearch cluster is a distributed and horizontally-scalable collection of nodes, which are differentiated based on the operations they perform. Attendees will learn the specific functions of master, master-eligible, data, client, ingest nodes. -- Data organization. Understand how OpenSearch organizes data into indices (which contain documents, which contain fields). -- Internal data structures. Get an in-depth look at how OpenSearch achieves scalability and reliability by breaking up indices into shards and segments, and utilizes translogs. -- Aggregations. See how OpenSearch enables its advanced built-in analytics capabilities through the power of aggregations.

Seth Muthukaruppan, Consultant at Instacluster
Data Engineering
OpenSearch is an incredibly powerful search engine and analytics suite for ingesting, searching, visualizing, and analyzing your data and it is fully open source. This Apache 2.0-licensed and community-driven collection of technologies harnesses an architecture that combines the powers of Elasticsearch 7.10.2, Kibana 7.10.2 and Apache Lucene. With OpenSearch, users gain a distributed framework featuring particularly powerful scalability, high availability, and database-like capabilities. Attendees at this DataCon LA presentation will come away understanding OpenSearch's architecture and its building-block technology components, including: -- Apache Lucene utilization. Learn how this high-performance Java-based search library utilizes Lucene's inverted search index to delivers incredibly fast search results (while supporting natural language, wildcard, fuzzy, and proximity searches). -- OpenSearch cluster architecture. An OpenSearch cluster is a distributed and horizontally-scalable collection of nodes, which are differentiated based on the operations they perform. Attendees will learn the specific functions of master, master-eligible, data, client, ingest nodes. -- Data organization. Understand how OpenSearch organizes data into indices (which contain documents, which contain fields). -- Internal data structures. Get an in-depth look at how OpenSearch achieves scalability and reliability by breaking up indices into shards and segments, and utilizes translogs. -- Aggregations. See how OpenSearch enables its advanced built-in analytics capabilities through the power of aggregations.

Advertisement
Advertisement

More Related Content

More from Data Con LA

Advertisement

Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

  1. 1. OpenSearch (Just About) Everything You Need to Know About its Architecture Seth Muthukaruppan Consultant, Search Technologies Instaclustr By NetApp © Instaclustr Pty Limited, 2021 Data Con LA 2022
  2. 2. Agenda ● OpenSearch Overview ● Use Cases ● Apache Lucene ● OpenSearch Clustering ● Data Organization ● Document Indexing ● Document Searching ● Aggregations © Instaclustr Pty Limited, 2021
  3. 3. ● OpenSearch is a search and analytics engine built with the Apache Lucene search library ● Extends Lucene to provide a distributed, horizontally scalable, and highly available search and analytics platform ● OpenSearch is derived from Elasticsearch 7.10.2 and Kibana 7.10.2 from Elastic Co ● OpenSearch is 100% open-source and Apache 2.0 licensed - Free to view, use, change and distribute the code ● Community driven and maintained by the open-source community with backing from industry leaders such as Amazon, Red Hat
  4. 4. Enterprise-Grade Same core features with advanced add- ons 100% Open Source Apache 2.0 Free to view, use, change and distribute the code Community-Driven Developed and maintained by open source community © Instaclustr Pty Limited, 2021
  5. 5. © Instaclustr Pty Limited, 2021 Search Extremely fast Powerful text search Natural language Built in analyzers Fuzzy match Auto completion Scalable Distributed architecture Horizontally scalable Thousands of nodes Petabytes of data Analytics Faceting Aggregations Built-in reporting Anomaly detection Highly Available Data replication Zone awareness Snapshots Cross cluster replication Ecosystem Dashboard Logstash Beats REST Clients OpenSearch: Core Features
  6. 6. © Instaclustr Pty Limited, 2021 OpenSearch: Building Blocks OpenSearch Based on Elasticsearch 7.10.2 Built with Apache Lucene Elasticsearch wire compatible (7.10.2) OpenSearch Dashboard Based on Kibana 7.10.2 Clients Compatible with Logstash, beats, REST clients for 7.10 ES Upgrade Path Rolling upgrade from ES 7.x(7.10) Restart upgrade from ES 6.x
  7. 7. © Instaclustr Pty Limited, 2021 OpenSearch: Use Cases
  8. 8. © Instaclustr Pty Limited, 2021 OpenSearch: Use Cases Log Analysis ● Search for patterns ● Normalize logs ● Correlate logs ● Filter logs Document Store ● Natural language search ● Search with auto- correction ● Search as you type ● Synonym search Analytics ● Historical data ● Trend analysis ● Forecasting e-Commerce ● Product search ● Product recommendations ● Auto completion ● Stock on hand ● Sales by category
  9. 9. © Instaclustr Pty Limited, 2021 OpenSearch: More Use Cases Monitoring ● Network ● Hosts ● Sensors SIEM ● Threat analysis ● Integrity monitoring ● Anomaly detection ● Compliance APM ● Real-time performance ● Latency, load ● Failures Time Series ● Machine learning ● Anomaly detection
  10. 10. © Instaclustr Pty Limited, 2021 OpenSearch: Apache Lucene
  11. 11. ● Lucene is an open source, high-performance search library built with Java, ● Used by some of the popular search engines such as Apache Solr, Apache Nutch, OpenSearch, and Elasticsearch ● Lucene uses an inverted search index to achieve incredibly fast search results ● The inverted search index provides a mapping of terms to documents that contain those terms ● Lucene supports storing several types of information such as numbers, strings, and text fields ● Lucene has a rich search interface with support for natural language searches, wildcard searches, fuzzy, and proximity searches Apache Lucene: Overview
  12. 12. Apache Lucene: Inverted Index 1 Term Frequency Document opensearch 1,1,3 1,2,3 search 1 1 analytics 1 1 suite 1 1 alv 1 2 licensed 1 2 includes 1 3 dashboards 1 3 © Instaclustr Pty Limited, 2021 OpenSearch is a search and analytics suite 2 OpenSearch is ALv2 Licensed 3 OpenSearch includes OpenSearch and OpenSearch Dashboards
  13. 13. © Instaclustr Pty Limited, 2021 OpenSearch: Cluster
  14. 14. ● Lucene is a search library but not a scalable search engine ● OpenSearch uses Lucene at the core for search but has additional capabilities that make it a full-featured search and analytics engine ● An OpenSearch cluster is a distributed collection of nodes that each perform one or more cluster operations ● The cluster is horizontally scalable - adding additional nodes allows the cluster capacity to increase linearly while maintaining similar performance ● With data replication and maintaining data across nodes in the cluster, OpenSearch can handle node failures with no data loss or downtime ● Nodes in the cluster are differentiated based on the specific functions that they perform although a node can perform any or all cluster operations OpenSearch Cluster: Basics
  15. 15. OpenSearch Cluster: Composition © Instaclustr Pty Limited, 2021 Master Eligible Master Master Eligible Client Data Data Data Client
  16. 16. ● Master ○ Responsible for maintaining the health and state of the cluster ○ Coordinator for creating, deleting, managing indices and shards ● Master-eligible nodes ○ Candidates master nodes - only one master at any given time ○ An odd number of nodes is required for tie-breaking ● Data nodes ○ Hold the actual data and handle ingestion, search, and aggregation ○ Run CPU and memory-intensive operations ● Client nodes ○ Act as a gateway and help load balance incoming requests OpenSearch Cluster: Node Types
  17. 17. © Instaclustr Pty Limited, 2021 OpenSearch: Data Organization
  18. 18. Data Organization: Overview
  19. 19. Data Organization: Indices ● An Index is the basic unit by which end users manage their data ○ Similar to a collection in a NoSQL database ● Indices contain one or more documents which can be ○ a paragraph from a book ○ a logline ○ a tweet ○ weather data for a city ● Typically similar documents are grouped into the same index ● Indices are internally broken down into multiple sub-indices called shards ● Shards are then directly mapped to Lucene indices
  20. 20. Data Organization: Documents ● Documents are JSON structures that hold a collection of fields and values ○ Fields are key-value pairs that make up a document ○ Fields can be of several different types such as numbers, text, keywords, geo points, etc ● Typically documents in an index can have similar content
  21. 21. Data Organization: Shards ● An index is broken up into one or more smaller units called shards ● Each shard maps to an underlying Lucene unit called index ○ In other words, each index is mapped to one or more Lucene indices aka shards ● The number of shards per index is a configurable parameter and has major implications on the performance of the cluster ● Search operations are performed at the shard level and having multiple shards help with increasing the search speed ● Increasing the number of shards increases the cluster state information which means more resources will be needed to manage ○ General practice is to keep the shard size between 30 and 50GB
  22. 22. Data Organization: Primary/Replica Shards ● To guard against data loss, OpenSearch allows configuring replicas ● As the index is stored in shards, configuring replicas cause replica shards to be created and stored ● OpenSearch tries to allocate replica shards to nodes other than the ones where the primary shard resides ● Number of replicas is an index-level setting and can be changed at any time ● With replicas, a node failure doesn’t lead to data loss or a data unavailability ○ Data can still be served from the replica copies ● Replica shards come with a price ○ Storing replicas require additional storage space ○ Can slow down indexing as data needs to be indexed in to both primary and replica shards
  23. 23. © Instaclustr Pty Limited, 2021 OpenSearch: Document Indexing
  24. 24. ● Input data to Elasticsearch/OpenSearch is analyzed and tokenized before it gets stored ○ OpenSearch also stores the original document in a special field called the _source ● Analyzers and normalizers convert the input fields into a sequence of terms which then gets stored in the Lucene inverted index ● Pre-built analyzers and normalizers are available for common use cases ○ Standard analyzer breaks text into grammar-based tokens ○ whitespace analyzer breaks text into terms based on whitespace Document Indexing: Basics
  25. 25. ● An analyzer is a combination of character filters, tokenizers, and token filters ● Custom analyzers can be built using the appropriate set of filters and tokenizers. Document Indexing: Analyzers
  26. 26. ● Character filters pre-process the input text before forwarding it to the tokenizer ● They work by adding, removing, or changing characters in the input text ● For example, the built-in HTML strip character filter strips HTML elements and decodes HTML entities ● Multiple character filters can be specified and they will be applied in order Document Indexing: Character Filters
  27. 27. ● Tokenizers convert the input stream of characters into tokens based on certain criteria ○ For instance, the standard tokenizer breaks text into tokens based on word boundaries and also removes punctuation ○ The whitespace tokenizer breaks text into tokens at whitespaces ● Token filters post-process the tokens from the tokenizer ○ Tokens can be added, removed, or modified ○ For example, the ASCII folding filter will convert Unicode characters to the closest ASCII equivalent ○ The stemming token filter applies stemming rules to convert words to their root form Document Indexing: Tokenizers and Token Filters
  28. 28. ● Text Type ○ Primarily used to index human-generated text such as tweets, social media posts, book contents, product descriptions ○ Text fields are particularly useful for performing phrase queries, fuzzy queries, etc ● Keyword Type ○ Typically used for indexing structured content such as names, ids, ISBN, categories, etc ○ Keyword fields are particularly useful for sorting, aggregations and running scripts as they are stored in a columnar format Document Indexing: Field Data Types
  29. 29. ● Numeric Type ○ Used for numeric data such as integers, unsigned integers, floats ○ When choosing a numeric type, the smallest type that could fit the input range should be chosen to conserve storage space ○ Numeric fields are stored as BKD trees ● Geo Point Type ○ Geo point is used to represent latitude and longitude data ○ With geo points, queries that rely on location, distance can be performed ○ BKD fields are used to store geo points. Document Indexing: More Field Data Types
  30. 30. © Instaclustr Pty Limited, 2021 OpenSearch: Document Searching
  31. 31. ● OpenSearch uses a distributed search algorithm to match documents against the input ● Search can be exact match based such as keyword searches or relevancy based such as text searches ● OpenSearch focuses more on search speed than accuracy. The level of required accuracy is typically configurable ● OpenSearch provides a near real-time search whereby all documents will be available for search except for the most recently indexed documents that have not been refreshed. By default documents are refreshed every sec Document Searching: Basics
  32. 32. ● OpenSearch distributed search algorithm uses a query and fetch phase ● Query phase ○ Query sent to all the shards associated with the index. Shards can be primary or replica ○ Each shard will run the search locally and return the results. ○ Results only contain the document ids, scores, and other relevant metadata but not the actual document ● Fetch Phase ○ Query results from all the shards are ordered to form the final set of results. ○ A fetch is performed to rerieve the actual documents from the nodes Document Searching: Search Phases
  33. 33. ● Relevancy searching such as searching for words against text fields involve scoring to determine which documents are the closese match ● Document scoring is achieved by the similarity module. The default similarity is BM25 Document Searching: Scoring Algorithm Score = TF * IDF * Norm TF: How frequently the given term appears in the field. The higher the number of times the term appears in the document, the more likely the document is to be relevant. IDF: How frequently the term appears across all documents in the index. If it appears more commonly, then the term is less relevant Norm: Length normalization. For the same term frequency, a shorter field is more relevant than a longer field.
  34. 34. © Instaclustr Pty Limited, 2021 OpenSearch: Aggregations
  35. 35. ● OpenSearch is not only a search engine but also have built-in, advanced analytics capabilities ● Aggregations allow filtering and categorizing documents, calculate metrics, and build aggregation pipelines by combining multiple aggregations ● OpenSearch has support for many aggregation types which can be calssified as ○ Metrics Aggregations ○ Bucket Aggregations ○ Pipeline Aggregations Aggregations: Basics
  36. 36. Aggregations: Different Types
  37. 37. ● Bucket aggregations categorize the matching set of documents into buckets based on a bucketing criteria ● Bucketing criteria could be based on unique values of a field (terms aggregation), date range (date histogram) aggregation, etc ● Bucket aggregations can be used to ○ paginate all buckets (composite aggregation) ○ provide faceting ○ act as inputs to metric aggregations Aggregations: Bucket Aggregations
  38. 38. ● Metrics aggregations calculate metrics on the values generated from the documents. The values could be specific fields in the documents being aggregated or generated dynamically through scripts ● They can be included as sub-aggregations to bucket aggregations and will produce metrics per aggregated bucket ● Numeric metric aggregations produce numeric metrics such as max, min, sum, and average values ● Some metric aggregations do produce output that are non-numeric. A good example is the Top hits aggregation. Used as a sub-aggregation, it produces top matching documents per bucket. Aggregations: Metric Aggregations
  39. 39. ● Pipeline aggregations can be used to compute metrics and they act on the output of other aggregations making it possible to build a chain of aggregations ● Pipeline aggregations can be further categorized as parent and sibling pipeline aggregations ○ Parent pipeline aggregations compute new aggregations based on the output of the parent aggregation ○ Sibling pipeline aggregations compute new aggregations based on output from one or more sibling aggregations Aggregations: Pipeline Aggregations
  40. 40. © Instaclustr Pty Limited, 2021 OpenSearch: Eco System
  41. 41. © Instaclustr Pty Limited, 2021 Seth Muthukaruppan Seth.Muthukaruppan@netapp.com linkedin.com/Seth.Muthukaruppan Questions & Comments

Editor's Notes

  • Seth Muthukaruppan

×