Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business
Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.
In 40 minutes the audience will learn a variety of ways to make postgresql database suddenly go out of memory on a box with half a terabyte of RAM.
Developer's and DBA's best practices for preventing this will also be discussed, as well as a bit of Postgres and Linux memory management internals.
P99 Pursuit: 8 Years of Battling P99 LatencyScyllaDB
Performance engineering is a Sisyphean hill climb for perfection. Those who climb the hill are hardly ever satisfied with the results. You should always ask yourself where the bottleneck is today and what’s holding you back. Great performance improves your software. It enables you to run fewer layers, manage 10x less machines, simplifies your stack, and more.
In this keynote session, ScyllaDB CEO Dor Laor will cover the principles for successful creation of projects like ScyllaDB, KVM, the Linux kernel and explain why they spurred his vision for the P99 CONF.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
In 40 minutes the audience will learn a variety of ways to make postgresql database suddenly go out of memory on a box with half a terabyte of RAM.
Developer's and DBA's best practices for preventing this will also be discussed, as well as a bit of Postgres and Linux memory management internals.
P99 Pursuit: 8 Years of Battling P99 LatencyScyllaDB
Performance engineering is a Sisyphean hill climb for perfection. Those who climb the hill are hardly ever satisfied with the results. You should always ask yourself where the bottleneck is today and what’s holding you back. Great performance improves your software. It enables you to run fewer layers, manage 10x less machines, simplifies your stack, and more.
In this keynote session, ScyllaDB CEO Dor Laor will cover the principles for successful creation of projects like ScyllaDB, KVM, the Linux kernel and explain why they spurred his vision for the P99 CONF.
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
How Netflix run Apache Flink at very large scale in these two scenarios. (1) Thousands of stateless routing jobs in the context of Keystone data pipeline (2) single large state job with many TBs of state and parallelism at a couple thousands
Video: https://data-artisans.com/flink-forward-berlin/resources/monitoring-flink-with-prometheus
Live Demo Code: https://github.com/mbode/flink-prometheus-example
Prometheus is a cloud-native monitoring system prioritizing reliability and simplicity – and Flink works really well with it! This session will show you how to leverage the Flink metrics system together with Pronetheus to improve the observability of your jobs. There will be a live demo establishing how everything ties in together. The talk is aimed at people already building and running Flink jobs who would like to gain more insight into them. It is fine if you are not familiar with Prometheus yet as the basic concepts will be introduced. If you have ever wondered how you could use modern monitoring tools to be alerted in the middle of the night in case your Flink job‘s 99th percentile end-to-end latency degraded for some reason, this might just be the talk you are looking for.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
PostgreSQL and ZFS were made for each other. This talk dives downstack into the internals and way that PostgreSQL consumes disk resources and tricks that are available if you run PostgreSQL on ZFS (ZFS on Linux, ZFS on FreeBSD, or ZFS on Illumos). Topics covered will include:
* Performance and sizing considerations
* Workload estimation heuristics
* Standard administrative practices that leverage ZFS
* Recovery using ZFS
* Performing database migrations using ZFS
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
Discover the new features and capabilities of Scylla Open Source 5.0 directly from the engineers who developed it. This second block of lightning talks will cover the following topics:
- New IO Scheduler and Disk Parallelism
- Per-Service-Level Timeouts
- Better Workload Estimation for Backpressure and Out-of-Memory Conditions
- Large Partition Handling Improvements
- Optimizing Reverse Queries
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Agreement in a distributed system is complicated but required. Scylla gained lightweight transactions through Paxos but the latter has a cost of 3X roundtrips. Raft can allow consistent transactions without the performance penalty. Beyond LWT, we plan to integrate Raft with most aspects of Scylla making a leap forward in manageability and consistency
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
How Netflix run Apache Flink at very large scale in these two scenarios. (1) Thousands of stateless routing jobs in the context of Keystone data pipeline (2) single large state job with many TBs of state and parallelism at a couple thousands
Video: https://data-artisans.com/flink-forward-berlin/resources/monitoring-flink-with-prometheus
Live Demo Code: https://github.com/mbode/flink-prometheus-example
Prometheus is a cloud-native monitoring system prioritizing reliability and simplicity – and Flink works really well with it! This session will show you how to leverage the Flink metrics system together with Pronetheus to improve the observability of your jobs. There will be a live demo establishing how everything ties in together. The talk is aimed at people already building and running Flink jobs who would like to gain more insight into them. It is fine if you are not familiar with Prometheus yet as the basic concepts will be introduced. If you have ever wondered how you could use modern monitoring tools to be alerted in the middle of the night in case your Flink job‘s 99th percentile end-to-end latency degraded for some reason, this might just be the talk you are looking for.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
PostgreSQL and ZFS were made for each other. This talk dives downstack into the internals and way that PostgreSQL consumes disk resources and tricks that are available if you run PostgreSQL on ZFS (ZFS on Linux, ZFS on FreeBSD, or ZFS on Illumos). Topics covered will include:
* Performance and sizing considerations
* Workload estimation heuristics
* Standard administrative practices that leverage ZFS
* Recovery using ZFS
* Performing database migrations using ZFS
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
Discover the new features and capabilities of Scylla Open Source 5.0 directly from the engineers who developed it. This second block of lightning talks will cover the following topics:
- New IO Scheduler and Disk Parallelism
- Per-Service-Level Timeouts
- Better Workload Estimation for Backpressure and Out-of-Memory Conditions
- Large Partition Handling Improvements
- Optimizing Reverse Queries
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Agreement in a distributed system is complicated but required. Scylla gained lightweight transactions through Paxos but the latter has a cost of 3X roundtrips. Raft can allow consistent transactions without the performance penalty. Beyond LWT, we plan to integrate Raft with most aspects of Scylla making a leap forward in manageability and consistency
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
PAGOdA (Pay-as-you-go OWL Query Answering Using a Triple Store) presentation by Bernardo Cuenca Grau
Abstract: We present an enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scalable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitrary OWL ontologies, and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.
Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.
Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
In this talk I'll show how the Bayesian Optimization methods used by SigOpt, coupled with the incredibly scalable deep learning architecture provided with ncloud and neon, allow anyone it easily tune their models to quickly achieve higher accuracy. I'll walk through the techniques and show an explicit example with results.
1 1/2 years ago we have rolled out a new integrated full-text search engine for our Intranet based on Apache Solr. The search engine integrates various data sources such as file systems, wikis, internal websites and web applications, shared calendars, our corporate database, CRM system, email archive, task management and defect tracking etc. This talk is an experience report about some of the good things, the bad things and the surprising things we have encountered over two years of developing with, operating and using a Intranet search engine based on Apache Solr.
After setting the scene, we will discuss some interesting requirements that we have for our search engine and how we solved them with Apache Solr (or at least tried to solve). Using these concrete examples, we will discuss some interesting features and limitations of Apache Solr.
In the second part of the talk, we will tell a couple of "war stories" and walk through some interesting, annoying and surprising problems that we faced, how we analyzed the issues, identified the cause of the problems and eventually solved them.
The talk is aimed at software developers and architects with some basic knowledge about Apache Solr, the Apache Lucene project familiy or similar full-text search engines. It is not an introduction into Apache Solr and we will dive right into the interesting and juicy bits.
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH at Lucene/Solr Revolution 2013 Dublin
Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown.
Configure
Building Client-side Search Applications with Solrlucenerevolution
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.
Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc.
Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.
Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Solr's Admin UI - Where does the data come from?lucenerevolution
Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
Presented by Renaud Delbru, Co-Founder, SindiceTech
In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Presented by Shai Erera, Researcher, IBM
Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.
As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL
This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
2. Who Am I
●
Search user, developer, researcher
●
Many years in industry & academia
●
Ph.D. in Information Retrieval
●
Interests: Search, Big Data, Machine Learning
●
Currently working on the Geocoding offer of HERE,
Nokia's Location Platform
●
Spare time: Lucene contributor
7 Nov 2013
Query Latency Optimization with Lucene
2
4. Motivation: Query Latency
● Human Reaction Time: 200 ms *
→ Backend latency: << 200 ms
●
Faster queries means higher manageable load
●
Costs
* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in
Software, Addison-Wesley Professional, 2008.
7 Nov 2013
Query Latency Optimization with Lucene
4
7. First: Do Your Homework
● Keep enough RAM for OS (disk buffer cache)
● Reduce HDD “pressure” (e.g. throttle indexing)
● SSDs
● Warming
● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
7 Nov 2013
Query Latency Optimization with Lucene
7
8. Mining Hypothesis
●
Check if query latencies are reproducible
●
If not, try to find correlations with system events:
–
–
–
–
●
Many new incoming docs to index?
Other daemons spike in disk or CPU activity?
Garbage Collections?
Other sar statistics (e.g. paging)
If yes, profile
–
–
First, your code
Don't instrument Lucene internal low-level classes
7 Nov 2013
Query Latency Optimization with Lucene
8
9. Hypothesis Testing
●
You really think you understand the problem
and have a potential solution?
●
Try it out (if it's cheap)!
●
Otherwise, think of (cheap) experiments that
–
–
7 Nov 2013
Give confidence
Tell you (and others) what the gains are (ROI)
Query Latency Optimization with Lucene
9
10. Example: In-memory
●
Buy more memory / bigger machine !?
●
Simulate1
–
–
–
●
1
Consecutively execute the same query multiple times
Much lower memory requirement (i.e. the size of the involved postings)
Repeat for sample of queries of interest
Gives lower bound on query latency
S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European
Conference on Information Retrieval, Toulouse, France, April 2009. Springer.
7 Nov 2013
Query Latency Optimization with Lucene
10
12. Conjunctions (i.e. AND / Occur.MUST)
●
Sort Boolean clauses by increasing DocFreq ft
7 Nov 2013
Query Latency Optimization with Lucene
12
13. Conjunctions (i.e. AND / Occur.MUST)
●
Next() on sparsest posting list (“lead”)
7 Nov 2013
Query Latency Optimization with Lucene
13
14. Conjunctions (i.e. AND / Occur.MUST)
●
Advance(18) on next sparsest posting list → fail
7 Nov 2013
Query Latency Optimization with Lucene
14
15. Conjunctions (i.e. AND / Occur.MUST)
●
Start all over again with “lead”, but advance(22)
7 Nov 2013
Query Latency Optimization with Lucene
15
16. Conjunctions (i.e. AND / Occur.MUST)
●
Try to advance(31) on all other posting lists
7 Nov 2013
Query Latency Optimization with Lucene
16
17. Conjunctions (i.e. AND / Occur.MUST)
●
Try to advance(31) on all other posting lists
7 Nov 2013
Query Latency Optimization with Lucene
17
18. Conjunctions (i.e. AND / Occur.MUST)
●
Try to advance(31) on all other posting lists
7 Nov 2013
Query Latency Optimization with Lucene
18
19. Conjunctions (i.e. AND / Occur.MUST)
●
Match found → R = {31
7 Nov 2013
Query Latency Optimization with Lucene
19
20. Conjunctions (i.e. AND / Occur.MUST)
●
Next() on “lead” → R = {31}
7 Nov 2013
Query Latency Optimization with Lucene
20
21. Disjunctions (i.e. OR / Occur.SHOULD)
7 Nov 2013
Query Latency Optimization with Lucene
21
22. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() on all clauses
7 Nov 2013
Query Latency Optimization with Lucene
22
23. Disjunctions (i.e. OR / Occur.SHOULD)
●
Track clauses in min-heap → R = {2
7 Nov 2013
Query Latency Optimization with Lucene
23
24. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() on all previously matched clauses → R = {2,4
7 Nov 2013
Query Latency Optimization with Lucene
24
25. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() on all previously matched clauses → R = {2,4,5
7 Nov 2013
Query Latency Optimization with Lucene
25
26. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7
7 Nov 2013
Query Latency Optimization with Lucene
26
27. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9
7 Nov 2013
Query Latency Optimization with Lucene
27
28. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11
7 Nov 2013
Query Latency Optimization with Lucene
28
29. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12
7 Nov 2013
Query Latency Optimization with Lucene
29
30. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16
7 Nov 2013
Query Latency Optimization with Lucene
30
31. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18
7 Nov 2013
Query Latency Optimization with Lucene
31
32. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20
7 Nov 2013
Query Latency Optimization with Lucene
32
33. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22
7 Nov 2013
Query Latency Optimization with Lucene
33
34. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26
7 Nov 2013
Query Latency Optimization with Lucene
34
35. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27
7 Nov 2013
Query Latency Optimization with Lucene
35
36. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29
7 Nov 2013
Query Latency Optimization with Lucene
36
37. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31
7 Nov 2013
Query Latency Optimization with Lucene
37
38. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32
7 Nov 2013
Query Latency Optimization with Lucene
38
39. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37
7 Nov 2013
Query Latency Optimization with Lucene
39
40. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}
7 Nov 2013
Query Latency Optimization with Lucene
40
41. Why Query Processing Can Be Slow?
●
Disjunctive Processing: O(n log |C|)
–
–
–
●
High DF terms (large n)
Many terms (large |C|), e.g. query expansion
No / too little use of advance()
Filter (over-use)
7 Nov 2013
Query Latency Optimization with Lucene
41
42. Filter
●
Aims:
–
–
–
●
(Pre-)computation of common sub-queries
Cache result
Don't influence scoring
Limitation
–
–
Additional cost for 1st query
Currently, no skip information generated
→ Adding filter as a conjunct to queries can sometimes be faster
e.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013
Query Latency Optimization with Lucene
42
43. Stopword Removal
●
Removal of High-DocFreq terms from
–
–
●
Limitation:
–
●
Index : 10-30% space saving
Query: no very expensive terms
“To be or not to be”
In general, don't do it
7 Nov 2013
Query Latency Optimization with Lucene
43
44. Minor, But Easy Improvements
●
Reduce information, increase locality:
–
Don't store TF, if it's almost always 1 (and you don't
need positions),
fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);
–
●
Use BlockPostingsFormat (default in Lucene ≥ 4.1)
Tune Space/Time/Quality tradeoffs:
–
–
7 Nov 2013
DirectDocValues
Less complex scoring function
Query Latency Optimization with Lucene
44
46. MinShouldMatch
●
●
●
(Lucene-4571)
Don't want matches on only one (stop-)word?
Enforce at least mm>1 terms to be present !
Synthetic example query used during dev:
Terms:
ref
restored
struck
wings
dublin
DocFreq:
3.8M
32k
32k
32k
32k
E.g. mm=2:
Conjunctive Processing:
advance()
Disjunctive Processing:
next()
7 Nov 2013
Query Latency Optimization with Lucene
46
51. MinShouldMatch – Open Questions
●
●
●
(Lucene-4571)
How bad is it to exclude docs that only match one,
but an important term?
Why is it enough to match any mm terms?
Why not providing a list of stop-words to a
'StopwordExcludingScorer'?
(But be careful: “To Be Or Not To Be”)
7 Nov 2013
Query Latency Optimization with Lucene
51
52. ReqOptSumScorer
●
Benefit:
–
–
●
Conjunctive processing on required clauses
Calls advance() on optional clauses
How do you determine which clauses are required?
– Lookup term statistics (i.e. DocFreq)
– 2nd lookup unnecessary, if you hand over stats to query
7 Nov 2013
Query Latency Optimization with Lucene
52
53. CommonTermsQuery (≥ 4.1)
●
Looks up term infos (docfreq, posting list offset)
●
(Lucene-4628)
Categorizes query terms as
–
–
●
Low-freq: At least one low-freq term MUST occur in result doc
High-freq: SHOULD occur in doc → their presence add to score
Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !
●
Also supports MinShouldMatch
7 Nov 2013
Query Latency Optimization with Lucene
53
54. Cost-Model (≥ 4.3)
●
What about structured queries? E.g. +(a b) +c
●
(Lucene-4607)
Currently: worst-case estimate of returned #docs (docfreq)
–
–
●
Disjunctions: sumcC(dfc)
Conjunctions: mincC(dfc)
Limitations:
–
–
●
Effort to generate returned docs?
Only one cost (next() vs. advance())
Open Question:
–
Can we do better with more detailed cost models?
7 Nov 2013
Query Latency Optimization with Lucene
54
55. Maxscore Top-k Scoring Algorithm1
●
●
Experimental prototype code attached to Lucene-4100
Limitation:
–
1
(Lucene-4100)
Requires final run over whole index (i.e. only for static indexes)
H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.
7 Nov 2013
Query Latency Optimization with Lucene
55
56. Index Sorting (≥ 4.3)
●
Advantages (if appropriate sort order chosen)
–
–
●
(Lucene-4752)
Better compression → more locality → faster processing
Early termination
Use together with EarlyTerminatingSortingCollector
–
–
Can terminate scoring within sorted segments
Fully scores as-yet unsorted segments
→ see 2nd half of Shai & Adrian's talk yesterday for details
7 Nov 2013
Query Latency Optimization with Lucene
56
57. Parallelization
●
In general, sharding is better:
–
–
●
Shared-nothing
Better use cores for handling load
Multi-threaded query execution:
–
Static indexes:
For slow queries, almost perfect speedups
(if docs are uniformly distributed over shards)
–
Dynamic indexes:
●
Lucene-2840, Lucene-5299
7 Nov 2013
Query Latency Optimization with Lucene
57
58. Summary
●
Understand your problem
●
Scoring can become an issue with many million docs
●
Many recent efficiency improvements
●
More to come... patches welcome
7 Nov 2013
Query Latency Optimization with Lucene
58
59. We're Hiring @HERE
Frankfurt, Berlin, Boston, Chicago.
Come work with us.
Get in touch!
7 Nov 2013
developer.here.com/geocoder
Query Latency Optimization with Lucene
59
60. Thank You!
Contact
Email : stefan.pohl@here.com
Web : http://linkedin.com/in/stefanpohl
Twitter : @pohlstefan
7 Nov 2013
developer.here.com/geocoder
Query Latency Optimization with Lucene
60