Presented by Steve Kearns, Basis Technology - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Search is everywhere, and it is a crucially important capability in any enterprise, application, or website. However, an increasingly sophisticated user base expects their search engine to bring them more than just document hits - they want the facts, answers, and context that connect the results with their workflow. In this talk, Steve Kearns will discuss and demonstrate how the combination of structured data, text analytics on unstructured data, and Solr can be used to power advanced analytics applications at scale. This includes integrating text analytics components into Solr, adjustments to the Solr Schema, as well as UI-level changes that support the integration of structured and unstructured data from several sources.
The Application Deployment Manager (ADM) framework is the recommended support structure for
migrating application customizations from a source environment to one or more target environments.
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH at Lucene/Solr Revolution 2013 Dublin
Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown.
Configure
Building Client-side Search Applications with Solrlucenerevolution
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.
Presented by Steve Kearns, Basis Technology - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Search is everywhere, and it is a crucially important capability in any enterprise, application, or website. However, an increasingly sophisticated user base expects their search engine to bring them more than just document hits - they want the facts, answers, and context that connect the results with their workflow. In this talk, Steve Kearns will discuss and demonstrate how the combination of structured data, text analytics on unstructured data, and Solr can be used to power advanced analytics applications at scale. This includes integrating text analytics components into Solr, adjustments to the Solr Schema, as well as UI-level changes that support the integration of structured and unstructured data from several sources.
The Application Deployment Manager (ADM) framework is the recommended support structure for
migrating application customizations from a source environment to one or more target environments.
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH at Lucene/Solr Revolution 2013 Dublin
Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown.
Configure
Building Client-side Search Applications with Solrlucenerevolution
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.
Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc.
Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.
Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Solr's Admin UI - Where does the data come from?lucenerevolution
Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
Presented by Renaud Delbru, Co-Founder, SindiceTech
In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Presented by Shai Erera, Researcher, IBM
Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.
As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL
This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
Presented by Rajini Maski, Senior Software Engineer, Happiest Minds Technologies
An important problem with document-search in any content management system (CMS) is the handling of permission-based search requests for each user. In this session, we present an algorithm and framework that allows the Search Engine to plainly index both public and privileged documents without any early binding overhead—thus enforcing document-level security policies only at the time of search. With our late-binding approach for ACL (access control lists) and some custom components, we have achieved reduction in search-time overhead. We will also discuss the order of complexity and execution time for the search overhead.
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
Presented by Hien Luu, Technical Lead, LinkedIn
Rajasekaran Rangaswamy, LinkedIn
For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns.
Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.
Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business
Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc.
Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.
Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Solr's Admin UI - Where does the data come from?lucenerevolution
Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
Presented by Renaud Delbru, Co-Founder, SindiceTech
In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Presented by Shai Erera, Researcher, IBM
Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.
As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL
This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
Presented by Rajini Maski, Senior Software Engineer, Happiest Minds Technologies
An important problem with document-search in any content management system (CMS) is the handling of permission-based search requests for each user. In this session, we present an algorithm and framework that allows the Search Engine to plainly index both public and privileged documents without any early binding overhead—thus enforcing document-level security policies only at the time of search. With our late-binding approach for ACL (access control lists) and some custom components, we have achieved reduction in search-time overhead. We will also discuss the order of complexity and execution time for the search overhead.
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
Presented by Hien Luu, Technical Lead, LinkedIn
Rajasekaran Rangaswamy, LinkedIn
For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns.
Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.
Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business
Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matchingharmonylab
公開URL:https://arxiv.org/pdf/2404.19174
出典:Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson R. ascimento: XFeat: Accelerated Features for Lightweight Image Matching, Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
概要:リソース効率に優れた特徴点マッチングのための軽量なアーキテクチャ「XFeat(Accelerated Features)」を提案します。手法は、局所的な特徴点の検出、抽出、マッチングのための畳み込みニューラルネットワークの基本的な設計を再検討します。特に、リソースが限られたデバイス向けに迅速かつ堅牢なアルゴリズムが必要とされるため、解像度を可能な限り高く保ちながら、ネットワークのチャネル数を制限します。さらに、スパース下でのマッチングを選択できる設計となっており、ナビゲーションやARなどのアプリケーションに適しています。XFeatは、高速かつ同等以上の精度を実現し、一般的なラップトップのCPU上でリアルタイムで動作します。
セル生産方式におけるロボットの活用には様々な問題があるが,その一つとして 3 体以上の物体の組み立てが挙げられる.一般に,複数物体を同時に組み立てる際は,対象の部品をそれぞれロボットアームまたは治具でそれぞれ独立に保持することで組み立てを遂行すると考えられる.ただし,この方法ではロボットアームや治具を部品数と同じ数だけ必要とし,部品数が多いほどコスト面や設置スペースの関係で無駄が多くなる.この課題に対して音𣷓らは組み立て対象物に働く接触力等の解析により,治具等で固定されていない対象物が組み立て作業中に運動しにくい状態となる条件を求めた.すなわち,環境中の非把持対象物のロバスト性を考慮して,組み立て作業条件を検討している.本研究ではこの方策に基づいて,複数物体の組み立て作業を単腕マニピュレータで実行することを目的とする.このとき,対象物のロバスト性を考慮することで,仮組状態の複数物体を同時に扱う手法を提案する.作業対象としてパイプジョイントの組み立てを挙げ,簡易な道具を用いることで単腕マニピュレータで複数物体を同時に把持できることを示す.さらに,作業成功率の向上のために RGB-D カメラを用いた物体の位置検出に基づくロボット制御及び動作計画を実装する.
This paper discusses assembly operations using a single manipulator and a parallel gripper to simultaneously
grasp multiple objects and hold the group of temporarily assembled objects. Multiple robots and jigs generally operate
assembly tasks by constraining the target objects mechanically or geometrically to prevent them from moving. It is
necessary to analyze the physical interaction between the objects for such constraints to achieve the tasks with a single
gripper. In this paper, we focus on assembling pipe joints as an example and discuss constraining the motion of the
objects. Our demonstration shows that a simple tool can facilitate holding multiple objects with a single gripper.
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
Integrating advanced text analytics into solr - By Steve Kearns
1. Integrating Advanced Text
Analytics into Solr
Lucene Revolution
Steve Kearns
Product Manager
www.basistech.com
2. Agenda
• About Basis Technology
• Why Text Analytics and Solr?
• Overview and Uses of Text Analytics
• Integration Strategies
3. About Basis Technology
• HQ in Cambridge, MA, Offices in:
Tokyo, San Francisco, Washington DC
• Specialists in multilingual text analytics for
Web/enterprise search
Document/OSINT/media exploitation
• Rosette Linguistics Platform is widely used by
commercial enterprises and government
agencies
4. Why Text Analytics and Solr?
• More than Keyword Search and Result Lists
• More Metadata
New ways to visualize, navigate and explore
New knobs to tune relevance
New info to connect disparate data sources
• Solr can be the consumer, host, or broker
5. Overview of Text Analytics
• Document-Level
Language Identification, Categorization
• Sub-Document Level
Entity Extraction, Fact Extraction, Sentiment, Linguistics
• Cross-Document
Cross-Document Entity Resolution, Near Duplicate Detection,
Unsupervised Clustering
6. Document Level Analysis: Language Identification
• Sub-document Lang ID is possible
La Grande-Bretagne
Американская a de son côté jugé
Après avoir rencontré
La Grande-Bretagne a 「端末側で行単位に(あるい
софтверная компания queles présidents de nigérian
l'accord de
de son côté jugé que становится
Le président は一画面分)編集しておいて、
l'accord deВ данный момент
Luxembourg
Luxembourg cinq pays
quatre des
Olusegun Obasanjo a 「端末側で行単位に(あるい
送信キーによりまとめて送信
пользующимся
constituaitправительство США, 私ごとになりますが、ちょうどこの
un véritable спросом у спецслужб
constituait uncette du
africains (Afrique
salué は一画面分)編集しておいて、
する」という方式と、
changement dans la ころ大学院生でしたが、ACOS-6 véritable
Sud, l'engagement du G8,
Algérie, FNPがコンピュータと端末の
送信キーによりまとめて送信
обвиняющее США экспертом в 「端末には知能はなく、一字
stratégie agricole de 用のある言語処理系の開発を請
области лингвистики changement Nigeria) "la
Sénégal, dans la
déclarant que 間にあって、実際の端末との
する」という方式と、
радикальную 一字すべてがその都度送ら
l'Europe, tandis que
мусульманскую け負って作っていました。ACOS-6
(в частности, изучения stratégie
membres du comité
condition majeure au やりとりを制御するのです。そ
「端末には知能はなく、一字
れ処理される」
l'Irlande y a vu un gage "Аль
группировку はMulticsの概念に非常に近いも
и обработки de pilotage du して、コンピュータとFNPの間
一字すべてがその都度送ら
développement est
de stabilité et et de терактах 2 のを持っていました、あるいは持
Каида" в информации на の通信は、
れ処理される」
sécurité pour les
года назад, とうとしていました。 языке) после
арабском 少量の転送には不向きで、大
активизирует свое また、ハードウェアも大変似てい
agriculteurs. терактов 11 сентября French 量の一括転送に向いていまし
внимание к арабскому ました。シールをはがすと、
2001 Le président nigérian
г.
языку и программам その下から別のアメリカの会社の
его обработки. 名前が出てくるマシンでテスト
Olusegun Obasanjo a
salué cette Japanese
Грамматика языков したこともありました。1年間ほとdu G8,
「端末側で行単位に(あるいは一 l'engagement
画面分)編集しておいて、
данной группы んど休みなしにマシンルーム déclarant que "la Программное
送信キーによりまとめて送信す にこもっていて、ここでの議論とcondition majeure au обеспечение Basis
Американская
る」という方式と、 Программное 疑問を自分のテーマとしても développement est Technology позволяет
софтверная
「端末には知能はなく、一字一字обеспечение扱ったことがあるのです。それで、
Basis l'absence de conflit". La осуществлять поиск
компания момент
В данный
よーくわかるのです。 porte-parole de la Bild vergrößern German
すべてがその都度送られ処理さ Technology позволяет слов с правительство США,
близкими
становится Berlin (AP) Der Kanzler
れる」 осуществлять поиск слов présidence française, значениями, а также
обвиняющее
пользующимся strahlte: «Ich gestehe, dass
29%
という方式は、究極的に前者は с близкими значениями, а Catherine Colonna, a транслитерировать
радикальную
спросом у ich 90 Prozent Zustimmung
半二重通信、後者は全二重通信 также транслитерировать pour sa part qualifié la French
FNPがコンピュータと端末の間に
とフィットします。арабские и фарси-буквы в réunion мусульманскую
спецслужб США EVIAN (AP) - Les membres du
後者では、入力のエコーもコン あって、実際の端末とのやりとり
латинские. Продукт был d'"exceptionnelle". группировку "Аль
экспертом в области G8 se sont engagés dimanche 33%
ピュータ側で制御されます。 по
разработан を制御するのです。そして、コン Каида" в терактах 2 soir à soutenir la
つまり、入力した字の表示はキーспециальному заказуピュータとFNPの間の通信は、
これはファンドマネージャー
Japanese
入力がコンピュータに送られ、 США少量の転送には不向きで、大量
それが送り返されて表示されま
правительства
целью оптимизации
с
の一括転送に向いていました。 Russian さんが嘘をついているという 21%
す。 процесса анализа FNPによるコンピュータへの割り
わけではありません。計算
арабских текстов. 込み要求は高価なものだったか ilHaaqa-n bikitaabinaa s- Arabic
らです。Multicsでのプロセスの sirriyyi r-raqiimi fii yurjae
wake upも高価だということもあ ittikhaadha maa yulzamu
17%
りました。
7. Document Level Analysis: Categorization
• Group Documents into Pre-defined categories
http://news.google.com/
http://www.bbc.co.uk/
8. Sub-Document Analysis: Linguistics
• Segmentation of Asian language
• Lemmatization
Stemming
N-Gram
Morphological
Lemmatization
Segmentation
17. Integration Point: UpdateRequestProcessor
• Runs Before Analyzers
• Full Access to Document
• Two options:
Run the analysis directly in Solr
Call out to external analysis services
• Limitations:
Think through your indexing strategy
18. Integration Point: UpdateRequestProcessor
• Run the analysis directly in Solr
Good for light weight analytics
Not good for cross-document analytics
• Call out to external analysis services
Web Services, UIMA, OpenPipeline, GATE, custom code
Note that these external calls are synchronous
Additional complexity / points of failure
20. Integration Point: Pre-Processor
• Index in Solr as Last Step of Analysis
• Good For:
Finer-grained control
Managing dependencies between components
Scalability
• Limitations:
Complexity / New points of failure
Cannot use Solr’s content acquisition features
21. Integration Summary
• There are Many Options!
• Document-Level Analysis:
Generally, safe to run in UpdateRequestProcessor
• Sub-Document Analysis:
Sometimes run in UpdateRequestProcessor, sometimes external
• Cross-Document Analysis:
Run external
• Multiple-Analysis Components:
Run external document processing pipeline