The document summarizes new features in Solr 3.1 and 4.0, including improved relevancy, spatial/geo search, search result grouping, faceting, and scalability features like SolrCloud. It provides an overview and examples of extended dismax parsing, spatial search, field collapsing, pivot faceting, range faceting, per-segment faceting, auto-suggest, indexing with JSON, and querying results in CSV format.
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real worldMichaël Figuière
This document provides an overview of Lucene, an open-source information retrieval library. It discusses Lucene's history and key concepts like indexing, inverted indexes, analyzers, queries, and performance tuning. The document also describes how Lucene can be used in applications through frameworks like Solr, HibernateSearch, and with distribution through projects like Katta. It concludes with looking at future directions for Lucene including more customization and integration with machine learning through Apache Mahout.
Implementing pseudo-keywords through Functional ProgramingVincent Pradeilles
The document discusses implementing asynchronous code in a more functional way using pseudo-keywords. It introduces a weakify function that takes a function and returns a function that wraps it to avoid retain cycles when referencing self. This approach is expanded to other common patterns like debouncing. The document then shows how property wrappers can be used to implement these pseudo-keywords more declaratively. This allows patterns like asynchronous code and debouncing to be written in a cleaner way without extra boilerplate.
Geo distance search with my sql presentationGSMboy
The document discusses various techniques for performing geo-spatial searches with MySQL to find points of interest near a given location. It covers calculating distance between points using the Haversine formula, optimizing queries by limiting the search area, and using spatial extensions, full-text search, or external search engines like Sphinx to enable both geo and text searching. Demo examples show finding nearby POIs matching a keyword within a radius of the user's GPS point.
The document discusses different techniques for performing geo-spatial searches with MySQL to find points of interest near a given location. It covers calculating distance between points using the Haversine formula, optimizing queries by limiting the search area, and using spatial extensions, full-text search, or external search engines like Sphinx to enable both geo and text searches. Demo examples show finding nearby POIs matching a keyword within a radius of the user's GPS point.
Architecture Patterns in Practice with Kotlin. UA Mobile 2017.UA Mobile
The document discusses architecture patterns in Kotlin mobile app development. It promotes using patterns like MVVM, dependency injection with Kodein, and coroutines to achieve clean, testable and maintainable code. Specific techniques covered include using abstract base classes to reduce boilerplate, weak references to avoid memory leaks, the Either monad for error handling, extension functions for domain models, and dependency injection with Kodein to decouple classes. The document argues that adopting these patterns makes the code more readable, testable and maintainable, and helps prevent common problems like concurrency issues.
This document summarizes the key features and changes in PostgreSQL 9.0 beta release. It highlights major new features like replication, permissions, and anonymous code blocks. It also briefly outlines many other enhancements, including performance improvements, monitoring tools, JSON/XML output for EXPLAIN, and mobile app contest. The presentation aims to excite developers about trying the new beta version.
This document discusses the QueryExecute() function in CFML as an alternative to CFQUERY for executing database queries in CFScript. It provides examples of using QueryExecute() for different types of queries like select, insert, update and delete. It also shows how to pass parameters to the query as a structure or array. QueryExecute() simplifies database queries in CFScript by allowing unnamed parameters and optional query attributes to be passed as arguments.
Flink Forward Berlin 2017: Max Kiessling, Martin Junghanns - Cypher-based Gra...Flink Forward
Graph pattern matching is one of the most interesting and challenging operations on graph data. However, it is primarily supported by graph database systems such as Neo4j but not generally available for distributed processing frameworks like Apache Flink or Apache Spark. In our talk, we want to give an overview of our current implementation of Cypher on Apache Flink. Cypher is the Neo4j graph query language and enables the intuitive definition of graph patterns including structural and semantic predicates. As the Neo4j graph data model is not supported out-of-the box by Apache Flink, we leverage Gradoop, a Flink-based graph analytics framework that already provides an abstraction of schema-free property graphs. We will give a brief overview about the technologies used to implement Cypher, explain our query engine and give a demonstration of the available language features. In addition, we will present benchmark results from running Cypher queries on billion edge graphs.
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real worldMichaël Figuière
This document provides an overview of Lucene, an open-source information retrieval library. It discusses Lucene's history and key concepts like indexing, inverted indexes, analyzers, queries, and performance tuning. The document also describes how Lucene can be used in applications through frameworks like Solr, HibernateSearch, and with distribution through projects like Katta. It concludes with looking at future directions for Lucene including more customization and integration with machine learning through Apache Mahout.
Implementing pseudo-keywords through Functional ProgramingVincent Pradeilles
The document discusses implementing asynchronous code in a more functional way using pseudo-keywords. It introduces a weakify function that takes a function and returns a function that wraps it to avoid retain cycles when referencing self. This approach is expanded to other common patterns like debouncing. The document then shows how property wrappers can be used to implement these pseudo-keywords more declaratively. This allows patterns like asynchronous code and debouncing to be written in a cleaner way without extra boilerplate.
Geo distance search with my sql presentationGSMboy
The document discusses various techniques for performing geo-spatial searches with MySQL to find points of interest near a given location. It covers calculating distance between points using the Haversine formula, optimizing queries by limiting the search area, and using spatial extensions, full-text search, or external search engines like Sphinx to enable both geo and text searching. Demo examples show finding nearby POIs matching a keyword within a radius of the user's GPS point.
The document discusses different techniques for performing geo-spatial searches with MySQL to find points of interest near a given location. It covers calculating distance between points using the Haversine formula, optimizing queries by limiting the search area, and using spatial extensions, full-text search, or external search engines like Sphinx to enable both geo and text searches. Demo examples show finding nearby POIs matching a keyword within a radius of the user's GPS point.
Architecture Patterns in Practice with Kotlin. UA Mobile 2017.UA Mobile
The document discusses architecture patterns in Kotlin mobile app development. It promotes using patterns like MVVM, dependency injection with Kodein, and coroutines to achieve clean, testable and maintainable code. Specific techniques covered include using abstract base classes to reduce boilerplate, weak references to avoid memory leaks, the Either monad for error handling, extension functions for domain models, and dependency injection with Kodein to decouple classes. The document argues that adopting these patterns makes the code more readable, testable and maintainable, and helps prevent common problems like concurrency issues.
This document summarizes the key features and changes in PostgreSQL 9.0 beta release. It highlights major new features like replication, permissions, and anonymous code blocks. It also briefly outlines many other enhancements, including performance improvements, monitoring tools, JSON/XML output for EXPLAIN, and mobile app contest. The presentation aims to excite developers about trying the new beta version.
This document discusses the QueryExecute() function in CFML as an alternative to CFQUERY for executing database queries in CFScript. It provides examples of using QueryExecute() for different types of queries like select, insert, update and delete. It also shows how to pass parameters to the query as a structure or array. QueryExecute() simplifies database queries in CFScript by allowing unnamed parameters and optional query attributes to be passed as arguments.
Flink Forward Berlin 2017: Max Kiessling, Martin Junghanns - Cypher-based Gra...Flink Forward
Graph pattern matching is one of the most interesting and challenging operations on graph data. However, it is primarily supported by graph database systems such as Neo4j but not generally available for distributed processing frameworks like Apache Flink or Apache Spark. In our talk, we want to give an overview of our current implementation of Cypher on Apache Flink. Cypher is the Neo4j graph query language and enables the intuitive definition of graph patterns including structural and semantic predicates. As the Neo4j graph data model is not supported out-of-the box by Apache Flink, we leverage Gradoop, a Flink-based graph analytics framework that already provides an abstraction of schema-free property graphs. We will give a brief overview about the technologies used to implement Cypher, explain our query engine and give a demonstration of the available language features. In addition, we will present benchmark results from running Cypher queries on billion edge graphs.
The Clash was a highly influential punk rock band formed in 1976 in London, England. They originally consisted of Joe Strummer, Mick Jones, Paul Simonon, and Nick Headon. Their third album London Calling, released in 1979, was critically acclaimed and considered one of the best albums of the 1980s. It brought them greater popularity in the United States. The Spanish Civil War was a major conflict that devastated Spain from 1936 to 1939 between Spanish army generals and the government of the Second Spanish Republic. The nationalists forces defeated the Republicans, establishing a dictatorship under General Francisco Franco.
The document discusses the search capabilities and infrastructure at TheLadders.com. It describes how they standardized their search using Solr, setting up a search team in 2010 and platform team in 2011. It also discusses challenges like complex boolean queries and implementing a recommendation service using Solr as the backend.
Building a global listening platform with Solr presents technical and global challenges. The speaker will demonstrate a platform they built in 3 months using Solr and Basis Technology products for content acquisition, analysis including language identification and entity extraction, and search visualization. Key aspects include distributed processing pipelines for analysis, language-specific indexing, and dashboard interfaces beyond basic search results.
LinkedIn is a professional networking platform that allows users to connect with colleagues and find new business opportunities. The document discusses why professionals should use LinkedIn by highlighting examples of how it can help users manage their online presence, find qualified candidates, and get introduced to new connections. It also shares two success stories of how individuals used LinkedIn to help raise funds for a startup and land a new job after being laid off. The document concludes by providing instructions for creating a new LinkedIn account and filling out a profile.
Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.
This document discusses integrating search capabilities with Hadoop's big data analytics. It explains that Hadoop is well-suited for distributed storage and processing of large datasets, while search excels at free-text retrieval and indexing large amounts of text. The document outlines how the speaker's company integrated Hadoop and search using HBase replication to a search index, allowing results from Hadoop jobs to be searchable in near real-time. It provides an example use case of monitoring tweets for keywords and extracting mentioned URLs to visualize popular links.
This document outlines a plan to start a sustainable business called We Beat The Mountain that would manufacture and sell products made from recycled materials, such as tires. The business would target environmentally and socially conscious consumers between ages 20-40 globally. Products would be of high quality, durable design and sold through an online store as well as partnerships with related industries. Initial product plans focus on selling recycled tire suitcases to tap into the $18 billion annual luggage market. Sales projections estimate selling between 3,000-10,000 suitcases in the first year, scaling up to 7,500-20,000 units by the third year.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.
If you've tried Apache Solr 1.4, you've probably had a chance to take it for a spin indexing and searching your data, and getting acquainted with its powerful, versatile new features and functions. Now, it's time to roll up your sleeves and really master what Solr 1.4 has to offer.
Solr & Lucene at Etsy provides concise summaries of Gregg Donovan's experience using Solr and Lucene at Etsy and TheLadders, including optimizing Solr out-of-the-box, customizing at a low level, and knowing when each approach is best. The document also shares various techniques for improving relevance, performance, and customization including external file fields, boosting queries, impression tracking, and more.
The scene- I love you like a love song Selena Gomeztanica
Selena Gomez is an American singer and actress born in 1992 in Grand Prairie, Texas. She began her career starring in the television series Wizards of Waverly Place. Her career expanded into music, contributing songs to soundtracks and releasing her own albums as part of the band The Scene. In 2011, she wrote and recorded the song "Love You like a Love Song" which was rumored to have been dedicated to her then-boyfriend Justin Bieber. The song expresses feelings of being completely in love.
1. Innovation ecosystems involve a network of actors working together to enable entrepreneurship, including idea generators, entrepreneurs, experienced managers, mentors, funding sources, customers, suppliers, and partners.
2. Successful ecosystems provide access to talent, technologies, advice, capital, networks, and other resources needed at each stage of a venture's development.
3. Incubators, accelerators, and co-working spaces play distinct but complementary roles in supporting entrepreneurs and startups at different stages by providing resources, mentoring, and connections.
LucidWorks Enterprise is a well-packaged, integrated search solution development platform that makes it easier for you to take on the art and science of search, applying the power and flexbility of open source to unlock the search technology for your most interesting and valuable business and technical challenges.http://www.lucidimagination.com/developers/whitepapers/getting-started-with-lucidworks-enterprise
This document summarizes new features and performance improvements in Solr 3.1. Key highlights include: improved faceting capabilities like numeric range facets; new features like spatial search, a faster highlighter, and extended dismax query parser; distributed search support; and under-the-hood performance optimizations in Lucene like DocumentWriterPerThread. It also previews upcoming Solr Cloud functionality and discusses features not yet included in the 3.1 release.
These slide belonged to the presentation I hold to my colleagues in Göttingen as an introduction to Apache Solr open source search engine. In the structure I followed Trey Grainger and Timothy Potter excellent Solr in Action book (Manning, 2014), and I took some of the examples form there. Some others come from the examples bundeled with Solr, and from the projects I had opportunity to work with in the past (eXtensible Catalog and Europeana).
These slides don't go too deep, if you want to know more about the topic, just drop me an email, or consult with the references on the last slide.
Happy searching!
The Clash was a highly influential punk rock band formed in 1976 in London, England. They originally consisted of Joe Strummer, Mick Jones, Paul Simonon, and Nick Headon. Their third album London Calling, released in 1979, was critically acclaimed and considered one of the best albums of the 1980s. It brought them greater popularity in the United States. The Spanish Civil War was a major conflict that devastated Spain from 1936 to 1939 between Spanish army generals and the government of the Second Spanish Republic. The nationalists forces defeated the Republicans, establishing a dictatorship under General Francisco Franco.
The document discusses the search capabilities and infrastructure at TheLadders.com. It describes how they standardized their search using Solr, setting up a search team in 2010 and platform team in 2011. It also discusses challenges like complex boolean queries and implementing a recommendation service using Solr as the backend.
Building a global listening platform with Solr presents technical and global challenges. The speaker will demonstrate a platform they built in 3 months using Solr and Basis Technology products for content acquisition, analysis including language identification and entity extraction, and search visualization. Key aspects include distributed processing pipelines for analysis, language-specific indexing, and dashboard interfaces beyond basic search results.
LinkedIn is a professional networking platform that allows users to connect with colleagues and find new business opportunities. The document discusses why professionals should use LinkedIn by highlighting examples of how it can help users manage their online presence, find qualified candidates, and get introduced to new connections. It also shares two success stories of how individuals used LinkedIn to help raise funds for a startup and land a new job after being laid off. The document concludes by providing instructions for creating a new LinkedIn account and filling out a profile.
Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.
This document discusses integrating search capabilities with Hadoop's big data analytics. It explains that Hadoop is well-suited for distributed storage and processing of large datasets, while search excels at free-text retrieval and indexing large amounts of text. The document outlines how the speaker's company integrated Hadoop and search using HBase replication to a search index, allowing results from Hadoop jobs to be searchable in near real-time. It provides an example use case of monitoring tweets for keywords and extracting mentioned URLs to visualize popular links.
This document outlines a plan to start a sustainable business called We Beat The Mountain that would manufacture and sell products made from recycled materials, such as tires. The business would target environmentally and socially conscious consumers between ages 20-40 globally. Products would be of high quality, durable design and sold through an online store as well as partnerships with related industries. Initial product plans focus on selling recycled tire suitcases to tap into the $18 billion annual luggage market. Sales projections estimate selling between 3,000-10,000 suitcases in the first year, scaling up to 7,500-20,000 units by the third year.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.
If you've tried Apache Solr 1.4, you've probably had a chance to take it for a spin indexing and searching your data, and getting acquainted with its powerful, versatile new features and functions. Now, it's time to roll up your sleeves and really master what Solr 1.4 has to offer.
Solr & Lucene at Etsy provides concise summaries of Gregg Donovan's experience using Solr and Lucene at Etsy and TheLadders, including optimizing Solr out-of-the-box, customizing at a low level, and knowing when each approach is best. The document also shares various techniques for improving relevance, performance, and customization including external file fields, boosting queries, impression tracking, and more.
The scene- I love you like a love song Selena Gomeztanica
Selena Gomez is an American singer and actress born in 1992 in Grand Prairie, Texas. She began her career starring in the television series Wizards of Waverly Place. Her career expanded into music, contributing songs to soundtracks and releasing her own albums as part of the band The Scene. In 2011, she wrote and recorded the song "Love You like a Love Song" which was rumored to have been dedicated to her then-boyfriend Justin Bieber. The song expresses feelings of being completely in love.
1. Innovation ecosystems involve a network of actors working together to enable entrepreneurship, including idea generators, entrepreneurs, experienced managers, mentors, funding sources, customers, suppliers, and partners.
2. Successful ecosystems provide access to talent, technologies, advice, capital, networks, and other resources needed at each stage of a venture's development.
3. Incubators, accelerators, and co-working spaces play distinct but complementary roles in supporting entrepreneurs and startups at different stages by providing resources, mentoring, and connections.
LucidWorks Enterprise is a well-packaged, integrated search solution development platform that makes it easier for you to take on the art and science of search, applying the power and flexbility of open source to unlock the search technology for your most interesting and valuable business and technical challenges.http://www.lucidimagination.com/developers/whitepapers/getting-started-with-lucidworks-enterprise
This document summarizes new features and performance improvements in Solr 3.1. Key highlights include: improved faceting capabilities like numeric range facets; new features like spatial search, a faster highlighter, and extended dismax query parser; distributed search support; and under-the-hood performance optimizations in Lucene like DocumentWriterPerThread. It also previews upcoming Solr Cloud functionality and discusses features not yet included in the 3.1 release.
These slide belonged to the presentation I hold to my colleagues in Göttingen as an introduction to Apache Solr open source search engine. In the structure I followed Trey Grainger and Timothy Potter excellent Solr in Action book (Manning, 2014), and I took some of the examples form there. Some others come from the examples bundeled with Solr, and from the projects I had opportunity to work with in the past (eXtensible Catalog and Europeana).
These slides don't go too deep, if you want to know more about the topic, just drop me an email, or consult with the references on the last slide.
Happy searching!
Got data? Let's make it searchable! This interactive presentation will demonstrate getting documents into Solr quickly, provide some tips in adjusting Solr's schema to match your needs better, and finally showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
This document provides an introduction to SOLR, including why search engines are needed, what Lucene and SOLR are, the advantages of SOLR, SOLR architecture, query syntax, working with SOLR to feed and query data, and SOLR installation and configuration. Key topics covered include SOLR's ability to index and search structured and unstructured data in real-time, its sharding and replication capabilities for large datasets, and how SOLR configuration involves defining fields, field types, and dynamic fields in schema.xml.
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014
(http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)
Interactive Questions and Answers - London Information Retrieval MeetupSease
Answers to some questions about Natural Language Search, Language Modelling (Google Bert, OpenAI GPT-3), Neural Search and Learning to Rank made during our London Information Retrieval Meetup (December).
Solr as a Spark SQL Datasource allows users to read and write data from Solr as DataFrames in Spark SQL. It utilizes the Solr Schema API to access field-level metadata and push SQL predicates down into Solr query constructs like the fq clause. It also supports shard partitioning, intra-shard splitting, and streaming query results for fast reads of large result sets. The document provides examples of connecting to Solr, registering a DataFrame as a temp table, and performing SQL queries on Solr data in Spark.
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
This document provides an overview of using Apache Spark with Apache Solr. It discusses using Solr as a data source for Spark SQL, reading data from Solr into Spark RDDs, querying Solr from the Spark shell, indexing data from Spark Streaming into Solr, and an example of using Solr as a sink for a Spark Streaming application that processes tweets in real-time.
SolrTM is the popular, blazing fast open Source Enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites like (Aol, Yahoo, Buy.com, Cnet, CitySearch, Netflix, Zappos, Stubhub!, digg, eTrade, Disney, Apple, NASA and MTV).
- The document provides an overview of Apache Solr, an open source enterprise search platform. It discusses how to install and configure Solr, load sample data, and perform various search queries. It also offers tips for advanced search functionality, indexing, and scaling Solr for large datasets.
This document provides information about integrating Apache Solr and Apache Spark. It discusses using Solr as a data source and sink for Spark applications, including indexing data from Spark jobs into Solr in real-time and exposing Solr query results as Spark RDDs. The document also summarizes the Spark Streaming and RDD APIs and provides code examples for indexing tweets from Spark Streaming into Solr and reading from Solr into a DataFrame.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Faceted Searching is a must have feature for enhancing findability and user engagement in enterprise search UI. The Faceted Searching features of Apache Solr have been a major factor in it's popularity, but many Solr users don't fully appreciate all of the capabilities that are available. In this session we will deep dive into the different types of data facets that Solr supports, discussing in detail the various options that can be used to explore them. We will also review some specific techniques for dealing with several complex use cases, and discuss some performance "gotchas" and how to avoid them.
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLucidworks
The document provides a deep dive into the lifecycle of a Solr search request, from the initial HTTP request to the generation of the response. It describes each stage of processing, including how the request is routed through the Solr core, how the query and filters are parsed and executed against the index, how various caches and plugins can be leveraged, and how the final response is generated. It uses examples of simple and more complex queries to demonstrate how each component interacts throughout the processing pipeline.
"Solr Update" at code4lib '13 - ChicagoErik Hatcher
Solr is continually improving. Solr 4 was recently released, bringing dramatic changes in the underlying Lucene library and Solr-level features. It's tough for us all to keep up with the various versions and capabilities.
This talk will blaze through the highlights of new features and improvements in Solr 4 (and up). Topics will include: SolrCloud, direct spell checking, surround query parser, and many other features. We will focus on the features library coders really need to know about.
Some design patterns and concepts for industrial grade deployment of Drupal on Solaris, plus a specific example of an interesting Drupal site deployed on Solaris
The document introduces Yann Yu from Lucidworks and provides information about Lucidworks and its products Solr and Hadoop. It discusses how Solr can be used to provide search capabilities for large amounts of both structured and unstructured data stored in Hadoop. Integrating Solr and Hadoop allows for fast search across big data stored in Hadoop along with real-time indexing and querying capabilities. Examples discussed include enabling enterprise-wide search of documents stored in Hadoop and using Flume to index log data from Hadoop into Solr for real-time analytics and search.
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
The document discusses integrating Hadoop and Solr to enable fast, ad-hoc search across structured and unstructured big data stored in Hadoop. It provides examples of how Hadoop can be used for large-scale storage and processing while Solr is used for real-time querying and search. Specifically, it describes how the Lucidworks HDFS connector can process documents from HDFS and index them into SolrCloud for search, and how log data can be ingested from Flume into HDFS for archiving and extracted fields can be indexed into Solr in real-time for search and analytics dashboards.
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
Box uses the Solr search platform to power content search across its 25 million+ users. Some key aspects of Box's search implementation with Solr include:
1) The Solr index is sharded or split across multiple shards for high availability and scalability, with each file identifier mapped to a specific shard.
2) Search queries are handled by a front-end load balancer that distributes queries across multiple search head nodes for high availability.
3) Solr documents contain metadata like file owner, parent folders, and extracted text to support search by content, ownership, and folder structure.
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
The document discusses benchmarking the performance of SolrCloud clusters. It describes Timothy Potter's experience operating a large SolrCloud cluster at Dachis Group. It outlines an methodology for benchmarking indexing performance by varying the number of servers, shards, and replicas. Results show near-linear scalability as nodes are added. The document also introduces the Solr Scale Toolkit for deploying and managing SolrCloud clusters using Python and AWS. It demonstrates integrating Solr with tools like Logstash and Kibana for log aggregation and dashboards.
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
The document discusses how search has evolved beyond traditional keyword search to include more complex tasks like recommendations, classifications, and analytics using distributed technologies like Hadoop. It provides an overview of new capabilities in Lucene/Solr like reduced memory usage, pluggable codecs, and spatial search upgrades. LucidWorks offers products like Solr and SiLK that integrate with Hadoop and provide search and analytics capabilities across distributed data.
Solr 4.7 and 4.8 include new features such as asynchronous execution of long-running actions, cursors for deep paging, document expiration, dynamic synonyms and stopwords, SSL support in SolrCloud, and improved collections API. Future versions will focus on ZooKeeper as the single source of truth, incremental field updates, multi-valued DocValues sorting, and removing legacy field types. The speaker also discussed related open source projects from LucidWorks for deploying Solr on AWS, log processing, and data quality.
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
This document discusses how Apache Solr can power ecommerce search and provides examples of companies using it. It outlines basic features for ecommerce like facets, highlighting, and boosting as well as advanced features like spatial search and analytics. The document also provides tips for ecommerce search like understanding user needs, debugging issues, and leveraging signals from user behavior to improve relevance.
Target transitioned from their previous search platform to using Solr. Some benefits they found included the speed of importing data into Solr and the ease of adding additional data signals to improve relevancy. However, they had to start from scratch on their relevancy strategy in Solr and found facets worked differently between the platforms. Target also discussed how they were able to improve relevancy by incorporating guest activity data on their website to surface more viewed and ordered items.
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
The document discusses the development of a new search system for PubChem to allow for exploration of multidimensional biomedical data. The new system was needed to address the challenges of handling large and heterogeneous datasets with many relationships between data types in a way that allows for fast querying. The system leverages Apache SOLR to provide features like full text search, faceting, molecule structure searching and joining of related data. It includes backend components like SOLR, SQL and specialized search engines as well as web APIs and frontend interfaces like reusable widgets and a new search interface.
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
This document discusses Solr, an open source search platform from the Apache Lucene project. It provides full-text search, faceted search, auto-suggest capabilities, and supports multiple file formats for document indexing. The document outlines Solr's architecture and components, provides usage examples from large government sites, and recommends related open source tools.
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
This document discusses building a lightweight discovery interface for Chinese patents. It describes using parsers and the cloud to ingest various patent file formats and metadata in order to build a search interface. It emphasizes spending adequate time on user experience design and sharing data with users and other applications.
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
ISS is a software solutions company that provides big data management tools to Department of Defense and intelligence community customers. They have over 800 employees across several US offices. Their solutions are reusable, license-free for the US government, and scalable from single users to large networks with thousands of users. Customers have thousands of heterogeneous data sources that create data at an increasing rate, making effective search and analytics tools necessary to help analysts extract useful information and actionable intelligence from large amounts of unstructured data in tactical environments. ISS argues that search must be the cornerstone of an effective big data strategy, allowing normalization, indexing, and semantic search of content to help analysts focus their efforts and gain insights from large data sets.
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
Lucene and Solr 4.8 include improvements to speed, flexibility, and scalability. Key updates include native near real-time support in Lucene, faster indexing with document writer per thread, and improved fuzzy and wildcard query processing. Solr 4 offers new faceting, geospatial, and distributed capabilities. Both projects provide easier configuration and more pluggable scoring and indexing options to improve search relevance and performance.
This document summarizes Sean Timm's presentation on Solr and Lucene at AOL. It discusses AOL's history with search technologies including using Open Directory Project (ODP) and building search into AOL Server using their own retrieval model (CPL). It describes AOL's contributions to Solr/Lucene including the Data Import Handler. It provides recommendations for contributing to the Solr/Lucene community such as answering questions, improving documentation, and submitting patches. It highlights some of AOL's applications of Solr like search for MapQuest, AIM, Mail, and analyzing Sarah Palin's emails.
This document provides an introduction to SolrCloud, which enables horizontal scaling of a Solr search index using sharding and replication. Key terminology is defined, including ZooKeeper, nodes, collections, shards, replicas, and leaders. The document outlines the high-level SolrCloud architecture and discusses features like sharding, document routing, replication, distributed indexing and querying. Challenges around consistency and availability are also covered.
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
Doug discusses challenges with collaboration between search developers and content experts when optimizing search relevancy. The current process of developers making changes and experts having to wait a week for results is inefficient. Doug proposes applying test-driven development principles to search by having experts continuously test search results and provide feedback on changes in real-time. This allows developers to get immediate feedback and ensures changes are improving search quality. Doug's company built a tool called Quepid that implements this approach to enable better collaboration between experts and developers when optimizing search.
This document discusses building a data-driven log analysis application using LucidWorks SILK. It begins with an introduction to LucidWorks and discusses the continuum of search capabilities from enterprise search to big data search. It then describes how SILK can enable big data search across structured and unstructured data at massive scale. The solution components involve collecting log data from various sources using connectors, ingesting it into Solr, and building visualizations for analysis. It concludes with a demo and contact information.
LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications.
This webinar will explore the features of the App, and provide attendees with valuable information on the following key components:
Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports
Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields
NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore
Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications
Introducing LucidWorks App for Splunk Enterprise webinar
Solr 3.1 and beyond
1. Solr 3.1 and Beyond
yonik@lucidimagination.com
October 8, 2010
2
Lucid Imagination
Yonik Seeley
2. Agenda
Goal : Introduce new features you can try & use now in
Solr development versions 3.1 or 4.0
Relevancy (Extended Dismax Parser)
Spatial/Geo Search
Search Result Grouping / Field Collapsing
Faceting (Pivot, Range, Per-segment)
Scalability (Solr Cloud)
Odds & Ends
Q&A
10/12/10 3
3. Solr 3.1? What happened to 1.5?
Lucene/Solr merged (March 2010)
Single set of committers
Single dev mailing list (dev@lucene.apache.org)
Single shared subversion trunk
Keep separate downloads, user mailing lists
Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
Development
trunk is now always next major release (currently 4.0)
branch_3x will be base for all 3.x releases
Branch together, Release together, Share version numbers
5. Extended Dismax Parser
Superset of dismax
&defType=edismax&q=foo&qf=body
Fixes edge cases where dismax could still throw
exceptions
OR
AND
NOT
-‐
“
Full lucene syntax support
Tries lucene syntax first
Smart escaping is done if syntax errors
Optionally supports treating “and”/”or” as AND/OR in
lucene syntax
Fielded queries (e.g. myfield:foo) even in degraded
mode
uf parameter controls what field names may be directly specified in “q”
6. Extended Dismax Parser (continued)
boost parameter for multiplicative boost-by-function
Pure negative query clauses
Example: solr
OR
(-‐solr)
Enhanced term proximity boosting
pf2=myfield – results in term bigrams in sloppy phrase queries
myfield:“aa
bb
cc”
-‐>
myfield:“aa
bb”
myfield:“bb
cc”
Enhanced stopword handling
stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr
is
awesome
&
qf=myfield
&
pf2=myfield
-‐>
+myfield:(solr
awesome)
(myfield:”solr
is”
myfield:”is
awesome”)
Currently controlled by the absence of StopWordFilter in index analyzer, and
presence in query analyzer
10. Field Collapsing Definition
Field collapsing
Limit the number of results per category
“category” normally defined by unique values in a field
Uses
Web Search – collapse by web site
Email threads – collapse by thread id
Ecommerce/retail
Show the top 5 items for each store category (music, movies,
etc)
13. Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
10/12/10 14
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"MA147LL/A",
14. Group by Query
10/12/10 15
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
"doclist":{"numFound":1,"start":0,"docs":[
{
15. Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function
query>
Group by unique values produced by
the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as
“sort”
param
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to
each other (based on top doc)
10/12/10 16
22. Per-segment faceting
Enable with facet.method=fcs
Controllable multi-threading
facet.field={!threads=4}myfield
Disadvantages
Larger memory use (FieldCaches + accumulators)
Slower (extra FieldCache merge step needed)
Advantages
Rebuilds FieldCache entries only for new segments (NRT friendly)
Multi-threaded
23. Per-segment faceting performance
comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B
24. Faceting Performance Improvements
For facet.method=enum, speed up initial
population of the filterCache (i.e. first time
facet): from 30% to 32x improvement
Optimized facet.method=fc for multi-valued
fields and large facet.limit – up to 3x faster
Optimized deep facet paging – up to 10x faster
with really large facet.offsets
Less memory consumed by field cache entries
10/12/10 25
26. SolrCloud
First steps toward simplifying cluster management
Integrates Zookeeper
Central configuration (schema.xml, solrconfig.xml, etc)
Tracks live nodes + shards of collections
Removes need for external load balancers
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr
Can specify logical shard ids
shards=NY_shard,NJ_shard
Clients don’t need to know shards at all:
http://localhost:8983/solr/collection1/select?distrib=true
27. SolrCloud : The Future
Eliminate all single points of failure
Remove Master/Searcher distinction
Enables near real-time search in a highly scalable environment
High Availability for Writes
Eventual consistency model (like Amazon Dynamo, Cassandra)
Elastic
Simply add/subtract servers, cluster will rebalance automatically
By default, Solr will handle document partitioning
29. Auto-Suggest
Many people currently use terms component
Can be slow for a large corpus
New auto-suggest builds off SpellCheck component
Compact memory based trie for really fast completions
Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
10/12/10 30
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}
30. Index with JSON
$
URL=http://localhost:8983/solr/update/json
$
curl
$URL
-‐H
'Content-‐type:application/json'
-‐d
'
{
"add":
{
"doc":
{
"id"
:
"978-‐0641723445",
"cat"
:
["book","hardcover"],
"title"
:
"The
Lightning
Thief",
"author"
:
"Rick
Riordan",
"series_t"
:
"Percy
Jackson
and
the
Olympians",
"sequence_i"
:
1,
"genre_s"
:
"fantasy",
"inStock"
:
true,
"price"
:
12.50,
"pages_i"
:
384
}
}
}'
31
31. Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
Can handle multi-valued fields (see “cat” field in example)
Completely compatible with the CSV update handler (can round-trip)
Results are streamed – good for dumping entire parts of the index
10/12/10 32