Solr 4.7 and 4.8 include new features such as asynchronous execution of long-running actions, cursors for deep paging, document expiration, dynamic synonyms and stopwords, SSL support in SolrCloud, and improved collections API. Future versions will focus on ZooKeeper as the single source of truth, incremental field updates, multi-valued DocValues sorting, and removing legacy field types. The speaker also discussed related open source projects from LucidWorks for deploying Solr on AWS, log processing, and data quality.
Join Apache Solr committer and Lucidworks engineer Tim Potter for a webinar to learn how to unlock and understand your big data - and get the most out of your Hadoop investment.
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Join Apache Solr committer and Lucidworks engineer Tim Potter for a webinar to learn how to unlock and understand your big data - and get the most out of your Hadoop investment.
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
These slide belonged to the presentation I hold to my colleagues in Göttingen as an introduction to Apache Solr open source search engine. In the structure I followed Trey Grainger and Timothy Potter excellent Solr in Action book (Manning, 2014), and I took some of the examples form there. Some others come from the examples bundeled with Solr, and from the projects I had opportunity to work with in the past (eXtensible Catalog and Europeana).
These slides don't go too deep, if you want to know more about the topic, just drop me an email, or consult with the references on the last slide.
Happy searching!
Webinar: Solr's example/files: From bin/post to /browse and BeyondLucidworks
Join Lucidworks cofounder, Sr. Solutions Architect, and Lucene/Solr committer, Erik Hatcher for a webinar to explore how to build a personal document search app with the ease and power of Solr.
Scaling search to a million pages with Solr, Python, and Djangotow21
A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.
Presentación curso de Apache Solr. A través de la realización de ejercicios prácticos, se obtendrán conocimientos sobre la implantación de tecnologías Solr, configuración, indexación, análisis y resolución de problemas comunes.
Seminario de Apache Solr, organizado por Paradigma Tecnologico y Javahispano, presentado por Marco Martinez y Alejandro Marques el 8 de Junio de 2010
Mas info:
Formación de Solr Avanzado, incluye muchos aspectos sobre Solr desde la arquitectura, la clusterización o el sharding, hasta las políticas de indexación y búsqueda distribuida, así como los componentes y handlers para la búsqueda avanzada (Faceting, Grouping, Sorting, Highlighting, Spellchecking, More like this, etc...)
Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward.
Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
Activate '18 talk: Lucene and Solr 8 will be released near the end of 2018. This session will explore new features, performance improvements and breaking changes.
In this On-Demand Webinar, Erik Hatcher, co-founder of Lucid Imagination, co-author of Lucene in Action, and Lucene/Solr PMC member and committer, presents and discusess key features and innovations of Apache Solr 1.4
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Solr search engine with multiple table relationJay Bharat
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL.
I am introducing how to handle multiple table data handling in SOLR.
OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well.
Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code.
In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following:
- A powerful UI to analyze time series data stored in Lucene/Solr
- Creating and sharing visualizations, dashboards and reports
- Discovery and analysis of data coming from servers, applications, devices and more
- Exploration of click, geospatial and social data in ways previously unimaginable
LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications.
This webinar will explore the features of the App, and provide attendees with valuable information on the following key components:
Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports
Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields
NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore
Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Mission to Decommission: Importance of Decommissioning Products to Increase E...
What's new in solr june 2014
1. 1
What’s New in Solr
Solr 4.7 & 4.8
June 12, 2014
Search | Discover | Analyze
2. Speaker
• Software Engineer at LucidWorks
• Lucene/Solr committer and PMC member
• Previously worked on search and NLP at the
Center for Natural Language Processing at
Syracuse University’s iSchool
• Twitter: @steven_a_rowe
Steve Rowe
2
3. Agenda
• A short history of Solr 4
• Solr 4.7 and 4.8: new features
• Solr 4.9 and beyond
3
5. A short history of Solr 4
• SolrCloud
– Distributed indexing and searching, NRT and NoSQL
features, e.g. realtime-get, optimistic concurrency and
durable updates
– Sharding, replication, ZooKeeper ensemble
– High availability with no single points of failure
• Real-time Get: Access latest document version, no
commit or new searcher open required
• Atomic updates: incremental field
add/update/increment via stored fields
• NRT: “soft” commits
5
6. A short history of Solr 4
• Solr Reference Guide now released with each
feature release:
– Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide
– Most recent released PDF:
http://s.apache.org/Solr-Ref-Guide-PDF
– Previous release PDFs:
http://s.apache.org/Older-Solr-Ref-Guide-PDFs
6
7. A short history of Solr 4
• Flexible indexing
– Solr core = Lucene index
• Lucene index = 1 or more segments
– Codec: per-segment suite of formats
• Flexible scoring
– You can specify similarity implementation per fieldType in
your schema.xml if you use SchemaSimilarityFactory
– Built-in Similarities (other than the default TF-IDF):
• Okapi BM25
• Divergence from Randomness
• Information-Based
• Language Models (with two smoothing implementations)
• SweetSpot
7
8. A short history of Solr 4
• DocValues: typed column stride fields
– Document-to-value mapping built at index time
– Reduced memory usage compared to field cache
– Good for faceting and sorting
– Missing values now supported as of Solr 4.5
• Pseudo-fields
– Field aliasing, e.g. &fl=result:indexed
– Function queries, aliasable too, e.g. &fl=price:sum(a,b)
– Document transformers
• Standard: [explain], [value], [shard], [docid]
• Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod
• Pivot faceting: automatic drill-down (no distr.’d support)
8
9. A short history of Solr 4
• Schema API
• GET /collection/schema/fields/fieldname
• PUT /collection/schema/fields/name
• JSON body: { "type":"text_general",
"stored":true,
"indexed":true }
• Schemaless mode
• a.k.a. data-driven schema or field guessing
• Class guessed based on field values, then class(es)
mapped to a fieldType; first gets added to the schema
• Supported value classes: Boolean, Integer, Long, Float,
Double, and Date
9
10. A short history of Solr 4
• Document routing
– CompositeId router, e.g. id=tenant!docid
• Used by default when numShards specified when
creating a collection.
• Restrict queries to shard(s): &_route_=tenant!
– Implicit router
• Online shard splitting
– Allows collections to scale, rather than having to
decide on how much to overshard up front.
– Split in two; with custom hash ranges; or using
split.key param to split to a dedicated shard
10
11. A short history of Solr 4
• Nested documents, a.k.a. Block Join
– Nested doc to be added:
<add>
<doc>
<field name="id">1</field>
<field name="title">Solr adds block join support</field>
<field name="content_type">parentDocument</field>
<doc>
<field name="id">2</field>
<field name="comments">SolrCloud supports it too!</field>
</doc>
</doc>
</add>
– Queries:
• Child query parser, e.g.
q={!child of="content_type:parentDocument"}title:Solr
• Parent query parser, e.g.
q={!parent which="content_type:parentDocument"}comments:SolrCloud
11
12. A short history of Solr 4
• solr.xml legacy & discovery modes
– Legacy mode (cores listed in solr.xml) is
deprecated; support will be removed in Solr 5.
– Discovery mode (new as of Solr 4.3):
• No cores are listed in solr.xml
• Cores are discovered by a recursive walk of the solr
home directory, marked by core.properties files
• Nested core directories are not allowed
12
13. A short history of Solr 4
• New web admin UI with SolrCloud support
13
14. Solr 4.7 and 4.8: new features
• As of Solr 4.8, Java 7 is the minimum supported
JVM version. Recommended: Oracle 1.7.0_60
• <fields> and <types> tags are no longer necessary in
schema.xml
• Collections API improvements
– Working toward “ZooKeeper = Truth” mode
• legacyCloud=false cluster property
– New actions:
• CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE,
ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS,
MIGRATE, CLUSTERPROP
– Core properties can be specified with CREATE and
SPLITSHARD actions
14
15. Solr 4.7 and 4.8: new features
• Asynchronous execution of long-running
actions
– SolrCloud Collections API:
• CREATE, SPLITSHARD, MIGRATE
– CoreAdminHandler:
• CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES,
SPLIT
– Tracking request ID supplied via async param
– Track status via the new REQUESTSTATUS action,
using the tracking request ID
• Possible states: running, complete, failed, notfound
– Clear stored statuses with special request ID -1
15
16. Solr 4.7 and 4.8: new features
• Cursors: Efficient Deep Paging
– Request must include a sort, which must include
the uniqueKey, which must be defined
– First page: ?q=…&sort=id+asc&rows=N&cursorMark=*
• Response contains "nextCursorMark":"<base64encoded>"
– Following pages:
?q=…&sort=id+asc&rows=N&cursorMark=<from response>
– Repeat; when nextCursorMark=cursorMark from the
request, there are no more results
– No server-side state
16
18. Solr 4.7 and 4.8: new features
• Document expiration and Time To Live (TTL)
– Auto-delete expired documents
• DocExpirationUpdateProcessorFactory can periodically
wake up and delete expired documents
– Compute expiration date from TTL
• Update request _ttl_ param, or
• Document _ttl_ field
• Both names are configurable, defaulting to _ttl_.
• _ttl_ values are interpreted as Date Math Expressions
relative to NOW, e.g. “+1YEAR”.
18
19. Solr 4.7 and 4.8: new features
• Dynamic synonyms and stopwords
– “Managed” resources: configuration and content for
synonyms and stopwords, persistence managed by Solr
– Specified as ManagedSynonymFilterFactory and
ManagedStopFilterFactory on analyzers in schema.xml
– CRUD operations are enabled via a REST endpoint per
managed resource.
– The “managed” attribute names the REST endpoint, e.g.
<filter class="solr.ManagedStopFilterFactory"
managed="french" />
– E.g. to delete stopword “le” from the “french” managed
stoplist:
curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le"
19
20. Solr 4.7 and 4.8: new features
• SSL support in SolrCloud
– URL scheme stored in ZooKeeper
– SSL certificates are specifiable via system properties, to
enable authentication
• Nested documents may be specified in JSON format
• Tri-level compositeId routing
– E.g. “tenant!group!docid”, 8/8/16 hash bits per component
• Build Solr indexes with Hadoop’s MapReduce
– +Mark Miller’s blog: http://bit.ly/1oh0fWq
• Github solr-map-reduce-example: http://bit.ly/1pnDAao
• Named config sets in non-SolrCloud mode
– Default base directory is SOLR_HOME/configsets/
20
21. Solr 4.7 and 4.8: new features
• Suggester v2
– Added BlendedInfixSuggester
– Added FreeTextSuggester
– Queries can use multiple suggesters
• New query parsing features
– SimpleQParserPlugin: parser for human entered
queries with selectable operators.
– ComplexPhraseQParserPlugin: wildcards, ORs, etc.
inside Phrase Queries
• E.g. {!complexphrase inOrder=true}name:"Jo* Smith"
21
22. Solr 4.7 and 4.8: new features
• CollapsingQParserPlugin
– Performant alternative grouping/field collapsing
implementation, for high distinct group cardinality.
• ExpandComponent
– Expands collapsed groups
– Can also expand nested documents
22
23. Solr 4.9 and beyond
• ZooKeeper = Truth / legacyCloud=false
• MODIFYCOLLECTION collections API
– Modify maxShardsPerNode, replicationFactor for the
entire collection
• Incremental Field Updates on numeric
DocValues
– Binary DocValues IFUs also coming
• Multi-valued DocValues sort fields
• Legacy numeric/date field types deprecated,
removed in Solr 5 in favor of Trie field types
23
24. Solr 4.9 and beyond
• In Solr 5, the .war will no longer be shipped
• Index integrity: checksums
• Integrity check on merge off by default
• solrconfig.xml option <indexConfig><checkIntegrityAtMerge>
• New update query param min_rf will allow clients
to set the minimum successful replicas for the
request
• Return Block Join child documents when parents
match, via a new DocTransformer
[child parentFilter=“field:value”]
24
25. Solr 4.9 and beyond
• AnalyticsQuery: support pluggable, pipeline-able
analytics, orderable via the “cost” parameter, like
PostFilters.
• ReRankingQParserPlugin
• Re-rank the top n results
25
26. Platform
LucidWorks Open Source
26
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr:
https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-
quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way
support w/ Lucene and Solr, different file formats, pipelines,
Logstash
27. Links
Solr website: http://lucene.apache.org/solr
Solr Reference Guide:
• Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide
• Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-
PDF
• Previous release PDFs: http://s.apache.org/Older-Solr-Ref-
Guide-PDFs
Lucene/Solr Revolution: http://www.LuceneRevolution.org
Q & A
27
Editor's Notes
Asynchronous collection API calls in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-AsynchronousCalls
REQUESTSTATUS action in the Solr Reference Guide: http://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-RequestStatus
See Pagination of Results in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
Chris Hostetter’s scripts to produce the graph: https://github.com/LucidWorks/blog-deep-paging-perf
Date Math Expressions in Solr Javadocs: https://lucene.apache.org/solr/4_8_1/solr-core/org/apache/solr/util/DateMathParser.html
See Chris Hostetter’s blog post “New in Solr 4.8: Document Expiration”: http://searchhub.org/2014/05/07/document-expiration/
See the “Managed Resources” page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Managed+Resources
See also Tim Potter’s blog “Using Solr’s REST APIs to manage stop words and synonyms”: http://searchhub.org/2014/03/31/introducing-solrs-restmanager-and-managed-stop-words-and-synonyms/
For info on Tri-level compositeId routing, see Anshum Gupta’s blog “Multi level composite-id routing in SolrCloud”: http://searchhub.org/2014/01/06/10590/
See the Config Sets page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Config+Sets
Suggester v2 JIRA issue: https://issues.apache.org/jira/browse/SOLR-5378
Simple Query Parser in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-SimpleQueryParser
Complex Phrase Query Parser in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
See the Collapse & Expand page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Collapse+%26+Expand
See also Joel Bernstein’s blog post “The CollapsingQParserPlugin: Solr’s New High Performance Field Collapsing PostFilter”: http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/
See also Joel Bernstein’s blog post “Solr’s New Expand Component”: http://heliosearch.org/solrs-new-expand-component/
See also Joel Bernstein’s blog post “Using the ExpandComponent to expand a Solr Block Join”: http://heliosearch.org/expand-block-join/
See Joel Bernstein’s blog post “Solr’s New AnalyticsQuery API”: http://heliosearch.org/solrs-new-analyticsquery-api/
See Joel Bernstein’s blog post “New in Solr 4.9: Query Re-Ranking”: http://heliosearch.org/solrs-new-re-ranking-feature/