Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.
Sol 1.4 is better than ever! Read this white paper and learn about these new features, including:
* enhanced data import capabilities
* rich document handling
* speedier numeric range queries
* duplicate detection
* java-based replication and deployment
* smarter handling of index changes
* faster faceting
* streamlined caching
Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.
Sol 1.4 is better than ever! Read this white paper and learn about these new features, including:
* enhanced data import capabilities
* rich document handling
* speedier numeric range queries
* duplicate detection
* java-based replication and deployment
* smarter handling of index changes
* faster faceting
* streamlined caching
Learn more about :-
* Ten things to know about designing the search experience
* When to assume users know what they’re looking for — and when not to
* Navigation/discovery techniques, such as faceted navigation, tag clouds, histograms and more
* Practical considerations in leveraging suggestions into search interactions
http://www.lucidimagination.com/solutions/webinars/designing-the-search-experience
Trulia is a real estate search company that helps customers find homes for sale or to rent and provides them with information to help them make better decisions in the process. It is also a hub for real estate professionals to market their listings, view real estate data and promote their services.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.
"With the new release of version 2.9, Apache Lucene is now faster and more flexible than before. Learn about these new features and improvements:
Near-real-time search
Flexible indexing and segments
High-performance numerical range queries
New APIs introduced
Critical bug fixes"
http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-2-9
In the last years, the idea of building applications that can be used remotely by mean of the Web, has coined a new concept called "Software as a Service". Such applications, have the advantage of a remote web deployment that can be instantaneously be used by potentially any consumer in internet or of the cost reduction that a Web-based deployment provides.
Search was once considered a black-box application that ingested content and delivered results to users opaquely. However, driven by the opportunities and demands of the growing universe of content and by the versatility of Solr/Lucene open source search technology, search applications are evolving from a standalone facility to an enabling framework.http://www.lucidimagination.com/developer/whitepapers/search-readiness-checklist
This white paper describes the new features and improvements in the latest version, Apache Solr
1.4. In the simplest terms, Solr is now faster and better than before. Central components of Solr
have been improved to cut the time needed for processing queries and indexing documents.
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...Marty Kaszubowski
Presentation given at NASA Langley Research Center (LaRC) Technology Days (5/15/15). The topic of the discussion was how we can take better advantage of the assets in our region to promote high-growth ventures.
Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it's helpful to begin with an overview of the steps and components involved in a Solr search.http://www.lucidimagination.com/developers/whitepapers/whats-new-solr-14
Learn more about :-
* Ten things to know about designing the search experience
* When to assume users know what they’re looking for — and when not to
* Navigation/discovery techniques, such as faceted navigation, tag clouds, histograms and more
* Practical considerations in leveraging suggestions into search interactions
http://www.lucidimagination.com/solutions/webinars/designing-the-search-experience
Trulia is a real estate search company that helps customers find homes for sale or to rent and provides them with information to help them make better decisions in the process. It is also a hub for real estate professionals to market their listings, view real estate data and promote their services.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.
"With the new release of version 2.9, Apache Lucene is now faster and more flexible than before. Learn about these new features and improvements:
Near-real-time search
Flexible indexing and segments
High-performance numerical range queries
New APIs introduced
Critical bug fixes"
http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-2-9
In the last years, the idea of building applications that can be used remotely by mean of the Web, has coined a new concept called "Software as a Service". Such applications, have the advantage of a remote web deployment that can be instantaneously be used by potentially any consumer in internet or of the cost reduction that a Web-based deployment provides.
Search was once considered a black-box application that ingested content and delivered results to users opaquely. However, driven by the opportunities and demands of the growing universe of content and by the versatility of Solr/Lucene open source search technology, search applications are evolving from a standalone facility to an enabling framework.http://www.lucidimagination.com/developer/whitepapers/search-readiness-checklist
This white paper describes the new features and improvements in the latest version, Apache Solr
1.4. In the simplest terms, Solr is now faster and better than before. Central components of Solr
have been improved to cut the time needed for processing queries and indexing documents.
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...Marty Kaszubowski
Presentation given at NASA Langley Research Center (LaRC) Technology Days (5/15/15). The topic of the discussion was how we can take better advantage of the assets in our region to promote high-growth ventures.
Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it's helpful to begin with an overview of the steps and components involved in a Solr search.http://www.lucidimagination.com/developers/whitepapers/whats-new-solr-14
* Open source search with Solr/Lucene gives you the power to turn a wide range of information into fast, useful, relevant results!
* LucidWorks for Solr gives you a tested, release-stable certified distribution of open source search with enhanced tools and installation for building search apps quickly and reliably.
http://www.lucidimagination.com/How-We-Can-Help/webinar-from-search-to-found
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
ZendCon 2010 - Building Intelligent Search Applications with Apache Solr and PHP5. This is a presentation on how to create intelligent web-based search applications using PHP 5 and the out-of-the-box features available in Solr 1.4.1 After we finish we finish the illustration of adding, updating and removing data from the Solr index, we will discuss how to add features such as auto-completion, hit highlighting, faceted navigation, spelling suggestions etc
Mehr und mehr Anwendungen könnten von einer ausgefeilten und schnellen Suchfunktionalität profitieren. Dieser Workshop beschäftigt sich nach einer kurzen Lucene-Einführung vor allem mit den praktischen Einsatzmöglichkeiten von Apache Solr und EleasticSearch in verschiedenen Szenarien: einfache Volltextsuche, schnelle Autovervollständigung, Suche mit Facetten, Inhaltsausschnitte, ähnliche Dokumente, einfacher Datenimport etc.
HOW TO USE APACHE SOLR TO THE FULLEST EFFORTS: A TECHNICAL EXPLORATION OF SEARCH INDEXING
A search tool improves a website's user experience by making it easier and faster for a user to find what they're looking for. Greater emphasis should be placed on huge, e-commerce, and dynamically updated websites (news sites, blogs).
One of the most well-liked search engines utilized by websites of all sizes is Apache Solr. It is a Java-based open-source search engine that enables you to look up information such as articles, goods, customer reviews, and more. In this article, we will examine Apache Solr in more detail.
What makes Apache Solr so well-liked?
Full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features (non-relational database), and rich document handling are all features of Apace Solr Web Development that make it quick and versatile. These features include the ability to index a variety of document formats, including PDF, MS Office, and Open Office, as well as the ability to index new content instantly.
Some useful information regarding Apache Solr
As a search engine for their websites and publications, CNET Networks, Inc. initially created it. Later, it became an Apache top-level project after being open-sourced. Supports a variety of programming languages, including Ruby, PHP, Java, and Python. Additionally, it offers these languages' APIs.
Has integrated capability for geographic search, enabling location-based content searches. Particularly beneficial for websites like tourism and real estate portals. Use APIs and plugins to support sophisticated search capabilities like spell checking, autocomplete, and custom search. Use Lucene for searching and indexing. What is Apache Lucene An open-source Java search library called Lucene makes it simple to incorporate search or information retrieval into an application. It utilizes a robust search algorithm and is adaptable, strong, and accurate.
Although Lucene is best recognized for its full-text search capabilities, it may also be used to classify documents, analyze data, and retrieve information. Along with English, it also supports a wide variety of additional languages, including German, French, Spanish, Chinese, and Japanese.
Describe indexing
Indexing is the first step for all search engines. The conversion of original data into a highly effective cross-reference lookup to speed up search is known as indexing. Data is not directly indexed by search engines. Tokens (atomic components) are first separated out from the texts. Consulting the search index and obtaining the document that matches the query constitute searching.
Benefits of indexing
• Information retrieval that is quick and accurate (collects, parses, and saves)
• The search engine needs extra time to scan each document without indexing.
• indices of flow
• indices of flow
The document will first be examined and divided into tokens.
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Solr search engine with multiple table relationJay Bharat
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL.
I am introducing how to handle multiple table data handling in SOLR.
Things Made Easy: One Click CMS Integration with Solr & Drupallucenerevolution
Presented by Peter Wolanin | Acquia, Inc - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
If you have a new web project or and existing Drupal site, the combination of Drupal and Apache Solr is both powerful and easy to set up thanks to the existing integration code. The module allows for substantial customization with the administrative UI. Drupal facilitates further customizations of the UI, indexing, and bosting because of the open architecture that provides multiple opportunities for custom code to alter the behavior. A couple code snippets will be followed by a review of other contributed Drupal modules that further enhance the search capability.
Finally, this session will showcase some example of Drupal sites using Solr including Acquia's own sites and Drupal sites including many well-known Enterprise and government sites.
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following:
- A powerful UI to analyze time series data stored in Lucene/Solr
- Creating and sharing visualizations, dashboards and reports
- Discovery and analysis of data coming from servers, applications, devices and more
- Exploration of click, geospatial and social data in ways previously unimaginable
LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications.
This webinar will explore the features of the App, and provide attendees with valuable information on the following key components:
Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports
Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields
NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore
Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Indexing Text and HTML Files with Solr
1. Indexing Text
and HTML Files with
Solr, the Lucene
,
Search Server
A Lucid Imagination
Technical Tutorial
By Avi Rappoport
Search Tools Consulting
2. Abstract
Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites. Lucid
Imagination’s LucidWorks Certified Distribution for Solr provides a fully open distribution
of Apache Solr, with key complements including a full Reference Guide, an installer, and
additional functions and utilities. All the core code and many new features are available, for
free, at the Lucid Imagination web site (www.lucidimagination.com/downloads).
In the past, examples available for learning Solr were for strictly-formatted XML and
database records. This new tutorial provides clear, step-by-step instructions for a more
common use case: how to index local text files, local HTML files, and remote HTML files. It
is intended for those who have already worked through the Solr Tutorial or equivalent.
Familiarity with HTML and a terminal command line are all that is required; no formal
experience with Java or other programming languages is needed. System Requirements for
this tutorial are those of the Startup Tutorial: UNIX, Cygwin (Unix on Windows), Mac OS X;
Java 1.5, disk space, permission to run applications, access to content.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page ii
3. Contents
Introduction......................................................................................................................................................................................... 1
Part 1: Installing This Tutorial ................................................................................................................................................... 1
Part 2: Solr Indexing with cURL ................................................................................................................................................ 3
Using the cURL command to index Solr XML ................................................................................................................ 3
Troubleshooting errors with cURL Solr updates......................................................................................................... 4
Viewing the first text file in Solr ........................................................................................................................................... 5
Part 3: Using Solr to Index Plain Text Files ......................................................................................................................... 6
Invoking Solr Cell ......................................................................................................................................................................... 6
Parameters for more fields ..................................................................................................................................................... 7
Part 4: Indexing All Text Files in a Directory...................................................................................................................... 9
Shell script for indexing all text files .................................................................................................................................. 9
More robust methods of indexing files ............................................................................................................................. 9
Part 5: Indexing HTML Files......................................................................................................................................................10
cURLSimplest HTML indexing .............................................................................................................................................10
Storing more metadata from HTML .................................................................................................................................11
Storing body text in a viewable field ................................................................................................................................12
Part 6: Using Solr indexing for Remote HTML Files .....................................................................................................12
Using cURL to download and index remote files .......................................................................................................12
File streaming for indexing remote documents .........................................................................................................13
Spidering tools .............................................................................................................................................................................13
Conclusion and Additional Resources .................................................................................................................................14
About Lucid Imagination ........................................................................................................................................................15
About the Author ........................................................................................................................................................................15
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page iii
4. Introduction
Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites1.
Today, the newly released version of Solr 1.4 includes a new module called Solr Cell that
can access many file formats including plain text, HTML, zip, OpenDocument, and Microsoft
Office formats (both old and new). Solr Cell is invokes the Apache Tika extraction toolkit,
another part of the Apache Lucene family, integrated in Solr). This tutorial provides a
simple introduction to this powerful file access functionality.
In this tutorial, we’ll walk you through the steps required for indexing readily accessible
sources with simple command-line tools for Solr, using content you are likely to have
access to: your own files, local discs, intranets, file servers, and web sites.
Part 1: Installing This Tutorial
As it turns out, the existing examples for in the default installation of the Solr Tutorial are
for indexing specific formats of XML and JDBC-interface databases. While those formats can
be easier for search engines to parse, many people learning Solr do not have access to such
content. This new tutorial provides clear, step-by-step instructions for a more common use
case: how to index local text files, local HTML files, and remote HTML files. It is intended for
those who have already worked through the Solr Tutorial or equivalent.
This tutorial will add more example entries, using Abraham Lincoln's Gettysburg Address
and the United Nations’ Universal Declaration of Human Rights as text files, and as HTML
files, and walk you through getting these document types indexed and searchable.
First, follow the instructions in the Solr Tutorial (from
http://lucidimagination.com/Downloads/LucidWorks-for-Solr or
http://lucene.apache.org/solr/tutorial.html) from installation to Querying Data (or
1
Lucene, is the Apache search library at the core of Solr, presents the interfaces through a collection of directly
callable Java libraries, offering fine-grained control of machine functions and independence from higher-level
protocols, and requiring development of a full java application. Most users building Lucene-based search
applications will find they can do so more quickly if they work with Solr, as it adds many of the capabilities needed
to turn a core search function into a full-fledged search application.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 1
5. beyond). When you are done, the Solr index file will have about 22 example entries, most of
them about technology gadgets.
Next, use a browser or ftp program to access the tutorial directory on the Lucid
Imagination web site (http://www.lucidimagination.com/download/thtutorial/). You
should find the following files:
lu-example-1.txt lu-example-4.txt lu-example-8.html thtutorial.zip
lu-example-1.xml lu-example-5.txt post-txt.sh
lu-example-2.txt lu-example-6.html remote
lu-example-3.txt lu-example-7.html schema.xml
For your convenience, all of the files above are included in thtutorial.zip
(http://www.lucidimagination.com/download/thtutorial/thtutorial.zip). Move the zip file
to the Solr example directory, (which is probably in usr/apache-solr-1.4.0 or /LucidWorks),
and unzip it: this will create an example-text-html directory
Working Directory: example-text-html
This tutorial assumes that the working directory is
[Solr home]/examples/examples-text-html: you can check your location by using the
Unix command line utility pwd.
Setting the schema
Before starting, it's important to update the example schema file to work properly with text
and HTML files. The schema needs one extra field defined, so all words in the plain text
files, and HTML body words go into the default field for searching.
Make a backup by renaming the conf directory file from schema.xml to schema-bak.xml
% mv ../../lucidworks/solr/conf/schema.xml ../../lucidworks/solr/conf/schema-
bak.xml
Then either copy the text-html version of the schema or edit the version that's there to
include the body text field.
• Either: copy the new one from the example-text-html directory into the conf
directory:
% cp schema.xml ../../lucidworks/solr/conf/schema.xml
or (for apache installs)
% cp schema.xml ../solr/conf/schema.xml
• Or: edit the schema to add this field:
• Open the original schema.xml in your favorite text editor
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 2
6. • Go to line 469 (LucidWorks) or 450 (apache). This should be the Solr Cell section
with other HTML tags.
...
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true"
multiValued="true"/>
• and add the code to create the body field:
<field name="body" type="text" indexed="true" stored="true" multiValued="true"/>
• Go to line 558 (LucidWorks) or 540 (Apache), and look for the copyfield section.
...
<copyField source="includes" dest="text"/>
<copyField source="manu" dest="manu_exact"/>
• Go to the end of the section, after field manu and add the line to copy the body field
content into the text field (default search field).
<copyField source="body" dest="text"/>
• Save and close the schema.xml file.
Restarting Solr
Solr will not use the new schema until you restart the search engine. If you haven't done
this before, follow these steps:
• Switch to the terminal window in which the Solr engine has been started
• Press ^c (control-c) to end this session: it should show you that Shutdown hook is
executing.
• (Apache) Type the command java -jar start.jar to start it again. This only works
from the example directory, not from the example-text-html directory.
• (LucidWorks) Start Solr by running the start script, or clicking on the system tray icon
Part 2: Solr Indexing with cURL
Plain text seems as though it should be the simplest, but there are a few steps to go
through. This tutorial will walk through the steps, using the Unix shell cURL command.
Using the cURL command to index Solr XML
The first step is communicating with Solr. The Solr Startup Tutorial shows how to use the
Java tool to index all the .xml files. This tutorial uses the cURL utility available in Unix,
within the command-line (terminal) shell.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 3
7. • Telling Solr to index is like sending a POST request from an HTML form, with
appropriate path name (by default /update) and parameters. cURL uses this process, on
the command-line. This example uses the test file lu-example-1.xml.
To start, be in the solr/example/example-text-html directory.
Then, instruct Solr to update (index) an XML file using cURL , and then finish the index
update with a commit command
cURL 'http://localhost:8983/solr/update/' -H 'Content-type:text/xml' --data-
binary "@lu-example-1.xml"
cURL 'http://localhost:8983/solr/update/' -H "Content-Type: text/xml" --data-
binary '<commit/>'
Successful calls have a response status of 0.
<xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int><int name="QTime">000</int></lst>
</response>
Troubleshooting errors with cURL Solr updates
If you have a cURL error, it's usually mis-matched double quotes or single quotes. If you see
one of the following, go back and try again.
cURL: (26) failed creating formpost data
cURL: (3) <url> malformed
Warning: Illegally formatted input field!
cURL: option -F: is badly used here
Common errors numbers from the Solr server itself include 400 and 500. This means that
the POST was properly formatted but included parameters that Solr could not identify.
When that happens, go back to a previous command that did work, and start building the
new one up from there. These errors should not damage your Solr search engine or index.
<title>Error 400 </title>
</head>
<body><h2>HTTP ERROR: 400</h2><pre>Unexpected character 's' (code 115) in
prolog; expected '<'
at [row,col {unknown-source}]: [1,1]</pre>
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 4
8. or
<title>Error 500 </title>
</head>
<body><h2>HTTP ERROR: 500</h2>
<pre>org.apache.lucene.store.NoSuchDirectoryException: directory
'/Applications/LucidWorks/example/solr/data/index' does not exist
If you can't make this work, you may want to follow the instructions with the Solr Startup
Tutorial to create a new Solr directory and confirm using the Java indexing instructions for
the exampledocs XML files before continuing.
Viewing the first text file in Solr
Once you have successfully sent the XML file to Solr's update processor, go to your browser,
as in the Getting Started tutorial, and search your Solr index for "gettysburg"
http://localhost:8983/solr/select?q=gettysburg.
The result should be an XML document, which will report one item matching the new test
file (rather than the earlier example electronic devices files). The number of matches is on
about the eighth line, and looks like this:
<result name="response" numFound="1" start="0">
After that, the Solr raw interface will show the contents of the indexed file:
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 5
9. Notes
You must use a browser than can render XML, such as Firefox or Internet Explorer
or Opera (but not Safari).
The field label arr indicates a multiValued field.
Part 3: Using Solr to Index Plain Text Files
Integrated with Solr version 1.4, Solr Cell (also known as the ExtractingRequestHandler)
provides access to a wide range of file formats using the integrated Apache Tika toolkit,
including untagged plain text files. The test file for this tutorial is lu-example-2.txt. It has
no tags or metadata within it, just words and line breaks.
Note
The Apache Tika project reports that extracting the words from plain text files is
surprisingly complex, because there is so little information on the language and
alphabet used. The text could be in Roman (Western European), Indic, Chinese, or
any other character set. Knowing this is important for indexing, in particular for
defining the rules of word breaks, which is Tokenization.
Invoking Solr Cell
To trigger the Solr Cell text file processing (as opposed to the Solr XML processing), add
extract in the URL path in the POST command: /solr/update/extract.
This example includes three new things: the extract path term, a document ID (because
this file doesn't have an ID tag), and an inline commit parameter, to send the update to the
index.
cURL 'http://localhost:8983/solr/update/extract?literal.id=exid2&commit=true' -
F "myfile=@lu-example-2.txt"
The response status of 0 signals success. Your cURL command has added the contents of
lu-example-2.txt to the index.
When running the query http://localhost:8983/solr/select?q=gettysburg in the index, both
documents are matched.
<result name="response" numFound="2" start="0">
Unlike the indexed XML document, with this text document, there are only two fields
(content-type and id) that are visible in the search result. The text content, even the word
"Gettysburg," all seems to be missing.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 6
10. How can Solr match words in a file using text that doesn’t seem to be there? It's because
Solr’s default schema.xml is set to index for searching, but not store for viewing. In other
words, Solr isn’t preset to store for your viewing the parts of the documents with no HTML
tags or other labels. For plain text files, that's everything, so the next section is about
changing that behavior.
Parameters for more fields
Solr Cell provides ways to control the indexing without having to change source code.
Parameters in the POST message set the option to save information about the documents in
appropriate fields, and then to grab the text itself and save it in a field. The metadata can be
extracted without the file contents or with the contents.
Solr Cell external metadata
When Solr Cell reads a document for indexing, it has some information about the file, such
as the name and size. This is metadata (information about information), and can be very
valuable for search and results pages. Although these fields are not in the schema.xml file,
Solr is very flexible, and can put them in dynamic fields that can be searched and displayed.
The operative parameter is uprefix=attr_; when added to the POST command, it will save
the file name, file size (in bytes), content type (usually text/plain), and sometimes the
content encoding and language.
cURL
'http://localhost:8983/solr/update/extract?literal.id=exid2&uprefix=attr_&commit
=true' -F "myfile=@lu-example-2.txt"
Here is an example of the same file, indexed with the uprefix=attr_ parameter:
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 7
11. Mapping document content
Once the metadata is extracted, Solr Cell can be configured to grab the text at well. The
fmap.content=body parameter stores the file content in the body field, where it can be
searched and displayed.
Note
Using the fmap parameter without uprefix will not work. To see the body text, the
schema.xml must have a body field, as described in the Install section above.
Here's an example of an index command with both attribute and content mapping:
cURL
'http://localhost:8983/solr/update/extract?literal.id=exid3&uprefix=attr_&fmap.c
ontent=body&commit=true' -F "txtfile=@lu-example-3.txt"
Searching the Solr index <http://localhost:8983/solr/select?q=gettysburg> will now
display the all three example files. For lu-example-3.txt, it shows the body text in the
body field and metadata in various fields.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 8
12. Part 4: Indexing All Text Files in a Directory
The Solr Startup Tutorial exampledoc directory contains a post.sh file, which is a shell script
that uses cURL to send files to the default Solr installation for indexing. This version uses
the cURL commands above to send .txt (as opposed to .xml) files to Solr for indexing. The
file post-text.sh should be in the ../example/example-text-html/ directory with the
test files.
Shell script for indexing all text files
• Set the permissions: chmod +x post-text.sh
• Invoke the script: ./post-text.sh
You should see the <response> with status 0 and the other lines after each item: if you do
not, check each line for exact punctuation and try again.
When you go back to search on Solr, http://localhost:8983/solr/select?q=gettysburg, you
will find five text documents and one XML document.
Different doc IDs: adds aan additional document
Note that the results include two different copies of the first example, both containing “Four
score and seven years ago”, because the script loop sent all text files with the generated exid
number, while the XML example contains an id starting with exidx.
Identical doc IDs - replaces a document
The second example text file had some text that was indexed but not stored as a text block
when we first indexed it. Now it has content in the body field, because the script loop sent it
with the same ID (and the new parameters), so Solr updated the copy that was already in
the index, using the Doc ID as the definitive identifier.
For more information on IDs, see the LucidWorks Certified Distribution Reference Guide on
Unique Key.
More robust methods of indexing files
Sending indexing and other commands to Solr via cURL is an easy way to try new things
and share ideas, but cURL is not built to be a production-grade facility. And because Solr's
HTTP API is so straightforward, there are many ways to call Solr programmatically. There
are libraries for Solr in nearly every language, including Java, Ruby, PHP, JSON, C#, and Perl,
among others. Many content management sytems (CMS), publishing and social media
systems have Solr modules, such as Ruby on Rails, Django, Plone, TYPO3, and Drupal; it is
also used in cloud computing environments such as Amazon Web Services and Google
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 9
13. Code. For more information, check the Solr wiki and the LucidWorks Solr client API Lineup
in the LucidWorks Certified Distribution Reference Guide.
Part 5: Indexing HTML Files
This tutorial uses the same cURL commands and shell scripts for HTML as for text. Solr Cell
and Tika already extract many HTML tags such as title and date modified.
Note
All the work described above on text files also applies to HTML files, so if you've
skipped to here, please go back and read the first sections.
cURLSimplest HTML indexing
The first example will index an HTML file with a quote from the Universal Declaration of
Human Rights:
cURL 'http://localhost:8983/solr/update/extract?literal.id=exid6&commit=true' -F
"myfile=@lu-example-6.html"
Doing a query for "universal", http://localhost:8983/solr/select?q=universal , shows us
that Solr Cell created the metadata fields title, and links, because they are standard
HTML constructs.
Again, by default, the body text is indexed but not stored; and, changing that is just as easy
as changing it with the text files.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 10
14. Storing more metadata from HTML
As in the text file section of this tutorial, this example uses the uprefix parameter attr_
to mark those fields that Solr Cell automatically creates but which are not in the
schema.xml. This is not a standard, but it's a convention that's widely used.
cURL
'http://localhost:8983/solr/update/extract?literal.id=exid7&uprefix=attr_&commit
=true' -F "myfile=@lu-example-7.html"
Searching for "universal" now finds both HTML documents. While exid6 has very little
stored data, exid7 has the internal metadata of the document, including the title, author,
and comments.
Note
Apache Tika uses several methods to identify file formats. These include
extensions, like .txt or .html, MIME types such as text/plain or application/pdf,
and known file format header patterns. It's always best to have your source
files use these labels, rather than relying on Tika to guess.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 11
15. Storing body text in a viewable field
As in the text file, indexing Example 8 uses the fmap parameter to set the text from within
the <body> field of the HTML document to the body field which is in this example schema,
so it will be both searchable and stored.
cURL -f
'http://localhost:8983/solr/update/extract?uprefix=attr_&fmap.content=body&commi
t=true&literal.id=exid8' -F "myfile=@lu-example-8.html"
Part 6: Using Solr indexing for Remote HTML Files
Using cURL to download and index remote files
The cURL utility is a fine way to download a file served by a Web server, which in this
tutorial we’ll call a remote file. With the -O flag (capital letter O, not the digit zero), cURL
will save a copy of the file with the same name into the current working directory. If there's
a file with that name already, it will be over-written, so be careful.
cURL -O http://www.lucidimagination.com/download/thtutorial/lu-example-9.html
Note
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 12
16. If web access is unavailable, there's a copy of the lu-example-9.html file in
the remote subdirectory in the zip file.
If you view the files in the local examples-text-html directory, there will be a
lu-example-9.html file. The next step is to send it to Solr, which will use Solr Cell to index
it.
cURL
"http://localhost:8983/solr/update/extract?literal.id=exid9&uprefix=attr_&fmap.c
ontent=body&commit=true" -F "exid9=@lu-example-9.html"
This will index and store all the text in the file, including the body, comments, and
description.
File streaming for indexing remote documents
Solr also supports a file streaming protocol, sending the remote document URL to be
indexed. For more information, see the ExtractingRequestHandler and ContentStream
pages in the LucidWorks Certified Distribution Reference Guide for Solr, or the Solr wiki. Note
that enabling remote streaming may create an access control security issue: for more
information, see the Security page on the wiki.
Spidering tools
This tutorial doesn't cover the step of adding a spider (also known as a crawler or robot) to
the indexing process. Spiders are programs that open web pages and follow links on the
pages, to index a web site or an intranet server. This is how horizontal consumer web
search providers such as Google, Ask, and Bing find so many pages.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 13
17. Solr doesn't have an integrated spider, but works well with another Apache Lucene open
source project, the Nutch crawler. There's a very helpful post on Lucid Imagination's site,
Using Nutch with Solr, which explains further how this works.
Alternatives include Heritrix from the Internet Archive, JSpider, WebLech, Spider on Rails,
and OpenWebSpider.
Conclusion and Additional Resources
Now that you’ve had the opportunity to try Solr on HTML content, the opportunities to
build a search application with it are as diverse and broad as the content you need to
search! Here are some resources you will find useful in building your own search
applications.
Configuring the ExtractingRequestHandler in Chapter 6 of the LucidWorks for Solr
Certified Distribution Reference Guide
http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide
Solr Wiki: Extracting Request Handler (Solr Cell)
http://wiki.apache.org/solr/ExtractingRequestHandler
Tika http://lucene.apache.org/tika/
Content Extraction with Tika, by Sami Siren:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-
Extraction-Tika
Optimizing Findability in Lucene and Solr, by Grant Ingersoll:
http://www.lucidimagination.com/Community/Hear-from-the-
Experts/Articles/Optimizing-Findability-Lucene-and-Solr
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 14
18. About Lucid Imagination
Mission critical enterprise search applications in e-commerce, government, research,
media, telecommunications, Web 2.0, and many more use Apache Lucene/Solr to ensure
end users can find valuable, accurate information quickly and efficiently across the
enterprise. Lucid Imagination complements the strengths of this technology with a
foundation of commercial-grade software and services with unmatched expertise. Our
software and services solutions help organizations optimize performance and achieve high-
quality search results with their Lucene/Solr applications. Lucid Imagination customers
include AT&T, Nike, Sears, Ford, Verizon, The Guardian, Elsevier, The Motley Fool, Cisco,
Macy's and Zappos.
Lucid Imagination is here to help you meet the most demanding search application
requirements. Our free LucidWorks Certified Distributions are based on these most
popular open source search products, including free documentation. And with our
industry-leading services, you can get the support, training, value added software, and
high-level consulting and search expertise you need to create your enterprise-class search
application with Lucene and Solr.
For more information on how Lucid Imagination can help search application developers,
employees, customers, and partners find the information they need, please visit
http://www.lucidimagination.com to access blog posts, articles, and reviews of dozens of
successful implementations. Please e-mail specific questions to:
• Support and Service: support@lucidimagination.com
• Sales and Commercial: sales@lucidimagination.com
• Consulting: consulting@lucidimagination.com
• Or call: 1.650.353.4057
About the Author
Avi Rappoport really likes good search, and Solr is really good. She is the founder of Search
Tools Consulting, which has given her the opportunity to work with site, portal, intranet
and Enterprise search engines since 1998. She also speaks at search conferences, writes on
search-related topics for InfoToday and other publishers, co-manages the LinkedIn
Enterprise Search Engine Professionals group, and is the editor of SearchTools.com.
Contact her at consult [at] searchtools . com or via the searchtools.com web site.
Indexing Text and HTML Files with Solr
A Lucid Imagination Tutorial • February 2010 Page 15