Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
3. Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary sponsor of
the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Based in San Francisco
Offices in Bangalore, Bangkok,
New York City, Raleigh, London
Over 300 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
4.
5. An optimized search experience
for every user using relevance
boosting and machine learning.
Create custom search and
discovery applications in
minutes.
Highly scalable search
engine and NoSQL
datastore that gives you
instant access to all your
data.
Lucidworks Fusion product suite
6. • 50+ connectors
• Full SQL compatibility
• End-to-end security
• Multi-dimensional real-time
ingestion
• Administration and analytics
20. Not…
o 20 discrete operations I have to do to convert one
field…
o Conditional operations (if this then this, otherwise
do this other thing)
o Canned functionality you have elsewhere.
o I don’t want to do anything that feels like
programming in form fields…
25. With Context
function (doc, ctx) {
// do really important things.
return doc;
}
https://doc.lucidworks.com/fusion-pipeline-
javadocs/3.1/com/lucidworks/apollo/pipeline/Context.html
27. With solrServer
function (doc, ctx, collection, solrServer) {
// do really important things.
// solrServer can index/query things
return doc;
}
https://doc.lucidworks.com/fusion-pipeline-
javadocs/3.1/com/lucidworks/apollo/component/
BufferingSolrServer.html
28. With
solrServerFactory
aka
SolrClientFactory
function (doc, ctx, collection, solrServer,
solrServerFactory) {
// do really important things.
// solrServerFactory look up other collections
return doc;
}
https://doc.lucidworks.com/fusion-pipeline-
javadocs/3.1/com/lucidworks/apollo/component/
SolrClientFactory.html
30. Add a Field
function (doc) {
// replace any values currently in
the field with new ones
doc.setField('some-new-field',
'some field value');
// for multi value fields this will
combine values with old values if
there are any, otherwise it will add a
new field.
doc.addField('some-new-field',
'some field value');
return doc;
}
31. Glue Two
Fields
function(doc) {
var value = "";
if (doc.hasField("Actor1Geo_Lat") &&
doc.hasField("Actor1Geo_Long")) {
value =
doc.getFirstFieldValue("Actor1Geo_Lat") + "," +
doc.getFirstFieldValue("Actor1Geo_Long");
doc.addField("Actor1Geo_p", value);
}
return doc;
}
32. Iterate through the fields
function (doc) {
// list of doc fields to iterate over
var fields = doc.getFieldNames().toArray();
for (var i=0;i < fields.length;i++) {
var fieldName = fields[i];
var fieldValue = doc.getFirstFieldValue(fieldName);
logger.info("field name:" +fieldName + ", field name: " +
fieldValue);
}
}
return doc;
}
34. Preview a field
function(doc){
if (doc.getId() != null) {
var fromField = "body_t";
var toField = "preview_t";
var value =
doc.getFirstFieldValue(fromField);
var pattern = /n|t/g;
value = value.replace(pattern, " ");
value = value ? value : "";
}
var length = value.length < 500 ?
value.length : 500;
value = value.substr(0,length);
doc.addField(toField, value);
}
return doc;
}
35. Bust up a
document
function (doc) {
var field = doc.getFieldValues('price');
var id = doc.getId();
var newDocs = [];
for (i = 0; i < field.size(); i++) {
newDocs.push( { 'id' : id+'-'+i,
'fields' : [ {'name' : 'subject', 'value' :
field.get(i) } ] } );
}
return newDocs;
}
36. Look up in another collection
function doWork(doc, ctx, collection,
solrServer, solrServerFactory) {
var imports = new JavaImporter(
org.apache.solr.client.solrj.SolrQuery,
org.apache.solr.client.solrj.util.ClientUtils);
with(imports) {
var sku = doc.getFirstFieldValue("sku");
if (!doc.hasField("mentions")) {
var mentions = ""
var productsSolr = solrServerFactory.getSolrServer("products");
37. Look up in another collection
if( productsSolr != null ){
var q = "sku:"+sku;
var query = new SolrQuery();
query.setRows(100);
query.setQuery(q);
var res = productsSolr.query(query);
mentions = res.getResults().size();
doc.addField("mentions",mentions);
}
}
}
38. Reject a
document
function (doc) {
if (doc.hasValue('foo')) {
return null; // stop this document from being indexed.
}
return doc;
}
40. Next Steps
o Grab Fusion https://lucidworks.com/download/
o Ingest some data
o Create a JavaScript pipeline stage and manipulate the data
o https://doc.lucidworks.com/fusion/latest/Indexing_Data/Custom-JavaScript-Indexing-
Stages.html
o Attend a training
o Get support
Hi, I’m Andrew Oliver, My title is Technical Enablement Manager. I’m a Fusion and Solr junkie. I’ve ingested so much data that my laptop is totally full and now I need to start moving it all to the cloud. Today we’re going to talk about how to use the Fusion Javascript index pipeline stage to manipulate data. We’ll go over some common cases and look at some code. This presentation is mainly for the data engineers and people who have to make this stuff work.
Before we get into the topic I’d like to quickly review that Lucidworks is a San Francisco based company with offices around the world. We are the primary sponsor of the Apache Solr project which powers search for some of the Internet’s largest sites and many of the worlds largest companies. Solr is the core of our product Lucidworks Fusion.
Let’s review Lucidworks Fusion.
Lucidworks Fusion is a platform that includes a highly scalable search engine coupled with AI and Machine learning functionality to give you the most relevant personalized results. In addition we have Fusion App Studio which automates and accelerates the tasks you necessary to develop search applications. Meaning the world does not need someone to write another search box with type-ahead and suggestion functionality, just use app studio, include it and skin it.
Connect to your data wherever it lives with over 50 connectors including databases, intranets, network drives, SharePoint, CRM systems, support tickets, the public web, and the cloud.
Access your data your way with the tools you already know with REST APIs and endpoint, text search, analytics, and full SQL queries using familiar commands.
Your security model enforced end-to-end from ingest to search including role-based access controls for encryption, masking, and redaction at every level.
Multi-dimensional real-time ingestion including documents and data, key-value stores (NoSQL), relational databases (MySQL, Hadoop, JBDC) with graph capabilities to show relationships and detect anomalies.
Administration from one unified view for managing and monitoring performance and uptime with load balancing, failover and recovery, and multi-tenancy compatibility.
Personalized recommendations that aggregate user history and actions, and highlights items for exploration and discovery.
Machine learning models that are pre-tuned and ready to for production add intelligence to your apps.
Powerful recommenders and classifiers for collaborative filtering and understanding intent.
Predictive search that suggests items and documents before a user even enters query.
Full control over relevancy with simulated preview before going live - and of course rules for boosts and blocks
Protoypes in hours, not weeks with a modular library of UI components
Fine-grained security fortified for industries across the Fortune 500 organizations and government agencies
Stateless architecture so apps are robust, easy to deploy, and highly scalable
Supports over 25 data platforms including Solr, SharePoint, Elasticsearch, Cloudera, Attivio, FAST, MongoDB, and many more - and of course Fusion Server
Full library of visualization components for charts, pivots, graphs and more
Pre-tested reusable modules include pagination, faceting, geospatial mapping, rich snippets, heatmaps, topic pages, and more.
Let’s get into the meat of the topic at hand. Ingestion and querying in Fusion is governed by pipelines.
Fusion’s ingestion process involves data going into a connector or rest endpoint, through a series of parsers for specific data shapes (like zip files or html or word docs). After data is parsed it is sent through an index pipeline which consists of stages. The last stage sends it to Solr. For developers that remember design patterns, this is the chain of responsibility pattern.
Likewise on the query side, we have a query pipeline that consists of a set of stages, the last of which sends the query to solr and retrieves the data.
Today we’re mostly going to talk about the index pipeline. You see here that I’ve ingested a series of articles from wikipedia. I have a connector, a set of parsers and I’ve expanded the index pipeline. It consists of three stages so far. On the right you see a simulated set of results.Fusion has an extensive library of pipeline stages that cover everything from renaming fields to mapping date types to entity extraction using Natural Langauge processing techniques. Today we’re really going to talk about the Javascript pipeline stage.
We’re not going to go over the query side of things much today but this is the query workbench and a series of query pipeline stages. Fusion comes with a library of pipeline stages from basic faceting to security trimming to boosting results based on what other users clicked on and advanced machine learning based search recommendations. There is also a Javascript stage on the query pipeline side of things, but we’re going to focus on the index pipeline side today.
Without further ado let’s look at the Javascript index pipeline stage
This is my pipeline for querying wikipedia cat pictures. I’ve used this in other webinars such as the site search in 1h that I did early this year. Like most pipeline stages you can have a condition which governs whether the stage executes at all. I find that a bit less important for the Javascript stage since I can basically include that in the script body anyhow. You can paste a script into the body or click “open editor” and edit it in a larger window.
For my cat pictures app, if you recall, I used a script to create a preview field from the content body. That’s all I’m showing part of here.
So we mentioned that Fusion has a lot of pre-built pipeline stages that you can just configure and use to manipulate data. Why would you want to use the Javascript stage?
And this is a debate we have internally at Lucidworks too. Here is where I stand on this.
Prebuilt Pipeline stages are great for complex functionality like NLP entity extraction or machine learning classification or anything where configuration just makes a whole lot more sense than code.
And pipeline stages are great for common types of field transformations like date parsing. Or even where you’re just going to run a regex on one or a series of fields.
But if you’ve got a bunch of things you need to do in order to convert one field, then having a bunch of stages seems less optimal. Additionally if you have a condition that governs a lot of different functionality or whether a series of things should be done – then I think using a JavaScript stage is a better solution. Moreover, a lot of companies have functionality they’re using elsewhere that is already in JavaScript or is more easily translated. I in general don’t want to do anything that fields like programming by form fields. Simple configurable data transformations yes…coding via form field...not so much.
The core of what you’ll be working on in the JavaScript index stage is a PipelineDocument. Let’s look at it’s basic interface.
You can find the Javadoc for the pipeline document at the Lucidworks documentation site. It has a lot of different functions, but the basic ones you’ll use are for adding, removing, getting or setting fields. Some of the common ones are listed here.
In the “body” you’re going to put an anonymous function. Let’s look at the basic forms of it.
This is the most common version where you just want to manipulate a pipelinedocument and return the manipulated one.
Sometimes you need context like whether the document is a “signal” basically an event like a click or query as opposed to normal data. Or you may want to pass key/value data to another pipeline object. If so you can inject the context.
If you want to know the name of the collection you’re operating on you can use this form of the fuction.
If you’re going to perform index or query operations from inside your pipeline stage you can have the solrServer component injected.
If you’re going to look up things in other collections or manipulate other collections from inside your pipeline you can have the solr client factory injected. This was renamed “solrClientFactory” from solrServerFactory” but in most of the documentation and examples its still shown as solrServerFactory. All of the function elements are injected by index so you can call it solrClientFactory instead if you like. Heck, You can call it bob if you want to.
So let’s look at common sorts of things you can do with JavaScript. The idea is to give you some code recipes.
If you want to replace a field you can call doc.setField with the field name and the field value. If you want to add a value you can use addField. If the field is multi-valued addfield will add another value or add an addition field if not.
You may want to combine two fields or include conditionals. This shows a latitude being combined with a longitude into a new point field.
Sometimes you want to look through a set of fields. Here we get the field names, then iterate through them, then get the values. This is sort of useless in itself presumably we’d do more than log, but you get the idea.
Speaking of logging we can do info, error, debug… What shows up in the log in terms of level is configurable. You’re wondering where these will show up by default...here’s where the log messages can are emitted...in var/log/connectors/connectors.log
In case you missed my webinar about the cat pictures. Above you see that I’ve taken a body_t field that is parsed from a wikipedia page. I then create a field called preview_t. I grab the value of body_t, operate on it with a regex which ditches the newlines and tabs. Next I trim the field to 500 characters and store it in the preview_t field. Frankly this is a very simple “preview” I could also parse the html and make sure I don’t get in any header information or grab specific parts of the article, but this is good enough for a demo!
Fusion’s parsers generally do a good job of taking a file and turning it into multiple documents. However sometimes you need to grab bits and pieces and create new documents. This “busts up” a document and creates a new set of documents. Note that in this case we’re returnning a collection of documents instead of just one.
This is what you’d need to build and maintain if you want an Intelligent Search Application