While fully nuanced search implementation takes time, getting basic data ingestion, schema design and critical-path insights does not have to be a painful experience.
This talk uses several real-life datasets (from Data is Plural mailing list) and shows "Rapid Application Development"-style workflow to get the data into Solr and shape it ready for initial searchability and relevancy analysis.
Different stages of content ingestion, pre-processing, analysis, and querying are explained, together with trade-offs of different built-in approaches. Relevant document links are also included for more in-depth research.
This talk is for everybody who wants Search in their development stack but is not sure exactly where to start.
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
Apache Solr was always built on strong Information Retrieval/Natural Language Processing foundation. And, in recent versions, even more Artificial Intelligence features, techniques and integrations were added to the Solr.
This presentation covers some classic (and hidden gems) AI elements that Solr supported for long time as well as the most recent features that are not even fully documented yet.
The presentation was made with references to Solr 7.4.
Apache Solr is a search engine that can scale from a personal project to a multi-terabyte cloud hosted cluster. At the same time, this ability to scale, tune and adjust to the clients' needs, can make it hard to understand the right aspects of Solr to bring to the problem.
In this session, Alexandre Rafalovitch (an Apache Solr committer) will do a speed run demonstrating how to create and tune a Solr 7.3 instance for a hypothetical Corporate Phone Directory application. It will cover:
*) The smallest learning schema/configuration required
*) Rapid schema evolution workflow
*) Dealing with multiple languages
*) Dealing with misspellings in search
*) Searching phone numbers
Presented at Solr meetup in Montreal, in May 2018.
Backing GitHub repository is: https://github.com/arafalov/solr-presentation-2018-may
Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.
Mastering Solr will give an introduction to the more advanced use of Solr. It offers a starting point to go beyond the possibilies offered by the Solr integration modules.
The following subjects will be covered:
- Getting Solr running locally
- Integration modules: Search API vs ApacheSolr
- Configuring an index
- Search relevancy: basic boosting of search results.
- The anatomy of a request to Solr
- Hooks for altering/adding parameters to the request to Solr
- Search relevancy: advanced date boosting by boost functions
- Search relevancy: advanced boosting by boost queries
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
Apache Solr was always built on strong Information Retrieval/Natural Language Processing foundation. And, in recent versions, even more Artificial Intelligence features, techniques and integrations were added to the Solr.
This presentation covers some classic (and hidden gems) AI elements that Solr supported for long time as well as the most recent features that are not even fully documented yet.
The presentation was made with references to Solr 7.4.
Apache Solr is a search engine that can scale from a personal project to a multi-terabyte cloud hosted cluster. At the same time, this ability to scale, tune and adjust to the clients' needs, can make it hard to understand the right aspects of Solr to bring to the problem.
In this session, Alexandre Rafalovitch (an Apache Solr committer) will do a speed run demonstrating how to create and tune a Solr 7.3 instance for a hypothetical Corporate Phone Directory application. It will cover:
*) The smallest learning schema/configuration required
*) Rapid schema evolution workflow
*) Dealing with multiple languages
*) Dealing with misspellings in search
*) Searching phone numbers
Presented at Solr meetup in Montreal, in May 2018.
Backing GitHub repository is: https://github.com/arafalov/solr-presentation-2018-may
Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.
Mastering Solr will give an introduction to the more advanced use of Solr. It offers a starting point to go beyond the possibilies offered by the Solr integration modules.
The following subjects will be covered:
- Getting Solr running locally
- Integration modules: Search API vs ApacheSolr
- Configuring an index
- Search relevancy: basic boosting of search results.
- The anatomy of a request to Solr
- Hooks for altering/adding parameters to the request to Solr
- Search relevancy: advanced date boosting by boost functions
- Search relevancy: advanced boosting by boost queries
This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Introduction to Solr. A brief introduction to Solr for the resources who wants to get trained on Solr.
1. Introduction to Solr
2. Solr Terminologies
3.Installation and Configuration
4. Configuration files schema.xml and solrconfig.xml
5. Features of SOLR
a. Hit Highlighting
Auto Complete / Suggester
Stop words
Synonyms
SpellCheck
Geo Spatial Search
Result Grouping
Query Syntax
Query Boosting
Content Spotlighting
Block Record / Remove URL Feature
Content Spotlighting / Merchandising / Banner / Elevate
Block Record / Remove URL Feature
6. Indexing the Data
7. Search Queries
8. DataImportHandler - DIH
9. Plugins to index various types of Data (XML, CSV, DB, Filesystem)
10. Solr Client APIs
11. Overview of SOLRJ API
12. Running Solr on Tomcat
13. Enabling SSL on Solr
14. Zookeeper Configuration
15. Solr Cloud Deployment
16. Production Indexing Architecture
17. Production Serving Architecture
18. Solr Upgradation
19. References
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.
After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/
This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Introduction to Solr. A brief introduction to Solr for the resources who wants to get trained on Solr.
1. Introduction to Solr
2. Solr Terminologies
3.Installation and Configuration
4. Configuration files schema.xml and solrconfig.xml
5. Features of SOLR
a. Hit Highlighting
Auto Complete / Suggester
Stop words
Synonyms
SpellCheck
Geo Spatial Search
Result Grouping
Query Syntax
Query Boosting
Content Spotlighting
Block Record / Remove URL Feature
Content Spotlighting / Merchandising / Banner / Elevate
Block Record / Remove URL Feature
6. Indexing the Data
7. Search Queries
8. DataImportHandler - DIH
9. Plugins to index various types of Data (XML, CSV, DB, Filesystem)
10. Solr Client APIs
11. Overview of SOLRJ API
12. Running Solr on Tomcat
13. Enabling SSL on Solr
14. Zookeeper Configuration
15. Solr Cloud Deployment
16. Production Indexing Architecture
17. Production Serving Architecture
18. Solr Upgradation
19. References
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.
After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
Since mid-2016, Spark-as-a-Service has been available to researchers in Sweden from the Rise SICS ICE Data Center at www.hops.site. In this session, Dowling will discuss the challenges in building multi-tenant Spark structured streaming applications on YARN that are metered and easy-to-debug. The platform, called Hopsworks, is in an entirely UI-driven environment built with only open-source software. Learn how they use the ELK stack (Elasticsearch, Logstash and Kibana) for logging and debugging running Spark streaming applications; how they use Grafana and InfluxDB for monitoring Spark streaming applications; and, finally, how Apache Zeppelin can provide interactive visualizations and charts to end-users.
This session will also show how Spark applications are run within a ‘project’ on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In addition, hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.afka topics are protected from access by users that are not members of the project. We will also discuss the experiences of our users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Apache Solr! Enterprise Search Solutions at your Fingertips!Murshed Ahmmad Khan
Get an overview of Apache Solr as an enterprise search server. Get to know the available alternatives and why the Solr is cool! Get Excited! Enterprise Search Solutions are ready to pick.
Pradeep Sharma from OSSCube presents on Securing your web server at OSSCamp, organized by OSSCube - A Global open Source enterprise for Open Source Solutions
To know how we can help your business grow, leveraging Open Source, contact us:
India: +91 995 809 0987
USA: +1 919 791 5427
WEB: www.osscube.com
Mail: sales@osscube.com
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
From content to search: speed-dating Apache Solr (ApacheCON 2018)
1. APACHECON North America Sept. 24-27, 2018
From content to search:
Speed-dating Apache Solr
Alexandre Rafalovitch
Apache Solr Committer
Apache Solr Popularizer
@arafalov
2. APACHECON North America
Apache Solr search engine features
➢
Advanced full-text search – very very advanced!
➢
Facets and aggregations
➢
SolrCloud for scale
➢
Extremely confgurable
– Confg fles
– Extension points
– Plugins
➢
Near Real-Time indexing
➢
Multiple input formats: XML, JSON, CSV, GeoJSON
➢
Multiple output formats: XML, JSON, CSV, GeoJSON, PHP, Ruby, Python, XLSX…
➢
Multiple client options: SolrJ, JavaScript, Python, PHP, R, Ruby…..
➢
Integrates Apache Lucene, Tika, Log4J, Xerces, ZooKeeper, Velocity, OpenNLP
➢
Integrates IBM ICU4J (Unicode), Carrot (document clustering)
➢
Has web-based Admin UI based on public APIs
➢
Can also do Graphs, Machine Learning, SQL, Linear Algebra…
➢
The little Search engine that could!
3. APACHECON North America
Learning Solr
➢
Solr Reference Guide
– 1250! pages of goodness
– Important! Solr WIKI is no longer primary source
➢
Ships with many examples (with explanations in comments):
– techproducts
– cloud
– schemaless
– dih (DataImportHandler – 5 examples)
– flms
– fles
➢
Solr Users mailing list, IRC channel, StackOverfow,
➢
Videos on YouTube from Lucene/Solr Revolution (Activate, as of 2018)
➢
http://www.solr-start.com/ - My reference website
4. APACHECON North America
Can be a bit too much
➢
Solr Reference Guide is feature-organized
– Some powerful features are hard to locate
➢
Examples are a bit of a kitchen-sink
➢
Printed books cannot keep up
➢
Many examples use artifcial, tuned datasets
➢
Several ways to do similar things
5. APACHECON North America
Today – we start to solve that
➢
Use the smallest learning confguration
➢
Latest cool features
➢
End-to-end
➢
Evolve from the queries backwards
➢
Real life example(s)
– Use “Data is plural” - Newsletter of useful/curious real-life datasets
– https://tinyletter.com/data-is-plural
– Sent by Jeremy Singer-Vine (https://www.jsvine.com/)
➢
Just a taste – but a good one
– Enough for you to understand how to keep learning by yourself
– Repo: https://github.com/arafalov/solr-apachecon2018-presentation
6. APACHECON North America
Learning setup
➢
Minimal setup
– 1 Server
– 1 Collection/Core (per example)
– Minimal schema (no extra types)
– Minimal confguration (all defaults, no caches, etc.)
➢
Larger production setup
– Multiple servers and Zookeepers (SolrCloud)
– Multiple collections with shards, replicas
– Full confguration (see Reference Guide)
7. APACHECON North America
Rapid (Search) Application Development
10 Get the Solr server running
20 Create new core with custom confg
30 Index some data
40 Run some queries/facets/aggregation/graph analysis/...
50 IF NOT HAPPY
60 Identify improvement
70 Evolve schema/confguration
80 Re/Index data
90 GOTO 40
100 ENDIF
8. APACHECON North America
Setting up our own server
➢
Shipped examples are preconfgured
– Magic location
– Magic confg fles (in server/solr/confisets/<templatedir>/conf)
– Hard to do multiple examples
– Confusing to grow to production, later
➢
For clarity, let’s setup our own server
– Create server root directory (e.g. sroot )
– Copy solr.xml to sroot (from <solr install>/server/solr)
– bin/solr -s <path to sroot>
●
All bin commands are in <solr install>/bin/
●
Default Solr port is 8983, other ports offset from that
9. APACHECON North America
Minimal learning confg
➢
Absolute minimum – 2 fles
– manaied-schema (an XML fle)
●
Defnition of felds, feld types, unique keys
●
Can be managed by API directly (removes comments on use)
– solrconfi.xml (also an XML fle)
●
Endpoints, Default parameters, Caches, Pre-processing pipelines, ...
●
A lot of system-admin and production confg
●
Cannot be modifed by API directly, but there are overlays and request parameters APIs
➢
We will also use:
– params.json
●
supplements solrconfi.xml
●
helps to manage request parameter defaults with API
➢
Get them:
https://github.com/arafalov/solr-apachecon2018-presentation/tree/master/confgsets/minimal
10. APACHECON North America
Create Collection/Core
➢ bin/solr create -c dip
-d .../minimal
– Collection/core name: dip
– Our 3 confg fles are in .../minimal
– Uses API to talk to the server
➢
Creates full structure under sroot/dip
– Exception: logs (by default) <solr install>/server/lois
➢
For SolrCloud, confgs are in ZooKeeper!
11. APACHECON North America
Dataset - “Data is Plural” itself
➢
Google sheet linked from
https://tinyletter.com/data-is-plural
➢
Save as .tsv (tab-separated values)
– Mine goes until 12 Sept 2018
➢
Solr can index .csv, also .tsv with help
– https://lucene.apache.org/solr/guide/7_4/post-tool.html#inde
xing-csv
– https://lucene.apache.org/solr/guide/7_4/uploading-data-with
-index-handlers.html#csv-formatted-index-updates
12. APACHECON North America
Index the content
➢ bin/post -c dip
-params "separator=%09"
-type "text/csv"
.../dip-data.tsv
➢
Fails as mandatory uniqueKey id feld is missing
➢
Let’s use rowid as unique id
➢ bin/post -c dip
-params "separator=%09&rowid=id"
-type "text/csv"
.../dip-data.tsv
➢
AND BOOM – We are ready for SEARCH
16. APACHECON North America
Output formats (wt)
➢
JSON by default:
http://localhost:8983/solr/dip/select?q=*:*
➢
http://localhost:8983/solr/dip/select?q=*:*&wt=ruby
17. APACHECON North America
Basic searches
➢
Case-insensitive search (why?)
– Search for ‘data’ in the q input box
– Search for ‘DATA’ instead
– Same 461 results
– What defnes case sensitivity?
➢
Inclusion and exclusion
– ‘data news’ - 469 results
– ‘+data +news’ - 46 results
– ‘+data -news’ - 415 results
➢
What defnes search syntax?
18. APACHECON North America
Beyond search - Facets
http://localhost:8983/solr/dip/select?facet.field=edition&facet=on&q=*:*&rows=1
19. APACHECON North America
➢
What about grouping by year
➢ http://localhost:8983/solr/dip/select
?facet=on&q=*:*&rows=0
&facet.query={!prefix f=edition}2014
&facet.query={!prefix f=edition}2015
&facet.query={!prefix f=edition}2016
&facet.query={!prefix f=edition}2017
&facet.query={!prefix f=edition}2018
➢
A bit beyond what Admin UI can allow you enter right now
➢
A bit painful as a GET
➢
Can be also be done as a POST
Facets – Group by year - GET
20. APACHECON North America
Facet – Group by year - POST
➢
Easier with Postman (https://www.getpostman.com/)
➢
Can save requests, create collections, publish
21. APACHECON North America
Quick review
➢
We did basic index of a real dataset
➢
We did some simple searches
– Case-insensitive
– Basic facets
– Less-basic facets
➢
Already useful – and just the tip of Solr
➢
Next: let’s understand WHY this works
22. APACHECON North America
Minimal learning confg
➢
Absolute minimum – 2 fles
– manaied-schema (an XML fle)
●
Defnition of felds, feld types, unique keys
●
Can be managed by API directly (removes comments on use)
– solrconfi.xml (also an XML fle)
●
Endpoints, Default parameters, Caches, Pre-processing pipelines, ...
●
A lot of system-admin and production confg
●
Cannot be modifed by API directly, but there are overlays and request parameters APIs
➢
We will also use:
– params.json
●
supplements solrconfi.xml
●
helps to manage request parameter defaults with API
➢
Get them:
https://github.com/arafalov/solr-apachecon2018-presentation/tree/master/confgsets/minimal
23. APACHECON North America
Minimal confg and params
➢
Minimal solrconfi.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>7.4.0</luceneMatchVersion>
<requestHandler name="/select"
class="solr.SearchHandler" useParams="SELECT" />
</config>
➢
Minimal matching params.json
{"params":{
"SELECT":{
"df":"_text_",
"echoParams":"all"
}}}
➢
Relies on a lot of defaults, check them with confg API:
http://localhost:8983/solr/<corename>/config?expandParams=true
➢ https://lucene.apache.org/solr/guide/7_4/config-api.html
25. APACHECON North America
Schema defnition hierarchy
➢
Field type CLASS
– Underlying
representation
➢
Field Type
– No built-in types
– Confguration
– Defaults
– Analyzers (if
supported)
➢
Field defnition
– Concrete
confguration
➢
Popular feld type
classes
– StrField
– TextField
– SortableTextField
– IntPointField
– DatePointField
➢
Other classes
– BinaryField
– BoolField
– CollationField
– CurrencyFieldType
– DateRangeField
– DoublePointField
– ExternalFileField
– EnumFieldType
– FloatPointField
– ICUCollationField
– LongPointField
– PointType
– PreAnalyzedField
– RandomSortField
– SpatialRecursivePrefxTreeFieldType
– UUIDField
26. APACHECON North America
managed-schema – feld types
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" docValues="true"/>
➢
Name: string – convention, not complusory
➢
Class: solr.StrField - keep string as is, no processing
➢
sortMissingLast – what to do with missing values
➢
docValues – enable column store, useful for faceting, sorting,
grouping
➢
Additional confguration will be done on feld using this type, can
also override values here (not name/class)
➢
No analyzers magic for this (solr.StrField) type
27. APACHECON North America
managed-schema – feld types
<fieldType name="text_basic" class="solr.SortableTextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
➢
Name: text_basic – whatever you want
➢
Class: solr.SortableTextField – new type useful for search and faceting both. Implicitly uses docValues
➢
positionIncrementGap – just keep this, a bit of an internal thing
➢
Basic combined index/query analyzer chain
➢ Tokenizer: StandardTokenizerFactory – generic, split on whitespace and punctuation
➢ (Token) Filter: LowerCaseFilterFactory - lowercase every token during index and search => case-insensitive
match
➢
Many more interesting analyzer chains are in examples
➢
You can (and should) craft your own for specialty use-cases
28. APACHECON North America
Understanding analyzer chains
...
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
...
29. APACHECON North America
managed-schema - felds
➢ <field name="id" type="string" required="true"
indexed="true" stored="true" />
– Field name is id – we refer to this later, in uniqueKey
– Its type is string – as we defned
– It is required – that’s quite rare actually
– It is indexed – means we can search on it
– It is stored
●
Means we will return it to the user
●
Important for ID, a decision point for the rest
➢ <uniqueKey>id</uniqueKey>
30. APACHECON North America
managed-schema - felds
<field name="_text_" type="text_basic"
multiValued="true" indexed="true"
stored="false" docValues="false"/>
➢
Name: _text_ - underscores to show special usage in our confi
➢
Type: text_basic – defned later
➢
multiValued – can take multiple values
➢
indexed – we can search against it, using text_basic rules
➢
(Not) stored – will not show up in the result
➢
(Disabled) docValues – overrides type defnition
31. APACHECON North America
Dynamic felds
➢ <dynamicField name="*" type="text_basic"
indexed="true" stored="true"/>
– Dynamic Field – match incoming feld name that is not
matched explicitly by other defnitions – a magic fallback
– Name: ”*” - pattern, means any left-over feld name
●
- could also be *_str, random_*
– stored – means the user will see the felds
– indexed – means we can search against it
– Type: text_basic – means we do the tokenization
and analysis
32. APACHECON North America
copyField
<copyField source="*" dest="_text_"/>
➢
Duplicates felds for different treatment
➢
Here:
– copy ALL (*) felds (all dynamic felds AND id feld)
– To feld _text_
➢
Remember: _text_
– is searchable (we even disabled docValues)
– not stored (so user does not see this duplicated content)
– multiValued (as it gets each feld’s content as separate value)
➢
Important! Copied content is searched by rules of the destination feld…. (e.g.
dates copied to text feld)
➢
Why do we need this duplication?
33. APACHECON North America
Query path
➢
What happens when we search “+data -news”
➢
Enable debugQuery fag to fnd out
– http://localhost:8983/solr/dip/select?debugQuery=on&q=+data -news
– "parsedquery":"+_text_:data -_text_:news"
➢
Solr has a query parser, which has parameters
– In solrconfi.xml, we defned /select endpoint
– It declared useParams=SELECT
– In params.json, we set a df parameter to _text_
– For default (lucene) Query Parser, df is Default Field parameter
➢
Other Query Parsers have very different expectations
– eDisMax is popular and has much greater searched-felds control
– Also: https://lucene.apache.org/solr/guide/7_4/other-parsers.html
35. APACHECON North America
Evolving the schema
➢
Let’s make links multi-valued strings, split
while injesting on white-space
➢
In manaied-schema, we still just have id,
_text_ as explicit felds
➢
Can do it in Admin UI
– Delete links (not strictly needed for dynamic
feld)
– Create explicit defnition: stored, indexed,
multiValued, using string feld type
– This will rewrite manaied-schema
➢
Need to reindex, sometimes delete/reindex
– Usually docValues require delete/reindex
– Remember! Have a primary source
36. APACHECON North America
Deleting content
➢ bin/post -c dip -d $"<delete><query>*:*</query></delete>"
➢
Can also do it in Admin UI/Documents
➢
Can use (Solr-specifc) XML or JSON, delete by ID or specifc query
➢
https://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html
37. APACHECON North America
Reindexing
➢ bin/post -c dip -params "separator=
%09&rowid=id&f.links.split=true&f.links.separator=%20"
-type "text/csv" .../dip-data.tsv
➢
New parameters
– f.links.split=true
– f.links.separator=20
38. APACHECON North America
Next challenge – links count
➢
Let’s fnd records with 3+ links
➢
Quite hard because we search from index (dictionary) of
tokens
➢
Correct Solr reasoning:
– Shape the data for search
– Solr is not a (storage-oriented) database
– Duplicate and index in multiple ways
– Pre-calculate what you will search
– De-normalize if needed
39. APACHECON North America
Update Request Processors
➢
Update Request Processor (URP)
– Runs after format (XML, JSON, etc) parser
– Runs before schema is involved
– Can run multiple in a chain
➢
Many (legacy and new) ways to confgure
– Documentation (rather hard to discover):
https://lucene.apache.org/solr/guide/7_4/update-request-processors.html
– Also: http://www.solr-start.com/info/update-request-processors/
➢
Example complex chain: Solr “schemaless” type-guessing
mode
40. APACHECON North America
Counting links
➢
To count links
– Copy to another feld (not yet involving schema) -
CloneFieldUpdateProcessorFactory
– Replace links themselves with their count –
CountFieldValuesUpdateProcessorFactory
➢
Defning URPs
– Some are implicit, but not ours
– Can defne in solrconfi.xml (edit, reload core)
– Can use Confg API to defne an overlay (in confgoverlay.json)
●
https://lucene.apache.org/solr/guide/7_4/confg-api.html
➢
Can use curl, Postman, or hack an Admin UI/Documents
41. APACHECON North America
Hacking Admin UI for Confg API
➢
Defne two URPs
➢
Cannot defne a chain via Confg API
➢
Notice: We changed request handler
➢
Notice: JSON has duplicate
keys-names (Ugh!)
➢
Once submitted, we have new confioverlay.json
fle in core’s confg directory
➢
Can check on Admin UI Files screen
42. APACHECON North America
Evolving schema for linksCount
➢
URP will generate linksCount feld
➢
By default, it would be a text_basic
➢
Let’s make it an Integer – better search
➢
Need to add a feld type, no UI for that
➢
Can edit fle manually and reload core
➢
Or can use Schema API
https://lucene.apache.org/solr/guide/7_4/schema-api.html
➢
Again, we can use an Admin UI hack (/schema)
➢
Once feld type (pint) is defned, use Admin UI to defne
linksCount as pint with default of 0
➢
As it is a new feld, no need to delete index, can just reindex
43. APACHECON North America
Reindex with URPs
➢ bin/post -c dip -params "separator=
%09&rowid=id&f.links.split=true&f.links.separat
or=%20&processor=cloneLinksToCount,countLinks"
-type "text/csv" .../dip-data.tsv
➢
New parameter
– processor=cloneLinksToCount,countLinks
➢
But that’s getting too long
➢
Can we do somethini about it?
– Hint: params.json
44. APACHECON North America
Dealing with solrconfg.xml
➢
Can modify solrconfg.xml by hand and reload core
(legacy method)
➢
Can use Confg API to create new entries – useful,
but overkill for changing parameters
➢
Can use Request Parameters API with predefned or
explicit useParams
– https://lucene.apache.ori/solr/iuide/7_4/request-parame
ters-api.html
– /confi/params endpoint
●
Takes JSON with set/unset/delete keys
●
Notice different escape for tab, space
➢
Once done, we can refer to it:
– bin/post -c dip -params "useParams=DIP_INDEX"
-type "text/csv" .../dip-data.tsv
45. APACHECON North America
Search for many links
http://localhost:8983/solr/dip/select?q=linksCount:[10 TO *]&sort=linksCount desc
46. APACHECON North America
Everybody likes dogs and cats
➢
Now that we have the basics
➢
Let’s choose another dataset
➢
How about: +dois +australia
➢
Export the Sunshine Coast dataset as XML
47. APACHECON North America
Reviewing the dataset format
➢
Original XML is hard to read
➢
Best to reformat (I use XMLStarlet)
➢
Defnitely not Solr format
➢
Can preprocess with XSLT (outside
or inside Solr)
➢
Can import with
Data Import Handler (DIH)
– Custom feld mapping
– Can do data merge, nested requests
– Has Admin UI screen
– Good for quick analysis
– Not production quality
48. APACHECON North America
Analyzing the data needs
➢
Records are at response/row/row
➢
_id is an attribute, the rest are
elements
➢
animaltype can be “d” and “Cat” -
normalize to dog/cat
➢
Some elements have extra space –
trim
➢
age should be a number
➢
primarycolour can have mixed
colours – make it possible to search
on sub-colour?
➢
Lowercase everything
49. APACHECON North America
Defning the pipeline
➢
Data Import Handler (DIH)
– Parse XML and map to feld name
– Rename _id to id and de_sexed to desexed
➢
Update Request Processors
– Trim extra space on everything
– Normalize D to Dog
➢
Schema defnition
– Default dynamic feld – as before (lowercase!)
– Explicit int feld for age
– Copy primarycolour into secondary feld and split
➢
Can evolve one step at a time, starting from DIH
– We are totally out of time for that today!
– Home work (hint – enable DIH in solrconfg.xml frst, then iterate)
➢
Copy minimal directory to pets-fnal to start new confgset
50. APACHECON North America
Pets managed-schema (age)
➢
Modify manaied-schema
➢
Keep all defnitions there from minimal confgset
➢
Add new parts anywhere in the fle
➢
feld type int and feld defnition for age
<fieldType name="pint" class="solr.IntPointField"
docValues="true"/>
<field name="age" type="pint" default="0"
indexed="true" stored="true"/>
51. APACHECON North America
Pets managed-schema (splitcolour type)
Defne new feld type to extract partial colour names (e.g. BlackWhite => Black,
White)
<fieldType name="splitcolour_type" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
52. APACHECON North America
Pets managed-schema (splitcolour feld)
➢
Defne new feld splitcolour and copy from primarycolour to it
<field name="splitcolour" type="splitcolour_type"
indexed="true" stored="false"/>
<copyField source="primarycolour"
dest="splitcolour"/>
➢
Notice:
– we have to search against splitcolour explicitly
– it is not returned to user (stored=false)
53. APACHECON North America
➢
Use DIH/atom example for reference (one of 5)
➢
bin/solr start -e dih -p 9999 (avoid port confict)
– http://localhost:9999/solr/#/atom/dataimport//dataimport
➢
bin/solr stop -p 9999 (when done)
➢
<solr_install>/example/example-DIH/solr/atom/conf/solrconfi.xml
– Order of elements is important for solrconfi.xml
– Notice library requirement
– Request Handler defnition for /dataimport
– confg => atom-data-confi.xml (DIH confguration)
– Both params and Trim URP are right in solrconfg.xml (instead of params.json and confgoverlay.json) – less
fexible, though can still be overriden
➢
atom-data-confi.xml
– https://lucene.apache.org/solr/guide/7_4/uploading-structured-data-store-data-with-the-data-import-hand
ler.html
– Loads XML from URL Feed, not File (both are valid for URLDataSource)
– Shows Transformers (we will stick to URPs)
Reference DIH example
56. APACHECON North America
Pets – create core
➢ bin/solr create -c pets -d …/pets-final
– Create core in the running server with our custom confgset
– Active confg fles are now in sroot/pets/conf
– Hack: you can edit these fles and reload core
– For more options: bin/solr create_core -h
➢ bin/solr delete -c pets
– If you want to update confgset and recreate core
– Will delete modifcations made while running
58. APACHECON North America
Basic search
http://localhost:8983/solr/pets/select?
facet.field=primarycolour&facet.mincount=1&facet=on&q=poodle&rows=1
➢
Case-insensitive search (poodle)
➢
Return only frst record of 1720 found
(of 56676 total)
➢
Show facets on primarycolour feld
➢
Do not return facets with 0 matched
documents
59. APACHECON North America
Merle puppy
➢ Merle – is a color
➢ There are several types of Merle
➢ Can we figure out types and
popularity using splitcolour field?
➢ q=splitcolour:merle
➢ rows=0
➢ facet=on
➢ facet.field=primarycolour
➢ facet.mincount=1
By Ted Van Pelt (Flickr, CC BY 2.0):
https://www.flickr.com/photos/bantam10/5580029980/
60. APACHECON North America
Advanced analytics
➢
Basic queries and facets are good to start
➢
Recent Solr supports:
– JSON Request API:
●
https://lucene.apache.org/solr/guide/7_4/json-request-api.html
– JSON Query DSL:
●
https://lucene.apache.org/solr/guide/7_4/json-query-dsl.html
– JSON Facets:
●
https://lucene.apache.org/solr/guide/7_4/json-facet-api.html
– Allows multi-leveled facets, analytics, (start of) semantic knowledge
graph
➢
Can do very complex queries
61. APACHECON North America
Complex JSON query
{
query: "splitcolour:gray",
filter: "age:[0 TO 20]"
limit: 2,
facet: {
type: {
type: terms,
field: animaltype,
facet : {
avg_age: "avg(age)",
breed: {
type: terms,
field: specificbreed,
limit: 3,
facet: {
avg_age: "avg(age)",
ages: {
type: range,
field : age,
start : 0,
end : 20,
gap : 5
}}}}}}}
➢
For all animals with a variation of
gray colour
➢
Limited to those of age between
0 and 20 (to avoid dirty data docs)
➢
Show frst two records and facets
➢
Facet them by animal type
(Cat/Dog)
– Then by the breed (top 3 only)
●
Then show counts for 5-year brackets
➢
On all levels, show bucket counts
➢
On bottom 2 levels, show average
age