This document provides an introduction and overview of Solr and its integration with AEM. It discusses search statistics to motivate the need for search. It then defines Solr and describes its key features and architecture. It covers topics like indexing, analysis, searching, cores, configurations files and queries. It also discusses setting up Solr with Linux and Windows. Finally, it discusses integrating Solr with AEM, including configuring an embedded Solr server and external Solr integration using a custom replication agent. Exercises are provided to allow hands-on experience with Solr functionality.
AEM (CQ) Dispatcher Security and CDN+Browser CachingAndrew Khoury
This presentation cover Adobe AEM Dispatcher security and CDN and browser caching.
This presentation is the second part of a webinar on AEM Dispatcher:
http://dev.day.com/content/ddc/en/gems/dispatcher-caching---new-features-and-optimizations.html
Visit url above to view the whole presentation. Domique Pfister the primary engineer developing AEM Dispatcher covers the first part on new features.
Hardening Your CI/CD Pipelines with GitOps and Continuous SecurityWeaveworks
Join us for a webinar on how to secure your CI/CD pipeline for Kubernetes with GitOps best practices and continuous runtime protection. As modern developers and DevOps teams are embarking on a quest for speed and reliability through automated CI/CD pipelines for Kubernetes, enterprises still need to ensure security and regulatory compliance.
Together with Deepfence, the Weaveworks team will explain and demonstrate how GitOps continuous delivery pipelines, combined with continuous security observability, improves the overall security of your development workflow - from Git to production.
In this webinar we will demonstrate:
Deepfence container scanning
Git-to-Kubernetes using FluxCD
Deepfence continuous runtime security
Helm version 3 was recently released with new features and a new architecture to support those features. The changes to Helm and charts were based on feedback, changes to Kubernetes, and lessons learned in the past couple years.
AEM (CQ) Dispatcher Security and CDN+Browser CachingAndrew Khoury
This presentation cover Adobe AEM Dispatcher security and CDN and browser caching.
This presentation is the second part of a webinar on AEM Dispatcher:
http://dev.day.com/content/ddc/en/gems/dispatcher-caching---new-features-and-optimizations.html
Visit url above to view the whole presentation. Domique Pfister the primary engineer developing AEM Dispatcher covers the first part on new features.
Hardening Your CI/CD Pipelines with GitOps and Continuous SecurityWeaveworks
Join us for a webinar on how to secure your CI/CD pipeline for Kubernetes with GitOps best practices and continuous runtime protection. As modern developers and DevOps teams are embarking on a quest for speed and reliability through automated CI/CD pipelines for Kubernetes, enterprises still need to ensure security and regulatory compliance.
Together with Deepfence, the Weaveworks team will explain and demonstrate how GitOps continuous delivery pipelines, combined with continuous security observability, improves the overall security of your development workflow - from Git to production.
In this webinar we will demonstrate:
Deepfence container scanning
Git-to-Kubernetes using FluxCD
Deepfence continuous runtime security
Helm version 3 was recently released with new features and a new architecture to support those features. The changes to Helm and charts were based on feedback, changes to Kubernetes, and lessons learned in the past couple years.
Multi Site Manager (MSM) enables you to easily manage multiple web sites that share common content. MSM lets you define relations between the sites so that content changes in one site are automatically replicated in other sites.
Using Docker multi-stage build to created advanced build pipelines.
With single Dockerfile create lean Docker images for Go, Node, Java and other platforms.
2 hour session where I cover what is Apache Camel, latest news on the upcoming Camel v3, and then the main topic of the talk is the new Camel K sub-project for running integrations natively on the cloud with kubernetes. The last part of the talk is about running Camel with GraalVM / Quarkus to archive native compiled binaries that has impressive startup and footprint.
Behavior Driven Development is one of the most commonly misunderstood techniques in DevOps, but it is also one of the key enablers of both an Agile culture and true continuous deployment. This talk will attempt to fill in the missing pieces on exactly what BDD is and how your teams can use it to increase communication, drive quality, and reduce waste. We will also connect the dots on why you need a test-first strategy to enable trunk-based development, continuous integration, and continuous deployment. If your business still struggles with monthly or quarterly big-batch releases, this talk will show you what your teams must do to evolve to the next stage of continuous delivery.
Faster Container Image Distribution on a Variety of Tools with Lazy PullingKohei Tokunaga
Talked at KubeCon + CloudNativeCon North America 2021 Virtual about lazy pulling of container images with eStargz and nydus (October 14, 2021).
https://kccncna2021.sched.com/event/lV2a
In the era of Microservices, Cloud Computing and Serverless architecture, it’s useful to understand Kubernetes and learn how to use it. However, the official Kubernetes documentation can be hard to decipher, especially for newcomers. In this book, I will present a simplified view of Kubernetes and give examples of how to use it for deploying microservices using different cloud providers, including Azure, Amazon, Google Cloud and even IBM.
An Operator is an application that encodes the domain knowledge of the application and extends the Kubernetes API through custom resources. They enable users to create, configure, and manage their applications. Operators have been around for a while now, and that has allowed for patterns and best practices to be developed.
In this talk, Lili will explain what operators are in the context of Kubernetes and present the different tools out there to create and maintain operators over time. She will end by demoing the building of an operator from scratch, and also using the helper tools available out there.
Since the release of 17.05, Docker has introduced Multi-Stage Build for Docker Images for anyone who has struggled to optimize Dockerfiles while keeping them easy to read and maintain. This builder pattern will help anyone who would just like to have the runtime, configuration & application and doesn’t want to have compilers, debuggers, code, build, test logs etc.
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
What are important considerations when modernizing middleware and moving towards serverless and/or cloud native integration architectures? How can we make the most of flexible technologies such as Camel K, Kafka, Quarkus and OpenShift. Claus is working as project lead on Apache Camel and has extensive experience from open source product development.
The talk was recorded and runs for 30 minutes and published on youtube at: https://www.youtube.com/watch?v=d1Hr78a7Lww
Kubernetes is a solid leader among different cloud orchestration engines and its adoption rate is growing on a daily basis. Naturally people want to run both their applications and databases on the same infrastructure.
There are a lot of ways to deploy and run PostgreSQL on Kubernetes, but most of them are not cloud-native. Around one year ago Zalando started to run HA setup of PostgreSQL on Kubernetes managed by Patroni. Those experiments were quite successful and produced a Helm chart for Patroni. That chart was useful, albeit a single problem: Patroni depended on Etcd, ZooKeeper or Consul.
Few people look forward to deploy two applications instead of one and support them later on. In this talk I would like to introduce Kubernetes-native Patroni. I will explain how Patroni uses Kubernetes API to run a leader election and store the cluster state. I’m going to live-demo a deployment of HA PostgreSQL cluster on Minikube and share our own experience of running more than 130 clusters on Kubernetes.
Patroni is a Python open-source project developed by Zalando in cooperation with other contributors on GitHub: https://github.com/zalando/patroni
How to do a LIVE-demo with minikube:
1. git clone https://github.com/zalando/patroni
2. cd patroni
3. git checkout feature/demo
4. cd kubernetes
5. open demo.sh and edit line #4 (specify the minikube context )
6. docker build -t patroni .
7. may be docker push patroni
8. may be edit patroni_k8s.yaml line #22 and put the name of patroni image you build there
9. install tmux
10. run tmux in one terminal
11. run bash demo.sh in another terminal and press Enter from time to time
Site search is one of the core functionality of any website. This talk provides an overview of internal workings of CQ5 search, its limitations for implementing site search functionality and discusses design patterns & challenges for integrating various 3rd party search providers with CQ5/AEM.
Multi Site Manager (MSM) enables you to easily manage multiple web sites that share common content. MSM lets you define relations between the sites so that content changes in one site are automatically replicated in other sites.
Using Docker multi-stage build to created advanced build pipelines.
With single Dockerfile create lean Docker images for Go, Node, Java and other platforms.
2 hour session where I cover what is Apache Camel, latest news on the upcoming Camel v3, and then the main topic of the talk is the new Camel K sub-project for running integrations natively on the cloud with kubernetes. The last part of the talk is about running Camel with GraalVM / Quarkus to archive native compiled binaries that has impressive startup and footprint.
Behavior Driven Development is one of the most commonly misunderstood techniques in DevOps, but it is also one of the key enablers of both an Agile culture and true continuous deployment. This talk will attempt to fill in the missing pieces on exactly what BDD is and how your teams can use it to increase communication, drive quality, and reduce waste. We will also connect the dots on why you need a test-first strategy to enable trunk-based development, continuous integration, and continuous deployment. If your business still struggles with monthly or quarterly big-batch releases, this talk will show you what your teams must do to evolve to the next stage of continuous delivery.
Faster Container Image Distribution on a Variety of Tools with Lazy PullingKohei Tokunaga
Talked at KubeCon + CloudNativeCon North America 2021 Virtual about lazy pulling of container images with eStargz and nydus (October 14, 2021).
https://kccncna2021.sched.com/event/lV2a
In the era of Microservices, Cloud Computing and Serverless architecture, it’s useful to understand Kubernetes and learn how to use it. However, the official Kubernetes documentation can be hard to decipher, especially for newcomers. In this book, I will present a simplified view of Kubernetes and give examples of how to use it for deploying microservices using different cloud providers, including Azure, Amazon, Google Cloud and even IBM.
An Operator is an application that encodes the domain knowledge of the application and extends the Kubernetes API through custom resources. They enable users to create, configure, and manage their applications. Operators have been around for a while now, and that has allowed for patterns and best practices to be developed.
In this talk, Lili will explain what operators are in the context of Kubernetes and present the different tools out there to create and maintain operators over time. She will end by demoing the building of an operator from scratch, and also using the helper tools available out there.
Since the release of 17.05, Docker has introduced Multi-Stage Build for Docker Images for anyone who has struggled to optimize Dockerfiles while keeping them easy to read and maintain. This builder pattern will help anyone who would just like to have the runtime, configuration & application and doesn’t want to have compilers, debuggers, code, build, test logs etc.
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
What are important considerations when modernizing middleware and moving towards serverless and/or cloud native integration architectures? How can we make the most of flexible technologies such as Camel K, Kafka, Quarkus and OpenShift. Claus is working as project lead on Apache Camel and has extensive experience from open source product development.
The talk was recorded and runs for 30 minutes and published on youtube at: https://www.youtube.com/watch?v=d1Hr78a7Lww
Kubernetes is a solid leader among different cloud orchestration engines and its adoption rate is growing on a daily basis. Naturally people want to run both their applications and databases on the same infrastructure.
There are a lot of ways to deploy and run PostgreSQL on Kubernetes, but most of them are not cloud-native. Around one year ago Zalando started to run HA setup of PostgreSQL on Kubernetes managed by Patroni. Those experiments were quite successful and produced a Helm chart for Patroni. That chart was useful, albeit a single problem: Patroni depended on Etcd, ZooKeeper or Consul.
Few people look forward to deploy two applications instead of one and support them later on. In this talk I would like to introduce Kubernetes-native Patroni. I will explain how Patroni uses Kubernetes API to run a leader election and store the cluster state. I’m going to live-demo a deployment of HA PostgreSQL cluster on Minikube and share our own experience of running more than 130 clusters on Kubernetes.
Patroni is a Python open-source project developed by Zalando in cooperation with other contributors on GitHub: https://github.com/zalando/patroni
How to do a LIVE-demo with minikube:
1. git clone https://github.com/zalando/patroni
2. cd patroni
3. git checkout feature/demo
4. cd kubernetes
5. open demo.sh and edit line #4 (specify the minikube context )
6. docker build -t patroni .
7. may be docker push patroni
8. may be edit patroni_k8s.yaml line #22 and put the name of patroni image you build there
9. install tmux
10. run tmux in one terminal
11. run bash demo.sh in another terminal and press Enter from time to time
Site search is one of the core functionality of any website. This talk provides an overview of internal workings of CQ5 search, its limitations for implementing site search functionality and discusses design patterns & challenges for integrating various 3rd party search providers with CQ5/AEM.
How can we harness AEM6 and Sling to integrate backed layers to the CMS and expose them as a unified framework. creation of these integrations is vital for a coherent, personalize-able and track-able sites.
Building Client-side Search Applications with Solrlucenerevolution
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Do you need an external search platform for Adobe Experience Manager?therealgaston
Experience Manager provides some basic search capabilities out of the box. In this talk, we'll explore an external search platform for implementing an Experience Manager powered, search-driven site. As an example, we will use Apache Solr as a reference implementation and describe best practices for indexing content, exposing non-Experience Manager content via search, delivering search-driven experiences, and deploying the solution in a production setting.
Over the past year data at GumGum has quadrupled. Now a days we process 20 TB of new data every day. New data/reporting requirements pour in every week. Usage of the real time data is growing with the daily data. In this talk we are tying to answer the following questions: How do we serve real time data with daily batched data to our consumers together? What is Lambda Architecture and how does it help? What role does Cassandra play in Lambda Architecture at GumGum? How did we solve few bottlenecks in the architecture using Cassandra? How Cassandra can help you avoid microbatching and give you a true realtime data?
About the Speaker
Vaibhav Puranik VP of Engineering, Big Data & Platform, GumGum
Vaibhav has 15 years of experience in Software. He began his career with Johnson Space Center in Houston and has a masters degree in computer science. For past 5 years Vaibhav has been responsible for architecting multiple big data systems at GumGum. He manages Data Science, Data Engineering and DevOps teams at GumGum.
Abhishek Dwevedi,Tech Training Instructor and Developer, Adobe Worldwide Field Enablement for a discussion about using AEM Assets. By joining this session, you will gain a deeper understanding of best practices for using assets in Experience Manager.
To view the on-demand session go to: http://bit.ly/ATACE92016
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
Building Search & Recommendation EnginesTrey Grainger
In this talk, you'll learn how to build your own search and recommendation engine based on the open source Apache Lucene/Solr project. We'll dive into some of the data science behind how search engines work, covering multi-lingual text analysis, natural language processing, relevancy ranking algorithms, knowledge graphs, reflected intelligence, collaborative filtering, and other machine learning techniques used to drive relevant results for free-text queries. We'll also demonstrate how to build a recommendation engine leveraging the same platform and techniques that power search for most of the world's top companies. You'll walk away from this presentation with the toolbox you need to go and implement your very own search-based product using your own data.
All you need to start with Apache Solr (elastic search). This presentation includes all the information of Solr i.e. what it is, installation, indexing & searching for beginners.
Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward.
Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Introduction to Solr. A brief introduction to Solr for the resources who wants to get trained on Solr.
1. Introduction to Solr
2. Solr Terminologies
3.Installation and Configuration
4. Configuration files schema.xml and solrconfig.xml
5. Features of SOLR
a. Hit Highlighting
Auto Complete / Suggester
Stop words
Synonyms
SpellCheck
Geo Spatial Search
Result Grouping
Query Syntax
Query Boosting
Content Spotlighting
Block Record / Remove URL Feature
Content Spotlighting / Merchandising / Banner / Elevate
Block Record / Remove URL Feature
6. Indexing the Data
7. Search Queries
8. DataImportHandler - DIH
9. Plugins to index various types of Data (XML, CSV, DB, Filesystem)
10. Solr Client APIs
11. Overview of SOLRJ API
12. Running Solr on Tomcat
13. Enabling SSL on Solr
14. Zookeeper Configuration
15. Solr Cloud Deployment
16. Production Indexing Architecture
17. Production Serving Architecture
18. Solr Upgradation
19. References
Self-learned Relevancy with Apache SolrTrey Grainger
Search engines are known for "relevancy", but the relevancy models that ship out of the box (BM25, classic tf-idf, etc.) are just scratching the surface of what's needed for a truly insightful application.
What if your search engine could automatically tune its own domain-specific relevancy model based on user interactions? What if it could learn the important phrases and topics within your domain, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain? What if you could further use SQL queries to explore these relationships within your own BI tools and return results in ranked order to deliver relevance-driven analytics visualizations?
In this presentation, we'll walk through how you can leverage the myriad of capabilities in the Apache Solr ecosystem (such as the Solr Text Tagger, Semantic Knowledge Graph, Spark-Solr, Solr SQL, learning to rank, probabilistic query parsing, and Lucidworks Fusion) to build self-learning, relevance-first search, recommendations, and data analytics applications.
Introduction to the basics of Information Retrieval (IR) with an emphasis on Apache Solr/Lucene. A lecture I gave during the JOSA Data Science Bootcamp.
After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
2. About me
❖ AEM6 Certified Expert Lead Consultant
❖ LinkedIn : https://www.linkedin.com/pub/deepak-
khetawat/96/a44/99a
❖ Twitter : @dk452
❖ For any query mail at deepakkhe.90@gmail.com
3. Agenda
❖ Search
❖ What is Solr ?
❖ Solr Architecture
❖ Working of Solr
❖ Solr Core
❖ Solr Queries
❖ Solr with AEM
❖ External Solr Integration with AEM
❖ Exercises
4. ❖Search Stats :
➢ Google now processes over 40,000 search queries every second on
average , which translates to over 3.5 billion searches per day and 1.2 trillion
searches per year worldwide.
➢ 61% of global Internet users research products online .
➢ 44% of online shoppers begin by using a search engine .
➢ A study by Outbrain shows that Search is the #1 driver of traffic to content
sites , beating social medias .
5. ❖Why Searching ?
➢ Want to look up similar terms in database forming autocomplete , get
facet/categories results also in one-go?
➢ Instead of application(ex. e-commerce portals) storing all structured data in
database , it is recommended to find right products which is already built,
which can be adapted .
➢ This is where Search engines/servers comes into picture .
7. ❖What is Solr ?
Price High to Low : Endeca, FredHopper, Mercado, Google Mini,
Microsoft Search Server, Autonomy, Microsoft Search Server Express,
Lucene(Solr/Elastic Search)
Speed Fast to Slow : Google Mini/Endeca, FredHopper, Autonomy,
Lucene/MSS/MSSE
Features High to Low : Endeca, FredHopper, Mercado, Solr, Autonomy,
Lucene, MSS/MSSE, Google Mini
Extensibility High to Low : Lucene, Endeca, FredHopper, Mercado,
Autonomy, MSS/MSSE, Google Mini
8. ❖What is Solr ?
➢ Apache Solr is a popular open source enterprise search server / web
application .
➢ Solr uses lucene search library and extends it . Solr exposes lucene Java
API’s as Restful services .
➢ Apache Solr is easy to use from virtually any programming language. It can
be used to increase the performance as it can search all the web content.
➢ We put documents in it (called indexing) via xml, json, csv, binary formats &
query it via GET request and receive search data in xml, json, csv, Python,
Ruby, PHP, binary etc. formats.
9. ❖Key features of Solr :
➢ Advanced Full-Text Search Capabilities .
➢ Optimized for High Volume Web Traffic .
➢ Standards Based Open Interfaces - XML , JSON and HTTP .
➢ Comprehensive HTML Administration Interfaces .
➢ Server Statistics exposed over JMX for Monitoring .
➢ Near Real Time Indexing .
➢ Extensible Plugin Architecture .
14. ❖Indexing :
➢ Indexing is the processing of original data into a highly efficient cross-
reference lookup in order to facilitate rapid searching .
➢ This type of index is called an inverted index, because it inverts a page-
centric data structure (page->words) to a keyword-centric data structure (word-
>pages).
➢ Without an index, the search engine would scan every document which
requires considerable time and computing power .
➢ For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents could
take hours .
15. ❖Analysis:
➢ Analyze : Search engines does not index text directly . The text are broken
into a series of individual atomic elements called tokens .
➢ An Analyzer builds TokenStreams , which analyze texts and represents a
policy for extracting index terms from texts .
➢ Analyzers are used both when a document is indexed and at query time .
➢ Tokenizer break field data into lexical units or tokens .
➢ Filters examine a stream of tokens and keep them ,transform or discard
them, or create new ones .
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
https://cwiki.apache.org/confluence/display/solr/Tokenizers
https://cwiki.apache.org/confluence/display/solr/About+Filters
17. ❖Searching :
➢ Searching is the process of consulting the search index and retrieving the
documents matching the query , sorted in the requested sort order .
➢ Search Process works like following :
18. ❖What is Core ?
➢ Solr server can handle multiple cores . Core is a index that is handled by
Solr server . To run Solr core need to have few configuration files to be
discussed later .
➢ Mostly cores run as isolated databases , not meant for interacting with
each other .
➢ A Solr index can accept data from many different sources including xml, csv
, data extracted from tables in a DB and common formats like a word or PDF .
19. ❖Ways of loading data into Solr Index
➢ Upload files like XML , JSON by sending HTTP request to Solr .
➢ Index handlers to upload from database .
➢ Writing a custom java app to ingest data through Solr’s Java client .
20. ❖Ways of loading data into Solr Index
➢ Upload files like XML , JSON by sending HTTP request to Solr .
➢ Index handlers to upload from database .
➢ Writing a custom java app to ingest data through Solr’s Java client .
21. ❖Configuring Solr Linux
Installing and Starting Solr
# apt-get install openjdk-7-jre-headless -y
# java -version
# cd /opt/ && wget http://mirror.nexcess.net/apache/lucene/solr/5.4.1/solr-5.4.1.tgz
#
# cd solr-5.4.1
# bin/solr start
Access it using # http://<ip>:8983/solr/#/
Solr stores index in a directory called index in the data directory (/opt/solr-
5.4.1/server/solr/corename1/data/index/)
Create a core
To be able to Index and search, you need a core
# bin/solr create -c corename -d basic_configs
22. ❖Configuring Solr Windows
http://mirror.nexcess.net/apache/lucene/solr/5.4.1/ to download solr .
Using cmd go upto bin and use solr start to start solr .
Access it using # http://<ip>:8983/solr/#/
Solr stores index in a directory called index in the data directory
(/solr-5.4.1/server/solr/corename1/data/index/)
Create a core
To be able to Index and search, you need a core. navigate the solr-5.4.1bin folder in the command
window .
solr create -c corename -d basic_configs
23. ❖Solr Configuration Files
● Solrconfig.xml :
➢ Manage request and data manipulation for users .
➢ Here we configure update handlers,update event listeners and
different search components .
➢ Can handle schemaless Solr feature by adding <schemaFactory
class="ManagedIndexSchemaFactory"> in Solrconfig.xml .
● Core.properties : For defining properties at the core level .
● Solr.xml : It is at entire Solr level for cores like how many cores we have ,
core details .
● Schema.xml :
➢ It is at core data-structure level having information like what field
type , fields .
➢ Analyzers are specified as a child of the <fieldType> element in the
schema.xml configuration file .
24. ❖ Field Types
In Solr, every field has a type. Examples of basic field types available
in Solr include: float, long, double, date, text
New field types can be created/defined by combining filters and
tokenizers.
Field Definition
<field name="id" type="text" indexed="true"
stored="true"multiValued="true"/>
name: Name of the field
type: Field type
indexed: Should this field be added to the inverted index?
stored: Should the original value of this field be stored?
multiValued: Can this field have multiple values?
25. ❖ Exercises
Create 2 cores test and solrpoc .
Configure your schema.xml .
https://lucidworks.com/blog/2014/03/31/introducing-solrs-restmanager-
and-managed-stop-words-and-synonyms/ (For POC)
26. ❖Solr Queries
● Keyword Searching :
➢ Word Searching : title:aem doc
➢ Phrase Searching: title:"aem document"
➢ AND, OR, - operators
((title:"foo bar" AND body:"quick fox") OR title:fox)
● Wildcard Searching :
➢ title : ae*
➢ title : a*m
27. ❖Solr Queries(Note in code snippet of queries we are using test core)
● Range Searching :
➢ modified_date:[20020101 TO 20030101]
Query Example :
➢ http://localhost:8983/solr/test/select?q=title:Design&wt=json&indent=true
➢ http://localhost:8983/solr/test/select?q=*:*&fl=author,title&facet=true&facet
.field=author&wt=json&indent=true
28. ❖Solr Queries
● Fuzzy Search:
➢ Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance
algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word
Term.
➢ For example to search for a term similar in spelling to "roam" use the fuzzy search:roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required
similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher
similarity will be matched. For example:
roam~0.8
The default that is used if the parameter is not given is 0.5.
Query Example :
➢ http://localhost:8983/solr/test/select?q=DesignPatterns~.5&wt=json&inden
t=true&defType=edismax&qf=title
29. ❖Common Query Parameters
● sort : title desc
(http://localhost:8983/solr/test/select?q=title:Desi*&sort=title+desc&wt=jso
n&indent=true)
● fq :This parameter can be used to specify a query that can be used to
restrict the super set of documents that can be returned, without
influencing score. It can be very useful for speeding up complex queries
since the queries specified with fq are cached independently from the
main query. Caching means the same filter is used again for a later query
(i.e. there's a cache hit).
fq=popularity:[10 TO *] & fq=section:0
● http://www.exadium.com/tools/online-url-encode-and-url-decode-tool/(to
encode and decode your Solr queries)
30. ❖More Search Queries
● Highlighting :
http://localhost:8983/solr/test/select?q=title:Desi*&hl=true&hl.snippets=1&
hl.fl=*&hl.fragsize=0&wt=json&indent=true
● hl – Should be set to true as it enables highlighted snippets to be generated in the query response.
● hl.fl – Specifies a list of fields to highlight. Wild char * will highlight all the fields
● hl.fragsize – The size, in characters, of the snippets (aka fragments) created by the highlighter. In the
original Highlighter, “0” indicates that the whole field value should be used with no fragmenting. By default
fragment is of size 100 characters
● hl.snippets
The maximum number of highlighted snippets to generate per field. Note: it is possible for any
number of snippets from zero to this value to be generated. The default value is "1".
31. ❖Exercises
1. Under test core achieve all the following with one query :
a) Get only fields author and title in query results .
b) Fuzziness Index of .5 in title field .
c) Title should be start with design
d)Change title field to text_general instead of string .
e)Faceting for title field .
f) Highlighting title field
g)author name should only Range between Arvind to Jasvinder .
32. ❖Exercises
1.Under test core with your schema.xml created, achieve all the following :
Full Text Search
Fuzzy Search
Faceting
Spatial Search
Highlighting
Wild Card Search
Range Search
33. ❖Solr with AEM 6
● The integration in AEM 6.0 happens at the repository level so that Solr is
one of the possible indexes that can be used in Oak, the new repository
implementation shipped with AEM 6.0.
It can be configured to work as an embedded server with the AEM
instance, or as a remote server.
Configuring AEM with an embedded SOLR server
The embedded SOLR server is recommended for developing and testing the
Solr index for an Oak repository. Make sure you use a remote SOLR for
production instances.
34. ❖Configure the embedded Solr server
by:
● Search for "Oak Solr server provider".
● Press the edit button and in the following window set the server type to
Embedded Solr in the drop-down list.
● Next, edit "Oak Solr embedded server configuration" and create a
configuration
Add a node called solrlndex of type oak:QueryIndexDefinition under
oak:index with the following properties:
type:solr (of type String)
async:async (of type String)
reindex:true (of type Boolean)
Testing Solr Index(Recommended Configuration) :
https://docs.adobe.com/docs/en/aem/6-0/deploy/upgrade/queries-and-
indexing.html
35. ❖External Solr with AEM
1. Creation of Custom Replication Agent .
http://localhost:7502/miscadmin
Select “Agents on Author” New > Page
Select Replication Agent > Add Title and Name (e.g Solr-Replication-Agent)
Now go to the agent (double click) properties , Edit Settings
Serialization Type = POC Solr XML Content Serializer
Retry Delay = 60000 , Enable the Log Level to Debug
Transport Tab put the URl http://localhost:8983/solr/solrpoc?update=true
36. ❖2. Get the Page to be indexed(Push to
Solr)
Suppose you want to Index:
http://localhost:7502/editor.html/content/geometrixx-outdoors/en/activities.html
Now open CRXDE page (http://localhost:7502/crx/de/index.jsp) for the
respective page:
Check jcr:content of the the above page and get the value of
“sliing:resourceType”
(value = geometrixx-outdoors/components/page_sidebar)
37. ❖2. Get the Page to be indexed(Push to
Solr)
In Felix go to Content extractor service for Solr indexing (codebase in
https://github.com/deepakkhetawat/SolrPOCWithAEM)
Edit the “Component Indexed values” for the page you want to index in
apache solr.
Then add/put above value into the “Component Indexed values”
geometrixx-outdoors/components/page_sidebar@articletext=jcr:title
Note: @articletext -> Field which is mentioned/added into schema.xml
earlier.
jcr:title -> “Name” field mentioned in jcr:content properties of above page
38. ❖2. Get the Page to be indexed(Push to
Solr)
You need to decide where you want to map the page property to what
field of Solr and it should be present in the schema.xml
Once you publish the content , apache solr indexes data as we
configured the separate replication agent SOLR in agents on Author.
39. ❖Exploratory Exercises
Create your own fieldtype in Solr like textgeneral .
Explore AEM 6.1 Solr Indexing by Default
Implement Highlighting for title field
Try geospatial search with AEM
40. ❖References
Code of “Tag Based Faceting using External Apache Solr with AEM” available at
https://github.com/deepakkhetawat/SolrPOCWithAEM.git
Needful References are mentioned in between Slides .
Google