After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.
The talk presents the sfSolrPlugin which transparently integrates the Solr search engine into symfony.
The talk explains :
* the features of the solr search engine
* how to integrate the search engine into symfony
* complex search : faceted and geolocalized search
* usage example : http://www.menugourmet.com and http://resolutionfinder.org
Introduction to Solr. A brief introduction to Solr for the resources who wants to get trained on Solr.
1. Introduction to Solr
2. Solr Terminologies
3.Installation and Configuration
4. Configuration files schema.xml and solrconfig.xml
5. Features of SOLR
a. Hit Highlighting
Auto Complete / Suggester
Stop words
Synonyms
SpellCheck
Geo Spatial Search
Result Grouping
Query Syntax
Query Boosting
Content Spotlighting
Block Record / Remove URL Feature
Content Spotlighting / Merchandising / Banner / Elevate
Block Record / Remove URL Feature
6. Indexing the Data
7. Search Queries
8. DataImportHandler - DIH
9. Plugins to index various types of Data (XML, CSV, DB, Filesystem)
10. Solr Client APIs
11. Overview of SOLRJ API
12. Running Solr on Tomcat
13. Enabling SSL on Solr
14. Zookeeper Configuration
15. Solr Cloud Deployment
16. Production Indexing Architecture
17. Production Serving Architecture
18. Solr Upgradation
19. References
Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.
Tutorial on developing a Solr search component pluginsearchbox-com
In this set of slides we give a step by step tutorial on how to develop a fully functional solr search component plugin. Additionally we provide links to full source code which can be used as a template to rapidly start creating your own search components.
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
In this On-Demand Webinar, Erik Hatcher, co-founder of Lucid Imagination, co-author of Lucene in Action, and Lucene/Solr PMC member and committer, presents and discusess key features and innovations of Apache Solr 1.4
This presentation lays out the concept of the traditional web, the improvements web 2.0 have brought about, etc.
I have attempted to explain RIA as well.
The main part of this presentation is centered around ajax, its uses, advantages / disadvantages, framework considerations when using ajax, java-script hijacking, etc.
Hopefully it should be a good read as an intro doc to RIA and Ajax.
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/
The presentation describes what is Apache Solr, how it could be used. There is apache solr overview, performance tuning tips and advanced features description
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014
(http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)
After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.
The talk presents the sfSolrPlugin which transparently integrates the Solr search engine into symfony.
The talk explains :
* the features of the solr search engine
* how to integrate the search engine into symfony
* complex search : faceted and geolocalized search
* usage example : http://www.menugourmet.com and http://resolutionfinder.org
Introduction to Solr. A brief introduction to Solr for the resources who wants to get trained on Solr.
1. Introduction to Solr
2. Solr Terminologies
3.Installation and Configuration
4. Configuration files schema.xml and solrconfig.xml
5. Features of SOLR
a. Hit Highlighting
Auto Complete / Suggester
Stop words
Synonyms
SpellCheck
Geo Spatial Search
Result Grouping
Query Syntax
Query Boosting
Content Spotlighting
Block Record / Remove URL Feature
Content Spotlighting / Merchandising / Banner / Elevate
Block Record / Remove URL Feature
6. Indexing the Data
7. Search Queries
8. DataImportHandler - DIH
9. Plugins to index various types of Data (XML, CSV, DB, Filesystem)
10. Solr Client APIs
11. Overview of SOLRJ API
12. Running Solr on Tomcat
13. Enabling SSL on Solr
14. Zookeeper Configuration
15. Solr Cloud Deployment
16. Production Indexing Architecture
17. Production Serving Architecture
18. Solr Upgradation
19. References
Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.
Tutorial on developing a Solr search component pluginsearchbox-com
In this set of slides we give a step by step tutorial on how to develop a fully functional solr search component plugin. Additionally we provide links to full source code which can be used as a template to rapidly start creating your own search components.
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
In this On-Demand Webinar, Erik Hatcher, co-founder of Lucid Imagination, co-author of Lucene in Action, and Lucene/Solr PMC member and committer, presents and discusess key features and innovations of Apache Solr 1.4
This presentation lays out the concept of the traditional web, the improvements web 2.0 have brought about, etc.
I have attempted to explain RIA as well.
The main part of this presentation is centered around ajax, its uses, advantages / disadvantages, framework considerations when using ajax, java-script hijacking, etc.
Hopefully it should be a good read as an intro doc to RIA and Ajax.
An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)
The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/
The presentation describes what is Apache Solr, how it could be used. There is apache solr overview, performance tuning tips and advanced features description
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014
(http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)
OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well.
Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code.
In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.
Introduction to the basics of Information Retrieval (IR) with an emphasis on Apache Solr/Lucene. A lecture I gave during the JOSA Data Science Bootcamp.
Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward.
Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
Building a near real time search engine & analytics for logs using solrlucenerevolution
Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
Accelrys Catalog is a powerful new technology for creating an index of the protocols and components within your organization. You will learn about strategies for indexing and how search capabilities can be deployed to professional client and Web Port end users. You will also learn how to use this technology to find out about system usage to aid with system upgrades, server consolidations, and general system maintenance. The protocol validation capability in the admin portal allows administrators to created standard reports on server usage characteristics. You will learn how to report on violations of IT policies (e.g. around security), bad protocol authoring practices, or missing or incomplete protocol documentation. Developers will also learn how to extend and customize the rules used to create these reports.
Solr search engine with multiple table relationJay Bharat
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL.
I am introducing how to handle multiple table data handling in SOLR.
DataArt Custom Software Engineering with a Human ApproachDataArt
DataArt is a global software engineering firm that takes a uniquely human approach to solving problems. With over 20 years of experience, teams of highly-trained engineers around the world, deep industry sector knowledge and ongoing technology research, we help clients create custom software that improves their operations and opens new markets. Powered by our People First principle, we work with clients at any scale and on any platform, and adapt alongside them as they evolve.
DataArt is a global software engineering firm that takes a uniquely human approach to solving problems. With over 20 years of experience, teams of highly-trained engineers around the world, deep industry sector knowledge, and ongoing technology research, we help clients create custom software that improves their operations and opens new markets. Powered by our People First principle, we work with clients at any scale and on any platform, and adapt alongside them as they evolve.
DataArt Financial Services and Capital MarketsDataArt
DataArt is a global software engineering firm that takes a uniquely human approach to solving problems. With over 20 years of experience, teams of highly-trained engineers around the world, deep industry sector knowledge, and ongoing technology research, we help clients create custom software that improves their operations and opens new markets. Powered by our People First principle, we work with clients at any scale and on any platform, and adapt alongside them as they evolve.
We integrate our engineering excellence with deeply human values that drive our business and our approach to relationships: curiosity, empathy, trust, honesty, and intuition. These qualities help us deliver high-value, high-quality solutions that our clients depend on, and lifetime partnerships they believe in.
DataArt has earned the trust of some of the world’s leading brands and most discerning clients, including Nasdaq, Travelport, Ocado, Centrica/Hive, Paddy Power Betfair, IWG, Univision, Meetup and Apple Leisure Group among others. DataArt brings together expertise of over 3000 professionals in 20 locations in the US, Europe, and Latin America.
Мы ежедневно посещаем десятки и сотни сайтов и периодически видим рекламу, зачастую даже не задумываясь, откуда она вообще берется. Почему именно эта реклама показана вам именно здесь? И какая роль JS во всем этом?
Рассмотрим:
• поговорим о жизненном цикле рекламного баннера и проследим его путь от рекламодателя до браузера;
• узнаем, кто же постоянно следит за нами в интернете, как много информации о нас им доступно;
• определим способы выявления некачественного трафика;
• разберемся, зачем нужно контролировать качество просмотров;
• обсудим, почему нельзя так просто взять и просмотреть всю статистику по рекламе в одном месте (или все-таки можно?).
Алексей Уманский, JS Developer, AnyMind Group. Опыт работы в IT – четыре года. Участвовал в тревел- и gamedev-проектах: разрабатывал крупный сервис по покупке авиабилетов, создавал систему игровых автоматов для онлайн казино. Последний год работал в Таиланде над продуктами в области Digital Marketing: онлайн биржа для influencer-ов и сервис по управлению рекламой на сайте, а так же сбору статистики по ней.
DevOps Workshop:Что бывает, когда DevOps приходит на проектDataArt
Александр Снеговой, DevOps Software Engineer в DataArt.
Более шести лет в IT. Сертифицированный AWS Solutions Architect Associate. Докладчик на международных научных конференциях. Религиозный фанат Docker.
Оксана Харчук, Senior QA Engineer.
Презентация:
Коммуникация в жизни QA. Как выстроить эффективные коммуникации тестировщику с бизнес аналитиком, разработчиком, менеджером и клиентом.
Нельзя просто так взять и договориться, или как мы работали со сложными людьмиDataArt
Эллина Азадова, QA Lead в DataArt Kherson.
Презентация:
Реальные примеры из своей практики, как работать со сложными людьми: интровертами, экстравертами, излишне эмоциональными и с постоянно пессимистически настроенными.
Дмитрий Клипинин, DevOps Engineer в GlobalLogic, более 10 лет опыта работы в IT, сертифицированный специалист Microsoft по технологиям Active Directory и SQL Server.
Презентация:
1. Эволюция системного администратора.
2. DevOps-практики.
3. Основные DevOps-инструменты.
Александр Снеговой, DevOps Software Engineer в DataArt Kherson. Более шести лет в IT. Сертифицированный AWS Solutions Architect Associate. Докладчик на международных научных конференциях. Религиозный фанат Docker.
Презентация:
1. Докеризация приложения.
2. Настройка CI/CD.
3. Развертывание инфраструктуры в AWS с помощью Terraform.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
IT talk SPb "Full text search for lazy guys"
1.
2. FULL TEXT SEARCH
FOR LAZY GUYS
Starring Apache solr
Alexander Tokarev
Senior Developer
DataArt
atokarev@dataart.com
Alexander Polushkin
Configuration Manager
DataArt
atokarev@dataart.com
4. FTS solutions attributes
1. Search by content of documents rather than by attributes
2. Read-oriented
3. Flexible data structure
4. 1 dedicated tailored index used further for search
5. index contains unique terms and their position in all documents
6. Indexer takes into account language-specific nuances like stop words,
stemming, synonyms
11. Solr
• True open source (under Apache) full text search engine
• Built over Lucene
• Multi-language support
• Rich document parsing (rtf, pdf, …)
• Various client APIs
• Versatile query language
• Scalable
• Full of additional features
14. Client access
1. Main REST API
– Common operations
– Schema API
– Rebalance/collection API
– Search API
– Faceted API
2. Native JAVA client SolrJ
3. Client bindings like Ruby, .Net, Python, PHP, Scala – see
https://wiki.apache.org/solr/IntegratingSolr +
https://wiki.apache.org/solr/SolPython
4. Parallel SQL (via REST and JDBC)
16. Index modeling
Choose Solr mode:
1. Schema
2. Schema-less
Define field attributes:
1. Indexed (query, sort, facet, group by, provide query suggestions for, execute function)
2. Stored – all fields which are intended to be shown in a response
3. Mandatory
4. Data type
5. Multivalued
6. Copy field (calculated)
Choose a field for UniqueIdentifier
17. Field data types
1. Dates
2. Strings
3. Numeric
4. Guid
5. Spatial
6. Boolean
7. Currency and etc
22. Transaction management
1. Solr doesn’t expose immediately new data as well as not remove deleted
2. Commit/rollback should be issued
Commit types:
1. Soft. Data indexed in memory
2. Hard. It moves data to hard-drive
Risks:
1. Commits are slow
2. Many simultaneous commits could lead to Solr exceptions (too many commits)
<h2>HTTP ERROR: 503</h2>
<pre>Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later.</pre>
3. Commit command works on instance level – not on user one
23. Transaction log
Intention:
1. recovery/durability
2. Nearly-Real-Time (NRT) update
3. Replication for Solr cloud
4. Atomic document update, in-place update (syntax is different)
5. Optimistic concurrency
Transaction log could be enabled in solrconfig.xml
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
Atomic update example:
{"id":"mydoc",
"price":{"set":99},
"popularity":{"inc":20},
"categories":{"add":["toys","games"]},
"promo_ids":{"remove":"a123x"},
"tags":{"remove":["free_to_try","on_sale"]}
}
24. Data modification Rest API
Rest API accepts:
1. Json objects
2. Xml-update
3. CSV
Solr UPDATE = UPSERT if schema.xml has <UniqueIdentifier>
26. Post utility
1. Java-written utility
2. Intended to load files
3. Works extremely fast
4. Loads csv, json
5. Loads files by mask of file-by-file
bin/post -c http://localhost:8983/cloud tags*.json
ISSUE: doesn’t work with Solr Cloud
27. Data import handler
1. Solr loads data itself
2. DIH could access to JDBC, ATOM/RSS, HTTP, XML, SMTP datasource
3. Delta approach could be implemented (statements for new, updated and deleted data)
4. Loading progress could be tracked
5. Various transformation could be done inside (regexp, conversion, javascript)
6. Own datasource loaders could be implemented via Java
7. Web console to run/monitor/modify
28. Data import handler.
How to implement
1. Create data config
<dataConfig>
<dataSource name="jdbc" driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost/db"
user="admin" readOnly="true" autoCommit="false" />
<document>
<entity name="artist" dataSource="jdbc" pk="id"
query="select *from artist a"
transformer="DateFormatTransformer"
>
<field column="id" name="id"/>
<field column="department_code" name="department_code"/>
<field column="department_name" name="department_name"/>
<field column = "begin_date" dateTimeFormat="yyyy-MM-dd" />
</entity>
</document>
</dataConfig>
29. Data import handler.
How to implement
2. Publish in solrconfig.xml
<requestHandler name="/jdbc"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name=“default">
<str name="jdbc.xml</str>
</lst>
</requestHandler>
DIH could be started via REST call
curl http://localhost:8983/cloud/jdbc -F command=full-import
30. Data import handler
In process:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">jdbc.xml</str>
</lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
<str name="Time Elapsed">0:1:15.460</str>
<str name="Total Requests made to DataSource">39547</str>
<str name="Total Rows Fetched">59319</str>
<str name="Total Documents Processed">19772</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2010-10-03 14:28:00</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
31. Data import handler
After Import:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">jdbc.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">2118645</str>
<str name="Total Rows Fetched">3177966</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2010-10-03 14:28:00</str>
<str name="">Indexing completed. Added/Updated: 1059322 documents. Deleted 0 documents.</str>
<str name="Committed">2010-10-03 14:55:20</str>
<str name="Optimized">2010-10-03 14:55:20</str>
<str name="Total Documents Processed">1059322</str>
<str name="Time taken ">0:27:20.325</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
34. Search types
• Fuzzy
Developer~ Developer~1 Developer~4
It matches developer, developers, development and etc.
• Proximity
“solr search developer”~ “solr search developer”~1
It matches: solr search developer, solr senior developer
• Wildcard
Deal* Com*n C??t
Need *xed? Add ReversedWildcardFilterFactory.
• Range
[1 TO25] {23 TO50} {23 TO90]
35. Search characteristics
1. Similarity
2. Term frequency
Similarity could be changed via boosting:
q=title:(solr for developers)^2.5 AND description:(professional)
q=title:(java)^0.5 AND description:(professional)^3
36. Search result customization
Field list
/query?=&fl=id, genre /query?=&fl=*,score
Sort
/query?=&fl=id, name&sort=date, score desc
Paging
select?q=*:*&sort=id&fl=id&rows=5&start=5
Transformers
[docid] [shard]
Debuging
/query?=&fl=id&debug=true
Format
/query?=&fl=id&wt=json /query?=&fl=id&wt=xml
47. Zookeeper
Apache ZooKeeper is a mature and fast open source
server widely used in distributed systems for coordinating,
synchronizing, and maintaining shared information.
18 April 2017 47
53. SolrCloud Shard Splitting
• curl 'http://192.168.70.60:8983/solr/admin/collections?action=SPLITSHARD&collection=cloud&shard=shard1‘
• /admin/collections?action=SPLITSHARD&collection=name&shard=shardID
18 April 2017 F O O T E R H E R E 53
54. SolrCloud Adding a Replica
• curl 'http://192.168.70.60:8983/solr/admin/collections?action=ADDREPLICA&collection=cloud&shard=shard1&node=192.168.70.100:8983_solr‘
• /admin/collections?action=ADDREPLICA&collection=collection&shard=shard&node=nodeName
18 April 2017 F O O T E R H E R E 54
55. SolrCloud Collections API
• https://cwiki.apache.org/confluence/display/solr/Collections+API
18 April 2017 F O O T E R H E R E 55
56. Q&A
18 April 2017 F U L L T E X T S E A R C H F O R L A Z Y G U Y S 57
57. Advanced Solr
1. Streaming language
Special language tailored mostly for Solr Cloud, parallel processing, map-reduce style approach. The idea is to process and return big datasets. Commands like: search, jdbc, intersect, parallel, or, and
2. Parallel query
JDBC/REST to process data in SQL style. Works on many Solr nodes in MPP style.
curl --data-urlencode 'stmt=SELECT to, count(*) FROM collection4 GROUP BY to ORDER BY count(*) desc LIMIT 10' http://localhost:8983/solr/cloud/sql
3. Graph functions
Graph traversal, aggregations, cycle detection, export to GraphML format
4. Spatial queries
There is field datatype Location. It permits to deal with spatial conditions like filtering by distance (circle, square, sphere) and etc.
&q=*:*&fq=(state:"FL" AND city:"Jacksonville")&sort=geodist()+asc
5. Spellchecking
It could be based on a current index, another index, file or using word breaks. Many options what to return: most similar,
more popular etc
http://localhost:8983/solr/cloud/spell?df=text&spellcheck.q=delll+ultra+sharp&spellcheck=true
6. Suggestions
http://localhost:8983/solr/cloud/a_term_suggest?q=sma&wt=json
7. Highlighter
Marks fragments in found document
http://localhost:8983/solr/cloud/select?hl=on&q=apple
8. Facets
Arrangement of search results into categories based on indexed terms with statistics. Could be done by values, range, dates, interval, heatmap
58. Performance tuning Cache
Be aware of Solr cache types:
1. Filter cache
Holds unordered document identifiers associated with filter queries that have been executed (only if fq query parameter is used)
2. Query result cache
Holds ordered document identifiers resulting from queries that have been executed
3. Document cache
Holds Lucene document instances for access to fields marked as stored
Identify most suitable cache class
1. LRUCache – last recently used are evicted first, track time
2. FastLRUCache – the same but works in separate thread
3. LFUCache – least frequently used are evicted first, track usage count
Play with auto-warm
<filterCache class="solr.FastLRUCache" size="512“ initialSize=“100" autowarmCount=“10"/>
Be aware how auto-warm works internally – doesn’t delete data, repopulated completely
59. Performance tuning Memory.
• Care about OS memory for disk caching
• Estimate properly Java heap size for Solr – use
https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_2_0/dev-tools/size-estimator-lucene-solr.xls
60. Performance tuning Schema design
1. Try to decrease number of stored fields mark as indexed only
2. If fields are used only to be returned in search results – use stored only
61. Performance tuning Ingestion
1. Use bulk sending data rather than per-document
2. If you use SolrJ use ConcurentUpdateSolrServer class
3. Disable ID uniqueness checking
4. Identify proper mergeFactor + maxSegments for Lucene segment merge
5. Issue OPTIMIZE after huge bulk loadings
6. If you use DIH try to not use transformers – pass them to DB level in SQL
7. Configure AUTOCOMMIT properly
62. Performance tuning Search
1. Choose appropriate query parser based on use case
2. Use Solr pagination to return data without waiting for a long time
3. If you return huge data set use Solr cursors rather than pagination
4. Use fq clause to speed up queries with one equal condition – time for scoring isn’t used + results
are put in cache
5. If you have a lot of stored fields but queries don’t show all of them use field lazy loading
<enableLazyFieldLoading>true</enableLazyFieldLoading>
6. Use shingling to make phrasal search faster
<filter class="solr.ShingleFilterFactory“ maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.CommonGramsQueryFilterFactory“ words="commongrams.txt" ignoreCase="true""/>
63. Q&A
18 April 2017 F U L L T E X T S E A R C H F O R L A Z Y G U Y S 64
64. THANK YOU.
WE ARE HIRING!
Alexander Tokarev
Senior Developer
DataArt
Alexander.Tokarev@dataart.com
Alexander Polushkin
Configuration Manager
DataArt
Alexander.Polushkin@dataart.com
Editor's Notes
Добрый день. Меня зовут Александр. В компании ДатаАрт мы начали формировании сёрч-практики и в качестве первого решения для изучения мы выбрали продукт Апач Солр. Мы хотим рассказать вам о том, что мы узнали за нём за довольно короткий период времени. В этом мне также поможет коллега Саша из девопс-практики.
I plan to have intermediary breaks for small q&a sessions
What distinguishes fts solutions from others databases. Do you know what stemming is? It is word normalization i.e. drive, drove and driven will be written as drive
Consider the text "The quick brown fox jumped over the lazy dog". The use of shingling in a typical configuration would yield the indexed terms (shingles) "the quick", "quick brown", "brown fox", "fox jumped", "jumped over", "over the", "the lazy", and "lazy dog" in addition to all of the original nine terms.
Common-grams is a more selective variation of shingling that only shingles when one of the consecutive words is in a configured list. Given the preceding sentence using an English stop word list, the indexed terms would be "the quick", "over the", "the lazy", and the original nine terms.
There are 2 common approaches: fts index is created inside main database and dedicated FTS server. Which solution is better? It depends from your tasks, performance and scalability requirements. What is obvious FTS servers suggest reach function set but requires hardware, administration and development overhead. We will concentrate on dedicated FTS server.
In spite of FTS solutions looks like intended for content search only the spectrum of their usage patterns is rather big.
Pay attention that figures are calculated by faceted search engine
Suggestions could be made tailored for a particular user
All these patterns are done via FTS API which permits to reuse them without wasting time
Please pay attention that Lucene and Xapian are set of libraries. For instance Elasticsearch and Solr are based on Lucene
Full text search is rather sophisticated stuff throughout enterprise due it affects all aspects.
We will have a look into some of these aspects during last part of our presentation.
Any questions before we move to Apache Solr world?
It is worth mentioning that initially it was full text search engine – now I would rather name it Search engine
SOLR is j2ee application which as I mentioned uses Lucene library.
Storage stores metadata and inverted index in a file store. Solr could be configured to be stored for hdfs storage
Container
Lucene
DIH – export data from external sources
Velocity template – UI of Solr admin tool
RH – what’s process user requests: search, schema management, et.c
SOLR has REST API for main operations like search, indexing. Solr developers stated there are some groups of API.
Main idea was Solr api should be transparent enough to work without any additional payload – only by URI (in opposite of Elastic) but queries become more complicated and URI looks unreadable
SolrJ is included in Solr distributive
It is main structure. Please pay attention that stemming and stopwords aren’t used.
As you could see it stores the position as well. It is done for phrase queries like “New Car”
Data types
show real schema
Let’s have a look into ideal reverse index content
Ascif remove e akstegu
The first one removes continuous letters like cofeeeeee
Why synonym isn’t linked – it actually done on query time rather than on indexing
Rollback + nrt + soft/hard commit + indexes – what is new index handler
p. 3 – it means if an user issue commit changes of others users will be committed as well
There is Autocommit and CommitWIthin – it mention dataframe
p. 4 – update only small part of the document rather than reindex it at all. Without it all document should be loaded for update.
In-place – only for dovValues
p. 5 is based in mandatory _Version field.
Ordinal, json, xml, csv, rtf, csv
My lovely feature
Data import handler
Data import handler
Data import handler
Data import handler
Query parsers
Pay attention to searcher – it reads read-only snapshot of Lucen index. once we commit the search is reopening which leads to cache invalidation.
Searcher uses query parser. There are 3 of them but we will concentrate on mostly used Lucene query parser.
~ - number of replacements. So named edit distance
Proximity the same as Fuzzu but edit distance in terms of words
Please pay attention that we don’t consider function usage, cross index and cross document joins, faceting
About boosting, relevancy, similarity
Fields are returned only for stored fields
To load huge datasets so named cursors are used – out of the scope
Pay attention to score – it is search relevancy measure. You could manage it via boosting
We will have a look more examples in demo + with debug
Stand-Alone
Master-Slave
Master-Slave and sharding
SolrCloud
Обратить внимание на то что, ядро это не однопроцессорное приложение.
This approach is the most sophisticated among all the traditional approaches, but it has a
couple of limitations. This model of scaling Solr is complex and difficult to manage, monitor, and
maintain.
To address the limitation of the traditional architecture, SolrCloud was introduced. If you are
planning to implement a hybrid approach, I recommend you consider SolrCloud, covered in the
next section.
Зукипер это зрелый, опенсорсный проект широко используемый в распределённых системах для координации, синхронизации и хранении общих данных.
Когда вы запускаете SolrCloud без указания адреса зукипера, то у вас запускается встроенный зукипер.
Для продакшена -Zookeeper ensemble
На картинке видно что ,The primary server is called the leader, and has a set of followers for redundancy.
Обратите внимание Зукипер состоит из трех серверов.
-ZooKeeper maintains information about the cluster, its collections, live nodes, and replica states, which are watched by nodes for coordination.
-Each shard should have a leader and can have multiple replicas. If the leader goes down, one of the replicas has to be elected as leader. ZooKeeper plays an important role in this process of leader election.
-Any change in the configuration should be uploaded to ZooKeeper.
Давай те представим себе что мы развернули солрклауд
Splitting a shard will take an existing shard and break it into two pieces which are written to disk as two (new) shards. The original shard will continue to contain the same data as-is but it will start re-routing requests to the new shards. The new shards will have as many replicas as the original shard. A soft commit is automatically issued after splitting a shard so that documents are made visible on sub-shards. An explicit commit (hard or soft) is not necessary after a split operation because the index is automatically persisted to disk during the split operation.
The Solr Collections API allows you to split a shard into two partitions. The documents of the existing shard
are divided into two pieces, and each piece is copied to a new shard. The existing shard can be deleted later
as per convenience.
Here is an example to split shard1 in the hellocloud collection into two shards:
These features are shown in my own interest range
Solr has some advanced features which are out of the presentation but should be mentioned
Streams is tailored lightweight json format for decent volumes of data (source, decorator, evaluator)
p. 2 and 3 are based on p 1
p. 3 is used for recommendation engines
p. 8 is the most complicated stuff, 2 api, a lot of performance tricks
the Administration Console reports (Plugin/Stats | Cache)
There are additional caches which are out of control – field cache and field value cache. There is also an interface to implement own caching strategy as well as warming up.
sizing of document cache is to be larger than the max results * max concurrent queries being executed by Solr to prevent documents from being re-fetched during a query.
ConcurentUpdate uses many threads to connect to Solr as well as a compression to deliver documents faster
remove the QueryElevationComponent from solrconfig.xml
the more static your content is (that is, the less frequent you need to commit data), the lower the merge factor you want.
number of segments on the Overview screen's Statistics section.
No term vector, docvalues and e.t.c
ConcurentUpdate uses many threads to connect to Solr as well as a compression to deliver documents faster
remove the QueryElevationComponent from solrconfig.xml
the more static your content is (that is, the less frequent you need to commit data), the lower the merge factor you want.
number of segments on the Overview screen's Statistics section.
No term vector, docvalues and e.t.c