Apache Solr is a state of the art, high performance and scalable search server you can use in your (PHP) application to provide a very feature rich search experience. Besides full-text search, it also provides spell checking, highlighting, facets and powerful functions that can put it in the realm of a general information retrieval engine, replacing complex database queries you would (need to) use otherwise.
Use cases range from e-commerce, real-estate database search, intranets/extranets, content management systems, document management systems and anything that offers exploration of structured and/or unstructured information. The recent addition of geo-aware features makes even location searches possible.
eZ Find workshop: advanced insights & recipesPaul Borgermans
Various how-to's and recipes to get things done with eZ Find, advanced searches, facet navigation, clustering of search results, domain specific boosting, etc. This workshop is based on eZ version 4 stack but the knowledge provided reaches beyond eZ versions.
A content repository for your PHP application or CMS?Paul Borgermans
The idea for using content repositories (CR) instead of relying on lower level database frameworks is gaining steam with several new kids on block, typically relying on NoSQL data stores.
This talk will give an overview of the current state of the art amid several use cases (including ease of use, performance and flexibility) and architectures.
CR such as Midgard, Lily, and architectures based on HBase, CouchDB, MongoDB (NoSQL stores) incombination with Information retrieval layers will be highlighted, as well is components or libraries to be used from the PHP side.
A second part treats the emgerging standard PHPCR, a standard API based on JCR (JSR-283)
See: http://www.slideshare.net/bergie/phpcr-standard-content-repository-for-php
After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesBrett Meyer
DevNexus 2014
Hibernate has always revolved around data, ORM, and JPA. However, it’s much more than that. Hibernate has grown into a family of projects and capabilities, extending well beyond the traditional ORM/JPA space.
This talk will present powerful features provided both by Hibernate ORM, as well as third-party extensions. Some capabilities are brand new, while others are older-but-improved. Topics include multiple-tenancy, geographic data, auditing/versioning, sharding, OSGi, and integration with additional Hibernate projects. The talk will include live demonstrations.
Apache solr is an enterprise search engine. It facilitates indexing of large number of documents of any size and provides very robust search techniques. This ppt provides brief introduction of it.
eZ Find workshop: advanced insights & recipesPaul Borgermans
Various how-to's and recipes to get things done with eZ Find, advanced searches, facet navigation, clustering of search results, domain specific boosting, etc. This workshop is based on eZ version 4 stack but the knowledge provided reaches beyond eZ versions.
A content repository for your PHP application or CMS?Paul Borgermans
The idea for using content repositories (CR) instead of relying on lower level database frameworks is gaining steam with several new kids on block, typically relying on NoSQL data stores.
This talk will give an overview of the current state of the art amid several use cases (including ease of use, performance and flexibility) and architectures.
CR such as Midgard, Lily, and architectures based on HBase, CouchDB, MongoDB (NoSQL stores) incombination with Information retrieval layers will be highlighted, as well is components or libraries to be used from the PHP side.
A second part treats the emgerging standard PHPCR, a standard API based on JCR (JSR-283)
See: http://www.slideshare.net/bergie/phpcr-standard-content-repository-for-php
After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesBrett Meyer
DevNexus 2014
Hibernate has always revolved around data, ORM, and JPA. However, it’s much more than that. Hibernate has grown into a family of projects and capabilities, extending well beyond the traditional ORM/JPA space.
This talk will present powerful features provided both by Hibernate ORM, as well as third-party extensions. Some capabilities are brand new, while others are older-but-improved. Topics include multiple-tenancy, geographic data, auditing/versioning, sharding, OSGi, and integration with additional Hibernate projects. The talk will include live demonstrations.
Apache solr is an enterprise search engine. It facilitates indexing of large number of documents of any size and provides very robust search techniques. This ppt provides brief introduction of it.
Java Persistence API (JPA) - A Brief OverviewCraig Dickson
This is a lightning presentation given by Scott Rabon, a member of my development team. He presents a high level overview of the JPA based on his first exposure to it.
Hibernate ORM: Tips, Tricks, and Performance TechniquesBrett Meyer
DevNexus 2014
Out-of-the-box, Hibernate ORM offers limited overhead and decent throughput. Early-stage applications enjoy the convenience of ORM/JPA with great performance. However, scaling your application into an enterprise-level system introduces more demanding needs.
This talk will describe numerous tips and techniques to both increase Hibernate ORM performance, as well as decrease overhead. These include some basic tricks, such as mapping and fetching strategies. Entity enhancement instrumentation, third-party second level caching, Hibernate Search, and more complex considerations will also be discussed. The talk will include live demonstrations techniques and their before-and-after results.
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
Slides from my talk during ApacheCon EU 2012 - "Battle of the giants: Apache Solr vs ElasticSearch". Video available at http://player.vimeo.com/video/55645629
This ppt is about Orm and hibernate. This ppt gives you a brief knowledge about orm and hibernate. For more info visit : http://s4al.com/category/study-java/
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Attendees with come away from this presentation with a good understanding and access to source
code for boosting and/or filtering documents by recency, popularity, and personal preferences. My
solution improves upon the common “recipe” based solution for boosting by document age. The
framework also supports boosting documents by a popularity score, which is calculated and
managed outside the index. I will present a few different ways to calculate popularity in a scalable
manner. Lastly, my solution supports the concept of a personal document collection, where each
user is only interested in a subset of the total number of documents in the index.
Java Persistence API (JPA) - A Brief OverviewCraig Dickson
This is a lightning presentation given by Scott Rabon, a member of my development team. He presents a high level overview of the JPA based on his first exposure to it.
Hibernate ORM: Tips, Tricks, and Performance TechniquesBrett Meyer
DevNexus 2014
Out-of-the-box, Hibernate ORM offers limited overhead and decent throughput. Early-stage applications enjoy the convenience of ORM/JPA with great performance. However, scaling your application into an enterprise-level system introduces more demanding needs.
This talk will describe numerous tips and techniques to both increase Hibernate ORM performance, as well as decrease overhead. These include some basic tricks, such as mapping and fetching strategies. Entity enhancement instrumentation, third-party second level caching, Hibernate Search, and more complex considerations will also be discussed. The talk will include live demonstrations techniques and their before-and-after results.
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
Slides from my talk during ApacheCon EU 2012 - "Battle of the giants: Apache Solr vs ElasticSearch". Video available at http://player.vimeo.com/video/55645629
This ppt is about Orm and hibernate. This ppt gives you a brief knowledge about orm and hibernate. For more info visit : http://s4al.com/category/study-java/
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Attendees with come away from this presentation with a good understanding and access to source
code for boosting and/or filtering documents by recency, popularity, and personal preferences. My
solution improves upon the common “recipe” based solution for boosting by document age. The
framework also supports boosting documents by a popularity score, which is calculated and
managed outside the index. I will present a few different ways to calculate popularity in a scalable
manner. Lastly, my solution supports the concept of a personal document collection, where each
user is only interested in a subset of the total number of documents in the index.
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
Presentation on how to and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common "recip" based solution for boosting by document age.
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseLucidworks (Archived)
This talk will present what are click-through events and how to process them with LucidWorks Enterprise. This innovative technique puts powerful search and relevancy at your fingertips -- at a fraction of the time and effort required to program them yourself with native Apache Solr.
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
We asked LinkedIn members worldwide about their levels of interest in the latest wave of technology: whether they’re using wearables, and whether they intend to buy self-driving cars and VR headsets as they become available. We asked them too about their attitudes to technology and to the growing role of Artificial Intelligence (AI) in the devices that they use. The answers were fascinating – and in many cases, surprising.
This SlideShare explores the full results of this study, including detailed market-by-market breakdowns of intention levels for each technology – and how attitudes change with age, location and seniority level. If you’re marketing a tech brand – or planning to use VR and wearables to reach a professional audience – then these are insights you won’t want to miss.
Artificial intelligence (AI) is everywhere, promising self-driving cars, medical breakthroughs, and new ways of working. But how do you separate hype from reality? How can your company apply AI to solve real business problems?
Here’s what AI learnings your business should keep in mind for 2017.
In this talk, Solr's built-in query parsers will be detailed included when and how to use them. Solr has nested query parsing capability, allowing for multiple query parsers to be used to generate a single query. The nested query parsing feature will be described and demonstrated. In many domains, e-commerce in particular, parsing queries often means interpreting which entities (e.g. products, categories, vehicles) the user likely means; this talk will conclude with techniques to achieve richer query interpretation.
All you need to start with Apache Solr (elastic search). This presentation includes all the information of Solr i.e. what it is, installation, indexing & searching for beginners.
The presentation describes what is Apache Solr, how it could be used. There is apache solr overview, performance tuning tips and advanced features description
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014
(http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)
Parsers. We might not think about them but anyone who writes code uses parsers every day. And the best part, they are useful not only for compiler design but for implementing other things like custom search queries, DSLs, parsing log files and data.
Writing parsers, a prerequisite for implementation of such features, might seem scary at first (it seemed to me at first!), but in reality, writing parsers is not that complicated.
In this talk, I will explain a bit of theory behind parsers, show how they can be written by hand or with tools such as ANTLR.
Introduction to the basics of Information Retrieval (IR) with an emphasis on Apache Solr/Lucene. A lecture I gave during the JOSA Data Science Bootcamp.
Sunspot is a popular ruby library providing access to Apache Solr, the renounced text search engine. In these slides, we go display how you can use this gem in your app.
Presented by Fotolog. Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.
In this presentation, Frank Mash shows you how you can use Lucene with MySQL to offer powerful searching capabilities to your stakeholders. The presentation will cover installation, usage. optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
2. About me
● Currently employed by eZ Systems http://ez.no
● Active in open source community for a while
– Squid http proxy server (about 15 y ago)
– PHP based CMS solutions (mostly eZ Publish)
– executive committee
● Currently fancying :
– PHP as the master glue language for almost everything
– Apache Lucene family of projects (mainly Solr)
– NoSQL (Not only SQL) and scalable architectures
– CMS systems & information management
3. Outline
● Overview of Apache Solr
● Concepts & internals
● How to use it with PHP
● Use cases & tips
● Resources
5. Apache Solr Curriculum Vitae
● Open source Apache Lucene project,
started by Yonik Seeley
● Standalone, enterprise grade search
server built on top of Lucene
● Lives in a Java servlet container
● Access through a REST-ful API
– HTTP
– Primary payload in requests: XML
– Other response formats: PHP, JSON, …
7. Solr in a nutshell
● State of the art, advanced full text search and
information retrieval
● Fast, scalable with native replication features
● Flexible configuration
● Document oriented storage
● Geospatial search
● Native cloud features
8. Full text search main features
● Tuneable relevancy ranking on top of internal
similarity algorithms
● Highlighting
● Sorting
● Filtering
● “Drill-down” navigation (facets)
● Automatic related content
● Spell checking
● Multilingual text analysis
10. Tunable relevancy ranking
● “Boosting” at index and query time
– certain types of content
– certain parts of content (“fields”)
– page-rank like if the content has relations
● Elevate request component
– predefined “pages/documents” to the top when certain
keywords are entered
● With customised functions
– more recent articles
– proximity (geolocations)
11. Filtering
● Does not influence the relevancy
● Narrows down the scope
● Very powerful: full boolean, wildcards,
fuzzy, and unlimited combinations
● Ranges (dates, numbers,
alphanumeric, ...)
Also for implementing security!
12. Facets
● Along the main query, “facet fields” may be defined,
usually operating on meta-data:
– Type of content
– Publication year
– Keywords
– Author ....
● The result set is returned offering the number hits
within each “facet”
● You can use the selected facet as a subsequent filter
14. Automatic related content
(“More Like This”)
● Search engine determines itself which are the
important terms of a page and performs a query
● All other normal features can be used
– Filtering
– Sorting
– Facets
15.
16. Spell checking
● Two possible strategies
– Dictionary look-up
– Using the indexed words itself (recommended)
● Possible “Google” approach using the “best guess”
– Search for “Grein botle“
=> suggests “Green bottle”
● Let Solr return individual keyword suggestions
=> more client side processing required
17. Multilingual features
● Adapted tokenizers
● Stemming (reducing words to common form)
– Reduces some spelling errors too!
– May decrease accuracy
● Different algorithms per language
● Normalisation (“latin 1 characters”)
– élève = eleve, Spaß = spass, ...
19. Performance
● Solr employs intelligent caches
– filters
– queries
– internal indexes
● Optimized for search/retrieval
● Possible autowarming on start up
● When updates are done, caches are
reconstructed on the fly in the background
20. Performance (2)
● Replication
– master-slave for now
– works across platforms with same configuration
– no native OS features needed (or rsync)
– more cloud features under development
● Sharding (client driven)
22. The Solr/Lucene index
● Inverted index
● Holds a collection of “documents” (hello NoSQL)
● Document
– Collection of fields
– Flexible schema!
– Unique ID (user defined)
● Solr uses a XML based config file:
schema.xml
23. Fields
● Various field types, derived from base classes
● Indexed
– contains the inverted index
– usually analyzed & tokenized
– makes it searchable and sortable
● Stored
– contains also the original content
– content can be part of the request response
● Can be multi-valued!
– opens possibilities beyond full text search
24. Field definitions: schema.xml
● Field types
– text
– numerical
– dates
– location
– … (about 25 in total)
● Actual fields (name, definition, properties)
● Dynamic fields
● Copy fields (as aggregators)
25. schema.xml: simple field type examples
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>
<!-- A Trie based date field for faster date range
queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>
<!-- A text field that only splits on whitespace for exact matching
of words -->
<fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
26. schema.xml: more complex field type
<!-- A general unstemmed text field - good if one does not know the language of the field -->
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
28. Analysis
● Solr does not really search your text, but rather
the terms that result from the analysis of text
● Typically a chain of
– Character filter(s)
– Tokenisation
– Filter A
– Filter B
– …
29. Solr comes with many tokenizers and
filters
● Some are language specific
● Others are very specialised
● It is very important to get this right
otherwise, you may not get what you expect!
30. Text analysis examples
String Field term term
type position position
“text” 1 2
iPad => i pad
ipad
élève. => elev
PowerS => power shot
hot powershot
31. Character filters
● Used to cleanup text before tokenizing
– HTMLStripCharFilter (strips html, xml, js, css)
– MappingCharFilter (normalisation of characters,
removing accents)
– Regular expression filter
32. Tokenizers
● Convert text to tokens (terms)
● You can define only one per field/analyzer
● Examples
– WhitespaceTokenizer (splits on white space)
– StandardTokenizer
– CJK variants
33. Additional filters
● Many possible per field/analyzer
● Many delivered with Solr out of the box
● If not enough, write a tiny bit of Java or look for
contributions
● Examples ...
35. Reversing Filter
● Reverses the order of characters
● Use: allow “leading wildcards”
● *thing => gniht*
● A lot faster (prefixes)
36. Synonyms
● Inject synonyms for certain terms
● Language specific
● Best used for query time analysis
– may inflate the search index too much
– decreases relevancy
37. Stemming
● Reduce terms to their root form
– Plural forms
– Conjugations
● Language specific (or not relevant, CJK)
● Many specialised stemmers available
– Most european languages
– Dutch (!)
38. Copy fields
● Analysis is done differently for
– searching/filtering
– faceting/sorting
● Stemming and not stemming in different fields
can increase relevance of results
● Use copy fields in schema.xml or do it client
side
39. Geospatial search
● Solr dedicated fields
– Latitude Longitude type
● Special geospatial functions in filtering &
boosting
– Haversine distance (geosphere)
– Simple ranges (squares in 2-D)
– Special query constructs (upcoming)
41. Get the data and feed it
● Most *AMP applications have databases
● Map your data to a “document model”
– denormalization, flattening
– most DB fields can be fed unaltered, Solr takes
care of the rest
● Send it through HTTP as XML
● One constraint: it must be UTF-8!
42. Searching
● Construct a GET/POST query
● Base parameters
– “q” for query text
– “start” for offset
– “rows” for max number of results to return
Example:
http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
44. PHP client side
● Roll your own classes & functions
– Not difficult, it's REST after all
– Some Curl, XML, Json or native PHP array parsing
● Use existing libraries
– PECL: http://pecl.php.net/package/solr
– http://code.google.com/p/solr-php-client/
(follows ZF coding standards)
– eZ Components: ezcSearch
● PHP CMS's usually come with their own
– eZ Publish, Drupal, Symfony ...
46. Indexing binary files
● Solr includes the Apache Tika libraries
– convert about any format to plain text
– you can activate a dedicated requesthandler for it
OR
● Use it standalone (command line) for integration into
existing code
See: http://lucene.apache.org/tika/
47. Integrate legacy data
● Use the Solr Data Import Handler
● Able to index DB's directly
– define the schema to use (including possible
joins)
– fire simple requests to Solr to actually
index/update
● Also XML feeds, files (csv), ...
48. e-Commerce
● If you want so sell, make sure users find the products
they want
– Use facets (categories, drill-down, …)
– Push high margin / hot / new products with elevation
– Pay a lot of attention to index and query time analysis
● Feed additional meta-data and use it to tune
– Ratings
– Analytics (Google, Omniture, ...)
49. Have multilingual content?
● Multi-core configuration
– Setup a dedicated Solr core per language
– Each has its own schema definitions, while you
can still use common field names
● If using one index
– Use dynamic fields and create language specific
analyzers for dedicate language
suffixes/prefixes
51. Thank you!
Questions?
email: paul dot borgermans at gmail dot com
http://twitter.com/paulborgermans
Please rate this talk/slides:
http://joind.in/talk/view/1504