• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Solr - search for everyone!
 

Apache Solr - search for everyone!

on

  • 1,756 views

Talk presented at Baksia meet up i Oslo on November 23rd 2011.

Talk presented at Baksia meet up i Oslo on November 23rd 2011.

Statistics

Views

Total Views
1,756
Views on SlideShare
1,747
Embed Views
9

Actions

Likes
1
Downloads
53
Comments
0

3 Embeds 9

http://paper.li 7
http://a0.twimg.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Welcome...Purpose of this talk is to show you how easy it is to get started and give you an idea of some of the cool things you can do with Solr without too much effort. I love to get quick and easy to understand introductions to tools, that can help me get started quickly. Hopefully that’s what I’ll be able to provide you guys with today.Solr is a big project and it would be silly to attempt to cover everything in one evening, so I am going to focus on some of the features that I believe are the easiest to get started with and which I also am well familiar with and have had good experience with.So let’s get cracking...
  • Most of you already know me, but I see there are some new faces...
  • Since 2004, Integrasco has been providing social media methodologies and technologies to corporations, agencies, government regulators and other institutionsDedicated team of vertical expertsTechnology platform for analysis of Social Media where Apache Solr is a vital component.
  • But enough about me and Integrasco, i bet that’s not why most of you came to this meetup. Let’s talk search. What is search?
  • I bet many of you think of get this image in your head when you’re thinking about search. For most people today, the term search is equal to the name Google. Because of this «to google» has even entered our dictionaries as an officially accepted verb.Search equals input box and search button! In in many cases this is true. But as Google has become expert on, and which hopefully we will discover Solr can help us do as well – it’s not about searching... ->
  • It’s all about finding – helping your users find the information they are seeking – not having them search for it! You may think I am just playing with words here, but in my opinion there is a big different between «searching» and «finding». Let’s not fall into a discussion about semantics here, but let’s just say that our job as engineers is to help people find the information they are looking for, and spend our time efficiently doing that instead of spending our time designing search boxes and buttons!
  • You don’t want your users having spend hours searching on your site (or perhaps you do, if your revenue is driven by advertisement, but let’s also put that discussion on the shelf for later). We want to have our users finding what they’re looking at with little effort and give them a good experience. The great thing is that Solr can help you with just that, to find stuff. It has a lot of features that can improve your user experience and bring value to your data without too much effort. And as we will see it does not have to be linked to search engines at all! That’s eactly what I hope to show you some of today.
  • Apache Solr is an open source...... But before we get into the juicy details...
  • ... lets learn a little history.
  • Commo codebase since March 2011Which means...
  • ... They are now sharing features and fixes at a much higher rate than what was the case before.Many often wonder what’s the difference between Solr and Lucene, and up until this merge Solr was often considered to be an additional layer on top of Lucene, providing additional functionality that was not available in Lucene.
  • What I often find frustrating when looking at new frameworks and technologies, is in many cases the amount of time and resources you have to invest in order to try it out. I am not talking about reading documentation to get a deeper understanding of it – which you eventually have to, but the time you have to invest in order to just get started... I love those 3-point very quick «getting started» guides that just works. That’s what I hope to leave behind here today.To get started with Solr is actually very, very simple. Even though my examples here have been taken from a Unix environment, I’ve tried to make them as platform and language independent as possible. I myself work in a Java environment, but Solr is fully possible to use in many other environments - just as well as from Java.
  • I’ve tried to shave the process of getting Solr up and running down as much as possible, and I actually came down to these four steps. This is actually all you need to be up and running with a working instance of Solr. There is obviously a lot of configuration and customization you can do to tailor Solr to your specific needs, but to get started playing this is actually all you need to do.
  • Solr is served by Jetty on port 8983 by default, and opening the solr admin application in a browser yields this view. It’s by no means i candy, but then again – that is YOUR part of the JOB, to create a good looking application that helps find the valuable information Solr can serve you.As you see in the middle here, there’s a search box and a search button – who would have thought that? Let’s cick it!
  • Voila – there’s not much here yet.Obviously becaus we haven’t actually indexed anything yet. So let’s add some data.
  • Solr comes with a good set of example data that you can easily import and index to play around and see some of Solr capabilities. These example documents comes with the downloaded package and can be imported using the bash scripts available in the exampledocs folder. Let’s import a couple of ipod related documents and see what happens!
  • We refresh our search from before and... Behold! We have search results. Don’t be scared by the XML output here, there’s several tools available to work with the response!... So, now that we have some data – it takes us to the obvious part of Solr
  • Full text searching!This is what Solr was made for and whenever you have a set of documents that you want to query against, you should consider adding a solr instance to your system and query against the documents there – rather against a database. You will quickly see the value it adds! Now, as with most other things in feature, the querying is done via a parameter in the URL...
  • Very simple!Now lets go back to our admin user interface for another example – just to make sure we don’t scare anyone off with URL parameters in addition to the XML!
  • If you remember the query from before, it was «asterix, colon, asterix». Solr INDEXES are composed of FIELDS, and you can specify what field you want to search by querying «field, colon, value». So when we searched for «asterix, colon, asterix» we were searching for all values in all fields – hence giving us all documents in the index.Let’s see what happens when we search for documents where the PRICE field contains the value 19.95.
  • ... Unsurprisingly we get documents matching the price 19.95. Ok, but what if you want to add your own test data to play around with? Let’s look quickly at the input format that were used.
  • It looks like this. A very simpe XML structure that you can easily modify to fit your needs.
  • Here is the full ipod example we just imported an tested with.
  • As I said earlier – do not be scared by the XML if you feel it’s overwhelming. There are several different options both consuming the response, as well as for indexing data into Solr. And you do not need to specify all your input data in files, there are various means for importing from databases and other commonly used data sources. The example we have looked at so far has been centered around products from a shop. Obviously this is not likely to fit all your needs. As with any other technology for handling data, you need to define a DOMAIN, or a SCHEMA....
  • In Solr, the Schema is one of the core configurations you need to master. It’s at the core of your Solr solution and it’s important to spend time designing this schema. Let’s see what it looks like...
  • Of the key elements, we find....
  • We define the different types we want to have available in our schema...
  • ... And finally we define all the fields that we want in our schema. One neat thing with Solr, is the elements defined towards the bottom here – the DYNAMIC FIELDS. These allows us to specify any field we want on a document basis, after the index is up and running. This gives us good flexibility in cases where only some document contains some special data. An EXAMPLE from one of our projects at INTEGRASCO was when we received a batch of data we needed to make available for analysis together with our social data. This data contained a lot of META data that did not make sense for social data, but we needed to be able to perform queries across everything. In this situation, our dynamic fields became very handy, as we could simply define them when indexing the new Frankenstein data and it fit nicely in with the rest of our social data.
  • The second important part of Solr.
  • Solrconfig.xml is a complex XML document, but as important to spend time with as the Schema.xml – however, we will go into detail on it today as we don’t have time, but here’s some of the key elements that you configure in this file: Settings... How solr should handle your index files etc, Update chain ... You can define your own components for introducing into the indexing processing.
  • Now that we’ve got solr configured and up and running, lets look closer at what it can DO beyond simple searching!
  • The first thing I want to show is FACETS.Facets are category counts for search results.Meaning you can provide details about which categories your search results falls within. And as we will see, these categories can be a lot of things. Facets is in many situations also mentioned as FACETED NAVIGATION. Facets are a very powerful feature of solr when it comes to navigating your search results, and as we said in the beginning – it’s not just about searching, it’s about finding. And facets are a good way to provide your users with neat ways to navigate your data.
  • Here is another good example of how facets CAN be used to enable navigation. This is taken from Finn.no. Now, I do not know whether Finn.no is USING Solr and facets for this information, but it’s a good example of how facets can be used to enable powerful and user friendly navigation.
  • Yet another example of search facets, or faceted navigation. This time from the webshop Komplett.no, where – after you have done a search – you will get details about CATEGORIES, PRODUCERS and PRICE RANGES.
  • And this is how you do this withSolr. You simply add a few parameters to your QUERY URL, specifying what field, or fields you wish you receive facet information for – and you’re good to go!To illustrate the output I’ve done a search in our system using the example parameters here, and this would yield....
  • ... This output. As you see here, we get the top 10 languages for my search. You’re not limited to only getting top 10, you choose how many facet counts you want.
  • Another cool thing is that you are not limited to just getting facets for predefined fields, you can also use FACET QUERIES to provide unique value for your data. Here’s one example from our system, where we use facet queries on date fields to generate statistics for provided queries and display trends over time.The way we do this is to add a facet query for each of the timespans we want in our chart. Then we simply render those facet query counts in a chart.Another very common way to use facet queries, is for instance to generate price range information in online stores – like we saw in the screenshot from Komplett.no. (which we have an example query sting for here ... On the next slide)
  • So, once we have our FACETs, we may want to drill further down into our data. This is done via FILTER QUERIES. These are constraints you apply to your query to filter the results you get back from Solr.
  • Here we are going to filter on our facets as we do it in our solution. We have performed a search, and we wish to drill down into this by only looking at Facebook data....
  • So we choose the facebook Facet...
  • ... and this adds a filter to our existing query – limiting our search within the MEDIA type FACEBOOK.Very easy. And the way we actually do this, in Solr....
  • … is by adding “fq” parameters to our query. In this case, we added a filter for Facebook. And you can add as many of these as you want, so you’re not limited to filtering on only one thing at the time!Another important thing about filter queries are that they are a big help for optimizing your queries. The reason for this is that they are applied BEFORE other queries are done, which means they limit the number of documents Solr has to query.
  • The next thing I want to talk about, which in some cases is a bit more tricky to get working, but is definitively worth it for your users – is WORD CLOUDS (or tag clouds, term clouds – whatever you like to call them)We like to call them BUZZ CLOUDS because they tell us what’s buzzing around i the Social Media space.
  • The way you do this is via SPECIAL SEARCH COMPONENTS in Solr. These are components that you enable in your Solr config which provides access to additional information about the terms and term vectors Solr is using for your index and individual documents. The TERMS COMPONENT is a way to get information from Solr about all the terms available in your index, and how many documents they appear in. So you get the DOCUMENT FREQUENCY for all your terms. This can be used for creating an auto-suggest feature for your search box, although newer versions of Solr contains a special componet for doing just that – so it’s recommended to use that in stead. However, the reason I mention it here is that it is absolutely possible to use this to create such a word cloud for your index. What we decided was a better solution, althought a bit more processing heavy solution, was to use the TermVectorComponent. This component gives you term vector information for individual documents, rather than the entire index. This means we can get information about the term which are included in our resultset and provide a word cloud for that, rather than the entire index.
  • They way we are doing this is by performing a query, then aggregating the term vector information for the documents i our search result. This means a bit more processing since we have to traverse the documents we’re interested in – and aggregate the term frequencies for each document – we are counting how often the terms appear in our resultset. We then use this information to render the terms based on how often they appear....
  • ... Which in the end enables us to display clouds like this
  • Now that we’ve looked at some of the powerful, easy to use features of Solr – how do we scale it?The question is, do we continue to build towards the sky, adding more and more memory and processing power, or do you spread it out across smaller instances and distribute across them? How do we ensure that our indexes keeps a decent size and that we can distribute our search? There’s two key features which are useful for this, and which are easily available in Solr...
  • ... That’s SHARDING and REPLICATION.
  • From an INDEXINGperspective sharding is about determining where to index documents. You do this using a SHARDING STRATEGY. This can be anything from an ID BASED strategy, where you place documents in different Solr instances based on a unique ID. Other strategies can be by GEOGRAPHY, USERNAMES – or as we do at Integrasco, build a strategy based on DATES.The drawback here, is that Solr does not support sharded indexing out of the box, so you need to develop the framework for this yourself. The way we have done this is by creating an index writer for each shard and connecting these index writers to our sharding strategy so it selects the right one to dispatch documents to.
  • From the SEARCH perspective sharding is very easy. You simply provide your COORDINATING INSTANCE (which can be any of your available instances), with a list of the URLs to the shards you wish to distribute the search across. The coordinating instance will then handle DISTRIBUTION of queries to all the specified shards, and CONSOLIDATE the result before it’s send back to you.
  • And you do not have to query all shards, it’s a fairly easy job to get your client to only specify relevant shards when performing the query. For instance in the case of Integrasco, where we have sharded based on date – we really don’t have to query the entire cluster when we know we only want to search data for 2011. Then we can select the shards for 2011 and only specify those as the shards we wish the query sent off to.
  • And how do we do this? Again, by adding some parameters to our query URL containing the addresses of the shards we want to distribute the query across.
  • When it comes to replication, we build a common MASTER SLAVE relationship, where we have a set of masters where we do all our WRITING and a set of slaves where we do all our SEARCHING. This way we can keep the masters fairly cheap on resources as they do not require that we keep caches up to date in memory, as they do not need to handle any queries. Multiple slavesRepeaters
  • And the configuration for this is just a cople of lines of XML code in your SOLRCONFIG xml file, as you see examples of here.The only bad thing with the replication feature in Solr that we have seen is that is is very resource hungry when it is replicating. It sucks as much bandwidth and writes to disk as fast as it can, and in cases where there are large amounts of data in each replication batch this can quickly lead to unwanted starvation. So make sure you properly test your replication setup before going into production.
  • When creating slide decks, I love to go on CreativeCommons.org and search for the main topic of the section and use one of the first pictures as the background. That sometimes leads to interesting slides like this one...Anyways, it’s about integration of Solr – how do you integrate it in your project. And not surprisingly, Solr comes with a lot of client libraries for different languages...
  • As you can see Solr has good support for many different languages, and most of this are easy-to-use client libraries offering METHODS FOR ACCESSING all of the features we’ve covered in this talk. At Integrasco we’re mostly using Java, so the SolrJ library is what we’re using and it’s providing a very easy to use interface for accessing most of the Solr features we need – more or less right out of the box. I think the only two things we have had to do MORE EXTENSIVE DEVELOPMENT of on top of Solr and SolrJ is SHARDED INDEXING and the WORD CLOUDS using the TermVectorComponent.
  • A very quick example of the use of SolrJ in Java.
  • Use sufficient time to analyze and figure out what data your clients need and are interested in. This should be at the core of your planning and help shape the design of your schema and configuration.
  • Similarly, figure out what kind of searches you will be doing. And make sure your schema allows for these queries to happen. Also, make sure you configure the appropriate warm-up queries to ensure your cache is performing optimally for the type of queries you are doing.
  • Once you have the previous two covered, make sure you spend significant time designing your schema.xml. As I said this is at the core of your Solr solution and can be difficult to change once you have a lot of data indexed.
  • This is a very good practice to follow. You never know when you might need a new field for a document, make sure you’re prepared with dynamic fields.
  • Solr is not for storing data! The documentation even says you should not do it because you should expect the index to become corrupted at one time or another.
  • As discovered by Twitter (and perhaps others), 20 million documents is the max size you should aim for when it comes to your shard size.
  • There is a lot of levers and switches in Solr to use for optimization and shaping your search solution to exactly fit your needs. However, don’t mind this in the beginning. Mix and match from the default schema to get familiar with how it works. Start out very simple and build on it. You’ll learn it very quickly and see that search functionality is not just for the large players!
  • Looking a little further ahead into the future, there’s a lot of exciting things going on with Solr. One of the most exciting developments from my point of view is the Solr Cloud product – which is built on ZooKeeper and will offer a much better setup for large search clusters. Also, there’s work going on with building a Solr distribution based on Hadoop for large scale distributed indexing and search. So it will be very interesting to see where Solr takes us over the next couple of years!
  • This has been a little taste of what Solr has to offer – and hopefully shown you that it’s not just for large enterprises or people with huge amounts of data. Hopefully you have seen that Solr can fit very well into situation where you normally would not think of placing a search server, and that you start thinking of new ways help your users FIND what they’re looking for – or even better, help them find what they did not know they needed to look for 

Apache Solr - search for everyone! Apache Solr - search for everyone! Presentation Transcript

  • Apache Solr - search for everyone!http://www.flickr.com/photos/malikdhadha/
  • • Co-founder and R&D Director at Integrasco• Founder and developer of Notpod• Leader of javaBin Sørlandet• Programmer and Open Source enthusiast Jaran Nilsen twitter.com/jarannilsen
  • A global leader in social intelligence
  • What is search?http://www.flickr.com/photos/denverjeffrey/5133538450/
  • http://www.flickr.com/photos/somegeekintn/3709203268/
  • This is Apache Solr• Open Source enterprise search server from Apache• Built on Apache Lucene• Offers additional features to those of Lucene
  • First, a little history...
  • • Started out as a in-house CNET project for adding search functionality to the CNET website in 2004
  • • Started out as a in-house CNET project for adding search functionality to the CNET website.• Donated to Apache Software Foundation in 2006
  • • Started out as a in-house CNET project for adding search functionality to the CNET website.• Donated to Apache Software Foundation in 2006• Graduated from incubation status in 2007
  • • Since version 3.1 (March 2011), Solr and Lucene are now sharing the same codebase. +
  • • Since version 3.1 (March 2011), Solr and Lucene are now sharing the same codebase.• Meaning sharing of features and fixes between the projects at a much higher rate +
  • wget http://apache.uib.no/lucene/solr/3.6.1/apache-solr-3.6.1.tgztar xvf apache-solr-3.6.1.tgzcd apache-solr-3.6.1/example/java -jar start.jar 4 small steps...
  • ...and we’re up!
  • cd exampledocs/./post.sh ipod_other.xml
  • The obvious part – full text searchinghttp://www.flickr.com/photos/49889874@N05/6877840735/
  • • q=yourquery• Example: q=android AND ios&rows=100
  • Don’t worry - it’s not just XML!
  • The Schemahttp://www.flickr.com/photos/14804582@N08/2111269218/
  • Key elements of schema.xml• Unique identifer• Default search field• Types• Fields and dynamic fields• Copy fields
  • Solr configurationhttp://www.flickr.com/photos/esetianto/4099842490/
  • Key elements of solrconfig.xml• Settings for your search index• Warm-up routines• Cache settings• Replication• Update chain
  • Featureshttp://xkcd.com/619/
  • Facets
  • Facets
  • Facets
  • Just add this to your URL:• facet=true&facet.field=field• Example: facet=true&facet.field=language
  • Facet queries
  • Facet queries&facet=true&facet.query=price:[* TO 100]&facet.query=price:[100 TO 200]&facet.query=price:[200 TO 300]&facet.query=price:[300 TO 400]&facet.query=price:[400 TO 500]&facet.query=price:[500 TO *]
  • Now you want to drill down!http://www.flickr.com/photos/kk/4712925031/
  • Filter queries
  • Filter queries
  • Filter queries
  • Just add this to your URL:• fq=field:value• Example: fq=source:facebook.com
  • Produce «word clouds»
  • •TermsComponent•TermVectorComponent
  • TermVectorComponent Term vector information aggregator
  • Scalabilityhttp://www.flickr.com/photos/dickyfeng/3249837481/
  • •Sharding•Replication
  • Index sharding strategy Solr Solr Solr Solr Solrinstance 1 instance 2 instance 3 instance 4 instance N
  • Index sharding strategy Solr Solr Solr Solr Solrinstance 1 instance 2 instance 3 instance 4 instance N ipod OR iphone Search
  • Index sharding strategy Solr Solr Solr Solr Solrinstance 1 instance 2 instance 3 instance 4 instance N ipod OR iphone Search
  • Just add this to your URL:• shards=shard1,shard2• Example: q=android&shards=solr1.node.com/s olr,solr2.node.com/solr,solr3.node.co m/solr
  • ReplicationIndexer Master Slave android Search
  • Replication configuration
  • Integration of Solrhttp://www.flickr.com/photos/certified_su/229016531/
  • Solr has support for many different languages• Ruby• PHP• Java• Scala• Python• .NET• Perl• JavaScript
  • Tips & Gotcha’sOr; how to avoid the sinkholes!http://www.flickr.com/photos/67165210@N00/4661419386/
  • «What data do your clients need?»
  • «Figure out what kind of searches you will be doing»
  • «Spend a siginficant amount of timedesigning schema.xml»
  • «Add dynamic fields for ALL your field types»
  • «Do not use Solr as your primary data store!»
  • «The 20 million mark»
  • But most importantly...Don’t panic!
  • http://www.flickr.com/photos/11304375@N07/2046228644
  • http://www.flickr.com/photos/davidw/2201099990/
  • Thank you!http://www.jeremiahblatz.com/personal/pics/Australia_Travel_Pictures_2009/day12/164_Sunrise_Great_Barrier_Reef.html