Good afternoon everyone! Welcome to my lightning talk: The Solr Power. My name is Tareque and I work for a small health industry startup named wisertogether. As you have noticed from this corny title, my talk is about solr.
This could be turned into a most interesting man joke.
As you might have already guessed I’m talking about using solr as a NoSQL backend. This approach is not novel in anyway. But I wanted to discuss the use case that brought it about. First of all… NoSQL.
We got to a point where retrieving data from a SQL layer just wasn’t an option. The arrow came in form of performance hit from querying a complex relational model.
Well why not? Now on to more specific reasons for using solr as a NoSQL backend.
I emphasize on the word infrequently.
So there are a lot of answer options
What were you diagnosed with previously and what you got diagnosed with recently.
When you start combining all the survey responses, you start getting some really useful information because it exposes common trends, idiosyncrasies etc. We use these numbers to generate pretty graphs
Solr stores everything in the form of a document
We used sunburnt to interface with solr. If you only need the facets, no reason to retrieve the documents unless necessary and you can save a lot of memory
The solr power
The Power Tareque Hossain Sr. Software Engineer
What about it? • We always associate solr with searching • solr can also serve as your non-‐relational data layer
Why solr? • Hey solr is already part of my stack • I love solr • It’s fast, scalable and there are some great python interfaces out there
When would you consider it? • You have a DB that’s frequently read and infrequently written • You want robust search & ﬁltering on your data • You want to leverage the faceting feature • You want a decently scalable data layer
What’s not so cool? • Doesn’t support transactions • Not all SQL queries can be translated into solr queries • Generating indices can take a long time • Searching and indexing at the same time brings down performance
But.. • You don’t have to give up your relational data layer • Create a non-‐relational layer on top of your relational data layer • Get best of the both worlds
So what’s the use case? • We deal with medical survey data • Say: – About 300 multiple choice questions – Responses can be multi-‐dimensional – 7000+ diﬀerent answer choices per question – 2000+ respondents per survey – 15+ surveys and growing
What a survey question looks like When were you diagnosed with the following types of Arthri5s? Rheumatoid Traumatic Psoriatic Osteoarthritis Other Arthritis Arthritis Arthritis Less than a þ ☐ ☐ ☐ ☐ year ago More than a ☐ ☐ þ ☐ ☐ year ago
Storing a single response When were you diagnosed with the following types of Arthri5s? Rheumatoid Traumatic Psoriatic Osteoarthritis Other Arthritis Arthritis Arthritis Less than a 1 0 0 0 0 year ago More than a 0 0 1 0 0 year ago
Aggregating over 2000 responses When were you diagnosed with the following types of Arthri5s? Rheumatoid Traumatic Psoriatic Osteoarthritis Other Arthritis Arthritis Arthritis Less than a 63 155 19 27 268 year ago More than a 190 46 8 213 325 year ago
The Document Structure • Each survey response = solr document • Up to 3000 boolean variables per document indicating chosen answers • Added meta information: age, profession, interests
Querying • Filter by age, interest, profession • Facet across boolean ﬁeld • Result: what group of people chose what group of answers
Why solr is awesome.. • Faceting across boolean ﬁeld uses very little memory • Combining 3000 ﬁelds for 2000 documents takes 1 ~ 2 ms • Allowed us to reduce API response time from a variable of 2 ~ 15 seconds (sucked!) to an almost constant ~50 ms
Good to know.. • sunburnt: Awesome python solr interface github.com/tow/sunburnt • Programmatic querying as well as raw queries • Supports most advanced solr options • If you only required facets, specify rows=0