Berkeley DB } unstructured
MySQL - structured
Foreign Key (RDBMS)
MySQL related content
Postgres through JOINs
... structured data
Solr (Lucene) Denormalized,
Xapian Inverted Index
(Whoosh) over unstructured/
Other routes to full-text search
Solr: HTTP interface to Lucene
Lucene written by Doug Cutting (HADOOP),
ﬁrst release 2001.
Solr in-house CNET project, open-sourced in 2006
Solr 1.4, Lucene 3.0 released November 2009
Solr + Lucene merged in March 2010
Next version - 1.5/3.1/4.0 - not for production use yet.
composed of composed of
ALL DOCUMENTS HAVE
THE SAME STRUCTURE
•Optional columns Document Field options
•Denormalized data Entity type required
Book Magazine Person
Title Title First name name
ISBN ISSN Last name default
(FK Person) Frequency copyField Title
(FK Person) Associated
There is no update, only overwrite!!!
Solar Solr 1.4
Search Server Search Server
Pub. Freq. Pub. Freq.
David Smiley, David Smiley,
Eric Pugh Eric Pugh
Solr can't overwrite without a uniqueKey
What do you want to search on?
What do you want to do with results?
Scaling to a million pages ...
- talk to the Guardian (Content API)
Separate processes - many readers, single write pipeline.
Beware multiple writers!
Remember standard DB practice -
write to master, read from slave.
"UK crime: Betting, gaming and lotteries (year ending 5th April)"
Belgium, Unemployment rate by gender, Total (BE,T)
In the small
Understand Solr schemas - build one for your data.
how do you want to query?
how do you want to show results?
In the large
Understand Solr architecture - build around your data-ﬂow.
how/when do you want to read/write?
what shape/characteristics does your corpus have
Thanks for listening!
questions welcome ...