4. Solr is Lucenebased
Lucene = text search engine library written in Java
All kinds of crazy goodies:
Ranked search
Multiple indexing
Simultaneous read & write
Daterange search
...the list goes on
Platformindependent (thanks, Java!)
Fast & efficient
Index size ~= 2030% size of indexed data
Very high throughput indexing (95GB/hour)
5. Solr is NoSQL
NoSQL == Nonrelational database
RDBMS metaphor:
One database
One table
Denormalized data
Query parameters instead of SQL
“Documents” instead of rows
Bottom line: it's a persistent datastore, and we use it to store data
persistently.
7. Master
There can be only one
Read & write operations
Must be secure
Younger, stronger brother of production DB
Home base for Solr slaves
8. Slave
There are many copies
They have a plan: replication
Readonly
Gets copy of index from the Solr master every k
minutes
Responds to queries
9. Replication
Slaves –HTTP GET> Master
Replication is differential
Configuration is set in solrconfig.xml
http://tinyurl.com/DESolrRepl
10. Document
RDBMS = row; Solr = document
Denormalized relational data
my friend,
Flatten a bunch of related RDBMS rows into a
single Solr document
11. API
Application programming interface
Primary means of communicating with Solr is an
HTTP API
12. The Good Stuff:
Unix & Diagnostics
“This is the Unix philosophy: Write programs that
do one thing and do it well. Write programs to
work together. Write programs to handle text
streams, because that is a universal interface.”
Doug McIlroy
Examples of things beyond the scope of this talk:
Cat
Awk
Grep
Sed
Cut
Wc
Sort
Tail
Head
Great read: http://matt.might.net/articles/sqlintheshell/
13. The Good Stuff:
Unix & Diagnostics
You cannot effectively troubleshoot without parsing logs
You cannot effectively parse logs without good textparsing tools:
Cat
Awk
Grep
Sed
Cut
Wc
Sort
Tail
Head
No *nix OS? PowerShell!
14. The Good Stuff:
Unix & Diagnostics
Example commands:
tail -f /var/log/celery/project.log
Output the Celery log to stdout, in real time
cat /ebs2/log/celery/project.log|grep -oE 'BUID:([0-9]
{0,5})'|grep -oE '[0-9]{0,5}'|sort --unique
Parse the Celery log, printing a list of unique BUIDs
cat /ebs2/log/celery/project.log|grep -B 15
"DocumentInvalid"|grep -E 'Download complete for BUID ([0-9]
{1,5})'|awk '{sub(/[/, "");print $1 " " $2 " " $7 ":" $8}'
Parse the Celery log, outputting a list of BUID the feed
file for which failed for some reason:
15. Conclusion
RTFreakingM
http://wiki.apache.org/solr/SolrQuerySyntax
http://wiki.apache.org/solr/SolrCaching
http://wiki.apache.org/solr/SchemaXml
http://djangohaystack.readthedocs.org/en/latest/
Experiment & tinker & reinvent the wheel
Get comfortable with the command line – you can't effectively administer Solr
(or any sufficiently complex system) with a web GUI
Read the logs
Connect Solr behavior to application operations