Search in the Biblical Domain - BibleTech: 2011

Search in the Biblical
Domain
Brian Seagraves (Bible.org)

What is “Search”?
• Information/Document Retrieval

• Basic Deﬁnition:

• Finding previously seen documents that are
related to some user-supplied terms.

• Advanced Deﬁnition:

• Finding relevant content for some query by
understanding the contextual meaning of
terms in the search index and query.

• Finding relevant content for some query by
understanding the contextual meaning of
terms in the search index and query.
• Semantic Search

Types and Sources of
Content

Content

• The Bible and its verses

Content

• Articles, Journals, and other extra-biblical
content

Content

• Articles, Journals, and other extra-biblical
content
• The web

Information Retrieval
Engines

Engines
• Sphinx - http://sphinxsearch.com

Engines
• Lucene - http://lucene.apache.org/

Engines
• Solr - http://lucene.apache.org/solr/

Engines
• Solr - http://lucene.apache.org/solr/
• MySQL Fulltext Search - kinda

Solr
• Open Source
• Full-text search

Solr
• Open Source
• Hit Highlighting

Solr
• Open Source
• Facets

Solr
• Open Source
• Facets
• Java

Solr
• Open Source
• Facets
• Java
• REST-like HTTP/XML and JSON APIs

Solr Documents

• A document represents a distinct piece of
content that can be stored/retrieved

Solr Documents

• Bible Verse

Solr Documents

• Bible Verse
• Journal Article

Solr Documents

• Bible Verse
• Journal Article
• Commentary Chapter/Section

Solr Documents

• Bible Verse
• Journal Article
• Commentary Chapter/Section
• Web Page

Solr Documents
• Documents have one or more Fields

Solr Documents
• Fields Have types

Solr Documents
• Integer

Solr Documents
• Integer
• Float

Solr Documents
• Integer
• Float
• String

Solr Documents
• Integer
• Float
• String
• Text

Solr Documents
• Integer
• Float
• String
• Text
• Date

Solr Documents
• Integer
• Float
• String
• Text
• Date
• and More!

Solr Fields

• Field Types can have:

Solr Fields

• Filters

Solr Fields

• Filters
• Remove parts of the content

Solr Fields

• Filters
• Tokenizers

Solr Fields

• Filters
• Tokenizers
• Split content into chunks/tokens

Solr Fields
• The “String” Field Type

Solr Fields
• <ﬁeldType
name="string"
class="solr.StrField" />

Solr Fields
• <ﬁeldType
name="string"
• No Filter; No Tokenizer

Solr Fields
• <ﬁeldType
name="string"
• No Filter; No Tokenizer
• Field content won’t be split or changed

Sample Schema (cont.)
<ﬁeldtype
name="sint"
class="solr.SortableIntField"
omitNorms="true" />
<ﬁeldtype
name="string"
class="solr.StrField"
sortMissingLast="true"
omitNorms="true"/>

Sample Schema (cont.)
<fields>

<field name="id" type="sint" indexed="true" stored="true" multiValued="false" />

<field name="abbr" type="string" indexed="true" stored="true" multiValued="false" />

<field name="name" type="string" indexed="true" stored="true" multiValued="false" />

<field name="book" type="sint" indexed="true" stored="true" multiValued="false" />

<field name="chapter" type="sint" indexed="true" stored="true" multiValued="false" />

<field name="verse" type="sint" indexed="true" stored="true" multiValued="false" />
<field name="ot_nt" type="string" indexed="true" stored="true" multiValued="false" />
<field name="net" type="text" indexed="false" stored="true" multiValued="false" />
<field name="all_index" type="html_text" indexed="true" stored="false" />
</fields>

<copyField source="net" dest="all_index" />
<uniqueKey>id</uniqueKey>
<defaultSearchField>all_index</defaultSearchField>
<solrQueryParser defaultOperator="OR" />

Put Data in Solr
• Remember, Solr communicates using XML
over HTTP

Put Data in Solr
over HTTP
• No concept of updating a document -
delete, then add

Put Data in Solr
over HTTP
delete, then add
• To add, POST XML to update handler

Put Data in Solr
over HTTP
delete, then add
• To add, POST XML to update handler
• http://localhost:8080/solr/bible/update

Add XML
<add>
<doc>
<id>1</id>
<net>In the beginning God created the heavens and
the earth.</net>
</doc>
</add>

PHP API
• No XML!
• $client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', 1); //Must be Integer

$doc->addField('net', ‘In the beginning God
created the heavens and the earth.’);
$client->addDocument($doc);

Querying Solr

• HTTP GET Request

Querying Solr

• http://localhost:8080/solr/bible3/select?q=god

Querying Solr

• | Path to Solr ||Core||Handler||Query |

Querying Solr

• Returns XML By Default

Querying Solr

• Returns XML By Default

• Can return JSON and more

Querying Solr

• Queries the defaultSearchField by default

Querying Solr


• <defaultSearchField>all_index</defaultSearchField>

Querying Solr



• Can query other ﬁelds by using the syntax:ﬁeld:value

Querying Solr




• http://localhost:8080/solr/bible3/select?q=id:27974

Querying Solr





• Multiple queries / Booleans

Querying Solr





• Multiple queries / Booleans
• http://localhost:8080/solr/bible3/select?q=god AND book:40

Search Multiple
Translations (Fields)

Search Multiple
• Let’s add some ﬁelds: kjv and kjv_index

Search Multiple

• Add some copy ﬁeld directives:
<copyField source="kjv" dest="all_index" />
<copyField source="kjv" dest="kjv_index" />

Search Multiple


• Query: “Shew Thyself”

Search Multiple



• 0 Results in the NET
http://localhost:8080/solr/bible3/select?q=shew%20theyself

Search Multiple



• 0 Results in the NET
• 360 Results in the Combined index/ﬁeld

Search Multiple
Translations
• + Quasi Synonym term/phrase injection

Search Multiple
Translations
• + Less variation across translations leads to stronger
possible matches

Search Multiple
Translations
possible matches
• + Matches verses when the source translation isn’t
known

Search Multiple
Translations
possible matches
known
• - No control over which translation gets more weight

Search Multiple
Translations
possible matches
known
• - No control over which translation gets more weight
• - No control over scoring of matches

Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
• We’re searching the net_index and kjv_index fields, each with
a boost/weight of 1.
• net_index^6 kjv_index^.5
• http://localhost:8080/solr/bible4/select?q=respect%20for%20god&defType=dismax&tie=.
1&qf=net_index^1%20kjv_index^1&fl=score

1&qf=net_index^6%20kjv_index^.5&fl=score

Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q

Scoring
• score(q,d) =
t in q

• Basic Factors

Scoring
• score(q,d) =
t in q

• Basic Factors
• Term Frequency in a document (↑ is better)

Scoring
• score(q,d) =
t in q

• Basic Factors
• Term Frequency in Corpus (↓ is Better)

Scoring
• score(q,d) =
t in q

• Basic Factors
• Length of matching document (↓ is Better)

Scoring
• score(q,d) =
t in q

• Basic Factors
• “Jesus Wept” - John 11:35

Scoring
• score(q,d) =
t in q

• Basic Factors
• http://localhost:8080/solr/bible3/select?q=wept

Scoring
• score(q,d) =
t in q

• Basic Factors
• http://localhost:8080/solr/bible3/select?q=wept
• http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/
Similarity.html

Search Multiple
Translations

Search Multiple
Translations
1&qf=net_index^1%20kjv_index^1&ﬂ=score

Topic Tagging
• Use a topically-tagged Bible/concordance to mark-
up each verse, or just key verses

Topic Tagging
• Helpful for “theme” based queries.

Topic Tagging
• “Social Justice” - no good matches

Topic Tagging
• “Satan” - Many Names

Topic Tagging
• “Satan” - Many Names
• Name Tagging in general can be very helpful

Searching Strong’s

• Add a ﬁeld for Strong’s: strongs_index


• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198


• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198

• Most of the beneﬁts of text searching


• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198

• “Word” frequency


• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198

• “Word” frequency
• Document vs. corpus frequency of search terms

Searching Articles
• Similar approach to text-based queries

Searching Articles
• Stem words

Searching Articles
• Stem words
• Use Synonyms

Searching Articles
• Stem words
• Use Synonyms
• Remove Stop Words

Searching Articles
• Stem words
• Use Synonyms
• Remove Stop Words
• Without manual tagging, there’s no automatic way
to index/search by Bible Reference

Searching Articles

• Article contains reference: “John 3”

Searching Articles

• User searches for “John 3:16” or “John 2-4”

Searching Articles

• User searches for “John 3:16” or “John 2-4”
• Results: no meaningful matches at best
(unless the documents match the query
“John”

Searching Articles
• Solr-based Solutions:

Searching Articles
• Identify and index references and their
composite verses using a grammar.

Searching Articles
• John 1:1-3 -> John 1:1; John 1:2; John 1:3

Searching Articles
• Store in a multivalued ﬁeld - each
reference is a “term”

Searching Articles
• Store in a multivalued ﬁeld - each
reference is a “term”
• Must also parse and expand references in
queries in order to match

Searching Articles
• Relational database-based solution:

Searching Articles
• Assign an id to every verse

Searching Articles
• Store: id, articleId, verseId

Searching Articles
• Parse user query to ids.

Searching Articles
• SELECT COUNT(id)
WHERE verseId IN (ID_LIST)
GROUP BY articleId

Searching Articles
• SELECT COUNT(id)
WHERE verseId IN (ID_LIST)
GROUP BY articleId
• Higher count -> Article is most likely to me more
about that reference than other articles with a
lower count

Searching Articles
• Large amount of rows.

Searching Articles
• 15,000 Journal articles have > 9,000,000 rows
(verse occurrences)

Searching Articles
(verse occurrences)
• Can store id, articleId, verseId, count

Searching Articles
(verse occurrences)
• Then SUM() the counts for each articleId.

Searching Articles
(verse occurrences)
• Negligibly faster.

Searching Articles
(verse occurrences)
• Negligibly faster.
• Only approx. 3,000,000 rows

Heterogeneous Indexes
• All content is not created equally.

• Content quality and its affect on the quality of
your results becomes a factor when you move
from one resource to > one

• One Bible, One website, One Journal

• Apply a ﬁeld or document boost to help
normalize results

• Apply a ﬁeld or document boost to help
normalize results
• Some content gets bumped up and some down

Search in the Biblical Domain - BibleTech: 2011

Search in the Biblical Domain - BibleTech: 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Search in the Biblical Domain - BibleTech: 2011

Similar to Search in the Biblical Domain - BibleTech: 2011 (20)

Recently uploaded

Recently uploaded (20)

Search in the Biblical Domain - BibleTech: 2011

Editor's Notes