Professional Amil baba, Kala jadu specialist in Multan and Kala ilam speciali...
Making Bible Search Results Relevant (BibleTech 2011)
1. Making Bible Search Results Relevant
Stephen Smith
Bible Gateway
March 25, 2011
2. Agenda
1. How people search Bible Gateway
1. Four types of searches
2. Seven ways people find verses
2. How people browse Bible Gateway
1. Verse popularity
2. Parsing queries
3. Different translations
3. Source Data
• Bible Gateway logs, May-November 2010
• 160 GB
• Processed mostly with Perl
• Amazon EC2
9. Phil 4:13 Search Quotes
I can do all things
through Christ who
strengthens me.
10. Seven Verse Search Strategies
Type Example Percent
Exact word / phrase I can do all things 80
Similar verse With God, all things are possible 9
Partial word / misspelling Christ who strengthen me 6
Concept Not alone 3
Keywords Gives strength 2
Minor switch You can do all things 0.6
Paraphrase All things through God 0.5
12. BG Path Analysis for [strength]
Reference Text %
Exod 15:2 The LORD is my strength and my defense… 2.17
Isa 40:31 …will renew their strength. They will soar on wings like eagles… 1.27
Eph 3:16 I pray that out of his glorious riches he may strengthen you… 1.26
Ps 28:7 The LORD is my strength and my shield… 1.22
Phil 4:13 I can do all this through him who gives me strength. 1.04
Reference Text %
Phil 4:13 I can do all this through him who gives me strength. 17.5
Deut 31:6 Be strong and courageous… 11.7
Deut 20:4 The LORD your God is the one who goes with you to fight for you… 10.4
1 Cor 10:13 No temptation has overtaken you except what is common… 5.38
Eccl 4:12 A cord of three strands is not quickly broken. 3.15
Topical Bible Entries for [strength]
16. Types of Story Queries
Synecdoche
Section
Longer tail queries
%ofQueryType
17. Improving Story-Level Results
For sections, use existing headings to improve
recall from 48% to 63%
Path analysis works well for synecdoche
18. Search Summary
1. Quote: exact word / phrase, similar verse,
partial word / misspelling, concept, keyword,
paraphrase
2. Concept: most-popular queries; use human
intelligence
3. Named Entity: most commonly people
4. Story: use metadata
24. Verse Popularity DistributionMorepopular(log)
John 3:16
Num 7:51: “one young bull, one ram and
one male lamb a year old for a burnt offering”
Acts 27:6: “There the centurion found an
Alexandrian ship sailing for Italy and put us
on board.”
25. Passage Types
71% passage
29% search
Single verse 21%
Sub-chapter range 21%
Chapter 53%
Super-chapter range 2%
Multiple passages 3%
26. Query Components
Type Example
Book Genesis
Book Chapter Verse Genesis 1:2
Book Chapter Genesis 1
Book Verse Philemon 7
Chapter Verse 1:2
Chapter Ch 1
Verse Verse 2
Integer 3
Translation KJV
And following… Genesis 1:2ff.
Range Genesis 1:2-3
Sequence Genesis 1, 4
30. Corner Cases
Query Problem Best Practice
Genesis 1:5-3 Reversed range Gen.1.5,Gen.1.3 + error
Genesis 49-51 End of range too high Gen.49-Gen.50
Psalm 121-22 Omitting digit Ps.121-Ps.122
1 John 1:2-3 John 2 Book name conflict 1John.1.2-3John.1.2
Philemon 1 Chapter-verse conflict Phlm.1
Philemon 1-2 Chapter-verse range conflict Phlm.1.1-Phlm.1.2
Matt 5,6 Localization Matt.5,Matt.6
Gen 315 Ambiguous reference Gen.3.15
Ma 2 Ambiguous book Matt.2
36. Summary
1. Four types of searches.
2. Concepts dominate most-popular queries;
quotes dominate the long tail.
3. Seven search strategies to find verses.
4. Optimize for the New Testament.
5. Optimize for passage queries, especially
contiguous passages.
6. Accept a variety of book names.
BG is largest Bible site on the web, with 10 million unique visitors per month.
Give you data-based best practices for designing your own search engine based on our experience. Search = query that’s not a Bible reference. Browse is Bible reference or using site nav. “Searches vs. verses.”
6.5 months. English-only. Logs arranged by day are easy to map-reduce. EC2 spot instance with a number of cores to do this processing cheaply—it still took a couple of days of processing time to produce the reduced reports.
Log-log plot. Almost a perfect logarithmic distribution, meaning that the head of this long tail is very short and the tail is very long. Top 1% of unique queries gets you 75% of query volume. Top 13% gets you 90%. 54% (more than half) of the unique queries are truly unique—never seen before. The problem posed is that the numbers are decently large—5,000 queries to reach 50% of query volume, over 100,000 to reach 80%. These numbers are too big for manual tuning.
As you go down the tail, two things happen. First, the percentage of queries with misspellings increases. Based on a sample of 5% of queries. These are words that are misspelled, are in a different language (removed here), or are unknown. Excluding Bible book misspellings, which we’ll get to later, misspelled queries comprise about 2.2% of query volume. On the right are examples from among our top misspellings. We used open-source spell-check software called After the Deadline to collect this data. I like this software because it takes into account contextual clues. (afterthedeadline.com)
Three schemas; this is the first. Concept = “love,” where there’s not a clear right answer. The line between a concept and a quote is blurry. Story = pericope, like “David and Goliath.” Based on five samples of 1,000 total queries classified by hand. Concepts dominate the most popular queries—the various fruits of the spirit (love, joy, peace) are especially popular.
Within a verse, people use different hooks. “Strengthens” (or “strength”) is the most common hook for this verse, followed by “all things.” The two most common variations in the Bible text for this verse are “through him” and “who gives me strength.” Of course, “all things” is a bit misleading.
Second schema of the three. Seven hook types. Looks at 273 queries that led to Phil 4:13. “With God” is Matt 19:26. Handle partial words through stemming or prefix matching. We know from earlier that misspellings are 2.2% of queries. “Not alone” is a bit of a misreading of the verse. Other minor switches: “gives us strength,” “all things in Christ.” Stopwords are one way to deal with minor switches. Paraphrase involves changing or adding a word. The two most common paraphrases are changing “Christ” to “God” and “Christ Jesus.”
A path analysis is where you look at what people search for, and then you see where they click. For concepts like “love,” “joy,” “strength,” doing a path analysis only gets you so far. Don’t look in Nave’s for “strength”—you’ll be disappointed. Collective intelligence doesn’t work as well as individual intelligence when the query is ambiguous.
Person = human being. Group = Israelites or anonymous individuals, “Israelite.” Other types of queries are things like non-humans (angels, foreign gods)—anything else that has a proper name. A bit imprecise. Tops are Holy Spirit, Jesus, Moses, David, Satan.
AKA, pericope.
Talking here about queries that refer to sections: multiple verses (“Ten Commandments”, “Sermon on the Mount”). But there’s another kind of query where someone uses a hook from one verse when they want to look at multiple verses: “Fruit of the Spirit” doesn’t just want Gal 5:22 (7/9 of the list, omitting gentleness and self-control); they want the full list in 5:22-23. As you go down the long tail, as with queries in general, quotes start to dominate. (Part standing in for the whole = synecdoche.) Path analysis works well here.
If we focus just on section queries, as opposed to synecdoche queries, then including headings in your search engines improves the results. In a sample of 102 popular story queries, adding Bible headings improved result relevance. There’s still a ways to go, but it’s a good quick win.
7 types of quotes
As a reminder, when I say “browse,” I mean people who enter a passage reference as a query or who use navigation (including after doing a keyword search)
OT is 3x as long but fewer than half the overall pageviews. Certain sections are overrepresented, especially Paul’s epistles.
Two orders of magnitude: Psalms is 250x as popular as Obadiah. Stairstepping in NT
Two orders of magnitude. Most popular chapter (Romans 8) is 150x as popular as least popular chapter (Ezek 42, plans for future Temple). Least popular NT chapters: Acts 23-25 (Paul before Felix and his appeal to Caesar)
Three orders of magnitude. The most-popular verse (John 3:16) is seen 1000x as often as the least-popular verse (Num 7:51). We’ve previously released a list of the top 100 verses at BG; it’s available at the BG blog.
The median verse is 10x as popular as the least-popular verse. The least-popular NT verse is 6x as popular as the least-popular verse. The top 10% of verses comprise 46% of views; the top 20% comprise 64% of views.
Keyword searching we talked about in the first part of the talk accounts for 29% of the traffic on BG. BG nav steers people into single verse, sub-chapter range, and chapter. Doesn’t separate input from navigation. The trend in Bible UIs is infinite scrolling, which works well 97% of the time. As a best practice, make sure you can handle the remaining 3%. Of that 3%, 1.9% are in different books, 0.7% in different chapters, and 0.4% in the same chapter.
12 components. Setting aside some internationalization concerns. CV, C, V, integer only appear in sequences, so use context. If the previous reference mentioned a verse, treat the integer as a verse. Otherwise, treat it as a chapter (except with single-chapter books). With ranges, use the shortest range that works. If given a book, either show a special presentation or show the first chapter. CV is 3x as likely to appear in a sequence as an int, while and int is 2x as likely to appear at the end of a range as a CV
The red ones are misspellings. The top two get you 88% of queries. The top six get you 99%. Only six in this list aren’t misspellings. BG uses several thousand book abbreviations
My favorite misspelling of “Genesis”
Is, as I’m sure you know, a planet in one of the recent Star Wars movies
1267 ambiguous references like this. Take the lower possibility (wherever you can put the colon farthest to the left) and be right 73% of the time. The exceptions are largely confined to particular chapters–Deut 31 is more popular than Deut 3. Median is 64%.
1176 ambiguous verses in books. Take the more popular book and be right 87% of the time (one exception: Jude over Judges). Median is 69%.
This chart uses Google ngram data to quantify how traditional vs. contemporary a Bible translation is. Includes Bibles not on BG.
Traffic patterns are largely consistent across translations types, with a few exceptions. People reading more contemporary / dynamic translations spend more time in Poetry (Psalms and especially Proverbs) than others while spending less time in the Gospels. People reading formal translations spend more time in the Paul’s epistles and less time, overall, in the OT.