Creating an Open Source Genealogical Search Engine with Apache Solr


Published on

Set Your Records Free!

LeafSeek is a new tool that helps you turn your genealogical or historical record collections into searchable online databases. Combine multiple datasets of different types — such as birth, marriage, and military records — into one unified searchable website. Find inter-connections in your data that you never noticed before.

With great features like built-in geo-spatial searches, pop-up Google Maps, Beider-Morse Phonetic Matching, name synonyms, and language localization, LeafSeek can help you turn your spreadsheets of names and dates into a full-featured genealogy search engine. It’s designed for researchers and genealogy societies alike.

Oh, and one more thing: LeafSeek is free and open source. No strings attached.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Creating an Open Source Genealogical Search Engine with Apache Solr

  1. Creating an Open SourceGenealogical Search Engine With Apache Solr Brooke Schreier Ganz Twitter: @LeafSeek
  2. Hi, I‟m Brooke• I make web stuff for fun, and (sometimes) for profit• Web Developer at and Disney Consumer Products• Lead Programmer at (yikes, sorry about that)• Senior Web Producer at Bravo cable TV network and its spin-off websites• Big dork• Big genealogy dork• #BigData dork
  3. Meet Gesher Galicia• Non-profit 501(c)3 genealogy society• Founded in 1993• Hundreds of members, worldwide• E-mail discussion group• New website development in progress (existing website is fugly)• Needs a search engine…for data
  4. The Old Problem
  5. The Old Problem
  6. The New Problem
  7. The New Problem• Diverse Data Languages (German, Polish, Ukrainian, Russian, Yiddi sh, Hebrew, English…)• Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)
  8. Diverse Data Shapes
  9. Diverse Data Shapes
  10. Diverse Data Shapes
  11. Existing solutions• They‟re okay...for small numbers of databases, with small amounts of data – Steve Morses One-Step Tool Creator – Roll-your-own solution with PHP and MySQL• Both get more difficult to manage as data sets increase in number and complexity
  12. In space, no one can hear your data scream
  13. To Sum Up• There are lots of ways to publish your tree• …but not so many ways to publish your data• Surely there must be a way to deal with this?
  14. So I Made A ThingBut “That Thing I Made With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek
  15. This is the part where I show you allthe shiny new All Galicia Database
  16. Meet Apache Solr• Highly functional open source search platform• Based on Apache Lucene (Java)…• …plus a web wrapper/API• Not the prettiest or simplest tool• FREE and open source
  17. Saves Time, and Heartache
  18. Saves Time, and Stomachache
  19. File Structure: Back-End
  20. Welcome to /conf
  21. The Important Stuff
  22. solrconfig.xml
  23. solrconfig.xmlMake sure this part is configured, so you canimport data:
  24. How to get your data into Solr• Step 1: Make a properly-formatted spreadsheet• Step 2: Save spreadsheet as a .CSV file• Step 3: Create a MySQL database + table• Step 4: Import CSV into that new table• Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT)• Step 6: Add this table‟s information to db-data-config.xml
  25. db-data-config.xml• Basic XML file that tells Solr how to grab data from your MySQL database(s)• Add new <dataSource> for new databases• Add new <entity> for new tables within the databases• You need to make sure your MySQL connector .jar is installed for this to work
  26. Import!
  27. schema.xml• FieldTypes, Fields, and CopyFields• FieldTypes give indexing and querying instructions to “buckets”• Fields say what‟s what and whether to make something facetable or not• CopyFields collect Fields together into extra FieldTypes
  28. schema.xml - FieldTypes• 5 Custom FieldTypes (so far): – givenname – surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we‟re treating as text right now)
  29. schema.xml - FieldTypes
  30. schema.xml - FieldTypes
  31. schema.xml - Fields
  32. schema.xml - Fields• Uppercase fields come from the name of the MySQL column name• Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName
  33. schema.xml - Fields• Lowercase fields were added once the data is getting inputted to Solr, and start with the prefix record_• Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)
  34. schema.xml - Fields• You do not have to explicitly define every Field.• If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it.• Which is fine.• But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.
  35. schema.xml - CopyFields
  36. Add-ons and nice-to-have‟s (for the back-end)• Wildcards, and lots of „em• Non-name words handled through stopwords.txt• Nicknames and name synonyms handled through synonyms.txt• Two files included: – synonyms_-_american-anglo-saxon.txt – synonyms_-_polish-ukrainian-jewish.txt• Should be based on your data and your historical/ethnic community standards
  37. More add-ons and nice-to-have‟s (for the back-end)• Translate your site into different languages – multi- lingual content deserves a real multi-lingual website – Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want• Built-in performance monitoring hooks for New Relic• Soundalike searches for surname variants – Levenstein distance – “Regular” Soundex, Metaphone, Caverphone, etc.
  38. This is the part where I tell the story about THE SAGAof Beider-Morse Phonetic Matching (BMPM)
  39. Relevancy• Right now, we‟re using exact matches• (Of course, “exact” includes wildcards, alternate names / synonyms, etc.)• Like “Old Search” on• DisMax! Boosting fields! Scoring!• (…but not yet)• Problems with records with multiple people‟s names in the record
  40. Lots of Front-End Options• Ruby: Sunspot, RSolr, Tanning Bed, acts-as-solr• Django/Python: Haystack, Sunburnt, solrpy, pysolr• Older PHP options: PECL, solr-php-client• Plugins for blog/CMS systems: Drupal, WordPress
  41. Meet Solarium•• New, open source PHP wrapper for Solr• Very active development• Version 2.4 coming soon
  42. File Structure: Front-End
  43. Meet Solarium: The Config
  44. Meet Solarium: The Guts
  45. Meet Solarium: The Guts• You choose the parts of your data to facet• Data is submitted to the front-end by POST, not by GET, so the URL never changes• You can (and should) paginate results listings• You cant actually see the Solr servers URL from the front-end, not even in view- source
  46. Add-ons and nice-to-have‟s (for the front-end)• A welcome screen with information about the databases contents• Instructions (maybe twice)• How many records in the database?• How many datasets?• What features are coming next?• What datasets are coming next?
  47. Add-ons and nice-to-have‟s (for the front-end)• Make good UI choices• Pop-Up Google Maps• Tooltips to reduce UI clutter• Cross-browser compatibility• Still stuck with IE 7 and 8• CSS and code that degrades gracefully• No small text
  48. Bird‟s Eye View of Your Data• What (surnames, towns, etc.) do I have in my data?• What are the TOP (surnames, towns, etc.) in my data?• Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname• Make charts and graphs from your data
  49. The (Back-End) Future! (Maybe.)• Date ranges, instead of just years• Auto-complete as you type• “Did you mean...?” (based on data frequency)• “More Like This” (would have to do scoring)• Record bookmarking system (hashes?)
  50. The (Front-End) Future! (Maybe.)• Hierarchical facets for locations• Disambiguating locations• Social sharing of individual records• New genealogy data schema• Membership login system
  51. Please Do Not Build That Wall• Password protect some of the databases• Password protect some of the data• Open data, but pay for record or surname bookmarking system• Open data, but pay for API access• Open data, but sell online ads• Open data, but give people guilt trips
  52. Presenting LeafSeek!• Free and Open Source• Code is all on GitHub• Please add, edit, fix, change, tinker• …and use it!
  53. Why is this FREE?And why is this important?
  54. Thank you! :-)