Creating an Open SourceGenealogical Search Engine    With Apache Solr              Brooke Schreier Ganz                 in...
Hi, I‟m Brooke• I make web stuff for fun, and (sometimes) for  profit• Web Developer at IBM.com and Disney  Consumer Produ...
Meet Gesher Galicia• Non-profit 501(c)3 genealogy society• Founded in 1993• Hundreds of members, worldwide• E-mail discuss...
The Old Problem
The Old Problem
The New Problem
The New Problem• Diverse Data Languages  (German, Polish, Ukrainian, Russian, Yiddi  sh, Hebrew, English…)• Diverse Data T...
Diverse Data Shapes
Diverse Data Shapes
Diverse Data Shapes
Existing solutions• They‟re okay...for small numbers of  databases, with small amounts of data  – Steve Morses One-Step To...
In space, no one can hear your data scream
To Sum Up• There are lots of ways to publish your tree• …but not so many ways to publish your  data• Surely there must be ...
So I Made A ThingBut “That Thing I Made With The Database And Stuff”     was kind of an awkward name, so I called it      ...
This is the part where I show you allthe shiny new All Galicia Database  http://search.geshergalicia.org/
Meet Apache Solr• Highly functional open source search  platform• Based on Apache Lucene (Java)…• …plus a web wrapper/API•...
Saves Time, and Heartache
Saves Time, and Stomachache
File Structure: Back-End
Welcome to /conf
The Important Stuff
solrconfig.xml
solrconfig.xmlMake sure this part is configured, so you canimport data:
How to get your data into Solr• Step 1: Make a properly-formatted  spreadsheet• Step 2: Save spreadsheet as a .CSV file• S...
db-data-config.xml• Basic XML file that tells Solr how to grab  data from your MySQL database(s)• Add new <dataSource> for...
Import!
schema.xml• FieldTypes, Fields, and CopyFields• FieldTypes give indexing and querying  instructions to “buckets”• Fields s...
schema.xml - FieldTypes• 5 Custom FieldTypes (so far):  – givenname  – surname  – surname_bmpm (phonetic)  – place (note: ...
schema.xml - FieldTypes
schema.xml - FieldTypes
schema.xml - Fields
schema.xml - Fields• Uppercase fields come from the name of  the MySQL column name• Examples:  – Year  – SchoolYear  – Sur...
schema.xml - Fields• Lowercase fields were added once the  data is getting inputted to Solr, and start  with the prefix re...
schema.xml - Fields• You do not have to explicitly define every  Field.• If something is imported that is not named  and d...
schema.xml - CopyFields
Add-ons and nice-to-have‟s         (for the back-end)• Wildcards, and lots of „em• Non-name words handled through  stopwor...
More add-ons and nice-to-have‟s        (for the back-end)• Translate your site into different languages – multi-  lingual ...
This is the part where I tell          the story about     THE SAGAof Beider-Morse Phonetic Matching             (BMPM)
Relevancy• Right now, we‟re using exact matches• (Of course, “exact” includes  wildcards, alternate names /  synonyms, etc...
Lots of Front-End Options• Ruby:  Sunspot, RSolr, Tanning Bed, acts-as-solr• Django/Python:  Haystack, Sunburnt, solrpy, p...
Meet Solarium•   http://www.solarium-project.org/•   New, open source PHP wrapper for Solr•   Very active development•   V...
File Structure: Front-End
Meet Solarium: The Config
Meet Solarium: The Guts
Meet Solarium: The Guts• You choose the parts of your data to facet• Data is submitted to the front-end by  POST, not by G...
Add-ons and nice-to-have‟s        (for the front-end)• A welcome screen with information about  the databases contents• In...
Add-ons and nice-to-have‟s           (for the front-end)•   Make good UI choices•   Pop-Up Google Maps•   Tooltips to redu...
Bird‟s Eye View of Your Data• What (surnames, towns, etc.) do I have in  my data?• What are the TOP (surnames, towns, etc....
The (Back-End) Future!        (Maybe.)• Date ranges, instead of just years• Auto-complete as you type• “Did you mean...?” ...
The (Front-End) Future!         (Maybe.)• Hierarchical facets for locations• Disambiguating locations• Social sharing of i...
Please Do Not Build That Wall• Password protect some of the databases• Password protect some of the data• Open data, but p...
Presenting LeafSeek!•   Free and Open Source•   Code is all on GitHub•   Please add, edit, fix, change, tinker•   …and use...
Why is this FREE?And why is this important?
Thank you! :-)
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Upcoming SlideShare
Loading in …5
×

Creating an Open Source Genealogical Search Engine with Apache Solr

3,465 views
3,120 views

Published on

Set Your Records Free!

LeafSeek is a new tool that helps you turn your genealogical or historical record collections into searchable online databases. Combine multiple datasets of different types — such as birth, marriage, and military records — into one unified searchable website. Find inter-connections in your data that you never noticed before.

With great features like built-in geo-spatial searches, pop-up Google Maps, Beider-Morse Phonetic Matching, name synonyms, and language localization, LeafSeek can help you turn your spreadsheets of names and dates into a full-featured genealogy search engine. It’s designed for researchers and genealogy societies alike.

Oh, and one more thing: LeafSeek is free and open source. No strings attached.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,465
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Creating an Open Source Genealogical Search Engine with Apache Solr

  1. Creating an Open SourceGenealogical Search Engine With Apache Solr Brooke Schreier Ganz info@leafseek.com Twitter: @LeafSeek www.LeafSeek.com
  2. Hi, I‟m Brooke• I make web stuff for fun, and (sometimes) for profit• Web Developer at IBM.com and Disney Consumer Products• Lead Programmer at TMZ.com (yikes, sorry about that)• Senior Web Producer at Bravo cable TV network and its spin-off websites• Big dork• Big genealogy dork• #BigData dork
  3. Meet Gesher Galicia• Non-profit 501(c)3 genealogy society• Founded in 1993• Hundreds of members, worldwide• E-mail discussion group• New website development in progress (existing website is fugly)• Needs a search engine…for data
  4. The Old Problem
  5. The Old Problem
  6. The New Problem
  7. The New Problem• Diverse Data Languages (German, Polish, Ukrainian, Russian, Yiddi sh, Hebrew, English…)• Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)
  8. Diverse Data Shapes
  9. Diverse Data Shapes
  10. Diverse Data Shapes
  11. Existing solutions• They‟re okay...for small numbers of databases, with small amounts of data – Steve Morses One-Step Tool Creator – Roll-your-own solution with PHP and MySQL• Both get more difficult to manage as data sets increase in number and complexity
  12. In space, no one can hear your data scream
  13. To Sum Up• There are lots of ways to publish your tree• …but not so many ways to publish your data• Surely there must be a way to deal with this?
  14. So I Made A ThingBut “That Thing I Made With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek
  15. This is the part where I show you allthe shiny new All Galicia Database http://search.geshergalicia.org/
  16. Meet Apache Solr• Highly functional open source search platform• Based on Apache Lucene (Java)…• …plus a web wrapper/API• Not the prettiest or simplest tool• FREE and open source
  17. Saves Time, and Heartache
  18. Saves Time, and Stomachache
  19. File Structure: Back-End
  20. Welcome to /conf
  21. The Important Stuff
  22. solrconfig.xml
  23. solrconfig.xmlMake sure this part is configured, so you canimport data:
  24. How to get your data into Solr• Step 1: Make a properly-formatted spreadsheet• Step 2: Save spreadsheet as a .CSV file• Step 3: Create a MySQL database + table• Step 4: Import CSV into that new table• Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT)• Step 6: Add this table‟s information to db-data-config.xml
  25. db-data-config.xml• Basic XML file that tells Solr how to grab data from your MySQL database(s)• Add new <dataSource> for new databases• Add new <entity> for new tables within the databases• You need to make sure your MySQL connector .jar is installed for this to work
  26. Import!
  27. schema.xml• FieldTypes, Fields, and CopyFields• FieldTypes give indexing and querying instructions to “buckets”• Fields say what‟s what and whether to make something facetable or not• CopyFields collect Fields together into extra FieldTypes
  28. schema.xml - FieldTypes• 5 Custom FieldTypes (so far): – givenname – surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we‟re treating as text right now)
  29. schema.xml - FieldTypes
  30. schema.xml - FieldTypes
  31. schema.xml - Fields
  32. schema.xml - Fields• Uppercase fields come from the name of the MySQL column name• Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName
  33. schema.xml - Fields• Lowercase fields were added once the data is getting inputted to Solr, and start with the prefix record_• Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)
  34. schema.xml - Fields• You do not have to explicitly define every Field.• If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it.• Which is fine.• But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.
  35. schema.xml - CopyFields
  36. Add-ons and nice-to-have‟s (for the back-end)• Wildcards, and lots of „em• Non-name words handled through stopwords.txt• Nicknames and name synonyms handled through synonyms.txt• Two files included: – synonyms_-_american-anglo-saxon.txt – synonyms_-_polish-ukrainian-jewish.txt• Should be based on your data and your historical/ethnic community standards
  37. More add-ons and nice-to-have‟s (for the back-end)• Translate your site into different languages – multi- lingual content deserves a real multi-lingual website – Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want• Built-in performance monitoring hooks for New Relic• Soundalike searches for surname variants – Levenstein distance – “Regular” Soundex, Metaphone, Caverphone, etc.
  38. This is the part where I tell the story about THE SAGAof Beider-Morse Phonetic Matching (BMPM)
  39. Relevancy• Right now, we‟re using exact matches• (Of course, “exact” includes wildcards, alternate names / synonyms, etc.)• Like “Old Search” on Ancestry.com• DisMax! Boosting fields! Scoring!• (…but not yet)• Problems with records with multiple people‟s names in the record
  40. Lots of Front-End Options• Ruby: Sunspot, RSolr, Tanning Bed, acts-as-solr• Django/Python: Haystack, Sunburnt, solrpy, pysolr• Older PHP options: PECL, solr-php-client• Plugins for blog/CMS systems: Drupal, WordPress
  41. Meet Solarium• http://www.solarium-project.org/• New, open source PHP wrapper for Solr• Very active development• Version 2.4 coming soon
  42. File Structure: Front-End
  43. Meet Solarium: The Config
  44. Meet Solarium: The Guts
  45. Meet Solarium: The Guts• You choose the parts of your data to facet• Data is submitted to the front-end by POST, not by GET, so the URL never changes• You can (and should) paginate results listings• You cant actually see the Solr servers URL from the front-end, not even in view- source
  46. Add-ons and nice-to-have‟s (for the front-end)• A welcome screen with information about the databases contents• Instructions (maybe twice)• How many records in the database?• How many datasets?• What features are coming next?• What datasets are coming next?
  47. Add-ons and nice-to-have‟s (for the front-end)• Make good UI choices• Pop-Up Google Maps• Tooltips to reduce UI clutter• Cross-browser compatibility• Still stuck with IE 7 and 8• CSS and code that degrades gracefully• No small text
  48. Bird‟s Eye View of Your Data• What (surnames, towns, etc.) do I have in my data?• What are the TOP (surnames, towns, etc.) in my data?• Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname• Make charts and graphs from your data
  49. The (Back-End) Future! (Maybe.)• Date ranges, instead of just years• Auto-complete as you type• “Did you mean...?” (based on data frequency)• “More Like This” (would have to do scoring)• Record bookmarking system (hashes?)
  50. The (Front-End) Future! (Maybe.)• Hierarchical facets for locations• Disambiguating locations• Social sharing of individual records• New genealogy data schema http://historical-data.org/• Membership login system
  51. Please Do Not Build That Wall• Password protect some of the databases• Password protect some of the data• Open data, but pay for record or surname bookmarking system• Open data, but pay for API access• Open data, but sell online ads• Open data, but give people guilt trips
  52. Presenting LeafSeek!• Free and Open Source• Code is all on GitHub• Please add, edit, fix, change, tinker• …and use it!
  53. Why is this FREE?And why is this important?
  54. Thank you! :-)

×