SlideShare a Scribd company logo
1 of 77
Creating an Open Source
Genealogical Search Engine
    With Apache Solr


              Brooke Schreier Ganz
                 info@leafseek.com
                Twitter: @LeafSeek
                www.LeafSeek.com
Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for
  profit
• Web Developer at IBM.com and Disney
  Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV
  network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork
Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
  (existing website is fugly)
• Needs a search engine…for data
The Old Problem
The Old Problem
The New Problem
The New Problem
• Diverse Data Languages
  (German, Polish, Ukrainian, Russian, Yiddi
  sh, Hebrew, English…)
• Diverse Data Types
  (births, marriages, deaths, divorces, tax
  lists, landsmanschaften lists, industrial
  permit lists, school
  yearbooks, governmental yearbooks…)
Diverse Data Shapes
Diverse Data Shapes
Diverse Data Shapes
Existing solutions
• They‟re okay...for small numbers of
  databases, with small amounts of data

  – Steve Morse's One-Step Tool Creator
  – Roll-your-own solution with PHP and MySQL


• Both get more difficult to manage as data
  sets increase in number and complexity
In space, no one can hear your data scream
To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
  data
• Surely there must be a way to deal with
  this?
So I Made A Thing
But “That Thing I Made With The Database And Stuff”
     was kind of an awkward name, so I called it



             LeafSeek
This is the part where I show you all
the shiny new All Galicia Database

  http://search.geshergalicia.org/
Meet Apache Solr
• Highly functional open source search
  platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source
Saves Time, and Heartache
Saves Time, and Stomachache
File Structure: Back-End
Welcome to /conf
The Important Stuff
solrconfig.xml
solrconfig.xml

Make sure this part is configured, so you can
import data:
How to get your data into Solr
• Step 1: Make a properly-formatted
  spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing
  Primary Key called “id” (INT)
• Step 6: Add this table‟s information to
  db-data-config.xml
db-data-config.xml
• Basic XML file that tells Solr how to grab
  data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
  databases
• You need to make sure your MySQL
  connector .jar is installed for this to work
Import!
schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
  instructions to “buckets”
• Fields say what‟s what and whether to
  make something facetable or not
• CopyFields collect Fields together into
  extra FieldTypes
schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
  – givenname
  – surname
  – surname_bmpm (phonetic)
  – place (note: not merely town)
  – year (which we‟re treating as text right now)
schema.xml - FieldTypes
schema.xml - FieldTypes
schema.xml - Fields
schema.xml - Fields
• Uppercase fields come from the name of
  the MySQL column name
• Examples:
  – Year
  – SchoolYear
  – Surname
  – FathersTown
  – MothersFathersGivenName
  – MaternalGrandfathersGivenName
schema.xml - Fields
• Lowercase fields were added once the
  data is getting inputted to Solr, and start
  with the prefix record_
• Examples:
  – record_type (birth, death, tax, whatever)
  – record_source (name of repository)
  – record_latlong (latitude,longitude)
  – record_id (required!)
schema.xml - Fields
• You do not have to explicitly define every
  Field.
• If something is imported that is not named
  and defined in schema.xml it will just be
  indexed as a straight-up text string, with
  nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything
  anyway so you can remember what‟s what
  and what you are doing to it.
schema.xml - CopyFields
Add-ons and nice-to-have‟s
         (for the back-end)
• Wildcards, and lots of „em
• Non-name words handled through
  stopwords.txt
• Nicknames and name synonyms handled
  through synonyms.txt
• Two files included:
  – synonyms_-_american-anglo-saxon.txt
  – synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and your
  historical/ethnic community standards
More add-ons and nice-to-have‟s
        (for the back-end)
• Translate your site into different languages – multi-
  lingual content deserves a real multi-lingual
  website
   – Pass user preferences through GET value or through
     accept-language header or read from a cookie or
     whatever you want
• Built-in performance monitoring hooks for New
  Relic
• Soundalike searches for surname variants
   – Levenstein distance
   – “Regular” Soundex, Metaphone, Caverphone, etc.
This is the part where I tell
          the story about


     THE SAGA
of Beider-Morse Phonetic Matching
             (BMPM)
Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
  wildcards, alternate names /
  synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
  people‟s names in the record
Lots of Front-End Options
• Ruby:
  Sunspot, RSolr, Tanning Bed, acts-as-solr
• Django/Python:
  Haystack, Sunburnt, solrpy, pysolr
• Older PHP options:
  PECL, solr-php-client
• Plugins for blog/CMS systems:
  Drupal, WordPress
Meet Solarium
•   http://www.solarium-project.org/
•   New, open source PHP wrapper for Solr
•   Very active development
•   Version 2.4 coming soon
File Structure: Front-End
Meet Solarium: The Config
Meet Solarium: The Guts
Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by
  POST, not by GET, so the URL never
  changes
• You can (and should) paginate results
  listings
• You can't actually see the Solr server's
  URL from the front-end, not even in view-
  source
Add-ons and nice-to-have‟s
        (for the front-end)
• A welcome screen with information about
  the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?
Add-ons and nice-to-have‟s
           (for the front-end)
•   Make good UI choices
•   Pop-Up Google Maps
•   Tooltips to reduce UI clutter
•   Cross-browser compatibility
•   Still stuck with IE 7 and 8
•   CSS and code that degrades gracefully
•   No small text
Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
  my data?
• What are the TOP (surnames, towns, etc.)
  in my data?
• Finding incorrect data
  – Outlying years and dates
  – Figure out that hard-to-read surname
• Make charts and graphs from your data
The (Back-End) Future!        (Maybe.)

• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
  (based on data frequency)
• “More Like This”
  (would have to do scoring)
• Record bookmarking system (hashes?)
The (Front-End) Future!         (Maybe.)

• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
  http://historical-data.org/
• Membership login system
Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
  bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips
Presenting LeafSeek!
•   Free and Open Source
•   Code is all on GitHub
•   Please add, edit, fix, change, tinker
•   …and use it!
Why is this FREE?

And why is this important?
Thank you! :-)

More Related Content

What's hot

The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012Teresa Pask
 
Intro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteIntro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteNeo4j
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import pptDavid Horvath
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with ElasticsearchAleksander Stensby
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.orgJoshua Shinavier
 

What's hot (8)

The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012
 
Intro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteIntro to Neo4j - Nicole White
Intro to Neo4j - Nicole White
 
Basics of Web Research for ELA 10
Basics of Web Research for ELA 10Basics of Web Research for ELA 10
Basics of Web Research for ELA 10
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 

Viewers also liked

Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language CentreLucy Bullett
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for BeginnersIrina Bubnova
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian languageSecondary School from Helsinki
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationLiang Tang
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityDaniel Hieber
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageLegesse Allyn
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...eveline wandl-vogt
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian languageKaterina Vylomova
 
Languages of the world
Languages of the worldLanguages of the world
Languages of the worldManu Alias
 
Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)101_languages
 
Basic Russian Language Course
Basic Russian Language CourseBasic Russian Language Course
Basic Russian Language Course101_languages
 
Language families and branches
Language families and branchesLanguage families and branches
Language families and branchesPamela Sanhueza
 
Russian Language
Russian LanguageRussian Language
Russian LanguageIzzah Ros
 

Viewers also liked (17)

Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language Centre
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for Beginners
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian language
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized Recommendation
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian Language
 
Pre-incident plan
Pre-incident planPre-incident plan
Pre-incident plan
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian language
 
Languages of the world
Languages of the worldLanguages of the world
Languages of the world
 
Russia
RussiaRussia
Russia
 
Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)
 
Basic Russian Language Course
Basic Russian Language CourseBasic Russian Language Course
Basic Russian Language Course
 
Language families and branches
Language families and branchesLanguage families and branches
Language families and branches
 
Russian Language
Russian LanguageRussian Language
Russian Language
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

Similar to Creating an Open Source Genealogical Search Engine with Apache Solr

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminarGlen McGregor
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring dataSara-Jayne Terp
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersRob Fuller
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Yahoo Developer Network
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformSteve Hoffman
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsEDB
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 

Similar to Creating an Open Source Genealogical Search Engine with Apache Solr (20)

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminar
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
PHP - Introduction to PHP MySQL Joins and SQL Functions
PHP -  Introduction to PHP MySQL Joins and SQL FunctionsPHP -  Introduction to PHP MySQL Joins and SQL Functions
PHP - Introduction to PHP MySQL Joins and SQL Functions
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for Pentesters
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Creating an Open Source Genealogical Search Engine with Apache Solr

  • 1. Creating an Open Source Genealogical Search Engine With Apache Solr Brooke Schreier Ganz info@leafseek.com Twitter: @LeafSeek www.LeafSeek.com
  • 2. Hi, I‟m Brooke • I make web stuff for fun, and (sometimes) for profit • Web Developer at IBM.com and Disney Consumer Products • Lead Programmer at TMZ.com (yikes, sorry about that) • Senior Web Producer at Bravo cable TV network and its spin-off websites • Big dork • Big genealogy dork • #BigData dork
  • 3. Meet Gesher Galicia • Non-profit 501(c)3 genealogy society • Founded in 1993 • Hundreds of members, worldwide • E-mail discussion group • New website development in progress (existing website is fugly) • Needs a search engine…for data
  • 4.
  • 5.
  • 6.
  • 10. The New Problem • Diverse Data Languages (German, Polish, Ukrainian, Russian, Yiddi sh, Hebrew, English…) • Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)
  • 14. Existing solutions • They‟re okay...for small numbers of databases, with small amounts of data – Steve Morse's One-Step Tool Creator – Roll-your-own solution with PHP and MySQL • Both get more difficult to manage as data sets increase in number and complexity
  • 15. In space, no one can hear your data scream
  • 16. To Sum Up • There are lots of ways to publish your tree • …but not so many ways to publish your data • Surely there must be a way to deal with this?
  • 17.
  • 18.
  • 19. So I Made A Thing But “That Thing I Made With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek
  • 20. This is the part where I show you all the shiny new All Galicia Database http://search.geshergalicia.org/
  • 21. Meet Apache Solr • Highly functional open source search platform • Based on Apache Lucene (Java)… • …plus a web wrapper/API • Not the prettiest or simplest tool • FREE and open source
  • 22.
  • 23.
  • 24. Saves Time, and Heartache
  • 25.
  • 26. Saves Time, and Stomachache
  • 27.
  • 28.
  • 33. solrconfig.xml Make sure this part is configured, so you can import data:
  • 34. How to get your data into Solr • Step 1: Make a properly-formatted spreadsheet • Step 2: Save spreadsheet as a .CSV file • Step 3: Create a MySQL database + table • Step 4: Import CSV into that new table • Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT) • Step 6: Add this table‟s information to db-data-config.xml
  • 35.
  • 36.
  • 37. db-data-config.xml • Basic XML file that tells Solr how to grab data from your MySQL database(s) • Add new <dataSource> for new databases • Add new <entity> for new tables within the databases • You need to make sure your MySQL connector .jar is installed for this to work
  • 38.
  • 40.
  • 41. schema.xml • FieldTypes, Fields, and CopyFields • FieldTypes give indexing and querying instructions to “buckets” • Fields say what‟s what and whether to make something facetable or not • CopyFields collect Fields together into extra FieldTypes
  • 42. schema.xml - FieldTypes • 5 Custom FieldTypes (so far): – givenname – surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we‟re treating as text right now)
  • 46. schema.xml - Fields • Uppercase fields come from the name of the MySQL column name • Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName
  • 47. schema.xml - Fields • Lowercase fields were added once the data is getting inputted to Solr, and start with the prefix record_ • Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)
  • 48. schema.xml - Fields • You do not have to explicitly define every Field. • If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it. • Which is fine. • But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.
  • 50.
  • 51. Add-ons and nice-to-have‟s (for the back-end) • Wildcards, and lots of „em • Non-name words handled through stopwords.txt • Nicknames and name synonyms handled through synonyms.txt • Two files included: – synonyms_-_american-anglo-saxon.txt – synonyms_-_polish-ukrainian-jewish.txt • Should be based on your data and your historical/ethnic community standards
  • 52.
  • 53.
  • 54. More add-ons and nice-to-have‟s (for the back-end) • Translate your site into different languages – multi- lingual content deserves a real multi-lingual website – Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want • Built-in performance monitoring hooks for New Relic • Soundalike searches for surname variants – Levenstein distance – “Regular” Soundex, Metaphone, Caverphone, etc.
  • 55. This is the part where I tell the story about THE SAGA of Beider-Morse Phonetic Matching (BMPM)
  • 56. Relevancy • Right now, we‟re using exact matches • (Of course, “exact” includes wildcards, alternate names / synonyms, etc.) • Like “Old Search” on Ancestry.com • DisMax! Boosting fields! Scoring! • (…but not yet) • Problems with records with multiple people‟s names in the record
  • 57.
  • 58.
  • 59. Lots of Front-End Options • Ruby: Sunspot, RSolr, Tanning Bed, acts-as-solr • Django/Python: Haystack, Sunburnt, solrpy, pysolr • Older PHP options: PECL, solr-php-client • Plugins for blog/CMS systems: Drupal, WordPress
  • 60. Meet Solarium • http://www.solarium-project.org/ • New, open source PHP wrapper for Solr • Very active development • Version 2.4 coming soon
  • 64. Meet Solarium: The Guts • You choose the parts of your data to facet • Data is submitted to the front-end by POST, not by GET, so the URL never changes • You can (and should) paginate results listings • You can't actually see the Solr server's URL from the front-end, not even in view- source
  • 65. Add-ons and nice-to-have‟s (for the front-end) • A welcome screen with information about the database's contents • Instructions (maybe twice) • How many records in the database? • How many datasets? • What features are coming next? • What datasets are coming next?
  • 66. Add-ons and nice-to-have‟s (for the front-end) • Make good UI choices • Pop-Up Google Maps • Tooltips to reduce UI clutter • Cross-browser compatibility • Still stuck with IE 7 and 8 • CSS and code that degrades gracefully • No small text
  • 67. Bird‟s Eye View of Your Data • What (surnames, towns, etc.) do I have in my data? • What are the TOP (surnames, towns, etc.) in my data? • Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname • Make charts and graphs from your data
  • 68.
  • 69. The (Back-End) Future! (Maybe.) • Date ranges, instead of just years • Auto-complete as you type • “Did you mean...?” (based on data frequency) • “More Like This” (would have to do scoring) • Record bookmarking system (hashes?)
  • 70. The (Front-End) Future! (Maybe.) • Hierarchical facets for locations • Disambiguating locations • Social sharing of individual records • New genealogy data schema http://historical-data.org/ • Membership login system
  • 71.
  • 72. Please Do Not Build That Wall • Password protect some of the databases • Password protect some of the data • Open data, but pay for record or surname bookmarking system • Open data, but pay for API access • Open data, but sell online ads • Open data, but give people guilt trips
  • 73. Presenting LeafSeek! • Free and Open Source • Code is all on GitHub • Please add, edit, fix, change, tinker • …and use it!
  • 74.
  • 75.
  • 76. Why is this FREE? And why is this important?