Your SlideShare is downloading. ×
0
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A Robust Open-source GEDCOM Parser

2,778

Published on

A Robust Open-source GEDCOM Parser presented by Dallan Quass and Ryan Knight at RootsTech 2012 …

A Robust Open-source GEDCOM Parser presented by Dallan Quass and Ryan Knight at RootsTech 2012

Parses GEDCOM files into a "de facto" object model; includes round-tripping for the vast majority of GEDCOM files.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,778
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A Robust Open-source GEDCOM Parser Dallan Quass [email_address] Ryan Knight [email_address]
  • 2. What's a GEDCOM? 0 HEAD 1 SOUR PAF 2 NAME Personal Ancestral File 2 VERS 5.2.18.0 2 CORP The Church of Jesus Christ of Latter-day Saints 3 ADDR 50 East North Temple Street 4 CONT Salt Lake City, UT 84150 4 CONT USA 1 DEST Other 1 DATE 9 Aug 2006 2 TIME 19:57:47 1 FILE temp-paf.ged 1 GEDC 2 VERS 5.5 2 FORM LINEAGE-LINKED 1 CHAR UTF-8 1 LANG English 1 SUBM @SUB1@ 0 @SUB1@ SUBM 1 NAME Dallan Quass 0 @I1@ INDI 1 NAME Dallan /Quass/ 2 SURN Quass 2 GIVN Dallan If this looks unfamiliar to you, you may not get a lot out of this talk On the other hand, the purpose of this project is to handle this for you, so you can develop cool projects in genealogy and let this be unfamiliar to you!
  • 3. Why is parsing GEDCOMs so hard?
  • 4. Challenge #1 – Character set detection 0 HEAD 1 SOUR PAF 2 NAME Personal Ancestral File 2 VERS 5.2.18.0 2 CORP The Church of Jesus Christ of Latter-day Saints 3 ADDR 50 East North Temple Street 4 CONT Salt Lake City, UT 84150 4 CONT USA 1 DEST Other 1 DATE 9 Aug 2006 2 TIME 19:57:47 1 FILE temp-paf.ged 1 GEDC 2 VERS 5.5 2 FORM LINEAGE-LINKED 1 CHAR UTF-8 1 LANG English 1 SUBM @SUB1@ 0 @SUB1@ SUBM 1 NAME Dallan Quass 0 @I1@ INDI 1 NAME Dallan /Quass/ 2 SURN Quass 2 GIVN Dallan Should be easy, except...
  • 5. Challenge #1 – Character set detection <ul><li>GeneWeb ASCII -> ANSI </li></ul><ul><li>Geni.com ANSEL -> UTF8 </li></ul><ul><li>Geni.com UNICODE -> UTF8 </li></ul><ul><li>GENJ UNICODE -> UTF8 </li></ul><ul><li>All others UNICODE -> UTF16 </li></ul><ul><li>ASCII/MacOS Roman -> x-MacRoman </li></ul>
  • 6. Challenge #1 – Character set detection ANSEL
  • 7. Challenge #2 – Custom tags The GEDCOM specification hasn't been updated in a LONG time
  • 8. Challenge #3 – Misused tags
  • 9. Shout out Tim Forsythe VGed - GEDCOM validator http://ancestorsnow.blogspot.com/ 2011/07/vged.html
  • 10. ALIA 1 SEX M 1 ALIA /Ted/ 1 BIRT
  • 11. SOUR 0 @N6@ NOTE 1 CONT adopted surname Termaat 2 SOUR @S9@
  • 12. DATA 2 SOUR @S2149874917@ 3 DATA 4 DATE 11 Sep 1924 3 NOTE ... 3 DATA 4 TEXT ... 2 SOUR @S99@ 3 DATA 4 TEXT William Donald ... 4 DATE 1 Sep 1997 2 SOUR @S28@ 3 PAGE Indian Prarie... 3 QUAY 3 3 DATE 28 Feb 2005
  • 13. Challenge #4 – Unused tags Event Phone Event Agency Source Citation Event Type
  • 14. Challenge #5 – Names
  • 15. GEDCOM Standard ? The code is more what you'd call &quot; guidelines &quot; than actual rules .
  • 16. Two goals
  • 17. Goal #1 – Parse GEDCOMs into a de facto object model De Facto: In fact or in practice; in actual use or existence, regardless of official or legal status. – Wictionary.org Model should be straightforward, easy to use and understand
  • 18. Goal #2 – Round-trip From GEDCOM To Object Model Back to GEDCOM without information loss
  • 19. Nirvana
  • 20. There is no Nirvana
  • 21. But we can get pretty close 94%
  • 22. How is it done? ???
  • 23. Object model
  • 24. People
  • 25. Extensions
  • 26. GedML <ul><li>Originally by Michael Kay </li></ul><ul><li>http://users.breathe.com/mhkay/gedml/ </li></ul><ul><li>Enhanced by Lynn Monson </li></ul><ul><li>http://lmonson.com/blog/?page_id=64 </li></ul><ul><li>Further enhanced by Nathan Powell & Dallan Quass </li></ul><ul><li>part of this project </li></ul>GEDCOM -> SAX events ANSEL reader & writer
  • 27. Parser <ul><li>Written in Java </li></ul><ul><ul><li>~1500 LoC for parser + ~4000 LoC for POJOs </li></ul></ul><ul><li>Handles SAX events emitted by GedML </li></ul><ul><li>Separate functions called to handle each tag </li></ul><ul><li>Maintains a stack of model objects </li></ul><ul><li>Attach unexpected tags to model objects as extensions </li></ul><ul><li>Fast </li></ul><ul><li>Easily extendible </li></ul><ul><li>Tree parser also available </li></ul>
  • 28. GEDCOM Export Visitor pattern 600 LoC
  • 29. JSON GEDCOM POJO JSON POJO GEDCOM Simple model persistence using Google GSON
  • 30. Further thoughts
  • 31. Do we need a radically-different data-exchange model for genealogy?
  • 32. I don't know A new proposed object model could use this project to migrate existing GEDCOMs to the de facto model, then translate the de facto model objects to the new model
  • 33. Do we need GEDCOM validation tools?
  • 34. Definitely! A list of “standard” custom tags would also be pretty helpful
  • 35. We live in the real world
  • 36. Purpose of this project
  • 37. Demonstration of Gedcom Server <ul><li>Demonstrates GEDCOM -> model -> json -> model -> GEDCOM </li></ul><ul><li>Built with Play 1.2.4 - A Java Web framework </li></ul><ul><ul><li>Allows for rapid development of web applications with a fully integrated stack </li></ul></ul><ul><li>Deployed to Heroku – Cloud Application Platform </li></ul><ul><ul><li>Heroku allows one step deployment with git </li></ul></ul>
  • 38. Demonstration of Gedcom Server
  • 39. Demonstration of Gedcom Server
  • 40. Conclusion Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license <ul><li>Parsing GEDCOMs is hard </li></ul><ul><ul><li>it's like parsing HTML in the 1990's </li></ul></ul><ul><li>But getting it right is pretty important </li></ul><ul><ul><li>especially if you want to retain existing information </li></ul></ul><ul><li>Open source algorithm is now freely available </li></ul><ul><ul><li>http://github.com/DallanQ/Gedcom </li></ul></ul><ul><ul><li>simple object model with extensions, 94% round-trip </li></ul></ul><ul><li>Hopefully others will benefit from this effort </li></ul>
  • 41.  

×