Slinging Data: Data Loading and Cleanup in Evergreen


Published on

Presentation for the 2010 Evergreen Conference on migrating data to the Evergreen open source ILS.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Slinging Data: Data Loading and Cleanup in Evergreen

  1. 1. Slinging Data: Data Loading and Cleanup in Evergreen Growing Evergreen Conference 22 April 2010
  2. 2. To migrate data … <ul><li>Extract from the old, map and load into the new, clean up along the way, and keep the auditor happy . </li></ul>
  3. 3. Whence <ul><li>Extract data in a convenient form: </li></ul><ul><li>Sometimes that means whatever you can get </li></ul><ul><li>But better is </li></ul><ul><ul><li>MARC </li></ul></ul><ul><ul><li>Flat text </li></ul></ul><ul><ul><li>XML </li></ul></ul>
  4. 4. All over the map <ul><li>Map entities </li></ul><ul><li>Map fields </li></ul><ul><li>Map values </li></ul><ul><li>Map policies </li></ul>
  5. 5. All over the map <ul><li>Entities </li></ul><ul><ul><li>What is an item? </li></ul></ul><ul><ul><li>What is a patron? </li></ul></ul><ul><li>Fields </li></ul><ul><ul><li>Where does the patron PIN come from? </li></ul></ul>
  6. 6. All over the map <ul><li>Values </li></ul><ul><ul><li>Legacy item types </li></ul></ul><ul><ul><ul><li>0 </li></ul></ul></ul><ul><ul><ul><li>1 </li></ul></ul></ul><ul><ul><ul><li>45 </li></ul></ul></ul><ul><ul><ul><li>123 </li></ul></ul></ul><ul><ul><ul><li>234 </li></ul></ul></ul><ul><li>Quick: which is the one for journal loan? </li></ul>
  7. 7. All over the map Legacy Item Type Circ Modifier 0 Regular 1 Media 45 AV 123 Reference 234 Reference
  8. 8. Cleaning up <ul><li>What? </li></ul><ul><li>Bad data </li></ul><ul><li>Ancient data </li></ul><ul><li>Data it is too expensive to deal with later </li></ul><ul><li>When? </li></ul><ul><li>Extract </li></ul><ul><li>Load </li></ul><ul><li>Post-load </li></ul>
  9. 9. Don’t box me in! <ul><li>The case of the dreaded double-encoding </li></ul><ul><li>The even more dreadful case of the duplicitous and multiplicitous character encoding </li></ul>
  10. 10. Yes, those fixed fields really matter <ul><li>The purpose of every modern ILS and discovery layer … </li></ul>
  11. 11. Yes, those fixed fields really matter <ul><li>… is to point out every fixed field coding error in a form convenient for catalogers to identify and fix. </li></ul>
  12. 12. Fixed fields
  13. 13. Oops! <ul><li>create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$ </li></ul><ul><li>my ($marcxml, $pos, $value) = @_; </li></ul><ul><li>use MARC::Record; </li></ul><ul><li>use MARC::File::XML; </li></ul><ul><li>my $xml = $marcxml; </li></ul><ul><li>eval { </li></ul><ul><li>my $marc = MARC::Record->new_from_xml($marcxml, 'UTF-8'); </li></ul><ul><li>my $leader = $marc->leader(); </li></ul><ul><li>substr($leader, $pos, 1) = $value; </li></ul><ul><li>$marc->leader($leader); </li></ul><ul><li>$xml = $marc->as_xml_record; </li></ul><ul><li>$xml =~ s/^<?.+??>$//mo; </li></ul><ul><li>$xml =~ s/ //sgo; </li></ul><ul><li>$xml =~ s/>s+</></sgo; </li></ul><ul><li>}; </li></ul><ul><li>return $xml; </li></ul><ul><li>$$ LANGUAGE PLPERLU STABLE; </li></ul>
  14. 14. On stage <ul><li>Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database: </li></ul><ul><li>Table inheritance </li></ul><ul><li>Sequences </li></ul>
  15. 15. On stage <ul><li>We want to be able to </li></ul><ul><li>Load and manipulate the data </li></ul><ul><li>… using every tool on our belt </li></ul><ul><li>… while ensuring that it doesn’t show up in production until it’s ready (and we’re ready) </li></ul>
  16. 16. On stage <ul><li>Make a separate schema psql> create schema m_foo; </li></ul><ul><li>Mirror a real table create table m_foo.asset_copy … </li></ul>
  17. 17. On stage <ul><li>Use the sequence … id bigint not null default nextval('asset.copy_id_seq'::regclass) … </li></ul>
  18. 18. On stage <ul><li>Make space for the legacy </li></ul><ul><li>create table m_foo.asset_copy_legacy ( </li></ul><ul><li>l_call_number TEXT </li></ul><ul><li>inherits (m_foo.asset_copy); </li></ul>
  19. 19. On stage <ul><li>Munge </li></ul><ul><li>Munge </li></ul><ul><li>Munge some more, then … </li></ul><ul><li>Insert into production: </li></ul><ul><li>insert into asset.copy </li></ul><ul><li>select * from m_foo.asset_copy; </li></ul>
  20. 20. Counting <ul><li>Who is the auditor? </li></ul><ul><li>It is you … and your patrons … and maybe even an actual auditor. </li></ul>
  21. 21. Counting <ul><li>Count what matters </li></ul><ul><ul><li>Number of records </li></ul></ul><ul><ul><li>Number of dollars </li></ul></ul><ul><ul><li>Number of things you’ll have to fix manually </li></ul></ul><ul><li>Don’t count what doesn’t matter </li></ul><ul><ul><li>Header rows </li></ul></ul><ul><ul><li>Junk </li></ul></ul>
  22. 22. Counting <ul><li>Count early and often </li></ul><ul><li>Conservation of library data is Newton’s 42 nd law! </li></ul>
  23. 23. Tools <ul><li>The usual suspects </li></ul><ul><ul><li>MARC::Record (or pymarc, or ruby-marc, or …) </li></ul></ul><ul><ul><li>MARCEdit </li></ul></ul><ul><ul><li>yaz-marcdump </li></ul></ul><ul><ul><li>Spreadsheets </li></ul></ul>
  24. 24. And now something new
  25. 25. Equinox Migration Tools <ul><li>What? </li></ul><ul><li>MARC processing </li></ul><ul><li>Non-MARC processing </li></ul><ul><li>And more … </li></ul><ul><li>Where? </li></ul><ul><li>git:// </li></ul>
  26. 26. <ul><li>Thanks! </li></ul><ul><li>Galen Charlton </li></ul><ul><li>VP for Data Services, Equinox Software Inc. </li></ul><ul><li>[email_address] </li></ul>