Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Slinging Data: Data Loading and Cleanup in Evergreen Growing Evergreen Conference 22 April 2010
To migrate data … <ul><li>Extract  from the old,  map  and  load  into the new,  clean up  along the way, and  keep the au...
Whence <ul><li>Extract data in a convenient form: </li></ul><ul><li>Sometimes that means whatever you can get </li></ul><u...
All over the map <ul><li>Map entities </li></ul><ul><li>Map fields </li></ul><ul><li>Map values </li></ul><ul><li>Map poli...
All over the map <ul><li>Entities </li></ul><ul><ul><li>What is an item? </li></ul></ul><ul><ul><li>What is a patron? </li...
All over the map <ul><li>Values </li></ul><ul><ul><li>Legacy item types </li></ul></ul><ul><ul><ul><li>0 </li></ul></ul></...
All over the map Legacy Item Type Circ Modifier 0 Regular 1 Media 45 AV 123 Reference 234 Reference
Cleaning up <ul><li>What? </li></ul><ul><li>Bad data </li></ul><ul><li>Ancient data </li></ul><ul><li>Data it is too expen...
Don’t box me in! <ul><li>The case of the dreaded double-encoding </li></ul><ul><li>The even more dreadful case of the dupl...
Yes, those fixed fields really matter <ul><li>The purpose of every modern ILS and discovery layer … </li></ul>
Yes, those fixed fields really matter <ul><li>…  is to point out every fixed field coding error in a form convenient for c...
Fixed fields
Oops! <ul><li>create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$ </li></ul><ul><li>my ($marc...
On stage <ul><li>Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database: </...
On stage <ul><li>We want to be able to </li></ul><ul><li>Load and manipulate the data </li></ul><ul><li>…  using every too...
On stage <ul><li>Make a separate schema psql> create schema m_foo; </li></ul><ul><li>Mirror a real table create table m_fo...
On stage <ul><li>Use the sequence … id bigint  not null default nextval('asset.copy_id_seq'::regclass) … </li></ul>
On stage <ul><li>Make space for the legacy </li></ul><ul><li>create table m_foo.asset_copy_legacy ( </li></ul><ul><li>l_ca...
On stage <ul><li>Munge </li></ul><ul><li>Munge </li></ul><ul><li>Munge some more, then … </li></ul><ul><li>Insert into pro...
Counting <ul><li>Who is the auditor? </li></ul><ul><li>It is you … and your patrons … and maybe even an actual auditor. </...
Counting <ul><li>Count what matters </li></ul><ul><ul><li>Number of records </li></ul></ul><ul><ul><li>Number of dollars <...
Counting <ul><li>Count early and often </li></ul><ul><li>Conservation of library data is Newton’s 42 nd  law! </li></ul>
Tools <ul><li>The usual suspects </li></ul><ul><ul><li>MARC::Record (or pymarc, or ruby-marc, or …) </li></ul></ul><ul><ul...
And now something new
Equinox Migration Tools <ul><li>What? </li></ul><ul><li>MARC processing </li></ul><ul><li>Non-MARC processing </li></ul><u...
<ul><li>Thanks! </li></ul><ul><li>Galen Charlton </li></ul><ul><li>VP for Data Services, Equinox Software Inc. </li></ul><...
Upcoming SlideShare
Loading in …5
×

Slinging Data: Data Loading and Cleanup in Evergreen

1,534 views

Published on

Presentation for the 2010 Evergreen Conference on migrating data to the Evergreen open source ILS.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Slinging Data: Data Loading and Cleanup in Evergreen

  1. 1. Slinging Data: Data Loading and Cleanup in Evergreen Growing Evergreen Conference 22 April 2010
  2. 2. To migrate data … <ul><li>Extract from the old, map and load into the new, clean up along the way, and keep the auditor happy . </li></ul>
  3. 3. Whence <ul><li>Extract data in a convenient form: </li></ul><ul><li>Sometimes that means whatever you can get </li></ul><ul><li>But better is </li></ul><ul><ul><li>MARC </li></ul></ul><ul><ul><li>Flat text </li></ul></ul><ul><ul><li>XML </li></ul></ul>
  4. 4. All over the map <ul><li>Map entities </li></ul><ul><li>Map fields </li></ul><ul><li>Map values </li></ul><ul><li>Map policies </li></ul>
  5. 5. All over the map <ul><li>Entities </li></ul><ul><ul><li>What is an item? </li></ul></ul><ul><ul><li>What is a patron? </li></ul></ul><ul><li>Fields </li></ul><ul><ul><li>Where does the patron PIN come from? </li></ul></ul>
  6. 6. All over the map <ul><li>Values </li></ul><ul><ul><li>Legacy item types </li></ul></ul><ul><ul><ul><li>0 </li></ul></ul></ul><ul><ul><ul><li>1 </li></ul></ul></ul><ul><ul><ul><li>45 </li></ul></ul></ul><ul><ul><ul><li>123 </li></ul></ul></ul><ul><ul><ul><li>234 </li></ul></ul></ul><ul><li>Quick: which is the one for journal loan? </li></ul>
  7. 7. All over the map Legacy Item Type Circ Modifier 0 Regular 1 Media 45 AV 123 Reference 234 Reference
  8. 8. Cleaning up <ul><li>What? </li></ul><ul><li>Bad data </li></ul><ul><li>Ancient data </li></ul><ul><li>Data it is too expensive to deal with later </li></ul><ul><li>When? </li></ul><ul><li>Extract </li></ul><ul><li>Load </li></ul><ul><li>Post-load </li></ul>
  9. 9. Don’t box me in! <ul><li>The case of the dreaded double-encoding </li></ul><ul><li>The even more dreadful case of the duplicitous and multiplicitous character encoding </li></ul>
  10. 10. Yes, those fixed fields really matter <ul><li>The purpose of every modern ILS and discovery layer … </li></ul>
  11. 11. Yes, those fixed fields really matter <ul><li>… is to point out every fixed field coding error in a form convenient for catalogers to identify and fix. </li></ul>
  12. 12. Fixed fields
  13. 13. Oops! <ul><li>create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$ </li></ul><ul><li>my ($marcxml, $pos, $value) = @_; </li></ul><ul><li>use MARC::Record; </li></ul><ul><li>use MARC::File::XML; </li></ul><ul><li>my $xml = $marcxml; </li></ul><ul><li>eval { </li></ul><ul><li>my $marc = MARC::Record->new_from_xml($marcxml, 'UTF-8'); </li></ul><ul><li>my $leader = $marc->leader(); </li></ul><ul><li>substr($leader, $pos, 1) = $value; </li></ul><ul><li>$marc->leader($leader); </li></ul><ul><li>$xml = $marc->as_xml_record; </li></ul><ul><li>$xml =~ s/^<?.+??>$//mo; </li></ul><ul><li>$xml =~ s/ //sgo; </li></ul><ul><li>$xml =~ s/>s+</></sgo; </li></ul><ul><li>}; </li></ul><ul><li>return $xml; </li></ul><ul><li>$$ LANGUAGE PLPERLU STABLE; </li></ul>
  14. 14. On stage <ul><li>Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database: </li></ul><ul><li>Table inheritance </li></ul><ul><li>Sequences </li></ul>
  15. 15. On stage <ul><li>We want to be able to </li></ul><ul><li>Load and manipulate the data </li></ul><ul><li>… using every tool on our belt </li></ul><ul><li>… while ensuring that it doesn’t show up in production until it’s ready (and we’re ready) </li></ul>
  16. 16. On stage <ul><li>Make a separate schema psql> create schema m_foo; </li></ul><ul><li>Mirror a real table create table m_foo.asset_copy … </li></ul>
  17. 17. On stage <ul><li>Use the sequence … id bigint not null default nextval('asset.copy_id_seq'::regclass) … </li></ul>
  18. 18. On stage <ul><li>Make space for the legacy </li></ul><ul><li>create table m_foo.asset_copy_legacy ( </li></ul><ul><li>l_call_number TEXT </li></ul><ul><li>inherits (m_foo.asset_copy); </li></ul>
  19. 19. On stage <ul><li>Munge </li></ul><ul><li>Munge </li></ul><ul><li>Munge some more, then … </li></ul><ul><li>Insert into production: </li></ul><ul><li>insert into asset.copy </li></ul><ul><li>select * from m_foo.asset_copy; </li></ul>
  20. 20. Counting <ul><li>Who is the auditor? </li></ul><ul><li>It is you … and your patrons … and maybe even an actual auditor. </li></ul>
  21. 21. Counting <ul><li>Count what matters </li></ul><ul><ul><li>Number of records </li></ul></ul><ul><ul><li>Number of dollars </li></ul></ul><ul><ul><li>Number of things you’ll have to fix manually </li></ul></ul><ul><li>Don’t count what doesn’t matter </li></ul><ul><ul><li>Header rows </li></ul></ul><ul><ul><li>Junk </li></ul></ul>
  22. 22. Counting <ul><li>Count early and often </li></ul><ul><li>Conservation of library data is Newton’s 42 nd law! </li></ul>
  23. 23. Tools <ul><li>The usual suspects </li></ul><ul><ul><li>MARC::Record (or pymarc, or ruby-marc, or …) </li></ul></ul><ul><ul><li>MARCEdit </li></ul></ul><ul><ul><li>yaz-marcdump </li></ul></ul><ul><ul><li>Spreadsheets </li></ul></ul>
  24. 24. And now something new
  25. 25. Equinox Migration Tools <ul><li>What? </li></ul><ul><li>MARC processing </li></ul><ul><li>Non-MARC processing </li></ul><ul><li>And more … </li></ul><ul><li>Where? </li></ul><ul><li>git://git.esilibrary.com/git/migration-tools.git </li></ul>
  26. 26. <ul><li>Thanks! </li></ul><ul><li>Galen Charlton </li></ul><ul><li>VP for Data Services, Equinox Software Inc. </li></ul><ul><li>[email_address] </li></ul>

×