I Can Convert!by Sven Aas and Jason Proctor
I Can Convert!•   Sven Aas: @svenaas / saas@mtholyoke.edu•   Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu•   #TPR2     ...
We’re going to talk about•   Stories•   Patterns•   Tools                        ©2012 Sven Aas and Jason Proctor, Mount H...
Use Your Tools!              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools•   Spreadsheet•   Programmer’s Editor•   Programming Language                               ©2012 Sven Aas ...
Spreadsheet              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Spreadsheet              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programmer’s Editor                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programmer’s Editor                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programming Language               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programming Language                                    ©2012 Sven Aas and Jason Proctor,               ©2012 Sven Aas and...
Use Your Tools!  You’ve GOT this stuff.                           ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Getting Deported              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Portal News              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Unusual Data Representation +""""""""""""""+    |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$ |$Data$$$$$$$$$| +"""""...
Ruby to the Rescue          LegacyUser                  User                                      Item Portal             ...
ActiveRecord•   A Ruby library which implements the ActiveRecord software    architecture pattern.•   The original Model a...
Conversion Patterns                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object ExtractionContext: Ingesting source data.Problem: Source data objects contain multiple target objects.Solution: Pro...
Encoding ChangeContext: Mapping source data to target.Problem: Source text encoding differs from target.Solution: Perform ...
URL/Path TranslationContext: Preparing target environment and data.Problem: Assets in target system will be available at d...
Getting the News Out                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Easy Come, Easy Go1. Export Athletics news items to hosted service.2. Export all news items to digital archives.          ...
Exporting Athletics Items•   10 years of Athletics news in 14 channels.•   Export each item in a minimal, predictable HTML...
HAML•   A lightweight markup language used to generate HTML.•   A meta-markup language.•   We used it to succinctly expres...
Archiving Web News•   14 years of news: 6,000 items, 5,000 images, 34 channels.•   Export each news item in an archival fo...
Builder•   A Ruby library for generating XML.•   We used it to dynamically generate simple XML from within a    Ruby appli...
wkhtmltopdf•   A shell utility for generating PDF files by rendering HTML    documents using the WebKit rendering engine.• ...
Familiar Patterns•   Object Extraction•   Encoding Change•   URL/Path Translation                            ©2012 Sven Aa...
Direct TranslationContext: Simple conversion.Problem: Data conversion.Solution: Read source objects and write targets in s...
Markup ChangeContext: Mapping source data to target.Problem: Source text markup differs from target.Solution: Perform inte...
Data CleanupContext: Ingesting source data.Problem: Source data is ... imperfect.Solution: Fix what you can confidently fix....
Convert All the Things!                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Finally Done with News?•   HTML files scraped via Nokogiri scripts.•   Quite a bit of cleanup: garbage in, garbage out.•   ...
Nokogiri•   A Ruby library for parsing XML and HTML.•   Supports DOM or SAX parsing.•   Implements both XPath and CSS3 sel...
Familiar Patterns•   Direct Translation•   Encoding Change•   Markup Change•   URL/Path Translation•   Data Cleanup       ...
The Big One              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
CMS Conversion•   Old CMS pages all published with several different    presentational styles, but all with the same DOM. ...
Three-Pronged Conversion                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Three-Pronged Conversion•   Build the necessary structures and themes to accommodate    and represent our old content.•   ...
Migrate•   An Drupal module providing a framework for data import into    the Drupal content management system.•   Support...
Familiar Patterns•   Object Extraction•   Encoding Change•   Markup Change•   URL/Path Translation•   Data Cleanup        ...
Intermediate RepresentationContext: Complex conversion.Problem: Data conversion.Solution: Convert source data to intermedi...
Object IdentityContext: Ingesting source data.Problem: Data objects are repeated in source dataSolution: Uniquely identify...
Object AggregationContext: Ingesting source data.Problem: Target data objects contain multiple source objects.Solution: Ag...
Lessons•   You already have a good toolbox. Keep your tools sharp.•   Understand your source and target models.•   Watch f...
YOU Can Convert             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Questions?             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Thank you, & keep in touch!•   Sven Aas: @svenaas / saas@mtholyoke.edu•   Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu•...
Colophon•   This presentation is set in Exo Extra Bold from Natanael    Gama’s ndiscovered, with headings in ChunkFive fro...
Colophon (continued)•   Card-size survival tool photo via acreativeedge.info•   Leatherman photo via SonnyandSandy•   Stud...
Colophon (continued)•       Audio from Wikipedia:Sound/List:    •    Edvard Grieg - Piano Concerto in A Minor, Op. 16 - ii...
Colophon (continued)•       Other Audio    •    Jack Beaver - “Workaday World”    •    Danny Elfman - “Breakfast Machine” ...
Upcoming SlideShare
Loading in …5
×

I Can Convert

418 views
312 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
418
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

I Can Convert

  1. 1. I Can Convert!by Sven Aas and Jason Proctor
  2. 2. I Can Convert!• Sven Aas: @svenaas / saas@mtholyoke.edu• Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu• #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  3. 3. We’re going to talk about• Stories• Patterns• Tools ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  4. 4. Use Your Tools! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  5. 5. Use Your Tools• Spreadsheet• Programmer’s Editor• Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  6. 6. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  7. 7. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  8. 8. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  9. 9. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  10. 10. Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  11. 11. Programming Language ©2012 Sven Aas and Jason Proctor, ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  12. 12. Use Your Tools! You’ve GOT this stuff. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  13. 13. Getting Deported ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  14. 14. Portal News ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  15. 15. Unusual Data Representation +""""""""""""""+ |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$ |$Data$$$$$$$$$| +""""""""""""""+ 21139$|$71$1000009$1000010$1000011$1000012$1000013$ |$node$$$$$$$$$|$ 1000014$1000015$1000016$1000017$1000018$1000019$ |$name$$$$$$$$$|$ |$type$$$$$$$$$|$ 1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$ |$mode$$$$$$$$$|$ |$1170344089$|$21139$$|$$$$$$$1$| |$owner$$$$$$$$|$ |$group$$$$$$$$|$ 01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$ |$url$$$$$$$$$$|$ of$new$student$orientation,$members$of$the$class$of$ |$desc$$$$$$$$$|$ 2010$worked$on$community$service$projects$across$the$ |$parent$$$$$$$|$ |$linkto$$$$$$$|$ Pioneer$Valley$on$September$16.$View$the$photo$ |$ctime$$$$$$$$|$ gallery.||http://www.mtholyoke.edu/offices/comm/news/ |$mtime$$$$$$$$|$ |$mod_by$$$$$$$|$ sec_sat_06/page1.html|1158638400|1170305999||||| |$visible$$$$$$|$ 11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$ |$userdata$$$$$|$ |$datasize$$$$$|$ Saturday:^:^:^:^0:^$ |$datafilename$|$ |$$$$$2813$|$V1158673129"9689$| +""""""""""""""+ ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  16. 16. Ruby to the Rescue LegacyUser User Item Portal News ImporterSystem System LegacyItem Story Link Channel ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  17. 17. ActiveRecord• A Ruby library which implements the ActiveRecord software architecture pattern.• The original Model and ORM component of Ruby on Rails.• We used it to provide a convenient object layer on top of two underlying relational databases. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  18. 18. Conversion Patterns ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  19. 19. Object ExtractionContext: Ingesting source data.Problem: Source data objects contain multiple target objects.Solution: Process or parse target data just enough to extractobjects.Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  20. 20. Encoding ChangeContext: Mapping source data to target.Problem: Source text encoding differs from target.Solution: Perform intermediate translation.Tools: String methods, RegEx, programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  21. 21. URL/Path TranslationContext: Preparing target environment and data.Problem: Assets in target system will be available at differentpaths or URLs from their locations in source system.Solution: Map source locations to target locations. Replacereferences in data before saving to target.Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  22. 22. Getting the News Out ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  23. 23. Easy Come, Easy Go1. Export Athletics news items to hosted service.2. Export all news items to digital archives. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  24. 24. Exporting Athletics Items• 10 years of Athletics news in 14 channels.• Export each item in a minimal, predictable HTML wrapper.• Include metadata for each item in <meta> tags in the <head>.• Group items by sport and by academic year.• Generally accommodate the target system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  25. 25. HAML• A lightweight markup language used to generate HTML.• A meta-markup language.• We used it to succinctly express the HTML we wanted from within our Ruby code. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  26. 26. Archiving Web News• 14 years of news: 6,000 items, 5,000 images, 34 channels.• Export each news item in an archival form preserving the original markup and character entities (but not the design) • PDF generated from HTML generated from HAML• Export Dublin Core metadata for each news item: • XML generated via Builder ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  27. 27. Builder• A Ruby library for generating XML.• We used it to dynamically generate simple XML from within a Ruby application. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  28. 28. wkhtmltopdf• A shell utility for generating PDF files by rendering HTML documents using the WebKit rendering engine.• A Ruby library providing programmatic access to the wkhtmltopdf shell utility.• We used it so that we could use familiar web development techniques to generate PDFs without having to implement our own rendering and layout routines. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  29. 29. Familiar Patterns• Object Extraction• Encoding Change• URL/Path Translation ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  30. 30. Direct TranslationContext: Simple conversion.Problem: Data conversion.Solution: Read source objects and write targets in single pass.Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  31. 31. Markup ChangeContext: Mapping source data to target.Problem: Source text markup differs from target.Solution: Perform intermediate translation.Tools: String methods, RegEx, DOM/XML selection,programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  32. 32. Data CleanupContext: Ingesting source data.Problem: Source data is ... imperfect.Solution: Fix what you can confidently fix.Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  33. 33. Convert All the Things! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  34. 34. Finally Done with News?• HTML files scraped via Nokogiri scripts.• Quite a bit of cleanup: garbage in, garbage out.• Unscrapable news items.• “September 12, 2001”. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  35. 35. Nokogiri• A Ruby library for parsing XML and HTML.• Supports DOM or SAX parsing.• Implements both XPath and CSS3 selectors.• We used it to parse and extract content from the set of HTML files containing existing news stories. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  36. 36. Familiar Patterns• Direct Translation• Encoding Change• Markup Change• URL/Path Translation• Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  37. 37. The Big One ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  38. 38. CMS Conversion• Old CMS pages all published with several different presentational styles, but all with the same DOM. That means we can scrape ’em!• We agreed not to change anything else during the import. That means we can treat it as a clean switchover. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  39. 39. Three-Pronged Conversion ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  40. 40. Three-Pronged Conversion• Build the necessary structures and themes to accommodate and represent our old content.• Build a library of code for scraping the pages generated by the old site, cataloging data and metadata, and storing them in an intermediate representation.• Build a library of code for importing this intermediate representation into the new CMS structures. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  41. 41. Migrate• An Drupal module providing a framework for data import into the Drupal content management system.• Supports a variety of sources and targets out of the box.• Extensible to support additional migration sources and targets.• We used it to import the XML representation of our site into our Drupal system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  42. 42. Familiar Patterns• Object Extraction• Encoding Change• Markup Change• URL/Path Translation• Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  43. 43. Intermediate RepresentationContext: Complex conversion.Problem: Data conversion.Solution: Convert source data to intermediate representation inone pass. Then convert intermediate representation to target.Tools: Representation: Database, XML, CSV. Conversion: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  44. 44. Object IdentityContext: Ingesting source data.Problem: Data objects are repeated in source dataSolution: Uniquely identify source objects.Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  45. 45. Object AggregationContext: Ingesting source data.Problem: Target data objects contain multiple source objects.Solution: Aggregate objects at intermediate or output stage.Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  46. 46. Lessons• You already have a good toolbox. Keep your tools sharp.• Understand your source and target models.• Watch for familiar patterns.• Conversion is an opportunity for cleanup and improvement.• Human labor can sometimes be cheaper than automation. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  47. 47. YOU Can Convert ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  48. 48. Questions? ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  49. 49. Thank you, & keep in touch!• Sven Aas: @svenaas / saas@mtholyoke.edu• Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu• #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  50. 50. Colophon• This presentation is set in Exo Extra Bold from Natanael Gama’s ndiscovered, with headings in ChunkFive from The League of Movable Type.• Background images were adapted from FreeSeamlessTextures.com’s Red Watercolor and The Grid, by Willem Pirquin. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  51. 51. Colophon (continued)• Card-size survival tool photo via acreativeedge.info• Leatherman photo via SonnyandSandy• Studley Tool Chest photo via FineWoodworking.com ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  52. 52. Colophon (continued)• Audio from Wikipedia:Sound/List: • Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by the Skidmore College Orchestra. • W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik Eide. • Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman • J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel • Mississippi John Hurt - “Nobody’s Dirty Business” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  53. 53. Colophon (continued)• Other Audio • Jack Beaver - “Workaday World” • Danny Elfman - “Breakfast Machine” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College

×