Your SlideShare is downloading. ×
I Can Convert
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

I Can Convert

228
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
228
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. I Can Convert!by Sven Aas and Jason Proctor
  • 2. I Can Convert!• Sven Aas: @svenaas / saas@mtholyoke.edu• Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu• #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 3. We’re going to talk about• Stories• Patterns• Tools ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 4. Use Your Tools! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 5. Use Your Tools• Spreadsheet• Programmer’s Editor• Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 6. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 7. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 8. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 9. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 10. Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 11. Programming Language ©2012 Sven Aas and Jason Proctor, ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 12. Use Your Tools! You’ve GOT this stuff. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 13. Getting Deported ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 14. Portal News ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 15. Unusual Data Representation +""""""""""""""+ |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$ |$Data$$$$$$$$$| +""""""""""""""+ 21139$|$71$1000009$1000010$1000011$1000012$1000013$ |$node$$$$$$$$$|$ 1000014$1000015$1000016$1000017$1000018$1000019$ |$name$$$$$$$$$|$ |$type$$$$$$$$$|$ 1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$ |$mode$$$$$$$$$|$ |$1170344089$|$21139$$|$$$$$$$1$| |$owner$$$$$$$$|$ |$group$$$$$$$$|$ 01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$ |$url$$$$$$$$$$|$ of$new$student$orientation,$members$of$the$class$of$ |$desc$$$$$$$$$|$ 2010$worked$on$community$service$projects$across$the$ |$parent$$$$$$$|$ |$linkto$$$$$$$|$ Pioneer$Valley$on$September$16.$View$the$photo$ |$ctime$$$$$$$$|$ gallery.||http://www.mtholyoke.edu/offices/comm/news/ |$mtime$$$$$$$$|$ |$mod_by$$$$$$$|$ sec_sat_06/page1.html|1158638400|1170305999||||| |$visible$$$$$$|$ 11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$ |$userdata$$$$$|$ |$datasize$$$$$|$ Saturday:^:^:^:^0:^$ |$datafilename$|$ |$$$$$2813$|$V1158673129"9689$| +""""""""""""""+ ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 16. Ruby to the Rescue LegacyUser User Item Portal News ImporterSystem System LegacyItem Story Link Channel ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 17. ActiveRecord• A Ruby library which implements the ActiveRecord software architecture pattern.• The original Model and ORM component of Ruby on Rails.• We used it to provide a convenient object layer on top of two underlying relational databases. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 18. Conversion Patterns ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 19. Object ExtractionContext: Ingesting source data.Problem: Source data objects contain multiple target objects.Solution: Process or parse target data just enough to extractobjects.Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 20. Encoding ChangeContext: Mapping source data to target.Problem: Source text encoding differs from target.Solution: Perform intermediate translation.Tools: String methods, RegEx, programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 21. URL/Path TranslationContext: Preparing target environment and data.Problem: Assets in target system will be available at differentpaths or URLs from their locations in source system.Solution: Map source locations to target locations. Replacereferences in data before saving to target.Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 22. Getting the News Out ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 23. Easy Come, Easy Go1. Export Athletics news items to hosted service.2. Export all news items to digital archives. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 24. Exporting Athletics Items• 10 years of Athletics news in 14 channels.• Export each item in a minimal, predictable HTML wrapper.• Include metadata for each item in <meta> tags in the <head>.• Group items by sport and by academic year.• Generally accommodate the target system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 25. HAML• A lightweight markup language used to generate HTML.• A meta-markup language.• We used it to succinctly express the HTML we wanted from within our Ruby code. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 26. Archiving Web News• 14 years of news: 6,000 items, 5,000 images, 34 channels.• Export each news item in an archival form preserving the original markup and character entities (but not the design) • PDF generated from HTML generated from HAML• Export Dublin Core metadata for each news item: • XML generated via Builder ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 27. Builder• A Ruby library for generating XML.• We used it to dynamically generate simple XML from within a Ruby application. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 28. wkhtmltopdf• A shell utility for generating PDF files by rendering HTML documents using the WebKit rendering engine.• A Ruby library providing programmatic access to the wkhtmltopdf shell utility.• We used it so that we could use familiar web development techniques to generate PDFs without having to implement our own rendering and layout routines. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 29. Familiar Patterns• Object Extraction• Encoding Change• URL/Path Translation ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 30. Direct TranslationContext: Simple conversion.Problem: Data conversion.Solution: Read source objects and write targets in single pass.Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 31. Markup ChangeContext: Mapping source data to target.Problem: Source text markup differs from target.Solution: Perform intermediate translation.Tools: String methods, RegEx, DOM/XML selection,programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 32. Data CleanupContext: Ingesting source data.Problem: Source data is ... imperfect.Solution: Fix what you can confidently fix.Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 33. Convert All the Things! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 34. Finally Done with News?• HTML files scraped via Nokogiri scripts.• Quite a bit of cleanup: garbage in, garbage out.• Unscrapable news items.• “September 12, 2001”. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 35. Nokogiri• A Ruby library for parsing XML and HTML.• Supports DOM or SAX parsing.• Implements both XPath and CSS3 selectors.• We used it to parse and extract content from the set of HTML files containing existing news stories. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 36. Familiar Patterns• Direct Translation• Encoding Change• Markup Change• URL/Path Translation• Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 37. The Big One ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 38. CMS Conversion• Old CMS pages all published with several different presentational styles, but all with the same DOM. That means we can scrape ’em!• We agreed not to change anything else during the import. That means we can treat it as a clean switchover. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 39. Three-Pronged Conversion ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 40. Three-Pronged Conversion• Build the necessary structures and themes to accommodate and represent our old content.• Build a library of code for scraping the pages generated by the old site, cataloging data and metadata, and storing them in an intermediate representation.• Build a library of code for importing this intermediate representation into the new CMS structures. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 41. Migrate• An Drupal module providing a framework for data import into the Drupal content management system.• Supports a variety of sources and targets out of the box.• Extensible to support additional migration sources and targets.• We used it to import the XML representation of our site into our Drupal system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 42. Familiar Patterns• Object Extraction• Encoding Change• Markup Change• URL/Path Translation• Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 43. Intermediate RepresentationContext: Complex conversion.Problem: Data conversion.Solution: Convert source data to intermediate representation inone pass. Then convert intermediate representation to target.Tools: Representation: Database, XML, CSV. Conversion: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 44. Object IdentityContext: Ingesting source data.Problem: Data objects are repeated in source dataSolution: Uniquely identify source objects.Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 45. Object AggregationContext: Ingesting source data.Problem: Target data objects contain multiple source objects.Solution: Aggregate objects at intermediate or output stage.Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 46. Lessons• You already have a good toolbox. Keep your tools sharp.• Understand your source and target models.• Watch for familiar patterns.• Conversion is an opportunity for cleanup and improvement.• Human labor can sometimes be cheaper than automation. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 47. YOU Can Convert ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 48. Questions? ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 49. Thank you, & keep in touch!• Sven Aas: @svenaas / saas@mtholyoke.edu• Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu• #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 50. Colophon• This presentation is set in Exo Extra Bold from Natanael Gama’s ndiscovered, with headings in ChunkFive from The League of Movable Type.• Background images were adapted from FreeSeamlessTextures.com’s Red Watercolor and The Grid, by Willem Pirquin. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 51. Colophon (continued)• Card-size survival tool photo via acreativeedge.info• Leatherman photo via SonnyandSandy• Studley Tool Chest photo via FineWoodworking.com ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 52. Colophon (continued)• Audio from Wikipedia:Sound/List: • Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by the Skidmore College Orchestra. • W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik Eide. • Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman • J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel • Mississippi John Hurt - “Nobody’s Dirty Business” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 53. Colophon (continued)• Other Audio • Jack Beaver - “Workaday World” • Danny Elfman - “Breakfast Machine” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College

×