SlideShare a Scribd company logo
I Can Convert!
by Sven Aas and Jason Proctor
I Can Convert!
•   Sven Aas: @svenaas / saas@mtholyoke.edu

•   Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu

•   #TPR2




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
We’re going to talk about
•   Stories

•   Patterns

•   Tools




                        ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools!



              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools
•   Spreadsheet

•   Programmer’s Editor

•   Programming Language




                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Spreadsheet




              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Spreadsheet




              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programmer’s Editor




                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programmer’s Editor




                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programming Language




               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programming Language




                                    ©2012 Sven Aas and Jason Proctor,
               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools!
  You’ve GOT this stuff.




                           ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Getting Deported



              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Portal News




              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Unusual Data Representation
 +""""""""""""""+    |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$
 |$Data$$$$$$$$$|
 +""""""""""""""+    21139$|$71$1000009$1000010$1000011$1000012$1000013$
 |$node$$$$$$$$$|$   1000014$1000015$1000016$1000017$1000018$1000019$
 |$name$$$$$$$$$|$
 |$type$$$$$$$$$|$   1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$
 |$mode$$$$$$$$$|$   |$1170344089$|$21139$$|$$$$$$$1$|
 |$owner$$$$$$$$|$
 |$group$$$$$$$$|$
                     01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$
 |$url$$$$$$$$$$|$   of$new$student$orientation,$members$of$the$class$of$
 |$desc$$$$$$$$$|$   2010$worked$on$community$service$projects$across$the$
 |$parent$$$$$$$|$
 |$linkto$$$$$$$|$   Pioneer$Valley$on$September$16.$View$the$photo$
 |$ctime$$$$$$$$|$   gallery.||http://www.mtholyoke.edu/offices/comm/news/
 |$mtime$$$$$$$$|$
 |$mod_by$$$$$$$|$   sec_sat_06/page1.html|1158638400|1170305999|||||
 |$visible$$$$$$|$   11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$
 |$userdata$$$$$|$
 |$datasize$$$$$|$
                     Saturday:^:^:^:^0:^$
 |$datafilename$|$   |$$$$$2813$|$V1158673129"9689$|
 +""""""""""""""+


                                                       ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Ruby to the Rescue
          LegacyUser                  User

                                      Item
 Portal                                                              News
                       Importer
System                                                              System
          LegacyItem              Story      Link

                                    Channel




                                          ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
ActiveRecord
•   A Ruby library which implements the ActiveRecord software
    architecture pattern.

•   The original Model and ORM component of Ruby on Rails.

•   We used it to provide a convenient object layer on top of two
    underlying relational databases.




                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Conversion Patterns



                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object Extraction
Context: Ingesting source data.

Problem: Source data objects contain multiple target objects.

Solution: Process or parse target data just enough to extract
objects.

Tools: String methods, RegEx, DOM/XML selection.



                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Encoding Change
Context: Mapping source data to target.

Problem: Source text encoding differs from target.

Solution: Perform intermediate translation.

Tools: String methods, RegEx, programming libraries.




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
URL/Path Translation
Context: Preparing target environment and data.

Problem: Assets in target system will be available at different
paths or URLs from their locations in source system.

Solution: Map source locations to target locations. Replace
references in data before saving to target.

Tools: String methods, RegEx, DOM/XML selection.


                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Getting the News Out



                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Easy Come, Easy Go
1. Export Athletics news items to hosted service.

2. Export all news items to digital archives.




                                                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Exporting Athletics Items
•   10 years of Athletics news in 14 channels.

•   Export each item in a minimal, predictable HTML wrapper.

•   Include metadata for each item in <meta> tags in the <head>.

•   Group items by sport and by academic year.

•   Generally accommodate the target system.


                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
HAML
•   A lightweight markup language used to generate HTML.

•   A meta-markup language.

•   We used it to succinctly express the HTML we wanted from
    within our Ruby code.




                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Archiving Web News
•   14 years of news: 6,000 items, 5,000 images, 34 channels.

•   Export each news item in an archival form preserving the
    original markup and character entities (but not the design)

    •   PDF generated from HTML generated from HAML

•   Export Dublin Core metadata for each news item:

    •   XML generated via Builder

                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Builder
•   A Ruby library for generating XML.

•   We used it to dynamically generate simple XML from within a
    Ruby application.




                                                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
wkhtmltopdf
•   A shell utility for generating PDF files by rendering HTML
    documents using the WebKit rendering engine.

•   A Ruby library providing programmatic access to the
    wkhtmltopdf shell utility.

•   We used it so that we could use familiar web development
    techniques to generate PDFs without having to implement our
    own rendering and layout routines.

                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Familiar Patterns
•   Object Extraction

•   Encoding Change

•   URL/Path Translation




                            ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Direct Translation
Context: Simple conversion.

Problem: Data conversion.

Solution: Read source objects and write targets in single pass.

Tools: Varies.




                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Markup Change
Context: Mapping source data to target.

Problem: Source text markup differs from target.

Solution: Perform intermediate translation.

Tools: String methods, RegEx, DOM/XML selection,
programming libraries.



                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Data Cleanup
Context: Ingesting source data.

Problem: Source data is ... imperfect.

Solution: Fix what you can confidently fix.

Tools: Varies.




                                            ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Convert All the Things!



                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Finally Done with News?
•   HTML files scraped via Nokogiri scripts.

•   Quite a bit of cleanup: garbage in, garbage out.

•   Unscrapable news items.

•   “September 12, 2001”.




                                                   ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Nokogiri
•   A Ruby library for parsing XML and HTML.

•   Supports DOM or SAX parsing.

•   Implements both XPath and CSS3 selectors.

•   We used it to parse and extract content from the set of HTML
    files containing existing news stories.



                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Familiar Patterns
•   Direct Translation

•   Encoding Change

•   Markup Change

•   URL/Path Translation

•   Data Cleanup


                             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
The Big One



              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
CMS Conversion
•   Old CMS pages all published with several different
    presentational styles, but all with the same DOM. That means
    we can scrape ’em!

•   We agreed not to change anything else during the import. That
    means we can treat it as a clean switchover.




                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Three-Pronged Conversion




                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Three-Pronged Conversion
•   Build the necessary structures and themes to accommodate
    and represent our old content.

•   Build a library of code for scraping the pages generated by the
    old site, cataloging data and metadata, and storing them in an
    intermediate representation.

•   Build a library of code for importing this intermediate
    representation into the new CMS structures.

                                                   ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Migrate
•   An Drupal module providing a framework for data import into
    the Drupal content management system.

•   Supports a variety of sources and targets out of the box.

•   Extensible to support additional migration sources and targets.

•   We used it to import the XML representation of our site into
    our Drupal system.


                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Familiar Patterns
•   Object Extraction

•   Encoding Change

•   Markup Change

•   URL/Path Translation

•   Data Cleanup


                            ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Intermediate Representation
Context: Complex conversion.

Problem: Data conversion.

Solution: Convert source data to intermediate representation in
one pass. Then convert intermediate representation to target.

Tools: Representation: Database, XML, CSV. Conversion: Varies.



                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object Identity
Context: Ingesting source data.

Problem: Data objects are repeated in source data

Solution: Uniquely identify source objects.

Tools: String methods, RegEx, DOM/XML selection.




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object Aggregation
Context: Ingesting source data.

Problem: Target data objects contain multiple source objects.

Solution: Aggregate objects at intermediate or output stage.

Tools: Varies.




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Lessons
•   You already have a good toolbox. Keep your tools sharp.

•   Understand your source and target models.

•   Watch for familiar patterns.

•   Conversion is an opportunity for cleanup and improvement.

•   Human labor can sometimes be cheaper than automation.


                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
YOU Can Convert



             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Questions?



             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Thank you, & keep in touch!
•   Sven Aas: @svenaas / saas@mtholyoke.edu

•   Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu

•   #TPR2




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon
•   This presentation is set in Exo Extra Bold from Natanael
    Gama’s ndiscovered, with headings in ChunkFive from The
    League of Movable Type.

•   Background images were adapted from
    FreeSeamlessTextures.com’s Red Watercolor and The Grid, by
    Willem Pirquin.



                                                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon (continued)
•   Card-size survival tool photo via acreativeedge.info

•   Leatherman photo via SonnyandSandy

•   Studley Tool Chest photo via FineWoodworking.com




                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon (continued)
•       Audio from Wikipedia:Sound/List:
    •    Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by
         the Skidmore College Orchestra.

    •    W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik
         Eide.

    •    Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman

    •    J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel

    •    Mississippi John Hurt - “Nobody’s Dirty Business”



                                                                             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon (continued)
•       Other Audio

    •    Jack Beaver - “Workaday World”

    •    Danny Elfman - “Breakfast Machine”




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College

More Related Content

Viewers also liked

Report polsci
Report polsciReport polsci
Report polsci
Jilian Amor Saldua
 
Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...
FinPart
 
Benefits usa senior deck
Benefits usa senior deckBenefits usa senior deck
Benefits usa senior deck
leeg69
 
Sammousa - The story in pictures
Sammousa - The story in picturesSammousa - The story in pictures
Sammousa - The story in pictures
subravedula
 
The Power of Attendance
The Power of AttendanceThe Power of Attendance
The Power of Attendance
BIE Resources
 
กลุ่มอาการดาวน์
กลุ่มอาการดาวน์กลุ่มอาการดาวน์
กลุ่มอาการดาวน์
Atirak Pakdepin
 
HHS Ignite: Year One Results
HHS Ignite: Year One Results HHS Ignite: Year One Results
HHS Ignite: Year One Results
Steven Randazzo
 
Criolla music day
Criolla music dayCriolla music day
Criolla music day
alvarorv14
 
測試用簡報
測試用簡報測試用簡報
測試用簡報
資訊 奇豐
 
Assumptions in problem framing
Assumptions in problem framingAssumptions in problem framing
Assumptions in problem framing
Bhanu Pratap Singh
 
Installprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_beInstallprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_be
jl_merino
 
Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48
agus ZM
 
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec200361557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
Mohammad Khamiseh
 
Portafolio electronico
Portafolio electronicoPortafolio electronico
Portafolio electronico
paco-andrea
 

Viewers also liked (15)

Report polsci
Report polsciReport polsci
Report polsci
 
Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...
 
Benefits usa senior deck
Benefits usa senior deckBenefits usa senior deck
Benefits usa senior deck
 
Sammousa - The story in pictures
Sammousa - The story in picturesSammousa - The story in pictures
Sammousa - The story in pictures
 
The Power of Attendance
The Power of AttendanceThe Power of Attendance
The Power of Attendance
 
กลุ่มอาการดาวน์
กลุ่มอาการดาวน์กลุ่มอาการดาวน์
กลุ่มอาการดาวน์
 
HHS Ignite: Year One Results
HHS Ignite: Year One Results HHS Ignite: Year One Results
HHS Ignite: Year One Results
 
Criolla music day
Criolla music dayCriolla music day
Criolla music day
 
測試用簡報
測試用簡報測試用簡報
測試用簡報
 
Assumptions in problem framing
Assumptions in problem framingAssumptions in problem framing
Assumptions in problem framing
 
Installprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_beInstallprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_be
 
Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48
 
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec200361557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
 
Portafolio electronico
Portafolio electronicoPortafolio electronico
Portafolio electronico
 
Hizb 37
Hizb 37Hizb 37
Hizb 37
 

Similar to I Can Convert

Archiving Web News (captioned)
Archiving Web News (captioned)Archiving Web News (captioned)
Archiving Web News (captioned)
SvenAas
 
SEASR eScience 2008
SEASR eScience 2008SEASR eScience 2008
SEASR eScience 2008
Loretta Auvil
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
DATAVERSITY
 
NoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured PostgresNoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured Postgres
EDB
 
Meandre Architecture
Meandre ArchitectureMeandre Architecture
Meandre Architecture
Loretta Auvil
 
Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009
Loretta Auvil
 
SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009
Loretta Auvil
 
Embedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing DocumentsEmbedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing Documents
Jim Downing
 
MichaelLutherResume60
MichaelLutherResume60MichaelLutherResume60
MichaelLutherResume60
michael luther
 
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
Terminalfour
 
Data Persistence as a Language Feature
Data Persistence as a Language FeatureData Persistence as a Language Feature
Data Persistence as a Language Feature
Rob Tweed
 
Json
JsonJson
Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012
Lee Klement
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO Way
MongoDB
 
Semantic Web For Energy [Malcolm Murray]
Semantic Web For Energy [Malcolm Murray]Semantic Web For Energy [Malcolm Murray]
Semantic Web For Energy [Malcolm Murray]
University of the Highlands and Islands
 
394 wade word2007-ssp2008
394 wade word2007-ssp2008394 wade word2007-ssp2008
394 wade word2007-ssp2008
Society for Scholarly Publishing
 
MongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called OxMongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called Ox
MongoDB
 
DDS tutorial with connector
DDS tutorial with connectorDDS tutorial with connector
DDS tutorial with connector
Javier Povedano
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 
Compiler project
Compiler  projectCompiler  project
Compiler project
Monsur Ahmed Shafiq
 

Similar to I Can Convert (20)

Archiving Web News (captioned)
Archiving Web News (captioned)Archiving Web News (captioned)
Archiving Web News (captioned)
 
SEASR eScience 2008
SEASR eScience 2008SEASR eScience 2008
SEASR eScience 2008
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
 
NoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured PostgresNoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured Postgres
 
Meandre Architecture
Meandre ArchitectureMeandre Architecture
Meandre Architecture
 
Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009
 
SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009
 
Embedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing DocumentsEmbedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing Documents
 
MichaelLutherResume60
MichaelLutherResume60MichaelLutherResume60
MichaelLutherResume60
 
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
 
Data Persistence as a Language Feature
Data Persistence as a Language FeatureData Persistence as a Language Feature
Data Persistence as a Language Feature
 
Json
JsonJson
Json
 
Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO Way
 
Semantic Web For Energy [Malcolm Murray]
Semantic Web For Energy [Malcolm Murray]Semantic Web For Energy [Malcolm Murray]
Semantic Web For Energy [Malcolm Murray]
 
394 wade word2007-ssp2008
394 wade word2007-ssp2008394 wade word2007-ssp2008
394 wade word2007-ssp2008
 
MongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called OxMongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called Ox
 
DDS tutorial with connector
DDS tutorial with connectorDDS tutorial with connector
DDS tutorial with connector
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Compiler project
Compiler  projectCompiler  project
Compiler project
 

Recently uploaded

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 

Recently uploaded (20)

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 

I Can Convert

  • 1. I Can Convert! by Sven Aas and Jason Proctor
  • 2. I Can Convert! • Sven Aas: @svenaas / saas@mtholyoke.edu • Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu • #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 3. We’re going to talk about • Stories • Patterns • Tools ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 4. Use Your Tools! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 5. Use Your Tools • Spreadsheet • Programmer’s Editor • Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 6. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 7. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 8. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 9. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 10. Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 11. Programming Language ©2012 Sven Aas and Jason Proctor, ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 12. Use Your Tools! You’ve GOT this stuff. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 13. Getting Deported ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 14. Portal News ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 15. Unusual Data Representation +""""""""""""""+ |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$ |$Data$$$$$$$$$| +""""""""""""""+ 21139$|$71$1000009$1000010$1000011$1000012$1000013$ |$node$$$$$$$$$|$ 1000014$1000015$1000016$1000017$1000018$1000019$ |$name$$$$$$$$$|$ |$type$$$$$$$$$|$ 1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$ |$mode$$$$$$$$$|$ |$1170344089$|$21139$$|$$$$$$$1$| |$owner$$$$$$$$|$ |$group$$$$$$$$|$ 01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$ |$url$$$$$$$$$$|$ of$new$student$orientation,$members$of$the$class$of$ |$desc$$$$$$$$$|$ 2010$worked$on$community$service$projects$across$the$ |$parent$$$$$$$|$ |$linkto$$$$$$$|$ Pioneer$Valley$on$September$16.$View$the$photo$ |$ctime$$$$$$$$|$ gallery.||http://www.mtholyoke.edu/offices/comm/news/ |$mtime$$$$$$$$|$ |$mod_by$$$$$$$|$ sec_sat_06/page1.html|1158638400|1170305999||||| |$visible$$$$$$|$ 11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$ |$userdata$$$$$|$ |$datasize$$$$$|$ Saturday:^:^:^:^0:^$ |$datafilename$|$ |$$$$$2813$|$V1158673129"9689$| +""""""""""""""+ ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 16. Ruby to the Rescue LegacyUser User Item Portal News Importer System System LegacyItem Story Link Channel ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 17. ActiveRecord • A Ruby library which implements the ActiveRecord software architecture pattern. • The original Model and ORM component of Ruby on Rails. • We used it to provide a convenient object layer on top of two underlying relational databases. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 18. Conversion Patterns ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 19. Object Extraction Context: Ingesting source data. Problem: Source data objects contain multiple target objects. Solution: Process or parse target data just enough to extract objects. Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 20. Encoding Change Context: Mapping source data to target. Problem: Source text encoding differs from target. Solution: Perform intermediate translation. Tools: String methods, RegEx, programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 21. URL/Path Translation Context: Preparing target environment and data. Problem: Assets in target system will be available at different paths or URLs from their locations in source system. Solution: Map source locations to target locations. Replace references in data before saving to target. Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 22. Getting the News Out ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 23. Easy Come, Easy Go 1. Export Athletics news items to hosted service. 2. Export all news items to digital archives. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 24. Exporting Athletics Items • 10 years of Athletics news in 14 channels. • Export each item in a minimal, predictable HTML wrapper. • Include metadata for each item in <meta> tags in the <head>. • Group items by sport and by academic year. • Generally accommodate the target system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 25. HAML • A lightweight markup language used to generate HTML. • A meta-markup language. • We used it to succinctly express the HTML we wanted from within our Ruby code. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 26. Archiving Web News • 14 years of news: 6,000 items, 5,000 images, 34 channels. • Export each news item in an archival form preserving the original markup and character entities (but not the design) • PDF generated from HTML generated from HAML • Export Dublin Core metadata for each news item: • XML generated via Builder ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 27. Builder • A Ruby library for generating XML. • We used it to dynamically generate simple XML from within a Ruby application. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 28. wkhtmltopdf • A shell utility for generating PDF files by rendering HTML documents using the WebKit rendering engine. • A Ruby library providing programmatic access to the wkhtmltopdf shell utility. • We used it so that we could use familiar web development techniques to generate PDFs without having to implement our own rendering and layout routines. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 29. Familiar Patterns • Object Extraction • Encoding Change • URL/Path Translation ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 30. Direct Translation Context: Simple conversion. Problem: Data conversion. Solution: Read source objects and write targets in single pass. Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 31. Markup Change Context: Mapping source data to target. Problem: Source text markup differs from target. Solution: Perform intermediate translation. Tools: String methods, RegEx, DOM/XML selection, programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 32. Data Cleanup Context: Ingesting source data. Problem: Source data is ... imperfect. Solution: Fix what you can confidently fix. Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 33. Convert All the Things! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 34. Finally Done with News? • HTML files scraped via Nokogiri scripts. • Quite a bit of cleanup: garbage in, garbage out. • Unscrapable news items. • “September 12, 2001”. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 35. Nokogiri • A Ruby library for parsing XML and HTML. • Supports DOM or SAX parsing. • Implements both XPath and CSS3 selectors. • We used it to parse and extract content from the set of HTML files containing existing news stories. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 36. Familiar Patterns • Direct Translation • Encoding Change • Markup Change • URL/Path Translation • Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 37. The Big One ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 38. CMS Conversion • Old CMS pages all published with several different presentational styles, but all with the same DOM. That means we can scrape ’em! • We agreed not to change anything else during the import. That means we can treat it as a clean switchover. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 39. Three-Pronged Conversion ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 40. Three-Pronged Conversion • Build the necessary structures and themes to accommodate and represent our old content. • Build a library of code for scraping the pages generated by the old site, cataloging data and metadata, and storing them in an intermediate representation. • Build a library of code for importing this intermediate representation into the new CMS structures. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 41. Migrate • An Drupal module providing a framework for data import into the Drupal content management system. • Supports a variety of sources and targets out of the box. • Extensible to support additional migration sources and targets. • We used it to import the XML representation of our site into our Drupal system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 42. Familiar Patterns • Object Extraction • Encoding Change • Markup Change • URL/Path Translation • Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 43. Intermediate Representation Context: Complex conversion. Problem: Data conversion. Solution: Convert source data to intermediate representation in one pass. Then convert intermediate representation to target. Tools: Representation: Database, XML, CSV. Conversion: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 44. Object Identity Context: Ingesting source data. Problem: Data objects are repeated in source data Solution: Uniquely identify source objects. Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 45. Object Aggregation Context: Ingesting source data. Problem: Target data objects contain multiple source objects. Solution: Aggregate objects at intermediate or output stage. Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 46. Lessons • You already have a good toolbox. Keep your tools sharp. • Understand your source and target models. • Watch for familiar patterns. • Conversion is an opportunity for cleanup and improvement. • Human labor can sometimes be cheaper than automation. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 47. YOU Can Convert ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 48. Questions? ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 49. Thank you, & keep in touch! • Sven Aas: @svenaas / saas@mtholyoke.edu • Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu • #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 50. Colophon • This presentation is set in Exo Extra Bold from Natanael Gama’s ndiscovered, with headings in ChunkFive from The League of Movable Type. • Background images were adapted from FreeSeamlessTextures.com’s Red Watercolor and The Grid, by Willem Pirquin. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 51. Colophon (continued) • Card-size survival tool photo via acreativeedge.info • Leatherman photo via SonnyandSandy • Studley Tool Chest photo via FineWoodworking.com ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 52. Colophon (continued) • Audio from Wikipedia:Sound/List: • Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by the Skidmore College Orchestra. • W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik Eide. • Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman • J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel • Mississippi John Hurt - “Nobody’s Dirty Business” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 53. Colophon (continued) • Other Audio • Jack Beaver - “Workaday World” • Danny Elfman - “Breakfast Machine” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College