CHOPPING UP HTML
Ruby and Nokogiri for Documentarians
Wendy Shaffer @wshaffer74
Vmware, Inc
I’M SURROUNDED BY HTML AND XML
WHAT IS RUBY?
• Scripting language
• Many of you have probably heard of “Ruby on Rails”, used to build
websites
• https://www.ruby-lang.org/en/
• You can probably do most of what I’m describing in any language – pick
one you like
WHAT IS NOKOGIRI?
• Nokogiri is a library (“gem”) for Ruby that parses HTML and XML
• http://www.nokogiri.org/
• This means it “understands” how these languages are put together and
can identify information based on the HTML or XML structure
• <p class=“important”>This is some very important text.</p>
EXAMPLE 1: SCRAPING WIKI TABLES
Spreadsheet
EXAMPLE 1: CODE SNIPPETS
#Find the tables in the document
tables = page.xpath("//table")
#Process table rows
def process_row(row)
elements = row.children
z = Array.new
elements.each{|element|
#substitute all spaces with a
plain space to get rid of non-breaking
spaces
z <<
element.text.gsub(/[[:space:]]/, '
').strip
}
return z
end
EXAMPLE 2: SCRAPING A WEB REPORT
EXAMPLE 2: CODE SNIPPET
when "Clarity"
score_list[score] = score_page.css('table#tableSummary
tr.scorecardRow td')[2].text
OTHER NEAT STUFF YOU CAN DO
• Automate downloads of reports from web apps
• Figure out the number of links, images, etc. in a document
• Modify HTML output after publishing
MORE RESOURCES
• The Bastards Book of Ruby by Dan Nguyen:
http://ruby.bastardsbook.com/
• Why’s Poignant Guide to Ruby: https://poignant.guide/
• Everyday Scripting with Ruby: For Teams, Testers, and You by Brian
Marick

Chopping up HTML

  • 1.
    CHOPPING UP HTML Rubyand Nokogiri for Documentarians Wendy Shaffer @wshaffer74 Vmware, Inc
  • 2.
    I’M SURROUNDED BYHTML AND XML
  • 3.
    WHAT IS RUBY? •Scripting language • Many of you have probably heard of “Ruby on Rails”, used to build websites • https://www.ruby-lang.org/en/ • You can probably do most of what I’m describing in any language – pick one you like
  • 4.
    WHAT IS NOKOGIRI? •Nokogiri is a library (“gem”) for Ruby that parses HTML and XML • http://www.nokogiri.org/ • This means it “understands” how these languages are put together and can identify information based on the HTML or XML structure • <p class=“important”>This is some very important text.</p>
  • 5.
    EXAMPLE 1: SCRAPINGWIKI TABLES Spreadsheet
  • 6.
    EXAMPLE 1: CODESNIPPETS #Find the tables in the document tables = page.xpath("//table") #Process table rows def process_row(row) elements = row.children z = Array.new elements.each{|element| #substitute all spaces with a plain space to get rid of non-breaking spaces z << element.text.gsub(/[[:space:]]/, ' ').strip } return z end
  • 7.
    EXAMPLE 2: SCRAPINGA WEB REPORT
  • 8.
    EXAMPLE 2: CODESNIPPET when "Clarity" score_list[score] = score_page.css('table#tableSummary tr.scorecardRow td')[2].text
  • 9.
    OTHER NEAT STUFFYOU CAN DO • Automate downloads of reports from web apps • Figure out the number of links, images, etc. in a document • Modify HTML output after publishing
  • 10.
    MORE RESOURCES • TheBastards Book of Ruby by Dan Nguyen: http://ruby.bastardsbook.com/ • Why’s Poignant Guide to Ruby: https://poignant.guide/ • Everyday Scripting with Ruby: For Teams, Testers, and You by Brian Marick

Editor's Notes

  • #2 This is a very quick tour of how someone who isn’t a developer (me) learned a bit of code to make their lives easier.
  • #3 I am surrounded by HTML and XML. I write in DITA (XML), publish to HTML, and use a ton of web apps for everything from analytics to project tracking.
  • #6 We had 400 help IDs in a wiki table. We wanted to import them somewhere else.
  • #7 The //table finds all the tables in the document. row.children finds all the children of the table row (that is, the td elements) The loop breaks the rows up into td elements, cleans up any junk, and writes them to an array
  • #8 We have an Acrolinx quality report that produces the HTML on the left. By clicking a link for each topic, you get the report on the right that shows a number of metrics. I wanted to do some statistical analysis on a lot of metrics (circled in green) for a bunch of topics. The solution again was Nokogiri.
  • #9 This is an example of the code to extract just one of the scores. The code above basically translates into ‘Find the table with the class ‘tableSummary’, then find the table row with the id scorecardRow inside that, then find the table cells inside that, give me the second one, and then give me the text inside that.