Conversion of non-semantic HTML to semantic based on visuel cues
About Rune Kaagaard
Musician / Programmer
Works for Prescriba a Danish Healthcare / Health Information
Copyright 2009 1
What is he talking about?
You have a lot of HTML that is messy and not semantic. Bad for SEO. Bad for
Convert the HTML into clean, pretty semantic HTML.
The idea: Try to understand the page [more] like a person would.
Where is it positioned?
How big is the text?
What font is used.
What distance does it have to other elements.
PHP with htmlpuriﬁer, Tidy and phpQuery
Copyright 2009 2
How to render a webpage headless on the server?
Webkit from PyQT running inside Xvbf!
--server-args='-screen 0, 640x480x24'
Xvfb is linux only.
Returns output from .js code.
Copyright 2009 3
What to do with this info?
Strip out everything that does not have semantic meaning.
Use information about position to transform into semantic
Cleanup everything again.
Copyright 2009 4
Code examples not in presentation being shown... :)
Copyright 2009 5
Copyright 2009 6
More from same author
Copyright 2009 7