Sanitizing HTML 5 with Perl 5

5,193 views
5,043 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,193
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sanitizing HTML 5 with Perl 5

  1. 1. Introduction HTML parser choiceHTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion HTML5::Sanitizer Sanitizing HTML 5 with Perl 5 Uwe Voelker XING AG August 16th 2011 Uwe Voelker HTML5::Sanitizer
  2. 2. Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion1 Introduction2 HTML parser choice3 HTML5::Sanitizer interna4 HTML5::Sanitizer usage5 Conclusion Uwe Voelker HTML5::Sanitizer
  3. 3. Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion1 Introduction Task: WYSIWYG editor Team Live example2 HTML parser choice3 HTML5::Sanitizer interna4 HTML5::Sanitizer usage5 Conclusion Uwe Voelker HTML5::Sanitizer
  4. 4. Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example ConclusionTask: WYSIWYG editor integrate WYSIWYG editor in XING frontend architect researched open source solutions Uwe Voelker HTML5::Sanitizer
  5. 5. Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example ConclusionTask: WYSIWYG editor integrate WYSIWYG editor in XING frontend architect researched open source solutions none was suited, mostly for security reasons decision was made, to build it inhouse Uwe Voelker HTML5::Sanitizer
  6. 6. Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example ConclusionTask: WYSIWYG editor integrate WYSIWYG editor in XING frontend architect researched open source solutions none was suited, mostly for security reasons decision was made, to build it inhouse goals: secure, share profiles (allowed tags) between frontend and backend Uwe Voelker HTML5::Sanitizer
  7. 7. Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example ConclusionTeam Christopher Blum Ingo Chao Uwe Voelker Javascript QA (HTML5/CSS) Perl Uwe Voelker HTML5::Sanitizer
  8. 8. Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example ConclusionLive example Uwe Voelker HTML5::Sanitizer
  9. 9. Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion1 Introduction2 HTML parser choice CPAN modules Evaluation Final decision3 HTML5::Sanitizer interna4 HTML5::Sanitizer usage5 Conclusion Uwe Voelker HTML5::Sanitizer
  10. 10. Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision ConclusionHTML parser on CPAN HTML::Parser HTML::TreeBuilder HTML::TreeBuilder::LibXML XML::LibXML HTML::HTML5::Parser Marpa::HTML ... Uwe Voelker HTML5::Sanitizer
  11. 11. Introduction HTML parser choice CPAN modulesHTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion Uwe Voelker HTML5::Sanitizer
  12. 12. Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusionstarted with HTML::HTML5::Parser (HH5P)because it understands semantic of HTML 5 tags Uwe Voelker HTML5::Sanitizer
  13. 13. Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusionstarted with HTML::HTML5::Parser (HH5P)because it understands semantic of HTML 5 tagsbut it also did this: http://example.com/?section=2&copy=3&lang=en Uwe Voelker HTML5::Sanitizer
  14. 14. Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusionstarted with HTML::HTML5::Parser (HH5P)because it understands semantic of HTML 5 tagsbut it also did this: http://example.com/?section=2&copy=3&lang=en http://example.com/?section=2©=3&lang=en Uwe Voelker HTML5::Sanitizer
  15. 15. Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusionstarted with HTML::HTML5::Parser (HH5P)because it understands semantic of HTML 5 tagsbut it also did this: http://example.com/?section=2&copy=3&lang=en http://example.com/?section=2©=3&lang=enfinal choice: XML::LibXML Uwe Voelker HTML5::Sanitizer
  16. 16. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion1 Introduction2 HTML parser choice3 HTML5::Sanitizer interna Processing Phases Parsing Converting Writing4 HTML5::Sanitizer usage5 Conclusion Uwe Voelker HTML5::Sanitizer
  17. 17. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionProcessing phases preprocessing (e. g. migration) Uwe Voelker HTML5::Sanitizer
  18. 18. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionProcessing phases preprocessing (e. g. migration) parsing (HTML → DOM tree) Uwe Voelker HTML5::Sanitizer
  19. 19. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionProcessing phases preprocessing (e. g. migration) parsing (HTML → DOM tree) converting (rebuild tree according to profile) Uwe Voelker HTML5::Sanitizer
  20. 20. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionProcessing phases preprocessing (e. g. migration) parsing (HTML → DOM tree) converting (rebuild tree according to profile) writing (DOM tree → HTML) Uwe Voelker HTML5::Sanitizer
  21. 21. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionParsing HTML with XML::LibXML use XML : : LibXML ; my $ p a r s e r = XML : : LibXML−>new ( encoding => ’UTF−8 ’ , recover => 2 , keep blanks => 1 , no cdata => 1 , expand entities => 1 , no network => 1 , suppress errors => 1 , s u p p r e s s w a r n i n g s => 1 , ); Uwe Voelker HTML5::Sanitizer
  22. 22. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionParsing HTML with XML::LibXML my $doc = $ p a r s e r −>p a r s e h t m l s t r i n g ( $html , { no cdata => 1 , suppress errors => 1 , s u p p r e s s w a r n i n g s => 1 , }, ); Uwe Voelker HTML5::Sanitizer
  23. 23. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionConverting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) Uwe Voelker HTML5::Sanitizer
  24. 24. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionConverting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) drop unwanted elements completely (e. g. <script>) change unknown elements to <span> Uwe Voelker HTML5::Sanitizer
  25. 25. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionConverting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) drop unwanted elements completely (e. g. <script>) change unknown elements to <span> eventually change tag name (profile) transform (or copy) attributes Uwe Voelker HTML5::Sanitizer
  26. 26. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionConverting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) drop unwanted elements completely (e. g. <script>) change unknown elements to <span> eventually change tag name (profile) transform (or copy) attributes proceed recursively with child nodes Uwe Voelker HTML5::Sanitizer
  27. 27. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionWriting HTML mainly for additional escapes could not find a nice way to integrate this in XML::LibXML Uwe Voelker HTML5::Sanitizer
  28. 28. Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing ConclusionWriting HTML mainly for additional escapes could not find a nice way to integrate this in XML::LibXML $text =˜ s/&/&amp ; / g ; $text =˜ s / ’ /'/g;# ’ $text =˜ s /”/&q u o t ; / g;#” $text =˜ s/</& l t ; / g ; $text =˜ s/>/&g t ; / g ; $text =˜ s / ‘/&#9 6 ; / g ; $text =˜ s /{/&#1 2 3 ; / g ; $text =˜ s /}/&#1 2 5 ; / g ; Uwe Voelker HTML5::Sanitizer
  29. 29. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion1 Introduction2 HTML parser choice3 HTML5::Sanitizer interna4 HTML5::Sanitizer usage Usage Profile Examples Debugging5 Conclusion Uwe Voelker HTML5::Sanitizer
  30. 30. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionUsage # construct object my $ s a n i t i z e r = HTML5 : : S a n i t i z e r −>new ( p r o f i l e => ’My : : P r o f i l e ’ , ); # c a l l process () my $ c l e a n = $ s a n i t i z e r −>p r o c e s s ( $html ) ; Uwe Voelker HTML5::Sanitizer
  31. 31. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionProfile you have to build your own Uwe Voelker HTML5::Sanitizer
  32. 32. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionProfile you have to build your own class with just one method: element($tag) return undef or a hashref with: Uwe Voelker HTML5::Sanitizer
  33. 33. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionProfile you have to build your own class with just one method: element($tag) return undef or a hashref with: remove remove complete sub tree (boolean) rename tag rename tag (string) set attributes set these attributes (hashref) check attributes check/transform these attributes (hashref) set class set class (string) add class add class from other attributes (hashref) Uwe Voelker HTML5::Sanitizer
  34. 34. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - script completely remove <script> (including all children) Uwe Voelker HTML5::Sanitizer
  35. 35. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - script completely remove <script> (including all children) { remove => 1 , } Uwe Voelker HTML5::Sanitizer
  36. 36. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - script completely remove <script> (including all children) { remove => 1 , } otherwise it would be converted to <span> and all children processed recursively Uwe Voelker HTML5::Sanitizer
  37. 37. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - big <big> → <span class=”big”> Uwe Voelker HTML5::Sanitizer
  38. 38. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - big <big> → <span class=”big”> { r e n a m e t a g => ’ s p a n ’ , s e t c l a s s => ’ b i g ’ , } Uwe Voelker HTML5::Sanitizer
  39. 39. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - a add rel=”nofollow” and target=” blank” to every link Uwe Voelker HTML5::Sanitizer
  40. 40. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - a add rel=”nofollow” and target=” blank” to every link { s e t a t t r i b u t e s => { rel => ’ n o f o l l o w ’ , t a r g e t => ’ b l a n k ’ , }, } Uwe Voelker HTML5::Sanitizer
  41. 41. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - font r e n a m e t a g => ’ s p a n ’ , a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } , Uwe Voelker HTML5::Sanitizer
  42. 42. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionExamples - font r e n a m e t a g => ’ s p a n ’ , a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } , sub c l a s s s i z e f o n t { my ( $ s e l f , $ v a l ) = @ ; return unless $val ; r e t u r n ’ s i z e −xx−l a r g e ’ i f $ v a l eq ’ 7 ’ ; # ... r e t u r n ’ s i z e −xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ; r e t u r n ’ s i z e −l a r g e r ’ i f $ v a l =˜ /ˆ+/; r e t u r n ’ s i z e −s m a l l e r ’ i f $ v a l =˜ /ˆ −/; return ; } Uwe Voelker HTML5::Sanitizer
  43. 43. Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging ConclusionDebugging if the result is not as expected, you can access intermediate results: my $ r e s = $ s a n i t i z e r −>p r o c e s s ( $html , { r e t u r n r e s u l t # s e e HTML5 : : S a n i t i z e r : : R e s u l t s a y $ r e s −>i n p u t ; s a y $ r e s −>p r e p r o c e s s e d ; s a y $ r e s −>p a r s e d d o c −>t o S t r i n g ; s a y $ r e s −>c o n v e r t e d d o c −>t o S t r i n g ; s a y $ r e s −>o u t p u t ; p r i n t $ r e s −>d e b u g o u t p u t ; Uwe Voelker HTML5::Sanitizer
  44. 44. Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage ConclusionRepositories HTML5::Sanitizer (backend) http://github.com/xing/html5-sanitizer Uwe Voelker HTML5::Sanitizer
  45. 45. Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage ConclusionRepositories HTML5::Sanitizer (backend) http://github.com/xing/html5-sanitizer wysihtml5 (javascript frontend) http://github.com/xing/wysihtml5 Uwe Voelker HTML5::Sanitizer
  46. 46. Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage ConclusionRepositories HTML5::Sanitizer (backend) http://github.com/xing/html5-sanitizer wysihtml5 (javascript frontend) http://github.com/xing/wysihtml5 Feedback? uwe@uwevoelker.de Uwe Voelker HTML5::Sanitizer
  47. 47. Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage ConclusionQuestions? Uwe Voelker HTML5::Sanitizer

×