Introduction
   HTML parser choice
HTML5::Sanitizer interna
 HTML5::Sanitizer usage
             Conclusion




         HTML5::Sanitizer
  Sanitizing HTML 5 with Perl 5


                 Uwe Voelker

                     XING AG


            August 16th 2011




            Uwe Voelker    HTML5::Sanitizer
Introduction
                      HTML parser choice
                   HTML5::Sanitizer interna
                    HTML5::Sanitizer usage
                                Conclusion




1   Introduction

2   HTML parser choice

3   HTML5::Sanitizer interna

4   HTML5::Sanitizer usage

5   Conclusion




                               Uwe Voelker    HTML5::Sanitizer
Introduction
                    HTML parser choice      Task: WYSIWYG editor
                 HTML5::Sanitizer interna   Team
                  HTML5::Sanitizer usage    Live example
                              Conclusion




1   Introduction
       Task: WYSIWYG editor
       Team
       Live example

2   HTML parser choice

3   HTML5::Sanitizer interna

4   HTML5::Sanitizer usage

5   Conclusion


                             Uwe Voelker    HTML5::Sanitizer
Introduction
                   HTML parser choice      Task: WYSIWYG editor
                HTML5::Sanitizer interna   Team
                 HTML5::Sanitizer usage    Live example
                             Conclusion


Task: WYSIWYG editor



     integrate WYSIWYG editor in XING
     frontend architect researched open source solutions




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                   HTML parser choice      Task: WYSIWYG editor
                HTML5::Sanitizer interna   Team
                 HTML5::Sanitizer usage    Live example
                             Conclusion


Task: WYSIWYG editor



     integrate WYSIWYG editor in XING
     frontend architect researched open source solutions
     none was suited, mostly for security reasons
     decision was made, to build it inhouse




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                   HTML parser choice      Task: WYSIWYG editor
                HTML5::Sanitizer interna   Team
                 HTML5::Sanitizer usage    Live example
                             Conclusion


Task: WYSIWYG editor



     integrate WYSIWYG editor in XING
     frontend architect researched open source solutions
     none was suited, mostly for security reasons
     decision was made, to build it inhouse
     goals: secure, share profiles (allowed tags) between frontend
     and backend




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                  HTML parser choice      Task: WYSIWYG editor
               HTML5::Sanitizer interna   Team
                HTML5::Sanitizer usage    Live example
                            Conclusion


Team




 Christopher Blum        Ingo Chao                           Uwe Voelker
 Javascript              QA (HTML5/CSS)                      Perl


                           Uwe Voelker    HTML5::Sanitizer
Introduction
                  HTML parser choice      Task: WYSIWYG editor
               HTML5::Sanitizer interna   Team
                HTML5::Sanitizer usage    Live example
                            Conclusion


Live example




                           Uwe Voelker    HTML5::Sanitizer
Introduction
                      HTML parser choice      CPAN modules
                   HTML5::Sanitizer interna   Evaluation
                    HTML5::Sanitizer usage    Final decision
                                Conclusion




1   Introduction

2   HTML parser choice
     CPAN modules
     Evaluation
     Final decision

3   HTML5::Sanitizer interna

4   HTML5::Sanitizer usage

5   Conclusion


                               Uwe Voelker    HTML5::Sanitizer
Introduction
                  HTML parser choice      CPAN modules
               HTML5::Sanitizer interna   Evaluation
                HTML5::Sanitizer usage    Final decision
                            Conclusion


HTML parser on CPAN



     HTML::Parser
     HTML::TreeBuilder
     HTML::TreeBuilder::LibXML
     XML::LibXML
     HTML::HTML5::Parser
     Marpa::HTML
     ...




                           Uwe Voelker    HTML5::Sanitizer
Introduction
   HTML parser choice      CPAN modules
HTML5::Sanitizer interna   Evaluation
 HTML5::Sanitizer usage    Final decision
             Conclusion




            Uwe Voelker    HTML5::Sanitizer
Introduction
             HTML parser choice      CPAN modules
          HTML5::Sanitizer interna   Evaluation
           HTML5::Sanitizer usage    Final decision
                       Conclusion




started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags




                      Uwe Voelker    HTML5::Sanitizer
Introduction
             HTML parser choice      CPAN modules
          HTML5::Sanitizer interna   Evaluation
           HTML5::Sanitizer usage    Final decision
                       Conclusion




started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
    http://example.com/?section=2&copy=3&lang=en




                      Uwe Voelker    HTML5::Sanitizer
Introduction
             HTML parser choice      CPAN modules
          HTML5::Sanitizer interna   Evaluation
           HTML5::Sanitizer usage    Final decision
                       Conclusion




started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
    http://example.com/?section=2&copy=3&lang=en
    http://example.com/?section=2©=3&lang=en




                      Uwe Voelker    HTML5::Sanitizer
Introduction
             HTML parser choice      CPAN modules
          HTML5::Sanitizer interna   Evaluation
           HTML5::Sanitizer usage    Final decision
                       Conclusion




started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
    http://example.com/?section=2&copy=3&lang=en
    http://example.com/?section=2©=3&lang=en
final choice: XML::LibXML




                      Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Processing Phases
                      HTML parser choice
                                              Parsing
                   HTML5::Sanitizer interna
                                              Converting
                    HTML5::Sanitizer usage
                                              Writing
                                Conclusion




1   Introduction

2   HTML parser choice

3   HTML5::Sanitizer interna
     Processing Phases
     Parsing
     Converting
     Writing

4   HTML5::Sanitizer usage

5   Conclusion

                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Processing Phases
                    HTML parser choice
                                            Parsing
                 HTML5::Sanitizer interna
                                            Converting
                  HTML5::Sanitizer usage
                                            Writing
                              Conclusion


Processing phases




      preprocessing (e. g. migration)




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Processing Phases
                    HTML parser choice
                                            Parsing
                 HTML5::Sanitizer interna
                                            Converting
                  HTML5::Sanitizer usage
                                            Writing
                              Conclusion


Processing phases




      preprocessing (e. g. migration)
      parsing (HTML → DOM tree)




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Processing Phases
                    HTML parser choice
                                            Parsing
                 HTML5::Sanitizer interna
                                            Converting
                  HTML5::Sanitizer usage
                                            Writing
                              Conclusion


Processing phases




      preprocessing (e. g. migration)
      parsing (HTML → DOM tree)
      converting (rebuild tree according to profile)




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Processing Phases
                    HTML parser choice
                                            Parsing
                 HTML5::Sanitizer interna
                                            Converting
                  HTML5::Sanitizer usage
                                            Writing
                              Conclusion


Processing phases




      preprocessing (e. g. migration)
      parsing (HTML → DOM tree)
      converting (rebuild tree according to profile)
      writing (DOM tree → HTML)




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Processing Phases
                      HTML parser choice
                                              Parsing
                   HTML5::Sanitizer interna
                                              Converting
                    HTML5::Sanitizer usage
                                              Writing
                                Conclusion


Parsing HTML with XML::LibXML

  use XML : : LibXML ;

  my $ p a r s e r = XML : : LibXML−>new (
       encoding                        => ’UTF−8 ’ ,
       recover                         => 2 ,
       keep blanks                     => 1 ,
       no cdata                        => 1 ,
       expand entities                 => 1 ,
      no network                       => 1 ,
       suppress errors                 => 1 ,
       s u p p r e s s w a r n i n g s => 1 ,
  );

                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Processing Phases
                      HTML parser choice
                                              Parsing
                   HTML5::Sanitizer interna
                                              Converting
                    HTML5::Sanitizer usage
                                              Writing
                                Conclusion


Parsing HTML with XML::LibXML



  my $doc = $ p a r s e r −>p a r s e h t m l s t r i n g (
      $html ,
      {
          no cdata                        => 1 ,
          suppress errors                 => 1 ,
          s u p p r e s s w a r n i n g s => 1 ,
      },
  );




                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                         Processing Phases
                 HTML parser choice
                                         Parsing
              HTML5::Sanitizer interna
                                         Converting
               HTML5::Sanitizer usage
                                         Writing
                           Conclusion


Converting - rebuilding DOM tree



     loop through every node (only ELEMENT and TEXT)




                          Uwe Voelker    HTML5::Sanitizer
Introduction
                                          Processing Phases
                  HTML parser choice
                                          Parsing
               HTML5::Sanitizer interna
                                          Converting
                HTML5::Sanitizer usage
                                          Writing
                            Conclusion


Converting - rebuilding DOM tree



     loop through every node (only ELEMENT and TEXT)
     drop unwanted elements completely (e. g. <script>)
     change unknown elements to <span>




                           Uwe Voelker    HTML5::Sanitizer
Introduction
                                           Processing Phases
                   HTML parser choice
                                           Parsing
                HTML5::Sanitizer interna
                                           Converting
                 HTML5::Sanitizer usage
                                           Writing
                             Conclusion


Converting - rebuilding DOM tree



     loop through every node (only ELEMENT and TEXT)
     drop unwanted elements completely (e. g. <script>)
     change unknown elements to <span>
     eventually change tag name (profile)
     transform (or copy) attributes




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                                           Processing Phases
                   HTML parser choice
                                           Parsing
                HTML5::Sanitizer interna
                                           Converting
                 HTML5::Sanitizer usage
                                           Writing
                             Conclusion


Converting - rebuilding DOM tree



     loop through every node (only ELEMENT and TEXT)
     drop unwanted elements completely (e. g. <script>)
     change unknown elements to <span>
     eventually change tag name (profile)
     transform (or copy) attributes
     proceed recursively with child nodes




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                                           Processing Phases
                   HTML parser choice
                                           Parsing
                HTML5::Sanitizer interna
                                           Converting
                 HTML5::Sanitizer usage
                                           Writing
                             Conclusion


Writing HTML

     mainly for additional escapes
     could not find a nice way to integrate this in XML::LibXML




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                                             Processing Phases
                     HTML parser choice
                                             Parsing
                  HTML5::Sanitizer interna
                                             Converting
                   HTML5::Sanitizer usage
                                             Writing
                               Conclusion


Writing HTML

     mainly for additional escapes
     could not find a nice way to integrate this in XML::LibXML

  $text   =˜   s/&/&amp ; / g ;
  $text   =˜   s / ’ /&#39;/g;# ’
  $text   =˜   s /”/&q u o t ; / g;#”
  $text   =˜   s/</& l t ; / g ;
  $text   =˜   s/>/&g t ; / g ;
  $text   =˜   s / ‘/&#9 6 ; / g ;
  $text   =˜   s /{/&#1 2 3 ; / g ;
  $text   =˜   s /}/&#1 2 5 ; / g ;


                              Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Usage
                      HTML parser choice
                                              Profile
                   HTML5::Sanitizer interna
                                              Examples
                    HTML5::Sanitizer usage
                                              Debugging
                                Conclusion




1   Introduction

2   HTML parser choice

3   HTML5::Sanitizer interna

4   HTML5::Sanitizer usage
     Usage
     Profile
     Examples
     Debugging

5   Conclusion

                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Usage
                      HTML parser choice
                                              Profile
                   HTML5::Sanitizer interna
                                              Examples
                    HTML5::Sanitizer usage
                                              Debugging
                                Conclusion


Usage



 # construct object
 my $ s a n i t i z e r = HTML5 : : S a n i t i z e r −>new (
      p r o f i l e => ’My : : P r o f i l e ’ ,
 );

 # c a l l process ()
 my $ c l e a n = $ s a n i t i z e r −>p r o c e s s ( $html ) ;




                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                          Usage
                  HTML parser choice
                                          Profile
               HTML5::Sanitizer interna
                                          Examples
                HTML5::Sanitizer usage
                                          Debugging
                            Conclusion


Profile


     you have to build your own




                           Uwe Voelker    HTML5::Sanitizer
Introduction
                                           Usage
                   HTML parser choice
                                           Profile
                HTML5::Sanitizer interna
                                           Examples
                 HTML5::Sanitizer usage
                                           Debugging
                             Conclusion


Profile


     you have to build your own
     class with just one method: element($tag)
     return undef or a hashref with:




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                                           Usage
                   HTML parser choice
                                           Profile
                HTML5::Sanitizer interna
                                           Examples
                 HTML5::Sanitizer usage
                                           Debugging
                             Conclusion


Profile


     you have to build your own
     class with just one method: element($tag)
     return undef or a hashref with:
           remove remove complete sub tree (boolean)
      rename tag rename tag (string)
     set attributes set these attributes (hashref)
     check attributes check/transform these attributes (hashref)
          set class set class (string)
         add class add class from other attributes (hashref)



                            Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Usage
                    HTML parser choice
                                            Profile
                 HTML5::Sanitizer interna
                                            Examples
                  HTML5::Sanitizer usage
                                            Debugging
                              Conclusion


Examples - script



      completely remove <script> (including all children)




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Usage
                    HTML parser choice
                                            Profile
                 HTML5::Sanitizer interna
                                            Examples
                  HTML5::Sanitizer usage
                                            Debugging
                              Conclusion


Examples - script



      completely remove <script> (including all children)

  {
       remove => 1 ,
  }




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Usage
                    HTML parser choice
                                            Profile
                 HTML5::Sanitizer interna
                                            Examples
                  HTML5::Sanitizer usage
                                            Debugging
                              Conclusion


Examples - script



      completely remove <script> (including all children)

  {
       remove => 1 ,
  }

      otherwise it would be converted to <span>
      and all children processed recursively




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                            Usage
                    HTML parser choice
                                            Profile
                 HTML5::Sanitizer interna
                                            Examples
                  HTML5::Sanitizer usage
                                            Debugging
                              Conclusion


Examples - big



     <big> → <span class=”big”>




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                                             Usage
                     HTML parser choice
                                             Profile
                  HTML5::Sanitizer interna
                                             Examples
                   HTML5::Sanitizer usage
                                             Debugging
                               Conclusion


Examples - big



      <big> → <span class=”big”>

  {
       r e n a m e t a g => ’ s p a n ’ ,
       s e t c l a s s => ’ b i g ’ ,
  }




                              Uwe Voelker    HTML5::Sanitizer
Introduction
                                           Usage
                   HTML parser choice
                                           Profile
                HTML5::Sanitizer interna
                                           Examples
                 HTML5::Sanitizer usage
                                           Debugging
                             Conclusion


Examples - a



     add rel=”nofollow” and target=” blank” to every link




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Usage
                      HTML parser choice
                                              Profile
                   HTML5::Sanitizer interna
                                              Examples
                    HTML5::Sanitizer usage
                                              Debugging
                                Conclusion


Examples - a



      add rel=”nofollow” and target=” blank” to every link

  {
       s e t a t t r i b u t e s => {
             rel          => ’ n o f o l l o w ’ ,
             t a r g e t => ’ b l a n k ’ ,
       },
  }




                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                              Usage
                      HTML parser choice
                                              Profile
                   HTML5::Sanitizer interna
                                              Examples
                    HTML5::Sanitizer usage
                                              Debugging
                                Conclusion


Examples - font
  r e n a m e t a g => ’ s p a n ’ ,
  a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } ,




                               Uwe Voelker    HTML5::Sanitizer
Introduction
                                                 Usage
                         HTML parser choice
                                                 Profile
                      HTML5::Sanitizer interna
                                                 Examples
                       HTML5::Sanitizer usage
                                                 Debugging
                                   Conclusion


Examples - font
  r e n a m e t a g => ’ s p a n ’ ,
  a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } ,

  sub c l a s s s i z e f o n t {
    my ( $ s e l f , $ v a l ) = @ ;
    return unless $val ;
    r e t u r n ’ s i z e −xx−l a r g e ’ i f $ v a l eq ’ 7 ’ ;
    # ...
    r e t u r n ’ s i z e −xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ;

      r e t u r n ’ s i z e −l a r g e r ’        i f $ v a l =˜ /ˆ+/;
      r e t u r n ’ s i z e −s m a l l e r ’      i f $ v a l =˜ /ˆ −/;
      return ;
  }
                                  Uwe Voelker    HTML5::Sanitizer
Introduction
                                                 Usage
                         HTML parser choice
                                                 Profile
                      HTML5::Sanitizer interna
                                                 Examples
                       HTML5::Sanitizer usage
                                                 Debugging
                                   Conclusion


Debugging

        if the result is not as expected, you can access intermediate
        results:

  my $ r e s = $ s a n i t i z e r −>p r o c e s s ( $html , { r e t u r n r e s u l t

  # s e e HTML5 : : S a n i t i z e r : : R e s u l t
  s a y $ r e s −>i n p u t ;
  s a y $ r e s −>p r e p r o c e s s e d ;
  s a y $ r e s −>p a r s e d d o c −>t o S t r i n g ;
  s a y $ r e s −>c o n v e r t e d d o c −>t o S t r i n g ;
  s a y $ r e s −>o u t p u t ;

  p r i n t $ r e s −>d e b u g o u t p u t ;

                                  Uwe Voelker    HTML5::Sanitizer
Introduction
                   HTML parser choice
                HTML5::Sanitizer interna
                 HTML5::Sanitizer usage
                             Conclusion


Repositories



      HTML5::Sanitizer (backend)
      http://github.com/xing/html5-sanitizer




                            Uwe Voelker    HTML5::Sanitizer
Introduction
                    HTML parser choice
                 HTML5::Sanitizer interna
                  HTML5::Sanitizer usage
                              Conclusion


Repositories



      HTML5::Sanitizer (backend)
      http://github.com/xing/html5-sanitizer
      wysihtml5 (javascript frontend)
      http://github.com/xing/wysihtml5




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                    HTML parser choice
                 HTML5::Sanitizer interna
                  HTML5::Sanitizer usage
                              Conclusion


Repositories



      HTML5::Sanitizer (backend)
      http://github.com/xing/html5-sanitizer
      wysihtml5 (javascript frontend)
      http://github.com/xing/wysihtml5
      Feedback? uwe@uwevoelker.de




                             Uwe Voelker    HTML5::Sanitizer
Introduction
                HTML parser choice
             HTML5::Sanitizer interna
              HTML5::Sanitizer usage
                          Conclusion


Questions?




                         Uwe Voelker    HTML5::Sanitizer

Sanitizing HTML 5 with Perl 5

  • 1.
    Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion HTML5::Sanitizer Sanitizing HTML 5 with Perl 5 Uwe Voelker XING AG August 16th 2011 Uwe Voelker HTML5::Sanitizer
  • 2.
    Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion 1 Introduction 2 HTML parser choice 3 HTML5::Sanitizer interna 4 HTML5::Sanitizer usage 5 Conclusion Uwe Voelker HTML5::Sanitizer
  • 3.
    Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion 1 Introduction Task: WYSIWYG editor Team Live example 2 HTML parser choice 3 HTML5::Sanitizer interna 4 HTML5::Sanitizer usage 5 Conclusion Uwe Voelker HTML5::Sanitizer
  • 4.
    Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion Task: WYSIWYG editor integrate WYSIWYG editor in XING frontend architect researched open source solutions Uwe Voelker HTML5::Sanitizer
  • 5.
    Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion Task: WYSIWYG editor integrate WYSIWYG editor in XING frontend architect researched open source solutions none was suited, mostly for security reasons decision was made, to build it inhouse Uwe Voelker HTML5::Sanitizer
  • 6.
    Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion Task: WYSIWYG editor integrate WYSIWYG editor in XING frontend architect researched open source solutions none was suited, mostly for security reasons decision was made, to build it inhouse goals: secure, share profiles (allowed tags) between frontend and backend Uwe Voelker HTML5::Sanitizer
  • 7.
    Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion Team Christopher Blum Ingo Chao Uwe Voelker Javascript QA (HTML5/CSS) Perl Uwe Voelker HTML5::Sanitizer
  • 8.
    Introduction HTML parser choice Task: WYSIWYG editor HTML5::Sanitizer interna Team HTML5::Sanitizer usage Live example Conclusion Live example Uwe Voelker HTML5::Sanitizer
  • 9.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion 1 Introduction 2 HTML parser choice CPAN modules Evaluation Final decision 3 HTML5::Sanitizer interna 4 HTML5::Sanitizer usage 5 Conclusion Uwe Voelker HTML5::Sanitizer
  • 10.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion HTML parser on CPAN HTML::Parser HTML::TreeBuilder HTML::TreeBuilder::LibXML XML::LibXML HTML::HTML5::Parser Marpa::HTML ... Uwe Voelker HTML5::Sanitizer
  • 11.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion Uwe Voelker HTML5::Sanitizer
  • 12.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion started with HTML::HTML5::Parser (HH5P) because it understands semantic of HTML 5 tags Uwe Voelker HTML5::Sanitizer
  • 13.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion started with HTML::HTML5::Parser (HH5P) because it understands semantic of HTML 5 tags but it also did this: http://example.com/?section=2&copy=3&lang=en Uwe Voelker HTML5::Sanitizer
  • 14.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion started with HTML::HTML5::Parser (HH5P) because it understands semantic of HTML 5 tags but it also did this: http://example.com/?section=2&copy=3&lang=en http://example.com/?section=2&copy;=3&lang=en Uwe Voelker HTML5::Sanitizer
  • 15.
    Introduction HTML parser choice CPAN modules HTML5::Sanitizer interna Evaluation HTML5::Sanitizer usage Final decision Conclusion started with HTML::HTML5::Parser (HH5P) because it understands semantic of HTML 5 tags but it also did this: http://example.com/?section=2&copy=3&lang=en http://example.com/?section=2&copy;=3&lang=en final choice: XML::LibXML Uwe Voelker HTML5::Sanitizer
  • 16.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion 1 Introduction 2 HTML parser choice 3 HTML5::Sanitizer interna Processing Phases Parsing Converting Writing 4 HTML5::Sanitizer usage 5 Conclusion Uwe Voelker HTML5::Sanitizer
  • 17.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Processing phases preprocessing (e. g. migration) Uwe Voelker HTML5::Sanitizer
  • 18.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Processing phases preprocessing (e. g. migration) parsing (HTML → DOM tree) Uwe Voelker HTML5::Sanitizer
  • 19.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Processing phases preprocessing (e. g. migration) parsing (HTML → DOM tree) converting (rebuild tree according to profile) Uwe Voelker HTML5::Sanitizer
  • 20.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Processing phases preprocessing (e. g. migration) parsing (HTML → DOM tree) converting (rebuild tree according to profile) writing (DOM tree → HTML) Uwe Voelker HTML5::Sanitizer
  • 21.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Parsing HTML with XML::LibXML use XML : : LibXML ; my $ p a r s e r = XML : : LibXML−>new ( encoding => ’UTF−8 ’ , recover => 2 , keep blanks => 1 , no cdata => 1 , expand entities => 1 , no network => 1 , suppress errors => 1 , s u p p r e s s w a r n i n g s => 1 , ); Uwe Voelker HTML5::Sanitizer
  • 22.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Parsing HTML with XML::LibXML my $doc = $ p a r s e r −>p a r s e h t m l s t r i n g ( $html , { no cdata => 1 , suppress errors => 1 , s u p p r e s s w a r n i n g s => 1 , }, ); Uwe Voelker HTML5::Sanitizer
  • 23.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Converting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) Uwe Voelker HTML5::Sanitizer
  • 24.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Converting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) drop unwanted elements completely (e. g. <script>) change unknown elements to <span> Uwe Voelker HTML5::Sanitizer
  • 25.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Converting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) drop unwanted elements completely (e. g. <script>) change unknown elements to <span> eventually change tag name (profile) transform (or copy) attributes Uwe Voelker HTML5::Sanitizer
  • 26.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Converting - rebuilding DOM tree loop through every node (only ELEMENT and TEXT) drop unwanted elements completely (e. g. <script>) change unknown elements to <span> eventually change tag name (profile) transform (or copy) attributes proceed recursively with child nodes Uwe Voelker HTML5::Sanitizer
  • 27.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Writing HTML mainly for additional escapes could not find a nice way to integrate this in XML::LibXML Uwe Voelker HTML5::Sanitizer
  • 28.
    Introduction Processing Phases HTML parser choice Parsing HTML5::Sanitizer interna Converting HTML5::Sanitizer usage Writing Conclusion Writing HTML mainly for additional escapes could not find a nice way to integrate this in XML::LibXML $text =˜ s/&/&amp ; / g ; $text =˜ s / ’ /&#39;/g;# ’ $text =˜ s /”/&q u o t ; / g;#” $text =˜ s/</& l t ; / g ; $text =˜ s/>/&g t ; / g ; $text =˜ s / ‘/&#9 6 ; / g ; $text =˜ s /{/&#1 2 3 ; / g ; $text =˜ s /}/&#1 2 5 ; / g ; Uwe Voelker HTML5::Sanitizer
  • 29.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion 1 Introduction 2 HTML parser choice 3 HTML5::Sanitizer interna 4 HTML5::Sanitizer usage Usage Profile Examples Debugging 5 Conclusion Uwe Voelker HTML5::Sanitizer
  • 30.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Usage # construct object my $ s a n i t i z e r = HTML5 : : S a n i t i z e r −>new ( p r o f i l e => ’My : : P r o f i l e ’ , ); # c a l l process () my $ c l e a n = $ s a n i t i z e r −>p r o c e s s ( $html ) ; Uwe Voelker HTML5::Sanitizer
  • 31.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Profile you have to build your own Uwe Voelker HTML5::Sanitizer
  • 32.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Profile you have to build your own class with just one method: element($tag) return undef or a hashref with: Uwe Voelker HTML5::Sanitizer
  • 33.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Profile you have to build your own class with just one method: element($tag) return undef or a hashref with: remove remove complete sub tree (boolean) rename tag rename tag (string) set attributes set these attributes (hashref) check attributes check/transform these attributes (hashref) set class set class (string) add class add class from other attributes (hashref) Uwe Voelker HTML5::Sanitizer
  • 34.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - script completely remove <script> (including all children) Uwe Voelker HTML5::Sanitizer
  • 35.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - script completely remove <script> (including all children) { remove => 1 , } Uwe Voelker HTML5::Sanitizer
  • 36.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - script completely remove <script> (including all children) { remove => 1 , } otherwise it would be converted to <span> and all children processed recursively Uwe Voelker HTML5::Sanitizer
  • 37.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - big <big> → <span class=”big”> Uwe Voelker HTML5::Sanitizer
  • 38.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - big <big> → <span class=”big”> { r e n a m e t a g => ’ s p a n ’ , s e t c l a s s => ’ b i g ’ , } Uwe Voelker HTML5::Sanitizer
  • 39.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - a add rel=”nofollow” and target=” blank” to every link Uwe Voelker HTML5::Sanitizer
  • 40.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - a add rel=”nofollow” and target=” blank” to every link { s e t a t t r i b u t e s => { rel => ’ n o f o l l o w ’ , t a r g e t => ’ b l a n k ’ , }, } Uwe Voelker HTML5::Sanitizer
  • 41.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - font r e n a m e t a g => ’ s p a n ’ , a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } , Uwe Voelker HTML5::Sanitizer
  • 42.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Examples - font r e n a m e t a g => ’ s p a n ’ , a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } , sub c l a s s s i z e f o n t { my ( $ s e l f , $ v a l ) = @ ; return unless $val ; r e t u r n ’ s i z e −xx−l a r g e ’ i f $ v a l eq ’ 7 ’ ; # ... r e t u r n ’ s i z e −xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ; r e t u r n ’ s i z e −l a r g e r ’ i f $ v a l =˜ /ˆ+/; r e t u r n ’ s i z e −s m a l l e r ’ i f $ v a l =˜ /ˆ −/; return ; } Uwe Voelker HTML5::Sanitizer
  • 43.
    Introduction Usage HTML parser choice Profile HTML5::Sanitizer interna Examples HTML5::Sanitizer usage Debugging Conclusion Debugging if the result is not as expected, you can access intermediate results: my $ r e s = $ s a n i t i z e r −>p r o c e s s ( $html , { r e t u r n r e s u l t # s e e HTML5 : : S a n i t i z e r : : R e s u l t s a y $ r e s −>i n p u t ; s a y $ r e s −>p r e p r o c e s s e d ; s a y $ r e s −>p a r s e d d o c −>t o S t r i n g ; s a y $ r e s −>c o n v e r t e d d o c −>t o S t r i n g ; s a y $ r e s −>o u t p u t ; p r i n t $ r e s −>d e b u g o u t p u t ; Uwe Voelker HTML5::Sanitizer
  • 44.
    Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion Repositories HTML5::Sanitizer (backend) http://github.com/xing/html5-sanitizer Uwe Voelker HTML5::Sanitizer
  • 45.
    Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion Repositories HTML5::Sanitizer (backend) http://github.com/xing/html5-sanitizer wysihtml5 (javascript frontend) http://github.com/xing/wysihtml5 Uwe Voelker HTML5::Sanitizer
  • 46.
    Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion Repositories HTML5::Sanitizer (backend) http://github.com/xing/html5-sanitizer wysihtml5 (javascript frontend) http://github.com/xing/wysihtml5 Feedback? uwe@uwevoelker.de Uwe Voelker HTML5::Sanitizer
  • 47.
    Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion Questions? Uwe Voelker HTML5::Sanitizer