XML processing with perl

1,012 views
869 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,012
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

XML processing with perl

  1. 1. XML processing with Perl For the 2 nd YPPUG session by Joe Jiang [email_address]
  2. 2. XML is a data format, not a language <ul><ul><li>We use it in financial & searching. </li></ul></ul><ul><ul><li>DMP can also support it, but not as good as text/HTML. </li></ul></ul><ul><ul><li>Many people use it for configuration files. </li></ul></ul><ul><ul><li>I have used it at Perl book translation. </li></ul></ul><ul><ul><li>For example : </li></ul></ul><ul><li>... </li></ul><ul><li>$book/> count $book//sect1 117 $book/> count $book//sect2 149 $book/> count $book//para 4691 </li></ul><ul><li># Wah, it's a big book :) </li></ul><ul><li>  </li></ul>
  3. 3. The tool to work with XML <ul><ul><li>It's named XML::XSH2 , by Petr Pajas </li></ul></ul><ul><ul><li>And it take an useful utility named xsh </li></ul></ul><ul><ul><li>Which is based on XML::LibXSLT and XML::SAX::Writer, and ... </li></ul></ul><ul><ul><li>Which is based on XML::LibXML and a lot of ... </li></ul></ul><ul><ul><li>So you should not expect flat/easy installation :) </li></ul></ul><ul><ul><li>But it's still possible to be built with cpanm utility </li></ul></ul><ul><ul><li>So I suggest to install cpanm first </li></ul></ul><ul><li>$ curl -kL http://cpanmin.us | perl - --sudo App::cpanminus </li></ul><ul><li>$ cpanm -S XML::XSH2 </li></ul><ul><li># already made it at dev , so you can just run: xsh </li></ul><ul><li># ! Finding XML::XSH2 on cpanmetadb failed. </li></ul><ul><li># This kind of info is common </li></ul>
  4. 4. How is it used? XPath plus verbs <ul><li>$scratch/> $book := open english-tidyup.xml parsing english-tidyup.xml done. </li></ul><ul><li>  </li></ul><ul><li>$book/> cd //book/chapter[1] </li></ul><ul><li>$book/book/chapter[1]> ls title <title>Introduction</title> $book/book/chapter[1]> cd / </li></ul><ul><li>$book/> ls //chapter/title <title>Introduction</title> <title>Filesystems</title> <title>User Accounts</title> ... </li></ul>
  5. 5. Good at pipeline processing <ul><li>$book/> ls //sect1//para/text() | wc -w Found 12398 node(s). 150879 </li></ul><ul><li>  </li></ul><ul><li>Use &quot;wc -m&quot; for Chinese char count. </li></ul><ul><li>Or make fun with frequency statistics, for top 100 used words: </li></ul><ul><li>  </li></ul><ul><li>$book/> ls //sect1//para/text() | perl -MList::MoreUtils=natatime -lane 'END{ $it = natatime 100, sort {$cnt{$b} <=> $cnt{$a}} keys %cnt; print for map {join qq(t), $_, $cnt{$_}} $it->() } $cnt{$_}++ for @F' </li></ul><ul><li>... </li></ul><ul><li>data    483 </li></ul><ul><li>... Perl    437 </li></ul><ul><li>... file    426 ... </li></ul>
  6. 6. It can be used for conversion #1 <ul><li>$scratch/> $x:=open ArticleInfo_9.xml; </li></ul><ul><li>parsing ArticleInfo_9.xml </li></ul><ul><li>done. </li></ul><ul><li>$x/> ls $x </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-16&quot;?> </li></ul><ul><li>< 小样 > </li></ul><ul><li>         < 标题 ><![CDATA[ 第一推荐 ]]></ 标题 > </li></ul><ul><li>         < 作者 ><![CDATA[]]></ 作者 > </li></ul><ul><li>         < 内容 ><![CDATA[   华为美国拓展求解 </li></ul><ul><li>  华为对美国市场的执着显示出中国企业走出去的急切需要,但这样高调注定要经受更多挫折。 ]]></ 内容 > </li></ul><ul><li>         < 附图 > </li></ul><ul><li>                 < 简图 > </li></ul><ul><li>                         < 文件名 >../cnmlfiles/A01/A01Ab25C005_b.jpg</ 文件名 > </li></ul><ul><li>                         < 高 >260</ 高 > </li></ul><ul><li>                         < 宽 >245</ 宽 > </li></ul><ul><li>                 </ 简图 > </li></ul><ul><li>         </ 附图 > </li></ul><ul><li></ 小样 > </li></ul>
  7. 7. Now building an empty xHTML #2 <ul><li>$x/> $y:=new html; </li></ul><ul><li>$y/> ls $y </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> </li></ul><ul><li><html/> </li></ul><ul><li>$y/> xadd element &quot;<head/>&quot; into $y/html; #xadd is just alias of insert </li></ul><ul><li>$y/> ls $y </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> </li></ul><ul><li><html> </li></ul><ul><li>   <head/> </li></ul><ul><li></html> </li></ul><ul><li>  </li></ul><ul><li>$y/> xadd element &quot;<title/>&quot; into $y/html/head; </li></ul><ul><li>$y/> xadd element &quot;<body/>&quot; into $y/html; </li></ul><ul><li>$y/> ls $y </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> </li></ul><ul><li><html> </li></ul><ul><li>   <head> </li></ul><ul><li>     <title/> </li></ul><ul><li>   </head> </li></ul><ul><li>   <body/> </li></ul><ul><li></html> </li></ul>
  8. 8. Copy contents into xHTML #3 <ul><li>$y/> xadd text $x// 小样 / 标题 /text() into $y/html/head/title; </li></ul><ul><li>$y/> ls $y </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> </li></ul><ul><li><html> </li></ul><ul><li>   <head> </li></ul><ul><li>     <title> 第一推荐 </title> </li></ul><ul><li>   </head> </li></ul><ul><li>   <body/> </li></ul><ul><li></html> </li></ul><ul><li>$y/> xadd text $x// 小样 / 内容 /text() into $y/html/body; </li></ul><ul><li>$y/> save --file x.html $y; </li></ul><ul><li>Document saved into file 'x.html'. </li></ul><ul><li>$y/>Good bye! </li></ul><ul><li>$ cat x.html </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> </li></ul><ul><li><html> </li></ul><ul><li>   <head> </li></ul><ul><li>     <title> 第一推荐 </title> </li></ul><ul><li>   </head> </li></ul><ul><li>   <body>   华为美国拓展求解 </li></ul><ul><li>  华为对美国市场的执着显示出中国企业走出去的急切需要,但这样高调注定要经受更多挫折。 </body> </li></ul><ul><li></html> </li></ul>
  9. 9. XSLT is a focused XML conversion language, based on XPath <ul><li><? xml version = &quot;1.0&quot; encoding = &quot;ISO-8859-1&quot; ?> </li></ul><ul><li>< xsl : stylesheet version = &quot;1.0&quot; xmlns : xsl = &quot; http://www.w3.org/1999/XSL/Transform &quot; > </li></ul><ul><li>< xsl : template match = &quot;/perldata/hashref&quot; > </li></ul><ul><li>  <table border = &quot;1&quot; > </li></ul><ul><li>   <tr> </li></ul><ul><li>    <th> Key </th> </li></ul><ul><li>    <th> Value </th> </li></ul><ul><li>   </tr> </li></ul><ul><li>    < xsl : for-each select = &quot;item&quot; > </li></ul><ul><li>    <tr> </li></ul><ul><li>     <td>< xsl : value-of select = &quot;@key&quot; /></td> </li></ul><ul><li>     <td>< xsl : value-of select = &quot;.&quot; /></td> </li></ul><ul><li>    </tr> </li></ul><ul><li></ xsl : for-each > </li></ul><ul><li></table> </li></ul><ul><li></ xsl : template > </li></ul><ul><li></ xsl : stylesheet > </li></ul>
  10. 10. This works well with XML::Dumper <ul><li>$ perl -MXML::Dumper -e 'print pl2xml(%INC)' | xsltproc hashref.xsl - | w3m -T text/html </li></ul><ul><ul><li>We can use xsltproc to convert the DocBook book to HTML </li></ul></ul><ul><ul><li>And to PDF, with another utility named fop </li></ul></ul><ul><ul><li>Or generate MSWord doc file from openoffice </li></ul></ul><ul><ul><li>With the help from openoffice docbook XSLT filter </li></ul></ul>
  11. 11. Now you have been equipped with another tool named XML <ul><li>  </li></ul>Thanks all for the magic! Module Name Author Version XML::Dumper MIKEWONG 0.81 XML::Simple GRANTM 2.18 XML::LibXML PAJAS 1.87 XML::XPath MSERGEANT 1.13 XML::XSH2 PAJAS 2.1.3 XML::Twig MIROD 3.38

×