Your SlideShare is downloading. ×
XML processing with perl
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

XML processing with perl

662
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
662
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. XML processing with Perl For the 2 nd YPPUG session by Joe Jiang [email_address]
  • 2. XML is a data format, not a language
      • We use it in financial & searching.
      • DMP can also support it, but not as good as text/HTML.
      • Many people use it for configuration files.
      • I have used it at Perl book translation.
      • For example :
    • ...
    • $book/> count $book//sect1 117 $book/> count $book//sect2 149 $book/> count $book//para 4691
    • # Wah, it's a big book :)
    •  
  • 3. The tool to work with XML
      • It's named XML::XSH2 , by Petr Pajas
      • And it take an useful utility named xsh
      • Which is based on XML::LibXSLT and XML::SAX::Writer, and ...
      • Which is based on XML::LibXML and a lot of ...
      • So you should not expect flat/easy installation :)
      • But it's still possible to be built with cpanm utility
      • So I suggest to install cpanm first
    • $ curl -kL http://cpanmin.us | perl - --sudo App::cpanminus
    • $ cpanm -S XML::XSH2
    • # already made it at dev , so you can just run: xsh
    • # ! Finding XML::XSH2 on cpanmetadb failed.
    • # This kind of info is common
  • 4. How is it used? XPath plus verbs
    • $scratch/> $book := open english-tidyup.xml parsing english-tidyup.xml done.
    •  
    • $book/> cd //book/chapter[1]
    • $book/book/chapter[1]> ls title <title>Introduction</title> $book/book/chapter[1]> cd /
    • $book/> ls //chapter/title <title>Introduction</title> <title>Filesystems</title> <title>User Accounts</title> ...
  • 5. Good at pipeline processing
    • $book/> ls //sect1//para/text() | wc -w Found 12398 node(s). 150879
    •  
    • Use &quot;wc -m&quot; for Chinese char count.
    • Or make fun with frequency statistics, for top 100 used words:
    •  
    • $book/> ls //sect1//para/text() | perl -MList::MoreUtils=natatime -lane 'END{ $it = natatime 100, sort {$cnt{$b} <=> $cnt{$a}} keys %cnt; print for map {join qq(t), $_, $cnt{$_}} $it->() } $cnt{$_}++ for @F'
    • ...
    • data    483
    • ... Perl    437
    • ... file    426 ...
  • 6. It can be used for conversion #1
    • $scratch/> $x:=open ArticleInfo_9.xml;
    • parsing ArticleInfo_9.xml
    • done.
    • $x/> ls $x
    • <?xml version=&quot;1.0&quot; encoding=&quot;utf-16&quot;?>
    • < 小样 >
    •          < 标题 ><![CDATA[ 第一推荐 ]]></ 标题 >
    •          < 作者 ><![CDATA[]]></ 作者 >
    •          < 内容 ><![CDATA[   华为美国拓展求解
    •   华为对美国市场的执着显示出中国企业走出去的急切需要,但这样高调注定要经受更多挫折。 ]]></ 内容 >
    •          < 附图 >
    •                  < 简图 >
    •                          < 文件名 >../cnmlfiles/A01/A01Ab25C005_b.jpg</ 文件名 >
    •                          < 高 >260</ 高 >
    •                          < 宽 >245</ 宽 >
    •                  </ 简图 >
    •          </ 附图 >
    • </ 小样 >
  • 7. Now building an empty xHTML #2
    • $x/> $y:=new html;
    • $y/> ls $y
    • <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>
    • <html/>
    • $y/> xadd element &quot;<head/>&quot; into $y/html; #xadd is just alias of insert
    • $y/> ls $y
    • <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>
    • <html>
    •    <head/>
    • </html>
    •  
    • $y/> xadd element &quot;<title/>&quot; into $y/html/head;
    • $y/> xadd element &quot;<body/>&quot; into $y/html;
    • $y/> ls $y
    • <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>
    • <html>
    •    <head>
    •      <title/>
    •    </head>
    •    <body/>
    • </html>
  • 8. Copy contents into xHTML #3
    • $y/> xadd text $x// 小样 / 标题 /text() into $y/html/head/title;
    • $y/> ls $y
    • <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>
    • <html>
    •    <head>
    •      <title> 第一推荐 </title>
    •    </head>
    •    <body/>
    • </html>
    • $y/> xadd text $x// 小样 / 内容 /text() into $y/html/body;
    • $y/> save --file x.html $y;
    • Document saved into file 'x.html'.
    • $y/>Good bye!
    • $ cat x.html
    • <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>
    • <html>
    •    <head>
    •      <title> 第一推荐 </title>
    •    </head>
    •    <body>   华为美国拓展求解
    •   华为对美国市场的执着显示出中国企业走出去的急切需要,但这样高调注定要经受更多挫折。 </body>
    • </html>
  • 9. XSLT is a focused XML conversion language, based on XPath
    • <? xml version = &quot;1.0&quot; encoding = &quot;ISO-8859-1&quot; ?>
    • < xsl : stylesheet version = &quot;1.0&quot; xmlns : xsl = &quot; http://www.w3.org/1999/XSL/Transform &quot; >
    • < xsl : template match = &quot;/perldata/hashref&quot; >
    •   <table border = &quot;1&quot; >
    •    <tr>
    •     <th> Key </th>
    •     <th> Value </th>
    •    </tr>
    •     < xsl : for-each select = &quot;item&quot; >
    •     <tr>
    •      <td>< xsl : value-of select = &quot;@key&quot; /></td>
    •      <td>< xsl : value-of select = &quot;.&quot; /></td>
    •     </tr>
    • </ xsl : for-each >
    • </table>
    • </ xsl : template >
    • </ xsl : stylesheet >
  • 10. This works well with XML::Dumper
    • $ perl -MXML::Dumper -e 'print pl2xml(%INC)' | xsltproc hashref.xsl - | w3m -T text/html
      • We can use xsltproc to convert the DocBook book to HTML
      • And to PDF, with another utility named fop
      • Or generate MSWord doc file from openoffice
      • With the help from openoffice docbook XSLT filter
  • 11. Now you have been equipped with another tool named XML
    •  
    Thanks all for the magic! Module Name Author Version XML::Dumper MIKEWONG 0.81 XML::Simple GRANTM 2.18 XML::LibXML PAJAS 1.87 XML::XPath MSERGEANT 1.13 XML::XSH2 PAJAS 2.1.3 XML::Twig MIROD 3.38

×