19compression

379 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
379
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

19compression

  1. 1. Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001
  2. 2. In this lecture <ul><li>XML Compression </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>XMill approach and results </li></ul></ul><ul><li>Resources </li></ul><ul><li>XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001 </li></ul>
  3. 3. Compression: The Problem <ul><li>XML for exchange (space or time) </li></ul><ul><li>but XML is verbose </li></ul><ul><li>users prefer application specific formats: </li></ul><ul><ul><li>Web Server Logs </li></ul></ul><ul><ul><li>EMBL </li></ul></ul><ul><ul><li>G2 </li></ul></ul><ul><li>is XML doomed to fail ? </li></ul>
  4. 4. An Example:Web Server Logs <ul><li>< apache:entry > </li></ul><ul><li>< apache:host > 202.239.238.16 </ apache:host > </li></ul><ul><li>< apache:requestLine > GET / HTTP/1.0 </ apache:requestLine > </li></ul><ul><li>< apache:contentType > text/html </ apache:contentType > </li></ul><ul><li>< apache:statusCode > 200</ apache:statusCode > </li></ul><ul><li>< apache:date > 1997/10/01-00:00:02</ apache:date > </li></ul><ul><li>< apache:byteCount > 4478</ apache:byteCount > </li></ul><ul><li>< apache:referer > http://www.net.jp/ </ apache:referer > </li></ul><ul><li>< apache:userAgent > Mozilla/3.1$[$ja$]$(I)</ apache:userAgent > </li></ul><ul><li></ apache:entry > </li></ul>202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) ASCII File 15.9 MB (gzipped 1.6MB): XML-ized inflates to 24.2 MB (gzipped 2.1MB):
  5. 5. XMill <ul><li>specialized compressor for XML data </li></ul><ul><li>makes XML look “small” </li></ul><ul><li>Download: </li></ul><ul><ul><li>Now: www.research.att.com/sw/tools/xmill </li></ul></ul><ul><ul><li>Soon: www.cs.washington.edu/homes/suciu/XMILL </li></ul></ul>
  6. 6. How Xmill Works: Three Ideas < apache:entry > < apache:host > </ apache:host > . . . </ apache:entry > 202.239.238.16 GET / HTTP/1.0 text/html 200 … gzip Structure gzip Data =1.75MB + Compress the structure separately from the data:
  7. 7. How Xmill Works: Three Ideas < apache:entry > . . . </ apache:entry > 202.23.23.16 224.42.24.55 … gzip Structure gzip Data1 =1.33MB + GET / HTTP/1.0 GET / HTTP/1.1 … gzip Data2 + Group the data values according to their types:
  8. 8. How Xmill Works: Three Ideas Apply semantic (specialized) compressors: <ul><li>Examples: </li></ul><ul><li>8, 16, 32-bit integer encoding (signed/unsigned) </li></ul><ul><li>differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) </li></ul><ul><li>compress lists, records (e.g. 104.32.23.1  4 bytes) </li></ul><ul><li>Need user input to select the semantic compressor </li></ul>gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB
  9. 9. XML Compression
  10. 10. Compression Tradeoff
  11. 11. Summary of XML Data Management <ul><li>XML = </li></ul><ul><ul><li>old data type (trees) </li></ul></ul><ul><ul><li>with new interpretation (data) </li></ul></ul><ul><li>We discussed traditional management techniques for XML: </li></ul><ul><ul><li>Data model </li></ul></ul><ul><ul><li>Query language </li></ul></ul><ul><ul><li>Optimizations </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><li>Many traditional problems still unsolved (storage, processing, optimization, ...) </li></ul>
  12. 12. Summary of XML Data Management <ul><li>More interesting question: </li></ul><ul><ul><li>what are the novel applications enabled by XML ? </li></ul></ul><ul><li>Some ideas: </li></ul><ul><li>Approximate queries over unfamiliar data instances </li></ul><ul><ul><li>“ Search the database for a pattern similar to this one” </li></ul></ul><ul><ul><li>Rank results based on their similarity to the pattern </li></ul></ul><ul><ul><li>What is an appropriate query language for that ? </li></ul></ul><ul><li>Linking independent databases </li></ul><ul><ul><li>We have Xlink, how do we use it ? </li></ul></ul>

×