Analyzing Extended and Scientific Metadata for Scalable Index Designs


Published on

How do we effectively index scientific file systems? We analyze scientific data to examine characteristics which impact index choices.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Analyzing Extended and Scientific Metadata for Scalable Index Designs

  1. 1. Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*, Darrell D.E. Long*, Ian F. Adams*, Avani Wildani* *University of California Santa Cruz ^Conservatoire National des Arts et Métiers Examining Extended and Scientific Metadata for Scalable Index Designs
  2. 2. What we call metadata • Data for the system • External to the file • Small • Dense 2 Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition "
  3. 3. What everyone else calls metadata • Data for the user • Embedded in: • the file • the inode • a separate file • a notebook somewhere on their desk • Wildly varying size • Sparse 3 Embedded Metadata Metadata filesMetadata filesMetadata files Metadata outside the system Inode metadata
  4. 4. A scientist at work • “Show me the data set about bears in Alaska from last fall” • “Show me simulation results from last week for Vesuvius which used this code library, and where the pressure is higher than 500 kiloPascals” • A mix of system and scientific metadata 4
  5. 5. Our options • Relational databases • Column stores • Spatial trees (E.g., Spyglass, Smartstore) • Inverted indexes • Bitmap indexes (E.g. FastBit) • The choice of index depends on the data, but what does the data look like? 5
  6. 6. Outline • The data in brief • Dimensionality • Sparsity • Atomicity • Entropy 6
  7. 7. The metadata in brief 7 Discipline Native   Format Record   count Subsample d? Sample   count Total  size Dryad Biology XML 31K No 31K 400  MB WISE Astronomy CSV 564M Yes 10K 1  TB ARGO Oceanograp hy NetCDF 2B Yes 635K 330GB ORNL Climatology CSV 1478 No 1478 154KB
  8. 8. Dimensionality 8 Dryad WISE Argo ORNL Total   Dimensions 44 285 108 14 451 •Much higher dimensional than POSIX data •Curse of dimensionality concerns
  9. 9. Sparsity 9 Sparse even within a discipline (extremely sparse across all disciplines) • CDF of sparsity • For a randomly chosen element from X% of columns, there is a Y% chance it will be null
  10. 10. Atomicity (Dryad) • How many times can a field be present for a single item? • E.g.: A single paper can have multiple authors • Truncated to show detail. One study had 800 species! 10 Some disciplines have many field values per item. Others have range values (e.g., May-June 2010)
  11. 11. Entropy • Row organization versus column • How compressible is the data? • How selective are queries? • Plenty of compression available 11
  12. 12. Bringing it all together • Scientific data is: • Sparse • High-dimensional • Compressible • Non-atomic (one to many) • A mix of cardinal, ordinal, spatial, and binary data • Query models: • Spatial • Range and point • Key word 12
  13. 13. Comparing indexes 13 Column   stores Row  stores Spatial  trees Inverted   Indexes HDF5 FastBit High   dimensional Yes Yes No Yes Yes Yes Sparse Yes Stores  nulls No Yes Yes Stores  nulls Multiple   values Yes Yes No List,  not   range Yes Yes Non-­‐numeric   data Yes Yes No Yes Yes No Range   queries Yes Yes Yes No Yes Yes Specialized   indexes Yes Yes No No No No High Compression Yes No No Yes No Yes
  14. 14. Conclusions 14 • Currently popular approaches to file system indexing (spatial trees, RDBMS) are a poor match for scientific data • Current approaches to scientific indexing are not a complete solution • Column stores are a natural fit for scientific metadata and queries • Specialized indexes based on inverted indexes, bitmaps, and spatial trees are appropriate for some data
  15. 15. 15 Questions?
  16. 16. Data types (raw and semantic) 16 Dryad WISE Argo ORNL Total String Numeric Str/Num Date Spatial Flagsets 100% 4% 62% 29% 28% 0% 96% 38% 71% 72% 96% 68% 77% 72% 73% 2% 4% 7% 7% 5% 2% 9% 2% 21% 7% 0% 19% 14% 0% 15% •Support for spatial search is useful •Application hinting is needed for good search (is this a string, a location, or a flag set?)
  17. 17. How can we support this? • Search functionality which: • Supports these kinds of queries • Does not double the size of storage • Does not require a linear scan over petabytes of data • The answers to queries are documents • We rarely need an entire row • Complex transactions and joins are less important 17