Your SlideShare is downloading. ×
Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*,
Darrell D.E. Long*, Ian F. Adams*, Avani Wildani*
*University of ...
What we call metadata
• Data for the system
• External to the file
• Small
• Dense
2
Abraham Silberschatz, Greg Gagne, and ...
What everyone else calls metadata
• Data for the user
• Embedded in:
• the file
• the inode
• a separate file
• a notebook s...
A scientist at work
• “Show me the data set about bears in Alaska from
last fall”
• “Show me simulation results from last ...
Our options
• Relational databases
• Column stores
• Spatial trees (E.g., Spyglass, Smartstore)
• Inverted indexes
• Bitma...
Outline
• The data in brief
• Dimensionality
• Sparsity
• Atomicity
• Entropy
6
The metadata in brief
7
Discipline
Native	
  
Format
Record	
  
count
Subsample
d?
Sample	
  
count
Total	
  size
Dryad Bi...
Dimensionality
8
Dryad WISE Argo ORNL
Total	
  
Dimensions
44 285 108 14 451
•Much higher dimensional than POSIX data
•Cur...
Sparsity
9
Sparse even within a discipline (extremely sparse
across all disciplines)
• CDF of sparsity
• For a randomly
ch...
Atomicity (Dryad)
• How many times can a
field be present for a
single item?
• E.g.: A single paper can
have multiple autho...
Entropy
• Row organization
versus column
• How compressible is
the data?
• How selective are
queries?
• Plenty of compress...
Bringing it all together
• Scientific data is:
• Sparse
• High-dimensional
• Compressible
• Non-atomic (one to many)
• A mi...
Comparing indexes
13
Column	
  
stores
Row	
  stores Spatial	
  trees
Inverted	
  
Indexes
HDF5 FastBit
High	
  
dimension...
Conclusions
14
• Currently popular approaches to file system
indexing (spatial trees, RDBMS) are a poor match
for scientific...
15
Questions?
Data types (raw and semantic)
16
Dryad WISE Argo ORNL Total
String
Numeric
Str/Num
Date
Spatial
Flagsets
100% 4% 62% 29% 2...
How can we support this?
• Search functionality which:
• Supports these kinds of queries
• Does not double the size of sto...
Upcoming SlideShare
Loading in...5
×

Analyzing Extended and Scientific Metadata for Scalable Index Designs

197

Published on

How do we effectively index scientific file systems? We analyze scientific data to examine characteristics which impact index choices.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
197
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Analyzing Extended and Scientific Metadata for Scalable Index Designs"

  1. 1. Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*, Darrell D.E. Long*, Ian F. Adams*, Avani Wildani* *University of California Santa Cruz ^Conservatoire National des Arts et Métiers Examining Extended and Scientific Metadata for Scalable Index Designs
  2. 2. What we call metadata • Data for the system • External to the file • Small • Dense 2 Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition "
  3. 3. What everyone else calls metadata • Data for the user • Embedded in: • the file • the inode • a separate file • a notebook somewhere on their desk • Wildly varying size • Sparse 3 Embedded Metadata Metadata filesMetadata filesMetadata files Metadata outside the system Inode metadata
  4. 4. A scientist at work • “Show me the data set about bears in Alaska from last fall” • “Show me simulation results from last week for Vesuvius which used this code library, and where the pressure is higher than 500 kiloPascals” • A mix of system and scientific metadata 4
  5. 5. Our options • Relational databases • Column stores • Spatial trees (E.g., Spyglass, Smartstore) • Inverted indexes • Bitmap indexes (E.g. FastBit) • The choice of index depends on the data, but what does the data look like? 5
  6. 6. Outline • The data in brief • Dimensionality • Sparsity • Atomicity • Entropy 6
  7. 7. The metadata in brief 7 Discipline Native   Format Record   count Subsample d? Sample   count Total  size Dryad Biology XML 31K No 31K 400  MB WISE Astronomy CSV 564M Yes 10K 1  TB ARGO Oceanograp hy NetCDF 2B Yes 635K 330GB ORNL Climatology CSV 1478 No 1478 154KB
  8. 8. Dimensionality 8 Dryad WISE Argo ORNL Total   Dimensions 44 285 108 14 451 •Much higher dimensional than POSIX data •Curse of dimensionality concerns
  9. 9. Sparsity 9 Sparse even within a discipline (extremely sparse across all disciplines) • CDF of sparsity • For a randomly chosen element from X% of columns, there is a Y% chance it will be null
  10. 10. Atomicity (Dryad) • How many times can a field be present for a single item? • E.g.: A single paper can have multiple authors • Truncated to show detail. One study had 800 species! 10 Some disciplines have many field values per item. Others have range values (e.g., May-June 2010)
  11. 11. Entropy • Row organization versus column • How compressible is the data? • How selective are queries? • Plenty of compression available 11
  12. 12. Bringing it all together • Scientific data is: • Sparse • High-dimensional • Compressible • Non-atomic (one to many) • A mix of cardinal, ordinal, spatial, and binary data • Query models: • Spatial • Range and point • Key word 12
  13. 13. Comparing indexes 13 Column   stores Row  stores Spatial  trees Inverted   Indexes HDF5 FastBit High   dimensional Yes Yes No Yes Yes Yes Sparse Yes Stores  nulls No Yes Yes Stores  nulls Multiple   values Yes Yes No List,  not   range Yes Yes Non-­‐numeric   data Yes Yes No Yes Yes No Range   queries Yes Yes Yes No Yes Yes Specialized   indexes Yes Yes No No No No High Compression Yes No No Yes No Yes
  14. 14. Conclusions 14 • Currently popular approaches to file system indexing (spatial trees, RDBMS) are a poor match for scientific data • Current approaches to scientific indexing are not a complete solution • Column stores are a natural fit for scientific metadata and queries • Specialized indexes based on inverted indexes, bitmaps, and spatial trees are appropriate for some data
  15. 15. 15 Questions?
  16. 16. Data types (raw and semantic) 16 Dryad WISE Argo ORNL Total String Numeric Str/Num Date Spatial Flagsets 100% 4% 62% 29% 28% 0% 96% 38% 71% 72% 96% 68% 77% 72% 73% 2% 4% 7% 7% 5% 2% 9% 2% 21% 7% 0% 19% 14% 0% 15% •Support for spatial search is useful •Application hinting is needed for good search (is this a string, a location, or a flag set?)
  17. 17. How can we support this? • Search functionality which: • Supports these kinds of queries • Does not double the size of storage • Does not require a linear scan over petabytes of data • The answers to queries are documents • We rarely need an entire row • Complex transactions and joins are less important 17

×