2. 2
● To try to improve the status quo
○ Matrix Market, FROSTT, and other text-based formats
○ COO: coordinate triples of rows, columns, and values
■ Columnar storage technology is really good!
○ Bespoke, library-specific binary formats
○ TileDB
○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.)
● To discuss and create a vision for new binary storage formats for sparse
○ “Open standard” formats that are broadly useful and widely accessible
● To determine what is most important to the people here
○ Which languages, libraries, formats, features, technologies, etc.
○ And what do you wish for? Dream big!
● To find partners and team up for the next steps
○ Prototype, analyze, communicate results, and iterate
○ Goals for 3-6 months and beyond
Why are we here?
3. 3
● Library developers and users
● Anybody with large sparse data or graphs
● Hardware creators and vendors
○ For example, to better optimize running workloads on GPUs or other accelerators
● Researchers
● Companies with graph workloads
● Benchmark organizers
● Maintainers of repositories of sparse data
● Virtually anybody who needs to work with sparse data or graphs
Who is this for?
4. 4
● Data written to long term storage
○ Any technically competent person who discovers the file 10 years later should be
able to figure out how to load and use it (even if they don’t recognize the format)
● The logical model of a file format
○ Metadata about metadata
■ File extension (such as .zip or .pdf)
■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY)
■ Version number of format
■ Any additional extensions needed to understand the metadata or data
○ Metadata
■ Data type, endianness of data, shape of sparse array, etc.
■ Info about compression
○ Data
■ Contiguous arrays of raw bytes
■ May be chunked or compressed
What is a file format?
5. 5
An “open” file format is also about community
● More likely to succeed with more partners
● Right now, a small group (or just me) can
make a lot of progress with prototyping
and documenting to bootstrap the effort
○ Along with feedback from LAGraph group
and others
● We’ll want more partners eventually
● Please think about how/when you can help
○ Student projects
○ Volunteer to be beta users, give feedback
○ Participate in this session
● We’ll need a governance model eventually
○ Hopefully! This is a good “problem” to have
○ We’re happy to have this discussion
“If you want to go quickly,
go alone. If you want to
go far, go together.”
African Proverb
6. 6
● Fast enough
● Small enough
● Flexible enough
● Portable enough
● Future-proof enough
● Standardized enough
● “Cloud native” enough
● Simple enough to implement
● Easy enough to use
● Scalable enough
● Able to evolve
● And, of course, able to support
whatever optimized format
Tim Davis can dream up!
What are our high level goals?
“Pinky, are you pondering
what I’m pondering?”
Brain
8. 8
Essential formats for rank N: COO and CSF
http://shaden.io/pub-files/smith2017knl.pdf
CSF is like DCSR/HyperCSR generalized to higher dimensions
9. 9
● Metadata
○ Let’s use JSON
○ Versatile
○ Human-readable
○ No need to think about endianness
of metadata values
Logical model of a file
● Data
○ Like a key-value store with binary data
○ This is just a logical model
○ We are free to pack the data into bytes
however we wish
10. 10
Unified Sparse Storage
Rank 1 Rank 2
key CSR CSC DCSR DCSC COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0]
compressed_axes [0] [0]
coord_axes [0] [1] [1] [1] [1] [0, 1]
axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2
optional has_duplicates
Minimal, cohesive set of metadata and storage keys
13. 13
● No values?
○ Index-only
● Single values array
○ Typical usage
● Multiple value arrays!
○ Different data types
○ Give them different names?
○ Allows bitmaps
○ Want ability to read only one values array
● Dense array (may be rank N) for each element
○ For example, from embeddings
● Can a fill value be given for the missing elements?
Unified Sparse Storage: Values
14. 14
● Straw man proposal: use asar!
○ “asar is a simple extensive archive format, it works
like tar that concatenates all files together without
compression, while having random access support”
● Support random access
● Use JSON to store files' information
● Very easy to write a parser
How should we pack bytes into a file?
Other options
● ZIP file
● LMDB
● HDF4, HDF5, CDF5, NetCDF
● Directory of files on file system
● Directory of files in cloud
○ S3, GCS, Azure, Ceph, etc.
● Zarr
● n5
● ASDF
| UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
15. 15
● Arbitrary metadata via JSON
○ Able to evolve and support extensions
● Metadata scheme to specify CSR, DCSR, COO, CSF, etc.
○ Indicates what data is available and how to read it
● Single file archive that supports random access
○ Logically behaves like a key-value store
● Great, but what about extensions?
○ Support other sparse formats
○ Specify compression
○ Split single array into chunks
○ Additional data types (such as datetime)
○ Indicate variant of or option for existing format
Quick recap: what do we have so far?
Design goal: easily show via feature matrix which formats
and extensions are supported by each language and library.
16. 16
● CSR
○ “Padded” CSR: add indptr_end so column indices and values can have gaps
■ Cheaper to add or remove values
○ Column indices in a given row can be:
■ unsorted
■ sorted
■ binary tree as a binary heap
○ CSR5
● COO
○ One array per coordinate
■ May have different data types (uint8, uint16, etc.)
○ One array for all coordinates (N x D)
○ Has duplicates or not
○ Lexicographic sort depth
File format variations
18. 18
Graph properties
● directed / undirected
● weighted / unweighted
● is adjacency matrix
● is incidence matrix
● is bipartite graph
● is multigraph
● is hypergraph
● has self-edges
● has node attributes
Matrix properties
● is upper triangular
● is lower triangular
● is diagonal
● is symmetric
● is symmetric (only upper triangle)
● is symmetric (only lower triangle)
● is iso-valued
● ndiag
What about other properties?
A graph is more than just a sparse matrix
20. About Anaconda
With more than 20 million users, Anaconda is the world’s most
popular data science platform and the foundation of modern
machine learning. We pioneered the use of Python for data
science, champion its vibrant community, and continue to
steward open-source projects that make tomorrow’s
innovations possible. Our enterprise-grade solutions enable
corporate, research, and academic institutions around the world
to harness the power of open-source for competitive advantage,
groundbreaking research, and a better world.
Visit https://www.anaconda.com to learn more.