SlideShare a Scribd company logo
1 of 20
Download to read offline
1
Binary Storage Formats for Sparse
Erik Welch
ewelch@anaconda.com
September 21, 2021
2
● To try to improve the status quo
○ Matrix Market, FROSTT, and other text-based formats
○ COO: coordinate triples of rows, columns, and values
■ Columnar storage technology is really good!
○ Bespoke, library-specific binary formats
○ TileDB
○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.)
● To discuss and create a vision for new binary storage formats for sparse
○ “Open standard” formats that are broadly useful and widely accessible
● To determine what is most important to the people here
○ Which languages, libraries, formats, features, technologies, etc.
○ And what do you wish for? Dream big!
● To find partners and team up for the next steps
○ Prototype, analyze, communicate results, and iterate
○ Goals for 3-6 months and beyond
Why are we here?
3
● Library developers and users
● Anybody with large sparse data or graphs
● Hardware creators and vendors
○ For example, to better optimize running workloads on GPUs or other accelerators
● Researchers
● Companies with graph workloads
● Benchmark organizers
● Maintainers of repositories of sparse data
● Virtually anybody who needs to work with sparse data or graphs
Who is this for?
4
● Data written to long term storage
○ Any technically competent person who discovers the file 10 years later should be
able to figure out how to load and use it (even if they don’t recognize the format)
● The logical model of a file format
○ Metadata about metadata
■ File extension (such as .zip or .pdf)
■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY)
■ Version number of format
■ Any additional extensions needed to understand the metadata or data
○ Metadata
■ Data type, endianness of data, shape of sparse array, etc.
■ Info about compression
○ Data
■ Contiguous arrays of raw bytes
■ May be chunked or compressed
What is a file format?
5
An “open” file format is also about community
● More likely to succeed with more partners
● Right now, a small group (or just me) can
make a lot of progress with prototyping
and documenting to bootstrap the effort
○ Along with feedback from LAGraph group
and others
● We’ll want more partners eventually
● Please think about how/when you can help
○ Student projects
○ Volunteer to be beta users, give feedback
○ Participate in this session
● We’ll need a governance model eventually
○ Hopefully! This is a good “problem” to have
○ We’re happy to have this discussion
“If you want to go quickly,
go alone. If you want to
go far, go together.”
African Proverb
6
● Fast enough
● Small enough
● Flexible enough
● Portable enough
● Future-proof enough
● Standardized enough
● “Cloud native” enough
● Simple enough to implement
● Easy enough to use
● Scalable enough
● Able to evolve
● And, of course, able to support
whatever optimized format
Tim Davis can dream up!
What are our high level goals?
“Pinky, are you pondering
what I’m pondering?”
Brain
7
Essential sparse formats to support: rank 1 and 2
Sparse vector indices values
COO (ijv triples) rows columns values
CSR indptr col_indices values
CSC indptr row_indices values
DCSR (HyperCSR) rows indptr col_indices values
DCSC (HyperCSC) columns indptr row_indices values
Let’s start designing!
Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.
8
Essential formats for rank N: COO and CSF
http://shaden.io/pub-files/smith2017knl.pdf
CSF is like DCSR/HyperCSR generalized to higher dimensions
9
● Metadata
○ Let’s use JSON
○ Versatile
○ Human-readable
○ No need to think about endianness
of metadata values
Logical model of a file
● Data
○ Like a key-value store with binary data
○ This is just a logical model
○ We are free to pack the data into bytes
however we wish
10
Unified Sparse Storage
Rank 1 Rank 2
key CSR CSC DCSR DCSC COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0]
compressed_axes [0] [0]
coord_axes [0] [1] [1] [1] [1] [0, 1]
axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2
optional has_duplicates
Minimal, cohesive set of metadata and storage keys
11
Unified Sparse Storage: Rank 3
key CSF CSF (alt) CSR+extra COO DCSR+extra
double- fiber_ind0
compressed fiber_ptr0
fiber_ind1
fiber_ptr1
compressed indptr
sparse coord0
coord1
coord2
metadata fiber_axes [0, 1] [1] [0]
compressed_axes [0] [0]
coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2]
axis_order
optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2
optional has_duplicates
12
Unified Sparse Storage: Rank 2 combinations
Rank 2 Combos
key
CSR
/COO
DCSR
/CSR
DCSR
/COO
DCSR/CSR
/COO
CSC
/COO
DCSC
/CSC
DCSC
/COO
DCSC/CSC
/COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0] [0] [0] [0] [0]
compressed_axes [0] [0] [0] [0] [0] [0]
coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1]
axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1
optional has_duplicates
13
● No values?
○ Index-only
● Single values array
○ Typical usage
● Multiple value arrays!
○ Different data types
○ Give them different names?
○ Allows bitmaps
○ Want ability to read only one values array
● Dense array (may be rank N) for each element
○ For example, from embeddings
● Can a fill value be given for the missing elements?
Unified Sparse Storage: Values
14
● Straw man proposal: use asar!
○ “asar is a simple extensive archive format, it works
like tar that concatenates all files together without
compression, while having random access support”
● Support random access
● Use JSON to store files' information
● Very easy to write a parser
How should we pack bytes into a file?
Other options
● ZIP file
● LMDB
● HDF4, HDF5, CDF5, NetCDF
● Directory of files on file system
● Directory of files in cloud
○ S3, GCS, Azure, Ceph, etc.
● Zarr
● n5
● ASDF
| UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
15
● Arbitrary metadata via JSON
○ Able to evolve and support extensions
● Metadata scheme to specify CSR, DCSR, COO, CSF, etc.
○ Indicates what data is available and how to read it
● Single file archive that supports random access
○ Logically behaves like a key-value store
● Great, but what about extensions?
○ Support other sparse formats
○ Specify compression
○ Split single array into chunks
○ Additional data types (such as datetime)
○ Indicate variant of or option for existing format
Quick recap: what do we have so far?
Design goal: easily show via feature matrix which formats
and extensions are supported by each language and library.
16
● CSR
○ “Padded” CSR: add indptr_end so column indices and values can have gaps
■ Cheaper to add or remove values
○ Column indices in a given row can be:
■ unsorted
■ sorted
■ binary tree as a binary heap
○ CSR5
● COO
○ One array per coordinate
■ May have different data types (uint8, uint16, etc.)
○ One array for all coordinates (N x D)
○ Has duplicates or not
○ Lexicographic sort depth
File format variations
17
File format variations
https://arxiv.org/pdf/1904.03329.pdf
B-CSF
18
Graph properties
● directed / undirected
● weighted / unweighted
● is adjacency matrix
● is incidence matrix
● is bipartite graph
● is multigraph
● is hypergraph
● has self-edges
● has node attributes
Matrix properties
● is upper triangular
● is lower triangular
● is diagonal
● is symmetric
● is symmetric (only upper triangle)
● is symmetric (only lower triangle)
● is iso-valued
● ndiag
What about other properties?
A graph is more than just a sparse matrix
Thank You!
Erik Welch
ewelch@anaconda.com
About Anaconda
With more than 20 million users, Anaconda is the world’s most
popular data science platform and the foundation of modern
machine learning. We pioneered the use of Python for data
science, champion its vibrant community, and continue to
steward open-source projects that make tomorrow’s
innovations possible. Our enterprise-grade solutions enable
corporate, research, and academic institutions around the world
to harness the power of open-source for competitive advantage,
groundbreaking research, and a better world.
Visit https://www.anaconda.com to learn more.

More Related Content

What's hot

Europe PubMed Central and Linked Data
Europe PubMed Central and Linked DataEurope PubMed Central and Linked Data
Europe PubMed Central and Linked DataJee-Hyub Kim
 
Postgre sql data types
Postgre sql data typesPostgre sql data types
Postgre sql data typesDucat
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structuresnayanbanik
 
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Data Con LA
 
Introduction to trees
 Introduction to trees Introduction to trees
Introduction to treespriya_trehan
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquetManpreet Khurana
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_textshilpashukla01
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structuresNagajothiN1
 

What's hot (17)

Flow control, variable types
Flow control, variable typesFlow control, variable types
Flow control, variable types
 
Europe PubMed Central and Linked Data
Europe PubMed Central and Linked DataEurope PubMed Central and Linked Data
Europe PubMed Central and Linked Data
 
Structures
StructuresStructures
Structures
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
Tree
TreeTree
Tree
 
Postgre sql data types
Postgre sql data typesPostgre sql data types
Postgre sql data types
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
Learning R - Handling NetCDF files
Learning R - Handling NetCDF filesLearning R - Handling NetCDF files
Learning R - Handling NetCDF files
 
Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structures
 
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
 
Core Data
Core DataCore Data
Core Data
 
Introduction to trees
 Introduction to trees Introduction to trees
Introduction to trees
 
Day 4( magic camp)
Day 4( magic camp)Day 4( magic camp)
Day 4( magic camp)
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structures
 

Similar to HPEC 2021 sparse binary format

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sqlSean Murphy
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Cassandra in production
Cassandra in productionCassandra in production
Cassandra in productionvalstadsve
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's studentsMohamed Nadjib MAMI
 
Hadoop and cassandra
Hadoop and cassandraHadoop and cassandra
Hadoop and cassandraChristina Yu
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabaseMubashar Iqbal
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleRob Vesse
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 

Similar to HPEC 2021 sparse binary format (20)

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Anything-to-Graph
Anything-to-GraphAnything-to-Graph
Anything-to-Graph
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sql
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Data Structures & Algorithms
Data Structures & AlgorithmsData Structures & Algorithms
Data Structures & Algorithms
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Cassandra in production
Cassandra in productionCassandra in production
Cassandra in production
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
Hadoop and cassandra
Hadoop and cassandraHadoop and cassandra
Hadoop and cassandra
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
TinkerPop 2020
TinkerPop 2020TinkerPop 2020
TinkerPop 2020
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

HPEC 2021 sparse binary format

  • 1. 1 Binary Storage Formats for Sparse Erik Welch ewelch@anaconda.com September 21, 2021
  • 2. 2 ● To try to improve the status quo ○ Matrix Market, FROSTT, and other text-based formats ○ COO: coordinate triples of rows, columns, and values ■ Columnar storage technology is really good! ○ Bespoke, library-specific binary formats ○ TileDB ○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.) ● To discuss and create a vision for new binary storage formats for sparse ○ “Open standard” formats that are broadly useful and widely accessible ● To determine what is most important to the people here ○ Which languages, libraries, formats, features, technologies, etc. ○ And what do you wish for? Dream big! ● To find partners and team up for the next steps ○ Prototype, analyze, communicate results, and iterate ○ Goals for 3-6 months and beyond Why are we here?
  • 3. 3 ● Library developers and users ● Anybody with large sparse data or graphs ● Hardware creators and vendors ○ For example, to better optimize running workloads on GPUs or other accelerators ● Researchers ● Companies with graph workloads ● Benchmark organizers ● Maintainers of repositories of sparse data ● Virtually anybody who needs to work with sparse data or graphs Who is this for?
  • 4. 4 ● Data written to long term storage ○ Any technically competent person who discovers the file 10 years later should be able to figure out how to load and use it (even if they don’t recognize the format) ● The logical model of a file format ○ Metadata about metadata ■ File extension (such as .zip or .pdf) ■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY) ■ Version number of format ■ Any additional extensions needed to understand the metadata or data ○ Metadata ■ Data type, endianness of data, shape of sparse array, etc. ■ Info about compression ○ Data ■ Contiguous arrays of raw bytes ■ May be chunked or compressed What is a file format?
  • 5. 5 An “open” file format is also about community ● More likely to succeed with more partners ● Right now, a small group (or just me) can make a lot of progress with prototyping and documenting to bootstrap the effort ○ Along with feedback from LAGraph group and others ● We’ll want more partners eventually ● Please think about how/when you can help ○ Student projects ○ Volunteer to be beta users, give feedback ○ Participate in this session ● We’ll need a governance model eventually ○ Hopefully! This is a good “problem” to have ○ We’re happy to have this discussion “If you want to go quickly, go alone. If you want to go far, go together.” African Proverb
  • 6. 6 ● Fast enough ● Small enough ● Flexible enough ● Portable enough ● Future-proof enough ● Standardized enough ● “Cloud native” enough ● Simple enough to implement ● Easy enough to use ● Scalable enough ● Able to evolve ● And, of course, able to support whatever optimized format Tim Davis can dream up! What are our high level goals? “Pinky, are you pondering what I’m pondering?” Brain
  • 7. 7 Essential sparse formats to support: rank 1 and 2 Sparse vector indices values COO (ijv triples) rows columns values CSR indptr col_indices values CSC indptr row_indices values DCSR (HyperCSR) rows indptr col_indices values DCSC (HyperCSC) columns indptr row_indices values Let’s start designing! Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.
  • 8. 8 Essential formats for rank N: COO and CSF http://shaden.io/pub-files/smith2017knl.pdf CSF is like DCSR/HyperCSR generalized to higher dimensions
  • 9. 9 ● Metadata ○ Let’s use JSON ○ Versatile ○ Human-readable ○ No need to think about endianness of metadata values Logical model of a file ● Data ○ Like a key-value store with binary data ○ This is just a logical model ○ We are free to pack the data into bytes however we wish
  • 10. 10 Unified Sparse Storage Rank 1 Rank 2 key CSR CSC DCSR DCSC COO double- fiber_ind0 compressed fiber_ptr0 compressed indptr sparse coord0 coord1 metadata fiber_axes [0] [0] compressed_axes [0] [0] coord_axes [0] [1] [1] [1] [1] [0, 1] axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0] optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2 optional has_duplicates Minimal, cohesive set of metadata and storage keys
  • 11. 11 Unified Sparse Storage: Rank 3 key CSF CSF (alt) CSR+extra COO DCSR+extra double- fiber_ind0 compressed fiber_ptr0 fiber_ind1 fiber_ptr1 compressed indptr sparse coord0 coord1 coord2 metadata fiber_axes [0, 1] [1] [0] compressed_axes [0] [0] coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2] axis_order optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2 optional has_duplicates
  • 12. 12 Unified Sparse Storage: Rank 2 combinations Rank 2 Combos key CSR /COO DCSR /CSR DCSR /COO DCSR/CSR /COO CSC /COO DCSC /CSC DCSC /COO DCSC/CSC /COO double- fiber_ind0 compressed fiber_ptr0 compressed indptr sparse coord0 coord1 metadata fiber_axes [0] [0] [0] [0] [0] [0] compressed_axes [0] [0] [0] [0] [0] [0] coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1] axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0] optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 optional has_duplicates
  • 13. 13 ● No values? ○ Index-only ● Single values array ○ Typical usage ● Multiple value arrays! ○ Different data types ○ Give them different names? ○ Allows bitmaps ○ Want ability to read only one values array ● Dense array (may be rank N) for each element ○ For example, from embeddings ● Can a fill value be given for the missing elements? Unified Sparse Storage: Values
  • 14. 14 ● Straw man proposal: use asar! ○ “asar is a simple extensive archive format, it works like tar that concatenates all files together without compression, while having random access support” ● Support random access ● Use JSON to store files' information ● Very easy to write a parser How should we pack bytes into a file? Other options ● ZIP file ● LMDB ● HDF4, HDF5, CDF5, NetCDF ● Directory of files on file system ● Directory of files in cloud ○ S3, GCS, Azure, Ceph, etc. ● Zarr ● n5 ● ASDF | UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
  • 15. 15 ● Arbitrary metadata via JSON ○ Able to evolve and support extensions ● Metadata scheme to specify CSR, DCSR, COO, CSF, etc. ○ Indicates what data is available and how to read it ● Single file archive that supports random access ○ Logically behaves like a key-value store ● Great, but what about extensions? ○ Support other sparse formats ○ Specify compression ○ Split single array into chunks ○ Additional data types (such as datetime) ○ Indicate variant of or option for existing format Quick recap: what do we have so far? Design goal: easily show via feature matrix which formats and extensions are supported by each language and library.
  • 16. 16 ● CSR ○ “Padded” CSR: add indptr_end so column indices and values can have gaps ■ Cheaper to add or remove values ○ Column indices in a given row can be: ■ unsorted ■ sorted ■ binary tree as a binary heap ○ CSR5 ● COO ○ One array per coordinate ■ May have different data types (uint8, uint16, etc.) ○ One array for all coordinates (N x D) ○ Has duplicates or not ○ Lexicographic sort depth File format variations
  • 18. 18 Graph properties ● directed / undirected ● weighted / unweighted ● is adjacency matrix ● is incidence matrix ● is bipartite graph ● is multigraph ● is hypergraph ● has self-edges ● has node attributes Matrix properties ● is upper triangular ● is lower triangular ● is diagonal ● is symmetric ● is symmetric (only upper triangle) ● is symmetric (only lower triangle) ● is iso-valued ● ndiag What about other properties? A graph is more than just a sparse matrix
  • 20. About Anaconda With more than 20 million users, Anaconda is the world’s most popular data science platform and the foundation of modern machine learning. We pioneered the use of Python for data science, champion its vibrant community, and continue to steward open-source projects that make tomorrow’s innovations possible. Our enterprise-grade solutions enable corporate, research, and academic institutions around the world to harness the power of open-source for competitive advantage, groundbreaking research, and a better world. Visit https://www.anaconda.com to learn more.