2. Agenda
I. Metadata
1. what & why
2. collection approaches
3. ML Metadata: why?
4. ML Metadata: what?
5. Related work
II. Mastro
1. Data Assets
2. Connectors
3. Catalogue Service
4. Feature Store
5. Crawlers
6. UI
7. MVC
III. Quickstart
3. What & Why
● Metadata: “data about other data”
○ main goal is to allow for indexing and retrieval of a resource
○ resources described in terms of attributes and relations to other resources
● e.g. Semantic Web
○ unambiguous naming - Uniform Resource Identifiers (URI)
○ Resource Description Framework (RDF)
■ <subject, predicate, object> triples or <namespace, s, p, o> quadruples
■ knowledge bases as querable graphs - SPARQL
■ ontologies as shared data models - shared taxonomies of entities and their axioms
4. Collection approaches
Push
● event-based - push to remote endpoint
(service or queue) for any change on
monitored data
● need to have invasive access on any
monitored resource (requires code
changes)
Pull
● periodic/scheduled crawling of resources
● typically used in search engines - periodic
visit a set of root pages and navigate links
from there (only read access required)
KB
crawler crawler
topic
resource
resource
resource
...
resource
resource
resource
...
agent
agent
agent
5. ML Metadata: why?
Uber’s journey towards better Data culture
○ Problems: data duplication (different solutions to similar problems), discovery issues (no shared
specification), disconnected tools (no downstream usage tracking), logging inconsistencies, lack of
process (common practices), lack of ownership and SLAs (accountability and quality)
○ Solutions: Ownership, quality monitoring & SLAs, unified processes and tools
■ Data Annotation - according to a shared data model
■ i) static info (ownership, lineage - related pipelines, code, tier), ii) usage (audit information,
especially those modifying the data), iii) quality (available tests and provided metrics and
SLAs), iv) cost (resources needed to (re)compute those data), v) reference to open issues and
bugs;
6. ML Metadata: what?
● Catalogue - inventory of data assets
○ asset annotation, discovery and self-service data access (easier interaction across teams and projects)
○ versioning and lineage control (ownership?)
● Metrics Stores - data quality assurance
○ data profiling - extraction of statistics and rules from monitored data (train phase?!)
○ metrics calculation - calculate statistics on incoming data and based on rules (predict phase?!)
○ validation - monitoring/alerting on data drift
● Feature stores - metadata of processed data
○ versioning of processed data
○ online serving - decoupling use cases from processing
● Experiment Tracking & Model Registry - metadata of experiments and their results
○ focus on repeatability and model interoperability (across various libraries and technologies)
7. Related work
● Data Catalogues
○ Apache Atlas (mainly hadoop-related tech)
○ Lyft Amundssen, Uber Databook, Linkedin DataHub, Netflix Metacat, AirBnB DataPortal
● Quality Metrics Stores
○ (old!) Apache Griffin, AWS deequ, great_expectations, Tensorflow Data Validation
● Feature Stores - https://www.featurestore.org/
○ Feast (go), SageMaker Feature Store, many more..
● Experiment tracking & model registries - https://mlops.toys/
○ Mlflow, BentoML, Seldon, Evidently AI, many more..
○ first 2 do both tracking and serving, latter 2 do serving and model monitoring (very diverse!)
11. Catalogue Service
● Get Asset by Name or Tags
● Upsert 1 or multiple Assets
https://github.com/data-mill-cloud/mastro/blob/master/doc/CATALOGUE.md
● DAOs for various connectors
12. Feature Store
● Feature specifies type but go correctly serializes primitive ones
● Get By Name
● Explicit human-readable versioning using Version
● Implicit versioning with InsertedAt (set when doing PUT)
● Features also pushed to catalogue
https://github.com/data-mill-cloud/mastro/blob/master/doc/FEATURESTORE.md
13. Crawlers
● Any walkable file system or database can be crawled
● A filter on the filename is used to select only manifest files
● Scheduled reconcile loop
a. walk
b. bulk upsert to catalogue
https://github.com/data-mill-cloud/mastro/blob/master/doc/CRAWLERS.md Available crawlers:
● s3
● hdfs
● hive
● impala
● local (volume)
15. MVC - Mastro Version Control
● Motivation - bring data back to where it should be
○ file system rather than weird combination with git
○ alternative to dvc and pachyderm
○ gap between ml dataset versioning and versioning in DWH (hudi, delta, iceberg)
○ Merkle-tree-based integrity checks not available for the latter - too expensive for large datasets
● MVC
○ simple wrap to DFS clients (e.g. S3, HDFS)
○ manifest metadata file along with data files - same format that can be crawled by crawlers!
○ simple interface - same config of services (catalogue and featurestore)
■ set config that specifies data source - e.g. MVC_CONFIG=$PWD/conf/example_s3.yml
■ init dataset to destination (e.g. bucket) - will create a local manifest that can be filled and is then
uploaded
■ new to create a new version
■ add to add files to current latest version - hashing of sole folder/file being added
■ delete to delete entire version and its metadata
■ check to perform hash-sum of a local folder - to compare downloaded with metadata one
16. Mastro
Metadata management in Go
Quickstart:
● docker compose (mongo+catalogue+fs+ui)
● k8s deployment
Thanks! that’s all folks!