SlideShare a Scribd company logo
1 of 16
Download to read offline
Mastro
Metadata management in Go
Andrea Monacchi
Agenda
I. Metadata
1. what & why
2. collection approaches
3. ML Metadata: why?
4. ML Metadata: what?
5. Related work
II. Mastro
1. Data Assets
2. Connectors
3. Catalogue Service
4. Feature Store
5. Crawlers
6. UI
7. MVC
III. Quickstart
What & Why
● Metadata: “data about other data”
○ main goal is to allow for indexing and retrieval of a resource
○ resources described in terms of attributes and relations to other resources
● e.g. Semantic Web
○ unambiguous naming - Uniform Resource Identifiers (URI)
○ Resource Description Framework (RDF)
■ <subject, predicate, object> triples or <namespace, s, p, o> quadruples
■ knowledge bases as querable graphs - SPARQL
■ ontologies as shared data models - shared taxonomies of entities and their axioms
Collection approaches
Push
● event-based - push to remote endpoint
(service or queue) for any change on
monitored data
● need to have invasive access on any
monitored resource (requires code
changes)
Pull
● periodic/scheduled crawling of resources
● typically used in search engines - periodic
visit a set of root pages and navigate links
from there (only read access required)
KB
crawler crawler
topic
resource
resource
resource
...
resource
resource
resource
...
agent
agent
agent
ML Metadata: why?
Uber’s journey towards better Data culture
○ Problems: data duplication (different solutions to similar problems), discovery issues (no shared
specification), disconnected tools (no downstream usage tracking), logging inconsistencies, lack of
process (common practices), lack of ownership and SLAs (accountability and quality)
○ Solutions: Ownership, quality monitoring & SLAs, unified processes and tools
■ Data Annotation - according to a shared data model
■ i) static info (ownership, lineage - related pipelines, code, tier), ii) usage (audit information,
especially those modifying the data), iii) quality (available tests and provided metrics and
SLAs), iv) cost (resources needed to (re)compute those data), v) reference to open issues and
bugs;
ML Metadata: what?
● Catalogue - inventory of data assets
○ asset annotation, discovery and self-service data access (easier interaction across teams and projects)
○ versioning and lineage control (ownership?)
● Metrics Stores - data quality assurance
○ data profiling - extraction of statistics and rules from monitored data (train phase?!)
○ metrics calculation - calculate statistics on incoming data and based on rules (predict phase?!)
○ validation - monitoring/alerting on data drift
● Feature stores - metadata of processed data
○ versioning of processed data
○ online serving - decoupling use cases from processing
● Experiment Tracking & Model Registry - metadata of experiments and their results
○ focus on repeatability and model interoperability (across various libraries and technologies)
Related work
● Data Catalogues
○ Apache Atlas (mainly hadoop-related tech)
○ Lyft Amundssen, Uber Databook, Linkedin DataHub, Netflix Metacat, AirBnB DataPortal
● Quality Metrics Stores
○ (old!) Apache Griffin, AWS deequ, great_expectations, Tensorflow Data Validation
● Feature Stores - https://www.featurestore.org/
○ Feast (go), SageMaker Feature Store, many more..
● Experiment tracking & model registries - https://mlops.toys/
○ Mlflow, BentoML, Seldon, Evidently AI, many more..
○ first 2 do both tracking and serving, latter 2 do serving and model monitoring (very diverse!)
Mastro
Metadata management in Go
Data Assets
https://github.com/data-mill-cloud/mastro/blob/master/commons/abstract/asset.go
Connectors
elastic, hdfs, hive, impala, mongo, redis, s3
Connector init from DataSourceDefinition (Yaml)
https://github.com/data-mill-cloud/mastro/tree/master/commons/sources
Catalogue Service
● Get Asset by Name or Tags
● Upsert 1 or multiple Assets
https://github.com/data-mill-cloud/mastro/blob/master/doc/CATALOGUE.md
● DAOs for various connectors
Feature Store
● Feature specifies type but go correctly serializes primitive ones
● Get By Name
● Explicit human-readable versioning using Version
● Implicit versioning with InsertedAt (set when doing PUT)
● Features also pushed to catalogue
https://github.com/data-mill-cloud/mastro/blob/master/doc/FEATURESTORE.md
Crawlers
● Any walkable file system or database can be crawled
● A filter on the filename is used to select only manifest files
● Scheduled reconcile loop
a. walk
b. bulk upsert to catalogue
https://github.com/data-mill-cloud/mastro/blob/master/doc/CRAWLERS.md Available crawlers:
● s3
● hdfs
● hive
● impala
● local (volume)
UI
MVC - Mastro Version Control
● Motivation - bring data back to where it should be
○ file system rather than weird combination with git
○ alternative to dvc and pachyderm
○ gap between ml dataset versioning and versioning in DWH (hudi, delta, iceberg)
○ Merkle-tree-based integrity checks not available for the latter - too expensive for large datasets
● MVC
○ simple wrap to DFS clients (e.g. S3, HDFS)
○ manifest metadata file along with data files - same format that can be crawled by crawlers!
○ simple interface - same config of services (catalogue and featurestore)
■ set config that specifies data source - e.g. MVC_CONFIG=$PWD/conf/example_s3.yml
■ init dataset to destination (e.g. bucket) - will create a local manifest that can be filled and is then
uploaded
■ new to create a new version
■ add to add files to current latest version - hashing of sole folder/file being added
■ delete to delete entire version and its metadata
■ check to perform hash-sum of a local folder - to compare downloaded with metadata one
Mastro
Metadata management in Go
Quickstart:
● docker compose (mongo+catalogue+fs+ui)
● k8s deployment
Thanks! that’s all folks!

More Related Content

What's hot

Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Miguel Pérez Colino
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREFernando Lopez Aguilar
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
Cassandra data access
Cassandra data accessCassandra data access
Cassandra data accesstechblog
 
Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109Chengjen Lee
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 
ckan 2.0 Introduction
ckan 2.0 Introductionckan 2.0 Introduction
ckan 2.0 IntroductionChengjen Lee
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfRim Moussa
 
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Ivan Ermilov
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
An Introduction to Pentaho Kettle
An Introduction to Pentaho KettleAn Introduction to Pentaho Kettle
An Introduction to Pentaho KettleDan Moore
 

What's hot (19)

Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWARE
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Cassandra data access
Cassandra data accessCassandra data access
Cassandra data access
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
ckan 2.0 Introduction
ckan 2.0 Introductionckan 2.0 Introduction
ckan 2.0 Introduction
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
An Introduction to Pentaho Kettle
An Introduction to Pentaho KettleAn Introduction to Pentaho Kettle
An Introduction to Pentaho Kettle
 

Similar to Mastro

Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubGlobus
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout Carole Goble
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsShawn Zhu
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analyticsKyle Bader
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsMichel Dumontier
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Dimitar Danailov
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 

Similar to Mastro (20)

Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHub
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge Graphs
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 

More from Andrea Monacchi

Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systemsAndrea Monacchi
 
Anomaly detection on wind turbine data
Anomaly detection on wind turbine dataAnomaly detection on wind turbine data
Anomaly detection on wind turbine dataAndrea Monacchi
 
Welcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementWelcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementAndrea Monacchi
 
An Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAn Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAndrea Monacchi
 
Assisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAssisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAndrea Monacchi
 
Analytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAnalytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAndrea Monacchi
 
HEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorHEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorAndrea Monacchi
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyAndrea Monacchi
 

More from Andrea Monacchi (10)

Mastro
MastroMastro
Mastro
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
 
Introduction to istio
Introduction to istioIntroduction to istio
Introduction to istio
 
Anomaly detection on wind turbine data
Anomaly detection on wind turbine dataAnomaly detection on wind turbine data
Anomaly detection on wind turbine data
 
Welcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy ManagementWelcome to Load Disaggregation and Building Energy Management
Welcome to Load Disaggregation and Building Energy Management
 
An Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted LivingAn Early Warning System for Ambient Assisted Living
An Early Warning System for Ambient Assisted Living
 
Assisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and MicrogridsAssisting Energy Management in Smart Buildings and Microgrids
Assisting Energy Management in Smart Buildings and Microgrids
 
Analytics as value added service for energy utilities
Analytics as value added service for energy utilitiesAnalytics as value added service for energy utilities
Analytics as value added service for energy utilities
 
HEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market SimulatorHEMS: A Home Energy Market Simulator
HEMS: A Home Energy Market Simulator
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and Italy
 

Recently uploaded

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 

Recently uploaded (20)

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 

Mastro

  • 1. Mastro Metadata management in Go Andrea Monacchi
  • 2. Agenda I. Metadata 1. what & why 2. collection approaches 3. ML Metadata: why? 4. ML Metadata: what? 5. Related work II. Mastro 1. Data Assets 2. Connectors 3. Catalogue Service 4. Feature Store 5. Crawlers 6. UI 7. MVC III. Quickstart
  • 3. What & Why ● Metadata: “data about other data” ○ main goal is to allow for indexing and retrieval of a resource ○ resources described in terms of attributes and relations to other resources ● e.g. Semantic Web ○ unambiguous naming - Uniform Resource Identifiers (URI) ○ Resource Description Framework (RDF) ■ <subject, predicate, object> triples or <namespace, s, p, o> quadruples ■ knowledge bases as querable graphs - SPARQL ■ ontologies as shared data models - shared taxonomies of entities and their axioms
  • 4. Collection approaches Push ● event-based - push to remote endpoint (service or queue) for any change on monitored data ● need to have invasive access on any monitored resource (requires code changes) Pull ● periodic/scheduled crawling of resources ● typically used in search engines - periodic visit a set of root pages and navigate links from there (only read access required) KB crawler crawler topic resource resource resource ... resource resource resource ... agent agent agent
  • 5. ML Metadata: why? Uber’s journey towards better Data culture ○ Problems: data duplication (different solutions to similar problems), discovery issues (no shared specification), disconnected tools (no downstream usage tracking), logging inconsistencies, lack of process (common practices), lack of ownership and SLAs (accountability and quality) ○ Solutions: Ownership, quality monitoring & SLAs, unified processes and tools ■ Data Annotation - according to a shared data model ■ i) static info (ownership, lineage - related pipelines, code, tier), ii) usage (audit information, especially those modifying the data), iii) quality (available tests and provided metrics and SLAs), iv) cost (resources needed to (re)compute those data), v) reference to open issues and bugs;
  • 6. ML Metadata: what? ● Catalogue - inventory of data assets ○ asset annotation, discovery and self-service data access (easier interaction across teams and projects) ○ versioning and lineage control (ownership?) ● Metrics Stores - data quality assurance ○ data profiling - extraction of statistics and rules from monitored data (train phase?!) ○ metrics calculation - calculate statistics on incoming data and based on rules (predict phase?!) ○ validation - monitoring/alerting on data drift ● Feature stores - metadata of processed data ○ versioning of processed data ○ online serving - decoupling use cases from processing ● Experiment Tracking & Model Registry - metadata of experiments and their results ○ focus on repeatability and model interoperability (across various libraries and technologies)
  • 7. Related work ● Data Catalogues ○ Apache Atlas (mainly hadoop-related tech) ○ Lyft Amundssen, Uber Databook, Linkedin DataHub, Netflix Metacat, AirBnB DataPortal ● Quality Metrics Stores ○ (old!) Apache Griffin, AWS deequ, great_expectations, Tensorflow Data Validation ● Feature Stores - https://www.featurestore.org/ ○ Feast (go), SageMaker Feature Store, many more.. ● Experiment tracking & model registries - https://mlops.toys/ ○ Mlflow, BentoML, Seldon, Evidently AI, many more.. ○ first 2 do both tracking and serving, latter 2 do serving and model monitoring (very diverse!)
  • 10. Connectors elastic, hdfs, hive, impala, mongo, redis, s3 Connector init from DataSourceDefinition (Yaml) https://github.com/data-mill-cloud/mastro/tree/master/commons/sources
  • 11. Catalogue Service ● Get Asset by Name or Tags ● Upsert 1 or multiple Assets https://github.com/data-mill-cloud/mastro/blob/master/doc/CATALOGUE.md ● DAOs for various connectors
  • 12. Feature Store ● Feature specifies type but go correctly serializes primitive ones ● Get By Name ● Explicit human-readable versioning using Version ● Implicit versioning with InsertedAt (set when doing PUT) ● Features also pushed to catalogue https://github.com/data-mill-cloud/mastro/blob/master/doc/FEATURESTORE.md
  • 13. Crawlers ● Any walkable file system or database can be crawled ● A filter on the filename is used to select only manifest files ● Scheduled reconcile loop a. walk b. bulk upsert to catalogue https://github.com/data-mill-cloud/mastro/blob/master/doc/CRAWLERS.md Available crawlers: ● s3 ● hdfs ● hive ● impala ● local (volume)
  • 14. UI
  • 15. MVC - Mastro Version Control ● Motivation - bring data back to where it should be ○ file system rather than weird combination with git ○ alternative to dvc and pachyderm ○ gap between ml dataset versioning and versioning in DWH (hudi, delta, iceberg) ○ Merkle-tree-based integrity checks not available for the latter - too expensive for large datasets ● MVC ○ simple wrap to DFS clients (e.g. S3, HDFS) ○ manifest metadata file along with data files - same format that can be crawled by crawlers! ○ simple interface - same config of services (catalogue and featurestore) ■ set config that specifies data source - e.g. MVC_CONFIG=$PWD/conf/example_s3.yml ■ init dataset to destination (e.g. bucket) - will create a local manifest that can be filled and is then uploaded ■ new to create a new version ■ add to add files to current latest version - hashing of sole folder/file being added ■ delete to delete entire version and its metadata ■ check to perform hash-sum of a local folder - to compare downloaded with metadata one
  • 16. Mastro Metadata management in Go Quickstart: ● docker compose (mongo+catalogue+fs+ui) ● k8s deployment Thanks! that’s all folks!