This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
StreamSQL Feature Store (Apache Pulsar Summit)Simba Khadder
Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.
This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlowDatabricks
As machine learning evolves from experimentation to serving production workloads, so does the need to effectively manage the end-to-end training and production workflow including model management, versioning, and serving. Clemens Mewald offers an overview of TensorFlow Extended (TFX), the end-to-end machine learning platform for TensorFlow that powers products across all of Alphabet. Many TFX components rely on the Beam SDK to define portable data processing workflows. This talk motivates the development of a Spark runner for Beam Python.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
StreamSQL Feature Store (Apache Pulsar Summit)Simba Khadder
Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.
This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlowDatabricks
As machine learning evolves from experimentation to serving production workloads, so does the need to effectively manage the end-to-end training and production workflow including model management, versioning, and serving. Clemens Mewald offers an overview of TensorFlow Extended (TFX), the end-to-end machine learning platform for TensorFlow that powers products across all of Alphabet. Many TFX components rely on the Beam SDK to define portable data processing workflows. This talk motivates the development of a Spark runner for Beam Python.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, will demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premises or in the Cloud.
Attendees will also learn how to write a deep learning application that leverages Spark to train image recognition models at scale.
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
ADF Gold Nuggets (Oracle Open World 2011)Lucas Jellema
Gold Nuggets in ADF Faces
ADF Faces is a superior User Interface technology. Just look at Fusion Applications to confirm that statement. Or look at one of the hundreds of ADF applications deployed around the world. ACE Directors Chris Muir and Lucas Jellema draw from the experiences on many of these application to demonstrate a number of the most useful, productive, surprising, even amusing and sometimes quite obscure features in ADF Faces. To offer more insight in the richness of the ADF framework in general and also provide very concrete examples that will immediately help you add advanced functionality or benefit from increased productivity. Topics include task flows, push, desktop integration, event handling, reuse, change persistence, UI Shell and more.
value: - learn about tricks and useful features in ADF Faces that will enable attendees to enhance their ADF applications (in term of visual richness and functionality) and to increase their productivity and improve their development process - get inspired about ADF (Faces)
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
CoFX is the framework behind time cockpit (http://www.timecockpit.com). Learn about the data model of CoFX and see how to use it to extend time cockpit.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
AutoML for Data Science Productivity and Toward Better Digital DecisionsSteven Gustafson
With the increased availability of both cloud computing and AI libraries arrives the opportunity to automatically search, or optimize machine learning algorithms. While this technology has been around for almost twenty years and seeing renewed interest lately, only recently has the computing power become widespread enough to fully take advantage of it by a growing community of data scientists across many different types of opportunities. Because machine learning still remains a rather challenging discipline for most, I advocate for a more “assistive” approach to AutoML that helps the data scientist learn about different methods within the entire machine learning pipeline, as well as create a knowledge graph of results that can be further mined and explored to gain knowledge and connect with other individuals who are also searching for machine learning pipelines. In this talk, I will present an overview of the approach, published recently in IJCAI and AAAI, and provide new unpublished results demonstrating its effectiveness on public data sets.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, will demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premises or in the Cloud.
Attendees will also learn how to write a deep learning application that leverages Spark to train image recognition models at scale.
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
ADF Gold Nuggets (Oracle Open World 2011)Lucas Jellema
Gold Nuggets in ADF Faces
ADF Faces is a superior User Interface technology. Just look at Fusion Applications to confirm that statement. Or look at one of the hundreds of ADF applications deployed around the world. ACE Directors Chris Muir and Lucas Jellema draw from the experiences on many of these application to demonstrate a number of the most useful, productive, surprising, even amusing and sometimes quite obscure features in ADF Faces. To offer more insight in the richness of the ADF framework in general and also provide very concrete examples that will immediately help you add advanced functionality or benefit from increased productivity. Topics include task flows, push, desktop integration, event handling, reuse, change persistence, UI Shell and more.
value: - learn about tricks and useful features in ADF Faces that will enable attendees to enhance their ADF applications (in term of visual richness and functionality) and to increase their productivity and improve their development process - get inspired about ADF (Faces)
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
CoFX is the framework behind time cockpit (http://www.timecockpit.com). Learn about the data model of CoFX and see how to use it to extend time cockpit.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
AutoML for Data Science Productivity and Toward Better Digital DecisionsSteven Gustafson
With the increased availability of both cloud computing and AI libraries arrives the opportunity to automatically search, or optimize machine learning algorithms. While this technology has been around for almost twenty years and seeing renewed interest lately, only recently has the computing power become widespread enough to fully take advantage of it by a growing community of data scientists across many different types of opportunities. Because machine learning still remains a rather challenging discipline for most, I advocate for a more “assistive” approach to AutoML that helps the data scientist learn about different methods within the entire machine learning pipeline, as well as create a knowledge graph of results that can be further mined and explored to gain knowledge and connect with other individuals who are also searching for machine learning pipelines. In this talk, I will present an overview of the approach, published recently in IJCAI and AAAI, and provide new unpublished results demonstrating its effectiveness on public data sets.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
The main goal of the session is to showcase approaches that greatly simplify the work of a data analyst when performing data analytics, or when employing machine learning algorithms, over Big Data. The session will include presentations on
(a) How data analytics workflows can be easily and graphically composed, and then optimized for execution,
(b) How raw data with great variety can be easily queried using SQL interfaces, and
(c) How complex machine learning operations can be performed efficiently in distributed settings.
After these presentations, the speakers will participate in a discussion with the audience, in order to discuss further tools that could make the work of a data analyst more simple.
Log Data Analysis Platform by Valentin KropovSoftServe
Log Data Analysis Platform is a completely automated system to ingest, process and store huge amount of log data based on Flume, Spark, Hadoop, Impala, Hive, ElasticSearch and Kibana.
Log Data Analysis Platform is a completely automated system to ingest, process and store huge amount of log data based on Flume, Spark, Hadoop, Impala, Hive, ElasticSearch and Kibana.
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto,
Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi.
In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.
How to create an enterprise data lake for enterprise-wide information storage and sharing? The data lake concept, architecture principles, support for data science and some use case review.
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
Apache Spark 2.0 offers many enhancements that make continuous analytics quite simple. In this talk, we will discuss many other things that you can do with your Apache Spark cluster. We explain how a deep integration of Apache Spark 2.0 and in-memory databases can bring you the best of both worlds! In particular, we discuss how to manage mutable data in Apache Spark, run consistent transactions at the same speed as state-the-art in-memory grids, build and use indexes for point lookups, and run 100x more analytics queries at in-memory speeds. No need to bridge multiple products or manage, tune multiple clusters. We explain how one can take regulation Apache Spark SQL OLAP workloads and speed them up by up to 20x using optimizations in SnappyData.
We then walk through several use-case examples, including IoT scenarios, where one has to ingest streams from many sources, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data. Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Apache Spark 2.0 cluster. A design that is simpler, a lot more efficient, and let’s you do everything from Machine Learning and Data Science to Transactions and Visual Analytics all in one single cluster.
Why does big data always have to go through a pipeline? multiple data copies, slow, complex and stale analytics? We present a unified analytics platform that brings streaming, transactions and adhoc OLAP style interactive analytics in a single in-memory cluster based on Spark.
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...nadine39280
Discover the evolution of Apache Hudi within the open-source realm - a community and project pushing the boundaries of data lake possibilities. This presentation delves into Apache Hudi 1.0, a pivotal release reimagining its transactional database layer while honoring its foundational principles. Join us in this transformative journey!
Join the Apache Hudi Community
https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg.
Follow us on LinkedIn and Twitter
https://www.linkedin.com/company/apache-hudi/
https://twitter.com/apachehudi
Apache Flink Overview at SF Spark and FriendsStephan Ewen
Introductory presentation for Apache Flink, with bias towards streaming data analysis features in Flink. Shown at the San Francisco Spark and Friends Meetup
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
Similar to Metadata and Provenance for ML Pipelines with Hopsworks (20)
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
This talk is a mental map for building ML systems as ML Pipelines that are factored into Feature Pipelines, Training Pipelines, and Inference Pipelines.
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
Cloud Native London talk about the control layer of Hopsworks.ai and our choice of cloud native services. We built our own multi-tenant services as cloud native services, for the most part.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs, including AllReduce, Horovod, and how commodity GPU servers, such as DeepLearning11, will gain adoption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Leading Change strategies and insights for effective change management pdf 1.pdf
Metadata and Provenance for ML Pipelines with Hopsworks
1. Dr. Jim Dowling1,2
Slides together with Alexandru A. Ormenisan1,2
, Mahmoud Ismail1,2
PROVENANCE
FOR MACHINE LEARNING PIPELINES
KTH - Royal Institute of Technology (1)
Logical Clocks AB (2)
2. Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
2
3. Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
ML PLATFORM
TRAIN and SERVE
FEATURE
STORE
4. What is provenance for ML Pipelines?
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models
8. Event DataRaw Data
Data Lake
Data
Pipelines
BI
Platforms
SQL Data
Feature
Pipelines
Feature
Store
FEATURES FOR MODEL TRAINING
SERVE RT FEATURES TO ONLINE MODELS
FEATURES FOR ANALYTICAL MODELS (BATCH)
Feature Stores make existing Data Infrastructure available to Data Scientists and Online Apps
9. Click features every 10
secs
CDC data every 30
secs
User profile updates every
hour
Featurized weblogs data every
day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2
secs)
Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Pipelines update the Feature Store (2 Databases!) with data from backend Platforms
11. Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
The FeatureGroup abstraction hides the complexity of dealing with 2 databasesFeature Store Concepts in Hopsworks
12. Example Ingestion of data into a FeatureGroup
https://docs.hopsworks.ai/
dataframe = spark.read.json("s3://dataset/rain.json")
# do feature engineering on your dataframe
df.withColumn('precipitation', (df.val-min)/(max-min))
fg = fs.create_feature_group("rain",
version=1,
description="Rain features",
primary_key=['date', 'location_id'],
online_enabled=True)
fg.save(dataframe)
13. Example Creation of Train/Test Data from a Feature Store
https://docs.hopsworks.ai/
# Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN key.
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all()))
td = fs.create_training_dataset("training_dataset",
version=1,
data_format="tfrecords",
description="Training dataset, TfRecords format",
splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})
# The train/test/validation files are now saved to the filesystem (S3, HDFS, etc)
td.save(feature_join)
# Use the training data as follows:
df = td.read(split="train")
14. Event DataRaw Data
Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING
MONITOR
Data Lake
The end of the End-to-End ML Pipeline!
ML Pipelines start and stop at the Feature Store
15. End-to-End ML Pipelines on Hopsworks. Provenance is Collecting Metadata.
HopsFS
Code and
configuration
Data Lake,
Warehous
e, Kafka
Feature
Store
Model
registry
Prediction Logs
Monitoring Logs
Feature
Engineering
Serving on
Kubernetes
Model
Training
Model
Deploy
Serving and
Monitoring
Experiments/
Development Scaleout
Metadata
Features
Validate
Deploy to
Log
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync
16. Metadata is data that describes other data.
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata?
17. Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore (Database)
Provenance queries
● SQL or Free-Text or Graph?
● Update Throughput?
● Latency of queries?
● Size of Metadata?
19. 3 Mechanisms for Metadata Collection. Polyglot Metadata Storage for Efficient Querying.
File Systems, Databases, Data Warehouses, Message Bus, etc
Metastore (Database)
Crawler
Job
Pull
(REST) API
Push
Change Data
Capture(CDC) API
Application
Instrumented APIJob
Graph DB Search (Elastic)
Metadata Query API
20. Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore
Consistency issues
Synchronization
?
21. Metadata is data that describes other data.
Unspoken Assumption:
Why are Data and Metadata always separate stores?
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata Revisited?
22. Artifacts and Metadata in End-to-End ML Pipelines
Raw Data Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
File System (S3, HopsFS, etc)
Metastore (Database)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
23. Mechanism 4: Artifacts and Metadata in the same system - a Unified Metadata Layer (Hopsworks)
Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
HopsFS
Metastore (NDB)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
Metastore (NDB)
Raw Data
extends extends extends extends
24. Libraries
Application
Data
platform
Metadata
store
Explicit
Top–down tracking of provenance.
Push/Pull, CDC, or instrumented
application or library code.
Standalone Metadata Store.
Implicit
Bottom-up tracking of provenance.
Requires redesigning the platform.
Conventions link files to artifacts.
Metadata is strongly consistent
with storage platform.
Mechanism 4: Implicit Provenance
25. ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
Mahmoud Ismail1
, Mikael Ronström2
, Seif Haridi1
, Jim Dowling1
1
KTH - Royal Institute of Technology 2
Oracle
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2019), May 15th
25
Tightly coupled Metadata and Data - replicating Metadata to External Systems
HopsFS
Scaleout
Metadata
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync
26. DN1 DN2 DN3 DN4 DN5
26
● Highly scalable next-generation distribution of HDFS
mkdir /Images
write /Images/cat.png
Images
cat.png
/
DN1, DN3, DN5
What is HopsFS?
28. 28
● Drop-in replacement distribution of HDFS
● 16X - 37X the throughput of HDFS
● 37 larger clusters than HDFS
● 10 times lower latency
What is HopsFS?
33. 33
● ePipe is a databus that provides replicated metadata as a service for HopsFS
● ePipe internally
• creates a consistent and correctly ordered change stream for HopsFS
metadata
• and eventually delivers the change stream with low latency (sub second)
(Near Real-time) to consumers
ePipe
34. 34
● Extend HopsFS with a logging table to log file system changes
● Leverage the NDB events API to live stream changes on the logging table to
ePipe
● ePipe enriches the file system events with appropriate data and publish the
enriched events to the consumers
ePipe: Design Decisions
35. 35
Create /f1
name operationinodeID name parentID
1 / 0
Inodes table logging table
NDB
HopsFS Namenodes
2 f1 1
3 f2 1
f1 CREATE
f2 CREATE
f2 DELETE
f1 DELETE
Create /f2Delete /f2Delete /f1
Inodes table and logging table updated in the same Transaction to ensure Consistency/Integrity
37. 37
HopsFS
NDB
NDB
ePipe
Create f1
Append f1
Create f2
Delete f1
Delete f2
Create f1
Append f1
Create f2
Delete f1
Delete f2
Epoch1
Epoch2Epoch3
Order across epochs Order within epoch
Delete f1 after Create f1 Create f1 ?? Append f1
Delete f2 after Create f2 Create f2 ?? Delete f1
…..
Inconsistencies
100 ms
Ordering of Log Entries
38. 38
● Property 1: The epochs are totally ordered.
● Property 2: The changes within the same transaction happen in the same
epoch.
● Property 3: The changes on files are ordered only if they are in different
epochs, that is, no ordering is guaranteed within the same epoch.
NDB Ordering Properties
39. 39
HopsFS
NDB
NDB
ePipe
Epoch1
Epoch2Epoch3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
We introduced a version number per inode
which we will increment whenever
a change occurs to an inode.
Append f1 after Create f1
Create f2 ?? Delete f1
Strengthening NDB Ordering Properties
40. 40
● Property 1 & 2 & 3
● Property 4 & 5: The version number ensures the serializability of the changes
on the same file/directory within epochs.
● Property 6: The order of changes for different files/directories within the same
epoch doesn't matter.
ePipe ordering Properties
45. 45
● Supports failure recovery thanks to the persistent logging table
• The log entries are deleted only once the associated events are successfully
replicated to the downstream consumers.
• At least once delivery semantics.
● Pluggable architecture
• For example, filter events based on file name or any other attribute.
● Not Limited to HopsFS
• Can be extended to watch for other logging tables for different purposes.
More about ePipe
46. 46
● A databus that provides replicated metadata as a service for HopsFS
● Low overhead on HopsFS
● Low replication lag (sub-second)
● High throughput
● Pluggable architecture
ePipe Properties
47. What is provenance - ML Pipeline
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models
48. MLFlow Metadata - Explicit API calls
def train(data_path, max_depth, min_child_weight, estimators, model_name):
X_train, X_test, y_train, y_test = build_data(..)
mlflow.set_tracking_uri("jdbc:mysql://username:password@host:3306/database")
mlflow.set_experiment("My Experiment")
with mlflow.start_run() as run:
...
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("min_child_weight", min_child_weight)
mlflow.log_param("estimators", estimators)
with open("test.txt", "w") as f:
f.write("hello world!")
mlflow.log_artifacts("/full/path/to/test.txt")
...
model.fit(X_train, y_train) # auto-logging
...
mlflow.tensorflow.log_model(model, "tensorflow-model",
registered_model_name=model_name)
49. Hopsworks Metadata - Implicit Metadata
def train(data_path, max_depth, min_child_weight, estimators):
X_train, X_test, y_train, y_test = build_data(..)
...
print("hello world") # monkeypatched - prints in notebook
...
model.fit(X_train, y_train) # auto-logging
…
#Saves model to ”hopsfs://Projects/myProj/models/..”
hops.export_model(model, "tensorflow",..,model_name)
...
# maggy makes an API call to track this dict
return {'accuracy': accuracy, 'loss': loss, 'diagram': 'diagram.png'}
from maggy import experiment
experiment.lagom(train, name="My Experiment", ...)
50. Metadata
In [ ]:
add(fg_eng, raw_data, features)
…
add(training, features, model)
<fg_eng, raw_data, features>
Pipeline code
What is provenance - Metadata
Feature
engineering
Training Serving
Raw
Data
Features Models
51. ePipe (with ML Provenance)
Distributed File System (HopsFS)
Full Text Search (Elastic)
Feature
engineering
Training Serving
Raw
Data
Features Models
Let the platform manage the metadata!
52. ML Artifacts
Features, Feature Metadata,
Train/Test Datasets
Models, Model Metadata
Possibly thousands of files
Distributed File System
Generate thousands of operations
Change Data Capture (CDC)
Capture only relevant operations
Systems Challenges - Operations
53. More context for file system operations?
user: John user: Alex
Are any of these operations related?
user: John,
app1
user: John,
app3
user: Alex,
app2
Certificates (with AppId) enabled FS Operation
Order of operations
Order of operations
Richer provenance information
54. Distributed File System
Read/Write/Create/Delete/XAttr/Metadata
Resource Manager - Yarn (Application Context)
Application X
Job Manager - Hopsworks (Job Context)
Workflow Manager - Airflow (Pipeline Context)
Link input/output files via Apps
Different Executions of the same Job
Jobs as Stages of the same Pipeline
Additional Context
Richer provenance information
<file, op, user_id, app_id, job_id, pipeline_id>
57. Path based filtering
Tag based filtering
Example:
Custom metadata based on HDFS XAttr.
Tag: <tutorial>, <debug>
Tags can enable logging of all operations,
if path based filtering is not easy to set
CDC API - Filtering Mechanisms
58. Path based filtering
Tag based filtering
Coalesce FS Operations
Example:
Read file1
Read file2
…
Read filen
Access1
Training Dataset
CDC API - Filtering Mechanisms
59. Parent Create Artifact Create
Parent Delete Artifact Delete
Children Read Artifact Access
Children
Create/Delete/
Append/Truncate
Artifact Mutation
Namenodes
NDB
ePipe
Cache
per namenode
Log table
With duplicates
Remove duplicates
In [ ]:
hops.load_training_dataset(
“/Projects/LC/Training_Datasets/ImageNet”)
…
hops.save_model(“/Projects/LC/Models/ResNet”)
Optimization - FS Operation Coalesce
60. Path based filtering
Tag based filtering
Coalesce FS Operations
Filtered Operations
Filesystem Op Metadata Stored
Create/Delete Artifact existence
XAttr Add metadata to artifact
Read Artifact used by ..
Children Files
Create/Delete
Artifact mutation
Append/Truncate Artifact mutation
Permissions/ACL Artifact metadata mutation
CDC API - Filtering Mechanisms
61. DataOps
CI/CD Platform
Feature Store
...
Commit-0002
Commit-0001
Commit-0097
Model Training &
Model Validation
MLOps
CI/CD Platform
Model Repository
Model Serving
& Monitoring
Data
Develop/Test
Feature Pipelines2Data1 Develop Model3
Train/Validate
Model4
Deploy/
Monitor5
Hopsworks ML Pipelines
Metadata Store
CDC events CDC events
CDC events
API calls
CDC events
API calls
64. ● Provenance improves understanding of complex ML Pipelines.
● Provenance should not change the core ML pipeline code.
● Provenance facilitates Debugging, Analyzing, Automating and Cleaning
of ML Pipelines.
● Provenance and Time Travel facilitate reproducibility of experiments.
● In Hopsworks, we introduced a new mechanism for provenance based
on embedded metadata in a scale-out consistent metadata layer.
Summary
65. ● Ormenisan et al, Time-travel and Provenance for ML Pipelines, Usenix OpML 2020
● Niazi et al, HopsFS, Usenix Fast 2017
● Ismail et al, ePipe, CCGrid 2019
● Small Files in HopsFS, ACM Middleware 2018
● Ismail et al, HopsFS-S3, ACM Middleware 2020
● Meister et al, Oblivious Training Functions, 2020
● Hopsworks
References