The document discusses Cloudera's enterprise data cloud platform. It notes that data management is spread across multiple cloud and on-premises environments. The platform aims to provide an integrated data lifecycle that is easier to use, manage and secure across various business use cases. Key components include environments, data lakes, data hub clusters, analytic experiences, and a central control plane for management. The platform offers both traditional and container-based consumption options to provide flexibility across cloud, private cloud and on-premises deployment.
Analyzing StackExchange data with Azure Data LakeBizTalk360
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
Spark as a Service with Azure DatabricksLace Lofranco
Presented at: Global Azure Bootcamp (Melbourne)
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.
However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.
This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.
We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
This document provides an overview and demonstration of Azure Data Lake Store and Azure Data Lake Analytics. The presenter discusses how Azure Data Lake can store and analyze large amounts of data in its native format. Key capabilities of Azure Data Lake Store like unlimited storage, security features, and support for any data type are highlighted. Azure Data Lake Analytics is presented as an elastic analytics service built on Apache YARN that can process large amounts of data. The U-SQL language for big data analytics is demonstrated, along with using Visual Studio and PowerShell for interacting with Azure Data Lake. The presentation concludes with a question and answer section.
The document discusses Cloudera's enterprise data cloud platform. It notes that data management is spread across multiple cloud and on-premises environments. The platform aims to provide an integrated data lifecycle that is easier to use, manage and secure across various business use cases. Key components include environments, data lakes, data hub clusters, analytic experiences, and a central control plane for management. The platform offers both traditional and container-based consumption options to provide flexibility across cloud, private cloud and on-premises deployment.
Analyzing StackExchange data with Azure Data LakeBizTalk360
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
Spark as a Service with Azure DatabricksLace Lofranco
Presented at: Global Azure Bootcamp (Melbourne)
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.
However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.
This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.
We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
This document provides an overview and demonstration of Azure Data Lake Store and Azure Data Lake Analytics. The presenter discusses how Azure Data Lake can store and analyze large amounts of data in its native format. Key capabilities of Azure Data Lake Store like unlimited storage, security features, and support for any data type are highlighted. Azure Data Lake Analytics is presented as an elastic analytics service built on Apache YARN that can process large amounts of data. The U-SQL language for big data analytics is demonstrated, along with using Visual Studio and PowerShell for interacting with Azure Data Lake. The presentation concludes with a question and answer section.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Accelerate Data Science Initiatives: Databricks & PrivaceraDatabricks
Accelerating Data Science Initiatives with Databricks’ Rapid SQL Analytics and Privacera’s Centralized Data Access Governance.
Databricks’ SQL Analytics helps data teams consolidate and simplify their data architectures. With SQL Analytics, data teams can perform BI and SQL workloads on the same multi-cloud lakehouse architecture enabling data scientists to perform advanced analytics on unstructured and large-scale data. This session will explore how Privacera’s advanced security, privacy, and governance capabilities seamlessly integrate with Databricks’ unified SQL Analytics approach to provide single pane visibility of data analytics from a centralized location. Attendees will learn how to:
Rapidly access data to run high-fidelity analytics
Implement a fully secure solution that ensures productivity, while controlling data access at fine-grained levels (row, column, and file)
Easily enable consistent access policies across all systems and applications
Support true data transparency across enterprises
Comply with stringent industry and privacy regulations like GDPR, LGPD, HIPAA, CCPA, PCI DSS, RTBF, and more with rich auditing and reporting
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
Northwestern Mutual Journey – Transform BI Space to CloudDatabricks
The volume of available data is growing by the second (to an estimated 175 zetabytes by 2025), and it is becoming increasingly granular in its information. With that change every organization is moving towards building a data driven culture. We at Northwestern Mutual share similar story of driving towards making data driven decisions to improve both efficiency and effectiveness. Legacy system analysis revealed bottlenecks, excesses, duplications etc. Based on ever growing need to analyze more data our BI Team decided to make a move to more modern, scalable, cost effective data platform. As a financial company, data security is as important as ingestion of data. In addition to fast ingestion and compute we would need a solution that can support column level encryption, Role based access to different teams from our datalake.
In this talk we describe our journey to move 100’s of ELT jobs from current MSBI stack to Databricks and building a datalake (using Lakehouse). How we reduced our daily data load time from 7 hours to 2 hours with capability to ingest more data. Share our experience, challenges, learning, architecture and design patterns used while undertaking this huge migration effort. Different sets of tools/frameworks built by our engineers to help ease the learning curve that our non-Apache Spark engineers would have to go through during this migration. You will leave this session with more understand on what it would mean for you and your organization if you are thinking about migrating to Apache Spark/Databricks.
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
This document provides an overview of Azure Databricks, including:
- Azure Databricks is an Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It includes Spark SQL, streaming, machine learning libraries, and integrates fully with Azure services.
- Clusters in Azure Databricks provide a unified platform for various analytics use cases. The workspace stores notebooks, libraries, dashboards, and folders. Notebooks provide a code environment with visualizations. Jobs and alerts can run and notify on notebooks.
- The Databricks File System (DBFS) stores files in Azure Blob storage in a distributed file system accessible from notebooks. Business intelligence tools can connect to Databricks clusters via JDBC
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Modern DW Architecture
- The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different.
Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
The document discusses Azure Data Factory and its capabilities for cloud-first data integration and transformation. ADF allows orchestrating data movement and transforming data at scale across hybrid and multi-cloud environments using a visual, code-free interface. It provides serverless scalability without infrastructure to manage along with capabilities for lifting and running SQL Server Integration Services packages in Azure.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
Automating Data Quality Processes at ReckittDatabricks
Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.
But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
Feature Store as a Data Foundation for Machine LearningProvectus
This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.
Deagital offer will help improve the data quality of your Historian / Data Lake by functionaly qualifying your data (timed measures). This document presents the offer content et why you should choose Deagital.Regards, José Torres, Deagital.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Accelerate Data Science Initiatives: Databricks & PrivaceraDatabricks
Accelerating Data Science Initiatives with Databricks’ Rapid SQL Analytics and Privacera’s Centralized Data Access Governance.
Databricks’ SQL Analytics helps data teams consolidate and simplify their data architectures. With SQL Analytics, data teams can perform BI and SQL workloads on the same multi-cloud lakehouse architecture enabling data scientists to perform advanced analytics on unstructured and large-scale data. This session will explore how Privacera’s advanced security, privacy, and governance capabilities seamlessly integrate with Databricks’ unified SQL Analytics approach to provide single pane visibility of data analytics from a centralized location. Attendees will learn how to:
Rapidly access data to run high-fidelity analytics
Implement a fully secure solution that ensures productivity, while controlling data access at fine-grained levels (row, column, and file)
Easily enable consistent access policies across all systems and applications
Support true data transparency across enterprises
Comply with stringent industry and privacy regulations like GDPR, LGPD, HIPAA, CCPA, PCI DSS, RTBF, and more with rich auditing and reporting
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
Northwestern Mutual Journey – Transform BI Space to CloudDatabricks
The volume of available data is growing by the second (to an estimated 175 zetabytes by 2025), and it is becoming increasingly granular in its information. With that change every organization is moving towards building a data driven culture. We at Northwestern Mutual share similar story of driving towards making data driven decisions to improve both efficiency and effectiveness. Legacy system analysis revealed bottlenecks, excesses, duplications etc. Based on ever growing need to analyze more data our BI Team decided to make a move to more modern, scalable, cost effective data platform. As a financial company, data security is as important as ingestion of data. In addition to fast ingestion and compute we would need a solution that can support column level encryption, Role based access to different teams from our datalake.
In this talk we describe our journey to move 100’s of ELT jobs from current MSBI stack to Databricks and building a datalake (using Lakehouse). How we reduced our daily data load time from 7 hours to 2 hours with capability to ingest more data. Share our experience, challenges, learning, architecture and design patterns used while undertaking this huge migration effort. Different sets of tools/frameworks built by our engineers to help ease the learning curve that our non-Apache Spark engineers would have to go through during this migration. You will leave this session with more understand on what it would mean for you and your organization if you are thinking about migrating to Apache Spark/Databricks.
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
This document provides an overview of Azure Databricks, including:
- Azure Databricks is an Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It includes Spark SQL, streaming, machine learning libraries, and integrates fully with Azure services.
- Clusters in Azure Databricks provide a unified platform for various analytics use cases. The workspace stores notebooks, libraries, dashboards, and folders. Notebooks provide a code environment with visualizations. Jobs and alerts can run and notify on notebooks.
- The Databricks File System (DBFS) stores files in Azure Blob storage in a distributed file system accessible from notebooks. Business intelligence tools can connect to Databricks clusters via JDBC
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Modern DW Architecture
- The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different.
Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
The document discusses Azure Data Factory and its capabilities for cloud-first data integration and transformation. ADF allows orchestrating data movement and transforming data at scale across hybrid and multi-cloud environments using a visual, code-free interface. It provides serverless scalability without infrastructure to manage along with capabilities for lifting and running SQL Server Integration Services packages in Azure.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
Automating Data Quality Processes at ReckittDatabricks
Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.
But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
Feature Store as a Data Foundation for Machine LearningProvectus
This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.
Deagital offer will help improve the data quality of your Historian / Data Lake by functionaly qualifying your data (timed measures). This document presents the offer content et why you should choose Deagital.Regards, José Torres, Deagital.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Want to know more about Common Data Model and Service? You need to understant what's the difference between CDS for Apps and Analytics? Feel free to use these slides and send me your feed backs.
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...Data Con LA
Debu Sinha, Sr Specialist Solutions Architect - AI/ML at Databricks
AI/ ML/ Data Science
1. What are feature stores. 2. Why are they important? 3. Using Databricks and the feature store offering to streamline ml. This hold true for small companies as well. How we frame our approach to AI initiatives will determine its success. Don't worry, I am not a zealot. I will not tell you AI and ML are the cure-all and will solve all your problems. Some tasks are particularly well suited to these techniques, but not all. What I love about them is the fact that they allow us to tackle difficult problems that might otherwise be too daunting.
Gobblin is a unified data ingestion framework developed by LinkedIn to ingest large volumes of data from diverse sources into Hadoop. It provides a scalable and fault-tolerant workflow that extracts data, applies transformations, checks for quality, and writes outputs. Gobblin addresses challenges of operating multiple heterogeneous data pipelines by standardizing various ingestion tasks and metadata handling through its pluggable components.
IBM InfoSphere Data Architect 9.1 - Francis ArnaudièsIBMInfoSphereUGFR
The document discusses IBM InfoSphere Data Architect, a tool for modeling, relating, and standardizing diverse data assets. It can design and manage enterprise data models, enforce standards, leverage industry data models, and optimize existing investments. The tool is based on the Eclipse platform and allows various users like data architects, database developers, and administrators to be more productive. It provides logical, physical, and dimensional modeling capabilities as well as tools to define and enforce standards to increase quality and governance.
Building the Artificially Intelligent EnterpriseDatabricks
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited and specializes in business intelligence/analytics and data management. He discusses building the artificially intelligent enterprise and transitioning to a self-learning enterprise. Some key challenges discussed include the siloed and fractured nature of current data and analytics efforts, with many tools and scripts in use without integration. He advocates sorting out the data foundation, implementing DataOps and MLOps, creating a data and analytics marketplace, and integrating analytics into business processes to drive value from AI.
Technically Speaking: How Self-Service Analytics Fosters CollaborationInside Analysis
This document summarizes an upcoming webinar series from Bloor Research Group on enterprise software and business intelligence technologies. The webinars will take place monthly from June to November, covering topics like intelligence, disruption, analytics, integration, databases, and cloud computing. Attendees can ask questions of presenters and get detailed analysis of innovative technologies. The webinars aim to reveal enterprise software characteristics and give vendors a chance to explain their products to analysts.
Conceptual vs. Logical vs. Physical Data ModelingDATAVERSITY
A model is developed for a purpose. Understanding the strengths of each of the three Data Modeling types will prepare you with a more robust analyst toolkit. The program will describe modeling characteristics shared by each modeling type. Using the context of a reverse engineering exercise, delegates will be able to trace model components as they are used in a common data reengineering exercise that is also tied to a Data Governance exercise.
Learning objectives:
-Understanding the role played by models
-Differentiate appropriate use among conceptual, logical, and physical data models
- Understand the rigor of the round-trip data reengineering analyses
- Apply appropriate use of various Data Modeling types
Recent Gartner and Capgemini studies predict only around 25% of data science projects are successful and only around 15% make it to full-scale production. Of these, many degrade in performance and produce disappointing results within months of implementation. How can focusing on the desired business outcomes and business use cases throughout a data science project help overcome the odds?
Analytix Data Services provides data integration software and services. Its flagship product, AnalytiX Mapping Manager, is an enterprise solution for data mapping and metadata management. It helps customers accelerate project delivery through automating source-to-target mappings, enabling collaboration, and increasing productivity. The solution comprises modules for resource management, system management, and mapping specifications. It has over 50 customers worldwide.
Software engineering practices for the data science and machine learning life...DataWorks Summit
With the advent of newer frameworks and toolkits, data scientists are now more productive than ever and starting to prove indispensable to enterprises. Typical organizations have large teams of data scientists who build out key analytics assets that are used on a daily basis and an integral part of live transactions. However, there is also quite a lot of chaos and complexities that get introduced because of the state of the industry. Many packages used by data scientists are from open source, and even if they are well curated, there is a growing tendency to pick out the cutting-edge or unstable packages and frameworks to accelerate analytics. Different data scientists may use different versions of runtimes, different Python or R versions, or even different versions of the same packages. Predominantly data scientists work on their laptops and it becomes difficult to reproduce their environments for use by others. Since data science is now a team sport across multiple personas, involving non-practitioners, traditional application developers, execs, and IT operators, how does an enterprise create a platform for productive cross-role collaboration?
Enterprises need a very reliable and repeatable process, especially when it results in something that affects their production environments. They also require a well managed approach that enables the graduation of an asset from development through a testing and staging process to production. Given the pace of businesses nowadays, the process needs to be quite agile and flexible too—even enabling an easy path to reversing a change. Compliance and audit processes require clear lineage and history as well as approval chains.
In the traditional software engineering world, this lifecycle has been well understood and best practices have been followed for ages. But what does it mean when you have non-programmers or users who are not really trained in software engineering philosophies or who perceive all of this as "big process" roadblocks in their daily work ? How do you we engage them in a productive manner and yet support enterprise requirements for reliability, tracking, and a clear continuous integration and delivery practice? The presenters, in this session, will bring up interesting techniques based on their user research, real life customer interviews, and productized best practices. The presenters also invite the audience to share their stories and best practices to make this a lively conversation.
Speaker
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Embarcadero® ER/Studio®, an industry-leading data modeling tool, helps companies discover, document, and re-use data assets. With round-trip database support, data architects have the power to easily reverse-engineer, analyze, and optimize existing databases. Productivity gains and enforcement of organizational standards can be achieved with ER/Studio’s strong collaboration capabilities.
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented What Data Do You Have and Where is it?
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
The document discusses challenges in moving big data projects from pilots to production. It highlights that pilots have loose SLAs and focus on a few use cases and demonstrated insights, while production requires enforced SLAs, supporting many use cases and delivering actionable insights. Key challenges in the transition include establishing governance, skills, funding models and integrating insights into operations. The document also provides examples of technology considerations and common operating models for big data analytics.
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.
CA ERwin Modeling provides data modeling solutions to help reduce costs and increase ROI. Their next release, r8, will include improved visualization, customization, and productivity features. r8 is focused on balancing usability improvements and database support to increase user satisfaction and ROI. The data modeling market is seeing new players and a diversification in how solutions are used, with CA ERwin striving to scale their products to meet new requirements.
Building a New Platform for Customer Analytics Caserta
Caserta Concepts and Databricks partner up to bring you this insightful webinar on how a business can choose from all of the emerging big data technologies to figure out which one best fits their needs.
Similar to Feature store Overview St. Louis Big Data IDEA Meetup aug 2020 (20)
Slides from the August 2021 St. Louis Big Data IDEA meeting from Sam Portillo. The presentation covers AWS EMR including comparisons to other similar projects and lessons learned. A recording is available in the comments for the meeting.
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
Great Expectations is an open-source Python library that helps validate, document, and profile data to maintain quality. It allows users to define expectations about data that are used to validate new data and generate documentation. Key features include automated data profiling, predefined and custom validation rules, and scalability. It is used by companies like Vimeo and Heineken in their data pipelines. While helpful for testing data, it is not intended as a data cleaning or versioning tool. A demo shows how to initialize a project, validate sample taxi data, and view results.
Automate your data flows with Apache NIFIAdam Doyle
Apache Nifi is an open source dataflow platform that automates the flow of data between systems. It uses a flow-based programming model where data is routed through configurable "processors". Nifi was donated to the Apache Foundation by the NSA in 2014 and has over 285 processors to interact with data in various formats. It provides an easy to use UI and allows users to string together processors to move and transform data within "flowfiles" through the system in a secure manner while capturing detailed provenance data.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Slides from the January 2021 St. Louis Big Data IDEA meeting by Tim Bytnar regarding using Docker containers for a localized Hadoop development cluster.
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
The document provides an overview of the key steps for operationalizing data science projects:
1) Identify the business goal and refine it into a question that can be answered with data science.
2) Acquire and explore relevant data from internal and external sources.
3) Cleanse, shape, and enrich the data for modeling.
4) Create models and features, test them, and check with subject matter experts.
5) Evaluate models and deploy the best one with ongoing monitoring, optimization, and explanation of results.
Slides from the December 2019 St. Louis Big Data IDEA meetup group. Jon Leek discussed how the St. Louis Regional Data Alliance ingests, stores, and reports on their data.
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
Slides from the November St. Louis Big Data IDEA. Anthony Melson talked about how to engineer machine learning practices to better support prescriptive analytics.
Synthesis of analytical methods data driven decision-makingAdam Doyle
This document summarizes Dr. Haitao Li's presentation on synthesizing analytical methods for data-driven decision making. It discusses the three pillars of analytics - descriptive, predictive, and prescriptive. Various data-driven decision support paradigms are presented, including using descriptive/predictive analytics to determine optimization model inputs, sensitivity analysis, integrated simulation-optimization, and stochastic programming. An application example of a project scheduling and resource allocation tool for complex construction projects is provided, with details on its optimization model and software architecture.
Data Engineering and the Data Science LifecycleAdam Doyle
Everyone wants to be a data scientist. Data modeling is the hottest thing since Tickle Me Elmo. But data scientists don’t work alone. They rely on data engineers to help with data acquisition and data shaping before their model can be developed. They rely on data engineers to deploy their model into production. Once the model is in production, the data engineer’s job isn’t done. The model must be monitored to make sure that it retains its predictive power. And when the model slips, the data engineer and the data scientist need to work together to correct it through retraining or remodeling.
Data engineering Stl Big Data IDEA user groupAdam Doyle
Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams.
We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark.
Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
Slides from the STL Big Data IDEA meeting from January 2019. The presenters discussed technologies to continue using, stop using, and start using in 2019.
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
3. Confidential and Proprietary to Daugherty Business Solutions
“A feature is an individual measurable property or characteristic of a phenomenon being observed… Feature data is
used both as input to models during training and when models are served in production.”
Key takeaways
• Features are not data
• Features enumerate information
• Not all features are equal
Features
https://docs.feast.dev/user-guide/features
4. Confidential and Proprietary to Daugherty Business Solutions
Feature Engineering is the process of extracting features from raw data.
Feature Engineering Techniques
• Imputation
• Handling Outliers
• Binning
• Numerical Transform
• One-Hot Encoding
• Grouping
• Extraction
• Scaling
Feature Engineering
5. Confidential and Proprietary to Daugherty Business Solutions
• Feature Reuse Between Models
• Consistent Feature Definitions
• Latency / Recency
• Environmental Variation
• Unstable Dependencies
• Governance
• Versioning
Feature Challenges
6. Confidential and Proprietary to Daugherty Business Solutions
Feature Store
API
Metadata /
Model /
Predictions
Offline
Data Store
Online
Data Store
Batch Engine
Stream Engine
Batch Prediction
Stream Prediction
7. Confidential and Proprietary to Daugherty Business Solutions
• Retrieve Feature Metadata
• Retrieve Feature Values
• Remove Features
• Store Features
• Stream Store Features
• Stream Retrieve Features
• Feature Versioning
• Model Versioning
• Record Predictions
Feature Store Use Cases
8. Confidential and Proprietary to Daugherty Business Solutions
• Data engineers interact with a feature store by creating
data pipeline definitions.
• Data pipeline definitions combine
– Data Sources
– Business definitions
– Transformation rule
– Streaming/Batch definitions
– Scheduling
• Data pipelines are executed by the feature store engines
and stored in online and offline data stores.
Data Pipeline
9. Confidential and Proprietary to Daugherty Business Solutions
• Data scientists interact with the feature store through the Feature Registry.
• They can search for and browse feature definitions.
• They can register data science models as a class of data pipeline.
Feature Registry
10. Confidential and Proprietary to Daugherty Business Solutions
• Feature stores can assist with versioning and monitoring data
science applications.
• Predictions are recorded in the feature store API including
source data, model used, version of that model, and the
rendered prediction.
• Predictions can be compared with reality to determine the
accuracy of the models.
• Models and versions are tracked and can be used to determine
the lift provided by a particular instance of a model.
Versioning and Monitoring
11. Confidential and Proprietary to Daugherty Business Solutions
• Open Source
– GoJEK/Google FEAST
• Product Offerings
– Logical Clocks Hopsworks
– Scribble Enrich
• Presentations Only
– Uber Michaelangelo
– Airbnb Zipline
– Survey Monkey ML Feature Store
– Netflix MetaFlow
Feature Store Implementations