Do compilers look anything like a data pipeline? How do you do data testing to ensure end to end provenance and enforce engineering guarantees for your data products? What babysteps should you consider when assembling your team?
** Watch the video to accompany these slides: https://www.cloverdx.com/webinars/starting-your-modern-dataops-journey **
- What is "Data Ops" and why should you consider it?
- How to begin your transition to a DevOps and DataOps-style of work
- How agile methodologies, version control, continuous integration or 'infrastructure as code' can improve the effectivity of your teams
- How you can use technology like CloverDX to start with DataOps
Discover how to make your development and data analytics processes more efficient and effective by shifting to a Dev/DataOps approach.
More CloverDX webinars: https://www.cloverdx.com/webinars
Twitter: https://twitter.com/cloverdx
LinkedIn: https://www.linkedin.com/company/cloverdx/
Get a free 45 day trial of the CloverDX Data Management Platform: https://www.cloverdx.com/trial-platform
Second speaker of the DOG meetup on 25 February 2021 was Yields.io’s co-founder and CEO, Jos Gheerardyn.
Jos has built the first FinTech platform that uses AI for real-time model testing and validation on an enterprise-wide scale. A zealous proponent of model risk governance & strategy, Jos is on a mission to empower quants, risk managers and model validators with smarter tools to turn model risk into a business driver.
Jos told us more about monitoring data quality.
How an Industrial DataOps Solution Improves OEE With a Time Series DatabaseInfluxData
An industrial dataOps solution using a time series database can improve overall equipment effectiveness (OEE) in three key ways:
1. It stores industrial time series data with important context like descriptions, locations, and measurement units that raw industrial data lacks.
2. It standardizes and normalizes inconsistent data points across devices to make the data usable outside of process controls.
3. HighByte Intelligence Hub is presented as an example solution that can collect, standardize, and publish industrial data in real-time to enable analytics and business improvements.
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
The document outlines seven steps for implementing DataOps to improve data analytics projects: 1) orchestrate the data journey from access to production, 2) add automated tests and monitoring, 3) use version control for code, 4) enable branching and merging of code, 5) use multiple environments, 6) reuse and containerize components, and 7) parameterize processing. It also discusses three additional steps: data architecture, inter- and intra-team collaboration, and process analytics for measurement. The goal of DataOps is to increase project success rates by integrating testing, monitoring, collaboration and automation practices across the entire data and analytics workflow.
Batist Leman from Azumuta gave a talk on data capture with man and machine. A keynote on why it is super important to make data capture as easy as possible and how to enable your company to use accurate real-time insights.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Jan van der Vegt. Challenges faced with machine learning in practiceLviv Startup Club
Machine learning projects often fail to make it from development to production. Looking at the full machine learning lifecycle is essential for success. The lifecycle includes development, deployment, infrastructure, monitoring, automation, standardization, lineage and reproducibility. A machine learning operations (MLOps) platform can provide an end-to-end system view for increased efficiency, collaboration, and trust across the lifecycle. Key takeaways are to focus on what is important, avoid doing nothing which fails to scale, and doing everything which stifles progress.
Contemporary research challenges and applications of service oriented archite...Dr. Shahanawaj Ahamad
Service Oriented Architecture (SOA) is distributed architectural framework that provides service-based
solutions for improving the effectiveness of enterprise’s IT infrastructure. In this framework, technical and
business processes are implemented as services. A service is an independent software application that has been
designed to perform a specific function with emphasis on loose coupling between interacting services and their
components. SOA permits developers to utilize many of the resources from existing services to form the
distributed applications. This study has investigated to highlight the emerging issues of SOA such as service
structures advancement, requirements of evolution for current age applications like mobile-cloud, medical and
mechanism for interoperable operations. The paper also uncovers the practical application domains of SOA. It
has identified research attentions in these domains with detection of issues to carry further research to
overcome constraints in current scenarios.
** Watch the video to accompany these slides: https://www.cloverdx.com/webinars/starting-your-modern-dataops-journey **
- What is "Data Ops" and why should you consider it?
- How to begin your transition to a DevOps and DataOps-style of work
- How agile methodologies, version control, continuous integration or 'infrastructure as code' can improve the effectivity of your teams
- How you can use technology like CloverDX to start with DataOps
Discover how to make your development and data analytics processes more efficient and effective by shifting to a Dev/DataOps approach.
More CloverDX webinars: https://www.cloverdx.com/webinars
Twitter: https://twitter.com/cloverdx
LinkedIn: https://www.linkedin.com/company/cloverdx/
Get a free 45 day trial of the CloverDX Data Management Platform: https://www.cloverdx.com/trial-platform
Second speaker of the DOG meetup on 25 February 2021 was Yields.io’s co-founder and CEO, Jos Gheerardyn.
Jos has built the first FinTech platform that uses AI for real-time model testing and validation on an enterprise-wide scale. A zealous proponent of model risk governance & strategy, Jos is on a mission to empower quants, risk managers and model validators with smarter tools to turn model risk into a business driver.
Jos told us more about monitoring data quality.
How an Industrial DataOps Solution Improves OEE With a Time Series DatabaseInfluxData
An industrial dataOps solution using a time series database can improve overall equipment effectiveness (OEE) in three key ways:
1. It stores industrial time series data with important context like descriptions, locations, and measurement units that raw industrial data lacks.
2. It standardizes and normalizes inconsistent data points across devices to make the data usable outside of process controls.
3. HighByte Intelligence Hub is presented as an example solution that can collect, standardize, and publish industrial data in real-time to enable analytics and business improvements.
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
The document outlines seven steps for implementing DataOps to improve data analytics projects: 1) orchestrate the data journey from access to production, 2) add automated tests and monitoring, 3) use version control for code, 4) enable branching and merging of code, 5) use multiple environments, 6) reuse and containerize components, and 7) parameterize processing. It also discusses three additional steps: data architecture, inter- and intra-team collaboration, and process analytics for measurement. The goal of DataOps is to increase project success rates by integrating testing, monitoring, collaboration and automation practices across the entire data and analytics workflow.
Batist Leman from Azumuta gave a talk on data capture with man and machine. A keynote on why it is super important to make data capture as easy as possible and how to enable your company to use accurate real-time insights.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Jan van der Vegt. Challenges faced with machine learning in practiceLviv Startup Club
Machine learning projects often fail to make it from development to production. Looking at the full machine learning lifecycle is essential for success. The lifecycle includes development, deployment, infrastructure, monitoring, automation, standardization, lineage and reproducibility. A machine learning operations (MLOps) platform can provide an end-to-end system view for increased efficiency, collaboration, and trust across the lifecycle. Key takeaways are to focus on what is important, avoid doing nothing which fails to scale, and doing everything which stifles progress.
Contemporary research challenges and applications of service oriented archite...Dr. Shahanawaj Ahamad
Service Oriented Architecture (SOA) is distributed architectural framework that provides service-based
solutions for improving the effectiveness of enterprise’s IT infrastructure. In this framework, technical and
business processes are implemented as services. A service is an independent software application that has been
designed to perform a specific function with emphasis on loose coupling between interacting services and their
components. SOA permits developers to utilize many of the resources from existing services to form the
distributed applications. This study has investigated to highlight the emerging issues of SOA such as service
structures advancement, requirements of evolution for current age applications like mobile-cloud, medical and
mechanism for interoperable operations. The paper also uncovers the practical application domains of SOA. It
has identified research attentions in these domains with detection of issues to carry further research to
overcome constraints in current scenarios.
How to add security in dataops and devopsUlf Mattsson
The emerging DataOps is not Just DevOps for Data. According to Gartner, DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization.
The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.
This session will discuss how to add Security in DataOps and DevOps.
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Mic...BAINIDA
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
The document outlines 8 critical steps for getting started with industrial data collection: 1) Assess equipment and IT systems, 2) Map pain points and objectives, 3) Set a quick-win proof of concept, 4) Form a small dedicated IIoT team, 5) Resist problem-specific solutions, 6) Decide on cloud or on-premise storage, 7) Involve machine suppliers early, and 8) Choose an IIoT integrator wisely. Each step provides questions to consider. The key takeaways are to think big but start small, involve colleagues and partners, and keep the system open rather than locked into one technology.
What is DataOps? Data Lineage for DataOps. How can MANTA help? Case study DataOps. DataOps implementation. E-books for free.
Visit us at https://getmanta.com/
Who changed my data? Need for data governance and provenance in a streaming w...DataWorks Summit
Enterprises have dealt with data governance over the years, but it has been mostly around master data. With the advent of IoT/web/app streams everywhere in the ecosystem surrounding an enterprise, data-in-motion has become a strong force to reckon. Data-in-motion passes through several levels of transformations and augmentation before it becomes data-at-rest. Through this, it is pertinent to preserve the sanctity of such data or at least track the provenance through the various changes. This is very important for a lot of verticals where there are strong regulatory and compliance laws that exist around "who changed what."
This session will go into detail around some specific use cases of how data gets changed, how it can be tracked seamlessly and why this is important for certain verticals. This will be presented in two parts. The first part will cover the industry angle to this and its importance weighed in by several regulatory bodies. The second part will address the technology aspect of it and discuss how companies can leverage Apache Atlas and Ranger in conjunction with NiFi and Kafka to embrace data governance and provenance of their data streams.
Speakers
Dinesh Chandrasekhar, Director, Hortonworks
Paige Bartley, Senior Analyst - Data and Enterprise Intelligence, Ovum
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformArne Roßmann
In order to turn data into opportunities, you need to build a modern data analytics platform. But because literally everything changes so fast, built-in flexibility is paramount.
This presentation covers:
- how to leverage all your data to generate insights
- the capabilities needed to build a flexible platform
- how to incorporate sustainability requirement
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
Webinar: Attaining Excellence in Big Data IntegrationSnapLogic
This document discusses best practices for attaining excellence in big data integration. It notes that analytics and integration are top investment areas for big data technologies. There is still uncertainty around which Hadoop tools and distributions to use. The document recommends five best practices: 1) evaluate integration processes, 2) examine new approaches, 3) evaluate technology needs, 4) investigate dedicated integration technology, and 5) gain benefits that outweigh costs. It also discusses using the cloud for big data integration.
How the world of data analytics, science and insights is failing and how the principles from Agile, DevOps, and Lean are the way forward. #DataOps Given at DevOps Enterprise Summit 2019
Global Data Management – a practical framework to rethinking enterprise, oper...DataWorks Summit
Global data management is not a newly coined term. However, what it stands for is actually widening in scope particularly around data-in-motion and data-at-rest. Significant technology trends such as IoT, cloud, AI/ML, blockchain, and streaming data have given rise to excessive data volumes and also innovative use cases. The scope for global data management now extends all the way from ingestion, processing, storage, governance, security to analysis. With a good number of endpoints served through the cloud and major application footprints remaining on-premisess, it is pertinent to have a global data management strategy that supports hybrid models and more specifically, a multi-cloud model.
Many modern businesses struggle to balance the demands of rapidly innovating through new technologies like machine learning with the need to keep data safe and secure, all while responding to a constantly changing regulatory landscape. This puts data stewards, data engineers, architects, data scientists, and analysts under intense pressure as they must contend with existing and new applications, multiple logical and physical data stores and sources, diverse data types, and data spread across several deployment environments.
Attend this session led by Matt Aslett, Research Director at 451 Research and Dinesh Chandrasekhar, Director, Hortonworks to learn more about creating a framework for your enterprise that offers guidance on how to think about global data management—priorities, responsibilities, key stakeholders, compliance, and growth.
Speakers
Dinesh Chandrasekhar, Hortonworks, Director Product Marketing
Matt Aslett, 451 Research, Research Director, Data platforms and Analytics
Washington DC DataOps Meetup -- Nov 2019DataKitchen
This document discusses challenges with current data analytics practices and how adopting a DataOps approach can help address them. It notes that current practices often involve many people using complex, fragmented toolchains which results in high error rates, slow deployment speeds, and an inability to deliver insights at the speed of business. DataOps is presented as a way to transform data analytics by applying practices from DevOps and Lean manufacturing like continuous integration, monitoring, version control systems, and reusable components. The document provides a seven step framework for implementing DataOps along with additional considerations for architecture, metrics, and collaboration.
[Infographic] Cloud Integration Drivers and Requirements in 2015SnapLogic
SnapLogic and TechValidate queried more than 100 U.S. companies with revenues greater than $500 million about the business and technical drivers and barriers for enterprise cloud application adoption in 2015 and beyond.
You can also learn how the SnapLogic Elastic Integration Platform can help by going to www.SnapLogic.com/iPaaS.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
This document outlines seven steps for transitioning from data science to data operations (DataOps):
1. Orchestrate the data science and production workflows.
2. Add testing at each step to monitor quality.
3. Use a version control system to manage code changes.
4. Implement branching and merging to allow parallel development.
5. Maintain separate environments for experiments, development and production.
6. Containerize components and practice environment version control.
7. Parameterize processes to increase flexibility and reuse.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
Most organisations think that they have poor data quality, but don’t know how to measure it or what to do about it. Teams of data scientists, analysts, and ETL developers are either blindly taking a “garbage in -> garbage out” approach, or worse still, “cleansing” data to fit their limited perspectives. DataOps is a systematic approach to measuring data and for planning mitigations for bad data.
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
Nicholas Berg presented on Seagate's use of big data analytics to manage the large amount of manufacturing data generated from its hard drive production. Seagate collects terabytes of data per day from testing its drives, which it analyzes using Hadoop to improve quality, predict failures, and gain other insights. It faces challenges in integrating this emerging platform due to the rapid evolution of Hadoop and lack of tools to fully leverage large datasets. Seagate is developing its data lake and data science capabilities on Hadoop to better optimize manufacturing and drive design.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
How and why are companies like Uber, Netflix and AirBnB so successful, what you need to in order to become successful in the same way that they are and how Pivotal can help you with that.
Speaker: Les Klein, EMEA CTO Data, Pivotal
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
The document discusses lessons learned from moving machine learning algorithms to production environments, referred to as "AnalyticOps". It introduces AnalyticOps as establishing an environment where building, validating, deploying, and running analytic models happens rapidly, frequently, and reliably. A key challenge is deploying analytic models into operations, products, and services. The document discusses strategies for deploying models, including scoring engines that integrate analytic models into operational workflows using a model interchange format. It provides two case studies as examples.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
How to add security in dataops and devopsUlf Mattsson
The emerging DataOps is not Just DevOps for Data. According to Gartner, DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization.
The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.
This session will discuss how to add Security in DataOps and DevOps.
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Mic...BAINIDA
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
The document outlines 8 critical steps for getting started with industrial data collection: 1) Assess equipment and IT systems, 2) Map pain points and objectives, 3) Set a quick-win proof of concept, 4) Form a small dedicated IIoT team, 5) Resist problem-specific solutions, 6) Decide on cloud or on-premise storage, 7) Involve machine suppliers early, and 8) Choose an IIoT integrator wisely. Each step provides questions to consider. The key takeaways are to think big but start small, involve colleagues and partners, and keep the system open rather than locked into one technology.
What is DataOps? Data Lineage for DataOps. How can MANTA help? Case study DataOps. DataOps implementation. E-books for free.
Visit us at https://getmanta.com/
Who changed my data? Need for data governance and provenance in a streaming w...DataWorks Summit
Enterprises have dealt with data governance over the years, but it has been mostly around master data. With the advent of IoT/web/app streams everywhere in the ecosystem surrounding an enterprise, data-in-motion has become a strong force to reckon. Data-in-motion passes through several levels of transformations and augmentation before it becomes data-at-rest. Through this, it is pertinent to preserve the sanctity of such data or at least track the provenance through the various changes. This is very important for a lot of verticals where there are strong regulatory and compliance laws that exist around "who changed what."
This session will go into detail around some specific use cases of how data gets changed, how it can be tracked seamlessly and why this is important for certain verticals. This will be presented in two parts. The first part will cover the industry angle to this and its importance weighed in by several regulatory bodies. The second part will address the technology aspect of it and discuss how companies can leverage Apache Atlas and Ranger in conjunction with NiFi and Kafka to embrace data governance and provenance of their data streams.
Speakers
Dinesh Chandrasekhar, Director, Hortonworks
Paige Bartley, Senior Analyst - Data and Enterprise Intelligence, Ovum
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformArne Roßmann
In order to turn data into opportunities, you need to build a modern data analytics platform. But because literally everything changes so fast, built-in flexibility is paramount.
This presentation covers:
- how to leverage all your data to generate insights
- the capabilities needed to build a flexible platform
- how to incorporate sustainability requirement
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
Webinar: Attaining Excellence in Big Data IntegrationSnapLogic
This document discusses best practices for attaining excellence in big data integration. It notes that analytics and integration are top investment areas for big data technologies. There is still uncertainty around which Hadoop tools and distributions to use. The document recommends five best practices: 1) evaluate integration processes, 2) examine new approaches, 3) evaluate technology needs, 4) investigate dedicated integration technology, and 5) gain benefits that outweigh costs. It also discusses using the cloud for big data integration.
How the world of data analytics, science and insights is failing and how the principles from Agile, DevOps, and Lean are the way forward. #DataOps Given at DevOps Enterprise Summit 2019
Global Data Management – a practical framework to rethinking enterprise, oper...DataWorks Summit
Global data management is not a newly coined term. However, what it stands for is actually widening in scope particularly around data-in-motion and data-at-rest. Significant technology trends such as IoT, cloud, AI/ML, blockchain, and streaming data have given rise to excessive data volumes and also innovative use cases. The scope for global data management now extends all the way from ingestion, processing, storage, governance, security to analysis. With a good number of endpoints served through the cloud and major application footprints remaining on-premisess, it is pertinent to have a global data management strategy that supports hybrid models and more specifically, a multi-cloud model.
Many modern businesses struggle to balance the demands of rapidly innovating through new technologies like machine learning with the need to keep data safe and secure, all while responding to a constantly changing regulatory landscape. This puts data stewards, data engineers, architects, data scientists, and analysts under intense pressure as they must contend with existing and new applications, multiple logical and physical data stores and sources, diverse data types, and data spread across several deployment environments.
Attend this session led by Matt Aslett, Research Director at 451 Research and Dinesh Chandrasekhar, Director, Hortonworks to learn more about creating a framework for your enterprise that offers guidance on how to think about global data management—priorities, responsibilities, key stakeholders, compliance, and growth.
Speakers
Dinesh Chandrasekhar, Hortonworks, Director Product Marketing
Matt Aslett, 451 Research, Research Director, Data platforms and Analytics
Washington DC DataOps Meetup -- Nov 2019DataKitchen
This document discusses challenges with current data analytics practices and how adopting a DataOps approach can help address them. It notes that current practices often involve many people using complex, fragmented toolchains which results in high error rates, slow deployment speeds, and an inability to deliver insights at the speed of business. DataOps is presented as a way to transform data analytics by applying practices from DevOps and Lean manufacturing like continuous integration, monitoring, version control systems, and reusable components. The document provides a seven step framework for implementing DataOps along with additional considerations for architecture, metrics, and collaboration.
[Infographic] Cloud Integration Drivers and Requirements in 2015SnapLogic
SnapLogic and TechValidate queried more than 100 U.S. companies with revenues greater than $500 million about the business and technical drivers and barriers for enterprise cloud application adoption in 2015 and beyond.
You can also learn how the SnapLogic Elastic Integration Platform can help by going to www.SnapLogic.com/iPaaS.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
This document outlines seven steps for transitioning from data science to data operations (DataOps):
1. Orchestrate the data science and production workflows.
2. Add testing at each step to monitor quality.
3. Use a version control system to manage code changes.
4. Implement branching and merging to allow parallel development.
5. Maintain separate environments for experiments, development and production.
6. Containerize components and practice environment version control.
7. Parameterize processes to increase flexibility and reuse.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
Most organisations think that they have poor data quality, but don’t know how to measure it or what to do about it. Teams of data scientists, analysts, and ETL developers are either blindly taking a “garbage in -> garbage out” approach, or worse still, “cleansing” data to fit their limited perspectives. DataOps is a systematic approach to measuring data and for planning mitigations for bad data.
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
Nicholas Berg presented on Seagate's use of big data analytics to manage the large amount of manufacturing data generated from its hard drive production. Seagate collects terabytes of data per day from testing its drives, which it analyzes using Hadoop to improve quality, predict failures, and gain other insights. It faces challenges in integrating this emerging platform due to the rapid evolution of Hadoop and lack of tools to fully leverage large datasets. Seagate is developing its data lake and data science capabilities on Hadoop to better optimize manufacturing and drive design.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
How and why are companies like Uber, Netflix and AirBnB so successful, what you need to in order to become successful in the same way that they are and how Pivotal can help you with that.
Speaker: Les Klein, EMEA CTO Data, Pivotal
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
The document discusses lessons learned from moving machine learning algorithms to production environments, referred to as "AnalyticOps". It introduces AnalyticOps as establishing an environment where building, validating, deploying, and running analytic models happens rapidly, frequently, and reliably. A key challenge is deploying analytic models into operations, products, and services. The document discusses strategies for deploying models, including scoring engines that integrate analytic models into operational workflows using a model interchange format. It provides two case studies as examples.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
We will present the design and evolution of Nvidia's 100% Self-Service Streaming Big-Data Platform (ETL, Analytics, AI Training & Inferencing) powered by Spark and Nvidia GPUs. We will discuss the architecture, major challenges that we faced, and lessons learned along the way. Nvidia's data platform processes 10's of billions of events per day, supporting several Nvidia products like GPU Cloud, GeForce NOW Cloud Gaming, AI Smart Cities, DriveSim for Self Driving cars etc. In this talk, we are going to deep dive on Nvidia's next generation data platform with new custom built frameworks, automation tools, and a monitoring system on top of Spark. Thus empowering our developers to build new Spark-powered applications at the speed of light (SOL) with full self-service unified data flows. We will showcase these new tools : a) Zero-engineering dashboards, b) Out-of-the box Spark Streaming applications with automated schema management, c) Custom Spark Streaming to Elastic search connector with enhanced security, d) GDPR compliant SQL access control and auditing with a new custom token management framework, e) Migration from logstash clusters to Spark Streaming for log parsing, etc. We will discuss how decoupling Data-Platform and Applications helped us achieve the next level of scale, self-service, and, security. Finally, we will demo our Platform's App-Store, where developers can shop for new Apps and deploy them with ease - with automated dashboards, streaming ETL, analytics, monitoring, AI training and inferencing. Extended Description: With structured telemetry events and unstructured logs growing at 1000% rate year-over-year, it is extremely important to handle this scale with strict SLAs and high reliability while maintaining extremely low latency. We will discuss how we handled these scaling & security concerns to solve business requirements. Additionally, we will be open-sourcing some of our custom spark frameworks during the talk.
Speakers: Satish Dandu, Rohit Kulkarni
BioThings API: Promoting Best-practices via a Biomedical API Development Ecos...Chunlei Wu
Overview of BioThings project (https://biothings.io) with the highlight of BioThings Studio tool, a web development environment for building Biomedical APIs
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
Looking to build a robust machine learning infrastructure to streamline MLOps? Learn from Provectus experts how to ensure the success of your MLOps initiative by implementing Data QA components in your ML infrastructure.
For most organizations, the development of multiple machine learning models, their deployment and maintenance in production are relatively new tasks. Join Provectus as we explain how to build an end-to-end infrastructure for machine learning, with a focus on data quality and metadata management, to standardize and streamline machine learning life cycle management (MLOps).
Agenda
- Data Quality and why it matters
- Challenges and solutions of Data Testing
- Challenges and solutions of Model Testing
- MLOps pipelines and why they matter
- How to expand validation pipelines for Data Quality
1. The document discusses how to repeatedly translate models into business value through interventions and experiments.
2. It identifies key stage-transition tasks (STTs) in the process from defining problems to testing models and proposes tools to standardize, automate, and increase collaboration at each stage.
3. The goal is to increase the "velocity of the vortex" by speeding up the cycle from data to models to experiments and back to improve models.
ALT-F1.BE : The Accelerator (Google Cloud Platform)Abdelkrim Boujraf
The Accelerator is an IT infrastructure able to collect and analyze a massive amount of public data on the WWW.
The Accelerator leverages the untapped potential of web data with the first solution designed for diverse sectors,
completely scalable, available on-premise, and cloud-provider agnostic.
Notes on Deploying Machine-learning Models at ScaleDeep Kayal
While modeling techniques in machine learning have matured drastically, the deployment of models at scale has been overlooked. These are some learnings that I've had over the years, that I presented at Cognizant in Amsterdam.
The document discusses model risk management considerations for machine learning models. It begins with an overview of machine learning and artificial intelligence applications in finance. It then covers key elements of model risk management for machine learning such as model governance structure, model lifecycle management, tracking, metadata management, scaling, reproducibility, interpretability, and testing. The presentation concludes with a discussion on quantifying model risk.
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A survey on Machine Learning In Production (July 2018)Arnab Biswas
What does Machine Learning In Production mean? What are the challenges? How organizations like Uber, Amazon, Google have built their Machine Learning Pipeline? A survey of the Machine Learning In Production Landscape as of July 2018
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
This document discusses building machine learning systems using serverless services and Python. It introduces the Iris flower classification dataset as a case study. The key steps outlined are to: create accounts on Hopsworks, Modal, and HuggingFace; build and run feature, training and inference pipelines on Modal to classify Iris flowers; and create a predictive user interface using Gradio on HuggingFace to allow users to input Iris flower properties and predict the variety. The document emphasizes that serverless infrastructure allows building operational and analytical ML systems without managing underlying infrastructure.
1) The document discusses how systems engineering methods can be integrated with the AI/ML lifecycle to engineer intelligent systems. It identifies 10 major challenges for this integration, including describing AI/ML model needs and capabilities, integrating AI/ML into specification, verification, and other systems engineering processes.
2) The document proposes concepts for tackling each challenge, such as using standards to describe AI/ML model lifecycles and digital twin environments for verification. It also discusses opportunities like reusing existing AI/ML models and the need to educate new professionals.
3) Key points are that research is active in integrating systems engineering and AI/ML to build safer, more cost-effective cyber-physical systems, and
A case study in using ibm watson studio machine learning services ibm devel...Einar Karlsen
This IBM Developer article shows various ways of predicting customer churn using IBM Watson Studio ranging from a semi-automated approach using the Model Builder, a diagrammatic approach using SPSS Modeler Flows to a fully programmed style using Jupyter notebooks.
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Joachim Schlosser
1. The document discusses how MATLAB can be used to analyze large amounts of industrial data (i.e. big data) and optimize complex systems through modeling and simulation.
2. It provides an example of how a steel manufacturer used MATLAB to automatically optimize its production schedule, reducing development time by a factor of 10.
3. MATLAB allows rapid prototyping of algorithms on desktop computers and scaling to larger clusters or cloud environments as needed. This enables effective analysis of big data.
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
10. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
data quality, metadata, raw or slightly
modelled data
Lexical and Syntactic
Analysis
11. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Lexical and Syntactic
Analysis
Semantic Analysis
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
12. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Lexical and Syntactic
Analysis
Semantic Analysis
Code Generation
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
13. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
further tests, robustness
Lexical and Syntactic
Analysis
Semantic Analysis
Code Generation
Optimisation
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
24. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
b. Also about trends and anomalies
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
26. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
27. New Feature Engineering:
“Instead of deriving the math
before feeding the model, we
ensure our features comply
with certain properties so that
the NN can do the math
effectively by itself” -- Airbnb
https://arxiv.org/pdf/1810.09591.pdf
29. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
33. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
4. monitoring tests check the output of the model to trigger alerts
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
34. Monitoring
Model performance
dashboard
1. model output metrics
through training, validation,
testing, and deployment.
2. data input metrics
3. operational telemetry
Image from: https://www.parallelm.com/
38. What is Special about ML?
1. New Artefacts to Manage
a. Data
b. Metadata: Hyperparameters
c. Code: architecture
d. Model: executable software “built from the data”
e. Experiment: Data + metadata + Hyperparams + Code -> Model
2. Different Process
a. Trial and error: Scientific method
b. Reproducibility - traceability
c. Explainability
41. The Data Version Control Tipping Point
● Datasets can be versioned, branched, acted upon by versioned
code to create new data sets
● Test and fill bugs against data
● Enable quality control for compiler steps
● Automated lineage and schema change deceting
● Make guarantees about system components
47. Ad Hoc Exploration1
Tools RelationshipsProcesses
- Isolated efforts
- No repeatability
- Siloed data
- Local dev
- No business buy-in - transactional
- Ivory tower
Tools
Relationships
Processes
48. Reproducible, but limited2
Tools RelationshipsProcesses
- Repeatability is patchy
- Poor governance
- Shy centralisation
- Static reports
- Heavy transactional - rapport
- Team management support only
Tools
Relationships
Processes
49. Defined, Controlled3
Tools RelationshipsProcesses
- Formal but manually enforced- Good centralization
(metadata, access)
- Live retrospective
reports
- Incipient experimentation
- Empathy
Tools
Relationships
Processes