According to Forrester Research, only 22% of companies are currently seeing a significant return from data science expenditures. Most data science implementations are high-cost IT projects, local applications that are not built to scale for production workflows, or laptop decision support projects that never impact customers. Despite this high failure rate, we keep hearing the same mantra and solutions over and over again. Everybody talks about how to create models, but not many people talk about getting them into production where they can impact customers.
Harvinder Atwal offers an entertaining and practical introduction to DataOps, a new and independent approach to delivering data science value at scale, used at companies like Facebook, Uber, LinkedIn, Twitter, and eBay. The key to adding value through DataOps is to adapt and borrow principles from Agile, Lean, and DevOps. However, DataOps is not just about shipping working machine learning models; it starts with better alignment of data science with the rest of the organization and its goals. Harvinder shares experience-based solutions for increasing your velocity of value creation, including Agile prioritization and collaboration, new operational processes for an end-to-end data lifecycle, developer principles for data scientists, cloud solution architectures to reduce data friction, self-service tools giving data scientists freedom from bottlenecks, and more. The DataOps methodology will enable you to eliminate daily barriers, putting your data scientists in control of delivering ever-faster cutting-edge innovation for your organization and customers.
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
DataOps @ Scale: A Modern Framework for Data Management in the Public SectorTamrMarketing
Within the last 6 months, the U.S. agencies have begun defining a “Data Science Occupational Series”.
This means adding the term “(Data Scientist)” at the end of a job title to increase the odds of finding a candidate that understands data.
Watch the full presentation: https://resources.tamr.com/govdataops
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
Title
DataOps, the secret weapon for delivering AI, data science, and business intelligence value at speed.
Synopsis
● According to recent research, just 7.3% of organisations say the state of their data and analytics is excellent, and only 22% of companies are currently seeing a significant return from data science expenditure.
● Poor returns on data & analytics investment are often the result of applying 20th-century thinking to 21st-century challenges and opportunities.
● Modern data science and analytics require secure, efficient processes to turn raw data from multiple sources and in numerous formats into useful inputs to a data product.
● Developing, orchestrating and iterating modern data pipelines is an extremely complex process requiring multiple technologies and skills.
● Other domains have to successfully overcome the challenge of delivering high-quality products at speed in complex environments. DataOps applies proven agile principles, lean thinking and DevOps practices to the development of data products.
● A DataOps approach aligns data producers, analytical data consumers, processes and technology with the rest of the organisation and its goals.
DataOps is a methodology and culture shift that brings the successful combination of development and operations (DevOps) to data processing environments. It breaks down silos between developers, data scientists, and operators, resulting in lean data feature development processes with quick feedback. In this presentation, we will explain the methodology, and focus on practical aspects of DataOps.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Amazon Web Services
In modern, microservices-based applications, it’s critical to have end-to-end observability of each microservice and the communications between them in order to quickly identify and debug issues. In this session, we cover the techniques and tools to achieve consistent, full-application observability, including monitoring, tracing, logging, and service mesh.
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
DataOps @ Scale: A Modern Framework for Data Management in the Public SectorTamrMarketing
Within the last 6 months, the U.S. agencies have begun defining a “Data Science Occupational Series”.
This means adding the term “(Data Scientist)” at the end of a job title to increase the odds of finding a candidate that understands data.
Watch the full presentation: https://resources.tamr.com/govdataops
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
Title
DataOps, the secret weapon for delivering AI, data science, and business intelligence value at speed.
Synopsis
● According to recent research, just 7.3% of organisations say the state of their data and analytics is excellent, and only 22% of companies are currently seeing a significant return from data science expenditure.
● Poor returns on data & analytics investment are often the result of applying 20th-century thinking to 21st-century challenges and opportunities.
● Modern data science and analytics require secure, efficient processes to turn raw data from multiple sources and in numerous formats into useful inputs to a data product.
● Developing, orchestrating and iterating modern data pipelines is an extremely complex process requiring multiple technologies and skills.
● Other domains have to successfully overcome the challenge of delivering high-quality products at speed in complex environments. DataOps applies proven agile principles, lean thinking and DevOps practices to the development of data products.
● A DataOps approach aligns data producers, analytical data consumers, processes and technology with the rest of the organisation and its goals.
DataOps is a methodology and culture shift that brings the successful combination of development and operations (DevOps) to data processing environments. It breaks down silos between developers, data scientists, and operators, resulting in lean data feature development processes with quick feedback. In this presentation, we will explain the methodology, and focus on practical aspects of DataOps.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Amazon Web Services
In modern, microservices-based applications, it’s critical to have end-to-end observability of each microservice and the communications between them in order to quickly identify and debug issues. In this session, we cover the techniques and tools to achieve consistent, full-application observability, including monitoring, tracing, logging, and service mesh.
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
With the aid of any number of data management and processing tools, data flows through multiple on-prem and cloud storage locations before it’s delivered to business users. As a result, IT teams — including IT Ops, DataOps, and DevOps — are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
Join Stonebranch’s Scott Davis, Global Vice President and Ravi Murugesan, Sr. Solution Engineer to learn how DataOps teams orchestrate their end-to-end data pipelines with a platform approach to managing automation.
Key Learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business — with the observability you require
- Discover how to replace clunky legacy approaches to streaming data in a multi-cloud environment
- See what’s possible with the Stonebranch Universal Automation Center (UAC)
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
For many decades now, the software industry has attempted to bridge the productivity gap, develop higher quality code and manage the ever growing complexity of software-intensive systems. The results have been mixed, and as a result, a great majority of today's software is still written manually by human developers. This is about to change rapidly as recent developments in the field of Artificial Intelligence show promising results. While artists and designers have been taken by surprise by OpenAI’s DALL-E 2’s capabilities in designing unique art, ChatGPT has astonished the rest of the world with its capability of understanding human interaction. AI-assisted coding solutions such as Github’s Copilot and Replit’s Ghostwriter, among many others, are rapidly developing in a direction where AI generates new code that runs fast with high quality. Little is known about the true capabilities of AI programmers and their impact on the software development industry, education, and research. This talk sheds light on the current state of ChatGPT, large language models including GPT-4, AI-assisted coding, highlights the research gaps, and proposes a way forward.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3FF1ubd
In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets.
KEY TAKEAWAYS:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
Apache Kafka With Spark Structured Streaming With Emma LIU, Nitin Saksena, Ram Dhakne | Current 2022
A well-architected data lakehouse provides an open data platform that combines streaming with data warehousing, data engineering, data science and ML. This opens a world beyond streaming to solving business problems in real-time with analytics and AI. See how companies like Albertsons have used Databricks and Confluent together to combine Kafka streaming with Databricks for their digital transformation.
In this talk, you will learn:
- The built-in streaming capabilities of a lakehouse
- Best practices for integrating Kafka with Spark Structured Streaming
- How Albertsons architected their data platform for real-time data processing and real-time analytics
Sponsored by Data Transformed, the KNIME Meetup was a big success. Please find the slides for Dan's, Tom's, Anand's and Chhitesh's presentations.
Agenda:
Registration & Networking
Keynote – Dan Cox, CEO of Data Transformed
KNIME & Harvest Analytics – Tom Park
Office of State Revenue Case Study – Anand Antony
Using Spark with KNIME – Chhitesh Shrestha
Networking & Drinks
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...Big Data Week
We all are aware of the challenges enterprises are having with growing data and silo’d data stores. Business is not able to make reliable decisions with un-trusted data and on top of that, they don’t have access to all data within and outside their enterprise to stay ahead of the competition and make key decisions in their business
This session will take a deep dive into current challenges business are having today and how to build a Modern Data Architecture using emerging technologies such as Hadoop, Spark, NoSQL data stores, MPP Data stores and scalable and cost effective cloud solutions such as AWS, Azure and Bigstep.
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
With the aid of any number of data management and processing tools, data flows through multiple on-prem and cloud storage locations before it’s delivered to business users. As a result, IT teams — including IT Ops, DataOps, and DevOps — are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
Join Stonebranch’s Scott Davis, Global Vice President and Ravi Murugesan, Sr. Solution Engineer to learn how DataOps teams orchestrate their end-to-end data pipelines with a platform approach to managing automation.
Key Learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business — with the observability you require
- Discover how to replace clunky legacy approaches to streaming data in a multi-cloud environment
- See what’s possible with the Stonebranch Universal Automation Center (UAC)
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
For many decades now, the software industry has attempted to bridge the productivity gap, develop higher quality code and manage the ever growing complexity of software-intensive systems. The results have been mixed, and as a result, a great majority of today's software is still written manually by human developers. This is about to change rapidly as recent developments in the field of Artificial Intelligence show promising results. While artists and designers have been taken by surprise by OpenAI’s DALL-E 2’s capabilities in designing unique art, ChatGPT has astonished the rest of the world with its capability of understanding human interaction. AI-assisted coding solutions such as Github’s Copilot and Replit’s Ghostwriter, among many others, are rapidly developing in a direction where AI generates new code that runs fast with high quality. Little is known about the true capabilities of AI programmers and their impact on the software development industry, education, and research. This talk sheds light on the current state of ChatGPT, large language models including GPT-4, AI-assisted coding, highlights the research gaps, and proposes a way forward.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3FF1ubd
In the recent Building the Unified Data Warehouse and Data Lake report by leading industry analysts TDWI, we have discovered 64% of organizations stated the objective for a unified Data Warehouse and Data Lakes is to get more business value and 84% of organizations polled felt that a unified approach to Data Warehouses and Data Lakes was either extremely or moderately important.
In this session, you will learn how your organization can apply a logical data fabric and the associated technologies of machine learning, artificial intelligence, and data virtualization can reduce time to value. Hence, increasing the overall business value of your data assets.
KEY TAKEAWAYS:
- How a Logical Data Fabric is the right approach to assist organizations to unify their data.
- The advanced features of a Logical Data Fabric that assist with the democratization of data, providing an agile and governed approach to business analytics and data science.
- How a Logical Data Fabric with Data Virtualization enhances your legacy data integration landscape to simplify data access and encourage self-service.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
Apache Kafka With Spark Structured Streaming With Emma LIU, Nitin Saksena, Ram Dhakne | Current 2022
A well-architected data lakehouse provides an open data platform that combines streaming with data warehousing, data engineering, data science and ML. This opens a world beyond streaming to solving business problems in real-time with analytics and AI. See how companies like Albertsons have used Databricks and Confluent together to combine Kafka streaming with Databricks for their digital transformation.
In this talk, you will learn:
- The built-in streaming capabilities of a lakehouse
- Best practices for integrating Kafka with Spark Structured Streaming
- How Albertsons architected their data platform for real-time data processing and real-time analytics
Sponsored by Data Transformed, the KNIME Meetup was a big success. Please find the slides for Dan's, Tom's, Anand's and Chhitesh's presentations.
Agenda:
Registration & Networking
Keynote – Dan Cox, CEO of Data Transformed
KNIME & Harvest Analytics – Tom Park
Office of State Revenue Case Study – Anand Antony
Using Spark with KNIME – Chhitesh Shrestha
Networking & Drinks
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...Big Data Week
We all are aware of the challenges enterprises are having with growing data and silo’d data stores. Business is not able to make reliable decisions with un-trusted data and on top of that, they don’t have access to all data within and outside their enterprise to stay ahead of the competition and make key decisions in their business
This session will take a deep dive into current challenges business are having today and how to build a Modern Data Architecture using emerging technologies such as Hadoop, Spark, NoSQL data stores, MPP Data stores and scalable and cost effective cloud solutions such as AWS, Azure and Bigstep.
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
Joe Caserta explores the world of analytics, tech, and AI to paint a picture of where business is headed. This presentation is from the CDAO Exchange in Miami 2018.
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
Caserta Concepts Founder and President, Joe Caserta, gave this presentation at Strata + Hadoop World 2016 in New York, NY. His session covers path-to-purchase analytics using a data lake and spark.
For more information, visit http://casertaconcepts.com/
Data summit connect fall 2020 - rise of data opsRyan Gross
Data governance teams attempt to apply manual control at various points for consistency and quality of the data. By thinking of our machine learning data pipelines as compilers that convert data into executable functions and leveraging data version control, data governance and engineering teams can engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, and other activities. This talk illustrates how innovations are poised to drive process and cultural changes to data governance, leading to order-of-magnitude improvements.
Why Everything You Know About bigdata Is A LieSunil Ranka
As a big data technologist, you can bet that you have heard it all: every crazy claim, myth, and outright lie about what big data is and what it isn't that you can imagine, and probably a few that you can't.If your company has a big data initiative or is considering one, you should be aware of these false statements and the reasons why they are wrong.
Big Data LDN 2018: THE PATH TO ENTERPRISE AI: TALES FROM THE FIELDMatt Stubbs
Date: 14th November 2018
Location: AI Lab Theatre
Time: 11:50 - 12:20
Speaker: Romain Fouache
Organisation: Dataiku
About: Enterprise AI is a target state where every business process is AI-augmented and every employee is an AI beneficiary. But is that really attainable? And, if so, what is the path to get there? In this talk, Kurt Muehmel, VP Sales Engineering at Dataiku, will share learnings from the field, describing how companies of different sizes and across different sectors have begun this journey. Some are farther along than others, and by making the right decisions now and avoiding stumbling blocks, you can to supercharge your quest to this AI-fuelled future.
In the age of IoT, most everyone is talking about data lakes. For the most part, we all agree on the value data lakes deliver, but beyond this conceptual agreement, there are still many practical questions that need answers. The key to success comes down to how data lakes are implemented and managed.
Chuck Yarbrough outlines the 5 keys for creating a data lake along with strategies for defining, ingesting, governing, managing, and analyzing the data lake in ways that will enable transformative benefits in Iot and other use cases. This session will show and share how real-world data lake implementations are changing the world. Chuck focuses on automation of the data lake from ingesting data to managing metadata at scale and applying machine learning to drive significant results. Along the way, Chuck explores tools and procedures that help create a well-organized, -governed, and -managed data lake—without the risk of creating a dreaded data swamp. You’ll leave armed with the five keys to successfully creating and managing a killer data lake.
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...Denodo
This content was presented during the Smart Data Summit Dubai 2015 in the UAE on May 25, 2015, by Jesus Barrasa, Senior Solutions Architect at Denodo Technologies.
In the era of Big Data, IoT, Cloud and Social Media, Information Architects are forced to rethink how to tackle data management and integration in the enterprise. Traditional approaches based on data replication and rigid information models lack the flexibility to deal with this new hybrid reality. New data sources and an increasing variety of consuming applications, like mobile apps and SaaS, add more complexity to the problem of delivering the right data, in the right format, and at the right time to the business. Data Virtualization emerges in this new scenario as the key enabler of agile, maintainable and future-proof data architectures.
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Denodo
Watch full webinar here: https://bit.ly/3lSwLyU
En la era de la explosión de la información repartida en distintas fuentes, el gobierno de datos es un componente clave para garantizar la disponibilidad, usabilidad, integridad y seguridad de la información. Asimismo, el conjunto de procesos, roles y políticas que define permite que las organizaciones alcancen sus objetivos asegurando el uso eficiente de sus datos.
La virtualización de datos forma parte de las herramientas estratégica para implementar y optimizar el gobierno de datos. Esta tecnología permite a las empresas crear una visión 360º de sus datos y establecer controles de seguridad y políticas de acceso sobre toda la infraestructura, independientemente del formato o de su ubicación. De ese modo, reúne múltiples fuentes de datos, las hace accesibles desde una sola capa y proporciona capacidades de trazabilidad para supervisar los cambios en los datos.
Le invitamos a participar en este webinar para aprender:
- Cómo acelerar la integración de datos provenientes de fuentes de datos fragmentados en los sistemas internos y externos y obtener una vista integral de la información.
- Cómo activar en toda la empresa una sola capa de acceso a los datos con medidas de protección.
- Cómo la virtualización de datos proporciona los pilares para cumplir con las normativas actuales de protección de datos mediante auditoría, catálogo y seguridad de datos.
Against the backdrop of Big Data, the Chief Data Officer, by any name, is emerging as the central player in the business of data, including cybersecurity. The MITCDOIQ Symposium explored the developing landscape, from local organizational issues to global challenges, through case studies from industry, academic, government and healthcare leaders.
Joe Caserta, president at Caserta Concepts, presented "Big Data's Impact on the Enterprise" at the MITCDOIQ Symposium.
Presentation Abstract: Organizations are challenged with managing an unprecedented volume of structured and unstructured data coming into the enterprise from a variety of verified and unverified sources. With that is the urgency to rapidly maximize value while also maintaining high data quality.
Today we start with some history and the components of data governance and information quality necessary for successful solutions. I then bring it all to life with 2 client success stories, one in healthcare and the other in banking and financial services. These case histories illustrate how accurate, complete, consistent and reliable data results in a competitive advantage and enhanced end-user and customer satisfaction.
To learn more, visit www.casertaconcepts.com
This presentation will discuss the stories of 3 companies that span different industries; what challenges they faced and how cloud analytics solved for them; what technologies were implemented to solve the challenges; and how they were able to benefit from their new cloud analytics environments.
The objectives of this session include:
• Detail and explain the key benefits and advantages of moving BI and analytics workloads to the cloud, and why companies shouldn’t wait any longer to make their move.
• Compare the different analytics cloud options companies have, and the pros and cons of each.
• Describe some of the challenges companies may face when moving their analytics to the cloud, and what they need to prepare for.
• Provide the case studies of three companies, what issues they were solving for, what technologies they implemented and why, and how they benefited from their new solutions.
• Learn what to look for one considering a partner and trusted advisor to assist with an analytics cloud migration.
ADV Slides: How to Improve Your Analytic Data Architecture MaturityDATAVERSITY
Many organizations are immature when it comes to data use. The answer lies in delivering a greater level of insight from data, straight to the point of need. Enter: machine learning.
In this webinar, William will look at categories of organizational response to the challenge across strategy, architecture, modeling, processes, and ethics. Machine learning maturity levels tend to move in harmony across these categories. As a general principle of maturity models, you can’t skip levels in any category, nor can you advance in one category well beyond the others.
Vis-à-vis ML, attaining and retaining momentum up the model is paramount for success. You will ascend the model through concerted efforts delivering business wins utilizing progressive elements of the model, and thereby increasing your machine learning maturity. The model will evolve. No plateaus are comfortable for long.
With ML maturity markers, sequencing, and tactics, this webinar provides a plan for how to build analytic Data Architecture maturity in your organization.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
To analyse why operationalizing AI is so challenging, it’s important to understand the full lifecycle of an AI project, and identify the stakeholders involved.
Through 2023, Gartner estimates that 50% of IT leaders will struggle to move their AI projects past proof of concept (POC) to a production level of maturity.
To reduce this high failure rate, organisations need to build the right roles for AI success. In many organisations, data scientists are still wearing too many hats due to a dearth of talent across other roles.
This session will highlight how, in order to successfully operationalise and scale AI POCs, organisations must to build diverse AI roles and skills with a collaborative structure that is paramount.
Productionising Machine Learning to automate the enterprise. Conference research question: How can you pin-point which core business processes to transform with increased automation and streamline daily workflows to boost in house efficiencies?
Machine learning - What they don't teach you on Coursera ODSC London 2016Harvinder Atwal
I’ll show some example of live models at MoneySuperMarket. However, the main theme will be that there is far more to successful implementation of Machine Learning than just creating good algorithms. There needs to be just as much effort, if not more, put into selling the benefits to the business, working with developers and engineers to put the model into production, building testing into the process and ongoing maintenance of the solution.
Case Study Interactive: How To Work With Structured And Unstructured Data To Increase Customer Acquisition And Reduce Churn With Relevant Communication
How can analytics improve your attribution model accuracy to highlight and transform your most successful marketing channels?
How can you introduce predictive analytics to increase your customer segmentation competency?
How can insights from consumer data help you to predict customer lifetime value and focus on your top customers?
How can split testing consumer data help to improve your customer offering and boost retention rates?
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
DataOps: Nine steps to transform your data science impact Strata London May 18
1. DataOps: 9 steps to transform
your data science impact
21-24 May 2018
2. // Harvinder Atwal
MoneySuperMarket
// Web
dunnhumby
{"previous" : "Insight Director, Tesco Clubcard"}
Lloyds Banking
Group
{"previous" : "Senior Manager, Customer Strategy and Insight"}
{"Current" : "Head of Data Strategy and Advanced Analytics"}
@harvindersatwal
British Airways
{"previous" : "Senior Operational Research Analyst"}
{"about" : "me"}
@gmail.com
3. £2B
SAVINGS
2017 estimate total of UK savings
1993 24.9M 24 million £323M 989
We started life
as mortgages
2000
Adults choose
to share their
data with us
Average
monthly users
2017
Revenue
2017
Product
Providers
4. Sometimes it’s simple things that work really well
From one version to 1400+
customised variants of the
newsletter
+19% Increase in Revenue Per
Send
5. Sometimes it’s more complicated solutions
Worried about whether you can afford a
personal loan? With UK interest rates at record
lows, it’s worth checking to see how reasonable
the cost could be.
Whether you need to borrow to buy something,
or you want to bring your existing debts under
one roof, have a look at these competitive deals
we’ve assembled.
Thanks to our Smart Search tool, you can get
an idea of the loans you’re likely to be accepted
for before you proceed with your application.
Same message but
Language tailored
to the customer’s
Financial Attitude
6. Only 22% of companies are currently seeing
a significant return from data science
expenditures*
*Obligatory conference presentation quote from GartnerForresterMcKinsey Consulting. Sorry.
12. Multiple challenges in the process of turning
data into value on existing infrastructure
Business
Problem
Evaluate
available
data
Request
Data Access
from IT
Request
Compute
Resources
from IT
Negotiate
with IT for
requested
resources
Wait for
resources to
be
provisioned
Install
Languages
and tools
Configure
connectivity,
Access and
security
RAM/CPU
Availability,
scaling,
monitoring
Request
network
Config
Change
Request to
install
another
package
Model
building
Compose
PowerPoint
to share
results
Edit
Confluence
to document
work
Negotiate with
business
stakeholder
on
deployment
timeline
Wait for Data
Engineering to
implement the
model
Test Newly
implemented
model to
ensure valid
results
Request
Modifications
to model due
to unexpected
results
Release model
to production
and schedule
Document
release notes
and
deployment
steps
Prepare for
change
management
18. Eliminate waste
LEAN THINKING
The Optimist The Pessimist The Lean Thinker
THE GLASS
IS HALF FULL
THE GLASS IS
HALF EMPTY
WHY IS THE GLASS
TWICE AS BIG AS IT
SHOULD BE?
19. Alignment of data science with the rest of the
organisation and it's goals
25. Your business already has hypothesis for
what creates value
Actively avoid work on anything else
It’s the Corporate Strategy and Objectives
(everyone is aligned behind)
26. Measurement of everything gives feedback of not just individual deliverables (fast
loop) but also the organisation’s hypothesis of what adds value (slow loop)
Situational Awareness
Objectives (Themes)
Strategies (Initiatives)
Tactics
(Epics)
Actions
(Stories)
Strategies (Initiatives) Strategies (Initiatives)
Objectives (Themes)
Tactics
(Epics)
Tactics
(Epics)
Tactics
(Epics)
Tactics
(Epics)
Tactics
(Epics)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Corporate strategy is broken down into many
options (Epics) for Agile delivery
27. We reduce Batch sizes of work and
have options to keep flow going
28. Collaboration is key
Shared Buy-in from Senior management
Organizational behavior structured around the
ideal data-journey model
Shared Priorities
Shared Trust in data
Shared Rewards based on measured outcomes,
not outputs
29. Test &
Collect
Model Embed Roll Out
Feedback
Plan
Pilot test
Collect Data
Build Model,
Identify segments
Adjust model to fit
organisation
Re-engineer business processes
to support segmented execution
Train organisation
Creation of fast feedback loop
31. Shortened Data Cycles to be Agile
Data Engineering
Dev Ops/Infrastructure
DB Management
Cloud File
Storage
Distributed File
System
NoSQL DB
RDBMS
Distributed
SQLQuery Engine
Distributed
Compute
Framework
Compute
Instance
Container
Service
Data
Prep/Exploration
tools
Coding
Workspace
& Language
Libraries
Machine
Learning
Data
Visualisation
Interactive
Dashboards/
Web App
development
Version/
Deployment
Tool
Output Files
BI Tools
Interactive
dashboards
/Web Apps
APIs
Knowledge Management
Security/Identity Access control
Revision Control
Configuration Management
Orchestration and scaling
Project and Data Governance
Scheduling
Resource Management/Monitoring/Auditing
ETL
DQM
Data Scientists
Epic
Customer
Feedback &
Iteration
Data
Product
Strategy
Story
Stream
Processing
Data
Sources
34. DataOps is an independent approach to data analytics
Data Analytics team
moves at lightening speed
using highly optimized
tools and processes
across the whole data
lifecycle
Agile Collaboration to
break down silos and
work on “The Right
Things” that add value
Lean Manufacturing like
focus on eliminating waste
& bottlenecks, improving
quality, monitoring and
control
Iterative project management
Continuous delivery
Automated test and deployment
Monitoring
Self-serve
Quality
Governance
Organisational alignment
Ease of use PredictabilityReproducibility
Strategic Objectives
38. Trust part 1: Make the “What you do to data”
people in the organisation happy
Identity and
Access
Management
Custom role
permissions
Audit trail
logs
Data Loss
Prevention
Encryption
of Data at
Rest
Encryption
of Data in
Motion
Resource
Monitoring
Firewall
rules
Resource
and
Object
Isolation
Penetration
Testing
Code
Encryption
and
Backup
Segregation
of Duties
Authorisation
protocols
Data
Access and
Privacy
Policy
Metadata
Management
Data Lineage
Tracking
Data
Stewards
and
Owners
39. Trust part 2: Make the “What you do with
data” people in the organisation happy
Data
Quality
Testing
Transformation
Testing
End-User
Testing
ETL
Integration
Testing
Metadata
Testing
Data
Completeness
Testing
ETL
Regression
Testing
Incremental
ETL Testing
Reference
Data
Testing
ETL
Performance
Testing
42. Continuous Integration: Commit Code Regularly
Data Cleaning Master
Data Cleaning
Dev Branch
Feature Extraction Dev
Feature Extraction
Master
Model Train Master
Model Train Dev Branch
Machine Learning Pipeline
Product Development (e.g. App, Website, Marketing system, Operational System, Dashboard, etc.)
44. Continuous Delivery and Beyond:
Accelerating Deployment
Dev Integration testApplication test Acceptance test Production
Continuous Integration
Dev Integration testApplication test
Continuous Delivery
Dev Integration testApplication test Acceptance test Production
Continuous Deployment
Automated
Manual
46. Chemistry is not about tubes
DataOps is not about tools
(but the right ones help)
47. Align your spine
Needs
Principles
Practices
Tools
Values
How do you know it is the best
possible tool?
How do you know that
the Practices actively help the system?
How do you know
which Principles you want to
apply?
“We use _____ to get our work done”
“We DO Self-Service and DataOps
to continuously create VALUE for
the customer and business”
We LEVERAGE Agile and Lean
PRINCIPLES to change the system and
make sure resources work on the right
thing
We OPTIMISE for Speed, Accuracy,
Experimentation/Feedback and
Security.
We are here to SATISFY THE NEED to
help customers save money and the
business to execute it’s strategy
It all starts at Needs. Why does this
system exist in the first place?
Source: Kevin Trethewey, Danie Roux, Joanne Perold
48. Avoid building your own anything or
being on the bleeding edge.
Cost of Delay is high.
49. Data Scientists need a way to manage their projects
end-to-end with self-service data AND ARCHITECTURE
Business
Problem
Evaluate
available
data
Request
Data Access
from IT
Request
Compute
Resources
from IT
Negotiate
with IT for
requested
resources
Wait for
resources to
be
provisioned
Install
Languages
and tools
Configure
connectivity,
Access and
security
RAM/CPU
Availability,
scaling,
monitoring
Request
network
Config
Change
Request to
install
another
package
Model
building
Compose
PowerPoint
to share
results
Edit
Confluence
to document
work
Negotiate with
business
stakeholder
on
deployment
timeline
Wait for Data
Engineering to
implement the
model
Test Newly
implemented
model to
ensure valid
results
Request
Modifications
to model due
to unexpected
results
Release model
to production
and schedule
Document
release notes
and
deployment
steps
Prepare for
change
management
50. Modern serverless and managed
infrastructure makes it easy to create
data products just bring code and data
A single unified platform reduces data
fragmentation, overcomes business silos
and helps enforce consistent governance
51. You can make the data supply chain more efficient
by unifying data and tools in one platform
Data
Warehouse(s)ETL
Analytic
s
Platform
Core
Data
Other
Data
Extract/Load
OffLoad
Main Source(s) of Truth
Presentation/
Service Layer(s)
Analytical
tools
Predictive and
Prescriptive analytics
Flatten/Merge
columns
Data Sharing
BI Tools
Descriptive and
Diagnostic
analytics
Source
Cubes on
Dimensions
Reload
Data
Microservices
Flatten/Merge
columns
Data Sharing
52. Data Science Platforms add further self-serve
capabilities
Data Access, Prep
and Exploration
Jupyter, Rstudio,
Zeppelin, etc.
Automation and
Machine Learning
Run experiments, track
and compare results
Delivery and Model
Management
Publish APIs,
Interactive web apps
Schedule reports
Collaboration and Version Control
Discover, discuss and build on existing work
Compute Environment Library
Customised software stack
Compute Grid
Orchestrate hardware for development and deployment
Source: Domino Data Labs
61. #6 KEEP CALM
AND
BUILD TRUST IN DATA
Put Effective Data Governance, Security and Testing in place
62. #7 Invest in tools and process to
reduce bottlenecks and increase quality
Managed Infrastructure and Serverless Cloud,
Automation and Data Science Platforms
64. #9 Organise around the ideal data
journey instead of teams
Fewer roles, more end-to-end ownership, less friction
Store Share UseManageAcquire Process
Data Engineering
Data Scientists
Data Analysts
Business Stakeholders
67. The DataOps Data Science Factory
Epic
Customer
Data
Product
Strategy
Story
Data
Rest of
Business Analytics
Agile Collaboration
Data Governance
Automated testing
Value Measurement
Version Control
Configuration Management
Self-Serve Infrastructure
Automation
Continuous Integration
70. // Harvinder Atwal // Web
var current: {
companyName : "MoneySuperMarket",
position : "Head of Data Strategy"
+ " and Advanced Analytics"
};
var previous1: {
companyName : "Dunnhumby",
position : "Insight Director,"
+ " Tesco Clubcard"
};
var previous2: {
companyName : "Lloyds Banking Group",
position : "Senior Manager"
};
var previous3: {
companyName : "British Airways",
position : "Senior Operational Research Analyst"
};
{"about" : "me"}
var username = "harvindersatwal";
var linkedIn = "/in/" + username;
var twitter = "@" + username;
var email = username + "@gmail.com";
Editor's Notes
The average colleague doesn't want to be a data person any more than I want the an accountant. You have to hire like Google. Data people who happen to make good product owners.
Star wars is not a metaphor for good vs Evil but Waterfall vs Agile
Too much wastage in the process and hard to impact customers directly
The key to adding value is to adapt and borrow principles from Agile Software development starting with alignment of data science with the rest of the organisation and it's goals.
Work only on the organisation’s biggest strategic objectives – those stakeholders have aligned behind. Objectives the business hypothesises will add the most value.
We don’t know upfront what is going to work.
DataOps is an automated, process-oriented methodology, used by big data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics.[1] DataOps applies to the entire data lifecycle[2] from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.[3] From a process and methodology perspective, DataOps applies Agile software development, DevOps[3] and the statistical process control used in lean manufacturing, to data analytics.[4]
In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving — a situation well known to data analytics professionals.[5]
DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics.[3]
Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert.[6]
DataOps seeks to provide the tools, processes, and organizational structures to cope with this significant increase in data.[7] Automation streamlines the daily demands of managing large integrated databases, freeing the data team to develop new analytics in a more efficient and effective way.[9]
DataOps embraces the need to manage many sources of data, numerous data pipelines and a wide variety of transformations.[3] DataOps seeks to increase velocity, reliability, and quality of data analytics.[10] It emphasizes communication, collaboration, integration, automation, measurement and cooperation between data scientists, analysts, data/ETL(extract, transform, load) engineers, information technology (IT), and quality assurance/governance.[11] It aims to help organizations rapidly produce insight, turn that insight into operational tools, and continuously improve analytic operations and performance.[11]
This is sometimes really hard for Data Scientists who experiment with data on laptops to accept.
Add Data and Logic Tests
Version control is the foundation upon which a lot of delivery is built.
At a minimum, reviewers of a publication and future researchers should be able to:1) Download all data and software used to generate the results.2) Run tests and review source code to verify correctness.3) Run a build process to execute the computation.
Version control makes it possible to maintain an archived version of the code used to produce a particular result. Examples include Git and Subversion.
3) Automated build systems document the high-level structure of a computation: which programs process which data, what outputs they produce, etc. Examples include Make and Ant.
Configuration management tools document the details of the computational environment where the result was produced, including the programming languages, libraries, and system-level software the results depend on.
Examples include package managers like Conda that document a set of packages,
containers like Docker that also document system software,
and virtual machines that actually contain the entire environment needed to run a computation.
In an enterprise setting where multiple data scientists could be working on a single project, the first step to doing data science work that scales is implementing version control, whether that’s GitHub, GitLab, Bitbucket, or another solution. Once your team has the ability to track code changes, the next step is to create a process in which they regularly commit their code to the master branch of your repository.
2) During development, automated tests make programs more likely to be correct; they also tend to improve code quality. During review, they provide evidence of correctness, and for future researchers they provide what is often the most useful form of documentation. Examples include unittest and nose for Python and JUnit for Java.
You can move beyond Continuous Integration to make deployment even faster.
Traditionally, data science deployment has been a multi-step process that puts the onus on engineering: Engineers would refactor, test, and automate or schedule a data scientist’s model before slowly rolling it out, sometimes months after it was originally built.
Developers that embrace continuous delivery are pushing new application features or changes into production quickly, sometimes with the click of a button.
Increasingly, cloud and data science platforms are filling this void with features such as the ability to deploy models as APIs or schedule code runs which means that as soon as new development passes your tests it can be deployed into production with no dependencies on other teams.
For IT teams managing the systems that support models in production and data science environments, the ability to monitor and add resources as data science work expands — while maintaining system availability — is essential.
But that’s just one application. For IT teams managing the resources needed for every deployed model and data science environment across an entire company, a data science platform that offers cluster management features and the ability for IT to dictate the size of the resources made available to data science teams can go a long way toward achieving continuous operations.
Which brings me on to tools
Just as chemistry is not about the tubes but the process of experimentation. DataOps is not tied to a particular technology, architecture, tool, language or framework.
However, some tools are better at supporting DataOps collaboration, orchestration, agility, quality, security, access and ease of use.
Whenever, choosing tools is best to never start with the tools themselves.
I like to use the spine model by Tre the wey, Roux and Perold.
So to decide on the tool you need to understand the practices you employ, in order to understand what practices to employ you need to define your principles, to define your principles you need to know your values, and to know your values you need start with the needs you’re trying to fulfil.
We have a set of clear DataOps Practices we want to employ so we have a clear idea of what tools will be fit for purpose.
http://spinemodel.info/explanation/introduction
But first a bit of advice. You should avoid building your own anything or being on the bleeding edge.
Any technology or tool that is really useful will end up being refined or commoditised and turned into a service. Let someone else find the bugs, be the beta tester or end up in cul-de-sac.
The other factor to take into account is Cost of Delay.
It’s nearly always ignored in business cases. On paper be cheaper to build your own solution. However, the months, or years, you’re taking to do that is months, or years, you’re not benefiting from the solution and handing to your competitors. And it always takes twice as long to build your own solutions, even after you’ve factored in it’s going to take you twice as long as you think.
Because one of our principles is that we want to make data cycles shorter and shorter it’s important Data Scientists can self-serve not just the data but also the infrastructure, tools and packages
Modern Cloud architecture makes it very easy to create data products rapidly.
Specifically, the move from Infrastructure and Platform as a Service to Software as a Service and Serverless architecture.
That means you having not hardware or software to configure, you just bring your data and code and all the scaling and optimisation is done for you.
The other advantage is you can use the same tools for dev and production,
You can also the same data in dev and production as in the SAAS or Serverless word there’s no need for separation of environments.
We’re so convinced of the benefits we’re actually moving our Data Analytics stack out of AWS onto Google Cloud Platform.
Here’s an example of GCP reference architecture for Big data which isn’t a million miles from our architecture. There’s absolutely no infrastrucutre to manage within the environment.
The other thing you can do is use the cloud as a centralised platform helping to break down organisational barriers and makes it easier to enforce governance rules.
Modern cloud takes care of the underlying tools but you can add further levels of abstraction and self-service to the compute infrastructure and data pipeline.
Data Science Platforms provide tools that enable teams to work faster and deploy DataOps methodology very easily from choosing the computer infrastructure and environments to run their code on, to automated version control, collaboration tools and one-click deployment to APIs and Dashboards.
The requirements for this type of platform haven’t gone unnoticed, these are just some of the vendors we looked at before settling on Domino Data Labs.
Each has their strengths and weaknesses, so which one is best depends on your use-cases.
There’s another positive side-effect of going down the DataOps route.
You require fewer roles due to self-service.
There no need for Specialist Dev Ops, Infrastructure Engineers, Sys Admins or DBAs.
This reduces friction, hand-offs, bottlenecks
You’re left with just four key roles, Data Scientists, Data Engineers, Data Analysts (who are a much under-invested in group as everyone wants to be a Data Scientist) and the Line of Business (these are the stakeholders and also those who will help integrate your Data Product into other applications.
Worrying About Artificial Intelligence when you can’t even produce a Sales report is not going to get you very far.
You need to worry about being able to action data instead in alignment with the organisation’s strategy and goals.
80% of the battle is knowing what not to work on.
You should not work on projects but products.
Products are in constant use by consumers and have direct customer and business benefit. The benefit scales according to the number of customers who use them. A data product may be a machine learning model, a segmentation, a recommendation engine, a dashboard. They may be integrated into other products. They have an owner, you get feedback that helps you improve them through iteration.
They are not one-off adhoc pieces of insight that get filed away.
Velocity is th
We need to solve all the problems with Data Science today:
Hamster Wheel Analytics – Doing busywork for the organisation that makes us feel good because we’re putting in a lot of effort and clients appreciate but is never going to move the needle.
The work we do that’s not repeatable because it was never documented
The aimless crash and burn – Where we explore data to find the magical insights without a clear objective or worse the rest of the business has no interest in
The Roadblock – Work we do that has no route to the customers because it is blocked by corporate silos, IT, Security, lack of infrastructure, tools or willingness to integrate into an end product and remains on a laptop.
Work we do that does make a customer impact which we can’t measure because the feedback loop was never closed.
Instead we can move to the DataOps World – What I like to call the Data Science Factory.
It starts with alignment with the rest of the business’ strategy to create options for Agile Delivery and collaboration to deliver them.
Rapid delivery of Data Products because there is the governance, trust in data, self-service and automation
A path to the end-consumer and feedback to measure value for the next iteration.