The NotPetya, SolarWinds, and Kaseya cybersecurity attacks were all executed by injection of malicious code in software shipped by vendors to thousands of companies. These attacks have made the public more aware of the importance of secure software supply chains. But the path from awareness to ensuring a secure supply chain is long. Developers have gotten used to the convenience of easily downloading third party software into containers, and it is challenging to tighten supply chain security in a company with a sprawl of open source components.
Scling is a small data engineering startup, and since we ask our customers to entrust us with their data, we must take security seriously. We have been securing our software supply chain since the company was founded. We have no venture capital, and our customers expect quick development iteration cycles, so we have solved supply chain security with minimal effort and minimal impact on developer productivity. In this presentation, we share how we have addressed the different supply chain attack vectors, e.g. Python and JVM packages, with technical solutions. We will present how we automate third party software upgrades to stay up to date with security upgrades while minimising the risk of downloading rogue code.
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
The quality of data-powered applications depends not only on code, but also on collected data, as well as models trained on data. This renders traditional quality assurance inadequate. We will take a look in our toolbox for more holistic tactics that bridge the gap between code and data quality assurance.
If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".
Modern data processing environments resemble factory lines, transforming raw data to valuable data products. The lean principles that have successfully transformed manufacturing are equally applicable to data processing, and are well aligned with the new trend known as DataOps. In this presentation, we will explain how applying lean and DataOps principles can be implemented as technical data processing solutions and processes in order to eliminate waste and improve data innovation speed. We will go through how to eliminate the following types of waste in data processing systems:
* Cognitive waste - unclear source of truth, dependency sprawl, duplication, ambiguity.
* Operational waste - overhead for deployment, upgrades, and incident recovery.
* Delivery waste - friction and delay in development, testing, and deployment.
* Product waste - misalignment to business value, detach from use cases, push driven development, vanity quality assurance.
We will primarily focus on technical solutions, but some of the waste mentioned requires organisational refactoring to eliminate.
DataOps is the transformation of data processing from a craft with manual processes to an automated data factory. Lean principles, which have proven successful in manufacturing, are equally applicable for data factories. We will describe how lean principles can be applied in practice for successful data processing.
Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
This document discusses various topics related to DevOps practices at different companies:
1. Netflix prioritizes speed of innovation and availability over running costs when developing software. They found this approach ended up costing less than expected.
2. Riot Games uses tools like Chef to deploy their massively multiplayer online game League of Legends to the cloud. This helps them solve launch issues and scale efficiently.
3. Many companies like Netflix, Riot Games, and Kickstarter test new code and configurations in production at a large scale to continuously improve their systems and user experience.
4. Centralized logging services are important for developers to more easily monitor systems, debug issues, and reduce time spent on call
In data science, the scientific part is often forgotten - workflows, tools, and practices that are popular tend to yield experiments that cannot be repeated. Experiments are not reliable cannot tell us whether changes improve products or not. What works fine during initial development is inadequate for sustainable development of machine learning products. In this presentation, you will learn:
- Why reproducibility matters for data science.
- The practices and workflows that cause reproducibility problems.
- How to build technical environments and processes that enable reproducibility and iterative development of machine learning products.
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
The quality of data-powered applications depends not only on code, but also on collected data, as well as models trained on data. This renders traditional quality assurance inadequate. We will take a look in our toolbox for more holistic tactics that bridge the gap between code and data quality assurance.
If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".
Modern data processing environments resemble factory lines, transforming raw data to valuable data products. The lean principles that have successfully transformed manufacturing are equally applicable to data processing, and are well aligned with the new trend known as DataOps. In this presentation, we will explain how applying lean and DataOps principles can be implemented as technical data processing solutions and processes in order to eliminate waste and improve data innovation speed. We will go through how to eliminate the following types of waste in data processing systems:
* Cognitive waste - unclear source of truth, dependency sprawl, duplication, ambiguity.
* Operational waste - overhead for deployment, upgrades, and incident recovery.
* Delivery waste - friction and delay in development, testing, and deployment.
* Product waste - misalignment to business value, detach from use cases, push driven development, vanity quality assurance.
We will primarily focus on technical solutions, but some of the waste mentioned requires organisational refactoring to eliminate.
DataOps is the transformation of data processing from a craft with manual processes to an automated data factory. Lean principles, which have proven successful in manufacturing, are equally applicable for data factories. We will describe how lean principles can be applied in practice for successful data processing.
Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
This document discusses various topics related to DevOps practices at different companies:
1. Netflix prioritizes speed of innovation and availability over running costs when developing software. They found this approach ended up costing less than expected.
2. Riot Games uses tools like Chef to deploy their massively multiplayer online game League of Legends to the cloud. This helps them solve launch issues and scale efficiently.
3. Many companies like Netflix, Riot Games, and Kickstarter test new code and configurations in production at a large scale to continuously improve their systems and user experience.
4. Centralized logging services are important for developers to more easily monitor systems, debug issues, and reduce time spent on call
In data science, the scientific part is often forgotten - workflows, tools, and practices that are popular tend to yield experiments that cannot be repeated. Experiments are not reliable cannot tell us whether changes improve products or not. What works fine during initial development is inadequate for sustainable development of machine learning products. In this presentation, you will learn:
- Why reproducibility matters for data science.
- The practices and workflows that cause reproducibility problems.
- How to build technical environments and processes that enable reproducibility and iterative development of machine learning products.
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Did you know that the tech elite does not work at all like you do? Most people don't, and don't want to know. The State of DevOps report concluded a span of 1000x in delivery time and reliability between the elite and low performers. There is a similar gap for delivery time of data or ML pipelines to production. The gap in ability to compute datasets is higher, somewhere around a million times. We call this the data divide or the AI divide. It is widening over time, since most companies are not aware of its width.
We will share the principles we applied in the most successful Scandinavian crossing of the data divide. We never explicitly shared or described, nor fully understood the principles at the time, but it is long due to explicitly enumerate them.
The presentation will likely be uncomfortable and surprising, because it does not match what you do and what your vendors say. You will have no practical use of the information, since you cannot apply the principles, because they contradict many contemporary trends and popular technologies on the market, and you would be unable to overcome the forces of trends, popularity, and messages from vendors. They worked beautifully for us at the time.
1) Google Cloud provides a global infrastructure with regions launching rapidly around the world. Its network is designed for scale and performance without bottlenecks.
2) BigQuery provides petabyte-scale analytics powered by Colossus storage, Capacitor compression, and the high-bandwidth Jupiter network. It can process queries involving trillions of rows in seconds.
3) Google invests heavily in security, offering layers of protection for networks, applications, and data from threats like DDoS attacks. It also has a large partner ecosystem around compliance, privacy, and security.
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...Haggai Philip Zagury
The overwhelming growth of technologies in the Cloud Native foundation overtook our toolbox and completely changed (well, really enhanced) the Developer Experience.
In this talk, I will try to provide my personal journey from the "Operator to Developer's chair" and the practices which helped me along my journey as a Cloud-Native Dev ;)
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
Presentation at the 2017 LAST (Lean, Agile, Systems Thinking) Conference.
A presentation about the challenges involved in building a production Big Data system used directly by customers.
Urs Hoelzle
Vice President
Google
Summary
● Google operates two large backbone networks
○ Internet-facing backbone (user traffic)
○ Datacenter backbone (internal traffic)
● Managing large backbones is hard
● OpenFlow has helped us improve backbone performance and reduce backbone complexity and cost
● I'll tell you how
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
Vladislav Supalov introduces data pipeline architecture and workflow engines like Luigi. He discusses how custom scripts are problematic for maintaining data pipelines and recommends using workflow engines instead. Luigi is presented as a Python-based workflow engine that was created at Spotify to manage thousands of daily Hadoop jobs. It provides features like parameterization, email alerts, dependency resolution, and task scheduling through a central scheduler. Luigi aims to minimize boilerplate code and make pipelines testable, versioning-friendly, and collaborative.
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
What happens to the modern data stack (MDS) and analytics as a whole when streaming becomes accessible? For years, the MDS has been centered around batch-based workflows with dbt at its core, introducing software engineering best practices to analysts. But now with even major data warehouses like Snowflake getting in the game, expanding their streaming capabilities, what does that mean?
In this talk, we will explore what streaming in a batch-based analytics world should look like. How does that change your thoughts about implementing testing and performance optimization in your data pipelines? Do you still need dbt? And the question that we are all asking: do you really need a real-time dashboard?
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present.
Paul Calvano explores how the HTTP Archive works, how people are using this dataset, and some ways that Akamai has leveraged data within the HTTP Archive to help its customers.
Multiplier Effect: Case Studies in Distributions for PublishersJon Peck
Join members from both Four Kitchens and Meredith Agrimedia as they discuss the experience of migration and relaunch of the digital presence of two magazines: Successful Farming at Agriculture.com and WOOD Magazine at woodmagazine.com.
We'll start by discussing the scope of the projects, delve into the commonalities and differences, explore their common advertising and analytics implementation, and analyze the unified distribution that supports both brands. By developing the infrastructure simultaneously, brand-agnostic functionality became a priority which in turn created a more modular and flexible system that facilitated open-sourcing and cross-organizational sharing. Thanks to the codebase approach and experience, the first site took about 6 months and the second took less than 6 weeks.
The document discusses 7 habits of data effective companies. It describes how companies have evolved through different digital maturity phases from analog to born-digital. The key differences observed between phases include impact on cost, value extraction, and capabilities. The 7 habits discussed are: treating data processing as an industrial process, focusing on latency and waste reduction, being use case driven and value stream aligned, initially centralizing data, architecting for failure and sharing, treating it as a software engineering problem, and following the Unix philosophy of building specialized components. The document provides examples and illustrations for each habit.
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
These are the slides of the second talk of the first Tech Talk@TransferWise Singapore, which happened on the 23rd of November 2017.
These slides share how TransferWise codebase is moving from a monolith architecture to a microservices architecture.
This document discusses building a data platform in the cloud. It covers the evolution of data platforms from monolithic architectures to distributed event-driven architectures using a data lake. Key aspects of a cloud data platform include collecting and persisting all data in a data lake for standardized access, near real-time processing using streaming technologies, and building the platform using either fully managed or DIY/hybrid approaches on AWS. Design principles focus on event-driven separation of data producers and consumers and choosing the right technology for the problem.
DataOps is a methodology and culture shift that brings the successful combination of development and operations (DevOps) to data processing environments. It breaks down silos between developers, data scientists, and operators, resulting in lean data feature development processes with quick feedback. In this presentation, we will explain the methodology, and focus on practical aspects of DataOps.
This document discusses the steps to building a cloud native practice. It begins with introducing the speaker and what cloud native means. The 12 steps then cover: 1) version control, 2) continuous integration pipelines, 3) stateless applications, 4) containerization, 5) common services, 6) Kubernetes, 7) observability, 8) monitoring, 9) domain-driven design, 10) microservices and serverless architectures, 11) cloud strategies, and 12) reconstructing architectures with a focus on responsibilities of architects and challenges of open source.
The document discusses Intuit's transition to using canary releases in Kubernetes instead of a separate performance environment. It describes how Intuit collects metrics during canary releases to detect performance issues before fully deploying to production. The canary analysis model measures pod resource usage, JVM metrics, and application metrics to compute a score. Intuit aims to refine the model and scale the canary release process by integrating with tools like Argo Rollouts, Prometheus, and a service mesh.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/how-do-we-enable-edge-ml-everywhere-data-reliability-and-silicon-flexibility-a-presentation-from-edge-impulse/
Zach Shelby, Co-founder and CEO of Edge Impulse, presents the “How Do We Enable Edge ML Everywhere? Data, Reliability and Silicon Flexibility” tutorial at the May 2022 Embedded Vision Summit.
In this talk, Shelby reveals insights from the company’s recent global edge ML developer survey, which identified key barriers to machine learning adoption, and shares the company’s vision for how the industry can overcome these obstacles. Unsurprisingly, the first critical obstacle identified by the survey is data. But the issue isn’t simply a lack of massive datasets, as is often assumed. On the contrary, the biggest opportunities in ML will be enabled by highly custom, industry-specific and even user-specific data. We need to master data lifecycle and active learning techniques that enable developers to move quickly from “zero to dataset.”
The real and perceived inability of today’s ML algorithms to reach the ultra-high accuracy needed in industrial systems is another key barrier. New techniques for explainable ML, better testing, sensor fusion and model fusion will increasingly allow developers to achieve industrial-grade reliability. Finally, in order to accelerate ML adoption in embedded products, we must recognize that most developers can’t immediately upgrade their systems to use the latest chips — a problem that is compounded by today’s chip shortages. To enable ML everywhere, we have to find ways to deploy ML on today’s silicon, while ensuring a smooth transition to new devices with AI acceleration in the future.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
How fast can you modify your data collection to include a new field, make all the necessary changes in data processing and storage, and then use that field in analytics or product features? For many companies, the answer is a few quarters, whereas others do it in a day. This data agility latency has a direct impact on companies' ability to innovate with data. Schema-on-read has been a key strategy to lower that latency - as the community has shifted towards storing data outside relational databases, we no longer need to make series of schema changes through the whole data chain, coordinated between teams to minimise operational risk. Schema-on-read comes with a cost, however. Errors that we used to catch during testing or in early test deployments can now sneak into production undetected and surface as product errors or hard-to-debug data quality problems later than with schema-on-write solutions.
In this presentation, we will show how we have rejected the tradeoff between slow schema change rate and quality to achieve the best of both worlds. By using metaprogramming and versioned pipelines that are tested end-to-end, we can achieve fast schema changes with schema-on-write and the protection of static typing. We will describe the tools in our toolbox - Scalameta, Chimney, Bazel, and custom tools. We will also show how we leverage them to take static typing one step further and differentiate between domain types that share representation, e.g. EmailAddress vs ValidatedEmailAddress or kW vs kWh, while maintaining harmony with data technology ecosystems.
More Related Content
Similar to Secure software supply chain on a shoestring budget
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Did you know that the tech elite does not work at all like you do? Most people don't, and don't want to know. The State of DevOps report concluded a span of 1000x in delivery time and reliability between the elite and low performers. There is a similar gap for delivery time of data or ML pipelines to production. The gap in ability to compute datasets is higher, somewhere around a million times. We call this the data divide or the AI divide. It is widening over time, since most companies are not aware of its width.
We will share the principles we applied in the most successful Scandinavian crossing of the data divide. We never explicitly shared or described, nor fully understood the principles at the time, but it is long due to explicitly enumerate them.
The presentation will likely be uncomfortable and surprising, because it does not match what you do and what your vendors say. You will have no practical use of the information, since you cannot apply the principles, because they contradict many contemporary trends and popular technologies on the market, and you would be unable to overcome the forces of trends, popularity, and messages from vendors. They worked beautifully for us at the time.
1) Google Cloud provides a global infrastructure with regions launching rapidly around the world. Its network is designed for scale and performance without bottlenecks.
2) BigQuery provides petabyte-scale analytics powered by Colossus storage, Capacitor compression, and the high-bandwidth Jupiter network. It can process queries involving trillions of rows in seconds.
3) Google invests heavily in security, offering layers of protection for networks, applications, and data from threats like DDoS attacks. It also has a large partner ecosystem around compliance, privacy, and security.
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...Haggai Philip Zagury
The overwhelming growth of technologies in the Cloud Native foundation overtook our toolbox and completely changed (well, really enhanced) the Developer Experience.
In this talk, I will try to provide my personal journey from the "Operator to Developer's chair" and the practices which helped me along my journey as a Cloud-Native Dev ;)
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
Presentation at the 2017 LAST (Lean, Agile, Systems Thinking) Conference.
A presentation about the challenges involved in building a production Big Data system used directly by customers.
Urs Hoelzle
Vice President
Google
Summary
● Google operates two large backbone networks
○ Internet-facing backbone (user traffic)
○ Datacenter backbone (internal traffic)
● Managing large backbones is hard
● OpenFlow has helped us improve backbone performance and reduce backbone complexity and cost
● I'll tell you how
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
Vladislav Supalov introduces data pipeline architecture and workflow engines like Luigi. He discusses how custom scripts are problematic for maintaining data pipelines and recommends using workflow engines instead. Luigi is presented as a Python-based workflow engine that was created at Spotify to manage thousands of daily Hadoop jobs. It provides features like parameterization, email alerts, dependency resolution, and task scheduling through a central scheduler. Luigi aims to minimize boilerplate code and make pipelines testable, versioning-friendly, and collaborative.
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
What happens to the modern data stack (MDS) and analytics as a whole when streaming becomes accessible? For years, the MDS has been centered around batch-based workflows with dbt at its core, introducing software engineering best practices to analysts. But now with even major data warehouses like Snowflake getting in the game, expanding their streaming capabilities, what does that mean?
In this talk, we will explore what streaming in a batch-based analytics world should look like. How does that change your thoughts about implementing testing and performance optimization in your data pipelines? Do you still need dbt? And the question that we are all asking: do you really need a real-time dashboard?
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present.
Paul Calvano explores how the HTTP Archive works, how people are using this dataset, and some ways that Akamai has leveraged data within the HTTP Archive to help its customers.
Multiplier Effect: Case Studies in Distributions for PublishersJon Peck
Join members from both Four Kitchens and Meredith Agrimedia as they discuss the experience of migration and relaunch of the digital presence of two magazines: Successful Farming at Agriculture.com and WOOD Magazine at woodmagazine.com.
We'll start by discussing the scope of the projects, delve into the commonalities and differences, explore their common advertising and analytics implementation, and analyze the unified distribution that supports both brands. By developing the infrastructure simultaneously, brand-agnostic functionality became a priority which in turn created a more modular and flexible system that facilitated open-sourcing and cross-organizational sharing. Thanks to the codebase approach and experience, the first site took about 6 months and the second took less than 6 weeks.
The document discusses 7 habits of data effective companies. It describes how companies have evolved through different digital maturity phases from analog to born-digital. The key differences observed between phases include impact on cost, value extraction, and capabilities. The 7 habits discussed are: treating data processing as an industrial process, focusing on latency and waste reduction, being use case driven and value stream aligned, initially centralizing data, architecting for failure and sharing, treating it as a software engineering problem, and following the Unix philosophy of building specialized components. The document provides examples and illustrations for each habit.
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
These are the slides of the second talk of the first Tech Talk@TransferWise Singapore, which happened on the 23rd of November 2017.
These slides share how TransferWise codebase is moving from a monolith architecture to a microservices architecture.
This document discusses building a data platform in the cloud. It covers the evolution of data platforms from monolithic architectures to distributed event-driven architectures using a data lake. Key aspects of a cloud data platform include collecting and persisting all data in a data lake for standardized access, near real-time processing using streaming technologies, and building the platform using either fully managed or DIY/hybrid approaches on AWS. Design principles focus on event-driven separation of data producers and consumers and choosing the right technology for the problem.
DataOps is a methodology and culture shift that brings the successful combination of development and operations (DevOps) to data processing environments. It breaks down silos between developers, data scientists, and operators, resulting in lean data feature development processes with quick feedback. In this presentation, we will explain the methodology, and focus on practical aspects of DataOps.
This document discusses the steps to building a cloud native practice. It begins with introducing the speaker and what cloud native means. The 12 steps then cover: 1) version control, 2) continuous integration pipelines, 3) stateless applications, 4) containerization, 5) common services, 6) Kubernetes, 7) observability, 8) monitoring, 9) domain-driven design, 10) microservices and serverless architectures, 11) cloud strategies, and 12) reconstructing architectures with a focus on responsibilities of architects and challenges of open source.
The document discusses Intuit's transition to using canary releases in Kubernetes instead of a separate performance environment. It describes how Intuit collects metrics during canary releases to detect performance issues before fully deploying to production. The canary analysis model measures pod resource usage, JVM metrics, and application metrics to compute a score. Intuit aims to refine the model and scale the canary release process by integrating with tools like Argo Rollouts, Prometheus, and a service mesh.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/how-do-we-enable-edge-ml-everywhere-data-reliability-and-silicon-flexibility-a-presentation-from-edge-impulse/
Zach Shelby, Co-founder and CEO of Edge Impulse, presents the “How Do We Enable Edge ML Everywhere? Data, Reliability and Silicon Flexibility” tutorial at the May 2022 Embedded Vision Summit.
In this talk, Shelby reveals insights from the company’s recent global edge ML developer survey, which identified key barriers to machine learning adoption, and shares the company’s vision for how the industry can overcome these obstacles. Unsurprisingly, the first critical obstacle identified by the survey is data. But the issue isn’t simply a lack of massive datasets, as is often assumed. On the contrary, the biggest opportunities in ML will be enabled by highly custom, industry-specific and even user-specific data. We need to master data lifecycle and active learning techniques that enable developers to move quickly from “zero to dataset.”
The real and perceived inability of today’s ML algorithms to reach the ultra-high accuracy needed in industrial systems is another key barrier. New techniques for explainable ML, better testing, sensor fusion and model fusion will increasingly allow developers to achieve industrial-grade reliability. Finally, in order to accelerate ML adoption in embedded products, we must recognize that most developers can’t immediately upgrade their systems to use the latest chips — a problem that is compounded by today’s chip shortages. To enable ML everywhere, we have to find ways to deploy ML on today’s silicon, while ensuring a smooth transition to new devices with AI acceleration in the future.
Similar to Secure software supply chain on a shoestring budget (20)
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
How fast can you modify your data collection to include a new field, make all the necessary changes in data processing and storage, and then use that field in analytics or product features? For many companies, the answer is a few quarters, whereas others do it in a day. This data agility latency has a direct impact on companies' ability to innovate with data. Schema-on-read has been a key strategy to lower that latency - as the community has shifted towards storing data outside relational databases, we no longer need to make series of schema changes through the whole data chain, coordinated between teams to minimise operational risk. Schema-on-read comes with a cost, however. Errors that we used to catch during testing or in early test deployments can now sneak into production undetected and surface as product errors or hard-to-debug data quality problems later than with schema-on-write solutions.
In this presentation, we will show how we have rejected the tradeoff between slow schema change rate and quality to achieve the best of both worlds. By using metaprogramming and versioned pipelines that are tested end-to-end, we can achieve fast schema changes with schema-on-write and the protection of static typing. We will describe the tools in our toolbox - Scalameta, Chimney, Bazel, and custom tools. We will also show how we leverage them to take static typing one step further and differentiate between domain types that share representation, e.g. EmailAddress vs ValidatedEmailAddress or kW vs kWh, while maintaining harmony with data technology ecosystems.
Industrialised data - the key to AI success.pdfLars Albertsson
The DORA research concluded that there are orders of magnitude difference in delivery KPIs between leaders and the incumbents. In this presentation, we will describe the corresponding "data divide" in capabilities in data engineering, and how the leading companies have adopted an industrial approach to data management, enabling them to leap so far ahead. We will explain why "data industrialisation" is a key factor for succeeding going from AI prototypes to sustainable value from AI in production. We will also describe a path for companies outside the technology elite to cross the data divide into the industrialised data realm and share some very honest learnings from helping companies go that path.
Scalameta is a library for static analysis and processing of Scala source code, which supports syntactic and semantic analysis. In this presentation, we explain how Scalameta works, and how you can use Scalameta for custom code analysis. We demonstrate how we have used scalameta to automate schema management and privacy protection.
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
With the rise of artificial intelligence, we give more control of our lives to software. We thereby introduce new risks, and the fatal Uber crash in 2018 is the first example of AI causing an accidental death. It will be up to us as software engineers to build systems safe and reliable enough to entrust with important decisions. Our culture, however, includes praising companies that move fast and break things (Facebook), celebrate principled confrontation (Uber), fake self-driving demonstrations (Tesla), and are right, a lot (Amazon). As an industry, we need to radically improve to meet the challenge, or more people will die.
In this presentation, we will look at aviation - the industry most successful at continuously improving safety - and attempt to learn. We will look at aviation safety principles, compare with similar practices in software engineering, and see how we can translate safety principles that have worked well in aviation to the software engineering domain.
Video: https://youtu.be/IitY9yZFPSA
The document discusses various challenges with artificial intelligence (AI) systems, including ensuring their decisions and outputs are sensible given the inputs. It notes that while machines can handle rules-based decisions, they struggle with new situations unlike humans. When trained on examples, AI may produce unreasonable outputs if the input data is not sensible. Proper anticipation and preparation is needed to address issues like an age-height prediction model providing implausible results for outliers. The document also states that AI is very difficult to develop and protect against adversarial attacks, and that its societal impact will be massive, requiring regulations and collaboration between technology, legal and political fields.
The right side of speed - learning to shift leftLars Albertsson
Many disciplines are on the wrong side of speed - there is a tradeoff with development speed and security, data science, compliance, etc. Let us look at disciplines that have succeeded in shifting left by integrating development, and learn successful patterns: testing, DevOps, agile, DataOps.
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
Social media are full of Covid-19 graphs, each pointing to an "obvious" conclusion that fits the author's agenda. Unfortunately, even the official sources publish analytics that point at incorrect conclusions. Bad data quality has become a matter of life and death.
We look at the quality problems with official Covid-19 data presentations. The problems are common in all domains, and solutions are known, but not widespread. We describe tools and patterns that data mature companies use to assess and improve data quality in similar situations. Mastering data quality and data operations is a prerequisite for building sustainable AI solutions, and we will explain how these patterns fit into machine learning product development.
New times, new hype. Buzzwords like big data and Hadoop have been changed to AI and machine learning. But it's not technology, old or new, nor machine learning that separates companies that get value from data from the companies that struggle .
When big data was at its peak, several young, technology-intensive companies succeeded in absorbing big data successfully. They acquired large Hadoop clusters, learned to master data and created valuable products with machine learning. However, big data has had a limited impact at traditional companies, and the list of long and expensive data lake and Hadoop projects is long.
The key to implementing successful projects that transform data into business value is to democratise data - making it accessible and easy to use within an organisation.
Eventually, time will kill your data processingLars Albertsson
Race conditions and intermittent failures, daylight saving time, time zones, leap seconds, and overload conditions - time is a factor in many of the most annoying problems in computer systems. Data engineering is not exempt from problems caused by time, but also has a slew of unique problems. In this presentation, we will enumerate the time-related problems that we have seen cause trouble in data processing system components, including data collection, batch processing, workflow orchestration, and stream processing. We will provide examples of time-related incidents, and also tools and tricks to avoid timing issues in data processing systems.
Eventually, time will kill your data pipelineLars Albertsson
The document discusses challenges related to handling time in data pipelines. It begins by introducing different categories of data from a time perspective, including facts, state, and claims data. It then discusses issues related to calendars, time zones, clocks, and how these impact data quality. Specific challenges discussed include joining data with different time scopes, handling late or incomplete data, and expressing business logic that requires looking at historical data. Various patterns and anti-patterns are presented for handling these challenges, such as using offline replicas for database dumps, ingest time bucketing of events, and recursive or striding dependencies to allow for backfilling historical data.
This document discusses using Kubernetes as a data platform. It describes using use case driven development to build the initial platform, focusing on simple use cases that provide value. It also covers onboarding new data sources, an overview of the data platform architecture including data lakes and batch/online services, deployment approaches both on-premise and cloud native, and addressing challenges like GDPR compliance and autoscaling. Lessons learned include selecting cloud infrastructure based on data locations and using Kubernetes for its support and to avoid maintaining separate clusters.
Many companies start their big data and AI journey by hiring a team of data scientists, give them some data, and expect them to work their miracles. Although it may yield results, it is not an efficient way to use data scientists. We will explain the problems that occur, and how to adapt the context to get business value from data scientists.
- Why data science teams might fail to deliver results
- What data scientists need to be efficient
- What talent you need in addition to data scientists
Big data is primarily associated with AI and new technology. It is as much a revolution in cooperation patterns, however. Big data entails the democratisation of data within an organisation, enabling agile, data-driven innovation in a manner that was previously unavailable. Knowing this, how can you work as an organisation to harvest the fruits and what can go wrong?
Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018. GDPR puts limits on data storage, retention, and access, and also give users rights to have their data deleted and get information about the data stored. This constraints technical solutions, and makes it challenging to build systems that efficiently make use of sensitive data. This talk provides an engineering perspective on privacy. We highlight pitfalls and topics that require early attention. We describe technical patterns for complying with the "right to be forgotten" without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Test strategies for data processing pipelines, v2.0Lars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
Many companies have data with great potential. There are many ways to go wrong with Big Data projects, however; the difference between a successful and a failed project can be huge, both in cost and the return of investment. In this talk. we will describe the most common pitfalls, and how to avoid them. You will learn to:
- Be aware of the existing risk factors in your organisation that may cause a data project to fail.
- Learn how to recognise the most common and costly causes of project failure.
- Learn how to avoid or mitigate project problems in order to ensure return of investment in a lean manner.
This talk provides an engineering perspective on privacy protection. The intended audience is architects, developers, data scientists, and engineering managers that build applications handling user data. We highlight topics that require attention at an early design stage, and go through pitfalls and potentially expensive architectural mistakes. We describe a number of technical patterns for complying with privacy regulations without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
As companies adopt data processing technologies and add data-driven features to user-facing products, the need for effective automated test techniques for data processing applications increase. We go through anatomy of scalable data streaming applications, and how to set up test harnesses for reliable integration testing of such applications. We cover a few common anti-patterns that make asynchronous tests fragile, and corresponding patterns for remediation. We will also mention virtualisation components suitable for our testing scenarios.
A primer on building real time data-driven productsLars Albertsson
This document provides an overview of building real-time data products using stream processing. It discusses why stream processing is useful for providing low-latency reactions to data from 1 second to 1 hour. Key aspects covered include using a unified log to decouple producers and consumers, common stream processing building blocks like filtering and joining, and technologies like Spark Streaming, Kafka Streams, and Flink. The document also addresses challenges like out-of-order events and software bugs, and architectural patterns for handling imperfections in streams.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
3. www.scling.com
What do we contribute?
● Internet, digitalisation + many good little things
● Ability to measure and manipulate populations at scale
● Monetising bad security
○ Stolen CPU cycles → money
○ Ransomware
3
https://spinbackup.com/blog/24-biggest-ransomware-attacks-in-2019/
https://blog.chainalysis.com/reports/2022-crypto-crime-report-preview-ransomware/
https://www.theguardian.com/news/2018/mar/17/ca
mbridge-analytica-facebook-influence-us-election
4. www.scling.com
vs
Risk-management rarely wins
Employees have conflicting definitions of success
Security vs productivity
4
Revenue-generation
Features
Delivery speed
Security reviews
Pentests
Password reauthentication
Phishing campaigns
Firewalls
…
5. www.scling.com
A simple recipe for application security:
- While we value items on the right, we value items on the left more.
- Invent alternatives that are aligned with speed
- Give employees aligned definitions of success
Security AND productivity
5
SSO
Password managers
Infrastructure as code
Hardware MFA
Ephemeral containers
…
Security reviews
Pentests
Password reauthentication
Phishing campaigns
Firewalls
…
7. www.scling.com
Quality and ops
7
Aligning quality with speed
TDD
Continuous
delivery
Agile
Dev-friendly
ops tooling
Test
automation
XP
Cross-functional
teams
DevOps
Trunk-based
Continuous
integration
Containers
8. www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
8
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
9. www.scling.com
IT craft to factory
9
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
10. www.scling.com
● Toyota: Low defect rates AND high margins per vehicle
● State of DevOps report: High reliability AND high deployment rate
○ We have industrialised software engineering
Quality, speed - choose two
10
Quality
vs
Speed
Quality
AND
Speed
1000x span in
availability metrics
11. www.scling.com
Themes of good presentations, IMHO
● We have seen lots of X / X from a different angle. Here are some patterns.
● We have context Y. Here is how we work.
● We did a thing Z. Here is what we learnt.
11
We need to share how we work
in order to make faster progress.
13. www.scling.com
Data industrialisation
13
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
14. www.scling.com
How data leaders work
14
Data processed offline
Online
Data factory
Data platform & lake
data
Data
innovation &
functionality
100+K daily
datasets
30% staff
BigQuery daily
users
Value from data!
16. www.scling.com
Efficiency is sacred
● Productivity is our unique selling point
○ Client value from data is unpredictable
○ Clients don't know what they want
○ Quick experiments & pivot
● Minimal operational overhead
○ Pipelines / person
○ Datasets / day / person
● Nothing must undermine our USP
16
17. www.scling.com
Our security strategy
● Invest where it improves productivity
○ Cloud single sign on
○ Cloud identity management
○ Workload identities over secret tokens
○ Hardware multifactor authentication
○ Infrastructure as code
○ Patch management *
● Homogeneity over autonomy
○ Few technologies
○ Few processes
○ Processes encoded in code *
17
● Minimal attack surface *
● Strict asset management
○ Digital assets as code
○ Process to align assets with code
○ Explicit manual asset management
● Lean on Google
18. www.scling.com
Minimising attack surfaces
● Few ecosystems
○ Ubuntu
○ Scala + Spark
○ Python
● Few components
○ Reuse over perfect match
● Few versions
○ Single version per third party component
○ Opens gates to dependency hell *
■ Control or autonomous cells
18
20. www.scling.com
Which version?
● Version specifications
○ Exact version
■ Good for application stability
○ Range
○ Latest
■ Good for patch latency
● Specification choice tradeoffs
○ Provider trust
○ Patch latency
20
● Upgrade tradeoffs
○ Vulnerability patching
○ Rogue code
○ Bugs fixed
○ Bugs introduced
○ Necessary work
● Our goal:
○ Exact version
○ Transitive dependencies locked
○ Automatically updated
● Let's pursue!
21. www.scling.com
Levels of up to date
● No new version of A exists
● New A version exists. Application verified ok with upgrade.
● New A version exists. Unclear whether upgrade breaks application.
● New A version exists. Upgrade breaks application.
○ We use a deprecated API.
○ New version has bug.
● New A version exists. Upgrade breaks dependency B.
○ New version of B exists.
○ No new version of B exists.
○ A and B must atomically upgrade
21
22. www.scling.com
A bot friendly task
● There is some order that moves us forward through hell
● Slow trial and error cycle
○ Compile or test takes minutes
● There are bots
○ Dependabot, Scala steward
■ Way too complex (100/20 KLOC, 1000s lines of doc / examples)
○ Do not cover our needs
■ Application correctness
■ Our ecosystems
22
23. www.scling.com
With a strong process
● we can reason and automate
○ Trial and error forward
● Process strength
○ Faulty change is detected before prod
○ Non-code changes unlikely to affect correctness
○ Self-bootstrapping
23
24. www.scling.com
Strong process challenges
● Everything not covered by tests
● Test infrastructure / setup defined by code
○ How to test?
○ How to bootstrap?
● Indeterministic processes / components
○ Mostly deterministic is ok
24
Extended test suite:
● Testsuite bootstrap
● Continuous deployment testsuite
● Non-production functionality
○ Dev tooling
○ Web
○ …
25. www.scling.com
Our build process
● Monorepo + trunk-based
○ Platforms + all client code and pipelines
○ Single version of platform
● All tests verified* for every change
○ Tests do not require cloud resources
● Build + test speed challenging
○ Spark → seconds upstart time → slow tests
● Simple recipe for speed:
○ Avoid doing things → caching
○ Do things in parallel
25
26. www.scling.com
Bazel
● Designed for monorepos & strong process
○ Lazy tree evaluation
○ Isolated sandboxes
● Unmatched performance features
○ Isolation → reliable caching
○ Test result caching
○ Remote caching
○ Parallelism
○ Remote execution
26
● Great for stuff used by Google
● Catching up on
○ Docker
○ Scala
○ Third-party dependencies
27. www.scling.com
Dependency version control
● Transitive, locked
○ Python
○ JVM
○ Lock files in version control
● Not transitive, locked
○ Direct downloads
○ Bazel plugins
○ Container base images
○ version.bzl file
■ → bazel, python, bash
27
● Apt packages
○ Latest*
● Some Google components
○ VM base images, misc
○ Latest
● Employee devices
○ Manual
● Unmanaged leftovers
○ SaaS
○ Otherwise minimal exposure
32. www.scling.com
Can we make apt install deterministic?
● apt-get typically provides latest
○ Determined by Packages.gz
○ Download during build breaks determinism & caching?
● Distroless bazel package_manager:
○ Exact Packages.gz specification
○ Debian: Versioned Packages.gz
○ Ubuntu: Only latest Packages.gz
● Compromise on determinism
○ Download Packages.gz before build
○ Caching still ok
● Not running apt scripts seemed to work. For a while.
○ Subtle low-level container failures
○ Abandoned
32
33. www.scling.com
● Single unified platform
○ Monorepo + trunk-based process
○ Separate instance per client
○ All test suites run on every change
● Factories are adapted to constraints and important properties
○ Ok: Security, risk, quality, availability, compliance
○ No: Preferred technology, work processes
Scling collaboration models
33
Refinement factory
● Raw data in
● Valuable data out
● Non-technical clients
● "Easy" domain
Joint factory
● Hybrid teams
● Domain experts
● Data apprentices
● Scling runs data platform
Client factory
● Start as joint factory
● Goal: Client independent
34. www.scling.com
Divided, multi-tenant platform
34
Orion
base data platform
GCP (but portable to other clouds)
Isolated
client
instance
Isolated
client
instance
Isolated
client
instance Saturn
non-essential
operational tooling
ion CLI tool
scli CLI tool
40. www.scling.com
Resolution classifications
● No new version of A exists
● New A version exists. Application verified ok with upgrade.
● New A version exists. Unclear whether upgrade breaks application.
● New A version exists. Upgrade breaks application.
○ We use a deprecated API.
○ New version has bug.
● New A version exists. Upgrade breaks dependency B.
○ New version of B exists.
○ No new version of B exists.
○ A and B must atomically upgrade
40
not found
test failure
success
test failure
test failure
test failure
transient
transient
transient
transient
47. www.scling.com
Google SLSA evaluation
● Supply-chain Levels for Software Artifacts
○ Maturity model
● SLSA 1: yes
● SLSA 2: yes
● SLSA 3: some
○ Prioritising speed over Ephemeral Environment,
Isolated, Non-Falsifiable
● SLSA 4: some
○ Parameterless
○ Dependencies complete (except apt)
47
48. www.scling.com
Concluding remarks
● Challenges?
○ Operational tuning to balance rate vs €
○ Google cloud_sql_proxy patch update took us down
○ Diva dependencies need custom solutions
○ Which test failure to address?
● Future?
○ Upgrade conditional on container scanning?
○ Dead dependency detection?
● Open source? No.
○ Specific to our environment
○ Bot is easy. Just do it.
○ Strong process challenging. But rewarding.
○ Offer: A copy of the code for a C-level lunch date. :-)
48