What is Kedro?
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering best practices and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
Kedro 2-minute Intro Video: https://youtu.be/KEdmJ2ADy_M
Kedro Docs: https://kedro.readthedocs.io
Kedro GitHub repo: https://github.com/quantumblacklabs/kedro
Meetup: https://www.meetup.com/f7324858-b804-4ed8-ba45-580c262189f1/events/280986950/
Bahrain ch9 introduction to docker 5th birthday Walid Shaari
A hands-on workshop will go over the foundations of the containers platform, including an overview of the platform system components: images, containers, repositories, clustering, and orchestration. The strategy is to demonstrate through "live demo, and hands-on exercises." The reuse case of containers in building a portable distributed application cluster running a variety of workloads including HPC workload.
Deploying deep learning models with Docker and KubernetesPetteriTeikariPhD
Short introduction for platform agnostic production deployment with some medical examples.
Alternative download: https://www.dropbox.com/s/qlml5k5h113trat/deep_cloudArchitecture.pdf?dl=0
Docker Bday #5, SF Edition: Introduction to DockerDocker, Inc.
In celebration of Docker's 5th birthday in March, user groups all around the world hosted birthday events with an introduction to Docker presentation and hands-on-labs. We invited Docker users to recognize where they were on their Docker journey and the goal was to help them take the next step of their journey with the help of mentors. This presentation was done at the beginning of the events (this one is from the San Francisco event in HQ) and gives a run down of the birthday event series, Docker's momentum, a basic explanation of containers, the benefits of using the Docker platform, Docker + Kubernetes and more.
Enterprise Cloud Native is the New NormalQAware GmbH
ContainerDays 2019, Hamburg: Talk by Mario-Leander Reimer (@LeanderReimer, Principal Software Architect at QAware)
=== Please download slides if blurred! ===
Abstract: The world of IT and technology is moving faster than ever before. Cloud native technology and application architecture have been influencing and disrupting the software engineering discipline for the past years and there is no end in sight. But according to Gardner we are currently entering the trough of disillusionment. So does this mean we followed the wrong path and that we should turn back? Hell no!!!
Despite of all disbelievers and trolls: cloud native is neither a failure nor a hype anymore! It will become mainstream. We already see widespread adoption at all our customers. Of course there still is a lot of room for improvement. No doubt about that. Technology, methodology, processes, operations, cloud native architecture and software development need to mature even further to become boring and ready for the enterprise. This is software industrialization in its purest form. And our skills and expertise are required to make this happen.
Tampere Docker meetup - Happy 5th Birthday DockerSakari Hoisko
Part of official docker meetup events by Docker Inc.
https://events.docker.com/events/docker-bday-5/
Meetup event:
https://www.meetup.com/Docker-Tampere/events/248566945/
Bahrain ch9 introduction to docker 5th birthday Walid Shaari
A hands-on workshop will go over the foundations of the containers platform, including an overview of the platform system components: images, containers, repositories, clustering, and orchestration. The strategy is to demonstrate through "live demo, and hands-on exercises." The reuse case of containers in building a portable distributed application cluster running a variety of workloads including HPC workload.
Deploying deep learning models with Docker and KubernetesPetteriTeikariPhD
Short introduction for platform agnostic production deployment with some medical examples.
Alternative download: https://www.dropbox.com/s/qlml5k5h113trat/deep_cloudArchitecture.pdf?dl=0
Docker Bday #5, SF Edition: Introduction to DockerDocker, Inc.
In celebration of Docker's 5th birthday in March, user groups all around the world hosted birthday events with an introduction to Docker presentation and hands-on-labs. We invited Docker users to recognize where they were on their Docker journey and the goal was to help them take the next step of their journey with the help of mentors. This presentation was done at the beginning of the events (this one is from the San Francisco event in HQ) and gives a run down of the birthday event series, Docker's momentum, a basic explanation of containers, the benefits of using the Docker platform, Docker + Kubernetes and more.
Enterprise Cloud Native is the New NormalQAware GmbH
ContainerDays 2019, Hamburg: Talk by Mario-Leander Reimer (@LeanderReimer, Principal Software Architect at QAware)
=== Please download slides if blurred! ===
Abstract: The world of IT and technology is moving faster than ever before. Cloud native technology and application architecture have been influencing and disrupting the software engineering discipline for the past years and there is no end in sight. But according to Gardner we are currently entering the trough of disillusionment. So does this mean we followed the wrong path and that we should turn back? Hell no!!!
Despite of all disbelievers and trolls: cloud native is neither a failure nor a hype anymore! It will become mainstream. We already see widespread adoption at all our customers. Of course there still is a lot of room for improvement. No doubt about that. Technology, methodology, processes, operations, cloud native architecture and software development need to mature even further to become boring and ready for the enterprise. This is software industrialization in its purest form. And our skills and expertise are required to make this happen.
Tampere Docker meetup - Happy 5th Birthday DockerSakari Hoisko
Part of official docker meetup events by Docker Inc.
https://events.docker.com/events/docker-bday-5/
Meetup event:
https://www.meetup.com/Docker-Tampere/events/248566945/
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Princeton GPU Hackathon, OpenACC at SC22, updates from GNU Tools Cauldron, the upcoming UK DPU Hackathon, relevant research and more!
Docker adventures in Continuous Delivery - Alex VranceanuITCamp
Implementing CI with Docker are the baby steps. The tricky one is CD through several environments. Architecture, demo and lessons learned. Target audience: 80% technical, 20% PM/architects/leaders
CNCF general introduction to beginners at openstack meetup Pune & Bangalore February 2018. Covers broadly the activities and structure of the Cloud Native Computing Foundation.
Continuous integration and continuous delivery (CI/CD) enables an organization to rapidly iterate on software changes while maintaining stability, performance, and security. Many organizations have adopted various tools to follow the best practices around CI/CD to improve developer productivity, code quality, and software delivery. However, following the best practices of CI/CD is still challenging for many big data teams.
This webinar will highlight:
*Key challenges in building a data pipeline for CI/CD.
*Key integration points in a data pipeline's CI/CD cycle.
*How Databricks facilitates iterative development, continuous integration and build.
The PPT contains the following content:
1. What is Google Cloud Study Jam
2. What is Cloud Computing
3. Fundamentals of cloud computing
4. what is Generative AI
5. Fundamentals of Generative AI
6. Breif overview on Google Cloud Study Jam.
7. Networking Session.
RAPIDS is a suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.In this workshop, we will:
1. Introduce Rapids.ai & GPUs
2. Illustrate why GPUs are critical for machine learning and AI applications
3. Demonstrate common machine learning algorithms such as Regression, KNN,SGD etc. using RAPIDS on the QuSandbox
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingMatthew Vaughn
Intro slides from AKES workshop at ISMB2016. This workshop addresses the challenges and requirements for working effectively on cloud computing and high performance computing resources, discusses the key principles that should guide responsible scientific computation and collaboration, and using hands-on sessions presents practical solutions using emergent software tools that are becoming widely adopted in the global scientific community. Specifically, we will look at using “containers” to bundle software applications and their full execution environment in a portable way. We will look at managing and sharing data across distributed resources. And finally, we will tackle how to orchestrate job execution across systems and capture metadata on the results (and the process) so that parameters and methodologies are not lost.
Performance testing DataHelix | Scott LogicPaul Dykes
Presentation by Scott Logic's Frances Bromley and Ciara Gardiner at the South West Test meetup on 9 July 2019, summarising their learnings from a project to develop a tool which performance tests DataHelix.
DataHelix is an open source product to automate the creation of usable and representative sample data for testing and development purposes. Find out more and contribute to its development at https://github.com/finos/datahelix
Using the SDACK Architecture on Security Event Inspection by Yu-Lun Chen and ...Docker, Inc.
The SDACK architecture stands for Spark, Docker, Akka, Cassandra, and Kafka. At TrendMicro, we adopted the SDACK architecture to implement a security event inspection platform for APT attack analysis. In this talk, we will introduce SDACK stack with Spark lambda architecture, Akka and Kafka for streaming data pipeline, Cassandra for time series data, and Docker for microservices. Specifically, we will show you how we Dockerize each SDACK component to facilitate the RD team of algorithms development, help the QA team test the product easily, and use the Docker as a Service strategy to ship our products to customers. Next, we will show you how we monitor each Docker container and adjust the resource usage based on monitoring metrics. And then, we will share our Docker security policy which ensures our products are safety before shipping to customers. After that, we'll show you how we develop an all-in-one Docker based data product and scale it out to multi-host Docker cluster to solve the big data problem. Finally, we will share some challenges we faced during the product development and some lesson learned.
Pipeline as Code is a practical guide to automating your development pipeline in a cloud-native, service-driven world. You’ll use the latest infrastructure-as-code tools like Packer and Terraform to develop reliable CI/CD pipelines for numerous cloud-native applications. Follow this book's insightful best practices, and you’ll soon be delivering software that’s quicker to market, faster to deploy, and with less last-minute production bugs.
Learn more about the book here: http://mng.bz/opMv
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
More Related Content
Similar to Boston Data Engineering: Kedro Python Framework for Data Science: Overview and New Features
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Princeton GPU Hackathon, OpenACC at SC22, updates from GNU Tools Cauldron, the upcoming UK DPU Hackathon, relevant research and more!
Docker adventures in Continuous Delivery - Alex VranceanuITCamp
Implementing CI with Docker are the baby steps. The tricky one is CD through several environments. Architecture, demo and lessons learned. Target audience: 80% technical, 20% PM/architects/leaders
CNCF general introduction to beginners at openstack meetup Pune & Bangalore February 2018. Covers broadly the activities and structure of the Cloud Native Computing Foundation.
Continuous integration and continuous delivery (CI/CD) enables an organization to rapidly iterate on software changes while maintaining stability, performance, and security. Many organizations have adopted various tools to follow the best practices around CI/CD to improve developer productivity, code quality, and software delivery. However, following the best practices of CI/CD is still challenging for many big data teams.
This webinar will highlight:
*Key challenges in building a data pipeline for CI/CD.
*Key integration points in a data pipeline's CI/CD cycle.
*How Databricks facilitates iterative development, continuous integration and build.
The PPT contains the following content:
1. What is Google Cloud Study Jam
2. What is Cloud Computing
3. Fundamentals of cloud computing
4. what is Generative AI
5. Fundamentals of Generative AI
6. Breif overview on Google Cloud Study Jam.
7. Networking Session.
RAPIDS is a suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.In this workshop, we will:
1. Introduce Rapids.ai & GPUs
2. Illustrate why GPUs are critical for machine learning and AI applications
3. Demonstrate common machine learning algorithms such as Regression, KNN,SGD etc. using RAPIDS on the QuSandbox
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingMatthew Vaughn
Intro slides from AKES workshop at ISMB2016. This workshop addresses the challenges and requirements for working effectively on cloud computing and high performance computing resources, discusses the key principles that should guide responsible scientific computation and collaboration, and using hands-on sessions presents practical solutions using emergent software tools that are becoming widely adopted in the global scientific community. Specifically, we will look at using “containers” to bundle software applications and their full execution environment in a portable way. We will look at managing and sharing data across distributed resources. And finally, we will tackle how to orchestrate job execution across systems and capture metadata on the results (and the process) so that parameters and methodologies are not lost.
Performance testing DataHelix | Scott LogicPaul Dykes
Presentation by Scott Logic's Frances Bromley and Ciara Gardiner at the South West Test meetup on 9 July 2019, summarising their learnings from a project to develop a tool which performance tests DataHelix.
DataHelix is an open source product to automate the creation of usable and representative sample data for testing and development purposes. Find out more and contribute to its development at https://github.com/finos/datahelix
Using the SDACK Architecture on Security Event Inspection by Yu-Lun Chen and ...Docker, Inc.
The SDACK architecture stands for Spark, Docker, Akka, Cassandra, and Kafka. At TrendMicro, we adopted the SDACK architecture to implement a security event inspection platform for APT attack analysis. In this talk, we will introduce SDACK stack with Spark lambda architecture, Akka and Kafka for streaming data pipeline, Cassandra for time series data, and Docker for microservices. Specifically, we will show you how we Dockerize each SDACK component to facilitate the RD team of algorithms development, help the QA team test the product easily, and use the Docker as a Service strategy to ship our products to customers. Next, we will show you how we monitor each Docker container and adjust the resource usage based on monitoring metrics. And then, we will share our Docker security policy which ensures our products are safety before shipping to customers. After that, we'll show you how we develop an all-in-one Docker based data product and scale it out to multi-host Docker cluster to solve the big data problem. Finally, we will share some challenges we faced during the product development and some lesson learned.
Pipeline as Code is a practical guide to automating your development pipeline in a cloud-native, service-driven world. You’ll use the latest infrastructure-as-code tools like Packer and Terraform to develop reliable CI/CD pipelines for numerous cloud-native applications. Follow this book's insightful best practices, and you’ll soon be delivering software that’s quicker to market, faster to deploy, and with less last-minute production bugs.
Learn more about the book here: http://mng.bz/opMv
Similar to Boston Data Engineering: Kedro Python Framework for Data Science: Overview and New Features (20)
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
5. QuantumBlack, a McKinsey company 5
SET THE SCENE
What are you trying to build?
Insights
Data science code that no one will use after your project is
complete
6. QuantumBlack, a McKinsey company 6
What are you trying to build?
SET THE SCENE
1. This clever bedroom storage option
2. This timely workaround 3. This energizing electrical hookup
Source: https://www.buzzfeed.com/philippjahner/craftsmanship-fails
7. QuantumBlack, a McKinsey company 7
SET THE SCENE
What are you trying to build?
Machine Learning Product
Data science code that needs to be re-run and
maintained
9. 9
QuantumBlack, a McKinsey company
SET THE SCENE
Learning about your
experiences
Question time
What type of
analytics projects do
you focus on?
10. QuantumBlack, a McKinsey company 10
The challenges of creating machine learning products
The Jupyter notebook workflow has 5Cs of challenges
SET THE SCENE
Challenge 1
Collaboration
Multi-user collaboration in
a notebook is challenging
to do because of the
recommended one-
person/one-notebook
workflow.
Challenge 2
Code Reviews
Code reviews, the act of
checking each other's
code for mistakes,
requires extensions of
notebook capabilities.
Often meaning, reviews
are not done for code
written in notebooks.
Challenge 3
Code Quality
Writing unit tests,
documentation for the
codebase and linting (like
a grammar check for
code) is not something
that can be easily done in
a notebook.
Challenge 4
Caching
The convenience of
caching in a notebook
sacrifices an accurate
notebook execution flow
leading you to believe
that your code runs
without errors.
Challenge 5
Consistency
Reproducibility in
notebooks is challenge. A
2019 NYU study1
executed 860k
Notebooks found in 264k
GitHub repositories. 24%
of the notebooks
completed without error;
4% produced the same
results.
Source: 1. Pimentel, J., Murta, L., Braganholo, V. and Freire, J. (n.d.). A Large-scale Study about Quality and
Reproducibility of Jupyter Notebooks. [online] Available at: http://www.ic.uff.br/~leomurta/papers/pimentel2019a.pdf
[Accessed 23 Sep. 2020].
11. 11
QuantumBlack, a McKinsey company
SET THE SCENE
Learning about your
experiences
Question time
How are Jupyter
notebooks used in
your analytics
workflows?
12. 12
QuantumBlack, a McKinsey company
SET THE SCENE
A workflow beyond notebooks still
has challenges
The challenges of
creating machine
learning products
“Data scientists have to learn so many
tools to create high-quality code.”
“Everyone works in different ways.”
“No one wants to use the framework I created.”
“It’s tedious to always setup documentation and code
quality tooling my project.”
“We all have different levels of exposure to
software engineering best-practice.”
“I have to think about Sphinx, flake8, isort, black, Cookiecutter Data Science,
Docker, Python Logging, virtual environments, Pytest, configuration and more .”
“I spend a lot of time trying to understand a codebase that I didn’t write.”
“My code will not run on another person’s machine.”
“It takes really long to put code in production and we have to rewrite and
restructure large parts of it.”
13. QuantumBlack, a McKinsey company 13
A workflow that is focused on Python script
SET THE SCENE
The challenges of creating machine learning products
15. QuantumBlack, a McKinsey company 15
Reproducible, maintainable
and modular data science
solved
• Addresses the main
shortcomings of
Jupyter notebooks,
one-off scripts, and
glue-code because
there is a focus on
creating maintainable
data science code
• Increases the
efficiency of an
analytics team
• We use it to build
reusable code stores
like how React is
used to build Design
Systems
• It won Best Technical
Tool or Framework for
AI in 2019 (Awards
AI) and merit award
for the Technical
Documentation, is
listed on the 2020
ThoughtWorks
Technology Radar
and the 2020 Data &
AI Landscape
• Used at start-ups,
major enterprises and
in academia
• An open-source
Python framework
created for data
scientists, data
engineers and
machine-learning
engineers
• It borrows concepts
from software
engineering and
applies them to
machine-learning
code; applied
concepts include
modularity, separation
of concerns and
versioning
It is developed and maintained by QuantumBlack; and, is McKinsey’s first open-source product
WHAT IS KEDRO?
What is Kedro?
What is it? Why do we use it? Impact on MLOPs?
16. QuantumBlack, a McKinsey company 16
Ships with a CLI and UI for visualizing data and ML pipelines
WHAT IS KEDRO?
Concepts in Kedro
Nodes & Pipelines
A pure Python function that has an input and an output. A
pipeline is a directed acyclic graph, it is a collection of nodes
with defined relationships and dependencies.
Project Template
A series of files and folders derived from Cookiecutter Data
Science. Project setup consistency makes it easier for team
members to collaborate with each other.
Configuration
Remove hard-coded variables from ML code so that it runs
locally, in cloud or in production without major changes.
Applies to data, parameters, credentials and logging.
The Catalog
An extensible collection of data, model or image connectors,
available with a YAML or Code API, that borrow arguments
from Pandas, Spark API and more.
17. 17
QuantumBlack, a McKinsey company
WHAT IS KEDRO?
Project template
Python Script
Configuration
Tests
Notebooks
Project
Documentation
Logs
What is the project template?
• A modifiable series of files and folders
• Built-in support for Python logging, Pytest for
unit tests and Sphinx for documentation
What does the project template help
you do?
• Spend time on documenting your ML
approach and not on worrying how your
project is structured
• You spend less time digging around in
previous projects for useful code
• Makes it easier for collaborators to work with
you
18. 18
QuantumBlack, a McKinsey company
WHAT IS KEDRO?
Configuration
What is configuration?
• “Settings” for your machine-learning code
• A way to define requirements for data, logging
and parameters in different environments
• Helps keep credentials out of your code base
• Keep all parameters in one place
What does configuration help you do?
• Machine learning code that transitions from
prototype to production with little effort
• Makes it possible to write generalizable and
reusable analytics code that does not require
significant modification to be used
Python Script
Configuration
Tests
Notebooks
Project
Documentation
Logs
19. 19
QuantumBlack, a McKinsey company
WHAT IS KEDRO?
The Catalog
Integrations in the Catalog
Pandas
Spark
Dask
SQLAlchemy
NetworkX NetworkX
MatplotLib
Google BigQuery
Google Cloud Storage
AWS Redshift
AWS S3
Azure Blob Storage
Hadoop File System
What is the Catalog?
• Manages the loading and saving of your data
• Available as a code or YAML API
• Versioning is available for file-based systems
every time the pipeline runs
• It’s extensible, and we accept new data
connectors
What does the Catalog help you do?
• Never write a single line of code that would
read or write to a file, database or storage
system
• Makes it possible to write generalizable and
reusable analytics code that does not require
significant modification to be used
• Access data without leaking credentials
20. 20
QuantumBlack, a McKinsey company
WHAT IS KEDRO?
Nodes & Pipelines
What is a node?
• A pure Python function that has an input and an
output.
• Node definition supports multiple inputs for things like
table joins and multiple outputs for things like
producing a train/test split.
Input dataset Python function Output dataset
What is a pipeline?
• It is a directed acyclic graph.
• A collection of nodes with defined relationships and
dependencies.
Input dataset Python function Output dataset
Input dataset Python function Output dataset
+
Input dataset Python function Output dataset
+
21. QuantumBlack, a McKinsey company 21
Gives you x-ray vision into your project. You can see exactly how data flows through your pipelines. It is fully
automated and focuses on your code base.
WHAT IS KEDRO?
Pipeline visualization
PLUGIN
Demo: quantumblacklabs.github.io/kedro-viz/
22. QuantumBlack, a McKinsey company 22
Flexible deployment
Kedro supports multiple deployment modes
WHAT IS KEDRO?
Deployment modes
Supporting a range of tools
Amazon SageMaker
AWS Batch
Kedro currently supports:
Single-machine deployment on a production server using:
— A container based using Kedro-Docker
— Packaging a pipeline using kedro package
— A CLI-based approach using the Kedro CLI
Distributed application deployment allowing a Kedro pipeline to run on multiple computers within a network at the same
time
23. QuantumBlack, a McKinsey company 23
Where does Kedro fit in the ecosystem?
Kedro is the scaffolding that helps you develop a data and machine-learning pipeline that can be deployed
Philosophy of Kedro
• Kedro focuses on how you work while writing standardized, modular, maintainable and reproducible data science
code and does not focus on how you would like to run it in production
• The responsibility of “What time will this pipeline run?” and “How will I know if it failed?” is left to tools called
orchestrators like Apache Airflow, Kubeflow and Prefect
• Orchestrators do not focus on the process of producing something that could be deployed, which is what Kedro
does
Ingest Raw
Data
Clean & Join
Data
Engineer
Features
Train & Validate
Model
Deploy
Model
24. QuantumBlack, a McKinsey company 24
Tool Focus
Project
Template
Data Catalog
DAG
Workflow
DAG
UI
Experiment
Tracking
Data
Versioning
Scheduler Monitoring
“Kedro is an open-source Python framework
for creating reproducible, maintainable and
modular data science code.”
“ZenML is an extensible, open-source MLOps
framework for using production-ready
Machine Learning pipelines, in a simple way."
“A logical, reasonably standardized, but
flexible project structure for doing and sharing
data science work.”
“MLflow is an open source platform to
manage the ML lifecycle, including
experimentation, reproducibility, deployment,
and a central model registry.”
“Data catalogs provide an abstraction that
allows you to externally define, and optionally
share, descriptions of datasets, called catalog
entries.”
Orchestration platforms allow users to author,
schedule and monitor workflows task-centric
data workflows.
“DVC is built to make ML models shareable
and reproducible. Designed to handle data
sets, machine learning models, and metrics
as well as code.”
“Pachyderm is the data layer that powers your
machine learning lifecycle. Automate and
unify your MLOps tool chain. With automatic
data versioning and data driven pipelines.”
Competitive Space
We are not showing the complete feature space of all the other tools, but focusing on Kedro’s feature set
WHERE DOES KEDRO FIT IN THE ECOSYSTEM?
* Data-centric DAG * Coming soon! * Basic feature
* Basic feature
* Task-centric DAG
* Task-centric DAG
* Models
Kedro
* Data-centric DAG
25. QuantumBlack, a McKinsey company 25
Who is using Kedro?
There are Kedro users across the world, who work at start-ups, major enterprises and academic institutions
WHAT IS OUR IMPACT TO DATE?
6,500+
Monthly Active Users of the Kedro Documentation
4,000+
GitHub stars
100+
GitHub Contributors
26. 26
QuantumBlack, a McKinsey company
OUR
SUPPORT
CHANNELS
Kedro is actively maintained by QuantumBlack
We are committed to growing community and making sure that our users are supported for their standard
and advanced use cases.
Documentation is available on Kedro’s Read The
Docs: https://kedro.readthedocs.io/
The Kedro community is active on:
https://github.com/quantumblacklabs/kedro/
The team and contributors actively maintain raised
feature requests, bug reports and pull requests.
The Kedro community is active on:
https://discord.gg/7sTm3y5kKu
27. 27
QuantumBlack, a McKinsey company
Kedro is actively maintained by QuantumBlack
We are committed to growing community and making sure that our users are supported for their standard
and advanced use cases.
Scan
28. QuantumBlack, a McKinsey company
QuantumBlack, a McKinsey company 28
A Kedro example
An excerpt from the Spaceflight tutorial in the
Kedro documentation
KEDRO
29. QuantumBlack, a McKinsey company 29
The Blueprint view
The Data Processing Pipeline of Kedro-Viz
A KEDRO EXAMPLE
kedro viz
2020-10-12 13:47:00,906 - werkzeug - INFO
- * Running on http://127.0.0.1:4141/
(Press CTRL+C to quit)
2020-10-12 13:47:01,296 - werkzeug - INFO
- 127.0.0.1 - - [12/Oct/2020 13:47:01]
"GET /api/main HTTP/1.1" 200 –
2020-10-12 13:47:01,478 - werkzeug - INFO
- 127.0.0.1 - - [12/Oct/2020 13:47:01]
"GET /manifest.json HTTP/1.1" 304 –
2020-10-12 13:47:03,111 - werkzeug - INFO
- 127.0.0.1 - - [12/Oct/2020 13:47:03]
"GET /service-worker.js HTTP/1.1" 304 -
30. QuantumBlack, a McKinsey company 30
The Blueprint view
The Data Processing Pipeline of Kedro-Viz
A KEDRO EXAMPLE
2.
3.
1.
4.
5.
What happens in this data pipeline?
1. Input three tables which have data from
Companies, Space Shuttles and Customer
Reviews
2. Pre-process the Companies and Shuttles
data
3. Output the processed the Companies and
Space Shuttles tables
4. Join the tables to form a master table, using
the processed Companies and Space
Shuttles tables and the Reviews table
5. Output a master table
31. QuantumBlack, a McKinsey company 31
Adding entries to the Catalog
A KEDRO EXAMPLE
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
layer: raw
shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
layer: raw
32. QuantumBlack, a McKinsey company 32
Creating a node
A KEDRO EXAMPLE
# nodes.py
def preprocess_companies(df):
df["company_rating"] =
df["company_rating"].apply(_parse_percentage)
return df
33. QuantumBlack, a McKinsey company 33
Creating a pipeline
A KEDRO EXAMPLE
# pipelines.py
def create_pipeline(**kwargs):
return Pipeline([
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies"),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles"),
node(
func=create_master_table,
inputs=["preprocessed_shuttles",
"preprocessed_companies", "reviews"],
outputs="master_table"),
])
34. QuantumBlack, a McKinsey company 34
Running a pipeline
A KEDRO EXAMPLE
kedro run
2020-10-12 14:58:02,801 - kedro.io.data_catalog - INFO - Loading data from `shuttles` (ExcelDataSet)...
2020-10-12 14:58:11,592 - kedro.pipeline.node - INFO - Running node: preprocessing_shuttles: preprocess_shuttles([shuttles]) -
> [preprocessed_shuttles]
2020-10-12 14:58:11,663 - kedro.io.data_catalog - INFO - Saving data to `preprocessed_shuttles` (CSVDataSet)...
2020-10-12 14:58:12,085 - kedro.runner.sequential_runner - INFO - Completed 1 out of 3 tasks
2020-10-12 14:58:12,085 - kedro.io.data_catalog - INFO - Loading data from `companies` (CSVDataSet)...
2020-10-12 14:58:12,118 - kedro.pipeline.node - INFO - Running node: preprocessing_companies:
preprocess_companies([companies]) -> [preprocessed_companies]
2020-10-12 14:58:12,162 - kedro.io.data_catalog - INFO - Saving data to `preprocessed_companies` (CSVDataSet)...
2020-10-12 14:58:12,387 - kedro.runner.sequential_runner - INFO - Completed 2 out of 3 tasks
2020-10-12 14:58:12,388 - kedro.io.data_catalog - INFO - Loading data from `preprocessed_shuttles` (CSVDataSet)...
2020-10-12 14:58:12,489 - kedro.io.data_catalog - INFO - Loading data from `preprocessed_companies` (CSVDataSet)...
2020-10-12 14:58:12,521 - kedro.io.data_catalog - INFO - Loading data from `reviews` (CSVDataSet)...
2020-10-12 14:58:12,577 - kedro.pipeline.node - INFO - Running node:
create_master_table([preprocessed_companies,preprocessed_shuttles,reviews]) -> [master_table]
2020-10-12 14:58:14,530 - kedro.io.data_catalog - INFO - Saving data to `master_table` (CSVDataSet)...
2020-10-12 14:58:23,737 - kedro.runner.sequential_runner - INFO - Completed 3 out of 3 tasks