In this talk, I will show the range of data engineering challenges in acquiring accurate COVID-19 case data from hundreds of sources for an epidemiological study. I’ll walk you through how we mitigated these challenges using purely open source Python libraries (Great Expectations and Kedro). Together, they bring software engineering best practices to the experimental nature of Machine Learning.
Cidara is developing long-acting therapeutics designed to improve the standard of care for patients facing serious diseases. The Company’s portfolio is comprised of drug candidates intended to transform existing treatment and prevention paradigms. Its lead Phase 3 antifungal candidate, rezafungin, will report Phase 3 data at the end of 2021. The potential peak sales opportunity for rezafungin in the US is ~$750M. In addition, the Company is developing Drug-Fc Conjugates (DFCs) targeting viral and oncology diseases from Cidara’s proprietary Cloudbreak® platform.
Cidara is developing long-acting therapeutics designed to improve the standard of care for patients facing serious diseases. The Company’s portfolio is comprised of drug candidates intended to transform existing treatment and prevention paradigms. Its lead Phase 3 antifungal candidate, rezafungin, will report Phase 3 data at the end of 2021. The potential peak sales opportunity for rezafungin in the US is ~$750M. In addition, the Company is developing Drug-Fc Conjugates (DFCs) targeting viral and oncology diseases from Cidara’s proprietary Cloudbreak® platform.
G Medical Innovations Holdings Ltd is a mobile health (mHealth) and e-health company. It develops and markets clinical and consumer medical-grade health monitoring solutions and offers end-to-end support for e-health projects. The company offers a suite of both consumer and clinical grade products and platforms which are positioned to reduce inefficiencies in healthcare delivery, improve access, reduce costs, increase the quality of care, and make healthcare more personalized and precise.
Cidara is developing long-acting therapeutics designed to improve the standard of care for patients facing serious diseases. The Company’s portfolio is comprised of drug candidates intended to transform existing treatment and prevention paradigms. Its lead Phase 3 antifungal candidate, rezafungin, will report Phase 3 data at the end of 2021. The potential peak sales opportunity for rezafungin in the US is ~$750M. In addition, the Company is developing Drug-Fc Conjugates (DFCs) targeting viral and oncology diseases from Cidara’s proprietary Cloudbreak® platform.
Some hospitals have reported returning to pre-COVID-19 volumes for certain services, but the pandemic continues to affect outpatient and surgical volumes, largely due to workforce capacity constraints.
One in 8 U.S. women will develop invasive breast cancer over her lifetime, with approximately 266,000 new breast cancer patients and 3.1 million breast cancer survivors in 2018. Following breast cancer surgery in the adjuvant setting, a HER2/neu 3+ patient typically receives Herceptin® in the first year, with the hope that their breast cancer will not recur, and with the odds of recurrence slowly decreasing over the first 5 years after surgery. Herceptin® has been shown to reduce recurrence rates from 25% to 12% in the adjuvant setting. In the neoadjuvant setting, a patient receives treatment before surgery and based on the results of a biopsy at surgery, will receive the same or more potent treatment after surgery. Kadcyla® has been shown to reduce recurrence rates from 22% to 11% in the neoadjuvant setting. Accordingly, we believe that GP2 may be used to address the 50% of recurring patients who do not respond to either Herceptin® or Kadcyla®.
Milestone Scientific Inc. (MLSS) is a biomedical technology research and development company that patents, designs, develops and commercializes innovative diagnostic and therapeutic injection technologies and instruments for medical and dental applications. Milestone's computer-controlled systems are designed to make injections precise, efficient, virtually painless, and less expensive. Milestone’s proprietary DPS® Dynamic Pressure Sensing technology® platform advances the development of next-generation devices, regulating flow rate and monitoring pressure from the tip of the needle, through platform extensions for local anesthesia for subcutaneous drug delivery, with specific applications for epidural space identification in regional anesthesia procedures.
Os choques de demanda e oferta na economia global estão impactando negócios durante a pandemia de #covid19. No #BainWebinar "Procurement Best Practices Through COVID-19", nossos especialistas compartilharão análises sobre o cenário atual e as possíveis ações que permitirão a construção de um fluxo mais adequado com o objetivo mundial.
Cidara is developing long-acting therapeutics designed to improve the standard of care for patients facing serious diseases. The Company’s portfolio is comprised of drug candidates intended to transform existing treatment and prevention paradigms. Its lead Phase 3 antifungal candidate, rezafungin, will report Phase 3 data at the end of 2021. The potential peak sales opportunity for rezafungin in the US is ~$750M. In addition, the Company is developing Drug-Fc Conjugates (DFCs) targeting viral and oncology diseases from Cidara’s proprietary Cloudbreak® platform.
Cidara is developing long-acting therapeutics designed to improve the standard of care for patients facing serious diseases. The Company’s portfolio is comprised of drug candidates intended to transform existing treatment and prevention paradigms. Its lead Phase 3 antifungal candidate, rezafungin, will report Phase 3 data at the end of 2021. The potential peak sales opportunity for rezafungin in the US is ~$750M. In addition, the Company is developing Drug-Fc Conjugates (DFCs) targeting viral and oncology diseases from Cidara’s proprietary Cloudbreak® platform.
G Medical Innovations Holdings Ltd is a mobile health (mHealth) and e-health company. It develops and markets clinical and consumer medical-grade health monitoring solutions and offers end-to-end support for e-health projects. The company offers a suite of both consumer and clinical grade products and platforms which are positioned to reduce inefficiencies in healthcare delivery, improve access, reduce costs, increase the quality of care, and make healthcare more personalized and precise.
Cidara is developing long-acting therapeutics designed to improve the standard of care for patients facing serious diseases. The Company’s portfolio is comprised of drug candidates intended to transform existing treatment and prevention paradigms. Its lead Phase 3 antifungal candidate, rezafungin, will report Phase 3 data at the end of 2021. The potential peak sales opportunity for rezafungin in the US is ~$750M. In addition, the Company is developing Drug-Fc Conjugates (DFCs) targeting viral and oncology diseases from Cidara’s proprietary Cloudbreak® platform.
Some hospitals have reported returning to pre-COVID-19 volumes for certain services, but the pandemic continues to affect outpatient and surgical volumes, largely due to workforce capacity constraints.
One in 8 U.S. women will develop invasive breast cancer over her lifetime, with approximately 266,000 new breast cancer patients and 3.1 million breast cancer survivors in 2018. Following breast cancer surgery in the adjuvant setting, a HER2/neu 3+ patient typically receives Herceptin® in the first year, with the hope that their breast cancer will not recur, and with the odds of recurrence slowly decreasing over the first 5 years after surgery. Herceptin® has been shown to reduce recurrence rates from 25% to 12% in the adjuvant setting. In the neoadjuvant setting, a patient receives treatment before surgery and based on the results of a biopsy at surgery, will receive the same or more potent treatment after surgery. Kadcyla® has been shown to reduce recurrence rates from 22% to 11% in the neoadjuvant setting. Accordingly, we believe that GP2 may be used to address the 50% of recurring patients who do not respond to either Herceptin® or Kadcyla®.
Milestone Scientific Inc. (MLSS) is a biomedical technology research and development company that patents, designs, develops and commercializes innovative diagnostic and therapeutic injection technologies and instruments for medical and dental applications. Milestone's computer-controlled systems are designed to make injections precise, efficient, virtually painless, and less expensive. Milestone’s proprietary DPS® Dynamic Pressure Sensing technology® platform advances the development of next-generation devices, regulating flow rate and monitoring pressure from the tip of the needle, through platform extensions for local anesthesia for subcutaneous drug delivery, with specific applications for epidural space identification in regional anesthesia procedures.
Os choques de demanda e oferta na economia global estão impactando negócios durante a pandemia de #covid19. No #BainWebinar "Procurement Best Practices Through COVID-19", nossos especialistas compartilharão análises sobre o cenário atual e as possíveis ações que permitirão a construção de um fluxo mais adequado com o objetivo mundial.
COVID-19 Fact Base and Potential Implications for Brazil - CompletoBain & Company Brasil
Nova versão do estudo que vem sendo publicado pela nossa Task Force local sobre #Covid19 confirma o cenário de platô para o Brasil e mostra os estados brasileiros continuando o movimento de concentração na zona de “risco controlado”, com ocupação das UTIs em torno de ~70% e com níveis de contaminação mais constantes.
A study on “the impact of data analytics in covid 19 health care system”Dr. C.V. Suresh Babu
A Study on “The Impact of Data Analytics in COVID-19 Health Care System”, Presentation slides for International Conference on "Life Sciences: Acceptance of the New Normal", St. Aloysius' College, Jabalpur, Madhya Pradesh, India, 27-28 August, 2021
2020.01.12 OECD STI Outlook launch - Impacts of COVID-19: How STI systems res...innovationoecd
On January 12, join OECD iLibrary, the OECD Directorate for Science, Technology and Innovation, and ACRL/Choice for a presentation of the key findings from the new STI Outlook, followed by a conversation with OECD STI Director Andrew Wyckoff and RAND Corporation Senior Policy Researcher Marjory Blumenthal about the implications for research and innovation in the US.
Read more at https://oe.cd/STIO21-EES
Keberhasilan Selandia Baru dalam mengatasi penyebaran Covid-19 pada gelombang pertama menjadi pelajaran berharga bagi negara-negara di seluruh dunia dalam merancang sebuah strategi kebijakan mengatasi Covid-19
COVID-19 Fact Base and Potential Implications for Brazil - ShortBain & Company Brasil
Nossa Task Force local sobre #Covid19 apresenta uma atualização do estudo sobre o cenário brasileiro e destaca a longevidade do platô no Brasil, principalmente por conta da reabertura econômica, capaz de gerar aumento da contaminação, e das características locais, como densidade e uso de transporte público.
2 years in 2 months? Digital acceleration in biopharmaAcross Health
Relive Across Health first iD.cast: 90 minutes of inspiration on digital acceleration in biopharma
Get food and facts for thought on how biopharma companies can embrace the COVID-19 wake-up call and turn these disruptive times into an opportunity for hyper-acceleration and re-invention. The session featured two highly engaging speakers – and scored an NPS of 65!
- Peter Hinssen, Co-founder Nexxworks, serial entrepreneur, best-selling author, and thought leader on digitalization & innovation
The "VACINE" for the Never Normal and the hourglass model for corporate innovation
- Fonny Schenck, CEO Across Health
2 years in 2 months? Digital acceleration in biopharma
Watch the recorded webinar or download the slides to learn how to be the Phoenix in the biopharma world and create a sustainable competitive superiority: https://bit.ly/2IZDDLv.
2020 Inside The Post Pandemic Playbook - Client VersionR "Ray" Wang
The Constellation Research Team shares their high level overview of how to play for a post pandemic recovery. Details by business themes can be found in client advisory or paid client inquiry access.
Genetic Technologies Limited is a diversified molecular diagnostics company
developing tools for the prediction and assessment of cancer risk to help physicians
proactively manage patient health. The Company’s lead products, ‘GeneType for
Breast Cancer’ and ‘GeneType for Colorectal Cancer’, are clinically validated risk
assessment tests that are first in their class. The Company’s development pipeline
includes new tests for Type 2 diabetes, cardiovascular disease, prostate cancer, and
melanoma. Listed on the ASX in 2000 and NASDAQ in 2005, Genetic Technologies
has been a leader in the development and commercialization of genetic risk
assessment technology for 20 years.
Our central thesis has long been that COVID hasn’t dramatically changed the healthcare industry, rather it has dramatically accelerated different trends in the healthcare space that were already simmering before March 2020. Given the usually slow pace at which the healthcare market typically moves, COVID served as a shock to the system and an accelerator that created a window to drive meaningful change. In this whitepaper, we will examine several changes that were less obvious in the early days of the pandemic and assess their longevity as we (hopefully) move into a post-COVID world.
Hospital Capacity Management: How to Prepare for COVID-19 Patient SurgesHealth Catalyst
Health system resource strain became an urgent concern early in the COVID-19 pandemic. Hard-hit areas exhausted their hospital beds, ventilators, personal protective equipment, staffing, and other life-saving essentials, while other regions scrambled to prepare for inevitable surges. These resource concerns heightened the need for accurate, localized hospital capacity planning. With additional waves of infection in the summer months following the initial spring 2020 crisis, health systems must continue to forecast resource demands for the foreseeable future. An accurate capacity planning tool uses population demographics, governmental policies, local culture, and the physical environment to predict healthcare resource needs and help health systems prepare for surges in patient demand.
ILC webinar: Under the microscope: Comparing countries’ experiences of the CO...ILC- UK
COVID-19 has had devastating effects on health systems and economies across the world and has put the importance of the prevention of ill health throughout the life course into sharp focus– from the importance of better pandemic preparedness to the need to promote the overall health of the population.
This ILC webinar is part of our “Delivering prevention in an ageing world” programme.
The panellists presented their country perspectives on how each of their countries have responded to COVID-19 and what we can learn from the pandemic for the prevention agenda going forward.
A Sustainable Healthcare Emergency Management Framework: COVID-19 and BeyondHealth Catalyst
With an ever-changing understanding of COVID-19 and a continually fluctuating disease impact, health systems can’t rely on a single, rigid plan to guide their response and recovery efforts. An effective solution is likely a flexible framework that steers hospitals and other providers through four critical phases of a communitywide healthcare emergency:
Prepare for an outbreak.
Prevent transmission.
Recover from an outbreak.
Plan for the future.
The framework must include data-supported surveillance and containment strategies to enhance detection, reduce transmission, and manage capacity and supplies, providing a roadmap to respond to immediate demands and also support a sustainable long-term pandemic response.
On Wednesday, 3 March 2021, ESRI researcher Conor Keegan presented the topic ‘Understanding the drivers of hospital expenditure’ at the conference ‘Irish hospital expenditure beyond the era of COVID-19.’
The conference examined issues relating to expenditure on acute hospital care in Ireland. Findings from recent ESRI research, undertaken as part of the ESRI Research Programme in Healthcare Reform, which is funded by the Department of Health, were presented.
To view the presentation slides and other event details, click here: https://www.esri.ie/events/irish-hospital-expenditure-beyond-the-era-of-covid-19
To view a video of the presentation, click here: https://www.youtube.com/watch?v=cEHsUI0EmQ4
Slides from my 2023 Leadership Institute Talk on CSIRO's Our Future World Report, which identifies seven Global Megatrends which will influence society for the next two decades. With reference to work in digital solutions to help address these megatrends
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
More Related Content
Similar to Data Engineers in Uncertain Times: A COVID-19 Case Study
COVID-19 Fact Base and Potential Implications for Brazil - CompletoBain & Company Brasil
Nova versão do estudo que vem sendo publicado pela nossa Task Force local sobre #Covid19 confirma o cenário de platô para o Brasil e mostra os estados brasileiros continuando o movimento de concentração na zona de “risco controlado”, com ocupação das UTIs em torno de ~70% e com níveis de contaminação mais constantes.
A study on “the impact of data analytics in covid 19 health care system”Dr. C.V. Suresh Babu
A Study on “The Impact of Data Analytics in COVID-19 Health Care System”, Presentation slides for International Conference on "Life Sciences: Acceptance of the New Normal", St. Aloysius' College, Jabalpur, Madhya Pradesh, India, 27-28 August, 2021
2020.01.12 OECD STI Outlook launch - Impacts of COVID-19: How STI systems res...innovationoecd
On January 12, join OECD iLibrary, the OECD Directorate for Science, Technology and Innovation, and ACRL/Choice for a presentation of the key findings from the new STI Outlook, followed by a conversation with OECD STI Director Andrew Wyckoff and RAND Corporation Senior Policy Researcher Marjory Blumenthal about the implications for research and innovation in the US.
Read more at https://oe.cd/STIO21-EES
Keberhasilan Selandia Baru dalam mengatasi penyebaran Covid-19 pada gelombang pertama menjadi pelajaran berharga bagi negara-negara di seluruh dunia dalam merancang sebuah strategi kebijakan mengatasi Covid-19
COVID-19 Fact Base and Potential Implications for Brazil - ShortBain & Company Brasil
Nossa Task Force local sobre #Covid19 apresenta uma atualização do estudo sobre o cenário brasileiro e destaca a longevidade do platô no Brasil, principalmente por conta da reabertura econômica, capaz de gerar aumento da contaminação, e das características locais, como densidade e uso de transporte público.
2 years in 2 months? Digital acceleration in biopharmaAcross Health
Relive Across Health first iD.cast: 90 minutes of inspiration on digital acceleration in biopharma
Get food and facts for thought on how biopharma companies can embrace the COVID-19 wake-up call and turn these disruptive times into an opportunity for hyper-acceleration and re-invention. The session featured two highly engaging speakers – and scored an NPS of 65!
- Peter Hinssen, Co-founder Nexxworks, serial entrepreneur, best-selling author, and thought leader on digitalization & innovation
The "VACINE" for the Never Normal and the hourglass model for corporate innovation
- Fonny Schenck, CEO Across Health
2 years in 2 months? Digital acceleration in biopharma
Watch the recorded webinar or download the slides to learn how to be the Phoenix in the biopharma world and create a sustainable competitive superiority: https://bit.ly/2IZDDLv.
2020 Inside The Post Pandemic Playbook - Client VersionR "Ray" Wang
The Constellation Research Team shares their high level overview of how to play for a post pandemic recovery. Details by business themes can be found in client advisory or paid client inquiry access.
Genetic Technologies Limited is a diversified molecular diagnostics company
developing tools for the prediction and assessment of cancer risk to help physicians
proactively manage patient health. The Company’s lead products, ‘GeneType for
Breast Cancer’ and ‘GeneType for Colorectal Cancer’, are clinically validated risk
assessment tests that are first in their class. The Company’s development pipeline
includes new tests for Type 2 diabetes, cardiovascular disease, prostate cancer, and
melanoma. Listed on the ASX in 2000 and NASDAQ in 2005, Genetic Technologies
has been a leader in the development and commercialization of genetic risk
assessment technology for 20 years.
Our central thesis has long been that COVID hasn’t dramatically changed the healthcare industry, rather it has dramatically accelerated different trends in the healthcare space that were already simmering before March 2020. Given the usually slow pace at which the healthcare market typically moves, COVID served as a shock to the system and an accelerator that created a window to drive meaningful change. In this whitepaper, we will examine several changes that were less obvious in the early days of the pandemic and assess their longevity as we (hopefully) move into a post-COVID world.
Hospital Capacity Management: How to Prepare for COVID-19 Patient SurgesHealth Catalyst
Health system resource strain became an urgent concern early in the COVID-19 pandemic. Hard-hit areas exhausted their hospital beds, ventilators, personal protective equipment, staffing, and other life-saving essentials, while other regions scrambled to prepare for inevitable surges. These resource concerns heightened the need for accurate, localized hospital capacity planning. With additional waves of infection in the summer months following the initial spring 2020 crisis, health systems must continue to forecast resource demands for the foreseeable future. An accurate capacity planning tool uses population demographics, governmental policies, local culture, and the physical environment to predict healthcare resource needs and help health systems prepare for surges in patient demand.
ILC webinar: Under the microscope: Comparing countries’ experiences of the CO...ILC- UK
COVID-19 has had devastating effects on health systems and economies across the world and has put the importance of the prevention of ill health throughout the life course into sharp focus– from the importance of better pandemic preparedness to the need to promote the overall health of the population.
This ILC webinar is part of our “Delivering prevention in an ageing world” programme.
The panellists presented their country perspectives on how each of their countries have responded to COVID-19 and what we can learn from the pandemic for the prevention agenda going forward.
A Sustainable Healthcare Emergency Management Framework: COVID-19 and BeyondHealth Catalyst
With an ever-changing understanding of COVID-19 and a continually fluctuating disease impact, health systems can’t rely on a single, rigid plan to guide their response and recovery efforts. An effective solution is likely a flexible framework that steers hospitals and other providers through four critical phases of a communitywide healthcare emergency:
Prepare for an outbreak.
Prevent transmission.
Recover from an outbreak.
Plan for the future.
The framework must include data-supported surveillance and containment strategies to enhance detection, reduce transmission, and manage capacity and supplies, providing a roadmap to respond to immediate demands and also support a sustainable long-term pandemic response.
On Wednesday, 3 March 2021, ESRI researcher Conor Keegan presented the topic ‘Understanding the drivers of hospital expenditure’ at the conference ‘Irish hospital expenditure beyond the era of COVID-19.’
The conference examined issues relating to expenditure on acute hospital care in Ireland. Findings from recent ESRI research, undertaken as part of the ESRI Research Programme in Healthcare Reform, which is funded by the Department of Health, were presented.
To view the presentation slides and other event details, click here: https://www.esri.ie/events/irish-hospital-expenditure-beyond-the-era-of-covid-19
To view a video of the presentation, click here: https://www.youtube.com/watch?v=cEHsUI0EmQ4
Slides from my 2023 Leadership Institute Talk on CSIRO's Our Future World Report, which identifies seven Global Megatrends which will influence society for the next two decades. With reference to work in digital solutions to help address these megatrends
Similar to Data Engineers in Uncertain Times: A COVID-19 Case Study (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas