Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Covid19 Smart Assessment Tool

556 views

Published on

Built with the support of Compellio during #EUvsVirus hackathon.
https://covid19smartscreeningtool.launchaco.com

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Covid19 Smart Assessment Tool

  1. 1. Machine Learning for COVID-19 targeted testing A risk assessment tool that optimises the use of COVID19 tests in order to implement fact-based strategies for deconfinement. A project supported by Compellio S.A. 15, côte d'Eich L-1450 Luxembourg https://compell.io (https://compell.io) Contributors All project contributors worked voluntarily in this project. Christos Avrilionis, Credit Risk and Governance Manager, PayPal Theo Papasternos, Business Manager, Compellio Denis Avrilionis, CEO, Compellio Vivi Tzekou, Machine Learning / Full-Stack Developer Yuri Visovsiouk. Full-Stack Developer Christina Dimopoulou, Business Operations, Compellio Disclaimer: the opinions expressed in this publication are those of theauthors. They do not express the opinions of any entity whatsoever with which they are affiliated. Contacts We would be delighted to further discuss this project with you. You can directly reach us at hello@compell.io (mailto:hello@compell.io). The way forward
  2. 2. We are currently looking to join forces with governments, health organisations, laboratories, and pharmaceuticals. Interested parties can fill in the partnership form found on the project’s website: https://covid19smartscreeningtool.launchaco.com (https://covid19smartscreeningtool.launchaco.com) Stay healthy, The COVID19 Smart Screening Tool Team Hacking for #EUvsVirus Overview The project aims at the development and the deployment of a software platform that would: Allow an individual to fill-in an electronic questionnaire with health and demographic information, securing and protecting sensitive personal data using Compellio’s blockchain- enabled registry technology Predict the likelihood of positive Covid-19 diagnosis of an individual at a given point in time using machine learning (ML) and artificial intelligence (AI) Enable policy makers to build an optimal Covid-19 exit strategy based on the targeted use of Covid-19 tests of high-risk individuals This solution will be implemented in 2 phases. 1. Phase 1: Data collection and modelling a. Design the questionnaire b. Collect medical and demographic data of a person when that person takes a test for Covid-19 using the questionnaire from step 1.a c. Link the test’s outcome (Covid-19 positive or negative) to the data collected in step 1.b d. Build a machine learning model on data from step 1.c 2. Phase 2: Deployment and general availability a. Use the model from step 1.d to generate a prediction of Covid-19 positiveness of any person b. Target Covid-19 tests for persons having a high likelihood of positive Covid-19 diagnosis c. Link the test result (Covid-19 positive or negative) to the prediction calculated in step 2.a d. Monitor model performance and fine-tune the machine learning model built in step 1.d This document illustrates phases 1.b, 1.c, 1.d, 2.a and 2.b using simulated data
  3. 3. Phase 1.b. Collect medical and demographic data of a person when that person takes a test for Covid- 19 using the questionnaire As a result of phase 1.a, let's assume that we have a questionnaire of 23 questions about medical and demographic information. For the purpose of this illustration, let's assume that: Each question is referred to as q1, q2, ..., q23 Each patient has a unique identifier from 1 to 1000 The questionnaire was proposed to 1000 patients as part of the Covid-19 testing procedure All patients answered all the questions Each answer to each question is a continuous variable (this can be extended to categorical variables as well) The output of phase 1.b. is a table similar to this (first 20 patients shown): In [7]: df_X.head(20)
  4. 4. Out[7]: q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 Patient ID 1 0.86360 -1.00490 -0.32880 0.63232 -0.76382 2.48937 0.30792 0.69260 0.87849 0.39463 1.00259 0.78609 -0.17739 0.77910 2 0.16313 -0.24986 0.13275 -1.50085 -2.92733 0.20983 0.66074 -0.65538 0.11255 0.59372 0.88890 -0.50316 0.25568 0.76523 3 -0.73682 -0.55274 -1.18046 -0.42813 -1.34706 -0.50470 0.23098 2.24479 -0.68447 -0.19708 -1.26557 -3.01053 0.82140 -0.71519 4 -1.04279 -0.65561 -0.76028 0.25715 -0.35681 -2.86298 -2.25225 -5.07991 0.63268 1.13516 0.00391 1.42961 0.43997 -1.27914 5 -0.51371 0.50819 -0.38651 -1.34063 1.36827 -0.89227 1.27378 -0.07773 0.77796 -0.47807 -2.04526 1.64837 -0.67013 1.54823 6 -2.55599 2.15763 -0.41735 -0.45903 -0.99303 0.68273 1.86696 0.92724 -0.51942 0.97351 1.01371 0.12639 -0.88108 0.90742 7 0.25446 -0.66239 0.14898 -1.14714 0.26427 -0.12783 -0.13116 0.68001 -0.22781 0.89755 -0.50767 -0.22261 -0.43984 -1.13151 8 -1.01046 -0.75284 0.22125 -0.25886 -1.23720 -1.26904 -0.06989 -0.54133 0.54484 0.41281 -0.17778 -2.23708 0.43346 -0.75199 9 0.21368 -1.49480 0.80215 -0.55766 -2.11991 0.22287 -2.60513 0.86176 0.86479 0.20923 -0.66948 -0.15163 0.98832 -1.26166 10 0.47976 0.04171 -2.12566 0.07869 0.83110 -0.12056 -1.66437 0.79751 -0.97663 1.29526 -0.57091 -1.01142 -0.88971 -2.32314 11 1.00421 0.99791 0.76168 0.40136 -0.52947 -1.03565 -0.96048 -0.63995 2.44512 1.08679 1.41085 2.73563 1.36788 -0.69813 12 0.52490 -0.47379 -0.65672 -1.35932 -2.25998 -2.31555 -0.89348 -4.19650 -0.84165 -0.69299 0.69274 2.07068 0.22034 0.43498 13 0.63667 1.19087 0.05326 1.21838 -0.08718 2.12081 0.13317 1.77220 -0.62710 -1.29448 -0.33494 -0.95674 0.53233 0.48086 14 -0.37914 -0.14957 0.76062 -0.83470 -0.77427 0.27242 1.21496 1.95481 -0.16722 0.31711 0.31330 -1.84803 0.40436 2.54233 15 -0.89963 -0.38678 0.95247 1.40977 2.22997 -2.98061 -0.18962 -5.50990 0.93988 0.17021 0.26472 2.38602 -0.99076 0.25673 16 -0.17194 0.40791 -0.80500 0.17541 0.35286 1.02670 -0.26927 0.56005 -0.07986 0.04604 -0.95871 -0.59690 1.51679 -1.01261 17 -2.05931 -1.57735 0.46265 -0.18091 2.63300 -2.58241 -0.87385 -4.06331 -0.57685 0.57172 1.38340 1.93061 -1.13699 -1.81634 18 0.17660 -1.64372 1.26173 -0.01106 -1.44764 1.86687 -0.28044 1.77416 -1.68730 -0.23558 -0.41543 -0.63237 -0.79932 0.64604 19 -0.58746 -0.40418 1.12721 -1.26260 -0.27359 -1.22690 -1.83024 -0.53333 0.93635 -0.91030 1.40048 2.55845 -0.12744 -0.34864 20 -0.00095 0.10829 -0.64400 0.06277 -0.76381 -0.67486 0.11844 2.36569 -0.63694 -0.69341 -1.27323 -0.42776 0.45477 0.90311
  5. 5. Phase 1.c. Link the test’s outcome (Covid-19 positive or negative) to the data collected in step 1.b. Let's assume that: All 1000 patients from phase 1.b. have been tested for Covid-19 positiveness using a lab test The tests were done on respiratory samples obtained by a nasopharyngeal swab using real-time reverse transcription polymerase chain reaction (rRT- PCR) The data of the test outcome are captured as follows: If a patient is Covid-19 positive, the Covid-19 test outcome is equal to 1 If a patient is Covid-19 negative, the Covid-19 test outcome is equal to 0 The data of the test outcome for the first 20 patients are the following:
  6. 6. In [8]: df_y.head(20) Out[8]: Covid-19 test outcome Patient ID 1 1 2 1 3 1 4 0 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 0 13 1 14 1 15 0 16 1 17 0 18 1 19 0 20 0
  7. 7. From the table above, we see that: Patient with ID 3 is Covid-19 positive Patient with ID 4 is Covid-19 negative Let's assume that the proportion of Covid-19 positive patients is approximately 700 / 1000 (70%) In [9]: df_y['Covid-19 test outcome'].value_counts() In [10]: sns.countplot(x='Covid-19 test outcome', data=df_y, color='grey') plt.ylabel('Count of patients') plt.show() Out[9]: 1 701 0 299 Name: Covid-19 test outcome, dtype: int64
  8. 8. Then, we link the patient's answers to the questionnaire with the test results. The output of phase 1.c. is a table similar to this: In [12]: df.head(20)
  9. 9. Out[12]: Covid- 19 test outcome q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 Patient ID 1 1 0.86360 -1.00490 -0.32880 0.63232 -0.76382 2.48937 0.30792 0.69260 0.87849 0.39463 1.00259 0.78609 -0.17739 2 1 0.16313 -0.24986 0.13275 -1.50085 -2.92733 0.20983 0.66074 -0.65538 0.11255 0.59372 0.88890 -0.50316 0.25568 3 1 -0.73682 -0.55274 -1.18046 -0.42813 -1.34706 -0.50470 0.23098 2.24479 -0.68447 -0.19708 -1.26557 -3.01053 0.82140 4 0 -1.04279 -0.65561 -0.76028 0.25715 -0.35681 -2.86298 -2.25225 -5.07991 0.63268 1.13516 0.00391 1.42961 0.43997 5 1 -0.51371 0.50819 -0.38651 -1.34063 1.36827 -0.89227 1.27378 -0.07773 0.77796 -0.47807 -2.04526 1.64837 -0.67013 6 1 -2.55599 2.15763 -0.41735 -0.45903 -0.99303 0.68273 1.86696 0.92724 -0.51942 0.97351 1.01371 0.12639 -0.88108 7 1 0.25446 -0.66239 0.14898 -1.14714 0.26427 -0.12783 -0.13116 0.68001 -0.22781 0.89755 -0.50767 -0.22261 -0.43984 8 1 -1.01046 -0.75284 0.22125 -0.25886 -1.23720 -1.26904 -0.06989 -0.54133 0.54484 0.41281 -0.17778 -2.23708 0.43346 9 1 0.21368 -1.49480 0.80215 -0.55766 -2.11991 0.22287 -2.60513 0.86176 0.86479 0.20923 -0.66948 -0.15163 0.98832 10 1 0.47976 0.04171 -2.12566 0.07869 0.83110 -0.12056 -1.66437 0.79751 -0.97663 1.29526 -0.57091 -1.01142 -0.88971 11 1 1.00421 0.99791 0.76168 0.40136 -0.52947 -1.03565 -0.96048 -0.63995 2.44512 1.08679 1.41085 2.73563 1.36788 12 0 0.52490 -0.47379 -0.65672 -1.35932 -2.25998 -2.31555 -0.89348 -4.19650 -0.84165 -0.69299 0.69274 2.07068 0.22034 13 1 0.63667 1.19087 0.05326 1.21838 -0.08718 2.12081 0.13317 1.77220 -0.62710 -1.29448 -0.33494 -0.95674 0.53233 14 1 -0.37914 -0.14957 0.76062 -0.83470 -0.77427 0.27242 1.21496 1.95481 -0.16722 0.31711 0.31330 -1.84803 0.40436 15 0 -0.89963 -0.38678 0.95247 1.40977 2.22997 -2.98061 -0.18962 -5.50990 0.93988 0.17021 0.26472 2.38602 -0.99076 16 1 -0.17194 0.40791 -0.80500 0.17541 0.35286 1.02670 -0.26927 0.56005 -0.07986 0.04604 -0.95871 -0.59690 1.51679 17 0 -2.05931 -1.57735 0.46265 -0.18091 2.63300 -2.58241 -0.87385 -4.06331 -0.57685 0.57172 1.38340 1.93061 -1.13699 18 1 0.17660 -1.64372 1.26173 -0.01106 -1.44764 1.86687 -0.28044 1.77416 -1.68730 -0.23558 -0.41543 -0.63237 -0.79932 19 0 -0.58746 -0.40418 1.12721 -1.26260 -0.27359 -1.22690 -1.83024 -0.53333 0.93635 -0.91030 1.40048 2.55845 -0.12744 20 0 -0.00095 0.10829 -0.64400 0.06277 -0.76381 -0.67486 0.11844 2.36569 -0.63694 -0.69341 -1.27323 -0.42776 0.45477
  10. 10. The following figure illustrates the pairwise scaterplots of each combination of questions, as well as the distribution of values for each question. Covid-19 positive patients are shown in orange. Covid-19 negative patients are shown in blue. In [13]: sns.set(style="ticks", color_codes=True) df_sample = df.sample(frac=0.1, replace=False, random_state=0) g = sns.pairplot(df_sample, hue='Covid-19 test outcome')
  11. 11. Phase 1.d. Build a machine learning model on data from step 1.c As a best practice, we leave aside 20% of the data (200 patients) in order to measure model performance in a subset of data which was not used to fit the model In [15]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
  12. 12. In [16]: print('The training partition has', X_train.shape[0], 'rows (patients) and', X_train.shape[1],'inputs (a nswered questions from the questionnaire)') print('The training partition has', y_train.shape[0], 'class labels (results of the Covid-19 test for ea ch patient)') print('n') print('The validation partition has', X_test.shape[0],'rows (patients) and', X_test.shape[1],'inputs (an swered questions from the questionnaire)') print('The validation partition has', y_test.shape[0], 'class labels (results of the Covid-19 test for e ach patient)') The proportion of the target variable (Covid-19 test outcome) in the training partition is the following: In [17]: pd.Series(y_train).value_counts(normalize=True) The proportion of the target variable (Covid-19 test outcome) in the validation partition is the following: In [18]: pd.Series(y_test).value_counts(normalize=True) The training partition has 800 rows (patients) and 23 inputs (answered questions from the questionnaire ) The training partition has 800 class labels (results of the Covid-19 test for each patient) The validation partition has 200 rows (patients) and 23 inputs (answered questions from the questionnai re) The validation partition has 200 class labels (results of the Covid-19 test for each patient) Out[17]: 1 0.70125 0 0.29875 dtype: float64 Out[18]: 1 0.7 0 0.3 dtype: float64
  13. 13. The details about how the Covid-19 risk model is fit are not shown. The graphs below show the distribution of the risk score for the training and the validation partition. We can see that the distribution has two distinct spikes. Risk scores close to zero show low-risk people and the risk scores close to 1 show high-risk people. In [31]: sns.distplot(pd.Series(pred_proba_train), kde=False) plt.xlabel('Covid-19 risk score') plt.ylabel('Count of patients') plt.title('Covid-19 risk score for the training partition') plt.show()
  14. 14. In [32]: sns.distplot(pd.Series(pred_proba_test), kde=False) plt.xlabel('Covid-19 risk score') plt.ylabel('Count of patients') plt.title('Covid-19 risk score for the validation partition') plt.show() The following output shows the model performance on the training partition
  15. 15. In [36]: pred_train = adjusted_classes(pred_proba_train, prior_proba) print('Training partition') print('n') print(pd.DataFrame(confusion_matrix(y_train, pred_train), columns=['Predicted Covid-19 = 0', 'Predicted Covid-19 = 1'], index=['Actual Covid-19 = 0', 'Actual Covid-19 = 1'])) print('n') print(classification_report(y_train,pred_train)) The following output shows the model performance on the validation partition Training partition Predicted Covid-19 = 0 Predicted Covid-19 = 1 Actual Covid-19 = 0 239 0 Actual Covid-19 = 1 9 552 precision recall f1-score support 0 0.96 1.00 0.98 239 1 1.00 0.98 0.99 561 accuracy 0.99 800 macro avg 0.98 0.99 0.99 800 weighted avg 0.99 0.99 0.99 800
  16. 16. In [37]: pred_test = adjusted_classes(pred_proba_test, prior_proba) print('Validation partition') print('n') print(pd.DataFrame(confusion_matrix(y_test, pred_test), columns=['Predicted Covid-19 = 0', 'Predicted Covid-19 = 1'], index=['Actual Covid-19 = 0', 'Actual Covid-19 = 1'])) print('n') print(classification_report(y_test,pred_test)) The output of phase 1.d is a machine learning model that can be deployed at scale in order to calculate the risk score of any person, on the basis of his/her answers to the questionnaire. Validation partition Predicted Covid-19 = 0 Predicted Covid-19 = 1 Actual Covid-19 = 0 58 2 Actual Covid-19 = 1 12 128 precision recall f1-score support 0 0.83 0.97 0.89 60 1 0.98 0.91 0.95 140 accuracy 0.93 200 macro avg 0.91 0.94 0.92 200 weighted avg 0.94 0.93 0.93 200
  17. 17. Phase 2.a. Use the model from step 1.d to generate a prediction of Covid-19 positiveness of any person Post model deployment (in production) let's assume that there are 500 previously unknown patients that answered the questionnaire. The data for the first 10 of them are shown below. In [45]: pd.set_option('precision', 5) df_X_score.head(10) Out[45]: q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 Patient ID 1001 1.53175 -0.04577 0.24018 -3.01630 3.78771 0.14376 0.10031 1.51729 -1.76818 -1.63343 -0.17644 -0.93879 0.15535 1.46685 1002 -0.10546 0.27277 0.19986 -2.45632 2.71857 -0.67213 0.65718 1.81098 -1.08473 -2.55134 -0.01416 0.69048 -0.05910 0.62776 1003 -0.53627 1.91788 -1.27851 0.08680 0.73835 0.83755 -1.04482 -0.79318 0.05228 -0.39619 -0.05698 0.78990 0.75103 -1.12659 1004 0.22495 1.32366 2.31517 2.10098 0.12442 0.06402 0.13983 -2.27711 -0.34322 -0.45823 -0.86908 1.73886 -1.13832 -1.09103 1005 1.12422 2.21340 1.03132 2.05760 -3.09501 -3.00650 0.06349 -0.10149 1.79921 1.97837 0.02817 -0.22139 -0.08733 -0.17335 1006 0.13391 -0.76154 0.83318 1.40468 -1.82245 -0.19431 -0.20189 0.14828 -2.96337 -0.15200 -0.70969 0.09331 -0.62289 -0.52828 1007 0.55378 1.14685 1.57146 -1.62518 2.78392 -0.22966 1.07731 2.42854 -0.50233 1.09827 -0.25322 0.81109 -1.83957 1.25795 1008 -1.68730 1.69995 -0.99108 1.42300 -2.63067 -1.44764 0.89630 2.26976 -2.73905 0.17660 0.64604 1.48959 -1.64372 -1.62723 1009 -1.14398 1.17859 0.54627 0.11912 0.45548 -0.25665 -1.09810 -1.01112 0.41393 -0.73649 -0.62525 0.51227 0.55505 -1.23055 1010 -1.38965 1.11901 -1.67252 0.45198 -3.02987 0.72636 0.21500 -0.64388 -1.34596 -0.22011 0.13737 -0.67231 -2.50142 0.97951
  18. 18. Distribution of Covid-19 risk score for 500 previously unknown patients In [48]: sns.distplot(pd.Series(pred_proba_score), kde=False) plt.xlabel('Covid-19 risk score') plt.ylabel('Count of patients') plt.title('Covid-19 risk score for 500 previously unknown patients') plt.show() The table below shows the prediction for the first 10 previously unknown patients based on the Covid-19 risk score
  19. 19. In [50]: pred_score_df = pd.DataFrame(pred_score, index=idx_score, columns=['Predicted Covid-19 test outcome']) pred_score_df.head(10) Patient with ID 1003 has positive predicted Covid-19 test outcome, while patient 1004 has negative predicted Covid-19 outcome. Phase 2.b. Target Covid-19 tests for persons having a high likelihood of positive Covid-19 diagnosis Out[50]: Predicted Covid-19 test outcome Patient ID 1001 1 1002 1 1003 1 1004 0 1005 0 1006 1 1007 1 1008 1 1009 0 1010 1
  20. 20. Get the top 20 persons with highest Covis-19 risk score In [51]: pred_proba_score_df = pd.DataFrame(pred_proba_score, index=idx_score, columns=['Predicted Covid-19 risk score']) pd.concat([pred_proba_score_df, pred_score_df], axis=1).sort_values(by='Predicted Covid-19 risk score', ascending=False).head(20).drop(columns='Predicted Covid-19 risk score')
  21. 21. Out[51]: Predicted Covid-19 test outcome Patient ID 1204 1 1057 1 1223 1 1212 1 1374 1 1375 1 1026 1 1386 1 1293 1 1270 1 1407 1 1267 1 1417 1 1272 1 1195 1 1038 1 1322 1 1481 1 1163 1 1225 1
  22. 22. The proposed solution is capable of registering and reporting the following information: Daily count of participants taking the questionnaire Average daily Covid-19 risk score Average Covid-19 risk score by age group Average Covid-19 risk score by geographical region etc. In [53]: plt.figure(figsize=(15, 10)) plt.title("Daily count of participants taking the questionnaire", fontsize=16) plt.plot(daily_volume.index, daily_volume['Volume'], color="b", linestyle="-") plt.ylabel("Volume", fontsize=14) plt.xlabel("Date", fontsize=14) plt.ylim(0, 1000) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.grid(True) plt.show()
  23. 23. In [55]: plt.figure(figsize=(15, 10)) plt.title("Daily average Covid-19 risk score for predicted positive patients", fontsize=16) plt.plot(daily_scores.index, daily_scores['Average Covid-19 risk score'], color="b", linestyle="-") plt.ylabel("Average Covid-19 risk score", fontsize=14) plt.xlabel("Date", fontsize=14) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.ylim(0, 1) plt.grid(True) plt.show()
  24. 24. In [57]: plt.figure(figsize=(15, 10)) plt.title("Weekly average Covid-19 risk score by age group", fontsize=16) plt.plot(weekly_avg_score_age.index, weekly_avg_score_age['18-39'], color="b", linestyle="-", label='18- 39') plt.plot(weekly_avg_score_age.index, weekly_avg_score_age['40-59'], color="r", linestyle="-", label='40- 59') plt.plot(weekly_avg_score_age.index, weekly_avg_score_age['60+'], color="g", linestyle="-", label='60+') plt.ylabel("Average Covid-19 risk score", fontsize=14) plt.xlabel("Week", fontsize=14) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.grid(True) plt.legend(loc='best', fontsize=14) plt.show()

×