Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Role of Data Accessibility During Pandemic

Download to read offline

This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work.

We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.

  • Be the first to like this

Role of Data Accessibility During Pandemic

  1. 1. Role of data accessibility during Pandemic Vini Jaiswal Data and Spark Evangelist @ Databricks Isaac Lee Researcher @ Carnegie Mellon University
  2. 2. Who are we? ● Chief director | DS4C (Data Science for Covid) ● Software development team lead | Mindslab ● BS in computer science | Carnegie Mellon University ● Customer Success Engineer @ Databricks “Making Data People successful with their data and ML/AI use cases” ● Data Science Engineering Lead - Citi ● Data Intern - Southwest Airlines ● MS in Information Technology & Management - UTDallas Vini Jaiswal Isaac Lee
  3. 3. Agenda ▪ Importance of data accessibility ▪ Overview of research conducted by DS4C ▪ Data collection process ▪ Uniqueness of DS4C dataset ▪ Challenges ▪ Research Value ▪ Value of open source community
  4. 4. Agenda Importance of data accessibility Overview of research conducted by Data Science for Covid-19 (DS4C) Value of open source community
  5. 5. Dec 31, 2020 Pneumonia outbreak of unknown cause Wuhan Municipal Health Commission made the first public announcement, confirming 27 cases Early Jan 2020 Unknown pathogen confirmed as novel coronavirus ● Wuhan confirmed first 41 cases of confirmed COVID-19 along with one reported death ● Spread in other Chinese provinces Jan 20, 2020 Human to Human transmission ● Human-to-human transmission was confirmed by the WHO and Chinese authorities ● Links to Hunan seafood wholesale market ● Strongly recommended PPE Jan 20 - Feb 11 2020 Public emergency declared by WHO ● Jan 23 - Wuhan goes into lockdown ● Outbreak spread by a factor of 100 to 200 times. ● Italy had its first confirmed cases on 31 January 2020, two tourists from China. ● Feb 11: WHO officially named this disease COVID-19 ● ICTV named it SARS-CoV2 (severe acute respiratory syndrome coronavirus 2) 11 March 2020 Covid-19 declared as Pandemic - Global cases reported - Europe became a major hub by March 13, 2020 - Outbreak started spreading in US starting March 7, 2020 As of 28 April 2021, more than 149 million cases have been confirmed, with more than 3.14 million deaths attributed to COVID-19, making it one of the deadliest pandemics in history. Coronavirus disease (COVID-19) Evolution
  6. 6. Role of Data Availability 2 L a c k o f c o h e s i v e D a t a S o u r c e s o f d a t a w e r e l i m i t e d a n d d a t a q u a l i t y w a s a c h a l l e n g e 3 D a t a a v a i l a b i l i t y o n o p e n p l a t f o r m s L e v e l o f d e t a i l s , i n c l u d i n g r o u t e i n f o , p o l i c i e s , p a t i e n t d a t a , r e c o v e r e d v s d e a t h c a s e s , e t c . 4 Data Science for Covid (DS4C) dataset DS4C is a non-profit organization founded by data analysts and M L researchers w ho w anted to contribute to fighting COVID-19. 1 I t b e c a m e c r u c i a l t o k n o w t h e o r i g i n , c a u s e s o f s p i k e s , a f f e c t e d r e g i o n s a n d s p r e a d r e a s o n s S u r g e i n C o v i d - 1 9 C a s e s
  7. 7. The South Korea COVID-19 Dataset
  8. 8. Popularity on World’s biggest data science forum The White House Johns Hopkins University DS4C dataset 3rd ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat 2nd ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 1st ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat The White House Johns Hopkins University DS4C dataset
  9. 9. Dataset Category Description Case Case Data of COVID-19 infection cases in South Korea PatientInfo Patient Epidemiological data of COVID-19 patients in South Korea PatientRoute Patient Route data of COVID-19 patients in South Korea Time Time Series Time series data of COVID-19 status in South Korea TimeAge Time Series Time series data of COVID-19 status in terms of the age in South Korea TimeGender Time Series Time series data of COVID-19 status in terms of gender in South Korea TimeProvince Time Series Time series data of COVID-19 status in terms of the Province in South Korea DS4C Tables Source: https://www.kaggle.com/kimjihoo/coronavirusdataset /databricks-datasets/COVID-19/coronavirusdataset
  10. 10. Schema of DS4C dataset
  11. 11. DS4C Dataset Johns Hopkins Dataset Patient Info Table
  12. 12. Patient Info Table
  13. 13. Mass outbreak cases data
  14. 14. S. Korea’s social distancing policy data
  15. 15. Patient Route Data
  16. 16. Source of Route Data Cellphone GPS locations Credit card history Surveillance camera footages
  17. 17. Patient Route Data Patient ID 187218746 GPS Location Number of contacted people Use of mask Time of visit Results of COVID-19 tests of the contacted people Type of transportation used Type of facility used (ex. Restaurant)
  18. 18. Patient Route Data combined with Patient Info GPS Location: (35.9, 127.7) Number of contacted people: 12 Use of mask: True Time of visit: March, 2nd, 14:32 Results of COVID-19 tests of the contacted people Type of transportation used: Taxi -> subway -> walked Type of facility used: Gangnam station McDonalds
  19. 19. Uniqueness of DS4C dataset ▪ Patient status (age & sex) ▪ Deaths, new infections, and recovered per day ▪ Infection routes, infection chain network, diagnosed, symptom onset, and cured dates ▪ Patient travel routes ▪ COVID-19 events timeline ▪ COVID-19 preventative policies ▪ ETC: Population flow, hospital beds and medical supplies, major infection stats, vaccine history (BCG, MMR, etc), physical examination results What is unique to DS4C dataset What everyone else also have
  20. 20. South Korea Covid-19 research analytics
  21. 21. How did the virus travel from the main hub to the other countries ?
  22. 22. Covid-19 spread in South Korea: First case | Route | Spread reason
  23. 23. Spread reasons
  24. 24. Preventive Measures
  25. 25. Patient demographics
  26. 26. SIR (Susceptible, Infected, Recovered) model • Cases infected from overseas • Flow of SIR • New variable: o(t): # of people infected from overseas at t S I R �� �� S I R �� �� Overseas o
  27. 27. Overview of Research conducted by DS4C 1. Challenges 2. Data Engineering 3. Research Value
  28. 28. 1. Decentralized publication 2. Absence of a unified formatting 3. Data embedded in natural language Challenges of Data collection
  29. 29. Decentralized publication: Over 100 Counties & Cities each publish the data on their own local district website 0 1 .
  30. 30. 0 2 . Absence of a unified formatting: Each district’s website has different formats, so crawling is infeasible
  31. 31. “04/17: Patient 136 visited barber shop near Gagnam station exit 3 (15:30), went back home, then went out to Seven-Eleven infront of his house” Jack’s Barber Shop (37.566, 126.978) beauty_salon Jong-ro Seven-Eleven (37.616, 126.961) store NOT machine comprehensible Machine comprehensible 0 3 . Data embedded in natural language:
  32. 32. Our in-house data engineering tool enabled 15 engineers to process exponentially-growing of patient data
  33. 33. Research Value
  34. 34. Vini Jaiswal Denny Lee Research mentorship Guidance on data engineering Collaboration help with other organizations
  35. 35. Value of Open Source Community
  36. 36. Not even the Korean government has centralized dataset... 1. Deleted in 3 days 2. Updated sporadically 3. Distributed by over 100 municipal districts 4. No unifying format 5. Data in natural language
  37. 37. Todo: Data anonymization
  38. 38. Isaac Lee @ CMU joongkul@andrew.cmu.edu Please reach out to us if you would like to take part in our effort or collaborate with us
  39. 39. Contact us! collaborate@databricks.com Visit our hub! https://databricks.com/databricks-covid-19-resource-hub Closing remarks
  40. 40. Thank you!
  41. 41. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  42. 42. 1. Importance of data accessibility (vini) 2. Why is DS4C unique (Isaac) - take a look at the raw data very briefly 3. Use cases 1: dash board (Vini) 4. Use cases 2: resaerch papers (Isaac) 5. Data engineering: the 3 steps (Isaac) 6. value of community (Isaac) 7. closeout and feedback (Vini)
  43. 43. Data science for covid-19 Who we are: - Non-profit organization founded by data analysts and machine learning researchers - 16 Masters and PhD students from: Carnegie Mellon University, Seoul National University, Hanyang University, Kyunghee University
  44. 44. 1. Synthesized card history + phone GPS + closed-circuit cameras 2. Coordinate and time of every location visited 3. Number of people contacted 4. Wore mask or not Mass outbreak cases data Patient travel routes (unreleased) Korea’s social distancing policy data
  45. 45. Most used COVID-19 Dataset world-wide Downloads Contributors 3 rd 70,000 300
  46. 46. Started out as a small project with a couple friends and myself... Now, we have over 20 volunteer data engineers from universities all over Korea Over 100K$ funding from the Korean government, Microsoft, and others
  47. 47. Dozens of world-class research institutions conducting research with DS4C dataset
  48. 48. DS4C dataset was the foundational source for many research papers from world-class research institutions
  49. 49. Patient Route Data Patient ID 187218746 GPS Location Number of contacted people Use of mask Time of visit Results of COVID-19 tests of the contacted people Type of transportation used Type of facility used (ex. Restaurant)

This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work. We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.

Views

Total views

70

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×