2. Principles of Data science
Essential
steps
1. ResearchTopic
2. Research Question
3. Hypothesis
4.Data collection plan
5. Data analysis
6.Data Reporting
Research Question
Hypothesis
Experiment/
Data collection plan
Data Analysis
Conclusion/
Data Reporting
Replication
3. Principles of Data science
Research
Topic
Example:
First responders long term health is at risk when involved in
combating wildfire for several years.
Can monitoring individual emission exposure, help manage long
term health risks and extend their active life?
A problem or a need statement with a broad area of interest
Majority of First responders suffer from Cardiac Arrest andTrauma
4. Principles of Data science
Research
Question
A clearly articulated list of specific research question will define the
data types required to collect.
Example:
RQ1. Are toxic emissions negatively associated with long-term health?
RQ2.Are the current data collection measures, useful in monitoring the individual
emission burden?
RQ3. Are the current methods of Health risk assessments accurate?
5. Principles of Data science
Hypothesis
Example:
Ho3: Current methods of Health risk assessments are effective.
Ha3: Current methods of Health risk assessments are not sufficient.
H0: null hypothesis is a general statement or default position that there is
no relationship between two measured phenomena, or no association
among groups.
Ha: The alternative hypothesis is the hypothesis used
in hypothesis testing that is contrary to the null hypothesis.
H0
Ha
6. Principles of Data science
Hypothesis
What is
Type I error
Type II error
Hypothesis
Ho: Current Health Risk
Assessments are effective in
associating to toxic emission
(isTrue)
Ho: Current Health Risk Assessments are
effective in associating to toxic emission
(is False)
Reject Ho TYPE I Error
Correct Conclusion
(p < 0.05)
Fail to Reject Ho
Correct Conclusion
(p >= 0.05)
Type II Error
For Example:
7. Principles of Data science
Data
Collection Plan
Type of Data
1. Act, Behavior, or Events
2. Economic data
3. Organizational data
4. Demographic data
5. Self-identity
6. Cultural knowledge
7. Expert knowledge
8. Personal and psychological traits
9. Hidden social patters
Data Location
Operational
Definition
8. Principles of Data science
Data
Collection Plan
Dataset Who What Why Where When
Firefighters
Dataset
Firefighters
research associate
Wildfire events and
firefighter’s data
To assess the
emission exposure
The National
Institute for
Occupational
Safety and Health
(NIOSH)
For the period 2008
to 2018
Health Report
Dataset
Health report
research associate
Firefighters health
records
To capture the
disease diagnosis
Search Firefighter
fatalities in the
United States
For the period 2008
to 2018
Data Collection plan for Firefighters dataset
9. Principles of Data science
Data
Collection Plan
Sampling techniques
Simple random sample
Clustered sampling
Representative subgroup sampling
Possible sources of uncertainty
Sampling Error
Researcher Bias
Validity of Instrument
10. Principles of Data science
Data
Management
Themes of concerns of big data
Growing data
Real-time can be Complex
Data Security
SQL NoSQL
• Relational,Tabular format
• Schema is essential
• GrowVertically
• Unstructured, Semi structured
• No schema
• Grow horizontally
TYPES OF DATA STORAGE (Key Differences)
Example of SQL database: MySql,Oracle, SQLite, Postgres, and MS-SQL.
Examples of NoSQL database: MongoDB, BigTable, Redis, RavenDb, Cassandra,
HBase, Neo4j, and CouchDB
11. Principles of Data science
DataAnalysis
Flow of data
based on its type
to create insights
Categorical OrdinalInterval-Ratio/
Continuous
Calculate
Frequency,
Distribution
Calculate
Mode
Calculate
Mean,
Median, SD
Vary
Report No
change
No
T-Test | Chi-Squared | Correlation | OLS Regression | Logistic Regression
Report Table, Pie chart, Bar chart
Yes
Descriptive
Statistics
Inferential
Statistics
13. Principles of Data science
DataAnalysis
Exploratory Data Analysis
Descriptive statistics on Health Risk ,
Emission level, Exposure duration and
Age
14. Principles of Data science
Data
Reporting
The most common data reporting formats in business are as follows:
Research
Report
Executive
Summary
Short
Answers
Slide
Presentation
White Paper
15. Principles of Data science
Summary
Basic research design consists of six core steps:
Develop a good research question, identifying a small section of
wider topic that is worth exploring.
Choose a logical structure for research.
Identify the type of data needed.
Select a data collection method.
Choose data collection site, the data source.
The research question, the type of data, and the data collection
method together leads us to the correct data analysis method to
use.
16. Principles of Data science
Ethics in Data
Science
A detailed Informed consent form with the scope of
the research and a transparent method with only
the required information will be collected.
When accessing the first responder's information,
utmost care will be given to maximize benefits and
minimize harm.
For the most part, this research should enable
interventions that are designed solely to enhance
the mental being of an individual firefighter or
subject and that have a reasonable expectation of
success.
All participants will get equal treatment, and every
measurement will be analyzed with the same
method without any bias.
The assessment of risk and benefits requires a
careful collection of relevant data or any alternate
way of obtaining the benefits sought in the
research.
Informed
Consent
Maximize
benefit
Enhance
Wellbeing
Equal
Treatment
Risk vs Benefit
Editor's Notes
This slide deck was created to demonstrate my learnings in this course and some of the interesting observations are included to show my level of understanding.
The essential steps of data science research are discussed in this presentation. All six steps discussed here ensure all critical elements are considered in the research process and provide a clear insight for any other researchers to learn.
The six steps are,
Research Topic: Describes a problem or need statement
Research Question: A precise list of questions that directly gives clues on the data type, unit of measure, and data source.
Hypothesis: Clearly defines the relationship between the variable. It starts with the baseline assumption that there is no relationship between the independent variable and the dependent variable.
Data collection plan: A suitable and successful method of collecting the data by following the right sampling methods
Data analysis: Descriptive and Inferential statistics performed on the collected data
Data Reporting: Discuss various reporting techniques for varying levels of audiences.
In recent decades, the Western United States has seen heightened wildfire activity, characterized by a higher frequency of massive wildfires, a more extended fire season, larger fire size, and a higher total area burned. With projected temperature increases, soil moisture reduction, and more frequent air stagnation, the burden of wildfires on air quality, public health, and environmental management will likely increase. With state-of-the-art wearable sensors, AI models, and detailed health information, we propose to investigate the impacts of historical and future wildfires on first respondents long term health risks.
RQ1. Are toxic emissions negatively associated with long-term health?
Study the levels of toxic emissions from past wildfire events and map it to the health records of the first responders to identify any correlation in the data sets. What are the health risks associated with this occupation?
RQ2. Are the current data collection measures, useful in monitoring the individual emission burden?
Study the current data collection methods and evaluate their effectiveness in monitoring individual fire fighter's emission burden. Establish the correlation of current methods and their effectiveness in calculating the duration of emission burden.
RQ3. Are the current methods of Health risk assessments accurate?
What are the different methods used in calculating the health risks and how a specific toxic emission is associated with a Health Risk? What are the thresholds of the Emission burden?
Ho1: Toxic emissions do not affect the long-term health risk
Ha1: Toxic emissions have a negative association with the health risk
Ho2: Current data collection methods are not effective in calculating individual emission burden of the firefighters.
Ha2: Current data collection methods are useful in calculating individual emission burden
Ho3: Current methods of Health risk assessments are not accurate.
Ha3: Current methods of Health risk assessments are accurate.
Type I error is the rejection of a true null hypothesis.
Type II error is the failure of rejecting a false null hypothesis.
In the example
The p-value is > 0.05, the firefighters with longer hours of work in a toxic emission had a higher incidence of health disorder. The null hypothesis was accepted with the conclusion that the methods of health risk assessment are beneficial in associating with the toxic emission.
Based on the formulated research questions, retrospective analysis of various wildfire events for the last ten years and an anonymized list of fire fighter's health records are required. Careful selection of both quantitative and qualitative data from specific wildfire events with a duration of containment, level of emission, type of sensor used, firefighters age, shift schedules, reported Injuries and pre-existing conditions need to be collected. Longer-Term details of specific health records related to firefighter's hospital visits, insurance claims information and medicine prescription information, diagnosis date, and diagnosis details need to be collected.
From the data source, a set of vital information will be extracted for each of the wildfire events.
An event is a specific wildfire incident that burnt at least more than 1000 acres or produced significant structural damage or loss of life.
Exposed-days is the number of days each firefighter worked in a job or at a location with the potential for exposure. It will be derived from the employment date and event date.
Fire-runs is the total number of fire-runs made by each firefighter. It will be derived from the event date per event.
Fire-hours is the total time spent at fires by each firefighter. It will be derived from the exposed hours per day.
The individual Emission burden is the total duration of individual emission exposure.
Daily Emission burden is the hours of emission burden in a day. A day is 24 hours and starts at 00:00 hours and ends at 23:59 hours. The emission burden per event is the sum of the daily emission burden per event.
Level of toxicity is a qualitative assessment based on the pollutants, in six different levels Good, Moderate, Unhealthy for the sensitive group, Unhealthy, Very Unhealthy, Hazardous.
The first Noted date is the date on which a specific disease condition was first diagnosed.
The disease condition is the actual finding of the Disease state and its stage.
Both data sources will be quantitatively analyzed using the two main methods, Observations, and Questioners. Careful observation of the types of emission exposure and quantifying its duration for each of the combating firefighters is important. Questioners will be developed to assess the emission levels at the event locations. Each of the identified disease condition and the first noted date will be collected per firefighter. Qualitatively assess the worsening of disease condition from periodic health screening reports, based on its progress. The exposure assessment will be conducted by researchers who will be blinded to healthcare reports, to reduce the likelihood of information bias in the subsequent analyses. The below table shows the high-level plan of who, what, why, where and when for data collection.
While the Descriptive statistics and the Inferential statistics are vital for quantitative analysis, there is a need for careful sample selection to make a meaningful inference of the population statistics. The document discusses various sampling methods and reviews its relations with the population statistic. Each of the sampling techniques was reviewed, and my level of confidence in each of the sample mean to the population mean.
The era of big data has resulted in the development and applications of technologies and methods aimed at effectively using massive amounts of data to support decision-making and knowledge discovery activities. In this paper, the five Vs. of big data, volume, velocity, variety, veracity, and value, are reviewed, as well as new technologies, including NoSQL databases that have emerged to accommodate the needs of big data initiatives.
Both the SQL and NoSQL databases have their applications, based on the development requirements.
The datasets have a combination of continuous data, discrete numerical data, geospatial data, and categorical data types. A standard frequency of data aggregation will be determined before the analysis to calculate daily emission exposure, emission exposure per wildfire event, emission exposure for the entire career. Establish Mean, median level for emission exposure, and corresponding clinical diagnosis. Develop an unsupervised clustering of a dataset based on similar emission exposure and associated health risks. The analysis will help determine an emission exposure threshold that can be used to effectively manage the Health risk proactively and develop recommendations on care pathways.
The infographic shows a typical path of different data types in research activity. From the categorical data, we can calculate the Frequency and Mode before applying a Chi-Squared test or a Logistic regression in case of a classification scenario. From an Interval-Ratio or Continuous data, we can calculate Mean, Median, and Standard deviation to see if there is any variation, and accept the null hypothesis in the case of no variation. Several options are available for continuous data based on the spread and Kurtosis.
Finally, an appropriate method of visualization can be used to view and communicate the behavior of the data.
Fire Fighters Age: It is a continuous variable with type float, rounded to the nearest months of the firefighter's age.
Measure of Central Tendency: Mean and Median for this sample are pretty close to each other because the mean value is the balancing point, and it is also the average. Since all values are unique for this sample, there is no value for Mode.
Measure of Spread: The range or the difference between the minimum value and the maximum values shows the dispersion but in cases of outliers, it does not clearly indicate the spread. The standard deviation measures how far an individual value is from the mean value. In general, for larger sample size, the distribution is normal.
There is a huge variation in the CO emission and shows a great relationship with the Health risk. The duration of exposure varies significantly when compared to various health risks.
Diagnostic Condition: It is the diagnosis reported by the physician and serves as a qualitative variable describing the state of health. It is a discrete variable to map the health condition of the firefighter.
T
Research Report: the longest and most comprehensive presentation format,
Executive Summary: one to two pages providing an overview of the findings with a statement of action items,
Short Answers: a statement of action items,
Slide Presentation: designed for an oral presentation that provides some context of the research, the findings and the action items,
White Paper: a short report that describes the research and findings, action items, and how other needs and broader findings in the research area.
Develop recommendations, possibly a wearable sensor built to collect and managing emission exposure on an individual basis effectively.
Develop an AI/ML model to proactively identify the potential firefighter early on to manage the health conditions effectively.
Post data collection, both the datasets require careful mapping of Independent variables to associate the positive or negative correlations with studying the impact on overall health risk, retrospectively.
Last but not the least, The 45CFR46 and the Belmont study summarizes the ethical principles identified by the commission in the context of its deliberation. Scientific research has produced substantial social benefits. It has also posed some troubling ethical questions. The code consists of rules, some general and other specifics that guide the investigators and other reviewers of research in their work. It was depressing to read about some of the early research participants were treated unethically and helps us learn a systematic method in not repeating the unfair practices.
All interested citizens, including Scientist, Research subjects, and Reviewers will get trained with the research scope and the extent of data collection required for this analysis. The main objective is to follow an analytical framework that will guide the resolution of the ethical problem arising from research involving firefighter’s health reports.
However, some of the firefighters may not be capable of self-determination, or the capacity of self-determination may mature during the research participation, and some participants may not be in a position to assess their liberty due to their illness. Subjects are to be treated in an ethical manner, not only by respecting their decisions, but also protecting them from harm and secure their wellbeing.
Like the principle of respect for persons finds expression in the requirement for Informed consent and the principle of beneficence in risk/benefit assessment, the principle of justice also gives rise to moral requirements that there be fair procedures and outcomes in the selection of firefighter’s event and health history.