This document discusses the basics of descriptive data analysis, including defining different types of variables, coding principles, and univariate data analysis. It describes continuous variables, which are always numeric, and categorical variables such as ordinal, nominal, and dichotomous. The document outlines how to code different variable types using values and labels and provides examples. It also covers cleaning data, calculating basic statistics, and examining the distribution and frequencies of variables to check data quality in univariate analysis.
Practical applications and analysis in Research Methodology Hafsa Ranjha
practical application in research, reviews of qualitative and mixed method studies, analysis, processing the data, data editing , data coding , classification of data, analysis of data, parametric test, non parametric test in
Research Methodology
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and
offering a wide range of dental certified courses in different formats.for more details please visit
www.indiandentalacademy.com
Practical applications and analysis in Research Methodology Hafsa Ranjha
practical application in research, reviews of qualitative and mixed method studies, analysis, processing the data, data editing , data coding , classification of data, analysis of data, parametric test, non parametric test in
Research Methodology
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and
offering a wide range of dental certified courses in different formats.for more details please visit
www.indiandentalacademy.com
Presentation is made by the student of M.phil Jameel Ahmed Qureshi Faculty of Education Elsa Kazi campus Hyderabad UoS Jamshoron, This presentation is an assignment assign by the Dr. Mumtaz Khwaja
Statistics What you Need to KnowIntroductionOften, when peop.docxdessiechisomjj4
Statistics: What you Need to Know
Introduction
Often, when people begin a statistics course, they worry about doing advanced mathematics or their math phobias kick in. Understanding that statistics as addressed in this course is not a math course at all is important. The only math you will do is addition, subtraction, multiplication, and division. In these days of computer capability, you generally don't even have to do that much, since Excel is set up to do basic statistics for you. The key elements for the student in this course is to understand the various types of statistics, what their requirements are, what they do, and how you can use and interpret the results. Referring back to the basic components of a valid research study, which statistic a researcher uses depends on several things:
The research question itself
The sample size
The type of data you have collected
The type of statistic called for by the design
All quantitative studies require a data set. Qualitative studies may use a data set or may use observations with no numerical data at all. For the purposes of the next modules, our focus will be on quantitative studies.
Types of Statistics
There are several types of statistics available to the researcher. Descriptive statistics provide a basic description of the data set. This includes the measures of central tendency: means, medians, and modes, and the measures of dispersion, including variances and standard deviations. Descriptive statistics also include the sample size, or "N", and the frequency with which each data point occurs in the data set.
Inferential statistics allow the researcher to make predictions, estimations, and generalizations about the data set, the sample, and the population from which the sample was drawn. They allow you to draw inferences, generalizations, and possibilities regarding the relationship between the independent variable and the dependent variable to indicate how those inferences answer the research question. Researchers can make predictions and estimations about how the results will fit the overall population. Statistics can also be described in terms of the types of data they can analyze. Non-parametric statistics can be used with nominal or ordinal data, while parametric statistics can be used with interval and ratio data types.
Types of Data
There are four types of data that a researcher may collect.
Nominal Data Sets
The Nominal data set includes simple classifications of data into categories which are all of equal weight and value. Examples of categories that are equal to each other include gender (male, female), state of birth (Arizona, Wyoming, etc.), membership in a group (yes, no). Each of these categories is equivalent to the other, without value judgments.
Ordinal Data Sets
Ordinal data sets also have data classified into categories, but these categories have some form or order or ranking attached, often of some sort of value / val.
Need a nonplagiarised paper and a form completed by 1006015 before.docxlea6nklmattu
Need a nonplagiarised paper and a form completed by 10/06/015 before 7:00pm. I have attached the documents along the rubics that must be followed.
Coyne and Messina Articles, Part 2 Statistical Assessment
Details:
1) Write a paper of 1,000-1,250 words regarding the statistical significance of outcomes as presented in Messina's, et al. article "The Relationship between Patient Satisfaction and Inpatient Admissions Across Teaching and Nonteaching Hospitals."
2) Assess the appropriateness of the statistics used by referring to the chart presented in the Module 4 lecture and the resource "Statistical Assessment."
3) Discuss the value of statistical significance vs. pragmatic usefulness.
4) Prepare this assignment according to the APA guidelines found in the APA Style Guide located in the Student Success Center. An abstract is not required.
5) This assignment uses a grading rubric. Instructors will be using the rubric to grade the assignment; therefore, students should review the rubric prior to beginning the assignment to become familiar with the assignment criteria and expectations for successful completion of the assignment.
Statistics: What you Need to Know
Introduction
Often, when people begin a statistics course, they worry about doing advanced mathematics or their math phobias kick in. Understanding that statistics as addressed in this course is not a math course at all is important. The only math you will do is addition, subtraction, multiplication, and division. In these days of computer capability, you generally don't even have to do that much, since Excel is set up to do basic statistics for you. The key elements for the student in this course is to understand the various types of statistics, what their requirements are, what they do, and how you can use and interpret the results. Referring back to the basic components of a valid research study, which statistic a researcher uses depends on several things:
·
The research question itself
·
The sample size
·
The type of data you have collected
·
The type of statistic called for by the design
All quantitative studies require a data set. Qualitative studies may use a data set or may use observations with no numerical data at all. For the purposes of the next modules, our focus will be on quantitative studies.
Types of Statistics
There are several types of statistics available to the researcher. Descriptive statistics provide a basic description of the data set. This includes the measures of central tendency: means, medians, and modes, and the measures of dispersion, including variances and standard deviations. Descriptive statistics also include the sample size, or "N", and the frequency with which each data point occurs in the data set.
Inferential statistics allow the researcher to make predictions, estimations, and generalizations about the data set, the sample, and the population from which the sample was drawn. They allow you to draw inferences, generaliza.
Levels of Measurement: Nominal = Data one collects when doing a wide-open descriptive or exploratory study, however, it is not limited to these kinds of studies. We can count this data, but we can’t order it. We need to be able to put this data into categories that are mutually exclusive, i.e. it can’t be in more than one category at a time. An example would be looking at age, race, sex, or some other type of data that you either are or aren’t. The categories need to be exhaustive – there need to be enough categories to cover the data you collect. Ordinal = this category has mutually exclusive categories, but with ordinal data you can order the data within each category. The ratings of poor, fair, good are an example of ordinal information. Note that you can order the ratings, but you can’t really tell how far apart each of these descriptors are from each other. You could also look at who finishes a task first, second, third, and so on. Again, you can rank this data, but you don’t know how much faster the first person was in relation to the second person or subsequent people. Interval-ratio data = this type of data allows you to measure the difference between each of your rankings. Data is ordered (as with ordinal data) and you can tell how much difference there is between each observation because there is a scale that is divided into equal units. You can measure a race with a stopwatch in terms of seconds or tenths of seconds. A thermometer gives you data with measurements in degrees. Ratio data is like interval data (and is often lumped together with it because they are usually handled the same way statistically). Its primary difference is that there is a zero point on the scale so that you can do multiplication and division. Money is an example of a ratio scale – two dollars are exactly twice one dollar. Volume, area, and distance measures are also ratio scales (2 times 1 liter equals 2 liters). This is different from a strict interval scale like a thermometer – we can’t say that 10 degrees Fahrenheit is twice as warm as 5 degrees Fahrenheit. Statistical Distributions: According to Shi, “a distribution organizes the values of a variable into categories. Frequency Distribution (aka Marginal Distribution): Displays the number of cases that falls into each category. Percentage Distribution: Found by dividing the number of frequency of cases in the category by the total N. Measures of Central Tendency: Mean: The most common measure of central tendency. It simply the sum of the numbers divided by the number of numbers. Median: It is defined as the middle position or midpoint of a distribution. Mode: Is defined as the most frequently occurring value. What is variability? Amount of spread or dispersion within a distribution of scores within a set of data. Measures of Variability: Range: The difference between the highest and lowest values in a distribution. Interquartile Range: Known as the ‘midspread’ or ‘middle fifty.” It contains the middle 50% of
Levels of Measurement: Nominal = Data one collects when doing a wide-open descriptive or exploratory study, however, it is not limited to these kinds of studies. We can count this data, but we can’t order it. We need to be able to put this data into categories that are mutually exclusive, i.e. it can’t be in more than one category at a time. An example would be looking at age, race, sex, or some other type of data that you either are or aren’t. The categories need to be exhaustive – there need to be enough categories to cover the data you collect. Ordinal = this category has mutually exclusive categories, but with ordinal data you can order the data within each category. The ratings of poor, fair, good are an example of ordinal information. Note that you can order the ratings, but you can’t really tell how far apart each of these descriptors are from each other. You could also look at who finishes a task first, second, third, and so on. Again, you can rank this data, but you don’t know how much faster the first person was in relation to the second person or subsequent people. Interval-ratio data = this type of data allows you to measure the difference between each of your rankings. Data is ordered (as with ordinal data) and you can tell how much difference there is between each observation because there is a scale that is divided into equal units. You can measure a race with a stopwatch in terms of seconds or tenths of seconds. A thermometer gives you data with measurements in degrees. Ratio data is like interval data (and is often lumped together with it because they are usually handled the same way statistically). Its primary difference is that there is a zero point on the scale so that you can do multiplication and division. Money is an example of a ratio scale – two dollars are exactly twice one dollar. Volume, area, and distance measures are also ratio scales (2 times 1 liter equals 2 liters). This is different from a strict interval scale like a thermometer – we can’t say that 10 degrees Fahrenheit is twice as warm as 5 degrees Fahrenheit. Statistical Distributions: According to Shi, “a distribution organizes the values of a variable into categories. Frequency Distribution (aka Marginal Distribution): Displays the number of cases that falls into each category. Percentage Distribution: Found by dividing the number of frequency of cases in the category by the total N. Measures of Central Tendency: Mean: The most common measure of central tendency. It simply the sum of the numbers divided by the number of numbers. Median: It is defined as the middle position or midpoint of a distribution. Mode: Is defined as the most frequently occurring value. What is variability? Amount of spread or dispersion within a distribution of scores within a set of data. Measures of Variability: Range: The difference between the highest and lowest values in a distribution. Interquartile Range: Known as the ‘midspread’ or ‘middle fifty.” It contains the middle 50% of
Chapter 12Choosing an Appropriate Statistical TestiStockph.docxmccormicknadine86
Chapter 12
Choosing an Appropriate Statistical Test
iStockphoto/ThinkstockLearning Objectives
After reading this chapter, you will be able to. . .
· understand the importance of using the proper statistical analysis.
· identify the type of analysis based on four critical questions.
· use the decision tree to identify the correct statistical test.
Here we are in the final chapter that will pull all prior chapters together. Chapters 1 to 3 discussed descriptive statistics while the latterchapters, 4 to 11, discussed inferential statistics. Each of the inferential chapters presented a statistical concept then conducted the appropriateanalysis to be able to test a hypothesis. The big question for students learning statistics is, "How do I know if I'm using the correct statisticaltest?" For experienced statisticians this question is easy to answer as it is based on a few criteria. However, to a student just learning statisticsor to the novice researcher, this question is a legitimate one. Many statistical reference texts include a guide that asks specific questionsregarding the type of research question, design, number and scales of measurement of variables, and statistical assumption of the data thatallows you to use an elegant chart known as a decision tree. Based on the answers to these questions, the decision tree is used to helpdetermine the type of analysis to be used for the research, thereby helping you answer this big question.
12.1 Considerations
To make the correct decisions based on the use of a decision tree, there are four specific questions that must be answered. These questions areas follows:
· What is your overarching research question?
· How many independent, dependent, and covariate variables are used in the study?
· What are the scales of measurement of each of your variables?
· Are there violations of statistical assumptions?
If you are able to answer these specific questions, then you will be able to determine the proper analysis for your study. These questions arecritically important, and if they cannot be answered, then not enough thought has gone into the research. That said, let us discuss each ofthese questions so that they can be considered and answered in the use of the decision tree.
What Is Your Overarching Research Question?Try It!
Derive your ownresearch question foryour Master's Thesisor DoctoralDissertation. Have a colleague orprofessor read it. What are theirthoughts or suggestions forimprovements?
Answering this question seems simple enough as all research has an overarching research questionthat drives the study, especially since this dictates the type of quantitative methodology. There arekey words in every research question that help determine the appropriate type of analysis. Forinstance, if the research question states, "What are the effects of job satisfaction on employeeproductivity?" the keyword is "effects" as in the cause and effect of job satisfaction (theindependent variable) on productivity (th ...
Similar to MELJUN CORTES research seminar_1__data_analysis_basics_slides_2nd_updates (20)
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
2. Goals
Describe the steps of descriptive data
analysis
Be able to define variables
Understand basic coding principles
Learn simple univariate data analysis
3. Types of Variables
Continuous variables:
Always numeric
Can be any number, positive or negative
Examples: age in years, weight, blood pressure
readings, temperature, concentrations of
pollutants and other measurements
Categorical variables:
Information that can be sorted into categories
Types of categorical variables – ordinal, nominal
and dichotomous (binary)
4. Categorical Variables:
Ordinal Variables
Ordinal variable—a categorical variable with
some intrinsic order or numeric value
Examples of ordinal variables:
Education (no high school degree, HS degree,
some college, college degree)
Agreement (strongly disagree, disagree, neutral,
agree, strongly agree)
Rating (excellent, good, fair, poor)
Frequency (always, often, sometimes, never)
Any other scale (“On a scale of 1 to 5...”)
5. Categorical Variables:
Nominal Variables
Nominal variable – a categorical variable
without an intrinsic order
Examples of nominal variables:
Where a person lives in the U.S. (Northeast,
South, Midwest, etc.)
Sex (male, female)
Nationality (American, Mexican, French)
Race/ethnicity (African American, Hispanic, White,
Asian American)
Favorite pet (dog, cat, fish, snake)
6. Categorical Variables:
Dichotomous Variables
Dichotomous (or binary) variables – a
categorical variable with only 2 levels of
categories
Often represents the answer to a yes or no
question
For example:
“Did you attend the church picnic on May 24?”
“Did you eat potato salad at the picnic?”
Anything with only 2 categories
7. Coding
Coding – process of translating information
gathered from questionnaires or other
sources into something that can be analyzed
Involves assigning a value to the information
given—often value is given a label
Coding can make data more consistent:
Example: Question = Sex
Answers = Male, Female, M, or F
Coding will avoid such inconsistencies
8. Coding Systems
Common coding systems (code and label) for
dichotomous variables:
0=No 1=Yes
(1 = value assigned, Yes= label of value)
OR:1=No 2=Yes
When you assign a value you must also make it clear
what that value means
In first example above, 1=Yes but in second example 1=No
As long as it is clear how the data are coded, either is fine
You can make it clear by creating a data dictionary to
accompany the dataset
9. Coding: Dummy Variables
A “dummy” variable is any variable that is coded to
have 2 levels (yes/no, male/female, etc.)
Dummy variables may be used to represent more
complicated variables
Example: # of cigarettes smoked per week--answers total 75
different responses ranging from 0 cigarettes to 3 packs per
week
Can be recoded as a dummy variable:
1=smokes (at all) 0=non-smoker
This type of coding is useful in later stages of
analysis
10. Coding:
Attaching Labels to Values
Many analysis software packages allow you to attach a
label to the variable values
Example: Label 0’s as male and 1’s as female
Makes reading data output easier:
Without label: Variable SEX Frequency Percent
0 21 60%
1 14 40%
With label:Variable SEX Frequency Percent
Male 21 60%
Female 14 40%
11. Coding- Ordinal Variables
Coding process is similar with other categorical
variables
Example: variable EDUCATION, possible coding:
0 = Did not graduate from high school
1 = High school graduate
2 = Some college or post-high school education
3 = College graduate
Could be coded in reverse order (0=college graduate,
3=did not graduate high school)
For this ordinal categorical variable we want to be
consistent with numbering because the value of the
code assigned has significance
12. Coding – Ordinal Variables
(cont.)
Example of bad coding:
0 = Some college or post-high school education
1 = High school graduate
2 = College graduate
3 = Did not graduate from high school
Data has an inherent order but coding does
not follow that order—NOT appropriate
coding for an ordinal categorical variable
13. Coding: Nominal Variables
For coding nominal variables, order makes no
difference
Example: variable RESIDE
1 = Northeast
2 = South
3 = Northwest
4 = Midwest
5 = Southwest
Order does not matter, no ordered value
associated with each response
14. Coding: Continuous Variables
Creating categories from a continuous variable (ex.
age) is common
May break down a continuous variable into chosen
categories by creating an ordinal categorical variable
Example: variable = AGECAT
1 = 0–9 years old
2 = 10–19 years old
3 = 20–39 years old
4 = 40–59 years old
5 = 60 years or older
15. Coding:
Continuous Variables (cont.)
May need to code responses from fill-in-the-blank
and open-ended questions
Example: “Why did you choose not to see a doctor about
this illness?”
One approach is to group together responses with
similar themes
Example: “didn’t feel sick enough to see a doctor”,
“symptoms stopped,” and “illness didn’t last very long”
Could all be grouped together as “illness was not severe”
Also need to code for “don’t know” responses”
Typically, “don’t know” is coded as 9
16. Coding Tip
Though you do not code until the data
is gathered, you should think about how
you are going to code while designing
your questionnaire, before you gather
any data. This will help you to collect
the data in a format you can use.
17. Data Cleaning
One of the first steps in analyzing data is to
“clean” it of any obvious data entry errors:
Outliers? (really high or low numbers)
Example: Age = 110 (really 10 or 11?)
Value entered that doesn’t exist for variable?
Example: 2 entered where 1=male, 0=female
Missing values?
Did the person not give an answer? Was answer
accidentally not entered into the database?
18. Data Cleaning (cont.)
May be able to set defined limits when entering data
Prevents entering a 2 when only 1, 0, or missing are
acceptable values
Limits can be set for continuous and nominal
variables
Examples: Only allowing 3 digits for age, limiting words that
can be entered, assigning field types (e.g. formatting dates
as mm/dd/yyyy or specifying numeric values or text)
Many data entry systems allow “double-entry” – ie.,
entering the data twice and then comparing both
entries for discrepancies
Univariate data analysis is a useful way to check the
quality of the data
19. Univariate Data Analysis
Univariate data analysis-explores each
variable in a data set separately
Serves as a good method to check the quality of
the data
Inconsistencies or unexpected results should be
investigated using the original data as the
reference point
Frequencies can tell you if many study
participants share a characteristic of interest
(age, gender, etc.)
Graphs and tables can be helpful
20. Univariate Data Analysis (cont.)
Examining continuous variables can give you
important information:
Do all subjects have data, or are values missing?
Are most values clumped together, or is there a lot
of variation?
Are there outliers?
Do the minimum and maximum values make
sense, or could there be mistakes in the coding?
21. Univariate Data Analysis (cont.)
Commonly used statistics with univariate
analysis of continuous variables:
Mean – average of all values of this variable in
the dataset
Median – the middle of the distribution, the
number where half of the values are above and
half are below
Mode – the value that occurs the most times
Range of values – from minimum value to
maximum value
22. Statistics describing a continuous
variable distribution
84 = Maximum (an
outlier)
2 = Minimum
28 = Mode (Occurs
twice)
33 = Mean
36 = Median (50th
Percentile)
23. Standard Deviation
Figure left: narrowly distributed age values (SD = 7.6)
Figure right: widely distributed age values (SD = 20.4)
24. Distribution and Percentiles
Distribution –
whether most values
occur low in the
range, high in the
range, or grouped in
the middle
Percentiles – the
percent of the
distribution that is
equal to or below a
certain value
Distribution curves for variable AGE
25th Percentile
(4 years)
25th Percentile
(6 years)
25. Analysis of Categorical Data
Distribution of
categorical variables
should be examined
before more in-
depth analyses
Example: variable
RESIDE
Number of people answering example questionnaire who reside in 5
regions of the United States
26. Analysis of Categorical Data (cont.)
Another way to look
at the data is to list
the data categories
in tables
Table shown gives
same information as
in previous figure
but in a different
format
Frequency Percent
Midwest 16 20%
Northeast 13 16%
Northwest 19 24%
South 24 30%
Southwest 8 10%
Total 80 100%
Table: Number of people answering sample
questionnaire who reside in 5 regions of the United
States
27. Observed vs. Expected Distribution
Education variable
Observed distribution of
education levels (top)
Expected distribution of
education (bottom) (1)
Comparing graphs shows a
more educated study
population than expected
Are the observed data really
that different from the
expected data?
Answer would require further
exploration with statistical
tests
Observed data on level of education from a hypothetical
questionnaire
Data on the education level of the US population aged 20
years or older, from the US Census Bureau
28. Conclusion
Defining variables and basic coding are
basic steps in data analysis
Simple univariate analysis may be used
with continuous and categorical
variables
Further analysis may require statistical
tests such as chi-squares and other
more extensive data analysis
29. References
1. US Census Bureau. Educational Attainment in the
United States: 2003---Detailed Tables for Current
Population Report, P20-550 (All Races). Available at:
http://www.census.gov/population/www/socdemo/educa
. Accessed December 11, 2006.
Editor's Notes
Unlike the depiction of epidemiologists in some television shows, after gathering data you don’t simply have a brilliant flash of insight and solve the outbreak; you actually have to sit down and analyze that data! It is not the most glamorous part of the epidemiologist’s job, but when the data lead to the source of an outbreak, the analysis is definitely rewarding. This issue of FOCUS will take you through the basic steps of descriptive data analysis, including types of variables, basic coding principles and simple univariate data analysis.
Before delving into analysis, let’s take a moment to discuss variables. This may seem a trivial topic to those with analysis experience, but variables are not a trivial matter. Much like people, variables come in many different sizes and shapes. Most field epidemiology, however, relies on garden-variety continuous and categorical variables.
Continuous variables are always numeric and theoretically can be any number, positive or negative (in reality, this depends upon the variable). Examples of continuous variables are age in years, weight, blood pressure readings, indoor and outdoor temperature, concentrations of pollutants in the air or water, and other measurements.
Categorical variables contain information that can be sorted into categories, rather like sorting information into bins. Every piece of information belongs in one—and only one—bin. There are several types of categorical variables: ordinal, nominal, and dichotomous or binary.
1. US Census Bureau. Educational Attainment in the United States: 2003---Detailed Tables for Current Population Report, P20-550 (All Races). Available at: http://www.census.gov/population/www/socdemo/education/cps2003.html. Accessed December 11, 2006.