Data analytics in Healthcare

Topics to be covered
• Introducing Big Data
• Big Data in healthcare
• Database structure and management
• Database structure
• How to manage your data
• Statistical analysis in population health management
• Introduction to statistics
• Statistical analysis in healthcare

Introducing Big Data
• Information that can’t be processed or analyzed using traditional
processes or tools
• There are four dimensions to Big Data: Volume, Velocity, Variety,
Veracity
• Challenges with Big Data: Capturing, Storing, Searching, Sharing &
Analyzing

• Volume
• The amount of data being collected is unprecedented
• The volume of data available is on the rise, while the percent that can be
analyzed is on the decline. This is known as the data blind zone.
• Velocity
• The rate at which the data is being generated needs to be handled
• How quickly is the data arriving and stored?
• How quickly can you process the data?

• Variety
• With an increase in quantity, comes an increase in quality
• Issues with storing complex data
• Analyzing all different types of data
• Veracity
• The accuracy of data becomes more important as we use more of it
• Garbage in, garbage out

• Big Data challenges:
• Capturing
• Data is initially pulled from all sorts of different places
• Storing
• Data is kept in different locations (virtual or otherwise)
• Security concerns
• Searching
• Having a database capable of handling searches
• Optimizing a database for searches

• Big Data challenges:
• Sharing
• There are valid security concerns
• The data variety poses a problem when sharing
• Analyzing
• Extracting the data isn’t easy
• Data variety poses a significant problem
• Sheer volume of data makes it difficult to focus

Big Data in Healthcare
• Incentives for big data use are rising
• Movement to evidence-based care
• Increase in available technologies for data collection, analysis and
communication
• The ultimate goal is improving patient health while reducing costs

• Volume
• Healthcare data is more plentiful than ever
• Velocity
• Data flows real time and is processed real time
• Variety
• Billing information and clinical information
• Veracity
• Data accuracy is vital to an organization

• Challenges
• Mixing healthcare with IT
• The availability of data has exploded
• How do you handle the influx of data?
• Finding the relevant data to mine

Database Structure and
Management

Database Structure
• A structured set of data held in a computer, especially one that is
accessible in various ways (or not so accessible in some cases).
• Data are organized in database tables, which consists of rows and
columns.
• Each row is called a record, object or entity. Each column is called a
field or attribute.
• Each column should contain the same data type, but each row can
have different data types

Database Structure
• Two types of keys, primary and foreign
• Primary keys makes a row of data unique, it can be made up of
multiple columns
• Foreign keys are columns or group of columns in a relational database
table that provide a link between data in two tables

Database Structure
• Database relationships can be of three different types:
• One-to-one
• One-to-many
• Many-to-many

Database Structure
• One-to-One Relationships
• A key will appear only once in a related table.
• Example: A patient can only be assigned one primary care provider

Database Structure
• One-to-Many Relationships
• Keys from one table will appear multiple times in a related table
• Example: One provider can be assigned multiple patients in paneling

Database Structure
• Many-to-Many relationships
• The key value of one table can appear many times in a related table, but the
opposite also holds true!
• Example: A patient can see multiple different providers and a provider can see
multiple different patients

How to Manage Your Data
• The importance of managing your database
• Your database is composed of data and is built by the software companies.
You can effectively manage what goes INTO your database.
• It plays an important role in improving the performance of an organization’s
health care systems.
• Collecting, analyzing, interpreting, and acting on data for specific
performance measures allows health care professionals to identify where
systems are falling short, to make corrective adjustments, and to track
outcomes.

• Developing an EMR data roadmap
• First determine what you need to collect
• Next, identify where the data is able to be entered
• Find out who is entering it
• Develop a roadmap of your data using a spreadsheet
• Rows would correspond to the data being collected
• Columns would contain the where and who

• Data roadmap example:
Measure Name Data Item Field Name Employee
Colorectal Cancer Colonoscopy Result healthmaintenance.table MD
Colorectal Cancer Colonoscopy Date diagnostichistory.table MA
Colorectal Cancer Colonoscopy Document referralorder.table RN
Colorectal Cancer FIT Outside lab result outsidelabs.table MA
Colorectal Cancer FIT Quest lab resul emrlabs.table MD
Hypertension Systolic BP vitalssys.table MA
Hypertension Dyastolic BP vitalsdys.table MA

• Data Health Checks
• They are periodic reviews of your EMR data's integrity
• Establish timelines for the data health checks, yearly is recommended.
• Get your data health check team together, members from different departments are
recommended
• Document your data health checks, and don’t delete roadmap columns. Simply add
another tab in your spreadsheet.

• Creating data workflows
• Use the data roadmap to streamline workflows
• Duplicate data entry
• Redundant data workflows
• Too many places to document
• Too many variations in your data types
• Standardize the process
• Involve the end-users in the process
• Use a diverse team, the same team that does the Data Health Checks works
well

Statistical Analysis in PHM
• Statistical analysis involved using the scientific method to answer
questions and make decisions
• It involves designing the studies, collecting good data, describing the
data with numbers and graphs, analyzing the data, and then making
conclusions.

Introduction to Statistics
• Statistics are everywhere, from healthcare to marketing.
• Usually statistics deals with two different sets of data:
• Population:
• The set of individual persons or objects in which an investigator is primarily
interested during his or her research problem
• Sample:
• That part of the population from which information is collected

• There are two major types of statistics
• Descriptive: methods for organizing and summarizing information
• Inferential: methods for drawing and measuring the reliability of conclusions
about a population
• Descriptive statistics involves graphs, charts, tables, etc.
• Inferential statistics is predictive and includes methods like point
estimation, interval estimation and hypothesis testing

• Descriptive Statistics Examples:
PatientID
Tobacco
Cessation
5465 Yes
5466 No
5467 Yes
5468 Yes
5469 No
5470 Yes
5471 Yes
5472 Yes
5473 Yes

• Independent and Dependent Variables
• Independent variables are manipulated by an experimenter
• Example: A provider wants to know which medication is best for depression,
he has four antidepressants to choose from. Which medication they give out,
is the independent variable.
• Dependent variables are the results of the experiment
• Example: After a period of time, the provider interviews the patients to see
what their PHQ score is, the PHQ score is the dependent variable.

• Distribution
• Distribution has to do with the frequency of the data
• Example: You purchase a bag of Skittles. Skittles come in different colors, how
many of each type of color is found in the bag?
• This is known as a frequency table, which describes the Skittles color
frequencies
Color Count
Green 15
Blue 8
Yellow 10
Purple 6
Red 12

• Continuous Variables
• Sometimes data is always changing, and you never have a black and white
data set like in our Skittles example
• When your data is varied, you can do a grouped frequency distribution and
look at your data in histogram form
• Example:
• We’re much better off looking at the data in grouped frequency rather than
looking at each HgbA1c result
HgbA1c Values Count
<7 253
>7<8 700
>8<9 740
>9 141

• Probability Distributions: Discrete vs Continuous
• Depends on whether they define probabilities associated with discrete
variables or continuous variables.
• Discrete vs. Continuous Variables
• If a variable can take on any value between two specified values, it is called
a continuous variable; otherwise, it is called a discrete variable.
• Example:
• To be eligible for a particular program, your income must be between x and y amount.
This is an example of a continuous variable, because no one in the program would have
an income outside the parameters of x and y.
• The weight distribution of a patient population is an example of a discrete variable

• Probability Densities
• These are needed to observe not just one data set, but many of them at the
same time. This is called continuous distribution.
• Normal (Bell) distribution, a type of continuous distribution, explains many
natural phenomena

• Distribution shapes
• If you fold the figure in the previous slide in half you would get equal halves.
However, not all distributions are symmetrical.
• A distribution with a longer “tail” to the positive direction is said to have a
“positive skew”, it can also be known as “skewed to the right”:

• Although less common, some distributions have a “negative skew”.

• All the distributions so far have had one distinct high point or peak.
When distributions have two peaks in the data, this is called a
bimodal distribution:

• Some statistic definitions
• Mean – add up all the numbers and divide by the number of numbers
• Medium – middle value in the list of numbers, the numbers have to be listed
in numerical order
• Mode – the value that occurs most often
• Range – the difference between the largest and smallest values

• Properties of the Normal (Bell) Distribution Curve
• Suppose that the total area under the curve is defined to be 1. You can
multiply that number by 100 and say there is a 100% chance that any value
you can name will be somewhere in the distribution.(Remember: The
distribution extends to infinity in both directions.)
• Similarly, because half the area of the curve is below the mean and half is
above it, you can say that there is a 50 percent chance that a randomly
chosen value will be above the mean and the same chance that it will be
below it.

• A normal curve also has an equal mean, medium and mode.
• When looking at data points, the Mean is known as “sigma”. The
sigma is also the standard deviation of a population.

• In a normal distribution, 68% of the data are between one standard
deviation below the mean and one standard deviation above the
mean. 95% are within two standard deviations of the mean and
99.7% are within three standard deviations of the mean.

• Descriptive Statistic Models
• Graphing data from frequency tables in:
• Pie charts
• Bar Charts
HgbA1c Values Count
<7 253
>7<8 700
>8<9 740
>9 141

• Descriptive Statistic Models
• Graphing data from linear date tables
• Line Graphs: line graphs are meant to show data over time

• Histograms
• It’s a graphical method for displaying the shape of a distribution, really useful
when looking at large amounts of data.
• Example: We analyzed 10 patients, and we recorded their most recent LDL
values. The values ranged from 57 to 221. We would first create a frequency
table that breaks the values into intervals or parameters.

• Histogram Data set
LDL Intervals LDL Values
70 65
100 138
130 102
160 221
190 155
99
144
113
166
159

• Things to note about frequency tables:
• Intervals or parameters are also known as bins
• The bin values in the column is the highest value possible in the bin set
• To set up your bins, use the Rice rule. Set the number of intervals to twice the
cube root of the number of observations.
• In the case of 1000 observations, the Rice rule yields 20 intervals. In our previous
example, we got the data for 10 patients. So the cube root of 10 would be 2, twice that
would be 4. We settled on 5 to have more uniform bins. The rule is more of a guideline
and you can experiment with the bin numbers to get different distribution curves.

• Creating a Histogram using excel:
• First, make sure the Analysis ToolPak is enabled.
• Go to File, Options:

• Then, select Add-ins
• At the bottom of the view, select Excel Add-ins, then select Go…

• Afterwards, select the Analysis ToolPak and click OK
• The Data Analysis button now appears under the Data tab in the Excel home
menu

• Select your data set, then click on the Data Analysis button. A list pops up.
Select Histogram from the list
• It will ask you to select the Input Range and the Bin Range. The input range
are the actual values, the bin range are the set intervals
• If you have included the column labels, click on the labels box.
• Then select where you would like your histogram to go (the default is fine),
then click on chart output at the bottom

• If you followed the instructions, your should get a spreadsheet that
looks like this (reduce the gap width to zero to get the columns all
bunched up):

• Histogram applications in healthcare
• Large data sets
• Pareto charts to correctly identify vulnerable populations
• The 80/20 rule can help identify the areas to focus on
• Best when data ranges can vary, as averages are not a good measuring tool
• Examples: Cycle time, lab values, etc. Really any population with discrete variables

• Regression Analysis
• Linear Regression: At the center of regression is the relationship between two
variables called the dependent and independent variables
• You want to compare two data sets to see what a change in the independent
variable causes in the dependent variable
• Example:
• You notice that the Behavioral Health department is swamped with referrals from
primary care during the winter months. You wonder if there’s some correlation between
the average PHQ-9 scores of the patients, the months of the year an d the amount of
referrals BH is getting.

• You extract some data from your system, and obtain the following
data set
Date
Average PHQ-9
Score
Average Referrals
to BH
January 19 60
February 18 57
March 14 48
April 10 35
May 10 22
June 8 20
July 8 15
August 7 15
September 8 14
October 12 15
November 15 35
December 20 53

• Let’s regress.
• Choosing Data Analysis again from the Data Menu Item, choose
regression from the menu
• Put in the dependent variable in the y axis and the independent
variable in the x axis
• Click on Line Fit Plots to get a nice graph heat map that shows how
tight the relationship between the PHQ score and number of referrals
really is

• You should get the following (there’s more data but it get’s
complicated):
Regression Statistics
Multiple R 0.90733138
R Square 0.823250233
Adjusted R Square 0.805575256
Standard Error 7.928960791
Observations 12

• Our model tells us the following important information:
• Multiple R. This is the correlation coefficient. It tells you how strong the linear
relationship is. For example, a value of 1 means a perfect positive relationship and a
value of zero means no relationship at all. It is the square root of r squared (see #2)
• R squared. This is r2, the Coefficient of Determination. It tells you how many points
fall on the regression line. for example, 80% means that 80% of the variation of y-
values around the mean are explained by the x-values. In other words, 80% of the
values fit the model
• Adjusted R square. The adjusted R-square adjusts for the number of terms in a
model. You’ll want to use this instead of #2 if you have more than one x variable

• How is this useful?
• First of all, you have now proven your theory. In the summer months, when
the average PHQ scores are lower there are less referrals in the winter
months when the average PHQ scores are higher
• Use this information to request extra staffing, longer hours, etc. It’s not
conjecture anymore, you have hard data that proves it
• Maybe you can use this information to mount a depression campaign during
the winter months in your clinic, the uses are endless for the data

Statistical Analysis in Healthcare

Statistical Analysis In Healthcare
• Currently, there is an abundance of data. There is a real need for
people who can analyze and interpret clinical, operational and
financial data in healthcare
• Statistical analysis is looking at regression cost models, to see if
particular diagnoses or services increase or decrease costs
• Combining operational and clinical data will yield maximum
knowledge to create better clinical workflows and increase patient
satisfaction

Statistical Analysis In Healthcare
• Currently, not many healthcare centers or hospitals use analytics
software on a daily basis
• Statistical analysis of a patient population can help determine where
to focus efforts for maximum impact
• Using social determinants of health as data points, you can also
determine if there are correlations between them and patient
outcomes

Data analytics in Healthcare

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Data analytics in Healthcare

Similar to Data analytics in Healthcare (20)

More from Jorge A. Gaspar

More from Jorge A. Gaspar (6)

Recently uploaded

Recently uploaded (20)

Data analytics in Healthcare