This document provides an overview of data analytics topics including big data, database structure and management, and statistical analysis. It introduces big data concepts like volume, velocity, variety and veracity. It discusses database structure, relationships, and how to manage data through roadmaps and health checks. It also introduces statistical concepts like descriptive statistics, distributions, and regression analysis and how they can be applied in healthcare.
A Hybrid Apporach of Classification Techniques for Predicting Diabetes using ...ijtsrd
Diabetes is predicted by classification technique. The data mining tool WEKA has been developed for implementing Support Vector Machine SVM classifier. Proposed work is framed with a specific end goal to improve the execution of models. For improving the classification accuracy Support Vector Machine is combined with Feature Selection and percentage Split. Trial results demonstrated a serious change over in the current Support Vector Machine classifier. This approach enhances the classification accuracy and reduces computational time. S. Jaya Mala "A Hybrid Apporach of Classification Techniques for Predicting Diabetes using Feature Selection" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd27991.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/27991/a-hybrid-apporach-of-classification-techniques-for-predicting-diabetes-using-feature-selection/s-jaya-mala
A Hybrid Apporach of Classification Techniques for Predicting Diabetes using ...ijtsrd
Diabetes is predicted by classification technique. The data mining tool WEKA has been developed for implementing Support Vector Machine SVM classifier. Proposed work is framed with a specific end goal to improve the execution of models. For improving the classification accuracy Support Vector Machine is combined with Feature Selection and percentage Split. Trial results demonstrated a serious change over in the current Support Vector Machine classifier. This approach enhances the classification accuracy and reduces computational time. S. Jaya Mala "A Hybrid Apporach of Classification Techniques for Predicting Diabetes using Feature Selection" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd27991.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/27991/a-hybrid-apporach-of-classification-techniques-for-predicting-diabetes-using-feature-selection/s-jaya-mala
Data mining techniques are rapidly developed for many applications. In recent year, Data mining in healthcare is an emerging field research and development of intelligent medical diagnosis system. Classification is the major research topic in data mining. Decision trees are popular methods for classification. In this paper many decision tree classifiers are used for diagnosis of medical datasets. AD Tree, J48, NB Tree, Random Tree and Random Forest algorithms are used for analysis of medical dataset. Heart disease dataset, Diabetes dataset and Hepatitis disorder dataset are used to test the decision tree models. Aung Nway Oo | Thin Naing ""Decision Tree Models for Medical Diagnosis"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23510.pdf
Paper URL: https://www.ijtsrd.com/computer-science/data-miining/23510/decision-tree-models-for-medical-diagnosis/aung-nway-oo
Independent forces on the biomedical ecosystem is causing a convergence of care, quality measurement, and clinical research at the point of care. The presentation outlines some of the informatics implications of this convergence.
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenter:
Caitlin Bakker, University of Minnesota
The Simulacrum, a Synthetic Cancer DatasetCongChen35
This presentation describes the applications of synthetic data to cancer registries's efforts to support understanding of and research based on cancer while reducing privacy risks to cancer patients.
The Simulacrum imitates some of the data held securely by the Public Health England’s National Cancer Registration and Analysis Service.
The data in the Simulacrum is entirely artificial. It does not contain data about real patients, so users can never identify a real person. It is free to use and allows anyone who wants to use record-level cancer data to do so, safe in the knowledge that while the data feels like the real thing, there is no danger of breaching patient confidentiality.
Using machine learning to improve the user experience in online health care c...Anja Pilz
Talk held at the Cologne AI and Machine Learning Meetup #CAIML
DocCheck is a medical community for health care professionals. Doctors, pharmacists, students and other healthcare professionals use this platform for online learning, to exchange with peers and to actively contribute their expertise. They seek detailed information in the extensive medicine wiki DocCheck Flexikon, read the bi-weekly edition of DocCheck News, share and discuss medical images in the image archive DocCheck Pictures, or buy medical products and supplies in the online shop. Each of our user groups has different intentions and interests: A student might want to learn anatomical topics in some order and a cardiologist is usually interested in different news than a pharmacist. The ultimate goal is to find the most relevant and interesting assets for each target group to enable targeted mailing and feed personalization. At this point, to improve user experience, we provide related content across different media types in a fully automated fashion. For instance starting from a medical text about a specific disease, we want to offer the most relevant related articles but also news, pictures, videos or even products from the online shop. In this talk, we will focus on the websites with the highest click frequency: the medicine wiki Flexikon. We will show how we automatically find related assets using both content based models as well as models derived from user behaviour. Both approaches are backed by machine learning techniques, namely Latent Dirichlet Allocation and Association Rule Learning. We will give some technical details and share insights on the practical aspects and pitfalls.
THE TECHNOLOGY OF USING A DATA WAREHOUSE TO SUPPORT DECISION-MAKING IN HEALTH...ijdms
This paper describes the technology of data warehouse in healthcare decision-making and tools for support
of these technologies, which is used to cancer diseases. The healthcare executive managers and doctors
needs information about and insight into the existing health data, so as to make decision more efficiently
without interrupting the daily work of an On-Line Transaction Processing (OLTP) system. This is a
complex problem during the healthcare decision-making process. To solve this problem, the building a
healthcare data warehouse seems to be efficient. First in this paper we explain the concepts of the data
warehouse, On-Line Analysis Processing (OLAP). Changing the data in the data warehouse into a
multidimensional data cube is then shown. Finally, an application example is given to illustrate the use of
the healthcare data warehouse specific to cancer diseases developed in this study. The executive managers
and doctors can view data from more than one perspective with reduced query time, thus making decisions
faster and more comprehensive
Mobilizing informational resources for rare diseasesMaria Shkrob
Providing comprehensive disease-specific summaries remains a serious challenge as information is scattered across multiple resources. Elsevier is collaborating with a rare disease charity Findacure to create an informational portal for patients, researchers, and doctors to help finding new treatments, increase awareness, streamline information exchange and education. Using an integrative approach of automated and manual curation of literature, we constructed a knowledgebase containing an overview of the disease mechanisms, targets, drugs, key opinion leaders, and institutions. To demonstrate the utility of this approach, congenital hyperinsulinism will be discussed.
Assessing Research Impact: Bibliometrics, Citations and the H-IndexFintan Bracken
Talk presented by Dr. Fintan Bracken at the Mary Immaculate College Research Day on 1st September 2015. The talk looked at assessing and maximising the impact of the arts and humanities research conducted at Mary Immaculate College in Limerick, Ireland.
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Jan Cheetham, University of Wisconsin-Madison
Wendy Kozlowski, Cornell University
Data mining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction.
Data mining is supposed to be able to handle various data types in all formats.
Medical data mining is a multidisciplinary field with contribution of medicine and data mining.
each paper is studied based on the six medical tasks: screening, diagnosis, treatment, prognosis, monitoring and management.
Data for Impact hosted a one-hour webinar sharing guidance for using routine data in evaluations. More: https://www.data4impactproject.org/resources/webinars/routine-data-use-in-evaluation-practical-guidance/
Principles of data collection include principles, types, sources, and methods of data collection, which will help medical students to make their tools for data collection.
Data mining techniques are rapidly developed for many applications. In recent year, Data mining in healthcare is an emerging field research and development of intelligent medical diagnosis system. Classification is the major research topic in data mining. Decision trees are popular methods for classification. In this paper many decision tree classifiers are used for diagnosis of medical datasets. AD Tree, J48, NB Tree, Random Tree and Random Forest algorithms are used for analysis of medical dataset. Heart disease dataset, Diabetes dataset and Hepatitis disorder dataset are used to test the decision tree models. Aung Nway Oo | Thin Naing ""Decision Tree Models for Medical Diagnosis"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23510.pdf
Paper URL: https://www.ijtsrd.com/computer-science/data-miining/23510/decision-tree-models-for-medical-diagnosis/aung-nway-oo
Independent forces on the biomedical ecosystem is causing a convergence of care, quality measurement, and clinical research at the point of care. The presentation outlines some of the informatics implications of this convergence.
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenter:
Caitlin Bakker, University of Minnesota
The Simulacrum, a Synthetic Cancer DatasetCongChen35
This presentation describes the applications of synthetic data to cancer registries's efforts to support understanding of and research based on cancer while reducing privacy risks to cancer patients.
The Simulacrum imitates some of the data held securely by the Public Health England’s National Cancer Registration and Analysis Service.
The data in the Simulacrum is entirely artificial. It does not contain data about real patients, so users can never identify a real person. It is free to use and allows anyone who wants to use record-level cancer data to do so, safe in the knowledge that while the data feels like the real thing, there is no danger of breaching patient confidentiality.
Using machine learning to improve the user experience in online health care c...Anja Pilz
Talk held at the Cologne AI and Machine Learning Meetup #CAIML
DocCheck is a medical community for health care professionals. Doctors, pharmacists, students and other healthcare professionals use this platform for online learning, to exchange with peers and to actively contribute their expertise. They seek detailed information in the extensive medicine wiki DocCheck Flexikon, read the bi-weekly edition of DocCheck News, share and discuss medical images in the image archive DocCheck Pictures, or buy medical products and supplies in the online shop. Each of our user groups has different intentions and interests: A student might want to learn anatomical topics in some order and a cardiologist is usually interested in different news than a pharmacist. The ultimate goal is to find the most relevant and interesting assets for each target group to enable targeted mailing and feed personalization. At this point, to improve user experience, we provide related content across different media types in a fully automated fashion. For instance starting from a medical text about a specific disease, we want to offer the most relevant related articles but also news, pictures, videos or even products from the online shop. In this talk, we will focus on the websites with the highest click frequency: the medicine wiki Flexikon. We will show how we automatically find related assets using both content based models as well as models derived from user behaviour. Both approaches are backed by machine learning techniques, namely Latent Dirichlet Allocation and Association Rule Learning. We will give some technical details and share insights on the practical aspects and pitfalls.
THE TECHNOLOGY OF USING A DATA WAREHOUSE TO SUPPORT DECISION-MAKING IN HEALTH...ijdms
This paper describes the technology of data warehouse in healthcare decision-making and tools for support
of these technologies, which is used to cancer diseases. The healthcare executive managers and doctors
needs information about and insight into the existing health data, so as to make decision more efficiently
without interrupting the daily work of an On-Line Transaction Processing (OLTP) system. This is a
complex problem during the healthcare decision-making process. To solve this problem, the building a
healthcare data warehouse seems to be efficient. First in this paper we explain the concepts of the data
warehouse, On-Line Analysis Processing (OLAP). Changing the data in the data warehouse into a
multidimensional data cube is then shown. Finally, an application example is given to illustrate the use of
the healthcare data warehouse specific to cancer diseases developed in this study. The executive managers
and doctors can view data from more than one perspective with reduced query time, thus making decisions
faster and more comprehensive
Mobilizing informational resources for rare diseasesMaria Shkrob
Providing comprehensive disease-specific summaries remains a serious challenge as information is scattered across multiple resources. Elsevier is collaborating with a rare disease charity Findacure to create an informational portal for patients, researchers, and doctors to help finding new treatments, increase awareness, streamline information exchange and education. Using an integrative approach of automated and manual curation of literature, we constructed a knowledgebase containing an overview of the disease mechanisms, targets, drugs, key opinion leaders, and institutions. To demonstrate the utility of this approach, congenital hyperinsulinism will be discussed.
Assessing Research Impact: Bibliometrics, Citations and the H-IndexFintan Bracken
Talk presented by Dr. Fintan Bracken at the Mary Immaculate College Research Day on 1st September 2015. The talk looked at assessing and maximising the impact of the arts and humanities research conducted at Mary Immaculate College in Limerick, Ireland.
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Jan Cheetham, University of Wisconsin-Madison
Wendy Kozlowski, Cornell University
Data mining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction.
Data mining is supposed to be able to handle various data types in all formats.
Medical data mining is a multidisciplinary field with contribution of medicine and data mining.
each paper is studied based on the six medical tasks: screening, diagnosis, treatment, prognosis, monitoring and management.
Data for Impact hosted a one-hour webinar sharing guidance for using routine data in evaluations. More: https://www.data4impactproject.org/resources/webinars/routine-data-use-in-evaluation-practical-guidance/
Principles of data collection include principles, types, sources, and methods of data collection, which will help medical students to make their tools for data collection.
Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22mjbinstitute
Presentation by Prof. Mendel Singer of Case Western Reserve University, on the issue of "big data" in health care and policy research. Presented at the Myers-JDC-Brookdale Institute in Jerusalem.
Follow our presentation to learn about the role of statistical analysis in fraud detection. From data mining to clustering, learn the techniques necessary to quickly anticipate and detect health care fraud, waste, and abuse.
A community needs assessment identifies the strengths and resources available in the community to meet the needs of children, youth, and families. The assessment focuses on the capabilities of the community, including its citizens, agencies, and organizations.
Sills MR. Overview of the SAFTINet Program. Presented to the Emergency Department Research Committee, Department of Pediatrics, University of Colorado School of Medicine. 6 January 2015.
Why should we care about integrating data? What should we be trying to achieve? Population Health. The Softer, Human Side of Being “Data Driven” not “Driven By Data." The New Era of Decision Support in Healthcare. Top 10 Challenges To Integrating External Data.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Topics to be covered
• Introducing Big Data
• Big Data in healthcare
• Database structure and management
• Database structure
• How to manage your data
• Statistical analysis in population health management
• Introduction to statistics
• Statistical analysis in healthcare
3. Introducing Big Data
• Information that can’t be processed or analyzed using traditional
processes or tools
• There are four dimensions to Big Data: Volume, Velocity, Variety,
Veracity
• Challenges with Big Data: Capturing, Storing, Searching, Sharing &
Analyzing
4. Introducing Big Data
• Volume
• The amount of data being collected is unprecedented
• The volume of data available is on the rise, while the percent that can be
analyzed is on the decline. This is known as the data blind zone.
• Velocity
• The rate at which the data is being generated needs to be handled
• How quickly is the data arriving and stored?
• How quickly can you process the data?
5. Introducing Big Data
• Variety
• With an increase in quantity, comes an increase in quality
• Issues with storing complex data
• Analyzing all different types of data
• Veracity
• The accuracy of data becomes more important as we use more of it
• Garbage in, garbage out
6. Introducing Big Data
• Big Data challenges:
• Capturing
• Data is initially pulled from all sorts of different places
• Storing
• Data is kept in different locations (virtual or otherwise)
• Security concerns
• Searching
• Having a database capable of handling searches
• Optimizing a database for searches
7. Introducing Big Data
• Big Data challenges:
• Sharing
• There are valid security concerns
• The data variety poses a problem when sharing
• Analyzing
• Extracting the data isn’t easy
• Data variety poses a significant problem
• Sheer volume of data makes it difficult to focus
8. Big Data in Healthcare
• Incentives for big data use are rising
• Movement to evidence-based care
• Increase in available technologies for data collection, analysis and
communication
• The ultimate goal is improving patient health while reducing costs
9. Big Data in Healthcare
• Volume
• Healthcare data is more plentiful than ever
• Velocity
• Data flows real time and is processed real time
• Variety
• Billing information and clinical information
• Veracity
• Data accuracy is vital to an organization
10. Big Data in Healthcare
• Challenges
• Mixing healthcare with IT
• The availability of data has exploded
• How do you handle the influx of data?
• Finding the relevant data to mine
12. Database Structure
• A structured set of data held in a computer, especially one that is
accessible in various ways (or not so accessible in some cases).
• Data are organized in database tables, which consists of rows and
columns.
• Each row is called a record, object or entity. Each column is called a
field or attribute.
• Each column should contain the same data type, but each row can
have different data types
14. Database Structure
• Two types of keys, primary and foreign
• Primary keys makes a row of data unique, it can be made up of
multiple columns
• Foreign keys are columns or group of columns in a relational database
table that provide a link between data in two tables
16. Database Structure
• Database relationships can be of three different types:
• One-to-one
• One-to-many
• Many-to-many
17. Database Structure
• One-to-One Relationships
• A key will appear only once in a related table.
• Example: A patient can only be assigned one primary care provider
18. Database Structure
• One-to-Many Relationships
• Keys from one table will appear multiple times in a related table
• Example: One provider can be assigned multiple patients in paneling
19. Database Structure
• Many-to-Many relationships
• The key value of one table can appear many times in a related table, but the
opposite also holds true!
• Example: A patient can see multiple different providers and a provider can see
multiple different patients
20. How to Manage Your Data
• The importance of managing your database
• Your database is composed of data and is built by the software companies.
You can effectively manage what goes INTO your database.
• It plays an important role in improving the performance of an organization’s
health care systems.
• Collecting, analyzing, interpreting, and acting on data for specific
performance measures allows health care professionals to identify where
systems are falling short, to make corrective adjustments, and to track
outcomes.
21. How to Manage Your Data
• Developing an EMR data roadmap
• First determine what you need to collect
• Next, identify where the data is able to be entered
• Find out who is entering it
• Develop a roadmap of your data using a spreadsheet
• Rows would correspond to the data being collected
• Columns would contain the where and who
22. How to Manage Your Data
• Data roadmap example:
Measure Name Data Item Field Name Employee
Colorectal Cancer Colonoscopy Result healthmaintenance.table MD
Colorectal Cancer Colonoscopy Date diagnostichistory.table MA
Colorectal Cancer Colonoscopy Document referralorder.table RN
Colorectal Cancer FIT Outside lab result outsidelabs.table MA
Colorectal Cancer FIT Quest lab resul emrlabs.table MD
Hypertension Systolic BP vitalssys.table MA
Hypertension Dyastolic BP vitalsdys.table MA
23. How to Manage Your Data
• Data Health Checks
• They are periodic reviews of your EMR data's integrity
• Establish timelines for the data health checks, yearly is recommended.
• Get your data health check team together, members from different departments are
recommended
• Document your data health checks, and don’t delete roadmap columns. Simply add
another tab in your spreadsheet.
24. How to Manage Your Data
• Creating data workflows
• Use the data roadmap to streamline workflows
• Duplicate data entry
• Redundant data workflows
• Too many places to document
• Too many variations in your data types
• Standardize the process
• Involve the end-users in the process
• Use a diverse team, the same team that does the Data Health Checks works
well
25. Statistical Analysis in PHM
• Statistical analysis involved using the scientific method to answer
questions and make decisions
• It involves designing the studies, collecting good data, describing the
data with numbers and graphs, analyzing the data, and then making
conclusions.
26. Introduction to Statistics
• Statistics are everywhere, from healthcare to marketing.
• Usually statistics deals with two different sets of data:
• Population:
• The set of individual persons or objects in which an investigator is primarily
interested during his or her research problem
• Sample:
• That part of the population from which information is collected
27. Introduction to Statistics
• There are two major types of statistics
• Descriptive: methods for organizing and summarizing information
• Inferential: methods for drawing and measuring the reliability of conclusions
about a population
• Descriptive statistics involves graphs, charts, tables, etc.
• Inferential statistics is predictive and includes methods like point
estimation, interval estimation and hypothesis testing
28. Introduction to Statistics
• Descriptive Statistics Examples:
PatientID
Tobacco
Cessation
5465 Yes
5466 No
5467 Yes
5468 Yes
5469 No
5470 Yes
5471 Yes
5472 Yes
5473 Yes
29. Introduction to Statistics
• Independent and Dependent Variables
• Independent variables are manipulated by an experimenter
• Example: A provider wants to know which medication is best for depression,
he has four antidepressants to choose from. Which medication they give out,
is the independent variable.
• Dependent variables are the results of the experiment
• Example: After a period of time, the provider interviews the patients to see
what their PHQ score is, the PHQ score is the dependent variable.
30. Introduction to Statistics
• Distribution
• Distribution has to do with the frequency of the data
• Example: You purchase a bag of Skittles. Skittles come in different colors, how
many of each type of color is found in the bag?
• This is known as a frequency table, which describes the Skittles color
frequencies
Color Count
Green 15
Blue 8
Yellow 10
Purple 6
Red 12
31. Introduction to Statistics
• Continuous Variables
• Sometimes data is always changing, and you never have a black and white
data set like in our Skittles example
• When your data is varied, you can do a grouped frequency distribution and
look at your data in histogram form
• Example:
• We’re much better off looking at the data in grouped frequency rather than
looking at each HgbA1c result
HgbA1c Values Count
<7 253
>7<8 700
>8<9 740
>9 141
32. Introduction to Statistics
• Probability Distributions: Discrete vs Continuous
• Depends on whether they define probabilities associated with discrete
variables or continuous variables.
• Discrete vs. Continuous Variables
• If a variable can take on any value between two specified values, it is called
a continuous variable; otherwise, it is called a discrete variable.
• Example:
• To be eligible for a particular program, your income must be between x and y amount.
This is an example of a continuous variable, because no one in the program would have
an income outside the parameters of x and y.
• The weight distribution of a patient population is an example of a discrete variable
33. Introduction to Statistics
• Probability Densities
• These are needed to observe not just one data set, but many of them at the
same time. This is called continuous distribution.
• Normal (Bell) distribution, a type of continuous distribution, explains many
natural phenomena
34. Introduction to Statistics
• Distribution shapes
• If you fold the figure in the previous slide in half you would get equal halves.
However, not all distributions are symmetrical.
• A distribution with a longer “tail” to the positive direction is said to have a
“positive skew”, it can also be known as “skewed to the right”:
36. Introduction to Statistics
• All the distributions so far have had one distinct high point or peak.
When distributions have two peaks in the data, this is called a
bimodal distribution:
37. Introduction to Statistics
• Some statistic definitions
• Mean – add up all the numbers and divide by the number of numbers
• Medium – middle value in the list of numbers, the numbers have to be listed
in numerical order
• Mode – the value that occurs most often
• Range – the difference between the largest and smallest values
38. Introduction to Statistics
• Properties of the Normal (Bell) Distribution Curve
• Suppose that the total area under the curve is defined to be 1. You can
multiply that number by 100 and say there is a 100% chance that any value
you can name will be somewhere in the distribution.(Remember: The
distribution extends to infinity in both directions.)
• Similarly, because half the area of the curve is below the mean and half is
above it, you can say that there is a 50 percent chance that a randomly
chosen value will be above the mean and the same chance that it will be
below it.
39. Introduction to Statistics
• A normal curve also has an equal mean, medium and mode.
• When looking at data points, the Mean is known as “sigma”. The
sigma is also the standard deviation of a population.
40. Introduction to Statistics
• In a normal distribution, 68% of the data are between one standard
deviation below the mean and one standard deviation above the
mean. 95% are within two standard deviations of the mean and
99.7% are within three standard deviations of the mean.
42. Introduction to Statistics
• Descriptive Statistic Models
• Graphing data from frequency tables in:
• Pie charts
• Bar Charts
HgbA1c Values Count
<7 253
>7<8 700
>8<9 740
>9 141
43. Introduction to Statistics
• Descriptive Statistic Models
• Graphing data from linear date tables
• Line Graphs: line graphs are meant to show data over time
44. Introduction to Statistics
• Histograms
• It’s a graphical method for displaying the shape of a distribution, really useful
when looking at large amounts of data.
• Example: We analyzed 10 patients, and we recorded their most recent LDL
values. The values ranged from 57 to 221. We would first create a frequency
table that breaks the values into intervals or parameters.
45. Introduction to Statistics
• Histogram Data set
LDL Intervals LDL Values
70 65
100 138
130 102
160 221
190 155
99
144
113
166
159
46. Introduction to Statistics
• Things to note about frequency tables:
• Intervals or parameters are also known as bins
• The bin values in the column is the highest value possible in the bin set
• To set up your bins, use the Rice rule. Set the number of intervals to twice the
cube root of the number of observations.
• In the case of 1000 observations, the Rice rule yields 20 intervals. In our previous
example, we got the data for 10 patients. So the cube root of 10 would be 2, twice that
would be 4. We settled on 5 to have more uniform bins. The rule is more of a guideline
and you can experiment with the bin numbers to get different distribution curves.
47. Introduction to Statistics
• Creating a Histogram using excel:
• First, make sure the Analysis ToolPak is enabled.
• Go to File, Options:
48. Introduction to Statistics
• Creating a Histogram using excel:
• Then, select Add-ins
• At the bottom of the view, select Excel Add-ins, then select Go…
49. Introduction to Statistics
• Creating a Histogram using excel:
• Afterwards, select the Analysis ToolPak and click OK
• The Data Analysis button now appears under the Data tab in the Excel home
menu
50. Introduction to Statistics
• Creating a Histogram using excel:
• Select your data set, then click on the Data Analysis button. A list pops up.
Select Histogram from the list
• It will ask you to select the Input Range and the Bin Range. The input range
are the actual values, the bin range are the set intervals
• If you have included the column labels, click on the labels box.
• Then select where you would like your histogram to go (the default is fine),
then click on chart output at the bottom
51. Introduction to Statistics
• If you followed the instructions, your should get a spreadsheet that
looks like this (reduce the gap width to zero to get the columns all
bunched up):
52. Introduction to Statistics
• Histogram applications in healthcare
• Large data sets
• Pareto charts to correctly identify vulnerable populations
• The 80/20 rule can help identify the areas to focus on
• Best when data ranges can vary, as averages are not a good measuring tool
• Examples: Cycle time, lab values, etc. Really any population with discrete variables
53. Introduction to Statistics
• Regression Analysis
• Linear Regression: At the center of regression is the relationship between two
variables called the dependent and independent variables
• You want to compare two data sets to see what a change in the independent
variable causes in the dependent variable
• Example:
• You notice that the Behavioral Health department is swamped with referrals from
primary care during the winter months. You wonder if there’s some correlation between
the average PHQ-9 scores of the patients, the months of the year an d the amount of
referrals BH is getting.
54. Introduction to Statistics
• You extract some data from your system, and obtain the following
data set
Date
Average PHQ-9
Score
Average Referrals
to BH
January 19 60
February 18 57
March 14 48
April 10 35
May 10 22
June 8 20
July 8 15
August 7 15
September 8 14
October 12 15
November 15 35
December 20 53
55. Introduction to Statistics
• Let’s regress.
• Choosing Data Analysis again from the Data Menu Item, choose
regression from the menu
• Put in the dependent variable in the y axis and the independent
variable in the x axis
• Click on Line Fit Plots to get a nice graph heat map that shows how
tight the relationship between the PHQ score and number of referrals
really is
56. Introduction to Statistics
• You should get the following (there’s more data but it get’s
complicated):
Regression Statistics
Multiple R 0.90733138
R Square 0.823250233
Adjusted R Square 0.805575256
Standard Error 7.928960791
Observations 12
57. Introduction to Statistics
• Our model tells us the following important information:
• Multiple R. This is the correlation coefficient. It tells you how strong the linear
relationship is. For example, a value of 1 means a perfect positive relationship and a
value of zero means no relationship at all. It is the square root of r squared (see #2)
• R squared. This is r2, the Coefficient of Determination. It tells you how many points
fall on the regression line. for example, 80% means that 80% of the variation of y-
values around the mean are explained by the x-values. In other words, 80% of the
values fit the model
• Adjusted R square. The adjusted R-square adjusts for the number of terms in a
model. You’ll want to use this instead of #2 if you have more than one x variable
58. Introduction to Statistics
• How is this useful?
• First of all, you have now proven your theory. In the summer months, when
the average PHQ scores are lower there are less referrals in the winter
months when the average PHQ scores are higher
• Use this information to request extra staffing, longer hours, etc. It’s not
conjecture anymore, you have hard data that proves it
• Maybe you can use this information to mount a depression campaign during
the winter months in your clinic, the uses are endless for the data
60. Statistical Analysis In Healthcare
• Currently, there is an abundance of data. There is a real need for
people who can analyze and interpret clinical, operational and
financial data in healthcare
• Statistical analysis is looking at regression cost models, to see if
particular diagnoses or services increase or decrease costs
• Combining operational and clinical data will yield maximum
knowledge to create better clinical workflows and increase patient
satisfaction
61. Statistical Analysis In Healthcare
• Currently, not many healthcare centers or hospitals use analytics
software on a daily basis
• Statistical analysis of a patient population can help determine where
to focus efforts for maximum impact
• Using social determinants of health as data points, you can also
determine if there are correlations between them and patient
outcomes