This presentation on Introduction to Statistics helps Engineering students to review the fundamental topics of statistics. It is according tl syllabus of Institute of Engineering (IOE) but is similar to that of almost all the engineering colleges.
Elementary Data Analysis with MS Excel_Day-5Redwan Ferdous
This event took place on 16th September 2020. This was arranged by EMK Center (Makerlab). The title was 'Elementary Data Analysis with MS Excel', where very basic data analysis with MS excel was discussed.
In Day-5, Hypothesis, Statistics, Regression Analysis, T-Test, Z-test, P-Test, ANOVA, Goal Seek, Pivot Chart, Dashboard, Slicer, Solver, Data Analysis Toolpak, and peripheral items were discussed.
Descriptive statistics are methods of describing the characteristics of a data set. It includes calculating things such as the average of the data, its spread and the shape it produces.
This presentation on Introduction to Statistics helps Engineering students to review the fundamental topics of statistics. It is according tl syllabus of Institute of Engineering (IOE) but is similar to that of almost all the engineering colleges.
Elementary Data Analysis with MS Excel_Day-5Redwan Ferdous
This event took place on 16th September 2020. This was arranged by EMK Center (Makerlab). The title was 'Elementary Data Analysis with MS Excel', where very basic data analysis with MS excel was discussed.
In Day-5, Hypothesis, Statistics, Regression Analysis, T-Test, Z-test, P-Test, ANOVA, Goal Seek, Pivot Chart, Dashboard, Slicer, Solver, Data Analysis Toolpak, and peripheral items were discussed.
Descriptive statistics are methods of describing the characteristics of a data set. It includes calculating things such as the average of the data, its spread and the shape it produces.
Elementary Statistics (MATH220)
Assignment:
Statistical Project & Presentation
Purpose:
The purpose of this project is to supplement lecture material by having the students to do a case study on collecting, analyzing, and interpreting data.
***The best way to understand something is to experience it for yourself.
Guideline for Analyzing Data and Writing a Report
Below is a general outline of the topics that should be included in your report.
1.
Introduction.
State the topic of your study.
2.
Define Population.
Define the population that you intend for your study to represent.
3.
Define Variable.
Define clearly the variable that you obtained during your data collection; this should include information on how the variable is measured and what possible values this variable has.
4.
Data Collection.
Describe your data collection process, including your data source, your sampling strategy, and what steps you took to avoid bias.
5.
Study Design.
Describe the procedures you followed to analyze your data.
6.
Results: Descriptive Statistics.
Give the relevant descriptive statistics for the sample you collected.
7.
Results: Statistical Analysis.
Describe the results of your statistical analysis.
8.
Findings.
Interpret the results of your analysis in the context of your original research question. Was your hypothesis supported by your statistical analyses? Explain.
9.
Discussion.
What conclusions, if any, do you believe you can draw as a result of your study? If the results were not what you expected, what factors might explain your results? What did you learn from the project about the population you studied? What did you learn about the research variable? What did you learn about the specific statistical test you conducted?
.
STAT200: Assignment #2 - Descriptive Statistics Analysis and Writeup - Instructions
Page 1 of 3
STAT200 Introduction to Statistics
Assignment #2: Descriptive Statistics Analysis and Writeup
Assignment #2: Descriptive Statistics Analysis and Writeup
In the first assignment (Assignment #1: Descriptive Statistics Analysis Data Plan), you developed a
scenario about annual household expenditures and a plan for analyzing the data using descriptive
statistic methods. The purpose of this assignment is to carry out the descriptive statistics analysis plan
and write up the results. The expected outcome of this assignment is a two to three page write-up of
the findings from your analysis as well as a recommendation.
Assignment Steps:
Step #1: Review Feedback from Your Instructor
Before performing any analysis, please make sure to review your instructor’s feedback on Assignment
#1: Descriptive Statistics Data Analysis Plan. Based on the feedback, modify variables, tables, and
selected statistics, graphs, and tables, if needed.
Step #2: Perform Descriptive Statistic Analysis
Task 1: Look at the dataset.
• (Re)Familiarize yourself with the variables. Review Table 1: Variables Selected for the
Analysis you generated for the first assignment as well as your instructor’s feedback. In
addition, look at the data dictionary contained in the data set for information about the
variables.
• Select the variables you need for the analysis.
Task 2: Complete your data analysis, as outlined in your first assignment, with any needed
modifications, based on your instructor’s feedback.
• Calculate Measures of Central Tendency and Variability. Use the information from
Assignment #1 - Table 2. Numerical Summaries of the Selected Variables. Here again,
be sure to see your instructor’s feedback and incorporate into the analysis.
• Prepare Graphs and/or Tables. Use the information from Assignment #1 - Table 3.
Type of Graphs and/or Tables for Selected Variables. Here again, be sure to see your
instructor’s feedback and incorporate into the analysis.
STAT200: Assignment #2 - Descriptive Statistics Analysis and Writeup - Instructions
Page 2 of 3
Step #3: Write-up findings using the Provided Template
For this part of the assignment, write a short 2-3 page write-up of the process you followed and the
findings from your analysis. You will describe, in words, the statistical analysis used and present the
results in both statistical/text and graphic formats.
Here are the main sections for this assignment:
✓ Identifying Information. Fill in information on name, class, instructor, and date.
✓ Introduction. For this section, use the same scenario you submitted for the first assignment and
modified using your instructor’s feedback, if needed. Include Table 1 (Table 1: Variables
Selected for the Analysis) you used in Assignment #1 to show the variables you selected for the
analysis.
✓ Data .
The aim of this course is to equip the students with the necessary skills, including both the acquisition of habits of thought and knowledge of the techniques of modern econometrics.
The course is application oriented.
The course also aims to provide students with the ability to use appropriate software in an effective manner.
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docxvanesaburnand
Course Project: AJ DAVIS DEPARTMENT STORES
Introduction
AJ DAVIS is a department store chain, which has many credit customers and wants to find out more information about these customers. A sample of 50 credit customers is selected with data collected on the following five variables.
1. Location (rural, urban, suburban)
2. Income (in $1,000's—be careful with this)
3. Size (household size, meaning number of people living in the household)
4. Years (the number of years that the customer has lived in the current location)
5. Credit balance (the customers current credit card balance on the store's credit card, in $).
The data is available in Doc Sharing Course Project Data Set as an Excel file. You are to copy and paste the data set into a minitab worksheet.
PROJECT PART A: Exploratory Data Analysis
· Open the file MATH533 Project Consumer.xls from the Course Project Data Set folder in Doc Sharing.
· For each of the five variables, process, organize, present, and summarize the data. Analyze each variable by itself using graphical and numerical techniques of summarization. Use minitab as much as possible, explaining what the printout tells you. You may wish to use some of the following graphs: stem-leaf diagram, frequency or relative frequency table, histogram, boxplot, dotplot, pie chart, bar graph. Caution: Not all of these are appropriate for each of these variables, nor are they all necessary. More is not necessarily better. In addition, be sure to find the appropriate measures of central tendency and measures of dispersion for the above data. Where appropriate use the five number summary (the Min, Q1, Median, Q3, Max). Once again, use minitab as appropriate, and explain what the results mean.
· Analyze the connections or relationships between the variables. There are 10 pairings here (location and income, location and size, location and years, location and credit balance, income and size, income and years, income and balance, size and years, size and credit balance, years and Credit Balance). Use graphical as well as numerical summary measures. Explain what you see. Be sure to consider all 10 pairings. Some variables show clear relationships, while others do not.
· Prepare your report in Microsoft Word (or some other word processing package), integrating your graphs and tables with text explanations and interpretations.Be sure that you have graphical and numerical back up for your explanations and interpretations. Be selective in what you include in the report. I'm not looking for a 20-page report on every variable and every possible relationship (that's 15 things to do). Rather, what I want you do is to highlight what you see for three individual variables(no more than one graph for each, one or two measures of central tendency and variability (as appropriate), and two or three sentences of interpretation). For the 10 pairings, identify and report only on three of the pairings, again using graphical and numerical summary (as.
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
ECO 510 Final Project Guidelines and Rubric Overview The final.docxjack60216
ECO 510 Final Project Guidelines and Rubric
Overview
The final project for this course is the creation of an empirical analysis for HHS executives.
Economists conduct research, collect and analyze data, monitor economic trends, and develop forecasts. They use their understanding of economic
relationships to advise businesses, industry, and government agencies. For this assignment, you are tasked to be an economist consulting on a study to advise
the U.S. Department of Health and Human Services (HHS) on economic and demographic factors that might influence the amount of physical activity of the
population. Understanding this nexus among economics, demographics, and physical activity will be vital for HHS to be able to focus its limited funding for a
health improvement initiative.
The project is divided into three milestones, which will be submitted at various points throughout the course to scaffold learning and ensure quality final
submissions. These milestones will be submitted in Modules Three, Seven, and Nine. The final project will be submitted in Module Ten.
In this assignment, you will demonstrate your mastery of the following course outcomes:
• Analyze quantitative data through the integration of appropriate and relevant mathematical objects such as matrices, polynomials, graphs, and
derivatives to solve economic problems
• Apply appropriate data collection instruments and measures for planning and conducting economic research
• Analyze statistical data using appropriate statistical methodologies for the study of economic growth, business cycles, and other dynamic business
behavior
• Apply probability theory in making economic decisions to competitively position a business
Prompt
The aim of the HHS initiative, known as NEXI (National Economics of Exercise Initiative), is to increase the amount of leisure-time physical activity of the
population for health and economic benefits. A secondary aim of NEXI is to provide some directions for small businesses, nonprofits, and states to apply for
contracts and grants, which will provide more leisure-time physical activities to the U.S. population.
Therefore, your empirical analysis should answer the following questions:
• What are some economic and demographic factors that influence the level of leisure-time physical activity?
• Which states and which geographical regions might benefit the most from NEXI?
• What kind of businesses can monetarily benefit by creating services for NEXI (based on your findings for the previous questions)? What are the optimal
services?
'v ,..,ellOrm the statistical tests, you will use Minitab. The four assigned critical tasks are designed to give a basic introduction to some of the main statistical
functions that you will use for your empirical analysis.
Specifically, the following critical elements must be addressed:
I. Research. The research will provide foundational knowledge for you to select the appropriate data sets and qualitativ ...
STAT200: Assignment #3 - Inferential Statistics Analysis and Writeup - Instructions
Page 1 of 5
STAT200 Introduction to Statistics
Assignment #3: Inferential Statistics Analysis and Writeup
Purpose:
The purpose of this assignment is to develop and carry out an inferential statistics analysis plan and
write up the findings. There are two main parts to this assignment:
● Part A: Inferential Statistics Data Plan and Analysis
● Part B: Write up of Results
Part A: Prepare Data Plan, Analyze Data, and Complete Part A of the Assignment #3 Template
➢ Task 1: Select Variables. Review the variables you used for assignments #1 and #2. Select your
qualitative socioeconomic variable as your grouping variable and the two expenditure variables
from the variables used in these previous assignments. Fill in Table 1: Variables Selected for
Analysis with name, description, and type of variable (i.e., qualitative or quantitative).
➢ Task 2: Select and Run a One Sample Confidence Interval Analysis. For one expenditure
variable, select and run the appropriate method for estimating a parameter, based on a statistic
(i.e., confidence interval method). Complete Table 2: Confidence Interval Information and
Results, which follows the format outlined by Kozak and the course’s problem-solving approach,
including:
○ Random variable stated in words
○ Confidence interval method, including rationale and assumptions
○ Method used for analyzing data (i.e., web applets, Excel, TI calculator, etc.).
○ Results obtained
○ Interpretation
➢ Task 3: Select Two Sample Hypothesis Test. Using the second expenditure variable (with the
socioeconomic variable as the grouping variable), select and run the appropriate method for
making decisions about two parameters relative to observed statistics (i.e., two sample
STAT200: Assignment #3 - Inferential Statistics Analysis and Writeup - Instructions
Page 2 of 5
hypothesis test method). Complete Table 3: Two Sample Hypothesis Test Analysis, which
follows the format outlined by Kozak and the course’s problem-solving approach, including:
○ Hypotheses (null and alternative).
○ Two sample hypothesis testing method, including rationale and assumptions
○ Method used for analyzing data (i.e., web applets, Excel, TI calculator, etc.).
○ Results obtained.
○ Interpretation (i.e., Reject the null hypothesis OR Fail to reject null hypothesis)
Step 2: Write Up Results and Complete Part B of the Assignment #3 Template
For this 1 to 2 page section, refer to the inferential statistics data plan and computations done for Part A
of this assignment. Address the following area:
➢ Introduction. Based on the scenario you submitted for the second assignment, provide a brief
description of scenario, including the variables that were used in this analysis. Include a
completed “Table 1: Variables Selected for Analysis to show the variables you selected for
analysis.
.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Elementary Statistics (MATH220)
Assignment:
Statistical Project & Presentation
Purpose:
The purpose of this project is to supplement lecture material by having the students to do a case study on collecting, analyzing, and interpreting data.
***The best way to understand something is to experience it for yourself.
Guideline for Analyzing Data and Writing a Report
Below is a general outline of the topics that should be included in your report.
1.
Introduction.
State the topic of your study.
2.
Define Population.
Define the population that you intend for your study to represent.
3.
Define Variable.
Define clearly the variable that you obtained during your data collection; this should include information on how the variable is measured and what possible values this variable has.
4.
Data Collection.
Describe your data collection process, including your data source, your sampling strategy, and what steps you took to avoid bias.
5.
Study Design.
Describe the procedures you followed to analyze your data.
6.
Results: Descriptive Statistics.
Give the relevant descriptive statistics for the sample you collected.
7.
Results: Statistical Analysis.
Describe the results of your statistical analysis.
8.
Findings.
Interpret the results of your analysis in the context of your original research question. Was your hypothesis supported by your statistical analyses? Explain.
9.
Discussion.
What conclusions, if any, do you believe you can draw as a result of your study? If the results were not what you expected, what factors might explain your results? What did you learn from the project about the population you studied? What did you learn about the research variable? What did you learn about the specific statistical test you conducted?
.
STAT200: Assignment #2 - Descriptive Statistics Analysis and Writeup - Instructions
Page 1 of 3
STAT200 Introduction to Statistics
Assignment #2: Descriptive Statistics Analysis and Writeup
Assignment #2: Descriptive Statistics Analysis and Writeup
In the first assignment (Assignment #1: Descriptive Statistics Analysis Data Plan), you developed a
scenario about annual household expenditures and a plan for analyzing the data using descriptive
statistic methods. The purpose of this assignment is to carry out the descriptive statistics analysis plan
and write up the results. The expected outcome of this assignment is a two to three page write-up of
the findings from your analysis as well as a recommendation.
Assignment Steps:
Step #1: Review Feedback from Your Instructor
Before performing any analysis, please make sure to review your instructor’s feedback on Assignment
#1: Descriptive Statistics Data Analysis Plan. Based on the feedback, modify variables, tables, and
selected statistics, graphs, and tables, if needed.
Step #2: Perform Descriptive Statistic Analysis
Task 1: Look at the dataset.
• (Re)Familiarize yourself with the variables. Review Table 1: Variables Selected for the
Analysis you generated for the first assignment as well as your instructor’s feedback. In
addition, look at the data dictionary contained in the data set for information about the
variables.
• Select the variables you need for the analysis.
Task 2: Complete your data analysis, as outlined in your first assignment, with any needed
modifications, based on your instructor’s feedback.
• Calculate Measures of Central Tendency and Variability. Use the information from
Assignment #1 - Table 2. Numerical Summaries of the Selected Variables. Here again,
be sure to see your instructor’s feedback and incorporate into the analysis.
• Prepare Graphs and/or Tables. Use the information from Assignment #1 - Table 3.
Type of Graphs and/or Tables for Selected Variables. Here again, be sure to see your
instructor’s feedback and incorporate into the analysis.
STAT200: Assignment #2 - Descriptive Statistics Analysis and Writeup - Instructions
Page 2 of 3
Step #3: Write-up findings using the Provided Template
For this part of the assignment, write a short 2-3 page write-up of the process you followed and the
findings from your analysis. You will describe, in words, the statistical analysis used and present the
results in both statistical/text and graphic formats.
Here are the main sections for this assignment:
✓ Identifying Information. Fill in information on name, class, instructor, and date.
✓ Introduction. For this section, use the same scenario you submitted for the first assignment and
modified using your instructor’s feedback, if needed. Include Table 1 (Table 1: Variables
Selected for the Analysis) you used in Assignment #1 to show the variables you selected for the
analysis.
✓ Data .
The aim of this course is to equip the students with the necessary skills, including both the acquisition of habits of thought and knowledge of the techniques of modern econometrics.
The course is application oriented.
The course also aims to provide students with the ability to use appropriate software in an effective manner.
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docxvanesaburnand
Course Project: AJ DAVIS DEPARTMENT STORES
Introduction
AJ DAVIS is a department store chain, which has many credit customers and wants to find out more information about these customers. A sample of 50 credit customers is selected with data collected on the following five variables.
1. Location (rural, urban, suburban)
2. Income (in $1,000's—be careful with this)
3. Size (household size, meaning number of people living in the household)
4. Years (the number of years that the customer has lived in the current location)
5. Credit balance (the customers current credit card balance on the store's credit card, in $).
The data is available in Doc Sharing Course Project Data Set as an Excel file. You are to copy and paste the data set into a minitab worksheet.
PROJECT PART A: Exploratory Data Analysis
· Open the file MATH533 Project Consumer.xls from the Course Project Data Set folder in Doc Sharing.
· For each of the five variables, process, organize, present, and summarize the data. Analyze each variable by itself using graphical and numerical techniques of summarization. Use minitab as much as possible, explaining what the printout tells you. You may wish to use some of the following graphs: stem-leaf diagram, frequency or relative frequency table, histogram, boxplot, dotplot, pie chart, bar graph. Caution: Not all of these are appropriate for each of these variables, nor are they all necessary. More is not necessarily better. In addition, be sure to find the appropriate measures of central tendency and measures of dispersion for the above data. Where appropriate use the five number summary (the Min, Q1, Median, Q3, Max). Once again, use minitab as appropriate, and explain what the results mean.
· Analyze the connections or relationships between the variables. There are 10 pairings here (location and income, location and size, location and years, location and credit balance, income and size, income and years, income and balance, size and years, size and credit balance, years and Credit Balance). Use graphical as well as numerical summary measures. Explain what you see. Be sure to consider all 10 pairings. Some variables show clear relationships, while others do not.
· Prepare your report in Microsoft Word (or some other word processing package), integrating your graphs and tables with text explanations and interpretations.Be sure that you have graphical and numerical back up for your explanations and interpretations. Be selective in what you include in the report. I'm not looking for a 20-page report on every variable and every possible relationship (that's 15 things to do). Rather, what I want you do is to highlight what you see for three individual variables(no more than one graph for each, one or two measures of central tendency and variability (as appropriate), and two or three sentences of interpretation). For the 10 pairings, identify and report only on three of the pairings, again using graphical and numerical summary (as.
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
ECO 510 Final Project Guidelines and Rubric Overview The final.docxjack60216
ECO 510 Final Project Guidelines and Rubric
Overview
The final project for this course is the creation of an empirical analysis for HHS executives.
Economists conduct research, collect and analyze data, monitor economic trends, and develop forecasts. They use their understanding of economic
relationships to advise businesses, industry, and government agencies. For this assignment, you are tasked to be an economist consulting on a study to advise
the U.S. Department of Health and Human Services (HHS) on economic and demographic factors that might influence the amount of physical activity of the
population. Understanding this nexus among economics, demographics, and physical activity will be vital for HHS to be able to focus its limited funding for a
health improvement initiative.
The project is divided into three milestones, which will be submitted at various points throughout the course to scaffold learning and ensure quality final
submissions. These milestones will be submitted in Modules Three, Seven, and Nine. The final project will be submitted in Module Ten.
In this assignment, you will demonstrate your mastery of the following course outcomes:
• Analyze quantitative data through the integration of appropriate and relevant mathematical objects such as matrices, polynomials, graphs, and
derivatives to solve economic problems
• Apply appropriate data collection instruments and measures for planning and conducting economic research
• Analyze statistical data using appropriate statistical methodologies for the study of economic growth, business cycles, and other dynamic business
behavior
• Apply probability theory in making economic decisions to competitively position a business
Prompt
The aim of the HHS initiative, known as NEXI (National Economics of Exercise Initiative), is to increase the amount of leisure-time physical activity of the
population for health and economic benefits. A secondary aim of NEXI is to provide some directions for small businesses, nonprofits, and states to apply for
contracts and grants, which will provide more leisure-time physical activities to the U.S. population.
Therefore, your empirical analysis should answer the following questions:
• What are some economic and demographic factors that influence the level of leisure-time physical activity?
• Which states and which geographical regions might benefit the most from NEXI?
• What kind of businesses can monetarily benefit by creating services for NEXI (based on your findings for the previous questions)? What are the optimal
services?
'v ,..,ellOrm the statistical tests, you will use Minitab. The four assigned critical tasks are designed to give a basic introduction to some of the main statistical
functions that you will use for your empirical analysis.
Specifically, the following critical elements must be addressed:
I. Research. The research will provide foundational knowledge for you to select the appropriate data sets and qualitativ ...
STAT200: Assignment #3 - Inferential Statistics Analysis and Writeup - Instructions
Page 1 of 5
STAT200 Introduction to Statistics
Assignment #3: Inferential Statistics Analysis and Writeup
Purpose:
The purpose of this assignment is to develop and carry out an inferential statistics analysis plan and
write up the findings. There are two main parts to this assignment:
● Part A: Inferential Statistics Data Plan and Analysis
● Part B: Write up of Results
Part A: Prepare Data Plan, Analyze Data, and Complete Part A of the Assignment #3 Template
➢ Task 1: Select Variables. Review the variables you used for assignments #1 and #2. Select your
qualitative socioeconomic variable as your grouping variable and the two expenditure variables
from the variables used in these previous assignments. Fill in Table 1: Variables Selected for
Analysis with name, description, and type of variable (i.e., qualitative or quantitative).
➢ Task 2: Select and Run a One Sample Confidence Interval Analysis. For one expenditure
variable, select and run the appropriate method for estimating a parameter, based on a statistic
(i.e., confidence interval method). Complete Table 2: Confidence Interval Information and
Results, which follows the format outlined by Kozak and the course’s problem-solving approach,
including:
○ Random variable stated in words
○ Confidence interval method, including rationale and assumptions
○ Method used for analyzing data (i.e., web applets, Excel, TI calculator, etc.).
○ Results obtained
○ Interpretation
➢ Task 3: Select Two Sample Hypothesis Test. Using the second expenditure variable (with the
socioeconomic variable as the grouping variable), select and run the appropriate method for
making decisions about two parameters relative to observed statistics (i.e., two sample
STAT200: Assignment #3 - Inferential Statistics Analysis and Writeup - Instructions
Page 2 of 5
hypothesis test method). Complete Table 3: Two Sample Hypothesis Test Analysis, which
follows the format outlined by Kozak and the course’s problem-solving approach, including:
○ Hypotheses (null and alternative).
○ Two sample hypothesis testing method, including rationale and assumptions
○ Method used for analyzing data (i.e., web applets, Excel, TI calculator, etc.).
○ Results obtained.
○ Interpretation (i.e., Reject the null hypothesis OR Fail to reject null hypothesis)
Step 2: Write Up Results and Complete Part B of the Assignment #3 Template
For this 1 to 2 page section, refer to the inferential statistics data plan and computations done for Part A
of this assignment. Address the following area:
➢ Introduction. Based on the scenario you submitted for the second assignment, provide a brief
description of scenario, including the variables that were used in this analysis. Include a
completed “Table 1: Variables Selected for Analysis to show the variables you selected for
analysis.
.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
2. 1. Bootstrapping
2. Introduction to Regression
3. Simple Linear Regression
4. Summary and Regression Analysis in R
4.1. Formula and Basics
4.2. Examples of Data and Problem
4.3. Visualisation
4.4. Computation
4.5. Interpretation
4.6. Regression Line
4.7. Model Assessment
Content
3. Mid-term Coursework Assignment: 30% of the overall mark
▪ List of five exercises to be performed remotely within a 24-hours
period.
▪ Deadline: 18/03/2022 at 10:00am
Final Coursework Assignment: 70% of the overall mark
▪ Report showing a competent application of quantitative methods
and data analysis concepts learned in our module, exploring a topic
of your own interest.
▪ Word limit: 2000 words.
▪ Deadline: 22/04/2022 at 10:00am
Assessment Profile
4. 1. Instructions and Guidance
• In this Report (2000 words), please proceed as follows:
• Select a topic which you are really interested exploring.
In the case you want me to select your, that is completely fine, please just inform
me about this and I will provide you a topic to be explored.
• Decide which research question are you going to address.
• Collect data related to your topic and research question.
• Decide which quantitative research method(s) are you going to adopt.
• Perform data analyses applying quantitative research method(s) learnt in this
module on your data using R/ R Studio.
• Detail the method(s) adopted and discuss your findings in your individual Report.
Final Coursework Assignment 70%
5. The structure of this Report should consist of the following brief sections:
• Section 1. Introduction: Briefly mention your topic, question, input data, and
analyses performed;
• Section 2. Data: Detail your dataset, including data source, temporal coverage,
sample size;
• Section 3. Results: Describe the quantitative research methods adopted and data
analyses performed, reporting your results using a complementary chart and
table, discussing your findings;
• Section 4. Conclusion: Summarise your Report, briefly describing the main
quantitative research method adopted as well as your most relevant/ interesting
finding.
• Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you
performed your data analyses using R/ R Studio.
Final Coursework Assignment 70% (Cont.)
6. 2. Assessment Rubric with Weighted Criteria
• Following the structure of the Report, five rubrics are assessed, each
item contributing with its respective weight to this coursework
assignment overall mark (totalling 100 points), as follows:
• Section 1. Introduction – weight: 15% of the coursework assignment overall
mark;
• Section 2. Data – weight: 20% of the coursework assignment overall mark;
• Section 3. Results – weight: 40% of the coursework assignment overall mark;
• Section 4. Conclusion – weight: 15% of the coursework assignment overall mark;
• Appendix – weight: 10% of the coursework assignment overall mark.
Final Coursework Assignment 70% (Cont.)
7. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Bootstrapping
8. • Bootstrapping is a statistical procedure that resamples a single dataset
to create many simulated samples.
• This process allows to calculate standard errors, build confidence
intervals, and perform hypothesis testing.
• Both bootstrapping and traditional methods use samples to draw
inferences about populations.
• To accomplish this goal, these procedures treat the single sample that a
study obtains as only one of many random samples that the study could
have collected.
• From a single sample, one can calculate a variety of sample statistics,
such as the mean, median, standard deviation.
Source: https://statisticsbyjim.com/hypothesis-testing/bootstrapping/
Bootstrapping
10. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Introduction to
Regression
11. Introduction to Regression
Variations of regression analysis
• Simple: One dependent variable (y), the variable to be
predicted, and one independent variable (x)
• Multiple: Two or more independent variables
• Linear: a liner (“straight-line”) connection between
variables
• Nonlinear: More connection – and related formulas -
between variables
Regression analysis aims to identify a mathematical
function that relates two or more variables, so that the
value of one variable may be predicted from given
values of the other(s)
12. A Simple Linear Relationship
y
x
Intercept a
Slope b
y = a + bx
1
b
Introduction to Regression (Cont)
13. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Simple Linear
Regression
14. Basic Concept
• Simple linear regression uses one independent (x) and
one dependent variable (y) and produces a straight line.
• Indicates to what extent the variables are associated;
it does not show cause-and-effect.
Scatter Diagram
• Data plot with the y variable on the vertical axis and the
x variable on the horizontal axis.
Fitting a Line to the Data
• Generally, the line will not fit the data perfectly.
• Need to find the “best-fitting” line.
Simple Linear Regression
15. Objective:
min σ 𝑑𝑖
2 = min σ(𝑦𝑖 − ෝ
𝑦𝑖)2
where
yi = observed value of the dependent variable
ෝ
𝑦𝑖 = estimated value of the dependent variable
The least squares criterion identifies the best fitting
line as the line that minimizes the sum of the
squared vertical distances of points from the line
Simple Linear Regression (Cont)
17. ▪ The slope of the least squares line is calculated as follows:
▪ The intercept of the least squares line is calculated as follows:
where
x = values of the independent variable
y = values of the dependent variable
ഥ
𝑥 = mean of the x values
ഥ
𝑦 = mean of the y values
n = the number of points (observations)
( )
n xy x y
n x x
−
−
2 2
y bx
−
b =
a =
Simple Linear Regression (Cont)
18. The estimated regression equation as defined as follows:
where
ො
𝑦 = estimated value of y for a given value of x
a = intercept
b = slope
ŷ = a + bx
Simple Linear Regression (Cont)
19. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Summary and
Regression
Analysis in R
20. • The simple linear regression is used to predict a
quantitative outcome y on the basis of one single predictor
variable x.
• The objective is to formulate a model that defines y as a
function of the x variable.
• Once we built a statistically significant model, it is then
possible to use it for predicting future outcome on the
basis of new x values.
• Consider that, we want to evaluate the impact of
advertising budgets of three medias (YouTube, Facebook
and newspaper) on future sales.
• This example of problem can be modeled with linear
regression in R.
Simple Linear Regression in R
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
21. • The mathematical formula of the linear regression can be
written as y = b0 + b1*x + e, where:
▪ b0 and b1 are known as the regression beta coefficients
or parameters, as follows:
▪ b0 is the intercept of the regression line, consisting of
the predicted value when x = 0.
▪ b1 is the slope of the regression line.
▪ e is the error term - also known as the residual errors,
which refers to the part of y that cannot be explained by
the regression model.
Formula and Basics
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
22. • The figure below illustrates the linear regression model,
where:
▪The best-fit regression line is in blue
▪The intercept b0 and the slope b1 are shown in green
▪The error terms (e) are represented by vertical red lines
Formula and Basics (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
23. • From the scatter plot, it can be seen that not all the
data points fall exactly on the fitted regression line.
• Some of the points are above the blue curve and
some are below it.
• Overall, the residual errors (e) have approximately
mean zero.
• The sum of the squares of the residual errors are
called the Residual Sum of Squares or RSS.
• The average variation of points around the fitted
regression line is called the Residual Standard
Error (RSE).
• This is one the metrics used to evaluate the overall
quality of the fitted regression model.
• The lower the RSE, the better it is.
Formula and Basics (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
24. • Since the mean error term is zero, the outcome variable y can
be approximately estimated as follow:
y ~ b0 + b1*x
• Mathematically, the beta coefficients (b0 and b1) are
determined so that the RSS is as minimal as possible.
• This method of determining the beta coefficients is called
least squares regression or ordinary least squares (OLS)
regression.
• Once, the beta coefficients are calculated, a t-test is then
performed to check whether or not these coefficients are
statistically significantly different from zero.
• Non-zero beta coefficients means that there is a statistically
significant relationship between the predictors (x) and the
outcome variable (y).
Formula and Basics (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
25. Load the following required packages:
• tidyverse: For data manipulation and visualisation
• ggpubr: Creates easily a publication ready-plot
Loading Required R Packages
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
26. • We’ll use the marketing data set [datarium package]. It
contains the impact of three advertising medias (YouTube,
Facebook and newspaper) on sales.
• Data are the advertising budget in thousands of pounds along
with the sales.
• The advertising experiment has been repeated 200 times
with different budgets and the observed sales have been
recorded.
• Firstly, install the datarium package
using devtools::install_github("kassmbara/datarium")
Examples of Data and Problem
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
27. • Then load and inspect the marketing data as follow:
• We want to predict future sales on the basis of advertising budget
spent on YouTube.
Examples of Data and Problem (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
28. • Let’s create a scatter plot displaying the sales units versus YouTube advertising
budget.
• In addition, let’s add a smoothed line, using the following code:
• This graph suggests a linearly
increasing relationship between the sales
and the YouTube variables.
• This is good because one
important assumption of the linear
regression is that the relationship between
the outcome and predictor variables is
linear and additive.
Visualisation
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
29. • Let’s also compute the correlation coefficient between the two variables using the R
function cor()
• The correlation coefficient measures the level of the association between two
variables x and y, ranging between -1 (perfect negative correlation: when x
increases, y decreases) and +1 (perfect positive correlation: when x increases, y
increases).
• A value closer to 0 suggests a weak relationship between the variables.
• A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the
outcome variable (y) is not explained by the predictor (x).
In such case, we should probably look for better predictor variables
• In our example, the correlation coefficient is large enough, so we can continue by
building a linear model of y as a function of x.
Examples of Data and Problem (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
30. • The simple linear regression tries to find the best line to predict sales on the
basis of YouTube advertising budget.
• The linear model equation can be written as follow
sales = b0 + b1 * youtube
• The R function lm() can be used to determine the beta coefficients of the linear
model, as follows:
• The results show the intercept and the beta coefficient for the YouTube variable.
Computation
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
31. From the output on the previous slide we have the following:
• The estimated regression line equation can be written as follow:
sales = 8.44 + 0.048*youtube
• The intercept b0 is 8.44. It can be interpreted as the predicted sales unit for a zero
YouTube advertising budget.
• Recall that we are operating in units of thousand pounds. This means that, for a YouTube
advertising budget equal zero, we can then expect a sale of
8.44 *1,000 = 8,440 pounds
• The regression beta coefficient for the variable YouTube b1, also known as the slope, is
0.048.
This means that, for a YouTube advertising budget equal to 1,000 pounds, we can expect
an increase of 48 units (0.048*1,000) in sales. That is:
sales = 8.44 + 0.048*1000 = 56.44 units.
• As we are operating in units of thousand pounds, this represents a sale of 56,440
pounds.
Interpretation
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
32. • To add the regression line onto the scatter
plot, you can use the
function stat_smooth() [ggplot2].
• By default, the fitted line is presented with
confidence interval around it. The
confidence bands reflect the uncertainty
about the line.
• If you don’t want to display it, specify the
option se = FALSE in the
function stat_smooth().
Regression Line
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
33. • In the previous slides, we built a linear model of sales as a
function of YouTube advertising budget:
sales = 8.44 + 0.048*youtube
• Before using this formula to predict future sales, you should make
sure that this model is statistically significant, that is:
▪There is a statistically significant relationship between the
predictor and the outcome variables
▪The model that we built fits very well the data in our hand.
• Therefore, in the next slides we explain how to check the quality
of a linear regression model.
Model Assessment
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
34. • We start by displaying the statistical summary of the model using
the R function summary()
Model Summary
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
The R summary outputs shows 6 components
include:
Call. Shows the function call used to compute the
regression model.
Residuals. Provide a quick view of the distribution
of the residuals, which by definition have a mean
zero. Therefore, the median should not be far from
zero, and the minimum and maximum should be
roughly equal in absolute value.
Coefficients. Shows the regression beta
coefficients and their statistical significance.
Predictor variables, that are significantly associated
to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2)
and the F-statistic are metrics that are used to
check how well the model fits to our data.
35. • The coefficients table, in the model statistical summary, shows:
▪The estimates of the beta coefficients.
▪The standard errors (SE), which defines the accuracy of beta
coefficients. For a given beta coefficient, the SE reflects how the
coefficient varies under repeated sampling. It can be used to
compute the confidence intervals and the t-statistic.
▪The t-statistic and the associated p-value, which defines the
statistical significance of the beta coefficients.
Coefficients Significance
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
36. • For a given predictor, the t-statistic (and its associated p-value) tests
whether or not there is a statistically significant relationship between a
given predictor and the outcome variable.
• The statistical hypotheses are as follow:
• Null hypothesis (H0): The coefficients are equal to zero (i.e. no relationship
between x and y)
• Alternative Hypothesis (Ha): The coefficients are not equal to zero (i.e. there is
some relationship between x and y)
• Mathematically, for a given beta coefficient (b), the t-test is computed as
t = (b - 0)/SE(b), where SE(b) is the standard error of the coefficient b.
• The t-statistic measures the number of standard deviations that b is
away from 0. Therefore, a large t-statistic produces a small p-value.
t-statistic and p-values
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
37. • The larger the t-statistic – and, consequently, the lower the p-
value, the more significant the predictor.
• The symbols to the right visually specifies the level of significance.
The line below the table shows the definition of these symbols.
For example, one star means 0.01 < p < 0.05. The more the stars
beside the variable’s p-value, the more significant the variable.
• A statistically significant coefficient indicates that there is a
statistically significant association between the predictor (x) and
the outcome (y) variable.
t-statistic and p-values (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
38. • In our example, both the p-values for the intercept and the
predictor variable are highly significant.
• Thus, we can reject the null hypothesis and accept the alternative
hypothesis, which means that there is a significant association
between the predictor and the outcome variables.
• The t-statistic is a very useful guide for whether or not to include
a predictor in a model. High t-statistics (i.e. low p-values near 0)
indicate that a predictor should be retained in a model, while very
low t-statistics indicate a predictor variable could be dropped.
t-statistic and p-values (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
39. • The standard error measures the variability/accuracy of the beta
coefficients.
• It can be used to compute the confidence intervals of the
coefficients.
• For example, the 95% confidence interval for the coefficient b1 is
defined as b1 +/- 2*SE(b1), where:
▪ The lower limits of b1 = b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042
▪ The upper limits of b1 = b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052
• That is, there is approximately a 95% chance that the interval
[0.042, 0.052] will contain the true value of b1.
Standard Errors and Confidence Intervals
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
40. • Once you identified that, at least, one predictor variable is
significantly associated the outcome, you should continue the
diagnostic by checking how well the model fits the data.
• This process is also referred to as the goodness-of-fit
• The overall quality of the linear regression fit can be assessed
using the following three quantities, displayed in the model
summary:
1. Residual Standard Error (RSE).
2. R-squared (R2)
3. F-statistic
Model Accuracy
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
1 2 3
41. • The RSE (also known as the model sigma) is the residual variation,
representing the average variation of the observations points
around the fitted regression line.
• This is the standard deviation of residual errors.
• RSE provides an absolute measure of patterns in the data that
cannot be explained by the model.
• When comparing two models, the model with the smaller RSE is a
good indication that this model fits better the data.
• Dividing the RSE by the average value of the outcome variable
results in the prediction error rate, which should be as small as
possible.
Model Accuracy 1: Residual Standard Error
(RSE)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
42. • In our example, RSE = 3.91, meaning that the observed sales
values deviate from the true regression line by approximately 3.9
units on average.
• Whether or not an RSE of 3.9 units is an acceptable prediction
error is subjective and depends on the problem context.
• However, we can calculate the percentage error. In our data set,
the mean value of sales is 16.827, and so the percentage error is
3.9/16.827 = 23%.
Model Accuracy 1: Residual Standard Error
(RSE) (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
43. • The R-squared (R2) ranges from 0 to 1 and represents the proportion of
information (i.e. variation) in the data that can be explained by the
model.
• The adjusted R-squared adjusts for the degrees of freedom.
• The R2 measures, how well the model fits the data.
• For a simple linear regression, R2 is the square of the Pearson
correlation coefficient.
• A large value of R2 is a good indication. However, as the value of R2
tends to increase when more predictors are added in the model, such as
in multiple linear regression model, you should mainly consider the
adjusted R-squared, which is a penalized R2 for a higher number of
predictors.
▪ An (adjusted) R2 that is close to 1 indicates that a large proportion of the
variability in the outcome has been explained by the regression model.
▪ A number near 0 indicates that the regression model did not explain much of the
variability in the outcome.
Model Accuracy 2: R-squared and Adjusted R-
squared
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
44. • The F-statistic gives the overall significance of the model. It assess
whether at least one predictor variable has a non-zero coefficient.
• In a simple linear regression, this test is not really interesting
since it just duplicates the information in given by the t-test,
available in the coefficient table.
• In fact, the F-test is identical to the square of the t-test: 312.1 =
(17.67)^2. This is true in any model with 1 degree of freedom.
• The F-statistic becomes more important once we start using
multiple predictors as in multiple linear regression.
• A large F-statistic corresponds to a statistically significant p-value
(p < 0.05). In our example, the F-statistic equals 312.14, producing
a p-value of 1.46e-42 (or
0.00000000000000000000000000000000000000000146),
which is highly significant.
Model Accuracy 3: F-statistic
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
45. • After computing a regression model, a first step is to check
whether, at least, one predictor is significantly associated with
outcome variables.
• If one or more predictors are significant, the second step is to
assess how well the model fits the data by inspecting the
Residuals Standard Error (RSE), the R2 value and the F-statistics.
• These metrics give the overall quality of the model.
Summary
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
Residual Standard Error (RSE) Closer to zero the better
R-Squared Larger the better
F-statistic Larger the better
46. • Bootstrapping is a popular statistical procedure that resamples a single dataset to create many
simulated samples in order to calculate standard errors, build confidence intervals, and
perform hypothesis testing.
• Simple linear regression is used to predict a quantitative outcome y on the basis of one single
predictor variable x.
• The objective is to formulate a model that defines y as a function of the x variable.
• Once we built a statistically significant model, it is then possible to use it for predicting future
outcome on the basis of new x values.
• The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where
b0 and b1 are the regression parameters and e is the error term that refers to the part of y that
cannot be explained by the regression model.
• The larger the t-statistic – and, consequently, the lower the p-value, the more significant the
predictor variable x.
• The overall quality of the linear regression fit can be assessed using the following three
quantities: Residual Standard Error RSE (the close to zero the better), R-squared (the larger the
better), and F-statistic (the larger the better).
Takeaways
47. • Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge University Press.
• Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly
Media.
• Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis, and Decision
Modeling. New Jersey: Pearson/Prentice Hall.
• Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business Statistics. Wiley Global
Education.
• http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/#examples-of-data-and-problem
• James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction
to Statistical Learning: With Applications in R. Springer Publishing Company,
Incorporated.
• Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative Analysis for
Management, 13e. Prentice Hall.
References
48. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Any Questions?
49. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Thank You!