Lec. 10: Making Assumptions of Missing data

Energy Systems
Modeling and Optimization
Dr. Mohamad Kharseh
Office: G 342
mohamad.kharseh@aurak.ac.ae

Important Function In Excel
 frequency f for each class: FREQUENCY(Data_array,bins_array)
 The mean: AVERAGE(number1, number2…)
 The median: MEDIAN(number1, number2…)
 The mode: Mode(number1, number2…)
 𝑧 =
𝑎−𝜇
𝜎
 P(x≤𝑎)=𝐹(𝑎)=NORMSDIST(z)= NORM.DIST(a, mean,standard_dev,TRUE)
 Z-score: zα/2= NORMSINV(α/2)
 T-score: t α/2= T.INV(α/2,n-1)
 The error margin E:
o E=CONFIDENCE(α,σ,n), n≥30
o E=CONFIDENCE.T(α,s,n), n<30
3

Available and Missing Variables
When modeling a system, encountering missing data is common.
 What shall a modeler do in the case of unknown or missing information?

Available and Missing Variables
 When dealing with missing data, it is critical to make correct assumptions to ensure that
the system is accurate.
 One must common strategy for handling such situations is calculate the average of
available data for the similar existing systems (i.e., creating sampling data).
 Use this average as a reasonable estimate for the missing value.

However, be cautious
 Variability: If there are significant variations (high standard deviation) in collected data
from the experiment, relying solely on historical averages may not be accurate.
 Instead the modeller use the Confidence Interval to defining the range of the value of the
missing data
 In such a case, two approach exists:
 Normal Distribution and Z-Test: the standard deviation of the population is known and sample size is
greater than 30
 Normal Distribution and t-Test: the standard deviation of the population is unknown or sample size is
smaller than 30
6

Continuous Random Variables
 Continuous random variables play a crucial role in probability and statistics, dealing with
scenarios where the variable can take on any value within a specific range. Unlike discrete
variables (which have whole number values), continuous variables represent a spectrum of
possibilities.
7

Probability
 Probability statement describes the likelihood that a particular value occurs.
 The likelihood is quantified by assigning a number from the interval [0, 1] to the set of
values (or a percentage from 0 to 100%).
 A probability is usually expressed in terms of a random variable, e.g., P(x)=80%.
 Higher numbers indicate that the set of values is more likely.
8

Probability Distribution Types
9

Probability Density Function Rules
 The probability distribution of x is described by a density curve.
o f (x) is the probability density function (pdf)
 The probability density cannot be negative, f(x)≥0
 The total area under the curve must be 1.
 The probability of a continuous random variable is not defined at specific values.
 If x is continuous, then for any number c, P(x = c) = 0.
 Instead, it is defined over an interval of value, P(a≤ x ≤b).
 P(a≤ x ≤b) is the shaded area below the pdf
10
  ( )
b
a
P a X b f x dx
   

Cumulative Distribution Function
 F (x) is cumulative distribution function (cdf) 𝑑𝐹(𝑥)/𝑑𝑥 = f(x)
 The probability of any value of x below x0, equals the area under the density curve to the left
of x0.
𝑃 𝑥 ≤ 𝑥0 = 𝑃(𝑥 < 𝑥0) = 𝐹(𝑥0) =
−∞
𝑥0
𝑓 𝑥 𝑑𝑥
𝑃 𝑥 ≥ 𝑥0 = 1 − 𝐹(𝑥0) = 1 −
−∞
𝑥0
𝑓 𝑥 𝑑𝑥
 For any two numbers a and b with a < b, then:
𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = 𝐹 𝑏 − 𝐹 𝑎 =
𝑎
𝑏
𝑓 𝑥 𝑑𝑥
11

Exercise: Reaction Time
The following cumulative distribution function approximates the time until a chemical reaction
is completed (in milliseconds, ms):
 What is the Probability density function?
 What proportion of reactions is complete within 200 ms?
12
   0.01
0 for 0
1 for 0 x
x
x
F x
e


 
 
 

0.01 0.01
0 0 for 0
1 0.01 for 0 x
x x
dF x d x
f x
e e
dx dx
 
 
  
  

    2
200 200 1 0.8647
P X F e
    

Normal Distribution
 The famous "bell curve" distribution!
 Key characteristics:
o Symmetrical around the mean (μ).
The total area under the curve is 1.0, so half is above the mean, half is below
o Standard deviation (σ) controls the spread of the curve.
Larger σ indicates a wider spread of values.
o It is usually referred to by N(μ, σ)
 Equation
16

Cumulative Distribution Function of Normal Distribution
 𝐹(𝑥0) = −∞
𝑥0
𝑓 𝑥 𝑑𝑥
 P(x ≤ 𝑎) = 𝐹(𝑎) = NORM.DIST(a, mean,standard_dev,TRUE)
 P x ≥ 𝑎 = 1 − 𝐹(𝑎)
 P 𝑎 ≤ x ≤ 𝑏 = 𝐹 𝑏 − 𝐹 𝑎
17

Standard Normal Distribution
 The standard normal distribution, also called the z-distribution, is a special normal
distribution where the mean is 0 and the standard deviation is 1.
 The CDF doesn't have a simple closed-form expression and usually requires tables.
o In Excel: P(x≤𝑎)=𝐹(𝑎)=NORMSDIST(z)
18

Unusual values Standard deviation
 Unusual values occur outside the range -2 ≤ z ≤ 2 (or µ-2 σ ≤ x ≤ µ-2 σ)

Example 1: Young Women’s Heights
The height of young women can be defined as a continuous random variable (Y) with a
probability distribution is N(64, 2.7).
A. What is the probability that a randomly chosen young woman has a height between 68 and
70 inches? P(68 ≤ Y ≤ 70) = ???
22
z 
68 64
2.7
1.4815
z 
7064
2.7
 2.2222
 P(1.48 ≤ Z ≤ 2.22) = P(Z ≤ 2.22) – P(Z ≤ 1.48) = 0.9869 – 0.9308 = 0.0561
There is about a 5.6% chance that a randomly chosen
young woman has a height between 68 and 70 inches.

Example 1: Young Women’s Heights
The height of young women can be defined as a continuous random variable (Y) with a
probability distribution is N(64, 2.7).
B. At 71 inches tall, is Mrs. Daniel unusually tall? P(Y ≤ 70) = ???
23
Yes, Mrs. Daniel is unusually tall because 99.5% of the
population is shorter than her.
z 
7164
2.7
 2.5926 >2
P value: 0.995

Example: Time for Charging
The average battery takes 60 minutes (μ) to get full charged, with a standard deviation (σ) of
10 minutes. We can model the time with a normal distribution N(μ, σ).
A. What percentage of battery takes between 45 and 75 minutes to get full charged?
B. If a manufacturer claims his battery tacks only 32 minutes to get charged, would you
consider this claim unusual?
C. Determine the time for which the probability that a battery takes is less than 0.98.
24

Solution
B. A z-score of -2.8 indicates that the charging time is 2.8 standard
deviations below the mean hours.
 In normal distributions, most values fall within 2 standard deviations
of the mean (around 95%). Values beyond this range are considered
less frequent (around 5% on either tail).
C. We used Goal Seek in Excel to determine the value of x.
25
μ 60.0
σ 10
x 50.0
z -1.00
A P(x<50) 0.1587
ERF 0.1586553 0.158655
x 75.0
z 1.50
P(x<75) 0.9332
ERF 0.9332
P(50<x<75) 77.5%
77.5%
B x 32
z -2.8000
P(x<32) 0.3%
C x 80.56573
z 2.0566
P(x) 98.0% 0.98 Goal seek
Example 2

Example 3: Exam Scores
The scores on the Engineering Statistics Midterm exam can be modeled by a normal
distribution.
A. What is the probability that a randomly chosen engineering student scored between 75 and
90 points on the exam?
B. What is the probability of a student scoring less than 60?
C. What is the probability of a student scoring more than 90?
26

27
Student ID Overall Grade
2022005779 48
2021004896 49
2022005709 53
2022005577 56
2022005690 60
2022005436 60
2022005600 67
2022005480 69
2022005802 70
2018003821 72
2021004786 74
2021004893 74
2022005687 74
2022005560 74
2022005359 74
2022005597 75
2022005446 75
2021005070 76
2022005590 76
2022005710 76
2022005479 77
2022005757 78
2022005580 78
2022005581 78
2020004723 78
2022005565 78
2022005402 78
2022005618 79
2022005625 80
2022005401 80
2022005533 80
2022005616 81
2022005350 81
2021005055 81
2017003079 83
2022005700 85
2022005433 85
2022005685 85
2022005558 86
2021004872 86
2022005448 87
2022005678 88
2022005462 88
2021005252 88
2022005636 90
2022005691 91
2023005883 92
2022005535 94
2022005464 95
2021005126 96
2022005413 96
2022005620 97
2022005426 98
2022005444 98
2021004912 100
2022005425 100
2022005663 100
Solution
Scores are typically distributed with a mean (μ)=80.12
and a standard deviation (σ) of 12.35 points.
μ 80.1
σ 12.34741
x 75.0
z -0.4149
A P(x<75) 0.3391
ERF 0.339112
x 90.0
z 0.7999
P(x<90) 0.7881
ERF 0.7881
P(75<x<90) 44.902%
44.902%
B x 60
z -1.6297
P(x<60) 5.2%
C x 90
z 0.7999
P(x>90) 21.19%
21.05%

Example 4
In an electronics lab, PV panels are manufactured with a target capacity of 100 (W). However,
due to slight variations in the manufacturing process, the actual capacity of each panel can be
±2% from the target value. Assume this variation can be modeled using a normal distribution.
A. What is the probability of a panel’s capacity below 98 W
B. What is the probability of a panel’s capacity below 102 W
C. What is the probability of a panel meeting the target with the stated accuracy
D. What is the probability of a panel’s capacity exceeding 104 W
28

Central Limit Theorem
 The Central Limit Theorem (CLT) is a fundamental concept in statistics that describes the
behavior of sample mean drawn from a population, regardless of the shape of the
population's distribution.
 The CLT states that as the sample size increases, the distribution of the sample means
(average of values in each sample) will tend towards a normal distribution
o This is true even if the original population distribution is not normal (e.g., skewed, uniform)
 The normal distribution of the sample mean has
o a mean of 𝜇𝑥 = 𝜇
o And standard error of the sample mean (standard deviation) of 𝜎𝑥 =
𝜎
𝑛
31

From the Book
 If we are sampling from a population that has an unknown probability distribution, the
distribution of the sample mean will still be approximately normal with mean μ and
variance σ2/n if the sample size n is large. The statement is as follows:
32

Applications of the CLT
 The CLT allows us to apply statistical methods and tools that rely on normal distributions to
data from populations that might not be normally distributed themselves.
o This is incredibly useful because the normal distribution is well-understood and has many
established properties, making it easier to perform calculations and draw inferences from data.
 The CLT allows statisticians to make inferences about population parameters based on
sample data, even when the population distribution is unknown or non-normal.
33

Example 5
A factory produces metal widgets. The weights of these widgets are known to follow
a uniform distribution between 10 grams and 12 grams.
How does the variability of the average weight change with different sample sizes?
 Solution:
o If we take a small sample (e.g., 3 widgets), the average weight of that sample could be
anywhere between 10 grams and 12 grams, depending on which specific widgets were
chosen. The variability of these small sample will be high.
o According to the CLT, as the sample size increases (e.g., 30 widgets or more), the
distribution of sample means will approach a normal distribution. The variability of these
sample means will become smaller, even though the original weight distribution was
uniform.
34
# widgets weight σ(n=3) σ(n=30)
1 12 1 0.803
2 10
3 11
4 12
5 10
6 11
7 10
8 12
9 10
10 11
11 12
12 11
13 11
14 11
15 10
16 12
17 11
18 10
19 12
20 11
21 12
22 12
23 12
24 12
25 10
26 11
27 11
28 12
29 11
30 10

Example 6
 An electronics company manufactures resistors that have a mean resistance of 100 ohms
and a standard deviation of 10 ohms. Find the probability that a random sample of n = 25
resistors will have an average resistance of fewer than 95 ohms.
 Solution
35

Applications of the CLT
 The CLT allows us to apply statistical methods and tools that rely on normal distributions to
data from populations that might not be normally distributed themselves.
o This is incredibly useful because the normal distribution is well-understood and has many
established properties, making it easier to perform calculations and draw inferences from data.
 The CLT allows statisticians to make inferences about population parameters based on
sample data, even when the population distribution is unknown or non-normal.
36

Introduction
 Suppose you are studying the heights of students at AURAK.
o You take a random sample from the population and establish a mean height of 𝑥 = 170 cm.
o The mean of 𝑥 = 170 cm is a point estimate of the population mean.
o A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated
with the estimate;
 What's missing is the degree of uncertainty in this single sample.
 Namely, if you take another sample of students, very likely
to end up with a mean height that differs from 170 cm.
38

Confidence Interval (CI)
 CI is a statistical method used to estimate a population parameter (e.g., mean, proportion)
with a certain level of confidence.
 It provides a range of values that are likely to contain the true population parameter.
 The range is expressed as a lower and upper bound, often denoted by +/- a margin of error
around the sample parameter (e.g., sample mean for population mean).
39

Development Of The Confidence Interval
 we know that the sample mean 𝑥 is normally distributed with mean μ and variance σ2/ n.
 The z-score given by:
 A confidence interval estimate for μ is an interval of the form
L ≤ μ ≤ U
where the end-points L and U are computed from the different sample data.
40

Determining the end-points L and U
 Suppose that we can determine values of L and U such that the following probability
statement is true:
1-α is called the confidence coefficient.
 Because has a standard normal distribution, we can write
 This can be rewritten by
41

Guidelines for determining the Interval
• σ is known and n≥30
 E=CONFIDENCE(α,s,n) =
 Zα/2=NORMSINV(α/2)
42
• σ is unknown or n<30
 E=CONFIDENCE.T(α,s,n) =
 tα/2=T.INV(α/2,d.f)
 The mean of the population is given by μ = 𝑥 ± 𝐸

Example 7: Modelling Solar PV System
You are creating a simulation model for solar PV system for a residential
house. For this purpose, the energy consumption of households is needed.
You collect a random sample of 25 households and the energy consumption
values were.
1. Construct a 95% confidence interval for the monthly energy
consumption of a house.
2. Assume that you need to be in safe side by sizing the solar PV system
that cover 70% of the population, what is the energy consumption of the
house need to be assumed.
43
#house
Energy
Consumption
1 1572
2 1552
3 1431
4 1595
5 1500
6 1493
7 1449
8 1459
9 1506
10 1426
11 1515
12 1575
13 1524
14 1551
15 1432
16 1496
17 1586
18 1562
19 1508
20 1405
21 1533
22 1558
23 1464
24 1482
25 1402

Solution
44
x 1503.040
S 58.08003099
L 1480.27
z -1.9600
P(x<1480.27304620705) 2.5%
U 1525.807
z 1.9600
P(x<1525.80695379295) 97.5%
P(1480.27304620705<x<1525.80695379295) 95%
Different form
n 25
x 1503.04
S 58.08003
z 1.959964
Ez 22.76695
L 1480
U 1526
Et 23.97426
L 1479
U 1527
x 1503.040
S 58.08003099
x 1509.13
z 0.5244
P(x<1509.13145519591) 70.0%
Slection the value of the energy
consumption
P 𝐿 ≤ 𝜇 ≤ 𝑈 = 0.95

Error vs. Confidence
 There is a trade-off between acceptable error (or
required precision) and confidence.
o When you are required to be precise, you are less
confident.
o When greater error is allowed, you can be more confident.
 When in Doubt: If you're unsure about the population
standard deviation or the sample size is small, it's safer to use
a t-score test to ensure the robustness of the statistical
inference.
45
Z-distribution vs. t-distribution

Example 8: T-test
 Repeat example 7 using CL=98%
 What is the error margin in this case?
 Repeat example 1 using t-test.
 What is the error margin in this case?
46
Ez 22.8
L 1480
U 1526
Et 24.0
L 1479
U 1527
Ez 27.0
L 1476
U 1530

Example 9: 3DMark TimeSpy
You have conducted a 3DMark TimeSpy test on your laptop and have compared the result with
other laptops with the same GPU and CPU specifications. You have collected the results of the
top 100 of these tests.
1. Calculate the mean, median, and mode of the 3DMark TimeSpy scores. What do these
measures tell you about the distribution of the scores?
2. Calculate the standard deviation of the scores. What does this tell you about the variability
of the scores?
47

Solution
1. the mean, median, and mode:
 The mean, median, and mode can tell us about the distribution of the scores.
o If the mean, median, and mode are all close to each other, it suggests that the scores are symmetrically distributed.
o If the mean is greater than the mode, it suggests that the scores are right-skewed (i.e., there are a few very high
scores).
o If the mean is less than the median, it suggests that the scores are left-skewed (i.e., there are a few very low scores).
2. The standard deviation, s, tells us about the variability of the scores.
o If s is small (s/𝑥 < 10%), it means that most of the scores are close to the mean, indicating that the performance of
the laptops is quite consistent.
o If s is large (s/𝑥 > 10%), it means that the scores are spread out over a wider range, indicating more variability in
performance.
 Since the results reveal that s/𝑥=7%, one can say that the performance of the laptops is quite consistent.
48
S Mean Media Mode
340.5119 5238 5114 5113

Example 10
A battery manufacturer wishes to investigate the tread life of its batteries. A sample of 10
batteries used 5000 cycles revealed a sample mean of 12% degradation in battery performance
with a standard deviation of 2%. Construct a 95 percent confidence interval for the population
mean.
1. Would it be legal for the manufacturer to claim that after 5000 cycles the degradation in
battery performance is 10%?
2. Compute the required C.L. that makes the 10% degradation is accepted value.
49

Solution
1. The value of 10% is not in the confidence interval. Hence, we conclude that the population
mean is unlikely to be 10%.
2. The required CL is 99%
50
n 10
x
12
S 2
CL 95%
α 5%
z 1.959963985
Ez 1.239590065
L 11
U 13
Et 1.43
L 10.6
U 13.4

Confidence Level and Precision of Estimation
 Our choice of confidence level (CL) is essentially arbitrary.
o if we had chosen a level of confidence, say, CL=95%, the length of the confidence interval (CI) is
o if we had chosen a higher level of confidence, say, 99%? the length of the CI is
 This is why we are more confident with 99% than 95%.
52

Procedures Of Making Assumption
 Select your sample, n readings
 Determine the mean value of the sample, x
̅
 Determine the standard deviation of the sample, s
 Select your confidence level, 90%, 95%, or 99%
 Calculate the z-score (σ in known and n≥30), or t-score (σ in unknown or n<30)
o As sample size increases (exceeding 30), Z-test and T-test results converge.
 Determine the margin of error
 Then, the mean value of the population, μ, is:

Choice Of Sample Size
 The length of a confidence interval is a measure of the precision of estimation.
 The precision is inversely related to the confidence level.
 This means that in using 𝑥 to estimate μ, the error is less than or equal to E:
 3 factors determine the size of a sample:
o The level of confidence selected.
o The maximum allowable error.
o The variation in the population.
54

Sample Size
 It is desirable to obtain a confidence interval that is short enough for decision-making purposes
and that also has adequate confidence.
 One way to achieve this is by choosing the sample size “n” to be large enough to give a CI of
specified length or precision with given confidence
 Given a confidence level and a maximum error of estimate (error margin), E, the minimum
sample size n needed to estimate the population mean, is:
𝑛 =
𝑧𝛼/2 𝜎
𝐸
2
𝑜𝑟 𝑛 =
𝑡𝛼/2 𝑠
𝐸
2
 Where;
o E is the allowable error
o Z is the z-score corresponding to the selected level of confidence
o S is the standard deviation (of sample)
o σ is the standard deviation (of population)
55

Notice That:
 As the desired length of the interval “2E” decreases, the required sample size “n” increases
for a fixed value of “σ” and specified confidence.
 As “σ” increases, the required sample size n increases for a fixed desired length 2E and
specified confidence.
 As the confidence level decreases, the required sample size “n” decreases for fixed desired
length “2E” and standard deviation “σ”.
56
𝑛 =
𝑧𝛼/2 𝜎
𝐸
2
𝑜𝑟 𝑛 =
𝑡𝛼/2 𝑠
𝐸
2

Example 11
A random sample of 32 textbook prices is taken from a local college bookstore. The mean of
the sample is 𝑥 ̅ = 74.22, and the sample standard deviation is s = 23.44.
1. What is the error margin at confidence level of 99%
2. How many books must be included in your sample if you want to be 99% confident that the
sample mean is within $5 of the population mean?
3. Repeat question 2 assuming the standard deviation is s = 24.44
4. Repeat question 2 using 95% confidence level
57

Solution
58
zc = 2.575
x =74.22   s = 23.44
 145.7Always round up
2. You should include at least 146 books in your sample.
n 32 n 146 n 159
9%
n 85
x
74.22
x
74.22
x
74.22
x
74.22
S 23.44 S 23.44 S 24.44 4% S 23.44
CL 99% CL 99% CL 99% CL 95%
α 1% α 1% α 1% α 5%
z 2.58 z 2.58 z 2.58 z 1.96
Ez 10.67 10.67 Ez 5.00 5.00 Ez 4.993 4.99 Ez 4.983 4.98
1 2 3 4

Example 12: from my Students’ SDP
The students wants to estimate the cooling requirement of a building. The U-value of the wall is
needed and unknown for the building under investigation. A sample of 9 buildings reveals the
following values of U.
1. What is the length of the CI at confidence level of 95%
2. What is the endpoints of the confidence interval of U-value
3. What is required sample size to maintain the error less than 3%, with CL=95%.
sample
U-value
(W/m2.K)
1 1.80
2 1.85
3 1.90
4 2.00
5 2.05
6 2.10
7 2.25
8 2.30
9 2.40

Solution
n 9
mean 2.07
S
tandard deviation 0.2093
CL 95.00%
α 5.00%
t-score 2.306 z-score 1.9600
0.1609 0.1368
0.1609 7.8% 0.1368 6.6%
Q1 CI 0.32 0.322 0.27
Umin 1.911 1.935
Umax 2.233 2.209
P(x1≤x≤x2) 95.0%
Required E(%) 3%
Target E 0.062167 W/m2.K
n 61 44
Q3
E
z
Et
Q
2
the
endpoints
60

Example 13
The plan is to design a PV system to supply the required energy for mosques. Therefore the energy
consumption is required. However, the energy consumption of the mosque subject of the study is not
known. The area and volume of the mosque are 250 m2, and 750 m3, respectively.
Based on the collected data from another mosques, and using the confidence interval of 95%:
1. how many mosques need to be collected in the survey to assume the average energy consumption
with an error of 10%.
2. What is the interval of the energy consumption
Area Volume
Total Annual Current
Consumption (A)
Total Annual Power
Consumption (kW)
867.6 6940.8 170,068 37415
198 1584 28299 6226
216 1512 54718 12038
375.24 1200.768 116021 25525
453.3136 2946.5384 83707 18416
168.4535 1094.94775 85344 18776
386.1685 3282.43225 152925 33644
118.4625 770.00625 67142 14771
342.21 2908.785 81040 17829
330 2310 50539 11119

Solution
sample Energy (kWh/m2)
1 43.1 n 10.00
2 31.4 mean 64.80
3 55.7 Standard deviation 31.0658 Error
4 68.0 Observation 2000.000 27.8%
5 40.6 U≤Observation 100.0%
6 111.5 CL 90.00%
7 87.1 α 10.00%
8 124.7 t-score 1.833 z-score 1.645
9 52.1 Error, CL=0.90 18.0083 18.0083 16.15882
10 33.7 EAmin 46.793
EAmax 82.809
Target Error (%) 10%
Target Error, CL=0.90 6.4801
z-score 1.645
Required n 63

Exercises: From the Book 8-14 & 8-16
 The life in hours of a 75-watt light bulb is known to be normally distributed with σ = 25
hours. A random sample of 20 bulbs has a mean life of x = 1014 hours.
(a) Construct a 95% two-sided confidence interval on the mean life.
(b) Suppose that we wanted the error in estimating the mean life from the two-sided confidence
interval to be five hours at 95% confidence. What sample size should be used?
63

Solution
n 20
x
1014
σ 25
CL 95%
α 5%
z 1.96
Ez 10.96 10.96
CI 21.91 21.913
L 1003.043
U 1024.957
Target Error (%) 10%
Target Error 5.0000
z-score 1.960
Required n 97
1
64

Lec. 10: Making Assumptions of Missing data

Recommended

Recommended

More Related Content

Similar to Lec. 10: Making Assumptions of Missing data

Similar to Lec. 10: Making Assumptions of Missing data (20)

Recently uploaded

Recently uploaded (20)

Lec. 10: Making Assumptions of Missing data