Introduction to Descriptive & Predictive Analytics

Introduction to
Descriptive &
Predictive Analytics
CS5122 DESCRIPTIVE & PREDICTIVE ANALYTICS
DILUM BANDARA
DILUM.BANDARA@UOM.LK

Data Analytics, Engineering, &
Science
 Buzz words that every decision maker either wants
to or forced to look into
 Data-driven decision making is hard
 Needs right data, fitting tools, skilled analysts, & a
supportive environment
 Data analysts
 Domain experts
 Tool experts
2
Useful Insights

Objectives
 Given a dataset, to train you to
 Ask Right Questions
 Identify Right Tool(s)
 Derive Right Answers/Insights
 We take a data-driven approach
 First try to derive a set of questions based on data
available for the analysis
 Explore potential techniques to support answering those
questions while using available data
 Deriving right answers by interpreting processed data &
visualizations
3

Source: https://moz.com/blog/when-it-comes-to-analytics-are-you-doing-enough
4

Descriptive, Predictive, &
Prescriptive Analytics
 Descriptive Analytics
 Use data aggregation & data mining techniques to provide
insight into the past & answer: “What has happened?”
 Predictive Analytics
 Use statistical models & forecasts techniques to
understand the future & answer: “What could happen?”
 Prescriptive Analytics
 Use optimization & simulation algorithms to advice on
possible outcomes & answer: “What should we do?”
Source: https://halobi.com/2014/10/descriptive-predictive-and-prescriptive-analytics-explained/
5

Example
Descriptive Analytics
◦ Wal-Mart’s found that on Friday afternoons, young American males who buy
diapers also tend to buy beer
◦ Potential sales of each item can increase, if they are kept close to each other
Predictive analytics
◦ Demand for diapers could increase in mid to late summer as more babies are
expected to bone in the USA.
◦ Make sure expected mothers are informed of their diaper choices through
advertising, & production & supply are ready to meet the extra demand
◦ Increased sales
Prescriptive analytics
◦ When to start advertising & when to give discounts?
◦ Help us understand the most effective dates & percentage of discounts that not
only increase sales but also profit
6

Tools can help
reduce difficulty
7

Review of Basic
Statistics & Probability

Populations & Samples
 Population
 All items of interest for a particular decision or investigation
 E.g., all Gmail users, all subscribers to Netflix
 Sample
 A subset of the population
 E.g., all Google Apps for Education users, list of customers
who rented a comedy from Netflix in the past year
 Purpose of sampling is to obtain sufficient
information to draw a valid inference about a
population
10

Sample Space & Events
 Sample Space
 All possible outcomes of an experiment
 E.g., flipping a coin {H, T}
 E.g., rolling a dice {1, 2, 3, 4, 5, 6}
 Event
 Any subset of the sample space
 E.g., {H}, {T}, {H, T}, {1}, or {2, 4, 6}
11

Random Variable
 Variable whose value is subject to variations due
to chance
 Discrete random variables
 Toss a coin, roll a dice
 Continuous random variables
 Stock value, voltage of a sensor,
12

Measures of Location
 Mean
◦ Population mean
◦ Sample mean
 Median
◦ Middle value of data when sorted from least to greatest
 Mode
◦ Observation that occurs most often
 Midrange
◦ Average of greatest & least values = (max – min)/2
13

Probability Distribution/Mass
Function
14

Measures of Dispersion
 Dispersion
 Refers to the degree of variation in data
 Range
 Difference between max & min value
 Interquartile Range (IQR)
 Difference between 3rd and 1st quartiles
 Variance
 Average of squared deviations form mean
 Standard Deviation (STD)
 Square root of the variance
15

Measures of Dispersion (Cont.)
 z-score
Standard score is the number of STD an observation is
above/below the mean
For many data sets encountered in practice:
 ~68% of observations fall within 1 STD of mean
 ~95% fall within 2 STDs
 ~99.7% fall within 3 STDs
16

 Coefficient of Variation
 A relative measure of dispersion
 Return to risk = 1/CV
17

Exercise
Mean & STD of Closing Stock Prices:
 Intel (INTC): Mean = $18.81, STD = $0.50
 General Electric (GE): Mean = $16.19, STD =
$0.35
Which stock has higher risk of investment?
18

 Percentiles
 Value below which a given percentage of observations
in a group of observations fall
Source: www.mathsisfun.com/data/percentiles.html
19

Measures of Shape
 Skewness
 Describes lack of symmetry
 Coefficient of Skewness
CS < 0 for left-skewed data CS > 0 for right-skewed data
|CS| > 1 suggests high degree of skewness 0.5 ≤ |CS| ≤ 1 suggests moderate skewness
|CS| < 0.5 suggests relative symmetry
20

Measures of Shape (Cont.)
 Kurtosis
◦ Refers to peakedness or flatness
◦ Coefficient of Kurtosis
 CK < 3 indicates data is somewhat flat with a wide degree of dispersion
 CK > 3 indicates data is somewhat peaked with less dispersion
21

Measures of Association
 Covariance
◦ Measure of linear association between 2 variables, X & Y
Population
Sample
22

Measures of Association
 Correlation
◦ Measure of linear association between 2 variables, X & Y
◦ Correlation Coefficient
◦ Doesn’t depend upon units of measurement (unlike
covariance)
Population
Sample
23

Outliers
 Mean & range are sensitive to outliers
 No standard definition of what constitutes an outlier
 Possible methods to identify outliers are:
 z-scores greater than +3 or less than -3
 extreme outliers are more than 3*IQR to the left of Q1
or right of Q3
 mild outliers are between 1.5*IQR and 3*IQR to the
left of Q1 or right of Q3
25

Introduction to Descriptive & Predictive Analytics

Recommended

Recommended

More Related Content

Similar to Introduction to Descriptive & Predictive Analytics

Similar to Introduction to Descriptive & Predictive Analytics (20)

More from Dilum Bandara

More from Dilum Bandara (20)

Recently uploaded

Recently uploaded (20)

Introduction to Descriptive & Predictive Analytics

Editor's Notes