2. Data Analytics, Engineering, &
Science
Buzz words that every decision maker either wants
to or forced to look into
Data-driven decision making is hard
Needs right data, fitting tools, skilled analysts, & a
supportive environment
Data analysts
Domain experts
Tool experts
2
Useful Insights
3. Objectives
Given a dataset, to train you to
Ask Right Questions
Identify Right Tool(s)
Derive Right Answers/Insights
We take a data-driven approach
First try to derive a set of questions based on data
available for the analysis
Explore potential techniques to support answering those
questions while using available data
Deriving right answers by interpreting processed data &
visualizations
3
5. Descriptive, Predictive, &
Prescriptive Analytics
Descriptive Analytics
Use data aggregation & data mining techniques to provide
insight into the past & answer: “What has happened?”
Predictive Analytics
Use statistical models & forecasts techniques to
understand the future & answer: “What could happen?”
Prescriptive Analytics
Use optimization & simulation algorithms to advice on
possible outcomes & answer: “What should we do?”
Source: https://halobi.com/2014/10/descriptive-predictive-and-prescriptive-analytics-explained/
5
6. Example
Descriptive Analytics
◦ Wal-Mart’s found that on Friday afternoons, young American males who buy
diapers also tend to buy beer
◦ Potential sales of each item can increase, if they are kept close to each other
Predictive analytics
◦ Demand for diapers could increase in mid to late summer as more babies are
expected to bone in the USA.
◦ Make sure expected mothers are informed of their diaper choices through
advertising, & production & supply are ready to meet the extra demand
◦ Increased sales
Prescriptive analytics
◦ When to start advertising & when to give discounts?
◦ Help us understand the most effective dates & percentage of discounts that not
only increase sales but also profit
6
10. Populations & Samples
Population
All items of interest for a particular decision or investigation
E.g., all Gmail users, all subscribers to Netflix
Sample
A subset of the population
E.g., all Google Apps for Education users, list of customers
who rented a comedy from Netflix in the past year
Purpose of sampling is to obtain sufficient
information to draw a valid inference about a
population
10
11. Sample Space & Events
Sample Space
All possible outcomes of an experiment
E.g., flipping a coin {H, T}
E.g., rolling a dice {1, 2, 3, 4, 5, 6}
Event
Any subset of the sample space
E.g., {H}, {T}, {H, T}, {1}, or {2, 4, 6}
11
12. Random Variable
Variable whose value is subject to variations due
to chance
Discrete random variables
Toss a coin, roll a dice
Continuous random variables
Stock value, voltage of a sensor,
12
13. Measures of Location
Mean
◦ Population mean
◦ Sample mean
Median
◦ Middle value of data when sorted from least to greatest
Mode
◦ Observation that occurs most often
Midrange
◦ Average of greatest & least values = (max – min)/2
13
15. Measures of Dispersion
Dispersion
Refers to the degree of variation in data
Range
Difference between max & min value
Interquartile Range (IQR)
Difference between 3rd and 1st quartiles
Variance
Average of squared deviations form mean
Standard Deviation (STD)
Square root of the variance
15
16. Measures of Dispersion (Cont.)
z-score
Standard score is the number of STD an observation is
above/below the mean
For many data sets encountered in practice:
~68% of observations fall within 1 STD of mean
~95% fall within 2 STDs
~99.7% fall within 3 STDs
16
17. Measures of Dispersion (Cont.)
Coefficient of Variation
A relative measure of dispersion
Return to risk = 1/CV
17
18. Exercise
Mean & STD of Closing Stock Prices:
Intel (INTC): Mean = $18.81, STD = $0.50
General Electric (GE): Mean = $16.19, STD =
$0.35
Which stock has higher risk of investment?
18
19. Measures of Dispersion (Cont.)
Percentiles
Value below which a given percentage of observations
in a group of observations fall
Source: www.mathsisfun.com/data/percentiles.html
19
20. Measures of Shape
Skewness
Describes lack of symmetry
Coefficient of Skewness
CS < 0 for left-skewed data CS > 0 for right-skewed data
|CS| > 1 suggests high degree of skewness 0.5 ≤ |CS| ≤ 1 suggests moderate skewness
|CS| < 0.5 suggests relative symmetry
20
21. Measures of Shape (Cont.)
Kurtosis
◦ Refers to peakedness or flatness
◦ Coefficient of Kurtosis
CK < 3 indicates data is somewhat flat with a wide degree of dispersion
CK > 3 indicates data is somewhat peaked with less dispersion
21
22. Measures of Association
Covariance
◦ Measure of linear association between 2 variables, X & Y
Population
Sample
22
23. Measures of Association
Correlation
◦ Measure of linear association between 2 variables, X & Y
◦ Correlation Coefficient
◦ Doesn’t depend upon units of measurement (unlike
covariance)
Population
Sample
23
25. Outliers
Mean & range are sensitive to outliers
No standard definition of what constitutes an outlier
Possible methods to identify outliers are:
z-scores greater than +3 or less than -3
extreme outliers are more than 3*IQR to the left of Q1
or right of Q3
mild outliers are between 1.5*IQR and 3*IQR to the
left of Q1 or right of Q3
25