Data science notes for reference/ engineering

22UIT303-DATA SCIENCE
Dr. N.G.P Institute of Technology
Coimbatore- 48
Department of Information Technology

COURSE OBJECTIVES:
• To understand techniques and processes of data science.
• Learn and describe the relationship between data.
• Outline an overview of exploratory data analysis.
• Utilize the Python libraries for Data Wrangling.
• Interpret data using visualization techniques in Python.

UNIT I DATA SCIENCE AND STATISTICS
• Data Science:
• Benefits and uses.
• Applications
• Facets of data.
• Data Science Process:
• Overview.
• Defining research goals.
• Retrieving Data.
• Data Preparation.
• Exploratory Data Analysis
• Build the Model.
• Presenting Findings and Building
Applications.
• Statistics
• Basic Statistical Descriptions
of Data
• Types of Data
• Describing Data with Tables
and Graphs
• Describing Data with
Averages.

UNIT II DESCRIBING DATA & RELATIONSHIP
• Correlation
• Scatter Plots
• Correlation Coefficient for
Quantitative Data
• Computational formula for
Correlation Coefficient
• Regression
• Regression Line
• Least Squares Regression Line
• Standard Error of Estimate
• Interpretation of r2
• Multiple Regression Equations
• Regression Towards the Mean
• Logistic Regression
• Estimating Parameters.

UNIT III EXPLORATORY DATA ANALYSIS
• EDA fundamentals.
• Comparing EDA with classical and
Bayesian analysis.
• Software tools for EDA.
• Visual Aids for EDA.
• Data transformation techniques.
• Merging database, Reshaping and
Pivoting, Grouping Datasets
• Data Aggregation
• Pivot Tables and Cross
• Tabulations.

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING
• Basics of Numpy arrays
• Aggregations
• Computations on Arrays
• Comparisons, Masks, Boolean
logic
• Fancy Indexing
• Structured Arrays
• Data manipulation with Pandas
• Data Indexing and Selection.
• Operating on Data.
• Missing Data.
• Hierarchical Indexing.

UNIT V DATA VISUALIZATION
• Importing Matplotlib
• Simple Line Plots
• Simple Scatter Plots
• Visualizing Errors
• Density and Contour Plots
• Histograms
• Legends
• Colors
• Subplots
• Text and Annotation
• Customization
• Three Dimensional Plotting
• Geographic Data with Basemap
• Visualization with Seaborn.

COURSE OUTCOMES:
CO1:Understand the data science process and different types of data
description.
CO2: Analyze the relationship between data using statistics.
CO3: Perform fundamental exploratory data analysis on dataset.
CO4:Handle data using primary tools used for data science in Python.
CO5: Apply visualization Libraries in Python to interpret and explore
data.

Books
TEXTBOOKS:
1. Davy Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016.
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
REFERENCE:
3. Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, “Fundamentals of Data Science”,
CRC Press, 2022.
4. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
5. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,2014.
6. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive Data Visualization:
Foundations, Techniques, and Applications”, 2nd Edition, CRC press, 2015.

UNIT I DATA SCIENCE AND STATISTICS

DATA
• The quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.

BIG DATA
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
• It is a data with so large size and complexity that none of
traditional data management tools can store it or process it
efficiently.
• Big data is also a data but with huge size.

• Do you know? 1021
bytes equal to 1 zettabyte or one billion
terabytes forms a zettabyte.

EXAMPLE OF BIG DATA
• The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per day.

SOCIAL MEDIA
• The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day.
• This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.

JET ENGINE
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
• With many thousand flights per day, generation of data reaches up to
many Petabytes.

Big Data Data Science
• The widely adopted RDBMS has long been regarded as a one-size-fits-all solution.
• Data science involves using methods to analyze massive amounts of data and
extract the knowledge it contains.
• The relationship between big data and data science as being like the relationship
between crude oil and an oil refinery.
• Data science and big data evolved from statistics and traditional data
management.

• The characteristics of big data are often referred to as the three Vs:
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?
These characteristics are complemented with a fourth V,
• Veracity: How accurate is the data?
• These four properties make big data different from the data found in traditional
data management tools.

• Consequently, the challenges are almost in every aspect:
• Data capture,
• Curation,
• Storage,
• Search,
• Sharing,
• Transfer,
• Visualization.
• In addition, big data calls for specialized techniques to extract the insights.

StatisticsData Science
• Data science is an evolutionary extension of statistics capable of
dealing with the massive amounts of data produced today.
• It adds methods from computer science to the repertoire of statistics.

Why named as data scientist?
• The main things that set a data scientist apart from a statistician are the ability to
work with big data and experience in machine learning, computing, and algorithm
building.
• Their tools tend to differ too, with data scientist job descriptions more frequently
mentioning the ability to use Hadoop, Pig, Spark, R, Python, and Java, among others.
• Python is a great language for data science because it has many data science
libraries available, and it’s widely supported by specialized software.

Benefits and Uses of Data Science and Big Data
• Data science and big data are used almost everywhere in both
commercial and noncommercial settings.
• The number of use cases is vast.

TYPES OF BIG DATA
1.Structured
2.Unstructured
3.Semi-structured

Facets of data
• Structured.
• Unstructured.
• Natural language.
• Machine-generated.
• Graph-based.
• Audio, video, and images.
• Streaming.

STRUCTURED
• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
• Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is
well known in advance) and also deriving value out of it.
• However, nowadays, we are foreseeing issues when a size of such data grows to a
huge extent, typical sizes are being in the rage of multiple zettabytes.

STRUCTURED
• Data stored in a relational database management system is
one example of a ‘structured’ data.
• SQL, or Structured Query Language, is the preferred way to manage
and query data that resides in databases

STRUCTURED
• An ‘Employee’ table in a database is an example of Structured Data.
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000

UNSTRUCTURED
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it.
• A typical example of unstructured data is a
• Heterogeneous data source containing a combination of simple text files, images, videos etc.
• Now day organizations have wealth of data available with them but unfortunately, they don’t
know how to derive value out of it since this data is in its raw form or unstructured format.

EXAMPLES OF UN-STRUCTURED DATA
• The output returned by ‘Google Search’

SEMI-STRUCTURED
• Semi-structured data can contain both the forms of data.
• We can see semi-structured data as a structured in form but it is actually not defined
with e.g. a table definition in relational DBMS.
• Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data
• Personal data stored in an XML file-

Natural language
• Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and sentiment
analysis, but models trained in one domain don’t generalize well to other
domains.

Natural language
• This shouldn’t be a surprise though: Humans struggle with natural language
as well.
• It’s ambiguous by nature.
• The concept of meaning itself is questionable here.
• Have two people listen to the same conversation.
• Will they get the same meaning? The meaning of the same words can vary
when coming from someone upset or joyous.

Machine-Generated Data
• Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention.
• Machine-generated data is becoming a major data resource and will
continue to do so.
• Wikibon has forecast that the market value of the industrial Internet will
be approximately $540 billion in 2020.

• IDC (International Data Corporation) has estimated there will be 26 times more connected
things than people in 2020.
• This network is commonly referred to as the Internet of Things.
• The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
• Examples of machine data are
• Web server logs,
• Call detail records,
• Network event logs,
• Telemetry

• The machine data would fit nicely in a classic table-structured
database.
• This isn’t the best approach for highly interconnected or
“networked” data, where the relationships between entities have a
valuable role to play.

Graph-based or Network Data
• “Graph data” can be a confusing term because any data can be shown in a graph.
• “Graph” in this case points to “mathematical graph theory”.
• In graph theory, a graph is a mathematical structure to model pair-wise
relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store graphical
data.

• Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a
person and the shortest path between two people.

• Examples of graph-based data can be found on many social media websites
• Facebook
• LinkedIn
• Twitter.
• The power and sophistication comes from multiple, overlapping graphs of the same nodes.
• For example,
• Imagine the connecting edges here to show “friends” on Facebook.
• Imagine another graph with the same people which connects business colleagues via LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
• Overlapping the three different-looking graphs makes more interesting questions possible.

• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
• Graph data poses its challenges, but for a computer interpreting additive and
image data, it can be even more difficult.

Audio, Image, and Video
• Audio, image, and video are data types that pose specific challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to
be challenging for computers.
• MLBAM (Major League Baseball Advanced Media) announced in 2014 that
they’ll increase video capture to approximately 7 TB per game for the purpose
of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to calculate
in real time, for example, the path taken by a defender relative to two baselines.

Audio, Image, and Video
• Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games.
• This algorithm takes the video screen as input and learns to interpret everything
via a complex process of deep learning.
• It’s a remarkable feat that prompted Google to buy the company for their own
Artificial Intelligence (AI) development plans.
• The learning algorithm takes in data as it’s produced by the computer game; it’s
streaming data.

Streaming Data
• Streaming data can take almost any of the previous forms, it has an extra property.
• The data flows into the system when an event happens instead of being loaded into a
data store in a batch.
• Although this isn’t really a different type of data, we treat it here as such because you
need to adapt your process to deal with this type of information.
• Examples
• “What’s trending” on Twitter,
• Live sporting or music events, and
• The stock market.

Data Science Process
• Six Steps

• The first step of this process is setting a research goal.
• The main purpose is making sure all the stakeholders understand the what, how, and
why of the project.
• In every serious project this will result in a project charter.
• The second phase is data retrieval- Data Availability
• To have data available for analysis, so this step includes finding suitable data and
getting access to the data from the data owner.
• The result is data in its raw form, which probably needs polishing and
transformation before it becomes usable.

• Data Preparation-Raw data, it’s time to prepare it.
• This includes transforming the data from a raw form into data that’s
directly usable in your models.
• To achieve this, you’ll detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it.
• If you have successfully completed this step, you can progress to data
visualization and modeling.

• The fourth step is Data Exploration.
• The goal of this step is to gain a deep understanding of the data.
• “Look for patterns, correlations, and deviations based on visual and
descriptive techniques.”
• The insights you gain from this phase will enable you to start modeling.
• Finally,: Model building ( “data modeling”).
• It is now that you attempt to gain the insights or make the predictions stated in your
project charter.

• Now is the time to bring out the heavy guns, but remember research has taught
us that often (but not always) a combination of simple models tends to
outperform one complicated model.
• If you’ve done this phase right, you’re almost done.

• The last step of the data science model is presenting your results and automating the
analysis, if needed.
• One goal of a project is to change a process and/or make better decisions.
• You may still need to convince the business that your findings will indeed change the
business process as expected.
• This is where you can shine in your influencer role.
• The importance of this step is more apparent in projects on a strategic and tactical level.
• Certain projects require you to perform the business process over and over again, so
automating the project will save time.

Setting the Research Goal
• Data science is mostly applied in the context of an organization.
• When the business asks you to perform a data science project, you’ll first prepare a project charter.
• This charter contains information such as
• What you’re going to research,
• How the company benefits from that,
• What data and resources you need,
• A timetable,
• Deliverables.
• Define research goal.
• Create project charter.

Retrieving Data
• The second step is to collect data.
• You’ve stated in the project charter which data you need and where you can find it.
• In this step you ensure that you can use the data in your program, which means checking
the existence of, quality, and access to the data.
• Data can also be delivered by third-party companies and takes many forms ranging from
Excel spreadsheets to different types of databases.
• Internal Data
• Data Retrieval
• Data Ownership
• External Data

Data Preparation
• Data collection is an error-prone process;
• In this phase you enhance the quality of the data and prepare it for use in
subsequent steps.
• This phase consists of three subphases:
• Data cleansing removes false values from a data source and inconsistencies
across data sources,
• Data integration enriches data sources by combining information from
multiple data sources, and
• Data transformation ensures that the data is in a suitable format for use in
your models.

Data Exploration
• Data exploration is concerned with building a deeper understanding of your
data.
• You try to understand how variables interact with each other, the distribution of
the data, and whether there are outliers.
• To achieve this we mainly use descriptive statistics, visual techniques, and
simple modeling.
• This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

Data modeling or model building
• In this phase you use models, domain knowledge, and insights about the data you found
in the previous steps to answer the research question.
• You select a technique from the fields of statistics, machine learning, operations
research, and so on.
• Building a model is an iterative process that involves
• Selecting the variables for the model,
• Executing the model, and
• Model diagnostics.

Presentation and Automation
• Finally, you present the results to your business.
• These results can take many forms, ranging from presentations to research
reports.
• You’ll need to automate the execution of the process because the business will
want to use the insights you gained in another project or enable an operational
process to use the outcome from your model.

© 2011 Pearson Education, Inc
The Science of Statistics

What Is Statistics?
1. Collecting Data
e.g., Survey
2. Presenting Data
e.g., Charts & Tables
3. Characterizing Data
e.g., Average

What Is Statistics?
• Statistics is the science of data.
• It involves collecting, classifying, summarizing, organizing, analyzing,
and interpreting numerical information.

Types of Statistical Applications in
Business

Application Areas
• Economics
• Forecasting
• Demographics
• Sports
• Individual & Team
Performance
• Engineering
• Construction
• Materials
• Business
• Consumer Preferences
• Financial Trends

Statistics: Two Processes
• Describing sets of data
• Drawing conclusions
• (Making estimates, Decisions, Predictions, etc. about sets of data based on
sampling)

Descriptive Statistics
• It is methods for summarizing and organizing the
important features of a dataset.
• It provides simple, quantitative descriptions about
the main characteristics of data — either for a
population or a sample — through numbers,
tables, graphs, and charts.
1. Involves
• Collecting Data
• Presenting Data
• Characterizing Data
2. Purpose
• Describe Data
X = 30.5 S2
= 113
0
25
50
Q1 Q2 Q3 Q4
$

Descriptive Statistics
• Describing Data with Tables and Graphs
• Describing Data with Averages
• Describing Variability
• Normal Distributions and Standard (z) Scores
• Describing Relationships: Correlation
• Regression

• A methods that use sample data to draw
conclusions or inferences about a larger
population, along with a measure of
uncertainty or reliability.
• Involves
• Estimation
• Hypothesis
Testing
• Purpose
• Make decisions about population
characteristics
Inferential Statistics
Population?

Fundamental Elements
of Statistics

1. Experimental unit
• Object upon which we collect data
2. Population
• All items of interest
3. Variable
• Characteristic of an individual experimental unit
4. Sample
• Subset of the units of a population
• P in Population
& Parameter
• S in Sample
& Statistic

1. Statistical Inference
• Estimate or prediction or generalization about a population
based on information contained in a sample.
2. Measure of Reliability
• Statement (usually qualified) about the degree of uncertainty
associated with a statistical inference.

Four Elements of Descriptive Statistical Problems
1. The population or sample of interest
2. One or more variables (characteristics of the population or sample
units) that are to be investigated
3. Tables, graphs, or numerical summary tools
4. Identification of patterns in the data

Five Elements of Inferential Statistical Problems
1. The population of interest
2. One or more variables (characteristics of the population units)
that are to be investigated
3. The sample of population units
4. The inference about the population based on information
contained in the sample
5. A measure of reliability for the inference

Descriptive Statistical Problems Vs Inferential Statistical Problems
Aspect /
Element
Descriptive Statistical Problems Inferential Statistical Problems
1. Scope of Study Population or sample of interest Entire population of interest
2. Variables
Studied
One or more variables (characteristics
of the population/sample units)
One or more variables (characteristics
of the population units)
3. Data Source
Uses the entire dataset available
(sample or population) for
summarization
Uses a sample from the population
4. Method of
Analysis
Uses tables, graphs, or numerical
summaries to describe the data
Makes inference or generalization
about the population based on the
sample
5. Uncertainty
Measure
Not applicable – no inference, so no
measure of reliability needed
Includes a measure of reliability (e.g.,
confidence level, margin of error, p-
value)
6. Goal
To summarize and identify patterns
in the observed data
To draw conclusions or predictions
about the population from sample data

Types of Data
Data:
• A collection of actual observations or scores in a survey or an
experiment.
Types:
Qualitative Data
Ranked Data
Quantitative Data

THREE TYPES OF Data
Data - Data
A collection of
actual
observations or
scores in a survey
or an experiment
Qualitative Data A
set of observations
where any single
observation is a
word, letter, or
numerical code
that represents a
class or category
Ranked Data
A set of
observations
where any single
observation is a
number that
indicates relative
standing.
Quantitative Data
A set of
observations
where any single
observation is a
number that
represents an
amount or a count.
Any statistical analysis is performed on data, a collection of actual observations
or scores in a survey or an experiment.

Types of Data
Qualitative Data:
• A set of observations where any single observation is a word, letter, or numerical
code that represents a class or category. {(Yes or No), (Y or N), (0 or 1)}
Ranked Data:
• A set of observations where any single observation is a number that indicates
relative standing within a group.
• {(1st
2nd
3rd
.......40th
)}
Quantitative Data:
• A set of observations where any single observation is a number that represents an
amount or a count.
• {( weights of 238, 170,...185 lbs)}

How to Determine the Data
• To determine the type of data, focus on a single observation in any collection of
observations.
• Example:- The weights reported by 53 male students

Indicate whether each of the following terms is qualitative (because it’s a word,
letter, or numerical code representing a class or category); ranked (because it’s a
number representing relative standing); or quantitative (because it’s a number
representing an amount or a count).
(a) Ethnic Group.
(b) Age.
(c) Family Size.
(d)Academic Major
(e) Sexual Preference .
(f) IQ score.
(g) Net worth (dollars).
(h) Third-place finish.
(i) Gender.
(j) Temperature.

Indicate whether each of the following terms is qualitative (because it’s a word, letter, or
numerical code representing a class or category); ranked (because it’s a number representing
relative standing); or quantitative (because it’s a number representing an amount or a count).
(a) Ethnic Group.- qualitative
(b) Age.-quantitative
(c) Family Size.-quantitative
(d)Academic Major- qualitative
(e) Sexual Preference .- qualitative
(f) IQ score.-quantitative
(g) Net worth (dollars).-quantitative
(h) Third-place finish.-ranked
(i) Gender.-qualitative
(j) Temperature.-quantitative

LEVELS OF MEASUREMENT
• Level of Measurement Specifies the extent to which a number (or word or letter) actually
represents some attribute and, therefore, has implications for the appropriateness of
various arithmetic operations and statistical procedures.
• Three levels of measurement- nominal, ordinal, and interval/ratio—and these levels
are paired with qualitative, ranked, and quantitative data, respectively.
• Measurement of Nonphysical Characteristics – IQ Levels (Nominal)
Qualitative Data and
Nominal Measurement
• The single property of
nominal measurement
is classification—that
is, sorting observations
into different classes or
categories
Ranked Data and
Ordinal Measurement
• The distinctive
property of ordinal
measurement is order.
• Relative standing of
ranked data that
reflects differences in
degree based on order
QUANTITATIVE DATAAND
INTERVAL/RATIO
MEASUREMENT
Often the products of familiar
measuring devices, such as
rulers, clocks, or meters, the
distinctive properties of
interval/ratio measurement
are equal intervals and a true
zero.

Indicate the level of measurement—nominal, ordinal, or interval/ ratio—attained
by the following sets of observations or data. When appropriate, indicate that
measurement is only approximately interval
(A) Height
(B) Religious Affiliation
(C) Score For Psychopathic Tendency
(D) Years Of Education
(E) Military Rank
(F) Vocational Goal
(G) GPA
(H) Marital Status

Indicate the level of measurement—nominal, ordinal, or interval/ ratio—attained
by the following sets of observations or data. When appropriate, indicate that
measurement is only approximately interval
Variable Level of Measurement
(a) Height Interval/Ratio
(b) Religious Affiliation Nominal
(c) Score for Psychopathic Tendency Approximately Interval
(d) Years of Education Interval/Ratio
(e) Military Rank Ordinal
(f) Vocational Goal Nominal
(g) GPA Approximately Interval
(h) Marital Status Nominal

Describing Data with Tables and Graphs
Frequency
• Frequency  The number of times a data item occurs in the series.
• It deals with how frequent a data item is in the series.
Example,
• If the weight of 5 students in a class is exactly 65 kg, then the
frequency of data item 65kg is 5.

Frequency Distributions For Quantitative Data
• A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f ) of
occurrence in each class.
• Frequency distribution provides the information of the number of
occurrences (frequency) of distinct values distributed within a given
period of time or interval, in a list, table, or graphical representation.
• Graphic presentation is another way of the presentation of data and
information.

Frequency Distributions
• Usually, graphs are used to present time series and frequency
distribution.
• A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.

Frequency Distribution
• Many times, it is not easy or feasible to find the frequency of data from a
very large dataset.
• To make sense of the data we make a frequency table and graphs.
• Let us take the example of the heights of ten students in CMs.
Frequency Distribution Table
139, 145, 150, 145, 136, 150, 152, 144, 138, 138

Guidelines For Frequency Distributions
Essential
Each observation should be included in one, and only one, class.
• Example: Use classes like 130–139, 140–149, 150–159, etc.
• Don’t use overlapping classes like 130–140, 140–150, 150–160.
 Include all classes, even if no data falls in them.
• Example: If there’s no data in the class 210–219, still list it with a
frequency of zero.
 Make sure all class intervals are the same size.
• Example: Use 130–139, 140–149, 150–159, etc.
• Don’t mix sizes like 130–139 and 140–159.

Optional
4.All classes should have both an upper boundary and a lower boundary.
Example: 240–249. Less preferred would be 240–above, in which
no maximum value can be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . .
10, particularly 5 and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10
is a convenient number. Less preferred would be 130– 142, 143–155,
etc., in which the class interval of 13 is not a convenient number.

6. The lower boundary of each class interval should be a multiple of the class
interval.
Example:
• If the class interval is 10, use 130–139, 140–149, etc. (130 and 140 are
multiples of 10).
• Not preferred: 135–144, 145–154, etc., because 135 and 145 are not
multiples of 10.
7.Aim for a total of approximately 10 classes.
Example:
• A distribution with 12 classes is okay.
Not preferred:
• 24 classes (too many – makes the table too detailed)
• 3 classes (too few – gives very little information)

How many Classes ?
• The seventh guideline requires a few more comments.
• Try to use around 10 classes to summarize the data clearly.
• Too many classes (like 24 classes with a small interval of 5) can make the
table too detailed and hard to understand. It defeats the purpose of
summarizing the data in a simple way.
• Too few classes (like just 3 classes with a wide interval of 50) can hide
important patterns in the data.
• Aim for a balance—not too many and not too few—to show patterns clearly
while keeping it easy to read.

Gaps between Classes
Unit of Measurement:
• The smallest possible difference between scores.
• In well-constructed frequency tables, the gaps between classes, such as between
149 and 150 that each observation or score has been assigned to one, and only one,
class.
• The size of the gap should always equal one unit of measurement.
• It should always equal the smallest possible difference between scores within a
particular set of data.
• Since the gap is never bigger than one unit of measurement, no score can fall into
the gap.

Examples
• If weights are measured in kilograms to one decimal place, the unit of
measurement is 0.1 kg.
• If age is measured in whole years, the unit is 1 year.
• In temperature measured in Celsius, if values are recorded like 36.5°C,
then the unit is 0.1°C.

Real Limits of Class Intervals
• Real limits are used to find the actual width of a class interval. ensuring that there
are no gaps between adjacent classes.
• How to Find Real Limits:
• Lower Real Limit = Lower class boundary minus half of the unit
• Upper Real Limit = Upper class boundary plus half of the unit
• Example 1:
• Class: 140–149
• Unit of measurement = 1
• Lower real limit = 140 – 0.5 = 139.5
• Upper real limit = 149 + 0.5 = 149.5
• Actual width = 149.5 – 139.5 = 10
• Real limits remove small gaps between classes and help in accurate statistical
calculations like histograms or finding the midpoint and class width.

Constructing Frequency Distributions
1. Find the range, that is, the difference between the largest and smallest
observations. The range of weights in Table 1.1 is 245 133 = 112.
2. Find the class interval required to span the range by dividing the range by the
desired number of classes (ordinarily 10).
Example,
• Choose a simple class interval, like 5 or 10.
Example, 10 is a good choice.
• Start the first class at a number that’s a multiple of the interval.
Example: The smallest value is 133, so start at 130 (a multiple of 10).

Constructing Frequency Distributions
• Find the end of the first class by adding the class interval and subtracting 1.
Example: 130 + 10 = 140, then 140 – 1 = 139. So, the first class is 130–139.
• Keep listing classes (like 140–149, 150–159, etc.) until the last class includes
the largest value (245 in this case, so end at 240–249).
• Use tally marks to count how many values fall into each class.
Example: If the value is 160, put a tally next to 160–169.
• Replace tally marks with numbers to show the frequency (how many values
are in each class).Add up all frequencies to get the total.
• Add clear column headings (like “Class Interval” and “Frequency”) and give
your table a title.

OUTLIERS
• A very extreme score
• Outliers are data points that are far from other data points. In other words,
they’re unusual values in a dataset.
• Outliers are problematic for many statistical analyses because they can
cause tests to either miss significant findings or distort real results.

Here are some of the more common causes of outliers in datasets:
Human error while manually entering data, such as a typo.
Intentional errors, such as dummy outliers included in a dataset to test detection
methods.
Sampling errors that arise from extracting or mixing data from inaccurate or various
sources.
Data processing errors that arise from data manipulation, or unintended mutations of a
dataset.
Measurement errors as a result of instrumental error.
Experimental errors, from the data extraction process or experiment planning or
execution.
Natural outliers which occur “naturally” in the dataset, as opposed to being the result
of an error otherwise listed. These naturally-occurring errors are known as novelties.

• Identify any outliers in each of the following sets of data collected from nine college
students.
• Outliers are a summer income of $25,700; an age of 61; and a family size of 18. No
outliers for GPA

Types of Frequency Distribution
Grouped frequency
distribution.
Ungrouped frequency
distribution.
Cumulative frequency
distribution.
Relative frequency
distribution.
Relative cumulative
frequency distribution.

Grouped frequency distribution:
• A frequency distribution produced whenever observations
are sorted into classes of more than one value.
• The data is arranged and separated into groups called class
intervals.
• The frequency of data belonging to each class interval is noted
in a frequency distribution table.
• The grouped frequency table shows the distribution of
frequencies in class intervals.

Example
• Marks obtained by 20 students in the test are as follows.
• 5, 10, 20, 15, 5, 20, 20, 15, 15, 15, 10, 10, 10, 20, 15, 5, 18, 18, 18, 18.
• To arrange the data in grouped table we have to make class intervals.
• Thus, we will make class intervals of marks like 0 – 5, 6 – 10, and so
on.
• One Column is of class intervals (marks obtained in test) and the
second column is of frequency (no. of students).

Example
Marks obtained in Test
(class intervals)
No. of Students
(Frequency)
0 – 5 3
6 – 10 4
11 – 15 5
16 – 20 8
Total 20

Exercise
• Construct a Frequency Distribution Table for the IQ scores for a
group of 35 high school dropouts are as follows:

Ungrouped Frequency Distribution
• It shows the frequency of an item in each separate data value rather than
groups of data values.
• A frequency distribution produced whenever observations are sorted into
classes of single values.
• In an ungrouped frequency distribution table, data are not organized into class
intervals instead, the exact frequency of each individual data value is recorded
• The table shows two columns:
• One is of marks obtained in the test and the second is of frequency
(no. of students).

Example :Ungrouped Frequency Distribution
Marks obtained in Test No. of Students
5 3
10 4
15 5
18 4
20 4
Total 20
Marks obtained by 20 students in the test are as follows.
5, 10, 20, 15, 5, 20, 20, 15, 15, 15, 10, 10, 10, 20, 15, 5, 18,
18, 18, 18.

Exercise
• Construct a Ungrouped frequency distribution for the Students in a
theater arts appreciation class rated the classic film The Wizard of Oz
on a 10-point scale, ranging from 1 (poor) to 10 (excellent), as
follows:

RELATIVE FREQUENCY DISTRIBUTIONS
• An important variation of the frequency distribution is the relative frequency
distribution.
• Relative frequency distributions show the frequency of each class as a
part or fraction of the total frequency for the entire distribution.
• Instead of raw counts (frequencies), it presents proportions or percentages.

Constructing Relative Frequency Distributions
• To convert a frequency distribution into a relative frequency distribution, divide the
frequency for each class by the total frequency for the entire distribution.

Percentages or Proportions?
• Sometimes we prefer to deal with percentages rather than proportions because
percentages usually lack decimal points.
• A proportion always varies between 0 and 1, whereas a percentage always varies
between 0 percent and 100 percent.
• To convert the relative frequencies in Table 2.5 from proportions to percentages,
multiply each proportion by 100; that is, move the decimal point two places to
the right.
• For example, multiply .06 (the proportion for the class 130–139) by 100 to obtain
6 percent.

Relative Frequency Distributions Percentages
Class Interval Frequency
Relative
Frequency
Relative
Frequency (%)
0 – 10 5 5 / 55 = 0.0909 9.09%
11 – 20 8 8 / 55 = 0.1455 14.55%
21 – 30 12 12 / 55 = 0.2182 21.82%
31 – 40 20 20 / 55 = 0.3636 36.36%
41 – 50 10 10 / 55 = 0.1818 18.18%

Relative Frequency Distributions
• Suppose a store sells 55 shirts in total:
• 20 are blue shirts → Proportion = 20/55 = 0.3636
• The manager says, "36.36% of the shirts sold are blue" → Percentage
• Both tell the same story - one in numbers, one in everyday language.

CUMULATIVE FREQUENCY DISTRIBUTIONS
• Cumulative frequency distributions show the total number of observations in
each class and in all lower-ranked classes.
• This type of distribution can be used effectively with sets of scores, such as test
scores for intellectual or academic aptitude, when relative standing within the
distribution assumes primary importance.
• Under these circumstances, cumulative frequencies are usually converted, in
turn, to cumulative percentages.
• Cumulative percentages are often referred to as percentile ranks.

Constructing Cumulative Frequency Distributions
• To convert a frequency distribution into a cumulative frequency
distribution, add to the frequency of each class the sum of the frequencies
of all classes ranked below it.
• This gives the cumulative frequency for that class.
• Begin with the lowest-ranked class in the frequency distribution and work
upward, finding the cumulative frequencies in ascending order.

Constructing Cumulative Frequency Distributions
160, 168, 133, 170, 150, 165, 158, 165
193, 169, 245, 160, 152, 190, 179, 157
226, 160, 170, 180, 150, 156, 190, 156
157, 163, 152, 158, 225, 135, 165, 135
180, 172, 160, 170, 145, 185, 152, 154
205, 151, 220, 166, 159, 156
165, 157, 190, 206, 172, 175

Cumulative Frequency Distributions
• The cumulative frequency for the class
140–149 is 4, since 1 is the frequency for
that class and 3 is the frequency of all
lower-ranked classes.
• The cumulative frequency for the class
150–159 is 21, since 17 is the frequency
for that class and 4 is the sum of the
frequencies of all lower-ranked classes.

Cumulative Percentages
• If relative standing within a distribution is particularly important, then
cumulative frequencies are converted to cumulative percentages.
• A glance at Table 2.6 reveals that 75 percent of all weights are the same as or
lighter than the weights between 170 and 179 lbs.
• To obtain this cumulative percentage (75%), the cumulative frequency of 40
for the class 170–179 should be divided by the total frequency of 53 for the
entire distribution.

Exercise: Cumulative Frequency Distribution
Construct a Cumulative Frequency Distribution for the given score:
55, 62, 47, 88, 90, 78, 84, 67, 74, 92,58, 80, 63, 70, 49, 86, 73, 91, 75,
61,60, 85, 77, 69, 64, 53, 71, 83, 79, 66,68, 76, 82, 87, 59, 52, 65, 81,
50, 46,45, 89, 48, 72, 54, 56, 62, 57, 93, 95

GRAPHS
• Data can be described clearly and concisely with the aid of a well-
constructed frequency distribution.
• Data can often be described even more vividly, particularly when attempting
to communicate with a general audience, by converting frequency
distributions into graphs.

CONSTRUCTING GRAPHS
• Choose the right graph:
• Use histograms or frequency polygons for quantitative data.
• Use bar graphs for qualitative data or discrete quantitative data.
• Draw axes:
• Start with the horizontal (x) axis, then the vertical (y) axis.
• Make the height of the vertical axis roughly equal to the width of the
horizontal axis.
• Set up class intervals:
• For qualitative or ungrouped data, use the categories from the data.
• For grouped data, create class intervals as you would for a frequency
distribution

• Mark class intervals on the horizontal axis:
• For bar graphs, leave gaps between intervals.
• For histograms and frequency polygons, no gaps-spacing may need trial
and error, so use a pencil.
• If there’s a large gap from 0 to the first class, use a wiggly line to show a
break in scale.
• Avoid clutter-label only a few key points on the axis.
• Mark frequencies on the vertical axis:
• Start from 0 and go up to a value equal to or just above the highest
frequency.
• If the smallest frequency is far from 0, use a wiggly line to show a break in
the scale.
• Use simple, evenly spaced numbers for clarity.

• Plot the data:
• Draw bars for bar graphs and histograms to show frequencies.
• For frequency polygons, place dots above the midpoints of each class and
connect them with lines.
• Be sure to anchor both ends of the polygon to the horizontal axis.
• Add labels and title:
• Label both the horizontal and vertical axes clearly.
• Add a title or a brief explanation of what the graph shows.

GRAPHS
• Most common types of graphs for quantitative and qualitative data
• Histogram
• Frequency Polygon
• The Bar Graph

GRAPHS FOR QUANTITATIVE DATA
Histograms
• A bar-type graph for quantitative data.
• The common boundaries between adjacent bars emphasize the
continuity of the data, as with continuous variables.

Features of Histograms
• The horizontal axis (X-axis) shows the class intervals of the data using equal
spacing.
• The vertical axis (Y-axis) shows the frequencies, also with equal spacing.
(The spacing on the vertical axis doesn’t have to match that of the horizontal
axis.)
• The point where the two axes meet is the origin, where both values are 0.
• Numbers on the horizontal axis increase left to right, and on the vertical
axis, bottom to top.
• If there’s a big gap between 0 and the first class (like 130–139), it’s good to
show a wiggly line on the axis to mark a break in scale.

Features of Histograms
• A histogram is made up of bars, where each bar’s height shows the
frequency of that class.
• Histogram bars touch each other to show the data is continuous.
• If you add gaps between bars, it may wrongly suggest the data is discrete
or categorical.

Frequency Polygon
• A line graph for quantitative data that also emphasizes the continuity of
continuous variables.
• An important variation on a histogram is the frequency polygon, or line
graph.
• Frequency polygons may be constructed directly from frequency
distributions.
• Follow the step-by-step transformation of a histogram into a frequency
polygon, as described in panels A, B, C, and D of Figure 2.2.

Steps for Construction of Frequency Polygon
• The panel shows a histogram of weight distribution.
To create a frequency polygon:
• Place a dot at the midpoint of the top of each bar (or at the midpoint of
each class if there are no bars).
• To find a midpoint, add the class limits and divide by 2.
Example:
(160+169)/2=164.5
• Connect all the dots with straight lines.

• To anchor the polygon:
• Extend the right end to the midpoint of the next empty class (e.g.,
250–259).
• Extend the left end to the midpoint of the first empty class on that side
(e.g., 120–129).
• Finally, remove all the histogram bars, so only the frequency polygon
remains.
• Frequency polygons useful for
• Compare two or more frequency distributions or relative
frequency distributions on the same graph.

Exercise
The following frequency distribution shows the annual incomes in dollars for a
group of college graduates.
(a) Construct a histogram.
(b) Construct a frequency
polygon.
(c) Is this distribution
balanced or lopsided?

Stem and Leaf Displays
• Another technique for summarizing quantitative data is a stem and leaf
display.
• A stem-and-leaf display is a simple way to organize and visualize
quantitative data to see its distribution.
• It helps in identifying the shape of the data, such as whether it's
symmetric, skewed, or has clusters.

Constructing a Stem Leaf Display
• Each number is split into two parts:
• Stem – the leading digit(s)
• Leaf – the last digit
• For example, for the number 47:
Stem = 4
Leaf = 7
• The stems are listed in a vertical column, and the leaves (remaining
digits) are listed to the right of each stem.

Stem and Leaf Displays
Example:
• Let’s say we have the following
scores:
• 43, 46, 48, 51, 52, 54, 54, 56, 58,
61
• We organize them as follows:
Stem | Leaf
4 | 3 6 8
5 | 1 2 4 4 6 8
6 | 1

Selection of Stems
• Stem values are not limited to units of 10.
• Rule of thumb: Keep the stem long enough to show variation, but short
enough to group meaningfully.
• For 2-digit numbers: Use the tens digit as the stem, units as the leaf.
• 57 → stem = 5, leaf = 7
• For 3-digit numbers: Use the hundreds and tens as the stem.
• 248 → stem = 24, leaf = 8
• For 5-digit numbers: You might use the first 3 digits as stem.
• 42188 → stem = 421, leaf = 88

Exercise
Construct a stem and leaf display for the following IQ scores obtained
from a group of four-year-old children.

TYPICAL SHAPES
• Whether expressed as a histogram, a frequency polygon, or a stem and leaf
display, an important characteristic of a frequency distribution is its shape.
• Figure shows some of the more typical shapes for smoothed frequency
polygons (which ignore the inevitable irregularities of real data).
• Normal
• Bimodal
• Positively Skewed
• Negatively Skewed

Normal
• Any distribution that looks like the bell-shaped curve (as shown in panel A ) can
be analyzed using the normal curve.
• This normal distribution appears in many real-life cases, such as:
• Gestation periods of babies
• Standardized test scores
• Popping times of popcorn kernels
• The normal curve helps us understand and interpret these kinds of data.

Bimodal
• A bimodal distribution (like in panel B) often shows the presence of two
different groups in the same data set.
• For example,
• The age distribution in a neighborhood with mostly new parents and
their babies would have two peaks, forming a bimodal shape.

Positively Skewed Distribution
• A distribution that includes a few extreme observations in the
positive direction (to the right of the majority of observations).

Negatively Skewed
• Negatively Skewed Distribution
• A distribution that includes a few extreme observations in the negative
direction (to the left of the majority of observations).

A GRAPH FOR QUALITATIVE (NOMINAL) DATA
Bar Graph
• A bar-type graph for qualitative data.
• Gaps between adjacent bars emphasize the discontinuous nature of the
data.

Describing Data with Averages
• Averages  (measures of central tendency) are values that represent
a typical or central point in a data set.
• Measures of Central Tendency
• Numbers or words that attempt to describe, most generally, the
middle or typical value for a distribution.

Mode
• The value of the most frequent
score or the value that appears most
frequently in a dataset.
Key Properties:
• Can be used for qualitative and
quantitative data.
• A dataset can have no mode, one
mode (unimodal), two modes
(bimodal), or more
(multimodal).
Example:
Data:
[60, 63, 45, 63, 65, 70, 55, 63, 60,
65, 63]
Step 1: Count each value’s frequency
63 appears 4 times
Others less than that
Mode = 63

Median
• The median is the middle value
when all observations are sorted
from smallest to largest.
Data: [60, 63, 45, 63, 65, 70, 55,
63, 60, 65, 63]
Step 1: Sort the data
45, 55, 60, 60, 63, 63, 63, 63, 65,
65, 70
Step 2: Count number of items =
11 (odd number)
Median = 63 (6th value)
• Data: [26.3, 28.7, 27.4, 26.6,
27.4, 26.9]
• Sorted: [26.3, 26.6, 26.9, 27.4,
27.4, 28.7]
• Middle two: 26.9, 27.4
• Median = (26.9 + 27.4) / 2 =
27.15

Mean
• The mean is usually referred to as 'the
average’.
Mean for Ungrouped Data
• The mean is the sum of all the values in
the data divided by the total number of
values in the data.
Mean = Sum of all Observations ÷ Total
number of Observations
Example
• 40,21,55,31,48,13,72
 (40 + 21 + 55 + 31 + 48 + 13 + 72)/7
• 38.57
• Mean for Grouped Data
• Mean is defined for the grouped data as the
sum of the product of observations (xi) and
their corresponding frequencies (fi) divided
by the sum of all the frequencies (fi).
Example
• Mean = (4×5 + 6×10 + 15×8 + 10×7 +
9×10) ÷ (5 + 10 + 8 + 7 + 10)
= (20 + 60 + 120 + 70 + 90) ÷ 40
= 360 ÷ 40
= 9
X 4 5 15 10 9
F 5 10 8 7 10

Data science notes for reference/ engineering

More Related Content

Similar to Data science notes for reference/ engineering

Recently uploaded

Data science notes for reference/ engineering

Editor's Notes