Fundamentals of Data science Introduction Unit 1

Fundamentals of Data Science
Unit 1
Prepared By
Dr.P.Sasikumar
Associate Professor, AIML Dept.

Unit # 01
Introduction: What is Data Science?
• Big Data and Data Science hype
• Getting past the hype
• Why now?
• Datafication
• Current landscape of perspectives
• Data Science Jobs
• What is data Scientist
-In Academia
-In Industry

Basic Terminologies
• Data
• It can be
-generated
-collected
-retrieved.
Simulation
Similarity Measures
Data Structures
Algorithms

• Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to think and foresee.
• Validity: ways to confirm truth.

• Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example population density.
Measurements of Scales
There are 4 scales of measurement
• Nominal: determines classification of data i.e. male/female.
• Ordinal: determines order of data and can be numerical or non-numerical i.e. time of day (dawn,
morning, noon, afternoon, evening, night).
• Interval: gives the interval of a measurement i.e. temperature interval.
• Ratio: gives ratio of the measurement i.e. weight, height, number of children.

Big Data and Data Science Hype:
Skeptical related to Data Sciences.
• Is data sciences only the stuff going in companies like Google, Facebook and tech
companies?
• There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades.
• The hype is crazy-In general, hype masks reality and increases the noise-to-signal ratio.
• Statisticians already feel that they are studying and working on the “Science of Data.”
Introduction: What is Data Science?

Getting Past the Hype
• Rachel’s experience going from getting a PhD in statistics
to working at Google. In her words:

We have a couple replies to this:
• Sure, there’s is a difference between industry and academia. But does it really have to
be that way? Why do many courses in school have to be so intrinsically out of touch
with reality?
• Even so, the gap doesn’t represent simply a difference between industry statistics
and academic statistics.
• The general experience of data scientists is that, at their job, they have access to a
larger body of knowledge and methodology, as well as a process, which we now
define as the data science process, that has foundations in both statistics and
computer science.
Around all the hype, in other words, there is a ring of truth: this is something new.
Getting Past the Hype

• We have massive amounts of data about many aspects of our lives, and ,simultaneously,
What people might not know is that the “datafication” of our offline behavior has started
as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and assessments
coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data.
Why Now?

• In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-
Schoenberger wrote an article called “The Rise of Big Data”, In it they discuss the concept of
datafication,
They define datafication as a process of “taking all aspects of life and
turning them into data.”
• They follow up their definition in the article with a line that speaks volumes about their
perspective:
Once we datafy things, we can transform their purpose and turn the
information into new forms of value.
Datafication

Examples:
• How we quantify friendships with “likes”.
• “Twitter(X) datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being datafied, via
sensors, cameras, or Google glasses.
• Taking part in a social media experiment.
• All-out surveillance and stalking.
But it’s all datafication
Datafication

For example,
• On Quora there’s a discussion from 2010 about “What is Data Science?” and here’s Metamarket CEO
Mike Driscoll’s answer:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics.
• Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010.
Current landscape of perspectives

• Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include:
1. Statistics (traditional analysis you’re used to thinking about)
2. Data munging (parsing, scraping, and formatting data)
3. Visualization (graphs, tools, etc.)
• ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word”, in which
she defends statistics:
• Then at LinkedIn and Facebook, respectively—coined the term “data scientist” in 2008.
• Wikipedia finally gained an entry on data science in 2012.
Current landscape of perspectives

Data Science Jobs
• For three years running, data science has been dubbed ¨the best job in America.¨ According
to Stack Overflow, it is one of the highest paying jobs in the software sector.
• The GDPR increased the reliance companies have on data scientists due to the need for real-
time analytics and storing data responsibly.
• There are 465 job openings in New York City alone for data scientists.
• LinkedIn recently picked data scientist as its most promising career of 2019. One of the
reasons it got the top spot was that the average salary for people in the role is $130,000.
• The January report from Indeed, one of the top job sites, showed a 29% increase in demand
for data scientists year over year and a 344% increase since 2013 -- a dramatic upswing. But
while demand -- in the form of job postings -- continues to rise sharply, searches by job
seekers skilled in data science grew at a slower pace (14%), suggesting a gap between supply
and demand.

The growth in data scientist job postings on Indeed, from December 2016 to December 2018

What Is a Data Scientist, Really?
Perhaps the most concrete approach is to define data science is by its usage.
• In Academia
• An academic data scientist is a scientist, trained in anything from social science to biology, who works
with large amounts of data, and must deal with computational problems posed by the structure, size,
messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.
• In Industry
More generally, a data scientist is someone who knows
• How to design the experiments,
• how to the process of collecting, cleaning, and munging of data.
• Skills that are also necessary for understanding biases in the data, and for debugging logging output from
code.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.

Data Engineers are
the data
professionals who
prepare the “big
data” infrastructure
to be analyzed by
Data Scientists
Data analyst is
someone who merely
curates meaningful
insights from data.
A data scientist is a professional with the capabilities to gather large amounts of data to analyze
and synthesize the information into actionable plans for companies and other organizations.
What Is a Data Scientist

Statistical Inference
• What is Statistical inference is the process of using a sample to infer the properties of a population.
Statistical procedures use sample data to estimate the characteristics of the whole population from
which the sample was drawn.
• studying a phenomenon, such as the effects of a new medication or public opinion
• populations are usually too large to measure fully.
• Consequently, researchers must use a manageable subset of that population to learn about it.
• By using procedures that can make statistical inferences, you can estimate the properties and
processes of a population.
• More specifically, sample statistics can estimate population parameters.
21

How to Make Statistical Inferences
• Process of making a statistical inference requires you to do the following:
• Draw a sample that adequately represents the population.
• Measure your variables of interest.
• Use appropriate statistical methodology to generalize your sample results to the population while
accounting for sampling error.
Common Inferential Methods
• Hypothesis Testing: Uses representative samples to assess two mutually exclusive hypotheses about a
population. Statistically significant results suggest that the sample effect or relationship exists in the
population after accounting for sampling error.
• Confidence Intervals: A range of values likely containing the population value. This procedure
evaluates the sampling error and adds a margin around the estimate, giving an idea of how wrong it
might be.
• Margin of Error: Comparable to a confidence interval but usually for survey results.
• Regression Modeling: An estimate of the process that generates the outcomes in the population.
22

Example Statistical Inference
• real flu vaccine study for an example of making a statistical inference
Study Findings
• From the table above, 10.8% of the unvaccinated got the flu, while only 3.4% of the vaccinated
caught it. The apparent effect of the vaccine is 10.8% – 3.4% = 7.4%
23
Treatment Flu count Group size Percent infections
Placebo 35 325 10.8%
Vaccine 28 813 3.4%
Effect 7.4%

Population and Sample
• In statistics as well as in quantitative methodology, the set of data are collected and selected from a
statistical population with the help of some defined procedures. There are two different types of
data sets namely, population and sample
Population
• It includes all the elements from the data set and measurable characteristics of the population
such as mean and standard deviation are known as a parameter.
• For example, All people living in India indicates the population of India.
There are different types of population. They are:
• Finite Population
• Infinite Population
• Existent Population
• Hypothetical Population
• Let us discuss all the types one by one.
24

Types
• Finite Population
The finite population is also known as a countable population in which the population can be
counted. In other words, it is defined as the population of all the individuals or objects that are finite.
For statistical analysis, the finite population is more advantageous than the infinite population.
Examples of finite populations are employees of a company, potential consumer in a market.
• Infinite Population
The infinite population is also known as an uncountable population in which the counting of
units in the population is not possible. Example of an infinite population is the number of germs in the
patient’s body is uncountable.
• Existent Population
The existing population is defined as the population of concrete individuals. In other words, the
population whose unit is available in solid form is known as existent population. Examples are books,
students etc.
• Hypothetical Population
The population in which whose unit is not available in solid form is known as the hypothetical
population. A population consists of sets of observations, objects etc that are all something in common.
In some situations, the populations are only hypothetical.
Examples are an outcome of rolling the dice, the outcome of tossing a coin.
25

Differences between population and sample
Comparison Population Sample
Meaning Collection of all the units or elements that
possess common characteristics
A subgroup of the members of
the population
Includes Each and every element of a group Only includes a handful of
units of population
Characteristic
s
Parameter Statistic
Data
Collection
Complete enumeration or census Sampling or sample survey
Focus on Identification of the characteristics Making inferences about the
population 26
:

Sample
• It includes one or more observations that are drawn from the population and the measurable
characteristic of a sample is a statistic.
• Sampling is the process of selecting the sample from the population.
• For example, some people living in India is the sample of the population.
Basically, there are two types of sampling. They are:
• Probability sampling
• Non-probability sampling
27

Probability Sampling
• In probability sampling, the population units cannot be selected at the discretion(Option) of the
researcher.
• This can be dealt with following certain procedures which will ensure that every unit of the
population consists of one fixed probability being included in the sample.
• Such a method is also called random sampling.
• Some of the techniques used for probability sampling are:
 Simple random sampling
 Cluster sampling
 Multi-stage sampling
28

Non Probability Sampling
• In non-probability sampling, the population units can be selected at the discretion of the researcher.
• Those samples will use the human judgments for selecting units and has no theoretical basis for
estimating the characteristics of the population.
• Some of the techniques used for non-probability sampling are
 Quota sampling
 Judgment sampling
 Purposive sampling
Population and Sample Examples
• All the people who have the ID proofs is the population and a group of people who only have voter id
with them is the sample.
• All the students in the class are population whereas the top 10 students in the class are the sample.
• All the members of the parliament is population and the female candidates present there is the
sample.
29

Statistical Modelling
• A statistical model is a type of mathematical model that comprises of the assumptions undertaken
to describe the data generation process.
• The mathematical expressions will be general enough that they have to include parameters, but the
values of these parameters are not yet known.
• In mathematical expressions, the convention is to use Greek letters for parameters and Latin letters
for data.
• So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship,
you’d write down y = β0 +β1x.
• You don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters.
• Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows,
showing how things affect other things or what happens over time.
• This gives them an abstract picture of the relationships before choosing equations to express them.
30

Probability Distributions
What Is Probability?
• Probability denotes the possibility of something happening.
• It is a mathematical concept that predicts how likely events are to occur.
• The probability values are expressed between 0 and 1.
• The definition of probability is the degree to which something is likely to occur.
• This fundamental theory of probability is also applied to probability distributions.
Probability Distributions?
• Statistical function that describes all the possible values and probabilities for a random variable
within a given range.
• This range will be bound by the minimum and maximum possible values, but where the possible
value would be plotted on the probability distribution will be determined by a number of factors.
31

RVDist-32
A probability distribution (function) is a list of the probabilities of the values (simple
outcomes) of a random variable.
Ex: Number of heads in two tosses of a coin
For some experiments, the probability of a simple outcome can be
easily calculated using a specific probability function. If y is a simple
outcome and p(y) is its probability.
 


y
all
)
y
(
p
)
y
(
p
1
1
0
Probability Distribution

Fitting a model to data
• Many data mining procedures fall within this general framework.
• illustrate with some of the most common, all of which are based on linear models.
• The crux of the fundamental concept of this chapter—fitting a model to data by finding “optimal”
model parameters.
33

Classification via mathematical function
34

Overfitting
• Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset.
• Because of this, the model starts caching noise and inaccurate values present in the dataset, and
all these factors reduce the efficiency and accuracy of the model.
• The chances of occurrence of overfitting increase as much we provide training to our model
• Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
•
35
As we can see from the above
graph, the model tries to
cover all the data points
present in the scatter plot. It
may look efficient,
but in reality, it is not so.
Because the goal of the
regression model to find the
best fit line, but here we have
not got any best fit,
so, it will generate the
prediction errors.

How to avoid the Overfitting in Model
• Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
36

basic terms for overfitting
• Signal: It refers to the true underlying pattern of the data that helps the machine learning model to
learn from the data.
• Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
• Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine
learning algorithms. Or it is the difference between the predicted values and the actual values.
• Variance: If the machine learning model performs well with the training dataset, but does not
perform well with the test dataset, then variance occurs.
37

Basics of R
Introduction
• R is a popular programming language used for statistical computing.
• Its most common use is to analyze and visualize data
• Graphics representation and reporting.
• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand,
and is currently developed by the R Development Core Team.
• R is freely available under the GNU General Public License, and pre compiled binary versions are
provided for various operating systems like Linux, Windows and Mac.
• This programming language was named R , based on the first letter of first name of the two R
authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs.
• R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN
languages for efficiency.
38

Why Use R?
• It is a great resource for data analysis, data visualization, data science and machine learning
• It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
• It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot
• It works on different platforms (Windows, Mac, Linux)
• It is open-source and free
• It has a large community support
• It has many packages (libraries of functions) that can be used to solve different problems
39

Features of R
• As stated earlier, R is a programming language and software environment for statistical analysis,
graphics representation and reporting.
The following are the important features of R
• R is a well developed, simple and effective programming language which includes conditionals,
loops, input and output facilities.
• R has an effective data handling and storage facility,
• R provides a suite(SET) of operators for calculations on arrays, lists, vectors and matrices.
• R provides a large and integrated collection of tools for data analysis.
• R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
40

R - Environment Setup
1. Installation of R
In Linux: ( Through Terminal )
• Press Ctrl+Alt+T to open Terminal
• Then execute sudo apt-get update
• After that, sudo apt-get install r-base
41

In Windows:
Step – 1: Go to CRAN R project website. (Comprehensive R Archive Network )
Step – 2: Click on the Download R for Windows link. https://cran.r-project.org/bin/windows/base/
Step – 3: Click on the base subdirectory link or install R for the first time link.
Step – 4: Click Download R X.X.X for Windows (X.X.X stand for the latest version of R.
(eg: 3.6.1) and save the executable .exe file.
Step – 5: Run the .exe file and follow the installation instructions.
5.a. Select the desired language and then click Next.
5.b. Read the license agreement and click Next.
5.c. Select the components you wish to install (it is recommended to install all the components). Click Next.
5.d. Enter/browse the folder/path you wish to install R into and then confirm by clicking Next.
5.e. Select additional tasks like creating desktop shortcuts etc. then click Next.
5.f. Wait for the installation process to complete.
5.g. Click on Finish to complete the installation
42

Install RStudio on Windows
Step – 1: With R-base installed, let’s move on to installing RStudio.
To begin, go to download RStudio and click on the download button for RStudio desktop.
Step – 2: Click on the link for the windows version of RStudio and save the .exe file.
Step – 3: Run the .exe and follow the installation instructions.
3.a. Click Next on the welcome window.
3.b. Enter/browse the path to the installation folder and click Next to proceed.
3.c. Select the folder for the start menu shortcut or click on do not create shortcuts and then click
Next.
3.d. Wait for the installation process to complete.
3.e. Click Finish to end the installation
43

Syntax
1.To output text in R, use single or double quotes:
• Example
"Hello World!"
2.To output numbers, just type the number (without quotes):
• Example
5
10
25
3. To do simple calculations, add numbers together:
Example
5 + 5
44

R Print Output
1.Print : Unlike many other programming languages, you can output code in R without using a print
function:
Example
"Hello World!"
• However, R does have a print() function available if you want to use it. This might be useful if you are
familiar with other programming languages, such as Python, which often uses the print() function to output
code.
Example
print("Hello World!")
• And there are times you must use the print() function to output code, for example when working with for
loops.
Example
• for (x in 1:10)
• {
print(x)
}
• It is up to you whether you want to use the print() function to output code. However, when your code is
inside an R expression (e.g. inside curly braces {} like in the example above), use the print() function to
output the result. 45

Comments
• Comments can be used to explain R code, and to make it more readable. It can also be used to prevent execution when
testing alternative code.
• Comments starts with a #. When executing code, R will ignore anything that starts with #.
• This example uses a comment before a line of code:
• Example
• # This is a comment
"Hello World!"
• This example uses a comment at the end of a line of code:
• Example
• "Hello World!" # This is a comment
• Comments does not have to be text to explain the code, it can also be used to prevent R from executing the code:
• Example
• # "Good morning!"
"Good night!“
• Multiline Comments :Unlike other programming languages, such as Java, there are no syntax in R for multiline
comments. However, we can just insert a # for each line to create multiline comments: 46

Creating Variables in R
• Variables are containers for storing data values.
• R does not have a command for declaring a variable.
• A variable is created the moment you first assign a value to it. To assign a value to a variable, use
the <- sign. To output (or print) the variable value, just type the variable name:
• From the example above, name and age are variables, while "John" and 40 are values.
• In other programming language, it is common to use = as an assignment operator.
• In R, we can use both = and <- as assignment operators.
• However, <- is preferred in most cases because the = operator can be forbidden in some context in R.
47

Print / Output Variables
• Compared to many other programming languages, you do not have to use a function to
print/output variables in R. You can just type the name of the variable:
• However, R does have a print() function available if you want to use it. This might be useful if you
are familiar with other programming languages, such as Python, which often use a print() function
to output variables.
• And there are times you must use the print() function to output code, for example when working
with for loops (which you will learn more about in a later chapter):
48

Concatenate Elements
• You can also concatenate, or join, two or more elements, by using the paste() function.
• To combine both text and a variable, R uses comma (,):
• You can also use , to add a variable to another variable:
• For numbers, the + character works as a mathematical operator:
49

Multiple Variables
• R allows you to assign the same value to multiple variables in one line:
50

Variable Names
• A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume). Rules for R variables are:A variable name must start with a letter and can be a
combination of letters, digits, period(.) and underscore(_). If it starts with period(.), it cannot be
followed by a digit.
• A variable name cannot start with a number or underscore (_)
• Variable names are case-sensitive
EX: (age, Age and AGE are three different variables)
• Reserved words cannot be used as variables
EX: (TRUE, FALSE, NULL, if...)
51

R - Data Types
• Generally, while doing programming in any programming language, you need to use various
variables to store various information.
• Variables are nothing but reserved memory locations to store values.
• This means that, when you create a variable you reserve some space in memory.
• You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating
system allocates memory and decides what can be stored in the reserved memory.
• In contrast to other programming languages like C and java in R, the variables are not declared as
some data type.
• The variables are assigned with R-Objects and the data type of the R-object becomes the data type of
the variable
52

Data Types in R are:
• Each R-Data Type requires different amounts of memory and has some specific operations which
can be performed over it.
• numeric – (3,6.7,121)
• Integer – (2L, 42L; where ‘L’ declares this as an integer)
• logical – (‘True’)
• complex – (7 + 5i; where ‘i’ is imaginary number)
• character – (“a”, “B”, “c is third”, “69”)
• raw – ( as.raw(55); raw creates a raw vector of the specified length)
53

Data type and the values that each data
type can take.
Basic Data Types Values Examples
Numeric Set of all real numbers
"numeric_value <- 3.14"
Integer Set of all integers, Z
"integer_value <- 42L"
Logical TRUE and FALSE
"logical_value <- TRUE"
Complex Set of complex numbers
"complex_value <- 1 + 2i"
Character
“a”, “b”, “c”, …, “@”, “#”, “$”, …., “1”, “2”,
…etc
"character_value <- "Hello Geeks"
raw as.raw()
"single_raw <- as.raw(255)"
54

Data Types
Data type Example Description
Logical True, False It is a special data type for data with only two possible values which
can be construed as true/false.
Numeric 12,32,112,5432 Decimal value is called numeric in R, and it is the default computational
data type.
Integer 3L, 66L, 2346L Here, L tells R to store the value as an integer,
Complex Z=1+2i, t=7+3i A complex value in R is defined as the pure imaginary value i.
Character 'a', '"good'", "TRUE",
'35.4'
In R programming, a character is used to represent string values. We
convert objects into character values with the help ofas.character()
function.
Raw A raw data type is used to holds raw bytes.
55

Sample program
• # numeric
• x <- 10.5
• class(x)
• # integer
• x <- 1000L
• class(x)
• # complex
• x <- 9i + 3
• class(x)
56
# character/string
x <- "R is exciting"
class(x)
# logical
x <- TRUE
class(x)

Fundamentals of Data science Introduction Unit 1

Recommended

Recommended

More Related Content

Similar to Fundamentals of Data science Introduction Unit 1

Similar to Fundamentals of Data science Introduction Unit 1 (20)

Recently uploaded

Recently uploaded (20)

Fundamentals of Data science Introduction Unit 1

Editor's Notes