SlideShare a Scribd company logo
1 of 56
Fundamentals of Data Science
Unit 1
Prepared By
Dr.P.Sasikumar
Associate Professor, AIML Dept.
Unit # 01
Introduction: What is Data Science?
• Big Data and Data Science hype
• Getting past the hype
• Why now?
• Datafication
• Current landscape of perspectives
• Data Science Jobs
• What is data Scientist
-In Academia
-In Industry
Basic Terminologies
• Data
• It can be
-generated
-collected
-retrieved.
Simulation
Similarity Measures
Data Structures
Algorithms
• Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to think and foresee.
• Validity: ways to confirm truth.
The DIKW Pyramid
5
• Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example population density.
Measurements of Scales
There are 4 scales of measurement
• Nominal: determines classification of data i.e. male/female.
• Ordinal: determines order of data and can be numerical or non-numerical i.e. time of day (dawn,
morning, noon, afternoon, evening, night).
• Interval: gives the interval of a measurement i.e. temperature interval.
• Ratio: gives ratio of the measurement i.e. weight, height, number of children.
Big Data and Data Science Hype:
Skeptical related to Data Sciences.
• Is data sciences only the stuff going in companies like Google, Facebook and tech
companies?
• There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades.
• The hype is crazy-In general, hype masks reality and increases the noise-to-signal ratio.
• Statisticians already feel that they are studying and working on the “Science of Data.”
Introduction: What is Data Science?
Getting Past the Hype
• Rachel’s experience going from getting a PhD in statistics
to working at Google. In her words:
We have a couple replies to this:
• Sure, there’s is a difference between industry and academia. But does it really have to
be that way? Why do many courses in school have to be so intrinsically out of touch
with reality?
• Even so, the gap doesn’t represent simply a difference between industry statistics
and academic statistics.
• The general experience of data scientists is that, at their job, they have access to a
larger body of knowledge and methodology, as well as a process, which we now
define as the data science process, that has foundations in both statistics and
computer science.
Around all the hype, in other words, there is a ring of truth: this is something new.
Getting Past the Hype
• We have massive amounts of data about many aspects of our lives, and ,simultaneously,
What people might not know is that the “datafication” of our offline behavior has started
as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and assessments
coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data.
Why Now?
• In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-
Schoenberger wrote an article called “The Rise of Big Data”, In it they discuss the concept of
datafication,
They define datafication as a process of “taking all aspects of life and
turning them into data.”
• They follow up their definition in the article with a line that speaks volumes about their
perspective:
Once we datafy things, we can transform their purpose and turn the
information into new forms of value.
Datafication
Examples:
• How we quantify friendships with “likes”.
• “Twitter(X) datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being datafied, via
sensors, cameras, or Google glasses.
• Taking part in a social media experiment.
• All-out surveillance and stalking.
But it’s all datafication
Datafication
For example,
• On Quora there’s a discussion from 2010 about “What is Data Science?” and here’s Metamarket CEO
Mike Driscoll’s answer:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics.
• Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010.
Current landscape of perspectives
• Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include:
1. Statistics (traditional analysis you’re used to thinking about)
2. Data munging (parsing, scraping, and formatting data)
3. Visualization (graphs, tools, etc.)
• ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word”, in which
she defends statistics:
• Then at LinkedIn and Facebook, respectively—coined the term “data scientist” in 2008.
• Wikipedia finally gained an entry on data science in 2012.
Current landscape of perspectives
Data Science Jobs
• For three years running, data science has been dubbed ¨the best job in America.¨ According
to Stack Overflow, it is one of the highest paying jobs in the software sector.
• The GDPR increased the reliance companies have on data scientists due to the need for real-
time analytics and storing data responsibly.
• There are 465 job openings in New York City alone for data scientists.
• LinkedIn recently picked data scientist as its most promising career of 2019. One of the
reasons it got the top spot was that the average salary for people in the role is $130,000.
• The January report from Indeed, one of the top job sites, showed a 29% increase in demand
for data scientists year over year and a 344% increase since 2013 -- a dramatic upswing. But
while demand -- in the form of job postings -- continues to rise sharply, searches by job
seekers skilled in data science grew at a slower pace (14%), suggesting a gap between supply
and demand.
The growth in data scientist job postings on Indeed, from December 2016 to December 2018
What Is a Data Scientist, Really?
Perhaps the most concrete approach is to define data science is by its usage.
• In Academia
• An academic data scientist is a scientist, trained in anything from social science to biology, who works
with large amounts of data, and must deal with computational problems posed by the structure, size,
messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.
• In Industry
More generally, a data scientist is someone who knows
• How to design the experiments,
• how to the process of collecting, cleaning, and munging of data.
• Skills that are also necessary for understanding biases in the data, and for debugging logging output from
code.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
Data Engineers are
the data
professionals who
prepare the “big
data” infrastructure
to be analyzed by
Data Scientists
Data analyst is
someone who merely
curates meaningful
insights from data.
A data scientist is a professional with the capabilities to gather large amounts of data to analyze
and synthesize the information into actionable plans for companies and other organizations.
What Is a Data Scientist
Statistical Inference
• What is Statistical inference is the process of using a sample to infer the properties of a population.
Statistical procedures use sample data to estimate the characteristics of the whole population from
which the sample was drawn.
• studying a phenomenon, such as the effects of a new medication or public opinion
• populations are usually too large to measure fully.
• Consequently, researchers must use a manageable subset of that population to learn about it.
• By using procedures that can make statistical inferences, you can estimate the properties and
processes of a population.
• More specifically, sample statistics can estimate population parameters.
21
How to Make Statistical Inferences
• Process of making a statistical inference requires you to do the following:
• Draw a sample that adequately represents the population.
• Measure your variables of interest.
• Use appropriate statistical methodology to generalize your sample results to the population while
accounting for sampling error.
Common Inferential Methods
• Hypothesis Testing: Uses representative samples to assess two mutually exclusive hypotheses about a
population. Statistically significant results suggest that the sample effect or relationship exists in the
population after accounting for sampling error.
• Confidence Intervals: A range of values likely containing the population value. This procedure
evaluates the sampling error and adds a margin around the estimate, giving an idea of how wrong it
might be.
• Margin of Error: Comparable to a confidence interval but usually for survey results.
• Regression Modeling: An estimate of the process that generates the outcomes in the population.
22
Example Statistical Inference
• real flu vaccine study for an example of making a statistical inference
Study Findings
• From the table above, 10.8% of the unvaccinated got the flu, while only 3.4% of the vaccinated
caught it. The apparent effect of the vaccine is 10.8% – 3.4% = 7.4%
23
Treatment Flu count Group size Percent infections
Placebo 35 325 10.8%
Vaccine 28 813 3.4%
Effect 7.4%
Population and Sample
• In statistics as well as in quantitative methodology, the set of data are collected and selected from a
statistical population with the help of some defined procedures. There are two different types of
data sets namely, population and sample
Population
• It includes all the elements from the data set and measurable characteristics of the population
such as mean and standard deviation are known as a parameter.
• For example, All people living in India indicates the population of India.
There are different types of population. They are:
• Finite Population
• Infinite Population
• Existent Population
• Hypothetical Population
• Let us discuss all the types one by one.
24
Types
• Finite Population
The finite population is also known as a countable population in which the population can be
counted. In other words, it is defined as the population of all the individuals or objects that are finite.
For statistical analysis, the finite population is more advantageous than the infinite population.
Examples of finite populations are employees of a company, potential consumer in a market.
• Infinite Population
The infinite population is also known as an uncountable population in which the counting of
units in the population is not possible. Example of an infinite population is the number of germs in the
patient’s body is uncountable.
• Existent Population
The existing population is defined as the population of concrete individuals. In other words, the
population whose unit is available in solid form is known as existent population. Examples are books,
students etc.
• Hypothetical Population
The population in which whose unit is not available in solid form is known as the hypothetical
population. A population consists of sets of observations, objects etc that are all something in common.
In some situations, the populations are only hypothetical.
Examples are an outcome of rolling the dice, the outcome of tossing a coin.
25
Differences between population and sample
Comparison Population Sample
Meaning Collection of all the units or elements that
possess common characteristics
A subgroup of the members of
the population
Includes Each and every element of a group Only includes a handful of
units of population
Characteristic
s
Parameter Statistic
Data
Collection
Complete enumeration or census Sampling or sample survey
Focus on Identification of the characteristics Making inferences about the
population 26
:
Sample
• It includes one or more observations that are drawn from the population and the measurable
characteristic of a sample is a statistic.
• Sampling is the process of selecting the sample from the population.
• For example, some people living in India is the sample of the population.
Basically, there are two types of sampling. They are:
• Probability sampling
• Non-probability sampling
27
Probability Sampling
• In probability sampling, the population units cannot be selected at the discretion(Option) of the
researcher.
• This can be dealt with following certain procedures which will ensure that every unit of the
population consists of one fixed probability being included in the sample.
• Such a method is also called random sampling.
• Some of the techniques used for probability sampling are:
 Simple random sampling
 Cluster sampling
 Multi-stage sampling
28
Non Probability Sampling
• In non-probability sampling, the population units can be selected at the discretion of the researcher.
• Those samples will use the human judgments for selecting units and has no theoretical basis for
estimating the characteristics of the population.
• Some of the techniques used for non-probability sampling are
 Quota sampling
 Judgment sampling
 Purposive sampling
Population and Sample Examples
• All the people who have the ID proofs is the population and a group of people who only have voter id
with them is the sample.
• All the students in the class are population whereas the top 10 students in the class are the sample.
• All the members of the parliament is population and the female candidates present there is the
sample.
29
Statistical Modelling
• A statistical model is a type of mathematical model that comprises of the assumptions undertaken
to describe the data generation process.
• The mathematical expressions will be general enough that they have to include parameters, but the
values of these parameters are not yet known.
• In mathematical expressions, the convention is to use Greek letters for parameters and Latin letters
for data.
• So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship,
you’d write down y = β0 +β1x.
• You don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters.
• Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows,
showing how things affect other things or what happens over time.
• This gives them an abstract picture of the relationships before choosing equations to express them.
30
Probability Distributions
What Is Probability?
• Probability denotes the possibility of something happening.
• It is a mathematical concept that predicts how likely events are to occur.
• The probability values are expressed between 0 and 1.
• The definition of probability is the degree to which something is likely to occur.
• This fundamental theory of probability is also applied to probability distributions.
Probability Distributions?
• Statistical function that describes all the possible values and probabilities for a random variable
within a given range.
• This range will be bound by the minimum and maximum possible values, but where the possible
value would be plotted on the probability distribution will be determined by a number of factors.
31
RVDist-32
A probability distribution (function) is a list of the probabilities of the values (simple
outcomes) of a random variable.
Ex: Number of heads in two tosses of a coin
For some experiments, the probability of a simple outcome can be
easily calculated using a specific probability function. If y is a simple
outcome and p(y) is its probability.
 


y
all
)
y
(
p
)
y
(
p
1
1
0
Probability Distribution
Fitting a model to data
• Many data mining procedures fall within this general framework.
• illustrate with some of the most common, all of which are based on linear models.
• The crux of the fundamental concept of this chapter—fitting a model to data by finding “optimal”
model parameters.
33
Classification via mathematical function
34
Overfitting
• Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset.
• Because of this, the model starts caching noise and inaccurate values present in the dataset, and
all these factors reduce the efficiency and accuracy of the model.
• The chances of occurrence of overfitting increase as much we provide training to our model
• Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
•
35
As we can see from the above
graph, the model tries to
cover all the data points
present in the scatter plot. It
may look efficient,
but in reality, it is not so.
Because the goal of the
regression model to find the
best fit line, but here we have
not got any best fit,
so, it will generate the
prediction errors.
How to avoid the Overfitting in Model
• Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
36
basic terms for overfitting
• Signal: It refers to the true underlying pattern of the data that helps the machine learning model to
learn from the data.
• Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
• Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine
learning algorithms. Or it is the difference between the predicted values and the actual values.
• Variance: If the machine learning model performs well with the training dataset, but does not
perform well with the test dataset, then variance occurs.
37
Basics of R
Introduction
• R is a popular programming language used for statistical computing.
• Its most common use is to analyze and visualize data
• Graphics representation and reporting.
• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand,
and is currently developed by the R Development Core Team.
• R is freely available under the GNU General Public License, and pre compiled binary versions are
provided for various operating systems like Linux, Windows and Mac.
• This programming language was named R , based on the first letter of first name of the two R
authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs.
• R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN
languages for efficiency.
38
Why Use R?
• It is a great resource for data analysis, data visualization, data science and machine learning
• It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
• It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot
• It works on different platforms (Windows, Mac, Linux)
• It is open-source and free
• It has a large community support
• It has many packages (libraries of functions) that can be used to solve different problems
39
Features of R
• As stated earlier, R is a programming language and software environment for statistical analysis,
graphics representation and reporting.
The following are the important features of R
• R is a well developed, simple and effective programming language which includes conditionals,
loops, input and output facilities.
• R has an effective data handling and storage facility,
• R provides a suite(SET) of operators for calculations on arrays, lists, vectors and matrices.
• R provides a large and integrated collection of tools for data analysis.
• R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
40
R - Environment Setup
1. Installation of R
In Linux: ( Through Terminal )
• Press Ctrl+Alt+T to open Terminal
• Then execute sudo apt-get update
• After that, sudo apt-get install r-base
41
In Windows:
Step – 1: Go to CRAN R project website. (Comprehensive R Archive Network )
Step – 2: Click on the Download R for Windows link. https://cran.r-project.org/bin/windows/base/
Step – 3: Click on the base subdirectory link or install R for the first time link.
Step – 4: Click Download R X.X.X for Windows (X.X.X stand for the latest version of R.
(eg: 3.6.1) and save the executable .exe file.
Step – 5: Run the .exe file and follow the installation instructions.
5.a. Select the desired language and then click Next.
5.b. Read the license agreement and click Next.
5.c. Select the components you wish to install (it is recommended to install all the components). Click Next.
5.d. Enter/browse the folder/path you wish to install R into and then confirm by clicking Next.
5.e. Select additional tasks like creating desktop shortcuts etc. then click Next.
5.f. Wait for the installation process to complete.
5.g. Click on Finish to complete the installation
42
Install RStudio on Windows
Step – 1: With R-base installed, let’s move on to installing RStudio.
To begin, go to download RStudio and click on the download button for RStudio desktop.
Step – 2: Click on the link for the windows version of RStudio and save the .exe file.
Step – 3: Run the .exe and follow the installation instructions.
3.a. Click Next on the welcome window.
3.b. Enter/browse the path to the installation folder and click Next to proceed.
3.c. Select the folder for the start menu shortcut or click on do not create shortcuts and then click
Next.
3.d. Wait for the installation process to complete.
3.e. Click Finish to end the installation
43
Syntax
1.To output text in R, use single or double quotes:
• Example
"Hello World!"
2.To output numbers, just type the number (without quotes):
• Example
5
10
25
3. To do simple calculations, add numbers together:
Example
5 + 5
44
R Print Output
1.Print : Unlike many other programming languages, you can output code in R without using a print
function:
Example
"Hello World!"
• However, R does have a print() function available if you want to use it. This might be useful if you are
familiar with other programming languages, such as Python, which often uses the print() function to output
code.
Example
print("Hello World!")
• And there are times you must use the print() function to output code, for example when working with for
loops.
Example
• for (x in 1:10)
• {
print(x)
}
• It is up to you whether you want to use the print() function to output code. However, when your code is
inside an R expression (e.g. inside curly braces {} like in the example above), use the print() function to
output the result. 45
Comments
• Comments can be used to explain R code, and to make it more readable. It can also be used to prevent execution when
testing alternative code.
• Comments starts with a #. When executing code, R will ignore anything that starts with #.
• This example uses a comment before a line of code:
• Example
• # This is a comment
"Hello World!"
• This example uses a comment at the end of a line of code:
• Example
• "Hello World!" # This is a comment
• Comments does not have to be text to explain the code, it can also be used to prevent R from executing the code:
• Example
• # "Good morning!"
"Good night!“
• Multiline Comments :Unlike other programming languages, such as Java, there are no syntax in R for multiline
comments. However, we can just insert a # for each line to create multiline comments: 46
Creating Variables in R
• Variables are containers for storing data values.
• R does not have a command for declaring a variable.
• A variable is created the moment you first assign a value to it. To assign a value to a variable, use
the <- sign. To output (or print) the variable value, just type the variable name:
• From the example above, name and age are variables, while "John" and 40 are values.
• In other programming language, it is common to use = as an assignment operator.
• In R, we can use both = and <- as assignment operators.
• However, <- is preferred in most cases because the = operator can be forbidden in some context in R.
47
Print / Output Variables
• Compared to many other programming languages, you do not have to use a function to
print/output variables in R. You can just type the name of the variable:
• However, R does have a print() function available if you want to use it. This might be useful if you
are familiar with other programming languages, such as Python, which often use a print() function
to output variables.
• And there are times you must use the print() function to output code, for example when working
with for loops (which you will learn more about in a later chapter):
48
Concatenate Elements
• You can also concatenate, or join, two or more elements, by using the paste() function.
• To combine both text and a variable, R uses comma (,):
• You can also use , to add a variable to another variable:
• For numbers, the + character works as a mathematical operator:
49
Multiple Variables
• R allows you to assign the same value to multiple variables in one line:
50
Variable Names
• A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume). Rules for R variables are:A variable name must start with a letter and can be a
combination of letters, digits, period(.) and underscore(_). If it starts with period(.), it cannot be
followed by a digit.
• A variable name cannot start with a number or underscore (_)
• Variable names are case-sensitive
EX: (age, Age and AGE are three different variables)
• Reserved words cannot be used as variables
EX: (TRUE, FALSE, NULL, if...)
51
R - Data Types
• Generally, while doing programming in any programming language, you need to use various
variables to store various information.
• Variables are nothing but reserved memory locations to store values.
• This means that, when you create a variable you reserve some space in memory.
• You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating
system allocates memory and decides what can be stored in the reserved memory.
• In contrast to other programming languages like C and java in R, the variables are not declared as
some data type.
• The variables are assigned with R-Objects and the data type of the R-object becomes the data type of
the variable
52
Data Types in R are:
• Each R-Data Type requires different amounts of memory and has some specific operations which
can be performed over it.
• numeric – (3,6.7,121)
• Integer – (2L, 42L; where ‘L’ declares this as an integer)
• logical – (‘True’)
• complex – (7 + 5i; where ‘i’ is imaginary number)
• character – (“a”, “B”, “c is third”, “69”)
• raw – ( as.raw(55); raw creates a raw vector of the specified length)
53
Data type and the values that each data
type can take.
Basic Data Types Values Examples
Numeric Set of all real numbers
"numeric_value <- 3.14"
Integer Set of all integers, Z
"integer_value <- 42L"
Logical TRUE and FALSE
"logical_value <- TRUE"
Complex Set of complex numbers
"complex_value <- 1 + 2i"
Character
“a”, “b”, “c”, …, “@”, “#”, “$”, …., “1”, “2”,
…etc
"character_value <- "Hello Geeks"
raw as.raw()
"single_raw <- as.raw(255)"
54
Data Types
Data type Example Description
Logical True, False It is a special data type for data with only two possible values which
can be construed as true/false.
Numeric 12,32,112,5432 Decimal value is called numeric in R, and it is the default computational
data type.
Integer 3L, 66L, 2346L Here, L tells R to store the value as an integer,
Complex Z=1+2i, t=7+3i A complex value in R is defined as the pure imaginary value i.
Character 'a', '"good'", "TRUE",
'35.4'
In R programming, a character is used to represent string values. We
convert objects into character values with the help ofas.character()
function.
Raw A raw data type is used to holds raw bytes.
55
Sample program
• # numeric
• x <- 10.5
• class(x)
• # integer
• x <- 1000L
• class(x)
• # complex
• x <- 9i + 3
• class(x)
56
# character/string
x <- "R is exciting"
class(x)
# logical
x <- TRUE
class(x)

More Related Content

Similar to Fundamentals of Data science Introduction Unit 1

intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...jybufgofasfbkpoovh
 
Joe keating - world legal summit - ethical data science
Joe keating  - world legal summit - ethical data scienceJoe keating  - world legal summit - ethical data science
Joe keating - world legal summit - ethical data scienceJoe Keating
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Causal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionCausal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionFabio Stella
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
 
2019 June 27 - Big data and data science
2019 June 27 - Big data and data science2019 June 27 - Big data and data science
2019 June 27 - Big data and data scienceFabio Stella
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applicationsPadma Metta
 
Learning Data Analytics
Learning Data AnalyticsLearning Data Analytics
Learning Data AnalyticsLearnbay
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 
Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online caniceconsulting
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 

Similar to Fundamentals of Data science Introduction Unit 1 (20)

BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
 
BIG DATA.ppt
BIG DATA.pptBIG DATA.ppt
BIG DATA.ppt
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Joe keating - world legal summit - ethical data science
Joe keating  - world legal summit - ethical data scienceJoe keating  - world legal summit - ethical data science
Joe keating - world legal summit - ethical data science
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Causal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionCausal networks, learning and inference - Introduction
Causal networks, learning and inference - Introduction
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
 
2019 June 27 - Big data and data science
2019 June 27 - Big data and data science2019 June 27 - Big data and data science
2019 June 27 - Big data and data science
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Data literacy
Data literacyData literacy
Data literacy
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Learning Data Analytics
Learning Data AnalyticsLearning Data Analytics
Learning Data Analytics
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 

Recently uploaded

edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisBoston Institute of Analytics
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 

Recently uploaded (20)

edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 

Fundamentals of Data science Introduction Unit 1

  • 1. Fundamentals of Data Science Unit 1 Prepared By Dr.P.Sasikumar Associate Professor, AIML Dept.
  • 2. Unit # 01 Introduction: What is Data Science? • Big Data and Data Science hype • Getting past the hype • Why now? • Datafication • Current landscape of perspectives • Data Science Jobs • What is data Scientist -In Academia -In Industry
  • 3. Basic Terminologies • Data • It can be -generated -collected -retrieved. Simulation Similarity Measures Data Structures Algorithms
  • 4. • Data: facts with no meanings. • Information: learning from facts. • Knowledge: practical understanding of a subject. • Understanding: the ability to absorb knowledge and learn to reason. • Wisdom: the quality of having experience and good judgment; ability to think and foresee. • Validity: ways to confirm truth.
  • 6. • Cross-sectional data: applied on data without time. • Temporal data: applied on time series. • Spatial: considers location i.e. coordinate determination in touch phones. • Temporal cum Spatial (GIS): considers change with passage of time for example population density. Measurements of Scales There are 4 scales of measurement • Nominal: determines classification of data i.e. male/female. • Ordinal: determines order of data and can be numerical or non-numerical i.e. time of day (dawn, morning, noon, afternoon, evening, night). • Interval: gives the interval of a measurement i.e. temperature interval. • Ratio: gives ratio of the measurement i.e. weight, height, number of children.
  • 7. Big Data and Data Science Hype: Skeptical related to Data Sciences. • Is data sciences only the stuff going in companies like Google, Facebook and tech companies? • There’s a distinct lack of respect for the researchers in academia and industry labs who have been working on this kind of stuff for years, and whose work is based on decades. • The hype is crazy-In general, hype masks reality and increases the noise-to-signal ratio. • Statisticians already feel that they are studying and working on the “Science of Data.” Introduction: What is Data Science?
  • 8. Getting Past the Hype • Rachel’s experience going from getting a PhD in statistics to working at Google. In her words:
  • 9. We have a couple replies to this: • Sure, there’s is a difference between industry and academia. But does it really have to be that way? Why do many courses in school have to be so intrinsically out of touch with reality? • Even so, the gap doesn’t represent simply a difference between industry statistics and academic statistics. • The general experience of data scientists is that, at their job, they have access to a larger body of knowledge and methodology, as well as a process, which we now define as the data science process, that has foundations in both statistics and computer science. Around all the hype, in other words, there is a ring of truth: this is something new. Getting Past the Hype
  • 10. • We have massive amounts of data about many aspects of our lives, and ,simultaneously, What people might not know is that the “datafication” of our offline behavior has started as well. • On the Internet, this means Amazon recommendation systems. • on Facebook, friend recommendations, film and music recommendations, and so on. • In finance, this means credit ratings, trading algorithms, and models. • In education, this is starting to mean dynamic personalized learning and assessments coming out of places like Knewton and Khan Academy. • In government, this means policies based on data. Why Now?
  • 11. • In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer- Schoenberger wrote an article called “The Rise of Big Data”, In it they discuss the concept of datafication, They define datafication as a process of “taking all aspects of life and turning them into data.” • They follow up their definition in the article with a line that speaks volumes about their perspective: Once we datafy things, we can transform their purpose and turn the information into new forms of value. Datafication
  • 12. Examples: • How we quantify friendships with “likes”. • “Twitter(X) datafies stray thoughts. • LinkedIn datafies professional networks. • When we “like” someone or something online, we are intending to be datafied. • Browse the Web, we are unintentionally through cookies. • When we walk around in a store, or even on the street, we are being datafied, via sensors, cameras, or Google glasses. • Taking part in a social media experiment. • All-out surveillance and stalking. But it’s all datafication Datafication
  • 13. For example, • On Quora there’s a discussion from 2010 about “What is Data Science?” and here’s Metamarket CEO Mike Driscoll’s answer: Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics. • Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010. Current landscape of perspectives
  • 14. • Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include: 1. Statistics (traditional analysis you’re used to thinking about) 2. Data munging (parsing, scraping, and formatting data) 3. Visualization (graphs, tools, etc.) • ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word”, in which she defends statistics: • Then at LinkedIn and Facebook, respectively—coined the term “data scientist” in 2008. • Wikipedia finally gained an entry on data science in 2012. Current landscape of perspectives
  • 15. Data Science Jobs • For three years running, data science has been dubbed ¨the best job in America.¨ According to Stack Overflow, it is one of the highest paying jobs in the software sector. • The GDPR increased the reliance companies have on data scientists due to the need for real- time analytics and storing data responsibly. • There are 465 job openings in New York City alone for data scientists. • LinkedIn recently picked data scientist as its most promising career of 2019. One of the reasons it got the top spot was that the average salary for people in the role is $130,000. • The January report from Indeed, one of the top job sites, showed a 29% increase in demand for data scientists year over year and a 344% increase since 2013 -- a dramatic upswing. But while demand -- in the form of job postings -- continues to rise sharply, searches by job seekers skilled in data science grew at a slower pace (14%), suggesting a gap between supply and demand.
  • 16. The growth in data scientist job postings on Indeed, from December 2016 to December 2018
  • 17.
  • 18. What Is a Data Scientist, Really? Perhaps the most concrete approach is to define data science is by its usage. • In Academia • An academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must deal with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem. • In Industry More generally, a data scientist is someone who knows • How to design the experiments, • how to the process of collecting, cleaning, and munging of data. • Skills that are also necessary for understanding biases in the data, and for debugging logging output from code. • Exploratory data analysis, which combines visualization and data sense. • Find patterns, build models, and algorithms. • Use analyses for decision making.
  • 19.
  • 20. Data Engineers are the data professionals who prepare the “big data” infrastructure to be analyzed by Data Scientists Data analyst is someone who merely curates meaningful insights from data. A data scientist is a professional with the capabilities to gather large amounts of data to analyze and synthesize the information into actionable plans for companies and other organizations. What Is a Data Scientist
  • 21. Statistical Inference • What is Statistical inference is the process of using a sample to infer the properties of a population. Statistical procedures use sample data to estimate the characteristics of the whole population from which the sample was drawn. • studying a phenomenon, such as the effects of a new medication or public opinion • populations are usually too large to measure fully. • Consequently, researchers must use a manageable subset of that population to learn about it. • By using procedures that can make statistical inferences, you can estimate the properties and processes of a population. • More specifically, sample statistics can estimate population parameters. 21
  • 22. How to Make Statistical Inferences • Process of making a statistical inference requires you to do the following: • Draw a sample that adequately represents the population. • Measure your variables of interest. • Use appropriate statistical methodology to generalize your sample results to the population while accounting for sampling error. Common Inferential Methods • Hypothesis Testing: Uses representative samples to assess two mutually exclusive hypotheses about a population. Statistically significant results suggest that the sample effect or relationship exists in the population after accounting for sampling error. • Confidence Intervals: A range of values likely containing the population value. This procedure evaluates the sampling error and adds a margin around the estimate, giving an idea of how wrong it might be. • Margin of Error: Comparable to a confidence interval but usually for survey results. • Regression Modeling: An estimate of the process that generates the outcomes in the population. 22
  • 23. Example Statistical Inference • real flu vaccine study for an example of making a statistical inference Study Findings • From the table above, 10.8% of the unvaccinated got the flu, while only 3.4% of the vaccinated caught it. The apparent effect of the vaccine is 10.8% – 3.4% = 7.4% 23 Treatment Flu count Group size Percent infections Placebo 35 325 10.8% Vaccine 28 813 3.4% Effect 7.4%
  • 24. Population and Sample • In statistics as well as in quantitative methodology, the set of data are collected and selected from a statistical population with the help of some defined procedures. There are two different types of data sets namely, population and sample Population • It includes all the elements from the data set and measurable characteristics of the population such as mean and standard deviation are known as a parameter. • For example, All people living in India indicates the population of India. There are different types of population. They are: • Finite Population • Infinite Population • Existent Population • Hypothetical Population • Let us discuss all the types one by one. 24
  • 25. Types • Finite Population The finite population is also known as a countable population in which the population can be counted. In other words, it is defined as the population of all the individuals or objects that are finite. For statistical analysis, the finite population is more advantageous than the infinite population. Examples of finite populations are employees of a company, potential consumer in a market. • Infinite Population The infinite population is also known as an uncountable population in which the counting of units in the population is not possible. Example of an infinite population is the number of germs in the patient’s body is uncountable. • Existent Population The existing population is defined as the population of concrete individuals. In other words, the population whose unit is available in solid form is known as existent population. Examples are books, students etc. • Hypothetical Population The population in which whose unit is not available in solid form is known as the hypothetical population. A population consists of sets of observations, objects etc that are all something in common. In some situations, the populations are only hypothetical. Examples are an outcome of rolling the dice, the outcome of tossing a coin. 25
  • 26. Differences between population and sample Comparison Population Sample Meaning Collection of all the units or elements that possess common characteristics A subgroup of the members of the population Includes Each and every element of a group Only includes a handful of units of population Characteristic s Parameter Statistic Data Collection Complete enumeration or census Sampling or sample survey Focus on Identification of the characteristics Making inferences about the population 26 :
  • 27. Sample • It includes one or more observations that are drawn from the population and the measurable characteristic of a sample is a statistic. • Sampling is the process of selecting the sample from the population. • For example, some people living in India is the sample of the population. Basically, there are two types of sampling. They are: • Probability sampling • Non-probability sampling 27
  • 28. Probability Sampling • In probability sampling, the population units cannot be selected at the discretion(Option) of the researcher. • This can be dealt with following certain procedures which will ensure that every unit of the population consists of one fixed probability being included in the sample. • Such a method is also called random sampling. • Some of the techniques used for probability sampling are:  Simple random sampling  Cluster sampling  Multi-stage sampling 28
  • 29. Non Probability Sampling • In non-probability sampling, the population units can be selected at the discretion of the researcher. • Those samples will use the human judgments for selecting units and has no theoretical basis for estimating the characteristics of the population. • Some of the techniques used for non-probability sampling are  Quota sampling  Judgment sampling  Purposive sampling Population and Sample Examples • All the people who have the ID proofs is the population and a group of people who only have voter id with them is the sample. • All the students in the class are population whereas the top 10 students in the class are the sample. • All the members of the parliament is population and the female candidates present there is the sample. 29
  • 30. Statistical Modelling • A statistical model is a type of mathematical model that comprises of the assumptions undertaken to describe the data generation process. • The mathematical expressions will be general enough that they have to include parameters, but the values of these parameters are not yet known. • In mathematical expressions, the convention is to use Greek letters for parameters and Latin letters for data. • So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship, you’d write down y = β0 +β1x. • You don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters. • Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows, showing how things affect other things or what happens over time. • This gives them an abstract picture of the relationships before choosing equations to express them. 30
  • 31. Probability Distributions What Is Probability? • Probability denotes the possibility of something happening. • It is a mathematical concept that predicts how likely events are to occur. • The probability values are expressed between 0 and 1. • The definition of probability is the degree to which something is likely to occur. • This fundamental theory of probability is also applied to probability distributions. Probability Distributions? • Statistical function that describes all the possible values and probabilities for a random variable within a given range. • This range will be bound by the minimum and maximum possible values, but where the possible value would be plotted on the probability distribution will be determined by a number of factors. 31
  • 32. RVDist-32 A probability distribution (function) is a list of the probabilities of the values (simple outcomes) of a random variable. Ex: Number of heads in two tosses of a coin For some experiments, the probability of a simple outcome can be easily calculated using a specific probability function. If y is a simple outcome and p(y) is its probability.     y all ) y ( p ) y ( p 1 1 0 Probability Distribution
  • 33. Fitting a model to data • Many data mining procedures fall within this general framework. • illustrate with some of the most common, all of which are based on linear models. • The crux of the fundamental concept of this chapter—fitting a model to data by finding “optimal” model parameters. 33
  • 35. Overfitting • Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. • Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model. • The chances of occurrence of overfitting increase as much we provide training to our model • Example: The concept of the overfitting can be understood by the below graph of the linear regression output: • 35 As we can see from the above graph, the model tries to cover all the data points present in the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression model to find the best fit line, but here we have not got any best fit, so, it will generate the prediction errors.
  • 36. How to avoid the Overfitting in Model • Both overfitting and underfitting cause the degraded performance of the machine learning model. But the main cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in our model. • Cross-Validation • Training with more data • Removing features • Early stopping the training • Regularization 36
  • 37. basic terms for overfitting • Signal: It refers to the true underlying pattern of the data that helps the machine learning model to learn from the data. • Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model. • Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine learning algorithms. Or it is the difference between the predicted values and the actual values. • Variance: If the machine learning model performs well with the training dataset, but does not perform well with the test dataset, then variance occurs. 37
  • 38. Basics of R Introduction • R is a popular programming language used for statistical computing. • Its most common use is to analyze and visualize data • Graphics representation and reporting. • R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. • R is freely available under the GNU General Public License, and pre compiled binary versions are provided for various operating systems like Linux, Windows and Mac. • This programming language was named R , based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs. • R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency. 38
  • 39. Why Use R? • It is a great resource for data analysis, data visualization, data science and machine learning • It provides many statistical techniques (such as statistical tests, classification, clustering and data reduction) • It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot • It works on different platforms (Windows, Mac, Linux) • It is open-source and free • It has a large community support • It has many packages (libraries of functions) that can be used to solve different problems 39
  • 40. Features of R • As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R • R is a well developed, simple and effective programming language which includes conditionals, loops, input and output facilities. • R has an effective data handling and storage facility, • R provides a suite(SET) of operators for calculations on arrays, lists, vectors and matrices. • R provides a large and integrated collection of tools for data analysis. • R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers. 40
  • 41. R - Environment Setup 1. Installation of R In Linux: ( Through Terminal ) • Press Ctrl+Alt+T to open Terminal • Then execute sudo apt-get update • After that, sudo apt-get install r-base 41
  • 42. In Windows: Step – 1: Go to CRAN R project website. (Comprehensive R Archive Network ) Step – 2: Click on the Download R for Windows link. https://cran.r-project.org/bin/windows/base/ Step – 3: Click on the base subdirectory link or install R for the first time link. Step – 4: Click Download R X.X.X for Windows (X.X.X stand for the latest version of R. (eg: 3.6.1) and save the executable .exe file. Step – 5: Run the .exe file and follow the installation instructions. 5.a. Select the desired language and then click Next. 5.b. Read the license agreement and click Next. 5.c. Select the components you wish to install (it is recommended to install all the components). Click Next. 5.d. Enter/browse the folder/path you wish to install R into and then confirm by clicking Next. 5.e. Select additional tasks like creating desktop shortcuts etc. then click Next. 5.f. Wait for the installation process to complete. 5.g. Click on Finish to complete the installation 42
  • 43. Install RStudio on Windows Step – 1: With R-base installed, let’s move on to installing RStudio. To begin, go to download RStudio and click on the download button for RStudio desktop. Step – 2: Click on the link for the windows version of RStudio and save the .exe file. Step – 3: Run the .exe and follow the installation instructions. 3.a. Click Next on the welcome window. 3.b. Enter/browse the path to the installation folder and click Next to proceed. 3.c. Select the folder for the start menu shortcut or click on do not create shortcuts and then click Next. 3.d. Wait for the installation process to complete. 3.e. Click Finish to end the installation 43
  • 44. Syntax 1.To output text in R, use single or double quotes: • Example "Hello World!" 2.To output numbers, just type the number (without quotes): • Example 5 10 25 3. To do simple calculations, add numbers together: Example 5 + 5 44
  • 45. R Print Output 1.Print : Unlike many other programming languages, you can output code in R without using a print function: Example "Hello World!" • However, R does have a print() function available if you want to use it. This might be useful if you are familiar with other programming languages, such as Python, which often uses the print() function to output code. Example print("Hello World!") • And there are times you must use the print() function to output code, for example when working with for loops. Example • for (x in 1:10) • { print(x) } • It is up to you whether you want to use the print() function to output code. However, when your code is inside an R expression (e.g. inside curly braces {} like in the example above), use the print() function to output the result. 45
  • 46. Comments • Comments can be used to explain R code, and to make it more readable. It can also be used to prevent execution when testing alternative code. • Comments starts with a #. When executing code, R will ignore anything that starts with #. • This example uses a comment before a line of code: • Example • # This is a comment "Hello World!" • This example uses a comment at the end of a line of code: • Example • "Hello World!" # This is a comment • Comments does not have to be text to explain the code, it can also be used to prevent R from executing the code: • Example • # "Good morning!" "Good night!“ • Multiline Comments :Unlike other programming languages, such as Java, there are no syntax in R for multiline comments. However, we can just insert a # for each line to create multiline comments: 46
  • 47. Creating Variables in R • Variables are containers for storing data values. • R does not have a command for declaring a variable. • A variable is created the moment you first assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the variable value, just type the variable name: • From the example above, name and age are variables, while "John" and 40 are values. • In other programming language, it is common to use = as an assignment operator. • In R, we can use both = and <- as assignment operators. • However, <- is preferred in most cases because the = operator can be forbidden in some context in R. 47
  • 48. Print / Output Variables • Compared to many other programming languages, you do not have to use a function to print/output variables in R. You can just type the name of the variable: • However, R does have a print() function available if you want to use it. This might be useful if you are familiar with other programming languages, such as Python, which often use a print() function to output variables. • And there are times you must use the print() function to output code, for example when working with for loops (which you will learn more about in a later chapter): 48
  • 49. Concatenate Elements • You can also concatenate, or join, two or more elements, by using the paste() function. • To combine both text and a variable, R uses comma (,): • You can also use , to add a variable to another variable: • For numbers, the + character works as a mathematical operator: 49
  • 50. Multiple Variables • R allows you to assign the same value to multiple variables in one line: 50
  • 51. Variable Names • A variable can have a short name (like x and y) or a more descriptive name (age, carname, total_volume). Rules for R variables are:A variable name must start with a letter and can be a combination of letters, digits, period(.) and underscore(_). If it starts with period(.), it cannot be followed by a digit. • A variable name cannot start with a number or underscore (_) • Variable names are case-sensitive EX: (age, Age and AGE are three different variables) • Reserved words cannot be used as variables EX: (TRUE, FALSE, NULL, if...) 51
  • 52. R - Data Types • Generally, while doing programming in any programming language, you need to use various variables to store various information. • Variables are nothing but reserved memory locations to store values. • This means that, when you create a variable you reserve some space in memory. • You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory. • In contrast to other programming languages like C and java in R, the variables are not declared as some data type. • The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable 52
  • 53. Data Types in R are: • Each R-Data Type requires different amounts of memory and has some specific operations which can be performed over it. • numeric – (3,6.7,121) • Integer – (2L, 42L; where ‘L’ declares this as an integer) • logical – (‘True’) • complex – (7 + 5i; where ‘i’ is imaginary number) • character – (“a”, “B”, “c is third”, “69”) • raw – ( as.raw(55); raw creates a raw vector of the specified length) 53
  • 54. Data type and the values that each data type can take. Basic Data Types Values Examples Numeric Set of all real numbers "numeric_value <- 3.14" Integer Set of all integers, Z "integer_value <- 42L" Logical TRUE and FALSE "logical_value <- TRUE" Complex Set of complex numbers "complex_value <- 1 + 2i" Character “a”, “b”, “c”, …, “@”, “#”, “$”, …., “1”, “2”, …etc "character_value <- "Hello Geeks" raw as.raw() "single_raw <- as.raw(255)" 54
  • 55. Data Types Data type Example Description Logical True, False It is a special data type for data with only two possible values which can be construed as true/false. Numeric 12,32,112,5432 Decimal value is called numeric in R, and it is the default computational data type. Integer 3L, 66L, 2346L Here, L tells R to store the value as an integer, Complex Z=1+2i, t=7+3i A complex value in R is defined as the pure imaginary value i. Character 'a', '"good'", "TRUE", '35.4' In R programming, a character is used to represent string values. We convert objects into character values with the help ofas.character() function. Raw A raw data type is used to holds raw bytes. 55
  • 56. Sample program • # numeric • x <- 10.5 • class(x) • # integer • x <- 1000L • class(x) • # complex • x <- 9i + 3 • class(x) 56 # character/string x <- "R is exciting" class(x) # logical x <- TRUE class(x)

Editor's Notes

  1. Primary data collection sources include surveys, observations, experiments, questionnaire, personal interview, etc. On the contrary, secondary data collection sources are government publications, websites, books, journal articles, internal records
  2. Munging – process of cleaning and transforming data prior to use or analysis. Data parsing is converting data from one format to another. Data scraping is a technique where a computer program extracts data from human-readable output coming from another program
  3. idea or conclusion that's drawn from evidence and reasoning.  Conclude/ assume
  4. An example of a linear relationship is the number of hours worked compared to the amount of money earned. 
  5. Advanced Packaging Tool – APT superuser do or substitute user do
  6. It is up to your if you want to use the print() function or not to output code. However, when your code is inside an R expression (for example inside curly braces {} like in the example above), use the print() function if you want to output the result.