Data science notes for ASDS calicut 2.pptx

The Current Landscape (with a Little History)
 As per Metamarket CEO Mike Driscoll’s answer:Data science, as it’s practiced, is a blend of Red-
Bull-fueled hacking and espresso-inspired statistics.
 But data science is not merely hacking and data science is not merely statistics
 Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools
and materials, coupled with a theoretical understanding of what’s possible.
 Drew Conway’s Venn diagram of data science

Population and Sample
In statistics, population is the entire set of items from which you draw data for a statistical study.
It can be a group of individuals, a set of items, etc. It makes up the data pool for a study.
It can be a group of individuals, objects, events, organizations, etc. You use populations to draw
conclusions
An example of a population would be the entire student body at a school. It would contain all the
students who study in that school at the time of data collection. Depending on the problem
statement, data from each of these students is collected. An example is the students who speak
Hindi among the students of a school.
For the above situation, it is easy to collect data. The population is small and willing to provide
data and can be contacted. The data collected will be complete and reliable.
If you had to collect the same data from a larger population, say the entire country of India, it
would be impossible to draw reliable conclusions because of geographical and accessibility
constraints, not to mention time and resource constraints. A lot of data would be missing or
might be unreliable. Furthermore, due to accessibility issues, marginalized tribes or villages
might not provide data at all, making the data biased towards certain regions or groups.

Sample
 A sample represents the group of interest from the population, which you will
use to represent the data. The sample is an unbiased subset of the population
that best represents the whole data.
 To overcome the restraints of a population, you can sometimes collect data
from a subset of your population and then consider it as the general norm. You
collect the subset information from the groups who have taken part in the study,
making the data reliable. The results obtained for different groups who took part
in the study can be extrapolated to generalize for the population.
 The process of collecting data from a small subsection of the population and
then using it to generalize over the entire set is called Sampling.

 Samples are used when :
 The population is too large to collect data.
 The data collected is not reliable.
 The population is hypothetical and is unlimited in size.
 Take the example of a study that documents the results of a new medical procedure. It is unknown
how the procedure will affect people across the globe, so a test group is used to find out how people
react to it.
A sample should generally :
 • Satisfy all different variations present in the population as well as a well-defined selection criterion.
 • Be utterly unbiased on the properties of the objects being selected.
 • Be random to choose the objects of study fairly.
Say you are looking for a job in the IT sector, so you search online for IT jobs. The first search result
would be for jobs all around the world. But you want to work in India, so you search for IT jobs in India.
This would be your population. It would be impossible to go through and apply for all positions in the
listing. So you consider the top 30 jobs you are qualified for and satisfied with and apply for those. This
is your sample.

Statistical Modeling
 statistical modeling is a process using statistical models to analyze a set of data. Statistical
models are mathematical representations of the observed data.
 Statistical modeling methods are a powerful tool in understanding the consolidated data and
making generalized predictions using this data. A statistical model could be in the form of a
mathematical equation or a visual representation of the information.
Techniques in Statistical Modeling
There are several statistical modeling techniques used during data exploration. Here are some
of the common techniques:
1. Linear Regression
Linear regression uses a linear equation to model the relationship between two variables, where
one variable is dependent and the other is independent. If one independent variable is utilized to
predict a dependent variable, it is called simple linear regression. If more than one independent
variable is used to predict a dependent variable, it’s called a multiple linear regression.
For example, depending on a person’s height, age, and gender, a linear regression model may be
used to estimate their weight.

2. Classification
Classifications groups the data into different categories to allow for a more accurate
prediction and analysis. This technique can enable effective analysis of very large data sets.
There are two major techniques under classification:
 Logistic Regression
When the dependent variable is binary, the logistic regression technique is used to model and
predict the relationship between the binary variable and one or more independent variables.
Logistic regression models are used to represent the connection between a binary outcome
variable (for example, yes/no) and one or more predictor variables. For example, depending
on age, blood pressure, and cholesterol levels, a logistic regression model may be used to
predict if a patient would have a heart attack.
Discriminant Analysis
Here, two or more groups are known as prior and new observations are grouped into known clusters
based on the measured features. The distribution of the predictor variable X is modeled separately into
each of the response classes, Bayes’ theorem is then used to calculate the probability of each response
class, based on the value of X.
Let us consider an example of where the discriminant analysis can be used.
 Consider that you are in charge of the loan department at ABC bank. The bank manager asks you to
find a better way to give loans so bad debt and defaults are reduced. You have a financial
management background, so you decide to go with discriminant analysis to understand the problem
and find a solution. The creation of a credit risk profile for existing customers by a bank’s loan
department to determine whether new loan applicants pose a credit risk is a canonical example of
dynamic financial analysis.

 Resampling
In this technique, repeated samples are drawn from the original set of data,
creating a unique sampling distribution based on actual data. It uses experimental
methods as opposed to analytical methods to create a unique sampling
distribution. Since the samples drawn are unbiased, the estimates obtained are
also unbiased.
Bootstrapping
 This takes into account the data samples that weren’t selected in the initial
sample as are placement. The process is repeated several times and the
average score is calculated for the estimation of the model performance.
Cross-Validation
The training data is divided into k number of parts. Here, k – 1 parts are considered
training sets,
and the one remaining set is used as the test set. This is repeated k number of
times and the
average of the k scores are calculated as the performance estimation.

 Non-linear Models
Here the data under observation is modeled using a non-linear combination of
model parameters and this is dependent on one or more independent variables.
The data is then fitted using a method of successive approximations.
Example: Gold price and inflation --Even if the gold prices are stable to a great
extent, they are affected by inflation, crude oil, etc. But the important one is the
impact of inflation, and at the same time, gold prices can control the inflation
instability. Therefore, a deep understanding of the relationship between inflation
and gold price is a prerequisite.
In this case, nonlinear regression analysis is employed for analyzing data. The
dependent variable is gold price, and the independent variable is inflation. The
regression analysis results revealed that inflation impacts the gold price.
 Tree-Based Methods
In a tree-based method, the predictor space is segmented into different simple
regions. The set of splitting rules can be summarized in a tree, giving it the name
decision-tree method. This can be used for both, regression and classification
problems. Bagging, boosting, and random forest algorithm are some of the
approaches used in this method.

Unsupervised Learning
Unsupervised learning relies on the algorithm to identify a pattern in the data. Here
the categories of data are not known. For example, in clustering, closely related
items are grouped,making it a method of unsupervised learning.
unsupervised learning is a method we use to group data when no labels are
present. Since no labels are present, unsupervised learning methods are typically
applied to build a concise representation of the data so we can derive imaginative
content from it.
For example, if we were releasing a new product, we can use unsupervised
learning methods to identify who the target market for the new product will be: this
is because there is no historical information about who the target customer is and
their demographics.
Time Series
This forecasting model can be used to predict future values based on historical
values. It is used to identify the phenomenon represented by the data and then
integrated with other data to draw predictions for the future.
Time series forecasting is a technique for the prediction of events through a
sequence of time. It predicts future events by analyzing the trends of the past, on
the assumption that future trends will hold similar to historical trends. It is used
across many fields of study in various applications including: Astronomy.

Neural Networks
Modeled loosely on the human brain, these are algorithms designed to identify
patterns in the data. Neural networks have non-linear elements that process
information, called neurons. These are arranged in layers and normally executed in
parallel. Neural networks are being increasingly used to make predictions and
classifications as they have minimal demands on assumptions and model structure
and can approximate a wide range of models

Probability Distribution
Probability distribution is mathematical function which provide the possibilities
of occurrence of various possible outcome that can occur in an experiment.
There are many types of probability distribution .Following are five probability
distribution that mostly used in data science:
 Normal distribution
 Binomial distribution
 Bernoulli distribution
 Uniform distribution
 Poisson distribution

Normal distribution
Normal distribution is most important distribution ,because it fits in many natural
phenomenon.
 For instance :height,blood pressure,IQ score,etc
 • Normal distribution is also called as guassian distribution
 In graphical form, the normal distribution appears as a "bell curve".
A normal distribution is a type of continuous probability distribution in which
most data points cluster toward the middle of the range, while the rest taper off
symmetrically toward either extreme. The middle of the range is also known as
the mean of the distribution.

Binomial Distribution
 Binomial distribution is discrete distribution.
 Binomial distribution is used to represent probability of x success in n trial
,given success probability p in each trial.
 • If the distribution satisfies the below conditions then such distribution is
called as binomial distribution:
1. There should fixed number of trial.
2. It should have only two possible outcome.
3. Events should be independent.
4. Probability of getting success and failure should remain same

Bernoulli Distribution
 Bernoulli distribution is easiest distribution among all distributions.
 It is similar to binomial distribution. The only difference is it takes only one
trial while binomial distribution considers n trial.
 It has only two possible outcome ie success vs failure.
Let’s consider random variable X with only one parameter p which represents
probability of occurrence of event.
 It’s density function is given as :
P[X=1]=p
P[X=0]=1-p
Where,
X=1 indicates event has occurred
X=0 indicates event didn’t occured

Uniform Distribution
 Distribution is said to be a uniform distribution, if all the outcomes of event
have equal probabilities.
 Uniform distribution is also called rectangular distribution.
 Expected value of uniform distribution provides us no relevant information
 Since each outcome is equally likely both mean and variance are
uninterpretable.
 It does not have predictive power

Poisson Distribution
 Poisson distribution is discrete probability distribution.
 Poisson distribution is a distribution of count ie number of times event has
occurred in given interval of time.
 Poisson distribution can be used to predict probability of number of successful
event that may occur in specific interval of time.
 Example, if a call center received 50 calls in 1 hour, then using Poisson
distribution we can predict probability of getting 20 calls in next 30 minutes.

Data science notes for ASDS calicut 2.pptx

Data science notes for ASDS calicut 2.pptx

Recommended

Recommended

More Related Content

Similar to Data science notes for ASDS calicut 2.pptx

Similar to Data science notes for ASDS calicut 2.pptx (20)

Recently uploaded

Recently uploaded (20)

Data science notes for ASDS calicut 2.pptx