Data science

1. What is data? Categories Of Data.
2. What Is Statistics? Basic Terminologies In Statistics.
Data: In computing, data is information that has been translated into a form that is efficient for movement or processing.
Relative to today's computers and transmission media, data is information converted into binary digital form. It is
acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most
basic digital format.
Categories of Data: Data can be categorized into two sub-categories:
1. Qualitative Data (Nominal Data, Ordinal Data) 2. Quantitative Data (Discrete Data, Continuous Data)
Qualitative data deals with characteristics and descriptors that can’t be easily measured, but can be observed subjectively.
Quantitative data deals with numbers and things you can measure objectively.
Statistics: Statistics is an area of applied mathematics concerned with data collection, analysis, interpretation, and
presentation.
Basic Terminologies In Statistics: Before you dive deep into Statistics, it is important that you understand the basic
terminologies used in Statistics. The two most important terminologies in statistics are population and sample.
Population: A collection or set of individuals or objects or events whose properties are to be analysed.
Sample: A subset of the population is called ‘Sample’. A well-chosen sample
will contain most of the information about a particular population parameter.

3. Sampling Techniques?
4. Types of Statistics? Descriptive Statistics? Understanding Descriptive Statistics? Measures of the center
& Measures of The Spread?
Sampling Techniques: Sampling helps a lot in research. It is one of the most important factors which determines the accuracy of your
research/survey result. If anything goes wrong with your sample then it will be directly reflected in the final result. There are a lot of techniques
that help us to gather samples depending upon the need and situation. Here, I try to explain some of those techniques. To start with, let’s have a
look at some basic terminology-
1. Probability Sampling (Random, Systematic, Stratified) 2. Non-Probability Sampling (Snowball, Quota, Convenience, Judgmental).
Types of Statistics: There are 2 well-defined types of statistics: - 1. Descriptive Statistics 2. Inferential Statistics
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire
population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability
(spread).
With inferential statistics, you take data from samples and make generalizations about a population. This means taking a statistic from your
sample data (for example the sample mean) and using it to say something about a population parameter (i.e. the population mean). Hypothesis
tests.
Understanding Descriptive Statistics: Descriptive Statistics is broken down into 2 categories: 1. Measures of Central Tendency 2. Measures
of Variability (spread)
Measures of the center are statistical measures that represent the summary of a dataset. There are three main measures of center.
Mean: Measure of the average of all the values in a sample is called Mean. Median: Measure of the central value of the sample set is called
Median. Mode: The value most recurrent in the sample set is known as Mode.
A measure of spread, sometimes also called a measure of dispersion, is used to describe the variability in a sample or population. (Range,
Inter Quartile range, variance and standard deviation.)

5. Information Gain & Entropy. Confusion Matrix
6. What is Probability? What is the relation between Statistics and Probability?
Entropy measures the impurity or uncertainty present in the data. It can be measured by using the below formula:
where: S – set of all instances in the dataset; N – number of distinct class values; pi – event probability.
Information Gain (IG) indicates how much “information” a particular feature/ variable gives us about the final outcome. It can
be measured by using the below formula:
Here:
H(S) – entropy of the whole dataset S; |Sj| – number of instances with j value of an attribute A
|S| – total number of instances in dataset S; v – set of distinct values of an attribute A;
H(Sj) – entropy of subset of instances for attribute A; H(A, S) – entropy of an attribute A.
A Confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect
predictions are summarized with count values and broken
down by each class. This is the key to the confusion matrix.
Probability is the measure of how likely an event will occur. To be more precise probability is the ratio of desired outcomes to
total outcomes: (desired outcomes) / (total outcomes).
Relation between Probability and Statistics: Probability and Statistics and related fields. Probability is a mathematical
method used for statistical analysis. Therefore we can say that probability and statistics are interconnected branches of
mathematics that deal with analyzing the relative frequency of events.

7. Terminologies in Probability 8. Probability Distribution
Terminologies In Probability: Before you dive deep into the concepts of probability, it is important that you understand the
basic terminologies used in probability:
Random Experiment: An experiment or a process for which the outcome cannot be predicted with certainty.
Sample space: The entire possible set of outcomes of a random experiment is the sample space of that experiment.
Event: One or more outcomes of an experiment is called an event. It is a subset of sample space. There are two types of events
in probability:
Disjoint Event: Disjoint Events do not have any common outcomes. For example, a single card drawn from a deck cannot be a king and a queen.
Non – Disjoint Event: Non-Disjoint Events can have common outcomes. For example, a student can get 100 marks in statistics and 100 marks in probability
Probability Distribution:
Here we shall focus on three main probability distribution functions: 1. Probability Density Function 2. Normal Distribution
3. Central Limit Theorem.
Probability Density Function: It is called a probability density or just density. Thus, probability distributions for continuous
variables are called probability density functions (or PDF). The integral of the probability density function over a particular
interval gives you the probability that a random variable takes a value in this interval.
Normal Distribution: In normally distributed data, there is a constant proportion of data points lying under the curve between
the mean and a specific number of standard deviations from the mean. Thus, for a normal distribution, almost all values lie
within 3 standard deviations of the mean.
The Central Limit Theorem(CLT) states that for any data, provided a high number of samples have been taken. The
following properties hold: Sampling Distribution Mean(μₓ¯) = Population Mean(μ) Sampling distribution's standard deviation
(Standard error) = σ/√n ≈S/√n.

9. Types Of Probability
10. Bayes’Theorem
Types Of Probability: 1. Marginal Probability- The probability of an event occurring (p(A)), unconditioned on any other
events. For example, the probability that a card drawn is a 3 (p(three)=1/13). It can be expressed as:
2. Joint Probability is a measure of two events happening at the same time, i.e., p(A and B), The probability of event A and
event B occurring. It is the probability of the intersection of two or more events. The probability of the intersection of A and
B may be written p(A ∩ B). For example, the probability that a card is a four and red =p(four and red) = 2/52=1/26.
3. Conditional Probability: Probability of an event or outcome based on the occurrence of a previous event or outcome
Conditional Probability of an event B is the probability that the event will occur given that an event A has already occurred.
 p(B|A) is the probability of event B occurring, given that event A occurs.
 If A and B are dependent events then the expression for conditional probability is given by: P (B|A) = P (A and B) / P (A).
 If A and B are independent events then the expression for conditional probability is given by: P(B|A) = P (B)
Bayes’ Theorem: Bayes Theorem is the extension of Conditional probability. Conditional probability helps us to determine
the probability of A given B, denoted by P(A|B). So Bayes' theorem says if we know P(A|B) then we can determine P(B|A),
given that P(A) and P(B) are known to us.
In the above equation:
P(A|B): Conditional probability of event A occurring, given the event B
P(A): Probability of event A occurring
P(B): Probability of event B occurring
P(B|A): Conditional probability of event B occurring, given the event A

11. Statistical inference 14. Hypothesis Testing
12. What is Point Estimation? 15. Estimating Level of Confidence
13. What is Interval Estimation?
Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. It is
assumed that the observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive
statistics.
The process of point estimation involves the utilization of the value of a statistic that is obtained with the help of sample data to
determine the best estimate of the corresponding unknown parameter of the population. Two important terminologies on Point
Estimation are:
I. Estimator: A function f(x) of the sample, that is used to find out the estimate.
II. Estimate: The Realised value of an estimator.
Interval Estimation involves computing an interval, or range of values, within which the parameter is most likely to be located.
For example, we say that there is a 95% probability that the average customer satisfaction lies between intervals 6 & 8.
Hypothesis testing is a part of statistical analysis, where we test the assumptions made regarding a population parameter. It is
generally used when we were to compare: a single group with an external standard. two or more groups with each other.
Estimating Level Of Confidence: In statistics, a confidence interval (CI) is a type of estimate computed from the statistics of
the observed data. This proposes a range of plausible values for an unknown parameter. This is more clearly stated as: The
confidence level represents the probability that the unknown parameter lies in the stated interval.
Identify a Sample Statistic: Choose the statistic that you will use to estimate a population parameter (ex: mean of the sample).
Select a Confidence Level: The confidence level describes the uncertainty of a sampling method.
Find the Margin of Error: Find the margin of error based on the previous equation explained.
Specify the Confidence Interval: The Confidence Interval can be found out by: Confidence Interval = Sample Statistic ± Margin
of Error.

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample
size gets larger, regardless of the population's distribution. Sample sizes equal to or greater than 30 are often considered
sufficient for the CLT to hold.
The Central Limit Theorem is important for statistics because it allows us to safely assume that the sampling distribution of
the mean will be normal in most cases. This means that we can take advantage of statistical techniques that assume a normal
distribution, as we will see in the next section.
Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to
identify patterns and trends in the larger data set being examined. Types of Sampling Techniques - 2 types.
Probability Sampling: In probability sampling, every element of the population has an equal chance of being selected.
Probability sampling gives us the best chance to create a sample that is truly representative of the population
Non-Probability Sampling: In non-probability sampling, all elements do not have an equal chance of being selected.
Consequently, there is a significant risk of ending up with a non-representative sample which does not produce generalizable
results.
Selection bias: Selection bias is the bias introduced by the selection of individuals,
groups, or data for analysis in such a way that proper randomization is not achieved,
thereby failing to ensure that the sample obtained is representative of the
population intended to be analyzed.
16. What is the central limit theorem and why is it important?
17. What is sampling? How many sampling methods do you know?
27. What is selection bias?

18. What is the difference between type I vs type II error?
Basis for comparison Type I error Type II error
Definition Type 1 error, in statistical hypothesis testing, is the error
caused by rejecting a null hypothesis when it is true.
Type II error is the error that occurs
when the null hypothesis is accepted
when it is not true.
Also termed Type I error is equivalent to false positive. Type II error is equivalent to a false
negative.
Meaning It is a false rejection of a true hypothesis. It is the false acceptance of an incorrect
hypothesis.
Symbol Type I error is denoted by α. Type II error is denoted by β.
Probability The probability of type I error is equal to the level of
significance.
The probability of type II error is equal to
one minus the power of the test.
Reduced It can be reduced by decreasing the level of significance. It can be reduced by increasing the level
of significance.
Cause
It is caused by luck or chance.
It is caused by a smaller sample size or a
less powerful test.
What is it? Type I error is similar to a false hit. Type II error is similar to a miss.
Hypothesis Type I error is associated with rejecting the null
hypothesis.
Type II error is associated with rejecting
the alternative hypothesis.

19. What is linear regression? What do the terms p-value, coefficient, and r-squared value mean?
What is the significance of each of these components?
20. What are the assumptions required for linear regression?
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more
explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is
called multiple linear regression.
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical
hypothesis test, assuming that the null hypothesis is correct. The p-value is used as an alternative to rejection points to provide
the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger
evidence in favor of the alternative hypothesis.
Coefficient indicates the direction of the relationship between a predictor variable and the response variable. A negative sign
indicates that as the predictor variable(y) increases, the response variable(x) decreases.
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained
by an independent variable or variables in a regression model.
A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of
factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear
regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a
positive or negative relationship.
Assumptions of Linear Regression: There are four assumptions associated with a linear regression model: Linearity: The
relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
*Linear relationship *Multivariate normality *No or little multicollinearity *No auto-correlation *Homoscedasticity

21. What is Data Science? 23. What are the skills of a Data Scientist?
22. Who is a Data Scientist? 24. What does a Data Scientist do?
25. Data acquisition, Data Preparation, Data mining, Model Building, Model Maintenance.
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and
insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of
application domains.
A data scientist is a professional responsible for collecting, analyzing and interpreting extremely large amounts of data. The data
scientist role is an offshoot of several traditional technical roles, including mathematician, scientist, statistician and computer
professional.
Skills of a Data Scientist: As a Data Scientist, you'll be responsible for jobs that span three domains of skills. Statistical/mathematical
reasoning, business communication/leadership, and. programming. Below are seven essential skills for data scientists:
Python programming. R programming. Hadoop platform. SQL databases.
Machine learning and AI. Data visualization. Business strategy.
Data Scientist do: In simple terms, a data scientist's job is to analyze data for actionable insights. Specific tasks include: Identifying
the data-analytics problems that offer the greatest opportunities to the organization.
Data acquisition: Data acquisition is the process of sampling signals that measure real-world physical conditions and converting the
resulting samples into digital numeric values that can be manipulated by a computer.
Data preparation is the act of manipulating raw data into a form that can readily and accurately be analyzed, e.g. for business
purposes.
Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine
learning, statistics, and database systems.
The model building process involves setting up ways of collecting data, understanding and paying attention to what is important in
the data to answer the questions you are asking, finding a statistical, mathematical or a simulation model to gain understanding and
make predictions.
Model Maintenance: The meaning maintenance model (MMM) proposes that people have a need for meaning; that is, a need to
perceive events through a prism of mental representations of expected relations that organizes their perceptions of the world.

26. List the differences between supervised and unsupervised learning.
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained using labeled data. Unsupervised learning algorithms are trained using unlabeled
data.
Supervised learning model takes direct feedback to check if it is
predicting correct output or not.
Unsupervised learning model does not take any feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data.
Supervised learning can be categorized
in Classification and Regression problems.
Unsupervised Learning can be classified
in Clustering and Associations problems.
Supervised learning needs supervision to train the model. Unsupervised learning does not need any supervision to train
the model.
Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as
compared to supervised learning.
It includes various algorithms such as Linear Regression,
Logistic Regression, Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian Logic, etc.
It includes various algorithms such as Clustering, KNN, and
Apriori algorithm.

28. What is bias-variance trade-off? 30. What do you understand by the term normal distribution?
29. What is a confusion matrix? 31. What is correlation and covariance in statistics?
32. What is the difference between point estimates and confidence interval?
The Bias-Variance Trade-Off: Data scientists building machine learning algorithms are forced to make decisions about the
level of bias and variance in their models. A model that exhibits small variance and high bias will underfit the target, while a
model with high variance and little bias will overfit the target.
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect
predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. is
confused when it makes predictions.
Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean,
showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal
distribution will appear as a bell curve. In normally distributed data, there is a constant proportion of data points lying under
the curve between the mean and a specific number of standard deviations from the mean. Thus, for a normal distribution,
almost all values lie within 3 standard deviations of the mean.
Correlation and Covariance in statistics: Generally use the data science field for comparing data samples from different
populations, and covariance is used to determine how much two random variables to each other, whereas correlation, is used to
determine change one variable is it affect another variable. Covariance defines how two random variables vary together. And
correlation states how the change of one variable affects the other.
Point Estimates and Confidence Interval: Using descriptive and inferential statistics, you can make two types of estimates
about the population: point estimates and interval estimates.
 A point estimate is a single value estimate of a parameter. For instance, a sample mean is a point estimate of a population mean.
 An interval estimate gives you a range of values where the parameter is expected to lie. A confidence interval is the most common type of interval
estimate.
Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.

33. What is the goal of A/B testing? 34. What is P-value?
35. What are the differences between over-fitting and under-fitting?
36. How to combat overfitting and underfitting? 37. What is regularization? Why is it useful?
A/B testing goal: An A/B test will enable us to accurately quantify our effect size and errors, and so calculate the probability
that we have made a type I or type II error. I would argue that only once we understand the true effect size and robustness of
our results, can we proceed to making business-impact decisions.
P-value: “A p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample
mean difference between two compared groups) would be equal to or more extreme than its observed value.” P-value is the
probability that a random chance generated the data or something else that is equal or rarer (under the null hypothesis). We
calculate the p-value for the sample statistics(which is the sample mean in our case).
Over-fitting and Under-fitting: Overfitting is a modeling error which occurs when a function is too closely fit to a limited set
of data points. Underfitting refers to a model that can neither model the training data nor generalize to new data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.
Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting
occurs if the model or algorithm shows low variance but high bias.
Combat overfitting and underfitting: Use Dropouts. Dropout is a regularization technique that prevents neural networks
from overfitting. Regularization methods like L1 and L2 reduce overfitting by modifying the cost function. Dropout on the
other hand, modify the network itself.
Handling Underfitting: Get more training data. Increase the size or number of parameters in the model. Increase the
complexity of the model. Increasing the training time, until the cost function is minimized.
Regularization: This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In
other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
Regularization, significantly reduces the variance of the model, without substantial increase in its bias. So the tuning parameter
λ, used in the regularization techniques described above, controls the impact on bias and variance.

38. Explain how a ROC curve works?
39. Why we generally use soft-max (or sigmoid) non-linearity function as last operation in-network? Why ReLU in an
inner layer?
40. How does data cleaning play a vital role in the analysis?
ROC curve: A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary
classifiers. It was first used in signal detection theory but is now used in many other areas such as medicine, radiology, natural
hazards and machine learning. In this post I'll show you how a ROC curve is created and how to interpret the ROC curve. The
ROC curve is produced by calculating and plotting the true positive rate against the false positive rate for a single classifier at
a variety of thresholds. For example, in logistic regression, the threshold would be the predicted probability of an observation
belonging to the positive class.
Soft-max (or sigmoid) non-linearity function as last operation: It is because it takes in a vector of real numbers and returns
a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are
no constraints).
Then the i’th component of Softmax(x) is - It should be clear that the
output is a probability distribution: each element is non-negative and
the sum over all components is 1.
Data cleaning Vital role: Data cleaning can help in analysis because: Cleaning data from multiple sources helps to transform
it into a format that data analysts or data scientists can work with. Data Cleaning helps to increase the accuracy of the model
in machine learning. Data cleaning can help in analysis because:
 Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work
with.
 Data Cleaning helps to increase the accuracy of the model in machine learning.
 It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases
exponentially due to the number of sources and the volume of data generated by these sources.
 It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

41. Differentiate between univariate, bivariate and multivariate analysis?
42. Explain Star Schema? 43. What is cluster sampling?
44. What is systematic sampling?
 Univariate statistics summarize only one variable at a time. The analysis of univariate data is thus the simplest form of
analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the
main purpose of the analysis is to describe the data and find patterns that exist within it. The example of a univariate data
can be height.
 Bivariate statistics compare two variables. The analysis of this type of data deals with causes and relationships and the
analysis is done to find out the relationship among the two variables. Example of bivariate data can be temperature and ice
cream sales in summer season.
 Multivariate statistics compare more than two variables. Multivariate analysis is the analysis of three or more variables.
There are many ways to perform multivariate analysis depending on your goals.
A star schema is a database organizational structure optimized for use in a data warehouse or business intelligence that uses a
single large fact table to store transactional or measured data, and one or more smaller dimensional tables that store attributes
about the data. Examples of fact data include sales price, sale quantity, and time, distance, speed and weight measurements.
Cluster sampling is a probability sampling technique where researchers divide the population into multiple groups (clusters)
for research. Researchers then select random groups with a simple random or systematic random sampling technique for data
collection and data analysis. An example of single-stage cluster sampling – An NGO wants to create a sample of girls
across five neighboring towns to provide education. Using single-stage sampling, the NGO randomly selects towns
(clusters) to form a sample and extend help to the girls deprived of education in those towns.
Systematic sampling is a type of probability sampling method in which sample members from a larger population are selected
according to a random starting point but with a fixed, periodic interval. This interval, called the sampling interval, is calculated
by dividing the population size by the desired sample size.

45. What are eigenvectors and eigenvalues?
46. Can you explain the difference between a Validation set and a Test set?
47. Explain cross-validation? 48. What is machine learning?
49. What is supervised learning? 50. What is unsupervised learning?
The eigenvectors are called principal axes or principal directions of the data. Projections of the data on the principal axes are
called principal components. We reduce the dimensionality of data by projecting it in fewer principal directions than its
original dimensionality.
The eigenvalue is the scaling factor by which the vector is contracted or elongated. Mathematically, the vector x is an
eigenvector of A if: with λ (pronounced “lambda”) being the eigenvalue corresponding to the eigenvector x.
Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units
in a neural network. Validation set is used for determining the parameters of the model.
Test set: A set of examples used only to assess the performance of a fully-specified classifier. These are the recommended
definitions and usages of the terms. Test set is used for evaluate the performance of the model in an unseen(real-world) dataset.
Cross validation is a technique for assessing how the statistical analysis generalises to an independent data set.It is a
technique for evaluating machine learning models by training several models on subsets of the available input data and
evaluating them on the complementary subset of the data.
Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence
based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
Supervised learning is a machine learning approach that's defined by its use of labeled datasets. These datasets are designed
to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the
model can measure its accuracy and learn over time.
Unsupervised learning is another machine learning method in which patterns inferred from the unlabeled input data. The goal
of unsupervised learning is to find the structure and patterns from the input data. Unsupervised learning does not need any
supervision. Instead, it finds patterns from the data by its own.

51. What is ‘naive’ in a naive bayes? 52. What is PCA? When do you use it?
53. Explain SVM algorithm in detail? 54. What are the support vectors in SVM?
55. What are the different kernels in SVM?
Naive’ in a Naive Bayes: When the features are independent, we can extend Bayes' rule to what is called Naive Bayes which assumes
that the features are independent that means changing the value of one feature doesn't influence the values of other variables and
this is why we call this algorithm “NAIVE”.
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of
large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
SVM: A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group
classification problems. After giving an SVM model sets of labeled training data for each category, they're able to categorize new text.
The objective of SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. The
dimension of the hyperplane depends upon the number of features. If the number of input features is two, then the hyperplane is just a
line. SVM works relatively well when there is a clear margin of separation between classes. SVM is more effective in high
dimensional spaces. SVM is effective in cases where the number of dimensions is greater than the number of samples. SVM is
relatively memory efficient.
Support Vectors in SVM: Support vectors are data points that are closer to the hyperplane and influence the position and orientation
of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the
position of the hyperplane. These are the points that help us build our SVM.
Types of Kernel Functions SVM.
*Polynomial Kernel Function. *Gaussian RBF Kernel Function. *Sigmoid Kernel Function.
*Hyperbolic Tangent Kernel Function. *Linear Kernel Function. *Graph Kernel Function.
*String Kernel Function. *Tree Kernel Function.

56. What are the most known ensemble algorithms?
57. Explain decision tree algorithm in details.
58. What are entropy and information gain in decision tree algorithm?
Ensemble algorithm: Ensemble modeling is a process where multiple diverse base models are used to predict an outcome.
The motivation for using ensemble models is to reduce the generalization error of the prediction. Voting and averaging are
two of the easiest ensemble methods. They are both easy to understand and implement. Voting is used for classification and
averaging is used for regression. In both methods, the first step is to create multiple classification/regression models using
some training dataset.
Decision Tree Algorithm: A decision tree is a flowchart-like diagram that shows the various outcomes from a series of
decisions. It can be used as a decision-making tool, for research analysis, or for planning strategy. A primary advantage for
using a decision tree is that it is easy to follow and understand. A decision tree is a very specific type of probability tree that
enables you to make a decision about some kind of process. It can be of two types-
Categorical Variable Decision Tree: Decision Tree which has a categorical target variable then it called a Categorical
variable decision tree.
Continuous Variable Decision Tree: Decision Tree has a continuous target variable then it is called Continuous Variable
Decision Tree.
Important Terminology related to Decision Trees.
*Root Node * Splitting * Decision Node * Leaf / Terminal Node
*Pruning * Branch / Sub-Tree * Parent and Child Node:
Entropy and Information gain: Information gain is the reduction in entropy or surprise by transforming a dataset and is
often used in training decision trees. Information gain is calculated by comparing the entropy of the dataset before and after a
transformation. The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

59. What is logistic regression? State an example when you have used logistic regression recently.
60. What is linear regression? 61. What are the drawbacks of the linear model?
62. What is the difference between regression and classification ML techniques?
63. During analysis, how do you treat missing values?
Logistic regression: It is used when the data is linearly separable and the outcome is binary or dichotomous in nature. That
means Logistic regression is usually used for Binary classification problems. Binary Classification refers to predicting the
output variable that is discrete in two classes. Logistic Regression is used when the dependent variable(target) is categorical.
For example, To predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0).
Linear regression: Linear regression analysis is used to predict the value of a variable based on the value of another variable.
The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's
value is called the independent variable.
Drawbacks of the linear model: Main limitation of Linear Regression is the assumption of linearity between the dependent
variable and the independent variables. In the real world, the data is rarely linearly separable. It assumes that there is a straight-
line relationship between the dependent and independent variables which is incorrect many times.
Difference between Regression vs Classification in Machine Learning: The main difference between Regression and
Classification algorithms that Regression algorithms are used to predict the continuous values such as price, salary, age, etc.
and Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True or False, Spam or
Not Spam, etc.
During analysis time how do ML treat missing values: When dealing with missing data, data scientists can use two primary
methods to solve the error: imputation or the removal of data. The imputation method develops reasonable guesses for missing
data. It's most useful when the percentage of missing data is low. 7 Ways to Handle Missing Values in Machine Learning-
*Deleting Rows with missing values. * Impute missing values for continuous variable.
* Impute missing values for categorical variable. * Other Imputation Methods.
* Using Algorithms that support missing values. * Prediction of missing values.

64. How will you define the number of clusters in a clustering algorithm?
65. What is ensemble learning?
66. Describe in brief any type of ensemble learning?
Define the number of clusters in a clustering algorithm: Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters, consisting
of similar data points. A list of 10 of the more popular algorithms is as follows:
Affinity Propagation Agglomerative Clustering BIRCH DBSCAN
K-Means Mini-Batch K-Means Mean Shift OPTICS
Spectral Clustering Mixture of Gaussians
Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and
combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the
(classification, prediction, function approximation, etc.) In statistics and machine learning, ensemble methods use multiple
learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning
algorithms alone.
Describe in brief one type of ensemble learning: Ensemble methods is a machine learning technique that combines several
base models in order to produce one optimal predictive model. To better understand this definition lets take a step back into
ultimate goal of machine learning and model building.
Random Forest Models. Random Forest Models can be thought of as BAGGing, with a slight tweak. When deciding where to
split and how to make decisions, BAGGed Decision Trees have the full disposal of features to choose from. Therefore,
although the bootstrapped samples may be slightly different, the data is largely going to break off at the same features
throughout each model. In contrary, Random Forest models decide where to split based on a random selection of features.
Rather than splitting at similar features at each node throughout, Random Forest models implement a level of differentiation
because each tree will split based on different features. This level of differentiation provides a greater ensemble to aggregate
over, ergo producing a more accurate predictor. Refer to the image for a better understanding. Similar to BAGGing,
bootstrapped subsamples are pulled from a larger dataset. A decision tree is formed on each subsample. HOWEVER, the
decision tree is split on different features (in this diagram the features are represented by shapes).

67. What is a random forest? How does it work?
68. How do you work towards a random forest?
69. What cross-validation technique would you use on a time series data set?
Random forests & How it works: Random forests or random decision forests are an ensemble learning method for classification, regression
and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random
forest is the class selected by most trees. Working of Random Forest Algorithm-
Step 1 − First, start with the selection of random samples from a given dataset.
Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every
decision tree.
Step 3 − In this step, voting will be performed for every predicted result.
Step 4 − At last, select the most voted prediction result as the final prediction result.
Towards a random forest: The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature
randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate
than that of any individual tree.
Step 1: In Random forest n number of random records are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
Cross-validation technique on a time series data set: The method that can be used for cross-validating the time-series model is cross-
validation on a rolling basis. Start with a small subset of data for training purpose, forecast for the later data points and then checking the
accuracy for the forecasted data points. But the two primary approaches are:
1) Setting aside a validation time period and holdout period. Build models on data before that, and test performance on those
periods
2) A multi-fold variation of the option described above.

70. What do you mean by deep learning?
71. What is the difference between machine learning and deep learning?
Deep Learning: At a very basic level, deep learning is a machine learning technique. It teaches a computer to filter inputs
through layers to learn how to predict and classify information. Observations can be in the form of images, text, or sound. The
inspiration for deep learning is the way that the human brain filters information.
Machine learning vs Deep learning: Machine learning means computers learning from data using algorithms to perform a
task without being explicitly programmed. Deep learning uses a complex structure of algorithms modeled on the human brain.
This enables the processing of unstructured data such as documents, images and text.
Machine Learning Deep Learning
Machine Learning is a superset of Deep Learning. Deep Learning is a subset of Machine Learning.
Machine Learning is an evolution of AI. Deep Learning is an evolution to Machine Learning. Basically it is
how deep is the machine learning.
Machine learning consists of thousands of data points. Big Data: Millions of data points.
The data represented in Machine Learning is quite different
as compared to Deep Learning as it uses structured data.
The data representation is used in Deep Learning is quite different
as it uses neural networks(ANN).
Machine Learning is highly used to stay in the competition
and learn new things.
Deep Learning solves complex machine learning issues.
Algorithms are detected by data analysts to examine specific
variables in data sets.
Algorithms are largely self-depicted on data analysis once they’re
put into production.
Outputs: Numerical Value, like classification of score. Anything from numerical values to free-form elements, such as
free text and sound.

72. What, in your opinion, is the reason for the popularity of deep learning in recent times?
73. What is reinforcement learning? 74. What are artificial neural networks?
75. Describe the structure of artificial neural networks? 76. How are weights initialized in a network?
Opinion in the popularity of deep learning: But lately, Deep Learning is gaining much popularity due to it's supremacy in
terms of accuracy when trained with huge amount of data. The software industry now-a-days moving towards machine
intelligence. Machine Learning has become necessary in every sector as a way of making machines intelligent. We have
access to a lot more computational power. Neural Networks are a brand new field. We have access to a lot more data.
Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an
environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine
learning paradigms, alongside supervised learning and unsupervised learning. Reinforcement Learning (RL) is the science of
decision making. It is about learning the optimal behavior in an environment to obtain maximum reward.
Artificial Neural Network(ANN): Neural networks, also known as artificial neural networks (ANNs) or simulated neural
networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure
are inspired by the human brain, mimicking the way that biological neurons signal to one another.
The structure of artificial neural network: An artificial neural network consists of a collection of simulated neurons. Each
neuron is a node which is connected to other nodes via links that correspond to biological axon-synapse-dendrite connections.
Each link has a weight, which determines the strength of one node's influence on another. Artificial Neural Networks (ANN)
are multi-layer fully-connected neural nets that look like the figure below. They consist of an input layer, multiple hidden
layers, and an output layer. ... Training this deep neural network means learning the weights associated with all the edges.
Weights Initialized in a Network: Weight initialization is a procedure to set the weights of a neural network to small random
values that define the starting point for the optimization (learning or training) of the neural network model.
Step-1: Initialization of Neural Network: Initialize weights and biases.
Step-2: Forward propagation: Using the given input X, weights W, and biases b, for every layer we compute a linear combination of inputs and
weights (Z)and then apply activation function to linear combination (A).

77. What is the cost function? 78. What are hyperparameters?
79. What will happen if the learning rate is set inaccurately (too low or too high)?
80. What are the different layers on CNN? 81. What is pooling on CNN, and how does it work?
Cost Function: The cost function is the technique of evaluating “the performance of our algorithm/model”. It takes both
predicted outputs by the model and actual outputs and calculates how much wrong the model was in its prediction. It outputs
a higher number if our predictions differ a lot from the actual values.
Hyperparameters are parameters whose values control the learning process and determine the values of model parameters
that a learning algorithm ends up learning. Some examples of model hyperparameters include:
 The learning rate for training a neural network.
 The C and sigma hyperparameters for support vector machines.
 The k in k-nearest neighbors.
Learning rate is set inaccurately: If your learning rate is set too low, training will progress very slowly as you are making
very tiny updates to the weights in your network. However, if your learning rate is set too high, it can cause undesirable
divergent behavior in your loss function.
The different layers of a CNN: There are four types of layers for a convolutional neural network: the convolutional layer,
the pooling layer, the ReLU correction layer and the fully-connected layer. There are three types of layers in a convolutional
neural network: convolutional layer, pooling layer, and fully connected layer. Each of these layers has different parameters
that can be optimized and performs a different task on the input data.
Pooling on CNN and its works: The pooling operation involves sliding a two-dimensional filter over each channel of feature
map and summarising the features lying within the region covered by the filter. A common CNN model architecture is to have
a number of convolution and pooling layers stacked one after the other. Convolutional layers in a convolutional neural
network systematically apply learned filters to input images in order to create feature maps that summarize the presence of
those features in the input. A pooling layer is a new layer added after the convolutional layer.

82. What is the difference between epoch, batch, and iteration in deep learning?
83. What are recurrent neural networks (RNNs)? 84. What is a multi-layer perceptron (MLP)?
Epochs: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.
Since one epoch is too big to feed to the computer at once we divide it in several smaller batches.
Batch Size: Total number of training examples present in a single batch. Batch size and number of batches are two different
things.
Iterations: To get the iterations you just need to know multiplication tables or have a calculator. Iterations is the number of
batches needed to complete one epoch. The number of batches is equal to number of iterations for one epoch.
A recurrent neural network (RNN) is a special type of an artificial neural network adapted to work for time series data or
data that involves sequences. Ordinary feed forward neural networks are only meant for data points, which are independent of
each other. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output
being depended on the previous computations. For example, if the sequence we care about is a sentence of 3 words, the
network would be unrolled into a 3-layer neural network, one layer for each word.
LSTM network work: An LSTM has a similar control flow as a recurrent neural network. It processes data passing on
information as it propagates forward. The differences are the operations within the LSTM's cells. These operations are used to
allow the LSTM to keep or forget information. In order to train an LSTM Neural Network to generate text, we must first
preprocess our text data so that it can be consumed by the network. In this case, since a Neural Network takes vectors as input,
we need a way to convert the text into vectors.
Multi-layer Perceptron (MLP): A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set
of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph
between the input and output layers. The Perceptron consists of an input layer and an output layer which are fully connected.
Once the calculated output at the hidden layer has been pushed through the activation function, push it to the next layer in the
MLP by taking the dot product with the corresponding weights. A fully connected multi-layer neural network is called a
Multilayer Perceptron (MLP). It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a deep
ANN. An MLP is a typical example of a feedforward artificial neural network.

85. Explain gradient descent 86. What is exploding gradients?
87. What is vanishing gradients? 88. What is back propagation and explain it works?
89. What is an auto-encoder?
Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient descent is
simply used in machine learning to find the values of a function's parameters (coefficients) that minimize a cost function as
far as possible. The Gradient descent algorithm multiplies the gradient by a number (Learning rate or Step size) to determine
the next point. For example: having a gradient with a magnitude of 4.2 and a learning rate of 0.01, then the gradient descent
algorithm will pick the next point 0.042 away from the previous point.
Exploding gradients are a problem when large error gradients accumulate and result in very large updates to neural network
model weights during training. Gradients are used during training to update the network weights, but when the typically this
process works best when these updates are small and controlled.
Vanishing gradients: The term vanishing gradient refers to the fact that in a feedforward network (FFN) the backpropagated
error signal typically decreases (or increases) exponentially as a function of the distance from the final layer. — Random
Walk Initialization for Training Very Deep Feedforward Network. The reason for vanishing gradient is that during
backpropagation, the gradient of early layers (layers near to the input layer) are obtained by multiplying the gradients of later
layers (layers near to the output layer).
Backpropagation & it's work: Back-propagation is just a way of propagating the total loss back into the neural network to
know how much of the loss every node is responsible for, and subsequently updating the weights in such a way that
minimizes the loss by giving the nodes with higher error rates lower weights and vice versa. Backpropagation is the essence
of neural network training. It is the method of fine-tuning the weights of a neural network based on the error rate obtained in
the previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable
by increasing its generalization.
Auto-Encoder is a type of neural network that can be used to learn a compressed representation of raw data. An autoencoder
is composed of an encoder and a decoder sub-models. The encoder compresses the input and the decoder attempts to recreate
the input from the compressed version provided by the encoder.

90. What is the role of the activation function? 91. What is a Boltzmann machine?
92. What is dropout and batch normalization? 93. How is logistic regression done?
Role of the Activation Function: Simply put, an activation function is a function that is added into an artificial neural network
in order to help the network learn complex patterns in the data. When comparing with a neuron-based model that is in our
brains, the activation function is at the end deciding what is to be fired to the next neuron. Activation functions are a critical
part of the design of a neural network. The choice of activation function in the hidden layer will control how well the network
model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the
model can make.
Boltzmann machine: A Boltzmann machine is a type of stochastic recurrent neural network. It is a Markov random field. It
was translated from statistical physics for use in cognitive science. Boltzmann machines are typically used to solve different
computational problems such as, for a search problem, the weights present on the connections can be fixed and are used to
represent the cost function of the optimization problem.
Dropout and Batch Normalization: Dropout is meant to block information from certain neurons completely to make sure the
neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through
normalization statistics. Batch normalization is a technique for training very deep neural networks that standardizes the inputs
to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of
training epochs required to train deep networks. Dropout is a technique where randomly selected neurons are ignored during
training. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass
and any weight updates are not applied to the neuron on the backward pass.
Logistic regression done: Logistic regression uses an equation as the representation, very much like linear regression. Input
values (x) are combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta) to predict an
output value (y). Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary
dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit
regression) is estimating the parameters of a logistic model (a form of binary regression).

Data science

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Data science

Similar to Data science (20)

More from Rakibul Hasan Pranto

More from Rakibul Hasan Pranto (7)

Recently uploaded

Recently uploaded (20)

Data science