2. What is Data Science?
Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from data in various forms, both structured and unstructured.
or
Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to understand and analyze
actual phenomena with data. It employs techniques and theories drawn
from many fields within the context of mathematics, statistics,
information science, and computer science.
5. Data Science Applications:
Internet Search:
How does Google return such accurate search results within a fraction of a second?
Data Science!
Recommendation Systems:
From âpeople you may knowâ on Facebook or LinkedIn to âpeople whoâve bought
this product also likedâŚâ on Amazon to your daily curated playlists on Spotify to even
âsuggested videosâ on YouTube, everything is fueled by Data Science.
Image/Speech/Character Recognition:
This pretty much goes without saying. What do you think is the brain behind âSiriâ, if
not Data Science? Also, how do you think Facebook recognizes your friend when you
upload a photo with them? Itâs not magic; itâs science â Data Science.
6. Data Science Applications:
Gaming:
EA Sports, Sony, Nintendo, Zynga, and other giants in this domain have taken it
upon themselves to take your gaming experience to an altogether new level.
Games are now developed and improved using Machine Learning algorithms so
that it can upgrade as you move up to higher levels.
Price Comparison Websites:
These websites are fueled by data. For them, the more the merrier. The data is
fetched from the relevant websites using APIs. PriceGrabber, PriceRunner,
Junglee, Shopzilla are some such websites.
7. Data Science life cycle:
1 - Business Understanding
2 - Data Mining
3 - Data Cleaning
4 - Data Exploration
5 - Feature Engineering
6 - Predictive Modeling
7 - Data Visualization
11. Python Libraries:
Many popular Python toolboxes/libraries:
ď§ NumPy
ď§ SciPy
ď§ Pandas
ď§ Sickit-Learn
Visualization libraries
ď§ Matplotlib
ď§ Seaborn
and many more âŚ
12. Python Libraries:
NumPy:
introduces objects for multidimensional arrays and matrices, as well as
functions that allow to easily perform advanced mathematical and statistical
operations on those objects, provides vectorization of mathematical operations
on arrays and matrices which significantly improves the performance
SciPy:
collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
Pandas:
adds data structures and tools designed to work with table-like data and
provides tools for data manipulation: reshaping, merging, sorting, slicing,
aggregation etc.
13. Python Libraries:
SciKit-Learn:
provides machine learning algorithms: classification, regression, clustering,
model validation etc.
Matplotlib:
python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats.
Seaborn:
based on matplotlib, provides high level interface for drawing attractive
statistical graphics
15. B ANALOGICX | www.analogicx.in
Numpy:
The fundamental library needed for scientific computing with Python is called
NumPy.
This Open Source library contains:
⢠a powerful N-dimensional array object
⢠advanced array slicing methods (to select array elements)
⢠convenient array reshaping methods
⢠basic linear algebra functions
⢠basic Fourier transforms
⢠sophisticated random number capabilities
NumPy can be extended with C-code for functions where performance is highly
time critical. NumPy is a hybrid of the older NumArray and Numeric packages,
and is meant to replace them both.
16. Installing Numpy:
If you have Anaconda, you can simply install NumPy from your terminal or
command prompt using:
âconda install numpyâ
If you do not have Anaconda on your computer, install NumPy from your
terminal using:
âpip install numpyâ
19. Pandas:
Pandas is an open source python library that is built on top of NumPy. It allows
you do fast analysis as well as data cleaning and preparation.
An easy way to think of Pandas is by simply looking at it as Pythonâs version of
Microsoftâs Excel.
Pandas can work well with data from a wide variety of sources such as; Excel
sheet, csv file, sql file or even a webpage.
Pandas provides tools for data manipulation: reshaping, merging, sorting,
slicing, aggregation etc. and allows handling missing data
20. Installing Pandas:
If you have Anaconda, you can simply install Pandas from your terminal or
command prompt using:
âconda install pandasâ
If you do not have Anaconda on your computer, install Pandas from your
terminal using:
âpip install pandasâ
23. Matplotlib:
Data visualization is a very important part of data analysis. You can use it to
explore your data. If you understand your data well, youâll have a better chance
to find some insights.
In python we use Matplotlib and Seaborn for data visualization.
There are many types of visualizations.
Some of the most famous are: line plot, scatter plot, histogram, box plot, bar
chart, and pie chart.
How do we choose the right visualization?
First, we need to make some exploratory data analysis. After we know the
shape of the data, the data types, and other useful statistical information, it
will be easier to pick the right visualization type.
24. Installing Matplotlib:
If you have Anaconda, you can simply install matplotlib from your terminal or
command prompt using:
âconda install matplotlibâ
If you do not have Anaconda on your computer, install Pandas from your
terminal using:
âpip install matplotlibâ
27. What is Machine Learning?
Machine Learning is the field of study that gives computers the
capability to learn without being explicitly programmed. ML is one of
the most exciting technologies that one would have ever come across.
As it is evident from the name, it gives the computer that which makes it
more similar to humans: âThe ability to learnâ.
28.
29.
30. Supervised Machine Learning:
ď§ Supervised learning is where you have input variables (x) and an output
variable (Y) and you use an algorithm to learn the mapping function from the
input to the output.
Y = f(X)
ď§ The goal is to approximate the mapping function so well that when you have
new input data (x) that you can predict the output variables (Y) for that data.
ď§ Supervised learning can be further grouped into:
Classification: A classification problem is when the output variable is a
category, such as âredâ or âblueâ or âdiseaseâ and âno diseaseâ.
ď§ Regression: A regression problem is when the output variable is a real value,
such as âdollarsâ or âweight
31. Unsupervised Machine Learning:
ď§ Unsupervised learning is where you only have input data (X) and no
corresponding output variables.
ď§ The goal for unsupervised learning is to model the underlying structure or
distribution in the data in order to learn more about the data.
ď§ Unsupervised learning problems can be further grouped into:
Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
ď§ Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
34. Linear Regression:
ď§ Linear regression is a basic and commonly used type of predictive analysis.
The overall idea of regression is to examine a linear relationship between
the input variables (X) and the single output variable (Y).
ď§ When there is a single input variable (X), the method is referred to as
âsimple linear regressionâ.
ď§ When there are multiple input variables(X, X1,X2âŚ), literature from
statistics often refers to the method as âmultiple linear regressionâ.
ď§ Regression line minimizes the sum of âSquare of Residualsâ, So it is also
known as âOrdinary Least Square (OLS)â
36. Applications of Linear Regression:
Majority of continuous variable predictions are through Linear Regression.
ď§ Stock Exchange
ď§ Weather Forecast
ď§ Flight time prediction
ď§ House price prediction
ď§ Finance
ď§ Econometrics
39. Logistic Regression:
ď§ Logistic Regression is a âClassification Algorithmâ.
ď§ It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a
set of independent variables. To represent binary / categorical outcome.
ď§ You can also think of logistic regression as a special case of linear regression
when the outcome variable is categorical.
40. Working of Logistic Regression:
Y = bâ + b1X1 + b2X2 + âŚâŚ....+ bnXn + E
Y = Dependent variable
bâ = Constant
b1 = Coefficient of variable X1
X1 = Independent variable
E = Error term
41. Applications of Logistic Regression:
Every Categorical prediction is made through Logistic Regression.
ď§ Image Segmentation & Categorization
ď§ Handwriting Recognition
ď§ Spam filtering (Spam/Ham)
ď§ Cell Imaging (Cancer/Normal)
ď§ Production line Scan (Good/Defective)
43. K â Means Clustering
(Unsupervised Machine
Learning)
44. K - Means:
ď§ K-Means clustering is one of the simplest and popular unsupervised
machine learning algorithms.
ď§ Typically, unsupervised algorithms make inferences from datasets using
only input vectors without referring to known, or labelled, outcomes.
ď§ Define a target number k, which refers to the number of centroids you need
in the dataset. A centroid is the imaginary or real location representing the
center of the cluster.
ď§ In other words, the K-means algorithm identifies k number of centroids, and
then allocates every data point to the nearest cluster, while keeping the
centroids as small as possible.
47. Applications of K-Means:
K-Means is mostly used algorithm for unstructured data for
perditions.
ď§ Optical Character Recognition
ď§ Biometrics
ď§ Diagnostic Systems
ď§ Military Application
50. K Nearest Neighbor :
KNN can be used for classification. An object is classified by a majority vote of
its neighbors, with the object being assigned to the class most common
among its k nearest neighbors. It can also be used for regression.
It does not use the training data points to do any generalization. In other
words, there is no explicit training phase or it is very minimal.
KNN Algorithm is based on feature similarity: How closely out-of-sample
features resemble our training set determines how we classify a given data
point:
51. K Nearest Neighbor :
Example of k-NN classification. The test sample (inside circle) should be classified either to the first class of
blue squares or to the second class of red triangles. If k = 3 (outside circle) it is assigned to the second class
because there are 2 triangles and only 1 square inside the inner circle. If, for example k = 5 it is assigned to
the first class (3 squares vs. 2 triangles outside the outer circle).
52. Applications of KNN:
Credit ratingsâââcollecting financial characteristics vs. comparing people with
similar financial features to a database. By the very nature of a credit rating,
people who have similar financial details would be given similar credit ratings.
Should the bank give a loan to an individual? Would an individual default on his
or her loan? Is that person closer in characteristics to people who defaulted or
did not default on their loans?
In political scienceâââclassing a potential voter to a âwill voteâ or âwill not voteâ,
or to âvote Democratâ or âvote Republicanâ.
More advance examples could include handwriting detection (like OCR), image
recognition and even video recognition.
55. Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a technique used to emphasize
variation and bring out strong variations in a data set.
Its often used to make data easy to explore and visualize.
PCA reduces dimensionality by finding a new set of components, which are
composites of the original features but are uncorrelated with each other.
PCA is also known as Dimension Reduction Algorithm.
57. Applications of PCA:
Principal component analysis (PCA) is often used to reduce the dimension of
data before applying more sophisticated data analysis methods such as non-
linear classification algorithms or independent component analysis.
ď§ Multivariate Anomaly detection in health care.
ď§ Finance â Stock portfolio analysis
60. Decision Tree:
A decision tree is a learning algorithm that constructs a set of
decisions based on training data.
Decision trees are popular because:
⢠They are naturally non-linear, so you can use them to solve
complex problems
⢠They are easy to visualize
⢠How they work is easily explained
⢠They can be used for regression (predict a number) and
classification (predict a class)
66. Random Forest:
A random forest is an ensemble method based on decision trees.
1. Construct N decision trees
⢠Randomly sample a subset of the training data (with
replacement)
⢠Construct/train a decision tree using the decision tree
algorithm and the sampled subset of data
2. Predict by asking all trees in the forest for their opinion
⢠For regression problems, take the mean (average) of all
treesâ predictions
⢠For classification problems, take the mode of all treesâ
predictions.
70. Applications of Random Forest:
Random Forests are the best and most widely used classification and
prediction tools.
Business Management
Customer Relationship Management
Fraudulent Statement Detection
Engineering
Energy Consumption
Healthcare Management
Fault Diagnosis
73. Support Vector Machine:
Support Vector Machine is a supervised machine learning algorithm which
can be used for both classification or regression, however it is mostly used for
classification problems.
It fairly separates the two classes. âSeparation of classes. Thatâs what SVM
doesâ.
It finds out a line/ hyper-plane (in multidimensional space that separate outs
classes).
74. Working of SVM:
a. Draw a line that separates black circles and blue squares. b. Sample cut to divide into two classes.
75. Tuning parameters in SVM:
ď§ Kernel:
ď§ Regularization
ď§ Gamma
ď§ Margin
Kernel:
The learning of the hyperplane in linear SVM is done by transforming the
problem using some linear algebra. This is where the kernel plays role.
76. Regularization:
The Regularization parameter (often termed as C parameter in pythonâs sklearn
library) tells the SVM optimization how much you want to avoid misclassifying
each training example.
Left: low regularization value, right: high regularization value
77. Gamma:
The gamma parameter defines how far the influence of a single training example
reaches, with low values meaning âfarâ and high values meaning âcloseâ.
In other words, with low gamma, points far away from plausible separation line
are considered in calculation for the separation line. Where as high gamma means
the points close to plausible line are considered in calculation.
78. Margin:
A margin is a separation of line to the closest class points.
A good margin is one where this separation is larger for both the classes. Images
below gives to visual example of good and bad margin. A good margin allows the
points to be in their respective classes without crossing to other class.
81. Neural Networks:
Which is betterâcomputer or brain? Ask most people if they want a brain like
a computer and they'd probably jump at the chance. But look at the kind of
work scientists have been doing over the last couple of decades and you'll find
many of them have been trying hard to make their computers more like
brains!
How?
With the help of neural networksâcomputer programs assembled from
hundreds, thousands, or millions of artificial brain cells that learn and behave
in a remarkably similar way to human brains. What exactly are neural
networks? How do they work? Let's take a closer look!
82. Artificial Neural Network:
Artificial Neural Networks are computing systems inspired by the biological
neural networks. Such systems learn to do tasks by considering examples,
generally without task-specific programming.
An ANN is based on a collection of connected units called artificial neurons,
Typically, neurons are organized in layers. Different layers may perform
different kinds of transformations on their inputs.
85. Working of Neural Networks:
Input layer - It contains those units (artificial neurons) which receive input from
the outside world on
which network will learn recognize about or otherwise process
Output layer - it contains units that respond to the information about ' how itâ s
learned any task.
Hidden Iayer - These units are in between input and output layers The job of
hidden layer Is to transform the input into something that output unit can use
in some way
Most neural networks are fully connected that means to say each hidden
neuron is fully connected to the every neuron in its previous Iayer input) and to
the next (output) layer.