Data Science.pptx

What is Data Science?
Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from data in various forms, both structured and unstructured.
or
Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to understand and analyze
actual phenomena with data. It employs techniques and theories drawn
from many fields within the context of mathematics, statistics,
information science, and computer science.

Data Science vs Big Data vs Artificial Intelligence

Data Science Applications:
Internet Search:
How does Google return such accurate search results within a fraction of a second?
Data Science!
Recommendation Systems:
From “people you may know” on Facebook or LinkedIn to “people who’ve bought
this product also liked…” on Amazon to your daily curated playlists on Spotify to even
“suggested videos” on YouTube, everything is fueled by Data Science.
Image/Speech/Character Recognition:
This pretty much goes without saying. What do you think is the brain behind “Siri”, if
not Data Science? Also, how do you think Facebook recognizes your friend when you
upload a photo with them? It’s not magic; it’s science – Data Science.

Data Science Applications:
Gaming:
EA Sports, Sony, Nintendo, Zynga, and other giants in this domain have taken it
upon themselves to take your gaming experience to an altogether new level.
Games are now developed and improved using Machine Learning algorithms so
that it can upgrade as you move up to higher levels.
Price Comparison Websites:
These websites are fueled by data. For them, the more the merrier. The data is
fetched from the relevant websites using APIs. PriceGrabber, PriceRunner,
Junglee, Shopzilla are some such websites.

Data Science life cycle:
1 - Business Understanding
2 - Data Mining
3 - Data Cleaning
4 - Data Exploration
5 - Feature Engineering
6 - Predictive Modeling
7 - Data Visualization

Python Libraries:
Many popular Python toolboxes/libraries:
 NumPy
 SciPy
 Pandas
 Sickit-Learn
Visualization libraries
 Matplotlib
 Seaborn
and many more …

Python Libraries:
NumPy:
introduces objects for multidimensional arrays and matrices, as well as
functions that allow to easily perform advanced mathematical and statistical
operations on those objects, provides vectorization of mathematical operations
on arrays and matrices which significantly improves the performance
SciPy:
collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
Pandas:
adds data structures and tools designed to work with table-like data and
provides tools for data manipulation: reshaping, merging, sorting, slicing,
aggregation etc.

Python Libraries:
SciKit-Learn:
provides machine learning algorithms: classification, regression, clustering,
model validation etc.
Matplotlib:
python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats.
Seaborn:
based on matplotlib, provides high level interface for drawing attractive
statistical graphics

Exploratory Data Analysis -
Numpy
(Python)

B ANALOGICX | www.analogicx.in
Numpy:
The fundamental library needed for scientific computing with Python is called
NumPy.
This Open Source library contains:
• a powerful N-dimensional array object
• advanced array slicing methods (to select array elements)
• convenient array reshaping methods
• basic linear algebra functions
• basic Fourier transforms
• sophisticated random number capabilities
NumPy can be extended with C-code for functions where performance is highly
time critical. NumPy is a hybrid of the older NumArray and Numeric packages,
and is meant to replace them both.

Installing Numpy:
If you have Anaconda, you can simply install NumPy from your terminal or
command prompt using:
“conda install numpy”
If you do not have Anaconda on your computer, install NumPy from your
terminal using:
“pip install numpy”

Numpy
(Python Session – 1)
r

Exploratory Data Analysis -
Pandas
(Python)

Pandas:
Pandas is an open source python library that is built on top of NumPy. It allows
you do fast analysis as well as data cleaning and preparation.
An easy way to think of Pandas is by simply looking at it as Python’s version of
Microsoft’s Excel.
Pandas can work well with data from a wide variety of sources such as; Excel
sheet, csv file, sql file or even a webpage.
Pandas provides tools for data manipulation: reshaping, merging, sorting,
slicing, aggregation etc. and allows handling missing data

Installing Pandas:
If you have Anaconda, you can simply install Pandas from your terminal or
“conda install pandas”
If you do not have Anaconda on your computer, install Pandas from your
terminal using:
“pip install pandas”

Data Visualization - Matplotlib
(Python)

Matplotlib:
Data visualization is a very important part of data analysis. You can use it to
explore your data. If you understand your data well, you’ll have a better chance
to find some insights.
In python we use Matplotlib and Seaborn for data visualization.
There are many types of visualizations.
Some of the most famous are: line plot, scatter plot, histogram, box plot, bar
chart, and pie chart.
How do we choose the right visualization?
First, we need to make some exploratory data analysis. After we know the
shape of the data, the data types, and other useful statistical information, it
will be easier to pick the right visualization type.

Installing Matplotlib:
If you have Anaconda, you can simply install matplotlib from your terminal or
“conda install matplotlib”
If you do not have Anaconda on your computer, install Pandas from your
terminal using:
“pip install matplotlib”

Matplotlib
(Python Session – 3)

Data Science
(Machine Learning)

What is Machine Learning?
Machine Learning is the field of study that gives computers the
capability to learn without being explicitly programmed. ML is one of
the most exciting technologies that one would have ever come across.
As it is evident from the name, it gives the computer that which makes it
more similar to humans: “The ability to learn”.

Supervised Machine Learning:
 Supervised learning is where you have input variables (x) and an output
variable (Y) and you use an algorithm to learn the mapping function from the
input to the output.
Y = f(X)
 The goal is to approximate the mapping function so well that when you have
new input data (x) that you can predict the output variables (Y) for that data.
 Supervised learning can be further grouped into:
Classification: A classification problem is when the output variable is a
category, such as “red” or “blue” or “disease” and “no disease”.
 Regression: A regression problem is when the output variable is a real value,
such as “dollars” or “weight

Unsupervised Machine Learning:
 Unsupervised learning is where you only have input data (X) and no
corresponding output variables.
 The goal for unsupervised learning is to model the underlying structure or
distribution in the data in order to learn more about the data.
 Unsupervised learning problems can be further grouped into:
Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.

Supervised vs Unsupervised Machine Learning

Linear Regression
(Supervised Machine Learning)

Linear Regression:
 Linear regression is a basic and commonly used type of predictive analysis.
The overall idea of regression is to examine a linear relationship between
the input variables (X) and the single output variable (Y).
 When there is a single input variable (X), the method is referred to as
“simple linear regression”.
 When there are multiple input variables(X, X1,X2…), literature from
statistics often refers to the method as “multiple linear regression”.
 Regression line minimizes the sum of “Square of Residuals”, So it is also
known as “Ordinary Least Square (OLS)”

Applications of Linear Regression:
Majority of continuous variable predictions are through Linear Regression.
 Stock Exchange
 Weather Forecast
 Flight time prediction
 House price prediction
 Finance
 Econometrics

Linear Regression
Python Algorithm

Logistic Regression

Logistic Regression:
 Logistic Regression is a ”Classification Algorithm”.
 It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a
set of independent variables. To represent binary / categorical outcome.
 You can also think of logistic regression as a special case of linear regression
when the outcome variable is categorical.

Working of Logistic Regression:
Y = b∘ + b1X1 + b2X2 + ……....+ bnXn + E
Y = Dependent variable
b∘ = Constant
b1 = Coefficient of variable X1
X1 = Independent variable
E = Error term

Applications of Logistic Regression:
Every Categorical prediction is made through Logistic Regression.
 Image Segmentation & Categorization
 Handwriting Recognition
 Spam filtering (Spam/Ham)
 Cell Imaging (Cancer/Normal)
 Production line Scan (Good/Defective)

Logistic Regression
Python Algorithm

K – Means Clustering
(Unsupervised Machine
Learning)

K - Means:
 K-Means clustering is one of the simplest and popular unsupervised
machine learning algorithms.
 Typically, unsupervised algorithms make inferences from datasets using
only input vectors without referring to known, or labelled, outcomes.
 Define a target number k, which refers to the number of centroids you need
in the dataset. A centroid is the imaginary or real location representing the
center of the cluster.
 In other words, the K-means algorithm identifies k number of centroids, and
then allocates every data point to the nearest cluster, while keeping the
centroids as small as possible.

Applications of K-Means:
K-Means is mostly used algorithm for unstructured data for
perditions.
 Optical Character Recognition
 Biometrics
 Diagnostic Systems
 Military Application

K Nearest Neighbors

K Nearest Neighbor :
KNN can be used for classification. An object is classified by a majority vote of
its neighbors, with the object being assigned to the class most common
among its k nearest neighbors. It can also be used for regression.
It does not use the training data points to do any generalization. In other
words, there is no explicit training phase or it is very minimal.
KNN Algorithm is based on feature similarity: How closely out-of-sample
features resemble our training set determines how we classify a given data
point:

K Nearest Neighbor :
Example of k-NN classification. The test sample (inside circle) should be classified either to the first class of
blue squares or to the second class of red triangles. If k = 3 (outside circle) it is assigned to the second class
because there are 2 triangles and only 1 square inside the inner circle. If, for example k = 5 it is assigned to
the first class (3 squares vs. 2 triangles outside the outer circle).

Applications of KNN:
Credit ratings — collecting financial characteristics vs. comparing people with
similar financial features to a database. By the very nature of a credit rating,
people who have similar financial details would be given similar credit ratings.
Should the bank give a loan to an individual? Would an individual default on his
or her loan? Is that person closer in characteristics to people who defaulted or
did not default on their loans?
In political science — classing a potential voter to a “will vote” or “will not vote”,
or to “vote Democrat” or “vote Republican”.
More advance examples could include handwriting detection (like OCR), image
recognition and even video recognition.

K Nearest Neighbors
(Python Algorithm)

Principal Component Analysis -PCA
(Unsupervised Machine Learning)

Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a technique used to emphasize
variation and bring out strong variations in a data set.
Its often used to make data easy to explore and visualize.
PCA reduces dimensionality by finding a new set of components, which are
composites of the original features but are uncorrelated with each other.
PCA is also known as Dimension Reduction Algorithm.

Applications of PCA:
Principal component analysis (PCA) is often used to reduce the dimension of
data before applying more sophisticated data analysis methods such as non-
linear classification algorithms or independent component analysis.
 Multivariate Anomaly detection in health care.
 Finance – Stock portfolio analysis

Decision Trees
(Advanced Machine Learning)

Decision Tree:
A decision tree is a learning algorithm that constructs a set of
decisions based on training data.
Decision trees are popular because:
• They are naturally non-linear, so you can use them to solve
complex problems
• They are easy to visualize
• How they work is easily explained
• They can be used for regression (predict a number) and
classification (predict a class)

Decision Tree
Python Algorithm

Random Forest

Random Forest:
A random forest is an ensemble method based on decision trees.
1. Construct N decision trees
• Randomly sample a subset of the training data (with
replacement)
• Construct/train a decision tree using the decision tree
algorithm and the sampled subset of data
2. Predict by asking all trees in the forest for their opinion
• For regression problems, take the mean (average) of all
trees’ predictions
• For classification problems, take the mode of all trees’
predictions.

Applications of Random Forest:
Random Forests are the best and most widely used classification and
prediction tools.
Business Management
Customer Relationship Management
Fraudulent Statement Detection
Engineering
Energy Consumption
Healthcare Management
Fault Diagnosis

Random Forest
Python Algorithm

Support Vector Machines - SVM

Support Vector Machine:
Support Vector Machine is a supervised machine learning algorithm which
can be used for both classification or regression, however it is mostly used for
classification problems.
It fairly separates the two classes. “Separation of classes. That’s what SVM
does”.
It finds out a line/ hyper-plane (in multidimensional space that separate outs
classes).

Working of SVM:
a. Draw a line that separates black circles and blue squares. b. Sample cut to divide into two classes.

Tuning parameters in SVM:
 Kernel:
 Regularization
 Gamma
 Margin
Kernel:
The learning of the hyperplane in linear SVM is done by transforming the
problem using some linear algebra. This is where the kernel plays role.

Regularization:
The Regularization parameter (often termed as C parameter in python’s sklearn
library) tells the SVM optimization how much you want to avoid misclassifying
each training example.
Left: low regularization value, right: high regularization value

Gamma:
The gamma parameter defines how far the influence of a single training example
reaches, with low values meaning ‘far’ and high values meaning ‘close’.
In other words, with low gamma, points far away from plausible separation line
are considered in calculation for the separation line. Where as high gamma means
the points close to plausible line are considered in calculation.

Margin:
A margin is a separation of line to the closest class points.
A good margin is one where this separation is larger for both the classes. Images
below gives to visual example of good and bad margin. A good margin allows the
points to be in their respective classes without crossing to other class.

Support Vector Machine
Python Algorithm

Artificial Neural Network-ANN
(Deep Learning)

Neural Networks:
Which is better—computer or brain? Ask most people if they want a brain like
a computer and they'd probably jump at the chance. But look at the kind of
work scientists have been doing over the last couple of decades and you'll find
many of them have been trying hard to make their computers more like
brains!
How?
With the help of neural networks—computer programs assembled from
hundreds, thousands, or millions of artificial brain cells that learn and behave
in a remarkably similar way to human brains. What exactly are neural
networks? How do they work? Let's take a closer look!

Artificial Neural Network:
Artificial Neural Networks are computing systems inspired by the biological
neural networks. Such systems learn to do tasks by considering examples,
generally without task-specific programming.
An ANN is based on a collection of connected units called artificial neurons,
Typically, neurons are organized in layers. Different layers may perform
different kinds of transformations on their inputs.

Working of Neural Networks:
Input layer - It contains those units (artificial neurons) which receive input from
the outside world on
which network will learn recognize about or otherwise process
Output layer - it contains units that respond to the information about ' how it’ s
learned any task.
Hidden Iayer - These units are in between input and output layers The job of
hidden layer Is to transform the input into something that output unit can use
in some way
Most neural networks are fully connected that means to say each hidden
neuron is fully connected to the every neuron in its previous Iayer input) and to
the next (output) layer.

ANN Applications:
 Aerospace − Autopilot aircrafts, aircraft fault detection.
 Automotive − Automobile guidance systems.
 Military − Weapon orientation and steering, target tracking, object
discrimination, facial recognition, signal/image identification.
 Electronics − Code sequence prediction, IC chip layout, chip failure analysis,
machine vision, voice synthesis.
 Industrial − Manufacturing process control, product design and analysis, quality
inspection systems, welding quality analysis, paper quality prediction, chemical
product design analysis, dynamic modeling of chemical process systems,
machine maintenance analysis, project bidding, planning, and management.
 Medical − Cancer cell analysis, EEG and ECG analysis, prosthetic design,
transplant time optimizer.

Artificial Neural Network --ANN
Python Algorithm

Data Science.pptx

Recommended

Recommended

More Related Content

Similar to Data Science.pptx

Similar to Data Science.pptx (20)

Recently uploaded

Recently uploaded (20)

Data Science.pptx