This document provides an introduction to machine learning, including definitions of key terminology such as features, training data, test data, validation data, vectors, similarity, supervised learning, unsupervised learning, dimensionality reduction, and overfitting. It then outlines the typical steps in a model building cycle, including data collection, cleaning, preprocessing, sampling, model building, deployment, and improvement. Several common supervised learning methods are listed, such as linear regression, logistic regression, decision trees, and ensemble methods. Unsupervised learning methods like k-means clustering and hierarchical clustering are also introduced. The document concludes by discussing similarity measures, dimensionality reduction techniques, recommendations, text mining, and the vector space model.
2. 2
Machine Learning: In simple terms, is a set of pattern learning
techniques
- These techniques are based on statistical assumptions of the
data
- Conceptually these techniques can be applied to various
forms of data
- Machine learning models are built on training data (praportion
of the actual data) and then are used to predict pattern of
unseen data
Statistical Model: The outcome of a machine learning process is an
entity (or) a model, and is often called a statistical model
Terminology
3. 3
Feature (or) Dimension: Feature, Dimension, Variable, Attribute,
Property represent the characteristic of a data
Ex: {age, height, gender) are Features of User
Training Data (60-80%): Sampled data used for building the model
Test Data (20%): Sampled data used for testing the model
Validation Data (20%): Sampled data (unseen) used for validating
the model
Terminology cont ..
4. 4
Vector : A vector is a multi dimensional representation of a data
point,
- each row in a matrix is a Vector
Similarity : Is a measure used to represent how close 2 data points
are the vector space model
Ex: Euclidian Distance, Cosine etc.
Terminology cont ..
5. 5
Supervised Learning: are modeling techniques where you have the
labeled data
Ex: Customer Segmentation using Classification
Un-supervised Learning: are modeling techniques where you don’t
have the labeled data and are based on the natural occurrence
of the data
Ex: Customer Segmentation using Clustering
Dimensionality Reduction: Techniques to reduce the M dimensions
to N dimensions where M>N,
- That can explain most variation in the data,
- so that the computations & interpretations are easy.
Terminology cont ..
6. 6
Overfitting: If a model is tuned too much for the training data, it wont
be able to predict the unseen with accuracy, this situation is
called over fitting.
Terminology cont ..
7. 7
Typical steps of a model building cycle, but not limited to are,
1. Data collection: collecting data from sources
2. Data cleaning: Dealing with missing values etc.
3. Pre-processing: Outliers & transformations
4. Random sampling: train, test , validation sets
5. Model building: iterative process
1. Feature selection: sub set feature selection that explains
data better
2. Validation: Finalize model summaries
3. Model Selection: Model comparison & final model
6. Model deployment: For predicting unseen data
7. Feedback & model improvement
Model building cycle
8. 8
A few Supervised Learning methods to explore
1. Linear Multiple Regression
2. Logistic Regression
3. Decision Tree
1. CART
2. CHAID
4. Ensemble Methods
1. Bagging
2. Boosting
5. KNN
6. Naïve Bayesian
Supervised Learning
9. 9
A few Un-supervised Learning methods to explore
1. K-means clustering
2. Hierarchical clustering
Unsupervised learning
10. 10
A few similarity measures to explore
1. Euclidian distance
2. Cosine similarity
3. Pearson correlation
4. Jaccard similarity
5. Tanimoto distance
Similarity measures
11. 11
A few dimensionality reduction methods to explore
PCA
Factor Analysis
SVD
Dealing with sparsity
Min Hashing
LSH
Dimensionality reduction
12. 12
Collaborative Filtering
Item based
User based
slope-one
Challenges
clod start problem
curse of dimensionality
outliers
frequent items/association rules
Recommendations
13. 13
Text Mining
1. NLP approach (building language dependant models)
2. Machine Learning approach:
documents are converted into vector space model, and
machine learning techniques are applied on them to solve
problems.
Vector space model
documents => data points
words in the documents=> features
Feature, Document pairs
<feature , document, TF*IDF>
TF = normalized Term Frequency
IDF = Inverse Document Frequency
Text Mining