3. DataLab Classes
14 Advanced Data Mining 16
14.1 Near Neighbor Search in High Dimensional Data [14, 16] . . . . . . . . . . . . . . . . . . . . . . . . . 16
14.2 Structure of the Webgraph [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References 17
A New Era 3
4. DataLab Classes
1 Introduction
Here, we propose the following schedule of classes.
Feature Engineering
Feature Engineering
Basic Mathematical Modules
Linear Algebra/Regression Basic Probability Basic Optimization Methods
Supervised Learning
Basic Supervised Learning
Intermediate Supervised Learning
Advanced Supervised Learning
Advanced Optimization Advanced Probability
Unsupervised Learning
Basic Unsupervised Learning
Intermediate Unsupervised Learning
Advanced Unsupervised Learning
Python
Data Structures
Basic Python
Data Mining
Advanced Data Mining
Basic Data Mining
A New Era 4
5. DataLab Classes
Part I
Python Module
2 Basic Python
2.1 History of Python
1. Python was conceived in the late 1980s.
2. By Guido Van Rossum
(a) Benevolent dictator for life!!!
3. Python reached version 1.0 in January 1994.
4. The major new features included in this release were the functional programming tools:
(a) lambda - the basic lambda calculus implementation.
(b) map - it applies a given function to each element of a list.
(c) filter - it filters out items based on a test function.
(d) reduce - it is a really useful function for performing some computation on a list and returning the result.
5. Standard Implementations
(a) CPython
(b) Jython (1997)
(c) IronPython (2006) - .NET
(d) PyPy (2007) written in RPython
6. Exotic Implementation
(a) Stackless Python (2000) does not depend on the C call stack.
2.2 The Basic Structures
2.2.1 Variables
There is a list of things that need to be covered:
1. Assigning Values to Variables.
2. Multiple Assignment
3. Standard Data Types
(a) Numbers
(b) String
(c) List
(d) Tuple
(e) Dictionary
A New Era 5
6. DataLab Classes
2.2.2 Python Basic Operators
Here, we need to review:
1. Arithmetic Operators
2. Comparison (Relational) Operators
3. Assignment Operators
4. Logical Operators
5. Bitwise Operators
6. Membership Operators
7. Identity Operators
2.2.3 Control Flow Structures
1. The If Statement
2. Loop Instructions
(a) For Instruction
(b) While Instruction
(c) Techniques
3. Else in loops
4. The range instruction
5. break and continue Statements
6. pass Statements
2.2.4 Defining Functions
Here, we classically define the syntax for functions using:
1. Fibonacci function
Then, we look at
1. Default Argument Values
2. Keyword Arguments
3. Arbitrary Argument Lists
In addition, we look at small expressions using lambda expressions.
A New Era 6
7. DataLab Classes
2.2.5 Functional Programming - Lambda Calculus
Review the difference between:
1. Procedural
2. Declarative
3. Object-Oriented
4. Functional
3 Data Structures in Python
3.1 Lists
Review the most basic functions for a list:
1. append(x)
2. extend(x)
3. remove(x)
4. pop()
5. index(x)
6. count(x)
7. sort()
8. reverse()
Then given all this operations it is possible to use a list as
1. Stacks
2. Queues
Here, you need to be careful because the emulation vs native implementation.
3.2 Tuples
Here, we will review the Tuple.
3.3 Sets
They are an unordered collection with no duplicate elements
3.4 Dictionaries
Another useful data type built into Python is the dictionary. It can be seen as an unordered set of key:value pairs.
Here, we will see how to substitute the case statement by a dictionary
A New Era 7
8. DataLab Classes
3.5 Hash Tables
Here, we will review some of the basics in
1. Hash Functions
2. Hash Tables
3.6 Trees
As you can imagine trees are an evolution from the chain lists. Thus, we will review how to :
1. Build a tree.
2. Traverse a tree
3. Expression trees
3.7 Graphs
Look at the basics on how building graphs by using lists and dictionaries. Then, look at
1. Breath First Search
2. Depth First Search
3.8 Divide and Conquer
Here, review the basics of divide and conquer using
1. Merge Sort
2. Quick Sort
A New Era 8
9. DataLab Classes
Part II
Basic Mathematical Modules
4 Linear Algebra [1]
4.1 Vector Spaces
1. Using Vector to Represent Samples
2. Basis of Vector Spaces
3. Sub-spaces
4. Linear independence
4.2 System of Linear Equations
1. Elementary row operations and elementary matrices
2. Example the inverse matrix
3. Squared Matrices and Inverse
4. Determinants
5. Derivatives of Matrices
4.3 Basis
1. Orthonormal Basis
2. Eigenvalues and Eigenvectors
4.4 Applications
1. Linear Regression
2. Graph Theory
3. Sparse Matrices
5 Probability and Statistical Theory [2]
5.1 Introduction
1. Probability Definition
2. Basic Set Operations
3. Counting
A New Era 9
10. DataLab Classes
5.2 Bayes Theorem
1. Conditional probability
2. Independence of events
5.3 Distributions
1. Random Variables
2. Distributions
(a) Continuous
(b) Discrete
(c) Examples in Machine Learning
3. Mean and variance
4. Joint distributions
5. Covariance and Correlation
5.4 Applications
1. Bayesian inference
2. Principal Component Analysis
3. Analysis of Samples
6 Optimization [3]
6.1 Introduction
1. Formulation
2. Example: Least Squared Error
3. Continuous versus Discrete Optimization
4. Constrained and Unconstrained Optimization
6.2 Basic Optimization
1. How to recognize a minimum?
2. Search using Derivatives
(a) Gradient Descent
3. Linear Search
A New Era 10
11. DataLab Classes
Part III
Feature Engineering
1. Introduction
2. Types of Data
(a) Categorical
(b) Numerical
3. Extracting meaning from the data
4. Encoding techniques of categorical variables:
(a) Feature hashing and bin-counting
5. Feature engineering for numeric data:
(a) Filtering, binning, scaling, log transforms, and power transforms
(b) Statistical analysis of the data
6. Natural text techniques:
(a) bag-of-words, n-grams.
7. The Cycle of feature engineering
(a) Forward, backward selection, etc
8. Non-Linear Featurization and Model Stacking
A New Era 11
12. DataLab Classes
Part IV
Supervised Learning
7 Basic Supervised Learning
7.1 Introduction [4, 5, 6]
1. What is Learning?
2. The Learning Process
7.2 Linear Classifiers [4, 5, 6, 7, 8]
1. What is a Classifier?
2. Introduction to Linear Classifiers
3. Mean-Square Error Linear Estimation
4. Regularization
7.3 Probability Classifiers [4, 5, 6, 7, 8]
1. Discriminant Functions
2. Naive Bayes
7.4 Advanced Optimization
1. Differentiable Convex Functions
2. Newton’s Method
3. Quasi-Newton’s Methods
4. Constrained Optimization
(a) Lagrange Multipliers
(b) The Karush-Kuhn-Tucher Condiditons
(c) Duality
7.5 Advanced Probability
1. Central limit theorem
2. Bayes Theorem
(a) Likelihood Distribution
A New Era 12
13. DataLab Classes
3. Bayesian inference
(a) Maximum A Posteriori
4. Generative Models
5. Point Estimation
6. Dirichlet Processes
8 Intermediate Supervised Learning
8.1 Probability Models for Supervised Classification
1. Logistic Regression
2. Expectation Maximization
3. Maximum A Posteriori in the LASSO
8.2 Important Issues [4, 5, 6]
1. Statistical Analysis of Classifiers
2. Bias-Varance Dilemma
3. The Confusion Matrix
4. K-Cross Validation
8.3 Kernel Based Classifiers [9]
1. Introduction
2. Support Vector Machines
9 Advanced Supervised Learning
9.1 Graphical Model Based Classifiers [4, 6, 9]
1. Decision Trees.
2. Neural Networks
• Perceptron
• Multilayer Perceptron
3. Deep Neural Networks
(a) CNN
A New Era 13
14. DataLab Classes
9.2 Data Preparation
1. Feature selection [4, 5, 6]
(a) Preprocessing
(b) Statistical Methods
(c) Class Separability
(d) Feature Subset Selection.
2. Feature Generation [4, 5, 6, 17]
(a) Introduction
(b) Fisher Method
(c) Principal Component Analysis
(d) Dimensionality Reduction
9.3 Combining Classifiers [4, 10]
1. Average Rules
2. Majority Voting Rule
3. A Bayesian viewpoint
4. Boosting
A New Era 14
15. DataLab Classes
Part V
Unsupervised Learning
10 Basic Unsupervised Learning [11, 7]
1. Introduction
2. Proximity Measures.
3. Admissible Cost Functions.
4. Basic Clustering Algorithms:
(a) K-Means
(b) K-Centers
(c) K-Medians
11 Intermediate Unsupervised Learning [11]
1. Hierarchical Clustering
(a) Graph Model For Hierarchical Clustering
2. Partitional Clustering
(a) Grouping Based in Disimilarities
(b) Grouping Based on Cluster Volume
3. Hierarchical Clustering vs. Partitional Clustering
12 Advanced Unsupervised Learning [11, 6]
1. Large Data Set Clustering:
(a) Density Based: DBSCAN,
(b) Grid Based: STING, CLIQUE
2. Strategies for Large Data Sets
(a) One-pass strategies
(b) Summarization strategies
(c) Sampling/batch strategies
(d) Approximation strategies
(e) Divide and conquer strategies
A New Era 15
16. DataLab Classes
Part VI
Data Mining
13 Basic Data Mining
13.1 Mining the Web for Structured Data [12, 13, 14, 15]
1. Introduction
2. Why do we want to deal with data from the web?
3. Extracting information from the web
13.2 Frequent Itemsets and Association rules [12, 15]
1. The Market Problem
2. The basic algorithm
3. Improvements
14 Advanced Data Mining
14.1 Near Neighbor Search in High Dimensional Data [14, 16]
1. Locality Sensitive Hashing
2. Near Neighbor Search
14.2 Structure of the Webgraph [13]
1. Page Rank Algorithm
2. Topic Specific Page Rank
3. Trust Rank
A New Era 16
17. DataLab Classes
References
[1] G. Strang, Introduction to Linear Algebra. Wellesley, MA: Wellesley-Cambridge Press, fourth ed., 2009.
[2] R. Ash, Basic Probability Theory. Dover Books on Mathematics Series, Dover Publications, Incorporated,
2012.
[3] J. Nocedal and S. J. Wright, Numerical Optimization, second edition. World Scientific, 2006.
[4] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ,
USA: Springer-Verlag New York, Inc., 2006.
[5] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2Nd Edition). Wiley-Interscience, 2000.
[6] S. Theodoridis and K. Koutroumbas, Pattern Recognition, Fourth Edition. Academic Press, 4th ed., 2008.
[7] S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective. Academic Press, 1st ed., 2015.
[8] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications
in R. Springer Publishing Company, Incorporated, 2014.
[9] S. Haykin, Neural Networks and Learning Machines. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2008.
[10] R. Rojas, “Adaboost and the super bowl of classifiers a tutorial introduction to adaptive boosting,” 2009.
[11] S. Wierzchoń and M. Kłopotek, Modern Algorithms of Cluster Analysis, vol. 34. Springer, 2017.
[12] J. Han, Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
2011.
[13] A. N. Langville and C. D. Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings.
Princeton, NJ, USA: Princeton University Press, 2006.
[14] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New York, NY, USA: Cambridge University
Press, 2011.
[15] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition
(Morgan Kaufmann Series in Data Management Systems). San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2005.
[16] H. Samet, Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in
Computer Graphics and Geometric Modeling). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
2005.
[17] R. E. Schapire and Y. Freund, Boosting: Foundations and algorithms. MIT press, 2012.
A New Era 17