A Machine Learning Primer,

Machine Learning for Big Data
Prof. Dr. Eirini Ntoutsi
FG Intelligent Systems
Faculty of Electrical Engineering and Computer Science
Leibniz University Hannover & L3S Research Center
Introduction to Machine Learning

Overview
 A Machine Learning primer
 Machine Learning in the real world
Prof. Dr. Eirini Ntoutsi - Introduction to Machine Learning

A Machine Learning primer

What is Machine Learning?
 ML “gives computers the ability to learn without being explicitly
programmed” (Arthur Samuel, 1959)
 We don’t codify the solution. We don’t even know it!
 Data is the key & the learning algorithm
Algorithms
Models
Models
Automatic decision
making
Data
How can we build computer programs that
automatically improve with experience?

How do machines learn?
 A machine is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
Tom Mitchell, Machine Learning book
 Example
 Task T: Recognize good and bad products in a production system (e.g., a
drilling machine)
 Experience E: instances of good and bad products
 Performance measure P: % of correctly identified products

(Machine) Learning from experience/feedback
 Experience comes in terms of data (the so called, instances or examples)
from the specific problem/ application
 In our example, instances correspond to certain characteristics of the
product, e.g.,
 Shape descriptors
 weight
 Roughness of the surface
 …
 Except for the instance description, we might also have feedback on those
instances from some “teacher”/”expert“
 E.g., whether the produced product is good or bad

(Machine) Learning from experience/feedback
 Based on the feedback, we can distinguish between:
 Direct-feedback instances
 the correct response /label is provided for each instance by the “teacher”
 e.g., good or bad product
 No-feedback instances
 no evaluation/label of the instance is provided, since there is no “teacher“
 e.g., no information on whether a product is good or bad, just the description of the
product/instance
 Indirect-feedback instances
 less feedback is given, since not the proper action, but only an evaluation of the
chosen action is given by the teacher
Supervised learning
Reinforcement learning
Unsupervised learning

Unsupervised learning
 Unsupervised learning/ Descriptive:
 Only a description of the instances is available
 No feedback/labels are available
 The goal is to discover groups of similar instances
 Typical examples: clustering, association rules, outlier detection
Height[cm]
Width[cm]
Cluster 1Cluster 2
instance width height
1 2,6 4,5
2 3,7 7,3
3 4,1 6,5
4 8,5 8,1
5 9,5 5,5
… … …
nails paper clips

Unsupervised learning: Clustering
 A huge variety of clustering algorithms
1
2
3
4
5
6
1
2
3 4
5
Partitioning methods
(k-Means)
Grid-based methods
(CLIQUE)
Model-based methods
(DBSCAN)
Hierarchical methods Constraint-based methods
Model-based methods
(EM)

Supervised learning
 Supervised learning/ Predictive:
 A description of the instances and their class labels is available
(training set)
 The goal is to learn a mapping from the instances to the class labels,
i.e., given a future unseen instance to predict its class label
 Typical examples: classification, regression, outlier detection
Screw
Nails
Paper clips
New object
Height[cm]
Width[cm]
New object
instance width height class
1 2,6 4,5 A
2 3,7 7,3 A
3 4,1 6,5 A
4 8,5 8,1 B
5 9,5 5,5 B
… … … …

Supervised learning: classification
 A huge variety of classification algorithms
Decision trees k nearest neighbours Support vector machines
Neural networks Bayesian classifiers Ensembles

Supervised learning: classification
 Different methods different partitionings of the feature space
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Supervised learning: regression
 Similar to classification, but the feature-result to be learned is continuous rather
than discrete.
 Goal: Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency.
Given this data, a friend has a house 750
square feet - how much can they be
expected to get?

Reinforcement learning
 The learning machine interacts with its environment via actions
 Minimal feedback is provided regarding how the learning machine is
performing
 Feedback in terms of reward
 The goal of the agent is to learn a policy so as to maximize the expected
rewards
Source: https://en.wikipedia.org/wiki/Reinforcement_learning

Machine Learning in the Real World

ML in the real world
 Traditional ML assumptions
 The datasets are small and fit in memory
 Data is of a single type (e.g., numerical or text or images)
 For supervised learning
 The classes are well represented in the population (class balance)
 Labels are available for all instances (fully supervised learning)
 The quality of data (and labels) is good
 Data is stationary (the whole dataset is known in advance and there are no
changes in the data characteristics with time)
 …

Real world data
 The data manifest all Vs of big data
Source: http://blog.eoda.de/wp-content/uploads/2013/10/dv1.jpg

Big Data
Tackling the traditional restrictive ML assumptions
 The myth: we have big (labeled) data
 The reality:
 Huge amounts of unlabeled data
 Only a few labeled data
 Goal: Use both labeled and unlabeled
data for training
 Related ML areas
 Semi-supervised learning
Unlabeled
Labeled
Iosifidis&Ntoutsi, Large scale sentiment annotation with limited labels, KDD 2017

 The myth: data is stationary
 The reality:
 Data are collected over time and their characteristics
might change  data streams
 Goal: maintain valid models of the population
 Related ML areas:
 stream mining, adaptive ML
1
0
1
1
1
0
1
0
0
1
1
[Zhang et al, Journal Neurocomputing 2012]

 Similar example for stream clustering
Time T1 Time T2
Time T3
Cluster expands
Cluster shrinks
Cluster is split
Spiliopoulou, Ntoutsi, Theodoridis & Schult, MONIC and Followups on
Modeling and Monitoring Cluster Transitions, ECML PKDD 2013

Summary
 Machine learning is an exciting field with a huge variety of learning tasks
and algorithms for each task
 Different methods come with different assumptions, strengths and
limitations.
 The selection of the right method (and correct parameterization) is important
and requires a deep understanding of the methods and of the problem at hand
 A close cooperation with domain experts is required.
 Production systems (industry in general) impose new challenges for ML
due to their data complexity (volume, velocity, veracity, variety, value, …)
 “Factories are AI’s next frontier”, Andrew Ng
 https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-
ais-next-frontier/
 Landing AI startup to closely work with manufacturers like Foxconn, the
world’s largest contract manufacturer and maker of Apple’s iPhones

Thank you for your attention!
Questions/Comments?
Prof Dr. Eirini Ntoutsi
FG Intelligent Systems
Faculty of Electrical Engineering and Computer Science
Leibniz University Hannover & L3S Research Center
http://www.kbs.uni-hannover.de/~ntoutsi/
ntoutsi@l3s.de

A Machine Learning Primer,

More Related Content

What's hot

Similar to A Machine Learning Primer,

More from Eirini Ntoutsi

Recently uploaded

A Machine Learning Primer,