INDUSTRIAL MACHINE LEARNING
Grigorios Tsoumakas,
School of Informatics,
Aristotle University of Thessaloniki
OUTLINE
What is Machine Learning?
Industrial Applications of Machine Learning
2
DEFINITIONS OF ML
Machine learning is the subfield of computer science that gives
computers the ability to learn without being explicitly programmed
Arthur Samuel, 1959
A computer program is said to learn from experience 𝐸 with respect to
some class of tasks 𝑇 and performance measure 𝑃 if its performance at
tasks in 𝑇, as measured by 𝑃, improves with experience 𝐸
Tom Mitchell, 1998
3
Supervised Learning
 Input variables 𝒙
 Output variable 𝑦
 Mapping function 𝑦 = 𝑓(𝒙)
Unsupervised Learning
 Input variables 𝒙
 Learn more about the data
Reinforcement Learning
 Agent acting in an environment so
as to maximize cumulative reward
4
MAIN TASKS
http://www.isaziconsulting.co.za/machinelearning.html
Association Rules
 Items X => Items Z
Anomaly Detection
 Identify unusual data points
Recommender Systems
 Predict the rating that a user
would give to an item
…
5
OTHER TASKS
ALGORITHMS / APPROACHES / TRIBES
Discriminative vs Generative
 𝑝(𝑦|𝑥) vs 𝑝(𝑦, 𝑥)
Lazy vs Eager
 No learning until a test instance arrives
Parametric vs Non-Parametric
 Representations (don’t) grow with
more training data
The 5 Tribes of ML
6
SL: LINEAR MODELS, SVMS, TREES AND NNS
7
Pedro Domingos. 2012. A few useful things to
know about machine learning. Commun. ACM 55
“MORE DATA BEATS A CLEVERER ALGORITHM”
The Economist. Facebook post, May 5th, 2017
“Those who gather the most data will
dominate the digital landscapes of the future”
SL: LINEAR MODELS, SVMS, TREES AND NNS
8
Pedro Domingos. 2012. A few useful things to
know about machine learning. Commun. ACM 55
“LEARN MANY MODELS, NOT JUST ONE”
Anthony Goldbloom. Kaggle CEO. Oct 2015.
“As long as Kaggle has been around, it
has almost always been ensembles of
decision trees that have won
competitions. It used to be random forest
that was the big winner, but over the last
six months a new algorithm called
XGboost has cropped up, and it’s winning
practically every competition in the
structured data category.”
SL: LINEAR MODELS, SVMS, TREES AND NNS
9
CLUSTERING: KMEANS, LDA
10
DIMENSIONALITY REDUCTION: PCA, SVD
11
12
LANGUAGES, LIBRARIES, TOOLS & APIS
13
METHODOLOGIES
14
http://www.kdnuggets.com/2014/10/crisp-dm-top-
methodology-analytics-data-mining-data-science-projects.html
Pedro Domingos. 2012. A few useful things to
know about machine learning. Commun. ACM 55
“FEATURE ENGINEERING IS THE KEY”
“Data scientists spend 50-80% of their
time in data collection and preparation”
https://www.nytimes.com/2014/08/18/technology/for
-big-data-scientists-hurdle-to-insights-is-janitor-work.html
OUTLINE
What is Machine Learning?
Industrial Applications of Machine Learning
15
16
WHAT HAS CHANGED?
Faster distributed systems
The explosion in computing power has
allowed us to use machine learning to
tackle evermore-complex problems
Exponential data growth
The explosion of data being
captured and stored has allowed us
to apply machine learning to an
ever-expanding range of domains
17
The amount of collected data is
doubling every 12 months and
will reach 44 zettabytes by 2020
18
NATURAL GAS LOAD FORECASTING
Collaboration with Gas Supply Company of
Thessaloniki & Thessaly
The problem
 Daily statements of one day ahead demand must be
submitted to the regulatory entity
 Actual consumption must lie within a percentage of the
statement (e.g. 10%), otherwise economic fines are imposed
Similar framework in the electricity domain
SCREENSHOT
UNDERSTANDING ACADEMIC PUBLICATIONS
Collaboration with Atypon Inc.
 Online content hosting and management software
 Atypon is home to more than one-third of the world’s English-language
professional and scholarly journals — clients include Elsevier, IEEE, MIT Press,
Oxford University Press, Taylor & Francis, …
Some of the things we do
 Automated semantic indexing of articles and figures
 Information extraction (e.g. funding information)
 Question answering
PubMed Central
22
UNDERSTANDING ACADEMIC PUBLICATIONS
PubMed
 10,876,004 abstracts (18Gb)
 26,563 MeSH terms, ~13 on avg.
0
200000
400000
600000
800000
1000000
1200000
1950
1953
1956
1959
1962
1965
1968
1971
1974
1977
1980
1983
1986
1989
1992
1995
1998
2001
2004
2007
2010
2013
x $10
INDUSTRY – ACADEMIA PARTNERSHIPS
Industry funded research & development
 Staff, senior researchers, and PhD students
Pro bono exploratory work
 MSc theses
National and EU funding
23
THE END… OR THE BEGINNING?
24

Industrial Machine Learning

  • 1.
    INDUSTRIAL MACHINE LEARNING GrigoriosTsoumakas, School of Informatics, Aristotle University of Thessaloniki
  • 2.
    OUTLINE What is MachineLearning? Industrial Applications of Machine Learning 2
  • 3.
    DEFINITIONS OF ML Machinelearning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed Arthur Samuel, 1959 A computer program is said to learn from experience 𝐸 with respect to some class of tasks 𝑇 and performance measure 𝑃 if its performance at tasks in 𝑇, as measured by 𝑃, improves with experience 𝐸 Tom Mitchell, 1998 3
  • 4.
    Supervised Learning  Inputvariables 𝒙  Output variable 𝑦  Mapping function 𝑦 = 𝑓(𝒙) Unsupervised Learning  Input variables 𝒙  Learn more about the data Reinforcement Learning  Agent acting in an environment so as to maximize cumulative reward 4 MAIN TASKS http://www.isaziconsulting.co.za/machinelearning.html
  • 5.
    Association Rules  ItemsX => Items Z Anomaly Detection  Identify unusual data points Recommender Systems  Predict the rating that a user would give to an item … 5 OTHER TASKS
  • 6.
    ALGORITHMS / APPROACHES/ TRIBES Discriminative vs Generative  𝑝(𝑦|𝑥) vs 𝑝(𝑦, 𝑥) Lazy vs Eager  No learning until a test instance arrives Parametric vs Non-Parametric  Representations (don’t) grow with more training data The 5 Tribes of ML 6
  • 7.
    SL: LINEAR MODELS,SVMS, TREES AND NNS 7 Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55 “MORE DATA BEATS A CLEVERER ALGORITHM” The Economist. Facebook post, May 5th, 2017 “Those who gather the most data will dominate the digital landscapes of the future”
  • 8.
    SL: LINEAR MODELS,SVMS, TREES AND NNS 8 Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55 “LEARN MANY MODELS, NOT JUST ONE” Anthony Goldbloom. Kaggle CEO. Oct 2015. “As long as Kaggle has been around, it has almost always been ensembles of decision trees that have won competitions. It used to be random forest that was the big winner, but over the last six months a new algorithm called XGboost has cropped up, and it’s winning practically every competition in the structured data category.”
  • 9.
    SL: LINEAR MODELS,SVMS, TREES AND NNS 9
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    METHODOLOGIES 14 http://www.kdnuggets.com/2014/10/crisp-dm-top- methodology-analytics-data-mining-data-science-projects.html Pedro Domingos. 2012.A few useful things to know about machine learning. Commun. ACM 55 “FEATURE ENGINEERING IS THE KEY” “Data scientists spend 50-80% of their time in data collection and preparation” https://www.nytimes.com/2014/08/18/technology/for -big-data-scientists-hurdle-to-insights-is-janitor-work.html
  • 15.
    OUTLINE What is MachineLearning? Industrial Applications of Machine Learning 15
  • 16.
  • 17.
    WHAT HAS CHANGED? Fasterdistributed systems The explosion in computing power has allowed us to use machine learning to tackle evermore-complex problems Exponential data growth The explosion of data being captured and stored has allowed us to apply machine learning to an ever-expanding range of domains 17 The amount of collected data is doubling every 12 months and will reach 44 zettabytes by 2020
  • 18.
  • 19.
    NATURAL GAS LOADFORECASTING Collaboration with Gas Supply Company of Thessaloniki & Thessaly The problem  Daily statements of one day ahead demand must be submitted to the regulatory entity  Actual consumption must lie within a percentage of the statement (e.g. 10%), otherwise economic fines are imposed Similar framework in the electricity domain
  • 20.
  • 21.
    UNDERSTANDING ACADEMIC PUBLICATIONS Collaborationwith Atypon Inc.  Online content hosting and management software  Atypon is home to more than one-third of the world’s English-language professional and scholarly journals — clients include Elsevier, IEEE, MIT Press, Oxford University Press, Taylor & Francis, … Some of the things we do  Automated semantic indexing of articles and figures  Information extraction (e.g. funding information)  Question answering
  • 22.
    PubMed Central 22 UNDERSTANDING ACADEMICPUBLICATIONS PubMed  10,876,004 abstracts (18Gb)  26,563 MeSH terms, ~13 on avg. 0 200000 400000 600000 800000 1000000 1200000 1950 1953 1956 1959 1962 1965 1968 1971 1974 1977 1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010 2013 x $10
  • 23.
    INDUSTRY – ACADEMIAPARTNERSHIPS Industry funded research & development  Staff, senior researchers, and PhD students Pro bono exploratory work  MSc theses National and EU funding 23
  • 24.
    THE END… ORTHE BEGINNING? 24