SlideShare a Scribd company logo
1 of 24
A short overview
Analytics:
How to predict
anything
WHITE PAPER
Dr. Udo Göbel
All brand names, trademarks, product names, abbreviations thereof and logos are the property of their
respective companies and are protected as such. All third-party protected brand names and trademarks
are subject in all respects to the regulations of the relevant registration law and the rights of possession
of the respective registered owner. The mere appearance of names without designation as proprietary
shall not be construed as meaning that trademarks are not protected by the rights of third parties.
© CONTACT Software GmbH. All rights reserved. Changes may have come into effect after copy deadline
for this document. All information supplied without guarantee.
1	Introduction 4
2	 A historical perspective 6
	 Analytics trends over the past 200 years (abbreviated) 7
	 Ancient times 8
	 Consolidation and axiomatization 8
	 Computer time 8
	AI 10
	 Stochastic control 10
	 Financial mathematics 10
	 Modern times 11
	 Neural networks 11
	 Change of paradigm 11
	 Prediction, prediction, prediction 12
	Summary 13
3	 Analytics landscape 14
	 Algorithmic purpose 15
	Regression 15
	Classification 15
	 Anomaly detection 15
	Clustering 15
	 Reinforcement learning 15
	 Dimensional reduction 15
	 Software landscape 16
4	 The data science pipeline 17
	 The basic pipeline 18
	 Feature engineering 18
5	 Data science in product business 19
	 Products and digital twin 20
	Conclusion 20
6	 Bibliography, online resources and references 21
	 Learning data science 22
Table of contents
Introduction
1
5
four megatrends: Cloud computing, IoT, big data and
algorithmic computing. These areas have been
becoming or are under the way of becoming an
overwhelming success in a very short time frame and
are already delivering on their promises. The latter
point best illustrated by the rise of cloud computing
over the past five years now generating billions of
dollars of revenue for Amazon (AWS), Microsoft
(Azure) and Google (GCP). Therefore understanding
the current hype cycle surrounding analytics and
taking appropriate action is a concern for every
company especially those considering themselves
technology leaders in their respective market seg-
ments and exposed to some or all of the driving
megatrends.
Since analytics in itself is a huge topic and the mega-
trends are big topics in their own right it’s easy
getting lost in the details and be overwhelmed by the
sheer volume of news and stories coming in every
day. We will try here to give a simple overview of the
field with emphasis on describing the main factors
influencing the past, current and future develop-
ments in the field in order to enable the reader to
assess the importance of the topic and dig deeper if
needed. Subsequently we elucidate some practical
issues concerning data science and its importance in
the field of PLM (Product Life Cycle Management) and
give some insight into the roadmap of CIM Database
concerning analytics.
The last few years have seen an explosion of interest
in analytic methods for exploring data. Drivers of this
development have been trends in the availability of
data (“big data”) especially regarding large volumes
of data publicly available from social networks,
breakthroughs in specific areas of algorithmic re-
search (i.e. neural networks applied to classification),
the general availability of ever increasing storage and
computing capacities for rapidly falling costs (i.e.
Moore’s law in connection with cloud computing)
and recently the rise of stunningly cheap devices able
to connect with the internet and providing steadily
increasing amounts of data about their environment
and their own status (i.e. IoT or the internet of things).
These factors contributed to generate the recent
notion of data science as an independent scientific
endeavour apart from the classical statistics curricu-
lum and more specifically the notion of data scientist
as a job title for analysts featuring the necessary skill
sets of programming, statistics and algorithmic
expertise able to navigate and exploit the new oppor-
tunities.
What sets this hype cycle surrounding the topic
analytics apart from many others we had before in
the IT industry is that it is not coming our way as an
isolated topic as compared to hype cycles like XML,
ESB, SOAP or Web 2.0 we have been witnessing
popping up and fading out over the years. Instead
our topic is driven by and at the convergence point of
Introduction
Big Data
Cloud
Computing
IoT
AlgorithmsEfficiency, Insights
Connection, Storage, Management
Storage
AnalyticsPlatform
UseCases
(Failure,Anomaly)
Data
Tools
Data
Com
pute
pow
er
Analitycs
Figure 1
Current megatrends their
mutual dependencies and
influence on analytics
6
A historical perspective
2
7
Analytics trends over the
past 200 years (abbreviated)
To put things in perspective let us briefly review how
we got here and what happened to the ancient field
of statistics to rise from a 100-year-sleep (at least
regarding the public attention it has received during
its existence, the research community wouldn’t agree
with that). Analytics which up to 1990 was considered
the realm of applied statistics is an ancient topic
going back more than 200 years. It’s birth date may
be given when Gauss applied the method of least
squares to calculate the orbit of minor planets by
fitting orbit parameters via the least squares method
to observations. The prolific Gauss who usually didn’t
bother to publish “minor” work considered it impor-
tant enough to state his priority of discovery after
Legendre published the method in 1805 independ-
ently.1 Not unsurprisingly due to its usefulness the
method is still one of the most heavily used tools to
fit data to some empirical model. In modern terms it
may be described as consisting of an empirical
model, e.g. a linear model like
where x is the input (in various contexts also called
with more fancy names like predictor, regressor,
independent variable, controlled variable etc.), y is
the output (response, regressand, outcome etc.) and
𝝴 an error term due to inaccuracies of measurement
or noise. The parameters of the model a and b are
initially unknown and our goal is to determine them
as best we can according to our measurements of the
input x and output y in many observations. The
method of least squares gives an explicit formula for
this taking the observations (xi , yi) as input and
delivering an estimate for the parameters, e.g. the
estimate
𝑦𝑦 = a + b x + ε
𝑏𝑏⏞ =
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛
𝑖𝑖 =1
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
(𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐)
𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) =
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝)
𝑃𝑃(𝑘𝑘 = 65)
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65
(1 − 𝑝𝑝)35
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
for b is given by:
In addition according to the statistical model for the
error term accuracy guarantees can be computed in a
probabilistic manner, e.g. the statisticians statement
may be: “The parameter b is contained in the interval
with probability 95%” (of course the
actual stated probability depends on the width of the
interval c: It is somewhat obvious that it is more likely
that the „real“ parameter b is contained in an interval
with wide error margins i.e. big c than a small inter-
val). For more than a hundred years these ingredients
should be the hallmarks of the statistical science:
■	 Data generated from experiment
■	 A method for producing an estimate
in explicit form
■	 A (probability) model for the data
■	 Accuracy guarantees justifying the use
of the method
The last step ususally called inference. Without it our
estimates from data lose much of their appeal espe-
cially in the physical sciences since we cannot be
sure if their use is sensible at all: The fancy algorithm
we have used might just have produced a number no
better than guessing.
A historical perspective
Figure 2
The modern field of data
science at the intersection
of statistics, machine
learning and computer
science.
𝑦𝑦 = a + b x + ε
𝑏𝑏⏞ =
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛
𝑖𝑖 =1
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
(𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐)
𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) =
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝)
𝑃𝑃(𝑘𝑘 = 65)
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65
(1 − 𝑝𝑝)35
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
𝑦𝑦 = a + b x + ε
𝑏𝑏⏞ =
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛
𝑖𝑖 =1
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
(𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐)
𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) =
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝)
𝑃𝑃(𝑘𝑘 = 65)
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65
(1 − 𝑝𝑝)35
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
𝑦𝑦 = a + b x + ε
𝑏𝑏⏞ =
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛
𝑖𝑖 =1
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
(𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐)
𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) =
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝)
𝑃𝑃(𝑘𝑘 = 65)
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65
(1 − 𝑝𝑝)35
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
8 A historical perspective
Ancient times
During the development of the statistical science as
sketched above since Gauss during the first 150 years
up to the 1900s the methods were as expected ad
hoc and application specific. Gauss, Laplace and
Pascal contributed a good deal of ideas revolving
around concrete problems with the notable excep-
tion of Bayes who formulated the celebrated Bayes’
theorem2 which lay dormant for a long time only to
make a vigorous return when modern computational
power was finally available to give full power to its
many useful applications in modern statistical
analysis.
Consolidation and
axiomatization
The first half of the twentieth century saw the mathe-
matization of the discipline via the axiomatic formu-
lation by Kolmogorov which provides the solid frame-
work to this day. The great statisticians Fisher and
Pearson contributed now standard ideas and many
of the standard statistical methods in use today. The
body of this work up to 1950 is basically what can be
found in school books today: Hypothesis testing,
confidence intervals, p-values, χ2–testing etc. It
cannot be denied that the rigorous mathematization
of the science gave the discipline a slightly bureau-
cratic and boring flavour probably stifling the playful
advancement of the following period. All these
methods operated on small data sets and were
optimized for computing with pencil and paper.
Computer time
The second half of the twentieth century saw the rise
of computer programms. Whereas the goal of former
times was to avoid computations in a field where
gathering data and evaluating them was a cumber-
some task the advent of computers changed this
attitude albeit slowly. The breakthrough was proba-
bly the invention of the bootstrap method. The idea
sounds simple and intuitive looking back now, but
sounded certainly crazy then: Instead of throwing
heavy mathematics at each special case the boot-
strap method simply resamples from the existing
dataset to get at probabilistic estimates of parame-
ters.
Of course this adds a considerable computational
overhead to the procedure. But the big advantage is
that the procedure is very simple, automatic and
universal in the sense that it is immediately applica-
ble to arbitrary complicated curves instead of only
linear regression for which the classical formula
provided the analytical solution. Arguably as simple
as it is this shift to unabashed usage of modern
computer power ushered in the era of computers and
software in the statistical sciences ultimately result-
ing in software packages such as SAS or R and a more
playful and experimental approach to statistics.
Another point of view coming to its full right only with
the advent of powerful computers is bayesian statis-
tics as mentioned above. Whereas its origins lie with
Bayes’s theorem the full force of its application is
Figure 3
A timeline of computatio-
nal statistics since its
inception with a rough
measure of needed
computer power for the
methods in MIPS (million
instructions per second)
2017
Bootstrap
Bayes
Regression
Hypothesis Testing
Significance Testing
1800 1950
𝑯 𝟎

2
1990
P(A|B)=
𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐵)
ComputationalPower(MIPS)𝟎 𝟏𝟎𝟎 𝟏𝟎𝟎𝟎𝟎𝟎𝟎
Large Scale
Hypothesis Testing
(Human Genome
Project)
Monte Carlo
Bootstrap
Software R
Computational Methods in Statistics
9
hard to understand from its original formulation. The
gist of the bayesian method is to regard everything as
a probability statement. A simple example is estimat-
ing the probability p of a coin coming up heads for a
coin suspected of being not fair (i.e. p ≠ 0.5, the coin
comes up more often heads than tails or vice versa).
In the classical line of thought one would conduct a
hypothesis test with the null hypothesis being the
coin is fair and accept or reject this depending on the
outcome of an extensive coin flipping experiment. In
the bayesian view the “real” value of p is itself a
random variable and if you flip the coin say a hun-
dred times and it comes up 65 times heads up you
are interested in the probability of this outcome given
the value of the observed number of heads i.e. P(p |
k=65) (read: Probability of p given number of heads is
65). Of course you don’t know p but according to
Bayes’ theorem this can be deconstructed to:
Now the first term in the numerator is given by the
binomial distribution and up to a factor not depend-
ing on p is given by
If we factor in a probability distribution for P(p)
(called the prior in bayesian statistics) we have a
functional form for the desired distribution P(p|k=65)
(the posterior in bayesian paralance). The “real” value
A historical perspective
of p can be estimated from this by e.g. the maximum
or the expectation value. The real value of the meth-
od comes up when further experiments are conduct-
ed. We can go on and use the calculated distribution
of p up to now as the new prior P(p) and calculate
further refinements on the go. Note that these tricks
are only possible because the parameter p, which is
an unknown but fixed quantity in classical statistics,
was regarded as a random variable itself. This in-
nocuous method has revolutionized the way statis-
tics is done especially in the field of parameter
estimation. Usually there are many parameters
continuously distributed leading to multidimensional
integrals and therefore bayesian statistics is practical-
ly tractable only with considerable use of computer
power. A further element necessitating a change of
view or an enlargement of the toolset was modern
science in the form of the human genome project
leading to ever larger arrays of genome sequences
(micro arrays) on which simultaneous testing of e.g.
genome expression levels was conducted. The
analysis and inference on the resulting data was
impossible without computers and led to modern
methods such as HMM (Hidden Markov Models),
MCMC (Markov Chain Monte Carlo) and more, also
impossible to conduct without modern computer
power. In summary the field of statistics evolved
slowly but steadily from a theoretically driven field to
a very successful practical endeavour heavily domi-
nated by computational tools.
Figure 4
Bootstrap example,
sample points generated
from f(x)=5x-3x^2+0.3x^3
with added gaussian
noise, fitting curves are
obtained by resampling
from the sample points
𝑦𝑦 = a + b x + ε
𝑏𝑏⏞ =
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛
𝑖𝑖 =1
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
(𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐)
𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) =
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝)
𝑃𝑃(𝑘𝑘 = 65)
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65
(1 − 𝑝𝑝)35
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
𝑦𝑦 = a + b x + ε
𝑏𝑏⏞ =
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛
𝑖𝑖 =1
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
(𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐)
𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) =
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝)
𝑃𝑃(𝑘𝑘 = 65)
𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65
(1 − 𝑝𝑝)35
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
10 A historical perspective
only be measured with considerable noise. Their
formulation was a breakthrough which found imme-
diate application in the Apollo space program for
tracking the position of the space capsule. Nowadays
it is a ubiquitous tool for tracking positions from
spacecraft, planes, trucks or drones and steering
them in a reliable way.
Financial mathematics
The framework of stochastic SDEs was brilliantly
applied in the field of finance which underwent a
revolution in the 1970s when Fischer Black and
Myron Scholes solved the problem of pricing Europe-
an-styled options by assuming that the underlying
asset in the form of a stock is governed by Brownian
motion and postulating that there is an efficient
market free of arbitrage. The resulting PDE for the
price V(t)7 of the option as a function of time
though certainly understable only with considerable
mathematical expertise and solvable only with heavy
use of computer power was soon applied all over and
the options market underwent an explosion.
The explosion and complete revolution of the finan-
cial market and its ensuing meltdown in 2008 for
which the extension of these models and their unlim-
ited use without proper risk management is at least
partially responsible demonstrates the power (and
the danger) of (blindly) applying algorithms to practi-
cal problems.
AI
In contrast to the steady evolution of statistics AI
enjoyed a comparably roller coaster history with
many ups and downs. Artificial intelligence came into
being with the advent of the first computers and had
as a result much closer ties to computer science and
algorithmic experiments than to statistics and proba-
bility. Interestingly the idea of neural networks was
formulated3 even before the first computers were
available and a viable algorithm for training (back-
propagation) was proposed in the early days of
computer usage .4 But the history of AI is less glorious
in delivering on its promises. This was largely due to
much higher expectations which simply couldn’t be
delivered upon with the available computer power
before the millenium. Therefore the idea of neural
network computing had to await 30 years to put the
proposition into practice due to the lack of computa-
tional power. During the intervening years AI was a
science characterized by big hopes and big setbacks
leading to several (failed) hype cycles coining the
term AI winter (cf. nuclear winter)5, i.e. phases during
which funding was drastically cut back and the
reputation of the field badly damaged. But since the
millenium the field has a comeback especially in the
form of machine learning (s. below Modern Times).
Stochastic control
In the late 50s and early 60s Kalman and Bucy6
formulated a framework based on SDEs (stochastic
differential equations) to estimate accurately the
state of a dynamical system whose parameters can
Figure 5
A timeline of AI and
machine learning with its
various failures and the
computational power
needed to use the various
methods. Computer power
in red means that the
methods would have
needed this much power
but it wasn’t available at
the time.
Analytics Dr. Udo Göbel 27.07.
Figure 5: A timeline of AI and machine learning with its various failures and the computational power needed to use the
various methods. Computer power in red means that the methods would have needed this much power but it wasn’t
available at the time
Stochastic Control
In the late 50s and early 60s Kalman and Bucy6
formulated a framework based on SDEs (stochasti
differential equations) to estimate accurately the state of a dynamical system whose measureme
can only be measured with considerable noise. Their formulation was a breakthrough which foun
immediate application in the Apollo space program for tracking the position of the space capsule
Nowadays it is a ubiquitous tool for tracking positions from spacecraft, planes, trucks or drones a
steering them in a reliable way.
Financial Mathematics
The framework of stochastic SDEs was brilliantly applied in the field of finance which underwent
revolution in the 1970s when Fischer Black and Myron Scholes solved the problem of pricing
European-styled options by assuming that the underlying asset in the form of a stock is governed
Brownian motion and postulating that there is an efficient market free of arbitrage. The resulting
for the price V(t)7
of the option as a function of time
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
though certainly understable only with considerable mathematical expertise and solvable only wi
heavy use of computer power was soon applied all over and the options market underwent an
explosion.
6
https://en.wikipedia.org/wiki/Kalman_filter
7
The option price V(t) is dependent on the stock price S(t), its volatility 𝜎𝜎 and the interest rate r. The
movement of the stock is governed by a stochastic differential equation 𝑑𝑑𝑑𝑑 = 𝑟𝑟 𝑆𝑆 𝑑𝑑𝑑𝑑 + 𝜎𝜎 𝑆𝑆𝑆𝑆𝑆𝑆 where W is
socalled Wiener process modelling Brownian motion.
2017
Expert Systems
LISP based
Fail
Machine Translation
Simulate Intelligence
“…may eventuallybe able to learn, make
decisions, and translate languages“
Rosenblatt
Fail
1950 1980 1990
ComputationalPower(MIPS)𝟏𝟎𝟎𝟎 𝟏𝟎𝟎𝟎𝟎 𝟏𝟎𝟎𝟎𝟎𝟎𝟎
Machine Learning
Breakthrougs
Classification
Clustering
Anomaly Detection
Neural Networks
Great success!
AI and Computation
11A historical perspective
This digression into a seemingly unconnected side-
line of statistics is quite informative as a template for
things to come in fields which are just about entering
a phase of mathematization and algorithmic innova-
tion.
Modern times
The years 2000 to 2010 brought a renewed interest in
AI, breakthroughs in neural networks and an explo-
sion in large scale computing available to the public
in the form of cheap GPUs and cloud computing.
Especially from 2010 onwards a shift to algorithmic
computing and a playful expansion of their scope in
various niche domains can be seen with image
classification and speech recognition probably the
most prominent.
Neural networks
The current hype surrounding neural networks as
probably the most visible part of the movement to
automatic prediction can be traced back to some
iconic problems where their application has brought
spectacular improvements. One of these problems is
image classification. By 2010 a large body of images
had been gathered and made freely available for
research by professors from the university of Stan-
ford. Since 2010 the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)9 has been organized.
One of the tasks is classifying an image according to
given categories e.g. dogs, cats etc. Whereas during
the first two years “classical” algorithms like SVM
(Support Vector Machine) and bayesian methods
dominated the competition and where a mixed
success regarding the ability to classify images
correctly, in 2012 a participating team applied neural
network algorithms and dramatically lowered the
error rate i.e. the amount of misclassified pictures.
During the next years a combination of ever more
refined neural networks in connection with massive
amounts of computing power in the form of GPUs led
to spectacular improvements to the point that the
error rate is now comparable to or even better than
human error rates.10
Since 2014 more than 90% of all entries in the contest
use massive amounts of computer power in the form
of GPUs some teams being sponsored by NVIDIA.
Change of paradigm
It is to be noted that the process of generating insight
from data via neural networks or more generally
machine learning follows a somewhat different path
than that established by the classical statistical
sciences:
■	 Data generated or collected “somehow” (often gen-
erated especially for the express purpose of ana-
lyzing, e.g. the corpus for analyzing language or the
pictures within image classification contests)
Figure 6
Daily average option
volume in millions of
trades since 19738
12 A historical perspective
■	 An algorithm to generate insight e.g. by classifica-
tion
■	 Measuring the score of the algorithm as a measure
of success
Steps two and three often go in cycles to tune the
bells and whistles of the algorithm. Note the absence
of any probability model and in consequence the
lack of any accuracy guarantees here: The score
might become very good by tuning the parameters
but we don’t really know what this means regarding
the robustness of the algorithm when new data
comes in.
A second characteristic notable especially with the
use of neural networks is best explained when com-
pared to expert systems en vogue primarily in the 70s
and early 80s of the last millenium (old style AI).
While expert systems were rule based meaning that
at least in principle their mode of operation is under-
standable (although in practice due to the number of
rules in working systems they are not) neural net-
works pose a new challenge because they function as
a perfect black box. Especially problematic is this
shift when something unforeseen occurs. E.g. consid-
er an autonomous car running on a neural network
crashing into something. If this were due to a rule
based expert system one would simply follow the
path of rules through the decision tree taken by the
system and add or change a rule to ameliorate the
behavior. With a neural network the only answer is:
Add a few thousand crashes of this type and retrain
the network to avoid them.
Due to the success of machine learning in various
fields combined with their promise of delivering
immediate business value there is an increasing
willingness to trust the computer beyond any reason.
Signs of what can go wrong surfaced in the neural
network world in 201412 when Goodfellow et al.
showed that imperceptible changes to images can
fool a neural network image classifier to misclassify
images in an arbitrary way. Follow up work has
shown that this is not restricted to the niche of image
classifiers.13 These examples show that neural net-
works after a lot of training may perform better than
humans on a corpus of sample images but can go
wrong in subtle ways not yet fully understood and
perhaps even more disturbing may be fooled at will
by determined attackers.
Prediction, prediction,
prediction
The common denominator to this long and varied
history of statistics and machine learning in the
Year of Contest
ClassificationError[%]
10
2010 2011 2012 2013 2014 2015 2016
20
30
28%
26%
Switch from SVM to
Neutral Networks
15%
11%
7%
3%
4%
Figure 7
ImageNet contest, error
rates for classification11
13
broadest sense and of its various developments and
methods is the goal to peek into the unnknown:
Prediction. Prediction can take on various forms
depending on the domain:
■	 prediction of an unknown value to be discovered
or at least pinpointed later (old style statistics e.g.
determine the fraction of faulty products coming
from a production line)
■	 prediction of what an object is (classification, e.g.
tag an image by processing with a neural network)
■	 prediction of the future (e.g. option pricing, predict
the price of an asset in the near future)
It is to be expected that the various fields will cross-
fertilize each other and we will see huge expansions
of each method outside their respective fields.
Summary
As detailed above data science today is at the conver-
gence point of several subdisciplines of statistics and
artificial intelligence. At this unique point in history
we are at the convergence of several trends: For the
first time in history the needed computer power to
solve practical machine learning problems and large
scale statistical problems is commonly and cheaply
available (at least as compared to prior use of spe-
cialized super computers) in the form of cheap
specialised hardware (graphic processing units) or as
a service (cloud computing) thus enabling large scale
experimenting and modelling of big data sets for
everyone. The old ideas in machine learning have
been refined and researched to the point that they
solve rather complex and practical problems like
image processing and natural language processing in
a highly efficient and quite usable manner delivering
immediate business value.
The pendulum from academic mathematization to
free experimenting with algorithmic approaches to
data has swung back to the practical side. The emer-
gence of social networks and big shopping platforms
have provided data volumes in quantities never seen
before. The advent of the internet of things will
explode further the data volume in need of being
analyzed and offering huge business opportunities
for new digital businesses.
It is not hard to see that these developments will
accelerate from here on. We will see an ever greater
reliance on powerful computers tackling an increas-
ing breadth of problems. Especially the field of
artificial intelligence in the guise of modern machine
learning is now able to fulfill its promises for the first
time in its roller-coaster history. Thus we stand at a
historical point in time where the computational
challenges and algorithmic ideas meet for the first
time with adequate computer resources available for
everyone. Innovation and ideas in algorithmic com-
puting are no longer limited by computer power.
Perhaps the history of the financial industry since the
70s can be a lesson how an industry evolves when
modeling and mathematization of business pro-
cesses take place on a large scale. There is no holding
back from this point: Machine learning will explode in
the years to come.
Figure 8
The pictures in the left
column in both panels are
correctly classified by a
neural network (e.g. the
yellow bus as “bus”, the
wildcat as “cheetah”, the
white dog as “dog” etc.).
The pictures in the right
column in the panels are
classified uniformly as
“ostrich”, the difference
between left and right
consisting of sme pixel
values depicted in the
middle. Humans cannot
see a difference here
between left and right
columns showcasing the
phenomenon that the
neural network “thinks”
quite differently than
humans do and may be
made to think in arbitrary
ways with knowledge of
the inner workings of the
network.
A historical perspective
14
Analytics landscape
3
15
Anomaly detection
In anomaly detection we are trying to detect outliers
in data not conforming to an expected pattern or
behaviour. This may be seen as a subfield of classifi-
cation but it has its own unique methods setting it
largely apart from classification tasks.
Clustering
Clustering algorithms partition data points in clusters
with similar features. This is akin to classification the
key difference being that the labels are missing and
the algorithm has to come up with the clusters by
itself. Imagine a group of people in a social network
with many features (number of posts, age, number of
friends, content of posts, gender etc.) and giving it to
a machine learning algorithm which should come up
with “similar” people without telling the computer
what you mean by “similar”.
Reinforcement learning
Mainly applied in robot control, machine movement
problems and game theoretical problems where the
computer should learn how to best behave in an
initially unknown environment by specifying certain
rewards if he achieves various goals.
Dimensional reduction
Methods often used in a preliminary step of data
preprocessing to reduce huge feature sets to a mana-
geable subset by selecting the most important ones
or constructing various forms of combined features
giving the essential characteristic without having to
use the whole feature set.
Depending on the task at hand one or more methods
from these categories have to be picked but typically
naturally occuring data is messy and a mixture of
various methods have to be employed to be success-
ful.
After assessing the history and importance of ana-
lytics let’s map out the analytics landscape by various
differentiators to give an impression of breadth and
purpose of todays analytic toolset.
Depending on method or purpose the landscape
may be broken up by classificiation of algorithms or
intended purpose of algorithms. On the other hand
as always in the software industry there are plenty of
software frameworks available for use with various
infrastructures.
Algorithmic purpose
The easiest differentiator is algorithmic purpose.
Here we can classify the algorithms by way of what
we want to achieve. The categories available are
broadly as follows:
■	 Regression
■	 Classification
■	 Anomaly detection
■	 Clustering
■	 Reinforcement learning
■	 Dimensional reduction
Regression
With regression techniques one attempts to estimate
the relationship between variables for the purpose of
prediction or forecasting. Examples are standard
curves in medicine for vital variables to predict health
status or examining a timeline with the purpose of
predicting a future value.
Classification
Classification aims to categorize objects with given
features. These may be pictures with various motives
which the computer has to tag with labels or custo-
mers with certain properties who are categorized
according to potential business value.
Analytics landscape
16 Analytics landscape
Software landscape
As of 2017 there are already many players in the field competing for market shares.The existing software may
be roughly categorized by specifity of task it tries to solve. On one end of the spectrum there are already solu-
tions for very specific industries like retail or finance on the other end we have broad frameworks for data
science bundling the main algorithms for ease of use by a data scientist.
It is to be noted that most of basic data science relies on open source software. Especially important are here
basic statistical libraries where two languages dominate: R14 and Python.
R is a statistical language with emphasis on classical statistics but has been enriched also with libraries for
machine learning.
Python has evolved as the main language data scientists use to develop and analyze models especially for
machine learning. Noteworthy are the data science libraries pandas15 for statistical analysis and scikit-learn16
for machine learning.
All relevant libraries for python including interfaces to R are bundled in the Anaconda17 framework which is a
complete solution for doing data science.
Figure 7
Chart of available
Frameworks and Tools in
the field of machine
learning. Chart appeared
on O’Reilly article “The
current state of machine
intelligence” which is
updated yearly. This is the
third installment in their
series.
17Smarte Geschäftsmodelle
The data science pipeline
4
18
Cleanse
Wrangle Clean
Model
Explore
Preprocess
Validate
Validate Results
Actionable Insight
Automation
Deployment
Model
The data science pipeline
The basic pipeline
Data Science consists foremost of cleaning and
analyzing often messy data.
The steps at hand are first cleaning data which
consists of getting and importing the data in the first
place, which is often laborious and has earned the
apt name wrangling in the community. After some
preliminary cleaning the second step involves mode-
ling. This comprises exploration, preprocessing and
the actual modeling and goes in cycles until a satis-
factory model has been found describing the data
well enough. After that it is essential to validate the
model for various reasons. One problem is overfitting
the data with the model, meaning that the model
describes the data we have very well but will behave
very badly on future data. Usually this quality as-
surance is done by keeping some of the data back as
test data set only to be used in the validation step.
Finally we arrive at a result from which business value
may be generated by automating a process or deplo-
ying the model on a machine using it to drive some
process.
Feature engineering
Much of the modelling phase especially in machine
learning consists of preparing the data in a form
ingestible by the usual algorithms which expect their
data in the form of numerical features.
E.g. if you want to feed a timeline of sensor data into
an anomaly detection algorithm it doesn’t make
sense to give single data points as input. Instead you
may first define time windows on the time series and
within each time window extract features like mean,
standard deviation, mininum, maximum etc. In this
way the machine may be able to figure out which of
the time windows shows an anomalous behaviour.
This sort of work has been aptly named feature
engineering and the skillful application of features to
data determine to a great extent success or failure in
the field of machine learning.18
Figure 10
The data science pipeline
19Smarte Geschäftsmodelle
Data science in product
businesses
5
20
Device
Apps
User
Software-
Management
Community
Device-
Management
Storage
Connectivity Product
Management
Legacy
Digital Master
Software-
Management
Customer Operations
DesignerProvider
Digital Twin
Monitoring/Analytics
Data science in product businesses
Products and digital twin
We give an overview of what the analytics landscape consists today, what tools are available and how to go
about to actually putting them to work.
Product development and product services are an especially interesting field for future data science expan-
sion. Users, sensors and tools agglomerate tons of data and with the advent of IoT the task to collect, structure,
analyze that data, to report insights and to execute measures based on these insights will become even more
important.
The main tool to connect data in the field to the abstract product is the digital twin, which is a proxy for each
physical instance of a product in the field, together with a connection to the virtual model used in product
development.
CONTACT offers tested and widely used open source libraries for data science provided by integration with the
ANACONDA framework. Also, the data model is enhanced by all necessary tools notably the digital twin to
function as coordinating hub for product data science.
For data connection cloud connectors are available and allow seamless integration with field devices directly
or for big data scenarios via cloud data paths through cloud providers like AWS or Azure.
Conclusion
We have given some rough sketches of what the analytics landscape consists today, what tools are available
and how to go about actually putting to work the available tools. Further suggestions can be found in the
bibliography and the list of online resources.
Figure 11
CONTACT stack for
complete data science
and IoT architecture
21Smarte Geschäftsmodelle
Bibliography, online resources
and references
6
22 Bibliography, online resources
ne learning with its various failures and the computational power needed to use the
n red means that the methods would have needed this much power but it wasn’t
Kalman and Bucy6
formulated a framework based on SDEs (stochastic
imate accurately the state of a dynamical system whose measurement
onsiderable noise. Their formulation was a breakthrough which found
Apollo space program for tracking the position of the space capsule.
ool for tracking positions from spacecraft, planes, trucks or drones and
ay.
SDEs was brilliantly applied in the field of finance which underwent a
n Fischer Black and Myron Scholes solved the problem of pricing
assuming that the underlying asset in the form of a stock is governed by
ating that there is an efficient market free of arbitrage. The resulting PDE
on as a function of time
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
only with considerable mathematical expertise and solvable only with
r was soon applied all over and the options market underwent an
Kalman_filter
dent on the stock price S(t), its volatility 𝜎𝜎 and the interest rate r. The
ned by a stochastic differential equation 𝑑𝑑𝑑𝑑 = 𝑟𝑟 𝑆𝑆 𝑑𝑑𝑑𝑑 + 𝜎𝜎 𝑆𝑆𝑆𝑆𝑆𝑆 where W is a
ing Brownian motion.
chine learning with its various failures and the computational power needed to use the
er in red means that the methods would have needed this much power but it wasn’t
0s Kalman and Bucy6
formulated a framework based on SDEs (stochastic
estimate accurately the state of a dynamical system whose measurement
h considerable noise. Their formulation was a breakthrough which found
he Apollo space program for tracking the position of the space capsule.
s tool for tracking positions from spacecraft, planes, trucks or drones and
way.
tic SDEs was brilliantly applied in the field of finance which underwent a
hen Fischer Black and Myron Scholes solved the problem of pricing
y assuming that the underlying asset in the form of a stock is governed by
ulating that there is an efficient market free of arbitrage. The resulting PDE
ption as a function of time
0 =
𝜕𝜕𝑉𝑉
𝜕𝜕𝑡𝑡
+
1
2
𝜎𝜎2
𝑆𝑆2
𝜕𝜕2
𝑉𝑉
𝜕𝜕𝑆𝑆2
+ 𝑟𝑟𝑟𝑟
𝜕𝜕𝑉𝑉
𝜕𝜕𝑆𝑆
− 𝑟𝑟𝑟𝑟
ble only with considerable mathematical expertise and solvable only with
wer was soon applied all over and the options market underwent an
ki/Kalman_filter
endent on the stock price S(t), its volatility 𝜎𝜎 and the interest rate r. The
erned by a stochastic differential equation 𝑑𝑑𝑑𝑑 = 𝑟𝑟 𝑆𝑆 𝑑𝑑𝑑𝑑 + 𝜎𝜎 𝑆𝑆𝑆𝑆𝑆𝑆 where W is a
elling Brownian motion.
Some suggestions for further reading and hints for
digging deeper into the available literature. Of course
the opinions expressed here reflect the biases of the
author:
Efron, B.,  Hastie, T. (2016). Computer Age
Statistical Inference. Cambridge: Cambridge
University Press.
A scholarly but enjoyable text with an historical
account of statistics from which the author took most
historical references. Knowledge of statistical me-
thods is assumed if you want to follow the text, but
the historical account may be read without paying
too much attention to the math.
Evans, L. C. (2013). An Introduction to Stocha-
stic Differential Equations. American Mathe-
matical Society.
A little known gem introducing
stochastic differential equations in under 150 pages.
If you want to learn something about SDEs and have
a little training in probability theory this is a pleasure
to read. (If you are new to the entire field: SDEs are
used in financial math, it is not needed for neural
networks).
Goodfellow, I. (2016). Deep Learning. Mit Press.
A fine book mainly mapping out the theory of neural
networks. This book is quite self-contained requiring
not many prerequisites and gives you a primer in
linear algebra and probability theory before diving
into machine learning. Combined with a practical
book like Raschka you’re quickly up to speed.
Held, L. (2008). Methoden der statistischen
Inferenz: Likelihood und Bayes. Heidelberg:
Spektrum Akademischer Verlag.
A good overview of classical statistics with a modern
concise presentation of bayesian and likelihood
methods. The author summarizes the main points
without getting lost in mathematical detail. An
update in english under the title “Applied statistical
inference” (2013) is also available from the same
author.
James, G., Witten, D., Hastie, T.,  Tibshirani,
R. (2017). An Introduction to Statistical Lear-
ning. New York: Springer.
A standard reference for the field of machine lear-
ning. Appropriate and enjoyable for beginners. An
accompanying online course at Stanford University
can be attended (for free). Strongly recommended.
Raschka, S. (2016). Python Machine Learning.
Birmingham: Pack Publishing Ltd.
A practical book especially for python coders and the
anaconda framework giving you a hands on learning
experience. Combined with Goodfellow this may be a
good start into the field of machine learning.
Campolieti, G.,  Makarov, R. N. (2014). Finan-
cial Mathematics: A Comprehensive Treatment.
Boca Raton: Chapman and Hall/Crc Financial
Mathematics Series.
A truly comprehensive reference to modern financial
mathematics if you are really interested. The advan-
ced chapters on SDEs and option valuation are
probably not understandable without a university
degree in math, but it is one of the best references in
the field.
Learning data science
There are a number of very good free online resour-
ces available:
■	 edx
a platform for MOOC (massive open online cour-
ses), search for data science courses
■	 Newsletters at o’reilly
There is a newsletter for data science and ai one
can subscribe to at http://www.oreilly.com/data/
newsletter.html and http://www.oreilly.com/ai/
newsletter.html which feature links to interesting
resources and stories
For beginning data science especially learning R and
Python frameworks one may consult DataCamp
which specializes in data science courses. Subscrip-
tion fees are low and some courses are free.
Software resources are freely available mainly
for Python and R
The Anaconda framework from Continuum Analytics
is a bundle which already contains the main machine
learning libraries like scikit-learn, numpy and pandas
R is freely available with many tutorials available
online.
23
15 http://pandas.pydata.org/
16 http://scikit-learn.org/stable/
17 https://www.continuum.io/
18 s. for example: Mastering Feature Engineering:
Principles and Techniques for Data Scientists, Alice
Zhend, O’Reilly, 2017
1 https://en.wikipedia.org/wiki/Least_squares
2 https://en.wikipedia.org/wiki/Bayes%27_theorem
3 McCulloch, Warren; Walter Pitts (1943). „A Logical
Calculus of Ideas Immanent in Nervous Activity“.
Bulletin of Mathematical Biophysics. 5 (4): 115–133
4 https://en.wikipedia.org/wiki/Artificial_neural_net-
work
5 https://en.wikipedia.org/wiki/AI_winter
6 https://en.wikipedia.org/wiki/Kalman_filter
7 The option price V(t) is dependent on the stock
price S(t), its volatility and the interest rate r. The
movement of the stock is governed by a stochastic
differential equation where W is a socalled Wiener
process modelling Brownian motion.
8 Source OCC historical volumes of trade, https://
www.theocc.com/webapps/historical-volume-query
9 http://www.image-net.org/about-stats
10 for an interesting overview of the challenge s.
https://arxiv.org/abs/1409.0575v3
11 Data taken from project site of ImageNet (http://
image-net.org), s. official results published for each
year. The displayed error rates are featured in table
“classification + localization ordered by classification”
and have been rounded to nearest integer in percent.
The labeling of the algorithmic switch from SVM to
NN is an oversimplification by the author: Actual
algorithms used in the contest are complicated mixes
of various methods, s. project site for details.
12 Explaining and harnessing adverserial examples,
Ian J. Goodfellow, Jonathon Shlens  Christian
Szegedy, http://arxiv.org/abs/1312.6199
13 Adversarial Perturbations Against Deep Neural
Networks for Malware Classification, Kathrin Grosse
et al., https://arxiv.org/abs/1606.04435
14 s. R project site https://www.r-project.org/ for
more information and downloads
References
www.contact-software.com

More Related Content

Similar to Data Anayltics: How to predict anything

Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Geoffrey Fox
 
Challenges and outlook with Big Data
Challenges and outlook with Big Data Challenges and outlook with Big Data
Challenges and outlook with Big Data IJCERT JOURNAL
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...cscpconf
 
A mathematical model of access control in big data using confidence interval ...
A mathematical model of access control in big data using confidence interval ...A mathematical model of access control in big data using confidence interval ...
A mathematical model of access control in big data using confidence interval ...csandit
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Varied encounters with data science (slide share)
Varied encounters with data science (slide share)Varied encounters with data science (slide share)
Varied encounters with data science (slide share)gilbert.peffer
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxGreg Makowski
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxDr.Shweta
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningmaldonadojorge
 
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4shortJun Miyazaki
 
Top 10 data science takeaways for executives
Top 10 data science takeaways for executivesTop 10 data science takeaways for executives
Top 10 data science takeaways for executivesDylan Erens
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription productVienna Data Science Group
 

Similar to Data Anayltics: How to predict anything (20)

CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big data
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
 
Challenges and outlook with Big Data
Challenges and outlook with Big Data Challenges and outlook with Big Data
Challenges and outlook with Big Data
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
 
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
 
A mathematical model of access control in big data using confidence interval ...
A mathematical model of access control in big data using confidence interval ...A mathematical model of access control in big data using confidence interval ...
A mathematical model of access control in big data using confidence interval ...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Varied encounters with data science (slide share)
Varied encounters with data science (slide share)Varied encounters with data science (slide share)
Varied encounters with data science (slide share)
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learning
 
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4short
 
Top 10 data science takeaways for executives
Top 10 data science takeaways for executivesTop 10 data science takeaways for executives
Top 10 data science takeaways for executives
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Data Anayltics: How to predict anything

  • 1. A short overview Analytics: How to predict anything WHITE PAPER Dr. Udo Göbel
  • 2. All brand names, trademarks, product names, abbreviations thereof and logos are the property of their respective companies and are protected as such. All third-party protected brand names and trademarks are subject in all respects to the regulations of the relevant registration law and the rights of possession of the respective registered owner. The mere appearance of names without designation as proprietary shall not be construed as meaning that trademarks are not protected by the rights of third parties. © CONTACT Software GmbH. All rights reserved. Changes may have come into effect after copy deadline for this document. All information supplied without guarantee.
  • 3. 1 Introduction 4 2 A historical perspective 6 Analytics trends over the past 200 years (abbreviated) 7 Ancient times 8 Consolidation and axiomatization 8 Computer time 8 AI 10 Stochastic control 10 Financial mathematics 10 Modern times 11 Neural networks 11 Change of paradigm 11 Prediction, prediction, prediction 12 Summary 13 3 Analytics landscape 14 Algorithmic purpose 15 Regression 15 Classification 15 Anomaly detection 15 Clustering 15 Reinforcement learning 15 Dimensional reduction 15 Software landscape 16 4 The data science pipeline 17 The basic pipeline 18 Feature engineering 18 5 Data science in product business 19 Products and digital twin 20 Conclusion 20 6 Bibliography, online resources and references 21 Learning data science 22 Table of contents
  • 5. 5 four megatrends: Cloud computing, IoT, big data and algorithmic computing. These areas have been becoming or are under the way of becoming an overwhelming success in a very short time frame and are already delivering on their promises. The latter point best illustrated by the rise of cloud computing over the past five years now generating billions of dollars of revenue for Amazon (AWS), Microsoft (Azure) and Google (GCP). Therefore understanding the current hype cycle surrounding analytics and taking appropriate action is a concern for every company especially those considering themselves technology leaders in their respective market seg- ments and exposed to some or all of the driving megatrends. Since analytics in itself is a huge topic and the mega- trends are big topics in their own right it’s easy getting lost in the details and be overwhelmed by the sheer volume of news and stories coming in every day. We will try here to give a simple overview of the field with emphasis on describing the main factors influencing the past, current and future develop- ments in the field in order to enable the reader to assess the importance of the topic and dig deeper if needed. Subsequently we elucidate some practical issues concerning data science and its importance in the field of PLM (Product Life Cycle Management) and give some insight into the roadmap of CIM Database concerning analytics. The last few years have seen an explosion of interest in analytic methods for exploring data. Drivers of this development have been trends in the availability of data (“big data”) especially regarding large volumes of data publicly available from social networks, breakthroughs in specific areas of algorithmic re- search (i.e. neural networks applied to classification), the general availability of ever increasing storage and computing capacities for rapidly falling costs (i.e. Moore’s law in connection with cloud computing) and recently the rise of stunningly cheap devices able to connect with the internet and providing steadily increasing amounts of data about their environment and their own status (i.e. IoT or the internet of things). These factors contributed to generate the recent notion of data science as an independent scientific endeavour apart from the classical statistics curricu- lum and more specifically the notion of data scientist as a job title for analysts featuring the necessary skill sets of programming, statistics and algorithmic expertise able to navigate and exploit the new oppor- tunities. What sets this hype cycle surrounding the topic analytics apart from many others we had before in the IT industry is that it is not coming our way as an isolated topic as compared to hype cycles like XML, ESB, SOAP or Web 2.0 we have been witnessing popping up and fading out over the years. Instead our topic is driven by and at the convergence point of Introduction Big Data Cloud Computing IoT AlgorithmsEfficiency, Insights Connection, Storage, Management Storage AnalyticsPlatform UseCases (Failure,Anomaly) Data Tools Data Com pute pow er Analitycs Figure 1 Current megatrends their mutual dependencies and influence on analytics
  • 7. 7 Analytics trends over the past 200 years (abbreviated) To put things in perspective let us briefly review how we got here and what happened to the ancient field of statistics to rise from a 100-year-sleep (at least regarding the public attention it has received during its existence, the research community wouldn’t agree with that). Analytics which up to 1990 was considered the realm of applied statistics is an ancient topic going back more than 200 years. It’s birth date may be given when Gauss applied the method of least squares to calculate the orbit of minor planets by fitting orbit parameters via the least squares method to observations. The prolific Gauss who usually didn’t bother to publish “minor” work considered it impor- tant enough to state his priority of discovery after Legendre published the method in 1805 independ- ently.1 Not unsurprisingly due to its usefulness the method is still one of the most heavily used tools to fit data to some empirical model. In modern terms it may be described as consisting of an empirical model, e.g. a linear model like where x is the input (in various contexts also called with more fancy names like predictor, regressor, independent variable, controlled variable etc.), y is the output (response, regressand, outcome etc.) and 𝝴 an error term due to inaccuracies of measurement or noise. The parameters of the model a and b are initially unknown and our goal is to determine them as best we can according to our measurements of the input x and output y in many observations. The method of least squares gives an explicit formula for this taking the observations (xi , yi) as input and delivering an estimate for the parameters, e.g. the estimate 𝑦𝑦 = a + b x + ε 𝑏𝑏⏞ = ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛 𝑖𝑖 =1 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 (𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐) 𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) = 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝) 𝑃𝑃(𝑘𝑘 = 65) 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65 (1 − 𝑝𝑝)35 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 for b is given by: In addition according to the statistical model for the error term accuracy guarantees can be computed in a probabilistic manner, e.g. the statisticians statement may be: “The parameter b is contained in the interval with probability 95%” (of course the actual stated probability depends on the width of the interval c: It is somewhat obvious that it is more likely that the „real“ parameter b is contained in an interval with wide error margins i.e. big c than a small inter- val). For more than a hundred years these ingredients should be the hallmarks of the statistical science: ■ Data generated from experiment ■ A method for producing an estimate in explicit form ■ A (probability) model for the data ■ Accuracy guarantees justifying the use of the method The last step ususally called inference. Without it our estimates from data lose much of their appeal espe- cially in the physical sciences since we cannot be sure if their use is sensible at all: The fancy algorithm we have used might just have produced a number no better than guessing. A historical perspective Figure 2 The modern field of data science at the intersection of statistics, machine learning and computer science. 𝑦𝑦 = a + b x + ε 𝑏𝑏⏞ = ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛 𝑖𝑖 =1 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 (𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐) 𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) = 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝) 𝑃𝑃(𝑘𝑘 = 65) 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65 (1 − 𝑝𝑝)35 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 𝑦𝑦 = a + b x + ε 𝑏𝑏⏞ = ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛 𝑖𝑖 =1 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 (𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐) 𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) = 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝) 𝑃𝑃(𝑘𝑘 = 65) 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65 (1 − 𝑝𝑝)35 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 𝑦𝑦 = a + b x + ε 𝑏𝑏⏞ = ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛 𝑖𝑖 =1 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 (𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐) 𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) = 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝) 𝑃𝑃(𝑘𝑘 = 65) 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65 (1 − 𝑝𝑝)35 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟
  • 8. 8 A historical perspective Ancient times During the development of the statistical science as sketched above since Gauss during the first 150 years up to the 1900s the methods were as expected ad hoc and application specific. Gauss, Laplace and Pascal contributed a good deal of ideas revolving around concrete problems with the notable excep- tion of Bayes who formulated the celebrated Bayes’ theorem2 which lay dormant for a long time only to make a vigorous return when modern computational power was finally available to give full power to its many useful applications in modern statistical analysis. Consolidation and axiomatization The first half of the twentieth century saw the mathe- matization of the discipline via the axiomatic formu- lation by Kolmogorov which provides the solid frame- work to this day. The great statisticians Fisher and Pearson contributed now standard ideas and many of the standard statistical methods in use today. The body of this work up to 1950 is basically what can be found in school books today: Hypothesis testing, confidence intervals, p-values, χ2–testing etc. It cannot be denied that the rigorous mathematization of the science gave the discipline a slightly bureau- cratic and boring flavour probably stifling the playful advancement of the following period. All these methods operated on small data sets and were optimized for computing with pencil and paper. Computer time The second half of the twentieth century saw the rise of computer programms. Whereas the goal of former times was to avoid computations in a field where gathering data and evaluating them was a cumber- some task the advent of computers changed this attitude albeit slowly. The breakthrough was proba- bly the invention of the bootstrap method. The idea sounds simple and intuitive looking back now, but sounded certainly crazy then: Instead of throwing heavy mathematics at each special case the boot- strap method simply resamples from the existing dataset to get at probabilistic estimates of parame- ters. Of course this adds a considerable computational overhead to the procedure. But the big advantage is that the procedure is very simple, automatic and universal in the sense that it is immediately applica- ble to arbitrary complicated curves instead of only linear regression for which the classical formula provided the analytical solution. Arguably as simple as it is this shift to unabashed usage of modern computer power ushered in the era of computers and software in the statistical sciences ultimately result- ing in software packages such as SAS or R and a more playful and experimental approach to statistics. Another point of view coming to its full right only with the advent of powerful computers is bayesian statis- tics as mentioned above. Whereas its origins lie with Bayes’s theorem the full force of its application is Figure 3 A timeline of computatio- nal statistics since its inception with a rough measure of needed computer power for the methods in MIPS (million instructions per second) 2017 Bootstrap Bayes Regression Hypothesis Testing Significance Testing 1800 1950 𝑯 𝟎  2 1990 P(A|B)= 𝑃(𝐵|𝐴)𝑃(𝐴) 𝑃(𝐵) ComputationalPower(MIPS)𝟎 𝟏𝟎𝟎 𝟏𝟎𝟎𝟎𝟎𝟎𝟎 Large Scale Hypothesis Testing (Human Genome Project) Monte Carlo Bootstrap Software R Computational Methods in Statistics
  • 9. 9 hard to understand from its original formulation. The gist of the bayesian method is to regard everything as a probability statement. A simple example is estimat- ing the probability p of a coin coming up heads for a coin suspected of being not fair (i.e. p ≠ 0.5, the coin comes up more often heads than tails or vice versa). In the classical line of thought one would conduct a hypothesis test with the null hypothesis being the coin is fair and accept or reject this depending on the outcome of an extensive coin flipping experiment. In the bayesian view the “real” value of p is itself a random variable and if you flip the coin say a hun- dred times and it comes up 65 times heads up you are interested in the probability of this outcome given the value of the observed number of heads i.e. P(p | k=65) (read: Probability of p given number of heads is 65). Of course you don’t know p but according to Bayes’ theorem this can be deconstructed to: Now the first term in the numerator is given by the binomial distribution and up to a factor not depend- ing on p is given by If we factor in a probability distribution for P(p) (called the prior in bayesian statistics) we have a functional form for the desired distribution P(p|k=65) (the posterior in bayesian paralance). The “real” value A historical perspective of p can be estimated from this by e.g. the maximum or the expectation value. The real value of the meth- od comes up when further experiments are conduct- ed. We can go on and use the calculated distribution of p up to now as the new prior P(p) and calculate further refinements on the go. Note that these tricks are only possible because the parameter p, which is an unknown but fixed quantity in classical statistics, was regarded as a random variable itself. This in- nocuous method has revolutionized the way statis- tics is done especially in the field of parameter estimation. Usually there are many parameters continuously distributed leading to multidimensional integrals and therefore bayesian statistics is practical- ly tractable only with considerable use of computer power. A further element necessitating a change of view or an enlargement of the toolset was modern science in the form of the human genome project leading to ever larger arrays of genome sequences (micro arrays) on which simultaneous testing of e.g. genome expression levels was conducted. The analysis and inference on the resulting data was impossible without computers and led to modern methods such as HMM (Hidden Markov Models), MCMC (Markov Chain Monte Carlo) and more, also impossible to conduct without modern computer power. In summary the field of statistics evolved slowly but steadily from a theoretically driven field to a very successful practical endeavour heavily domi- nated by computational tools. Figure 4 Bootstrap example, sample points generated from f(x)=5x-3x^2+0.3x^3 with added gaussian noise, fitting curves are obtained by resampling from the sample points 𝑦𝑦 = a + b x + ε 𝑏𝑏⏞ = ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛 𝑖𝑖 =1 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 (𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐) 𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) = 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝) 𝑃𝑃(𝑘𝑘 = 65) 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65 (1 − 𝑝𝑝)35 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 𝑦𝑦 = a + b x + ε 𝑏𝑏⏞ = ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)(𝑦𝑦𝑖𝑖 − 𝑦𝑦̅)𝑛𝑛 𝑖𝑖 =1 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 (𝑏𝑏⏞ − 𝑐𝑐, 𝑏𝑏⏞ + 𝑐𝑐) 𝑃𝑃(𝑝𝑝|𝑘𝑘 = 65) = 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) 𝑃𝑃(𝑝𝑝) 𝑃𝑃(𝑘𝑘 = 65) 𝑃𝑃(𝑘𝑘 = 65|𝑝𝑝) ~ 𝑝𝑝65 (1 − 𝑝𝑝)35 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟
  • 10. 10 A historical perspective only be measured with considerable noise. Their formulation was a breakthrough which found imme- diate application in the Apollo space program for tracking the position of the space capsule. Nowadays it is a ubiquitous tool for tracking positions from spacecraft, planes, trucks or drones and steering them in a reliable way. Financial mathematics The framework of stochastic SDEs was brilliantly applied in the field of finance which underwent a revolution in the 1970s when Fischer Black and Myron Scholes solved the problem of pricing Europe- an-styled options by assuming that the underlying asset in the form of a stock is governed by Brownian motion and postulating that there is an efficient market free of arbitrage. The resulting PDE for the price V(t)7 of the option as a function of time though certainly understable only with considerable mathematical expertise and solvable only with heavy use of computer power was soon applied all over and the options market underwent an explosion. The explosion and complete revolution of the finan- cial market and its ensuing meltdown in 2008 for which the extension of these models and their unlim- ited use without proper risk management is at least partially responsible demonstrates the power (and the danger) of (blindly) applying algorithms to practi- cal problems. AI In contrast to the steady evolution of statistics AI enjoyed a comparably roller coaster history with many ups and downs. Artificial intelligence came into being with the advent of the first computers and had as a result much closer ties to computer science and algorithmic experiments than to statistics and proba- bility. Interestingly the idea of neural networks was formulated3 even before the first computers were available and a viable algorithm for training (back- propagation) was proposed in the early days of computer usage .4 But the history of AI is less glorious in delivering on its promises. This was largely due to much higher expectations which simply couldn’t be delivered upon with the available computer power before the millenium. Therefore the idea of neural network computing had to await 30 years to put the proposition into practice due to the lack of computa- tional power. During the intervening years AI was a science characterized by big hopes and big setbacks leading to several (failed) hype cycles coining the term AI winter (cf. nuclear winter)5, i.e. phases during which funding was drastically cut back and the reputation of the field badly damaged. But since the millenium the field has a comeback especially in the form of machine learning (s. below Modern Times). Stochastic control In the late 50s and early 60s Kalman and Bucy6 formulated a framework based on SDEs (stochastic differential equations) to estimate accurately the state of a dynamical system whose parameters can Figure 5 A timeline of AI and machine learning with its various failures and the computational power needed to use the various methods. Computer power in red means that the methods would have needed this much power but it wasn’t available at the time. Analytics Dr. Udo Göbel 27.07. Figure 5: A timeline of AI and machine learning with its various failures and the computational power needed to use the various methods. Computer power in red means that the methods would have needed this much power but it wasn’t available at the time Stochastic Control In the late 50s and early 60s Kalman and Bucy6 formulated a framework based on SDEs (stochasti differential equations) to estimate accurately the state of a dynamical system whose measureme can only be measured with considerable noise. Their formulation was a breakthrough which foun immediate application in the Apollo space program for tracking the position of the space capsule Nowadays it is a ubiquitous tool for tracking positions from spacecraft, planes, trucks or drones a steering them in a reliable way. Financial Mathematics The framework of stochastic SDEs was brilliantly applied in the field of finance which underwent revolution in the 1970s when Fischer Black and Myron Scholes solved the problem of pricing European-styled options by assuming that the underlying asset in the form of a stock is governed Brownian motion and postulating that there is an efficient market free of arbitrage. The resulting for the price V(t)7 of the option as a function of time 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 though certainly understable only with considerable mathematical expertise and solvable only wi heavy use of computer power was soon applied all over and the options market underwent an explosion. 6 https://en.wikipedia.org/wiki/Kalman_filter 7 The option price V(t) is dependent on the stock price S(t), its volatility 𝜎𝜎 and the interest rate r. The movement of the stock is governed by a stochastic differential equation 𝑑𝑑𝑑𝑑 = 𝑟𝑟 𝑆𝑆 𝑑𝑑𝑑𝑑 + 𝜎𝜎 𝑆𝑆𝑆𝑆𝑆𝑆 where W is socalled Wiener process modelling Brownian motion. 2017 Expert Systems LISP based Fail Machine Translation Simulate Intelligence “…may eventuallybe able to learn, make decisions, and translate languages“ Rosenblatt Fail 1950 1980 1990 ComputationalPower(MIPS)𝟏𝟎𝟎𝟎 𝟏𝟎𝟎𝟎𝟎 𝟏𝟎𝟎𝟎𝟎𝟎𝟎 Machine Learning Breakthrougs Classification Clustering Anomaly Detection Neural Networks Great success! AI and Computation
  • 11. 11A historical perspective This digression into a seemingly unconnected side- line of statistics is quite informative as a template for things to come in fields which are just about entering a phase of mathematization and algorithmic innova- tion. Modern times The years 2000 to 2010 brought a renewed interest in AI, breakthroughs in neural networks and an explo- sion in large scale computing available to the public in the form of cheap GPUs and cloud computing. Especially from 2010 onwards a shift to algorithmic computing and a playful expansion of their scope in various niche domains can be seen with image classification and speech recognition probably the most prominent. Neural networks The current hype surrounding neural networks as probably the most visible part of the movement to automatic prediction can be traced back to some iconic problems where their application has brought spectacular improvements. One of these problems is image classification. By 2010 a large body of images had been gathered and made freely available for research by professors from the university of Stan- ford. Since 2010 the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)9 has been organized. One of the tasks is classifying an image according to given categories e.g. dogs, cats etc. Whereas during the first two years “classical” algorithms like SVM (Support Vector Machine) and bayesian methods dominated the competition and where a mixed success regarding the ability to classify images correctly, in 2012 a participating team applied neural network algorithms and dramatically lowered the error rate i.e. the amount of misclassified pictures. During the next years a combination of ever more refined neural networks in connection with massive amounts of computing power in the form of GPUs led to spectacular improvements to the point that the error rate is now comparable to or even better than human error rates.10 Since 2014 more than 90% of all entries in the contest use massive amounts of computer power in the form of GPUs some teams being sponsored by NVIDIA. Change of paradigm It is to be noted that the process of generating insight from data via neural networks or more generally machine learning follows a somewhat different path than that established by the classical statistical sciences: ■ Data generated or collected “somehow” (often gen- erated especially for the express purpose of ana- lyzing, e.g. the corpus for analyzing language or the pictures within image classification contests) Figure 6 Daily average option volume in millions of trades since 19738
  • 12. 12 A historical perspective ■ An algorithm to generate insight e.g. by classifica- tion ■ Measuring the score of the algorithm as a measure of success Steps two and three often go in cycles to tune the bells and whistles of the algorithm. Note the absence of any probability model and in consequence the lack of any accuracy guarantees here: The score might become very good by tuning the parameters but we don’t really know what this means regarding the robustness of the algorithm when new data comes in. A second characteristic notable especially with the use of neural networks is best explained when com- pared to expert systems en vogue primarily in the 70s and early 80s of the last millenium (old style AI). While expert systems were rule based meaning that at least in principle their mode of operation is under- standable (although in practice due to the number of rules in working systems they are not) neural net- works pose a new challenge because they function as a perfect black box. Especially problematic is this shift when something unforeseen occurs. E.g. consid- er an autonomous car running on a neural network crashing into something. If this were due to a rule based expert system one would simply follow the path of rules through the decision tree taken by the system and add or change a rule to ameliorate the behavior. With a neural network the only answer is: Add a few thousand crashes of this type and retrain the network to avoid them. Due to the success of machine learning in various fields combined with their promise of delivering immediate business value there is an increasing willingness to trust the computer beyond any reason. Signs of what can go wrong surfaced in the neural network world in 201412 when Goodfellow et al. showed that imperceptible changes to images can fool a neural network image classifier to misclassify images in an arbitrary way. Follow up work has shown that this is not restricted to the niche of image classifiers.13 These examples show that neural net- works after a lot of training may perform better than humans on a corpus of sample images but can go wrong in subtle ways not yet fully understood and perhaps even more disturbing may be fooled at will by determined attackers. Prediction, prediction, prediction The common denominator to this long and varied history of statistics and machine learning in the Year of Contest ClassificationError[%] 10 2010 2011 2012 2013 2014 2015 2016 20 30 28% 26% Switch from SVM to Neutral Networks 15% 11% 7% 3% 4% Figure 7 ImageNet contest, error rates for classification11
  • 13. 13 broadest sense and of its various developments and methods is the goal to peek into the unnknown: Prediction. Prediction can take on various forms depending on the domain: ■ prediction of an unknown value to be discovered or at least pinpointed later (old style statistics e.g. determine the fraction of faulty products coming from a production line) ■ prediction of what an object is (classification, e.g. tag an image by processing with a neural network) ■ prediction of the future (e.g. option pricing, predict the price of an asset in the near future) It is to be expected that the various fields will cross- fertilize each other and we will see huge expansions of each method outside their respective fields. Summary As detailed above data science today is at the conver- gence point of several subdisciplines of statistics and artificial intelligence. At this unique point in history we are at the convergence of several trends: For the first time in history the needed computer power to solve practical machine learning problems and large scale statistical problems is commonly and cheaply available (at least as compared to prior use of spe- cialized super computers) in the form of cheap specialised hardware (graphic processing units) or as a service (cloud computing) thus enabling large scale experimenting and modelling of big data sets for everyone. The old ideas in machine learning have been refined and researched to the point that they solve rather complex and practical problems like image processing and natural language processing in a highly efficient and quite usable manner delivering immediate business value. The pendulum from academic mathematization to free experimenting with algorithmic approaches to data has swung back to the practical side. The emer- gence of social networks and big shopping platforms have provided data volumes in quantities never seen before. The advent of the internet of things will explode further the data volume in need of being analyzed and offering huge business opportunities for new digital businesses. It is not hard to see that these developments will accelerate from here on. We will see an ever greater reliance on powerful computers tackling an increas- ing breadth of problems. Especially the field of artificial intelligence in the guise of modern machine learning is now able to fulfill its promises for the first time in its roller-coaster history. Thus we stand at a historical point in time where the computational challenges and algorithmic ideas meet for the first time with adequate computer resources available for everyone. Innovation and ideas in algorithmic com- puting are no longer limited by computer power. Perhaps the history of the financial industry since the 70s can be a lesson how an industry evolves when modeling and mathematization of business pro- cesses take place on a large scale. There is no holding back from this point: Machine learning will explode in the years to come. Figure 8 The pictures in the left column in both panels are correctly classified by a neural network (e.g. the yellow bus as “bus”, the wildcat as “cheetah”, the white dog as “dog” etc.). The pictures in the right column in the panels are classified uniformly as “ostrich”, the difference between left and right consisting of sme pixel values depicted in the middle. Humans cannot see a difference here between left and right columns showcasing the phenomenon that the neural network “thinks” quite differently than humans do and may be made to think in arbitrary ways with knowledge of the inner workings of the network. A historical perspective
  • 15. 15 Anomaly detection In anomaly detection we are trying to detect outliers in data not conforming to an expected pattern or behaviour. This may be seen as a subfield of classifi- cation but it has its own unique methods setting it largely apart from classification tasks. Clustering Clustering algorithms partition data points in clusters with similar features. This is akin to classification the key difference being that the labels are missing and the algorithm has to come up with the clusters by itself. Imagine a group of people in a social network with many features (number of posts, age, number of friends, content of posts, gender etc.) and giving it to a machine learning algorithm which should come up with “similar” people without telling the computer what you mean by “similar”. Reinforcement learning Mainly applied in robot control, machine movement problems and game theoretical problems where the computer should learn how to best behave in an initially unknown environment by specifying certain rewards if he achieves various goals. Dimensional reduction Methods often used in a preliminary step of data preprocessing to reduce huge feature sets to a mana- geable subset by selecting the most important ones or constructing various forms of combined features giving the essential characteristic without having to use the whole feature set. Depending on the task at hand one or more methods from these categories have to be picked but typically naturally occuring data is messy and a mixture of various methods have to be employed to be success- ful. After assessing the history and importance of ana- lytics let’s map out the analytics landscape by various differentiators to give an impression of breadth and purpose of todays analytic toolset. Depending on method or purpose the landscape may be broken up by classificiation of algorithms or intended purpose of algorithms. On the other hand as always in the software industry there are plenty of software frameworks available for use with various infrastructures. Algorithmic purpose The easiest differentiator is algorithmic purpose. Here we can classify the algorithms by way of what we want to achieve. The categories available are broadly as follows: ■ Regression ■ Classification ■ Anomaly detection ■ Clustering ■ Reinforcement learning ■ Dimensional reduction Regression With regression techniques one attempts to estimate the relationship between variables for the purpose of prediction or forecasting. Examples are standard curves in medicine for vital variables to predict health status or examining a timeline with the purpose of predicting a future value. Classification Classification aims to categorize objects with given features. These may be pictures with various motives which the computer has to tag with labels or custo- mers with certain properties who are categorized according to potential business value. Analytics landscape
  • 16. 16 Analytics landscape Software landscape As of 2017 there are already many players in the field competing for market shares.The existing software may be roughly categorized by specifity of task it tries to solve. On one end of the spectrum there are already solu- tions for very specific industries like retail or finance on the other end we have broad frameworks for data science bundling the main algorithms for ease of use by a data scientist. It is to be noted that most of basic data science relies on open source software. Especially important are here basic statistical libraries where two languages dominate: R14 and Python. R is a statistical language with emphasis on classical statistics but has been enriched also with libraries for machine learning. Python has evolved as the main language data scientists use to develop and analyze models especially for machine learning. Noteworthy are the data science libraries pandas15 for statistical analysis and scikit-learn16 for machine learning. All relevant libraries for python including interfaces to R are bundled in the Anaconda17 framework which is a complete solution for doing data science. Figure 7 Chart of available Frameworks and Tools in the field of machine learning. Chart appeared on O’Reilly article “The current state of machine intelligence” which is updated yearly. This is the third installment in their series.
  • 18. 18 Cleanse Wrangle Clean Model Explore Preprocess Validate Validate Results Actionable Insight Automation Deployment Model The data science pipeline The basic pipeline Data Science consists foremost of cleaning and analyzing often messy data. The steps at hand are first cleaning data which consists of getting and importing the data in the first place, which is often laborious and has earned the apt name wrangling in the community. After some preliminary cleaning the second step involves mode- ling. This comprises exploration, preprocessing and the actual modeling and goes in cycles until a satis- factory model has been found describing the data well enough. After that it is essential to validate the model for various reasons. One problem is overfitting the data with the model, meaning that the model describes the data we have very well but will behave very badly on future data. Usually this quality as- surance is done by keeping some of the data back as test data set only to be used in the validation step. Finally we arrive at a result from which business value may be generated by automating a process or deplo- ying the model on a machine using it to drive some process. Feature engineering Much of the modelling phase especially in machine learning consists of preparing the data in a form ingestible by the usual algorithms which expect their data in the form of numerical features. E.g. if you want to feed a timeline of sensor data into an anomaly detection algorithm it doesn’t make sense to give single data points as input. Instead you may first define time windows on the time series and within each time window extract features like mean, standard deviation, mininum, maximum etc. In this way the machine may be able to figure out which of the time windows shows an anomalous behaviour. This sort of work has been aptly named feature engineering and the skillful application of features to data determine to a great extent success or failure in the field of machine learning.18 Figure 10 The data science pipeline
  • 19. 19Smarte Geschäftsmodelle Data science in product businesses 5
  • 20. 20 Device Apps User Software- Management Community Device- Management Storage Connectivity Product Management Legacy Digital Master Software- Management Customer Operations DesignerProvider Digital Twin Monitoring/Analytics Data science in product businesses Products and digital twin We give an overview of what the analytics landscape consists today, what tools are available and how to go about to actually putting them to work. Product development and product services are an especially interesting field for future data science expan- sion. Users, sensors and tools agglomerate tons of data and with the advent of IoT the task to collect, structure, analyze that data, to report insights and to execute measures based on these insights will become even more important. The main tool to connect data in the field to the abstract product is the digital twin, which is a proxy for each physical instance of a product in the field, together with a connection to the virtual model used in product development. CONTACT offers tested and widely used open source libraries for data science provided by integration with the ANACONDA framework. Also, the data model is enhanced by all necessary tools notably the digital twin to function as coordinating hub for product data science. For data connection cloud connectors are available and allow seamless integration with field devices directly or for big data scenarios via cloud data paths through cloud providers like AWS or Azure. Conclusion We have given some rough sketches of what the analytics landscape consists today, what tools are available and how to go about actually putting to work the available tools. Further suggestions can be found in the bibliography and the list of online resources. Figure 11 CONTACT stack for complete data science and IoT architecture
  • 22. 22 Bibliography, online resources ne learning with its various failures and the computational power needed to use the n red means that the methods would have needed this much power but it wasn’t Kalman and Bucy6 formulated a framework based on SDEs (stochastic imate accurately the state of a dynamical system whose measurement onsiderable noise. Their formulation was a breakthrough which found Apollo space program for tracking the position of the space capsule. ool for tracking positions from spacecraft, planes, trucks or drones and ay. SDEs was brilliantly applied in the field of finance which underwent a n Fischer Black and Myron Scholes solved the problem of pricing assuming that the underlying asset in the form of a stock is governed by ating that there is an efficient market free of arbitrage. The resulting PDE on as a function of time 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 only with considerable mathematical expertise and solvable only with r was soon applied all over and the options market underwent an Kalman_filter dent on the stock price S(t), its volatility 𝜎𝜎 and the interest rate r. The ned by a stochastic differential equation 𝑑𝑑𝑑𝑑 = 𝑟𝑟 𝑆𝑆 𝑑𝑑𝑑𝑑 + 𝜎𝜎 𝑆𝑆𝑆𝑆𝑆𝑆 where W is a ing Brownian motion. chine learning with its various failures and the computational power needed to use the er in red means that the methods would have needed this much power but it wasn’t 0s Kalman and Bucy6 formulated a framework based on SDEs (stochastic estimate accurately the state of a dynamical system whose measurement h considerable noise. Their formulation was a breakthrough which found he Apollo space program for tracking the position of the space capsule. s tool for tracking positions from spacecraft, planes, trucks or drones and way. tic SDEs was brilliantly applied in the field of finance which underwent a hen Fischer Black and Myron Scholes solved the problem of pricing y assuming that the underlying asset in the form of a stock is governed by ulating that there is an efficient market free of arbitrage. The resulting PDE ption as a function of time 0 = 𝜕𝜕𝑉𝑉 𝜕𝜕𝑡𝑡 + 1 2 𝜎𝜎2 𝑆𝑆2 𝜕𝜕2 𝑉𝑉 𝜕𝜕𝑆𝑆2 + 𝑟𝑟𝑟𝑟 𝜕𝜕𝑉𝑉 𝜕𝜕𝑆𝑆 − 𝑟𝑟𝑟𝑟 ble only with considerable mathematical expertise and solvable only with wer was soon applied all over and the options market underwent an ki/Kalman_filter endent on the stock price S(t), its volatility 𝜎𝜎 and the interest rate r. The erned by a stochastic differential equation 𝑑𝑑𝑑𝑑 = 𝑟𝑟 𝑆𝑆 𝑑𝑑𝑑𝑑 + 𝜎𝜎 𝑆𝑆𝑆𝑆𝑆𝑆 where W is a elling Brownian motion. Some suggestions for further reading and hints for digging deeper into the available literature. Of course the opinions expressed here reflect the biases of the author: Efron, B., Hastie, T. (2016). Computer Age Statistical Inference. Cambridge: Cambridge University Press. A scholarly but enjoyable text with an historical account of statistics from which the author took most historical references. Knowledge of statistical me- thods is assumed if you want to follow the text, but the historical account may be read without paying too much attention to the math. Evans, L. C. (2013). An Introduction to Stocha- stic Differential Equations. American Mathe- matical Society. A little known gem introducing stochastic differential equations in under 150 pages. If you want to learn something about SDEs and have a little training in probability theory this is a pleasure to read. (If you are new to the entire field: SDEs are used in financial math, it is not needed for neural networks). Goodfellow, I. (2016). Deep Learning. Mit Press. A fine book mainly mapping out the theory of neural networks. This book is quite self-contained requiring not many prerequisites and gives you a primer in linear algebra and probability theory before diving into machine learning. Combined with a practical book like Raschka you’re quickly up to speed. Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Heidelberg: Spektrum Akademischer Verlag. A good overview of classical statistics with a modern concise presentation of bayesian and likelihood methods. The author summarizes the main points without getting lost in mathematical detail. An update in english under the title “Applied statistical inference” (2013) is also available from the same author. James, G., Witten, D., Hastie, T., Tibshirani, R. (2017). An Introduction to Statistical Lear- ning. New York: Springer. A standard reference for the field of machine lear- ning. Appropriate and enjoyable for beginners. An accompanying online course at Stanford University can be attended (for free). Strongly recommended. Raschka, S. (2016). Python Machine Learning. Birmingham: Pack Publishing Ltd. A practical book especially for python coders and the anaconda framework giving you a hands on learning experience. Combined with Goodfellow this may be a good start into the field of machine learning. Campolieti, G., Makarov, R. N. (2014). Finan- cial Mathematics: A Comprehensive Treatment. Boca Raton: Chapman and Hall/Crc Financial Mathematics Series. A truly comprehensive reference to modern financial mathematics if you are really interested. The advan- ced chapters on SDEs and option valuation are probably not understandable without a university degree in math, but it is one of the best references in the field. Learning data science There are a number of very good free online resour- ces available: ■ edx a platform for MOOC (massive open online cour- ses), search for data science courses ■ Newsletters at o’reilly There is a newsletter for data science and ai one can subscribe to at http://www.oreilly.com/data/ newsletter.html and http://www.oreilly.com/ai/ newsletter.html which feature links to interesting resources and stories For beginning data science especially learning R and Python frameworks one may consult DataCamp which specializes in data science courses. Subscrip- tion fees are low and some courses are free. Software resources are freely available mainly for Python and R The Anaconda framework from Continuum Analytics is a bundle which already contains the main machine learning libraries like scikit-learn, numpy and pandas R is freely available with many tutorials available online.
  • 23. 23 15 http://pandas.pydata.org/ 16 http://scikit-learn.org/stable/ 17 https://www.continuum.io/ 18 s. for example: Mastering Feature Engineering: Principles and Techniques for Data Scientists, Alice Zhend, O’Reilly, 2017 1 https://en.wikipedia.org/wiki/Least_squares 2 https://en.wikipedia.org/wiki/Bayes%27_theorem 3 McCulloch, Warren; Walter Pitts (1943). „A Logical Calculus of Ideas Immanent in Nervous Activity“. Bulletin of Mathematical Biophysics. 5 (4): 115–133 4 https://en.wikipedia.org/wiki/Artificial_neural_net- work 5 https://en.wikipedia.org/wiki/AI_winter 6 https://en.wikipedia.org/wiki/Kalman_filter 7 The option price V(t) is dependent on the stock price S(t), its volatility and the interest rate r. The movement of the stock is governed by a stochastic differential equation where W is a socalled Wiener process modelling Brownian motion. 8 Source OCC historical volumes of trade, https:// www.theocc.com/webapps/historical-volume-query 9 http://www.image-net.org/about-stats 10 for an interesting overview of the challenge s. https://arxiv.org/abs/1409.0575v3 11 Data taken from project site of ImageNet (http:// image-net.org), s. official results published for each year. The displayed error rates are featured in table “classification + localization ordered by classification” and have been rounded to nearest integer in percent. The labeling of the algorithmic switch from SVM to NN is an oversimplification by the author: Actual algorithms used in the contest are complicated mixes of various methods, s. project site for details. 12 Explaining and harnessing adverserial examples, Ian J. Goodfellow, Jonathon Shlens Christian Szegedy, http://arxiv.org/abs/1312.6199 13 Adversarial Perturbations Against Deep Neural Networks for Malware Classification, Kathrin Grosse et al., https://arxiv.org/abs/1606.04435 14 s. R project site https://www.r-project.org/ for more information and downloads References