Commercial Analytics for the Python Modeler - Presentation Transcript
Commercial Analytics for the
Python Modeler
deler
Leveraging the IMSL® C Numerical
Library
Technical Note PY1122 by Visual Numerics, Inc.
February 2009
Visual Numerics, Inc.
2500 Wilcrest Drive, Suite 200
Houston, TX 77042
USA
www.vni.com
TABLE OF CONTENTS
Abstract....................................................................................................... 4
Commercial Analytics.................................................................................. 4
IMSL Numerical Libraries ........................................................................ 4
A Motivating Example................................................................................. 5
Beginning the Forecasting ....................................................................... 6
Choosing a Forecasting Method ............................................................ 10
Moving to Production ............................................................................... 13
References ................................................................................................ 16
About the Author ...................................................................................... 16
Abstract
Data modelers are increasingly finding that the Python language is incredibly rich and powerful
for rapid data exploration and prototypical application development. However, the numerous
constraints that often exist in production‐quality application development (e.g. integrating with
existing software and hardware; IT requirements that software tools be documented and
supported; computational speed and scalability requirements; consistent error handling in
numerical analyses) sometimes necessitate Python code be translated to a language like C or
Java. In such cases, porting prototypical modeling work and results to a production‐quality
application can be painful, as there is rarely a one‐to‐one translation of functionality from the
prototyping environment to the production environment. The following discussion explores the
benefits of using Python with commercial analytics: the strengths of Python are best leveraged
when the analytics used in modeling are not only robust, documented and supported, but also
are easily transferable to the production application development process.
Commercial Analytics
Commercial analytics software, such as Visual Numerics' IMSL Numerical Libraries, empower
software developers to add analytical capabilities to their applications, but avoid re‐inventing
the wheel through writing complex math or statistical routines, or introducing risk by using
open source routines which may not be sufficiently tested or documented.
Commercial analytics also leverage extensive numerical algorithm development expertise. For
example, Visual Numerics focuses on the development, sales and maintenance of analytical
algorithms, supporting a wide range of languages and platforms, and has decades of experience
working with thousands of customers. Thus, a commercial option like the IMSL Libraries
provides a wide breadth of available algorithms, significant depth of capabilities, extensive
internal testing, and market‐seasoned support by the vendor, all informed by a large and
diverse user community.
IMSL Numerical Libraries
The IMSL Numerical Libraries are implemented natively in the widely used computer
programming languages of C, Java, C#/.NET, and Fortran. Software developers embed
algorithms from these libraries into their software applications using their preferred
programming language. The first IMSL Library for the Fortran language was released in 1971,
followed by a C language version originally called C/Base in 1991, a Java language version in
2002, a C# language version in 2004, and a Python language version in 2008. Over time, the
libraries have grown in supported languages, supported platforms, and in depth and breadth of
available math and statistics algorithms.
Page 4
The table below outlines general math and statistics areas found in the IMSL Numerical
Libraries. These areas illustrate a broad and full‐featured spectrum of critical analytic
capabilities that can be used to build advanced numerical software applications in many fields.
Table 1: IMSL Numerical Libraries Math & Statistics Areas
Mathematics Statistics
Matrix Operations Basic Statistics
Linear Algebra Time Series & Forecasting
Eigensystems Nonparametric Tests
Interpolation & Approximation Correlation & Covariance
Numerical Quadrature Data Mining
Differential Equations Regression
Nonlinear Equations Analysis of Variance
Optimization Transforms
Special Functions Goodness of Fit
Finance & Bond Calculations Distribution Functions
Random Number Generation
Neural Networks
The IMSL C Numerical Library contains more than 450 mathematical and statistical routines, all
of which are fully documented, tested and supported. The PyIMSL™ Studio product wraps this
functionality in the Python language, includes additional routines for data access and
preprocessing, and packages key Python development tools such as the Eclipse IDE and the
highly interactive IPython shell. PyIMSL Studio is the first and only commercially‐available
numerical analysis application development environment designed for deploying mathematical
and statistical prototype models into production applications.
A Motivating Example
Accurately forecasting sales, supply, and demand are common yet critical tasks faced by today's
businesses. While there are numerous methods for generating forecasts, most use historical
time series data as the basis of estimating future values of various variables of interest. A
complicating aspect in forecasting is that numerous series are often collected across a variety of
variables, and it may be impossible for a practitioner to continually asses and model all of the
data. Accordingly, automated forecasting systems must be developed. The following discussion
considers two powerful time series forecasting methods—Neural Networks and Autoregressive
Integrated Moving Average (ARIMA) models—set within a context that customers are
increasingly sharing with Visual Numerics: production quality analysis tools must be developed
in C but languages like Python are used to more rapidly explore data and develop prototypes.
The following example uses data from the 2008 Time Series Forecasting Competition for
Computational Intelligence, also known as the NN5 competition, organized by Dr. Sven F. Crone
Page 5
from the Lancaster Centre for Forecasting1. The data are time series from 111 different
Automatic Teller Machines (ATMs) across England, with all series containing daily cash
withdrawal amounts, starting on March 18, 1996 and ending on March 22, 1998 (735 values).
The task is to predict the next 56 days (eight weeks) of ATM activity, for which the Competition
also provides the actual withdrawal values for comparison. The business impact of forecast
accuracy is noted in a description of the Competition’s motivation2:
“If the forecasts are flawed, they induce costs: if the forecast is too high unused money
is stored in the ATM incurring costs to the institution; if the ATM runs out of cash, profit
is lost and customers are dissatisfied”.
The data and competition instructions can be freely accessed from http://www.neural‐
forecasting‐competition.com/datasets.htm.
Data from 11 of the 111 total series are chosen as a subset on which to explore the
performance of a model‐based forecasting method and a nonparametric one (ARIMA and
Neural Networks, respectively). Python and PyIMSL Studio are then used to quickly apply the
methods to the data and assess the forecast performance. Once a superior method is identified
from the test data, production quality code can be confidently written, as the same key
numerical algorithms used by the Python modeler during exploration are available to the C
developer during application development.
Beginning the Forecasting
The first step is to read the data and replace missing values with reasonable estimates (known
as imputation). This is accomplished with the PyIMSL Studio modules
imsl.studio.data.asciiRead and imsl.studio.data.impute, respectively. With the
data read in, the next step is to visualize the data and determine how to set up the parameters
for the neural network and ARIMA configurations. Two of the 11 series are plotted below:
1
http://www.lums.lancs.ac.uk/forecasting
2
http://www.neural‐forecasting‐competition.com/motivation.htm
Page 6
Figure 1: Plots of time series of ATM withdrawals for series NN5‐101 and NN5‐110
These two series each display cyclic behavior, but with different periods. Series NN5‐101
appears to have a slight, increasing trend, while series NN5‐110 does not appear to have any
trend. We can further assess these series by looking at a correlogram, which plots the data's
autocorrelations and partial autocorrelations (or, how values in the time series depend on past
values). To generate the data for the correlograms we can call
imsl.stat.autoCorrelation and imsl.stat.partialAutocorrelation. The
correlogram simply plots these results.
Page 7
Figure 2: Correlograms for series NN5‐101 and NN5‐110
Page 8
As guessed at by viewing the plots of the raw data, series NN5‐101 and NN5‐110 have very
different correlation structure. Skilled statisticians use correlograms to help inform decisions
about how to model time series (red bars extending beyond the dashed blue lines correspond
to times that have a \"significant\" lag effect). Stepping back to recall our problem context, we
know that the ultimate goal is to create an automated forecasting tool, in which case viewing
plots of individual time series, and their correlograms, is not possible. For our exploration it
may be sufficient to simply use the context from which the data came from to inform our
decisions about how to set up the modeling.
For the neural network modeling (via imsl.stat.mlff_network) we'll assume that much of
the variability in the ATM withdrawals data is due to weekly patterns (e.g., people withdrawing
more money on the weekends while shopping). This assumption is borne out in much of the
data. For example, when using the plot window's magnifying glass icon to zoom in on a three‐
week period in series NN5‐101 we see:
Figure 3: Zoomed in view of series NN‐101
We can create training data for the neural network using
imsl.stat.time_series_filter, knowing that the assumption of a weekly pattern is not
optimal for modeling all 11 (or all 111) series but may be a good place to start testing the
performance of a general neural network forecasting system.
The ARIMA framework of time series modeling allows for great flexibility in how the data are
modeled. ARIMA models can be implemented in PyIMSL Studio with imsl.stat.autoArima.
This powerful time series modeling routine automatically handles missing data, detects and
adjusts outliers, takes necessary preprocessing steps when data are nonstationary, and
generates forecasts. Using imsl.stat.autoArima, we can easily consider yearly patterns in
Page 9
ATM withdrawals (e.g., more activity the week before Christmas every year) in addition to the
weekly‐dependence modeled by the neural network.
To assess the forecast performance of each method, we'll leave off the last 56 values of each of
the 11 series modeled by the neural network and ARIMA as a \"holdout\" set with which to
validate the 56‐day forecasts. To make the assessment quick and objective, we'll compute two
increasingly used measures of forecast accuracy:
• Symmetric Mean Absolute Percentage Error (SMAPE): As the name implies, this
measure is in units of percentage error. In practice, SMAPE values around 20% are
respectable. Deeper research and effort can often yield forecasts with single digit
SMAPEs, but there are no general thresholds of forecast quality, since issues such as
how the forecast will be used, the length of the forecast desired, and the nature of the
data lead to differing forecasting goals.
• Mean Absolute Scaled Error (MASE): This measure scales the out of sample forecast
error of a method in question (e.g. neural networks or ARIMA) by the in sample forecast
error based on a naïve, random walk forecast (the one‐step ahead forecast value is
simply the previous value in the series). As with the SMAPE measure, there are no clear
thresholds defining \"good\" and \"bad\" forecasts. Rather, the measure provides another
way of assessing forecast performance across different methods.
An excellent discussion of these measures, as well as others, can be found in Hyndman and
Koehler, 2006.
As a quick way of assessing forecast performance differences, we'll plot the SMAPE and MASE
values for each method and each time series against each other. In the present context, these
two measures only apply to out‐of‐sample forecast accuracy. We should also examine the
residuals from the ARIMA and neural network in‐sample fits to make sure that the results are
reasonable. However, the results of this exploration are not considered in this discussion; focus
is given to the SMAPE and MASE measures of forecast accuracy.
Choosing a Forecasting Method
In the SMAPE and MASE plots below, one line (green) exists for the forecasts from
imsl.stat.mlffNetworkForecast and two lines exist for the results from
imsl.stat.autoArima, with one line (red) corresponding to the forecast having adjusted for
any effects of outliers and the other line (blue) corresponding to not adjusting for these
effects.
Page 10
Figure 4: Plots of SMAPE and MASE accuracy measures for three forecasting methods applied
to 11 time series
Page 11
Plotting the SMAPE and MASE measures clearly yields different pictures of forecast
performance between the two methods. However, the general conclusions seem consistent:
the neural network yields a better 56‐day forecast for about half of the 11 series modeled. For
the other half of the series for which the ARIMA method yielded better forecasts, the ability to
adjust for outliers was critically important in one instance (series NN5‐103). Across all series,
there are several instances for which imsl.stat.autoArima's ability to adjust for outlier
effects yielded better forecasts than the ARIMA model that did not adjust for outliers.
Importantly, the neural network method generated forecasts about 10 times faster than the
ARIMA approach, although much can be done to confine the search space used by
imsl.stat.autoArima to increase its speed.
At this point we might feel that the neural network approach is the better method to proceed
with, but one key point should be considered: the model‐based approach of ARIMA
automatically yields confidence intervals for the forecasts, whereas the nonparametric neural
network approach does not (although there are simulation‐based methods for generating
confidence intervals for nonparametric forecasts). Depending upon what types of decisions will
be made from the forecasts, having these confidence intervals can be critically important.
Consider the forecasts for series NN5‐110, for which each method performed poorly (SMAPEs
of approximately 30%). The neural network forecast is displayed below:
Figure 5: Neural Network forecast for series NN5‐110
Without knowing what the actual data are for this 56‐day time frame, we have no easy way of
assessing the quality of the forecasted values. Now consider the same series forecasted by
imsl.stat.autoArima:
Page 12
Figure 6: ARIMA forecast for series NN5‐110
Here, the 5‐95% confidence interval is shaded yellow, and captures much of the actual
observed data. We can use the width of this confidence band to inform how to use the
forecasts for decision making, and these bands are the automatic, theoretically‐grounded result
of using imsl.stat.autoArima.
If speed were purely our concern for forecasting we might be inclined to use
imsl.stat.mlffNetworkForecast in our production forecasting tool. However, given that
• most of the forecasts from imsl.stat.autoArima were very close to, or better, than
those from imsl.stat.mlffNetworkForecast
• the flexibility of having automatic outlier detection can be very beneficial
• the forecasts from ARIMA yield confidence intervals
we decide to choose imsl.stat.autoArima as the best method to use in our forecasting
application.
Moving to Production
At this point we are confident that the functionality in imsl.stat.autoArima can be
embedded in an automatic forecasting application. In typical scenarios, the C application
developer would now need to either find or write C routines to replicate the functionality used
by the data modeler during the prototyping and exploration work. This process has significant
risk, as an ARIMA implementation found in C may not be robust, well‐tested or documented,
and writing an ARIMA forecasting routine from scratch would be time consuming and difficult.
Having used Python and PyIMSL Studio however, the forecasting provided by
imsl.stat.autoArima is easily moved to the C production environment via the IMSL C
Page 13
Numerical Library routine imsls_auto_arima. The code examples below demonstrate the
ease with which the same powerful ARIMA forecasting algorithm can be consistently and
confidently used in both the Python modeling work or the C application development. It should
be noted that imsl.stat.autoArima (and imsls_stat_auto_arima) has numerous
optional arguments that may be useful in various contexts. The arguments shown below
represent a subset of these that are relevant to the present example.
Python code snippet
from imsl.stat.autoArima import autoArima
p_initial = range(13) #AR(0 to 12)
q_initial = range(4) #MA(0 to 3) max of two orders of differencing.
s_initial = [1,2] #Consider a
d_initial = [0,1,7,365]#Consider models with no differencing, simple
#differencing, weekly differencing, and yearly-
#seasonal differencing.
method = 2 #Grid search over the *_initial parameter space
model = [-1,-1,-1,-1]
forecast = [-1]
forecast_outFree = [-1]
outFreeSeries = [-1]
times = range(len(series)) #Any gaps in this monotonic sequence would
#be an indicator of missing data that
#imsl.stat.autoArima should impute with an
params = autoArima (times, #AR(p) model estimate
series, method=method,
model=model, pInitial=p_initial,
qInitial=q_initial, sInitial=s_initial,
dInitial=d_initial, numPredict=Npredict,
outFreeSeries=outFreeSeries,
outFreeForecast=forecast_outFree,
outlierForecast=forecast)
Page 14
Analogous C code snippet
int n_obs = 735;
int n_predict = 56;
int i, *times;
/* As in the Python example code, assume the time series has already
been read in from some source
*/ model[4];
int
int n_s_initial = 2, n_d_initial = 4;
int n_p_initial = 13, n_q_initial = 4;
int s_initial[2] = {1,2};
int d_initial[4] = {0,1,7,365};
int p_initial[13] = {0,1,2,3,4,5,6,7,8,9,10,11,12};
int q_initial[4] = {0,1,2,3};
float params[12+3+1];
float *outfree_series, *outlierfree_forecast, *outlier_forecast;
/* Possibly 12 AR parameters, 3 MA parameters, and 1 constant
*/
outfree_series = (float *)malloc((n_obs*2) * sizeof(float));
/* First n_obs values are original series, next n_obs values are the
corresponding values having (possibly) been adjusted for the
effects of outliers
*/
outlierfree_forecast = (float *)malloc((n_predict*3) * sizeof(float));
/* The first n_predict values are the forecasts, the next n_predict
values are the forecast standard errors, and the last n_predict
values contain the psi weights of the infinite-order moving
average form of the model
*/
outlier_forecast = (float *)malloc((n_predict*3) * sizeof(float));
times = (int *)malloc(n_obs * sizeof(int));
for (i=0; i<n_obs; i++)
{
times[i] =i;
}
imsls_f_auto_arima(n_obs, times, x, IMSLS_MODEL, model,
IMSLS_METHOD, 2,
IMSLS_P_INITIAL, n_p_initial, p_initial,
IMSLS_Q_INITIAL, n_q_initial, q_initial,
IMSLS_S_INITIAL, n_s_initial, s_initial,
IMSLS_D_INITIAL, n_d_initial, d_initial,
IMSLS_NUM_PREDICT, n_predict,
IMSLS_OUT_FREE_SERIES_USER, outfree_series,
IMSLS_OUTLIER_FORECAST_USER, outlier_forecast,
IMSLS_OUT_FREE_FORECAST_USER, outlierfree_forecast,
IMSLS_RETURN_USER, params,
0);
Page 15
References
Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy, International
Journal of Forecasting, 22, 679‐688, 2006.
About the Author
Josh Hemann is a Statistical Advisor at Visual Numerics. He works closely with customers, as
well as colleagues within Visual Numerics, on exploratory data analysis and analytical
application development problems. In previous roles with the company Josh has been involved
in software development, consulting, training and customer support for the PV‐WAVE® Family
of products. His background is in environmental modeling and neural networks.
Page 16
0 comments
Post a comment