Your SlideShare is downloading. ×
overview of pred learning
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

overview of pred learning

178
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
178
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NEW FORMULATIONS FOR PREDICTIVE LEARNING Vladimir Cherkassky Dept. Electrical & Computer Eng. University of Minnesota Email cherkass@ece.umn.edu Plenary talk at ICANN-2001 VIENNA, AUSTRIA, August 22, 2001
  • 2. 2 OUTLINE 1. MOTIVATION and BACKGROUND Historical background New application needs Methodological & scientific importance 2. STANDARD FORMULATIONs Standard formulations of the learning problem Possible directions for new formulations VC-theory framework 3. EXAMPLE NEW FORMULATIONS Transduction: learning function at given points Signal Processing: signal denoising Financial engineering: market timing Spatial data mining
  • 3. 3 4. SUMMARY 1.1 HISTORICAL BACKGROUND • The problem of predictive learning: GIVEN past data + reasonable assumptions ESTIMATE unknown dependency for future predictions • Driven by applications (NOT theory): medicine biology: genomics financial engineering (i.e., program trading, hedging) signal/ image processing data mining (for marketing) ........ • Math disciplines: function approximation pattern recognition statistics
  • 4. 4 optimization ……. CURRENT STATE-OF-THE-ART (in predictive learning) • Disconnect between theory and practical methods • Cookbook of learning methods (neural networks, soft computing…) • Terminological confusion (reflecting conceptual confusion ?) • REASONS for the above - objective (inherent problem complexity) - historical: existing data-analytical tools developed for classical formulations
  • 5. 5 - subjective (fragmentation in science & engineering) OBJECTIVE FACTORS • Estimation/ Learning with finite samples = inherently complex problem (ill-posed) • Characterization of uncertainty - statistical (frequentist) - statistical (Bayesian) - fuzzy - etc. • Representation of a priori knowledge (implicit in the choice of a learning method) • Learning in Physical vs Social Systems - social systems may be inherently unpredictable
  • 6. 6 1.2 APPLICATION NEEDS (vs standard formulations) • Learning methods assume standard formulations - classification - regression - density estimation - hidden component analysis • Theoretical basis for existing methods - parametric (linear parameterization) - asymptotic (large-sample) • TYPICAL ASSUMPTIONS - stationary distributions - i.i.d. samples - finite training sample / very large test sample - particular cost function (i.e. squared loss)
  • 7. 7 • DO NOT HOLD for MODERN APPLICATIONS 1.3 METHODOLOGICAL & SCIENTIFIC NEEDS • Different objectives for - scientific discipline - engineering - (academic) marketing Science is an attempt to make the chaotic diversity of our sensory experiences correspond to a logically uniform system of thoughts A. Einstein, 1942 • Scientific approach - a few (primary) abstract concepts - math description (quantifiable, measurable)
  • 8. 8 - describes objective reality (via repeatable experiments) • Predictive learning - not a scientific field (yet) - this inhibits (true) progress - understanding of major issues • Characteristics of a scientific discipline (1) problem setting / main concepts (2) solution approach (= inductive principle) (3) math proofs (4) constructive implementation ( = learning algorithm) (5) applications NOTE: current focus on (3), (4), (5), NOT on (1) or (2) • Distinction between (1) and (2)
  • 9. 9 - necessary for scientific approach - currently lacking TWO APPROACHES TO PREDICTIVE LEARNING • The PROBLEM GIVEN past data + reasonable assumptions ESTIMATE unknown dependency for future predictions • APPROACH 1 (most common in Soft Computing) (a) Apply one’s favourite learning method, i.e., neural net, Bayesian, Maximum Likelihood, SVM etc. (b) Report results; suggest heuristic improvements (c) Publish a paper (optional) • APPROACH 2 (VC-theory) (a) Problem formulation reflecting application needs
  • 10. 10 (b) Select appropriate learning method (c) Report results; publish paper etc.
  • 11. 11 2. STANDARD FORMULATIONS for PREDICTIVE LEARNING f*(X) Generator X Learning of X-values machine X System Z y • LEARNING as function estimation = choosing the ‘best’ function from a given set of functions [Vapnik, 1998] • Formal specs: Generator of random X-samples from unknown probability distribution System (or teacher) that provides output y for every X-value according to (unknown) conditional distribution P(y/X) Learning machine that implements a set of approximating functions f(X, w) chosen a priori where w is a set of parameters
  • 12. 12 • The problem of learning/estimation: Given finite number of samples (Xi, yi), choose from a given set of functions f(X, w) the one that approximates best the true output Loss function L(y, f(X,w)) a measure of discrepancy (error) Expected Loss (Risk) R(w) = ∫ L(y, f(X, w)) dP(X, y) The goal is to find the function f(X, wo) that minimizes R(w) when the distribution P(X,y) is unknown. Important special cases (a) Classification (Pattern Recognition) output y is categorical (class label) approximating functions f(X,w) are indicator functions For two-class problems common loss L(y, f(X,w)) = 0 if y = f(X,w) L(y, f(X,w)) = 1 if y ≠ f(X,w) (b) Function estimation (regression) output y and f(X,w) are real-valued functions Commonly used squared loss measure: L(y, f(X,w)) = [y - f(X,w)]2
  • 13. 13 LIMITATIONS of STANDARD FORMULATION • Standard formulation of the learning problem (a) global function estimation (= large test set) (b) i.i.d. training/test samples (c) standard regression/ classification (loss function) (d) stationary distributions  possible research directions (a) new formulations of the learning problem (i.e., transduction) (b) non i.i.d. data (c) alternative cost (error) measures (d) non-stationary distributions
  • 14. 14 VC LEARNING THEORY • Statistical theory for finite-sample estimation • Focus on predictive non-parametric formulation • Empirical Risk Minimization approach • Methodology for model complexity control (SRM) References on VC-theory: V. Vapnik, The Nature of Statistical Learning Theory, Springer 1995 V. Vapnik, Statistical Learning Theory, Wiley, 1998 V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory and Methods, Wiley, 1998 B. Scholkopf et al (eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999 IEEE Trans on Neural Networks, Special Issue on VC learning theory and its applications, Sept. 1999
  • 15. 15 CONCEPTUAL CONTRIBUTIONS of VC-THEORY • Clear separation between problem statement solution approach (i.e. inductive principle) constructive implementation (i.e. learning algorithm) - all 3 are usually mixed up in application studies. • Main principle for solving finite-sample problems Do not solve a given problem by indirectly solving a more general (harder) problem as an intermediate step - usually not followed in statistics, neural networks and applications. Example: maximum likelihood methods • Worst-case analysis for learning problems Theoretical analysis of learning should be based on the worst-case scenario (rather than average-case).
  • 16. 16 3. EXAMPLES of NEW FORMULATIONS TOPICS COVERED Transduction: learning function at given points Signal Processing: signal denoising Financial engineering: market timing Spatial data mining FOCUS ON Motivation / application domain (Formal) problem statement Why this formulation is important Rather than solution approach (learning algorithm)
  • 17. 17 COMPONENTS of PROBLEM FORMULATION
  • 18. 18 APPLICATION NEEDS Loss Input, output, Training/ Function other variables test data Admissible Models FORMAL PROBLEM STATEMENT LEARNING THEORY TRANSDUCTION FORMULATION
  • 19. 19 • Standard learning problem: Given: training data (Xi, yi) i = 1,…n Estimate: (unknown) dependency everywhere in X R(w) = ∫ L(y, f(X, w)) dP(X, y)  min ~ global function estimation ~ large test data set f*(X) Generator X Learning of X-values machine X System Z y • Usually need to make predictions at a few points  estimating function everywhere may be wasteful
  • 20. 20 • Transduction approach [Vapnik, 1995] learning/training = induction operation/lookup = deduction a priori knowledge assumptions estimated function induction deduction training predicted data output transduction • Problem statement (Transduction): Given: training data (Xi, yi) , i = 1,…n Estimate: y-values at given points Xn+j , j = 1,…k Optimal prediction minimize R(w) = (1/k) ∑ Loss [yn+j , f (Xn+j, w)] j
  • 21. 21 CURRENT STATE - OF - ART • Vapnik [1998] provides - theoretical analysis of transductive inference - SRM formulation - SVM solution approach • Transduction yields better predictions than standard approach • Potentially plenty of applications • No practical applications reported to date
  • 22. 22 SIGNAL DENOISING 6 5 4 3 2 1 0 -1 -2 0 100 200 300 400 500 600 700 800 900 1000 6 5 4 3 2 1 0 -1 -2 -3 0 10 20 30 40 50 60 70 80 90 100
  • 23. 23 SIGNAL DENOISING PROBLEM STATEMENT • REGRESSION FORMULATION real-valued function estimation (with squared loss) • DIFFERENCES (from standard formulation) - fixed sampling rate (no randomness in X-space) - training data X-values = test data X-values - non i.i.d. data • SPECIFICS of this formulation  computationally efficient orthogonal estimators, i.e. - Discrete Fourier Transform (DFT) - Discrete Wavelet Transform (DWT)
  • 24. 24
  • 25. 25 RESEARCH ISSUES FOR SIGNAL DENOISING • CONCEPTUAL SIGNIFICANCE Signal denoising formulation = nonlinear estimator suitable for ERM • MODEL SELECTION / COMPLEXITY CONTROL V. Cherkassky and X. Shao (2001), Signal estimation and denoising using VC-theory, Neural Networks, 14, 37-52 • NON I.I.D. DATA • SPECIFYING GOOD STRUCTURES - especially for 2D signals (images)
  • 26. 26 FINANCIAL ENGINEERING: MARKET TIMING • MOTIVATION increasing market volatility, i.e. typical daily fluctuations: - 1-2% for SP500 - 2-4% for NASDAQ 100 Example of intraday volatility: NASDAQ 100 on March 14, 2001 • AVAILABILITY of MARKET DATA easy to quantify loss function hard to formulate good (robust) trading strategies
  • 27. 27 TRADING VIA PREDICTIVE LEARNING • GENERAL SETTING PREDICTIVE input x prediction TRADING Buy/sell/hold MODEL indicators y DECISION y=f(x) GAIN/ MARKET LOSS • CONCEPTUAL ISSUES: problem setting binary decision / continuous loss • PRACTICAL ISSUES - choice of trading security (equity) - selection of input variables X - timing of trading (i.e., daily, weekly etc.) These issues are most important for practical success BUT can not be formalized
  • 28. 28 FUNDS EXCHANGE APPROACH • 100% INVESTED OR IN CASH (daily trading) trading decisions made at the end of each day Index or Fund Money Market Sell Buy or or buy sell Proprietary Exchange Strategy optimal trade-off between preservation of principal and short-term gains • PROPRIETARY EXCHANGE STRATEGIES - selection of mutual funds - selection of predictor (input) variables
  • 29. 29 Ref: V. Cherkassky, F. Mulier and A. Sheng , Funds Exchange: an approach for risk and portfolio management, in Proc. Conf. Computational Intelligence for Financial Engineering, N.Y., 2000
  • 30. 30 • EXAMPLE International Fund (actual trading) results for 6 month period: Fund exchange Buy-and-Hold cumulative return 16.5% 5.7% market exposure 55% 100% number of trades 25 1 market exposure = % of days the account was invested number of trades (25) = 25 buy and 25 sell transactions
  • 31. 31 Value of exchange vs. buy and hold 118 116 exchange 114 buy_hold 112 110 value 108 106 104 102 100 98 20-May-99 9-Jul-99 28-Aug-99 17-Oct-99 6-Dec-99 25-Jan-00 date
  • 32. 32 FORMULATION FOR FUNDS EXCHANGE PREDICTIVE input x prediction TRADING Buy/sell/hold MODEL indicators y DECISION y=f(x) GAIN/ MARKET LOSS GIVEN p time series of daily closing prices (of a fund) i q = p −p i ( i i −1 ) p i daily % change X a time series of daily values of predictor variables i yi = f ( xi , w) indicator decision function yi =1 means BUY yi = 0 means SELL • OBJECTIVE maximize total return over n day period n Q ( w) = ∑ f ( xi , w)qi i =1
  • 33. 33 SPECIFICS of FUNDS EXCHANGE FORMULATION • COMBINES indicator decision functions (BUY or SELL) continuous cost function (cumulative return) • NOTE - classification formulation (= predicting direction) is not meaningful because classification cost does not reflect the magnitude of price change - regression formulation (= predicting next-day price) not appropriate either • Training/ Test data not i.i.d. random variables • No constructive learning methods available
  • 34. 34 SPATIAL DATA MINING • MOTIVATION: geographical / spatial data LAW of GEOGRAPHY: nearby things are more similar than distant things Examples: real estate data, US census data, crime rate data CRIME RATE DATA in Columbus OH Output (response) variable: crime rate in a given neighborhood Input (predictor) variables: ave. household income, house price Spatial domain (given): 49 neighborhoods in Columbus Training data: 49 samples (each sample = neighborhood) The PROBLEM: Using training data, estimate predictive model (regression- type) accounting for spatial correlation, i.e., the crime rate in adjacent neighborhoods should be similar.
  • 35. 35 PROBLEM STATEMENT GIVEN X input (explanatory) variables y output (response) variable f ( X s , w) Approximating functions ( X s , ys ) Training data defined on domain S S = given spatial domain (2D fixed grid) Spatial correlation: X or y values close in S are similar • SPATIAL CORRELATION VIA LOSS FUNCTION Loss (Risk) R(s) = prediction loss + spatial loss Define N (s ) = given neighborhood around each location S R ( s ) = ( y s − f ( xs , w) ) + ( y s − f ( xs , w) ) 2 2 where N (s ) = given neighborhood around each location S 1 ys = N ( s) k ∑ yk ave spatial response at each location S
  • 36. 36 4. SUMMARY and DISCUSSION Perfection of means and confusion of goals seem – in my opinion - to characterize our age A. Einstein, 1942 IMPORTANCE OF • NEW FORMULATIONS for PREDICTIVE LEARNING • PRACTICAL SIGNIFICANCE • NEW DIRECTIONS for RESEARCH • CONNECTION BETWEEN THEORY and PRACTICE DISCUSSION / QUESTIONS