pattern recognition in stock market


Published on

Published in: Business, Economy & Finance
  • Searching for the perfect life? With the 15K a day this app can make you you'll be able to have the life you've always wanted: GO HERE
    Are you sure you want to  Yes  No
    Your message goes here
  • Due to the fact that binary options functions more like a "yes or no" kind of trade mechanism, there is a tendency among inexperienced traders to trade it like a poker game or like they would do in the slot machines of Vegas. I am using 30DAYCHANGEPROGRAM.COM since 3 weeks and i am getting good profit.
    Are you sure you want to  Yes  No
    Your message goes here
  • Binary trading is advertised as the only genuine system that lets users earn preposterous amounts of money in ridiculously short period of time. Advertisers try to implicate as if you can make $350 every 60 seconds; if it was true then binary trading would truly be an astonishing business. This GEORGESEPROVITZ COM is perfect for anyone. Even if you have absolutely no trading experience! I often warn newbie traders not to dive into trading Binary Options thinking that you will be able to earn an incredible amount of profits on your own! Whenever you try something for the first time you are bound to make mistakes and mistakes in trading means that you are losing money! That is why having Option Navigator software is what you need!
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • pattern recognition in stock market

    1. 1. pattern recognition in stock market
    2. 2. Introduction
    3. 3. motivation <ul><li>Our time is limited, better not to waste it working </li></ul><ul><li>Life style costs money </li></ul><ul><li>Create someone else to do the job for you </li></ul>
    4. 4. metatrader <ul><li>Online broker </li></ul><ul><li>Lets you trade foreign currency, stocks and indexes </li></ul><ul><li>MetaQuotes Language (MQL) similar to C, allows you to buy and sell </li></ul><ul><li>Can be linked with dynamic linked libraries (dll) </li></ul>
    5. 5. Pattern recognition <ul><li>Pattern recognition aims to classify data </li></ul><ul><li>(patterns) based either on a priori knowledge or </li></ul><ul><li>on statistical information extracted from the </li></ul><ul><li>patterns. The patterns to be classified are usually </li></ul><ul><li>groups of measurements or observations, </li></ul><ul><li>defining points in an appropriate </li></ul><ul><li>multidimensional space. </li></ul><ul><li>To understand is to perceive patterns </li></ul>
    6. 6. SVM
    7. 7. Linear Support Vector Machines <ul><li>A direct marketing company wants to sell a new book: </li></ul><ul><li>“ The Art History of Florence ” </li></ul><ul><li>Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003). </li></ul><ul><li>Problem: How to identify buyers and non-buyers using the two variables: </li></ul><ul><ul><li>Months since last purchase </li></ul></ul><ul><ul><li>Number of art books purchased </li></ul></ul>Number of art books purchased ∆ buyers ● non-buyers Months since last purchase ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ● ● ● ● ∆ ∆ ∆
    8. 8. <ul><li>Main idea of SVM: separate groups by a line. </li></ul><ul><li>However: There are infinitely many lines that have zero training error… </li></ul><ul><li>… which line shall we choose? </li></ul>Linear SVM: Separable Case ∆ buyers ● non-buyers Number of art books purchased Months since last purchase ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ●
    9. 9. <ul><li>SVM use the idea of a margin around the separating line. </li></ul><ul><li>The thinner the margin, </li></ul><ul><li>the more complex the model, </li></ul><ul><li>The best line is the one with the largest margin. </li></ul>Linear SVM: Separable Case ∆ buyers ● non-buyers Number of art books purchased Months since last purchase margin ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ●
    10. 10. <ul><li>The line having the largest margin is: w 1 x 1 + w 2 x 2 + b = 0 </li></ul><ul><li>Where </li></ul><ul><ul><li>x 1 = months since last purchase </li></ul></ul><ul><ul><li>x 2 = number of art books purchased </li></ul></ul><ul><li>Note: </li></ul><ul><ul><li>w 1 x i 1 + w 2 x i 2 + b  +1 for i  ∆ </li></ul></ul><ul><ul><li>w 1 x j 1 + w 2 x j 2 + b  –1 for j  ● </li></ul></ul>Linear SVM: Separable Case x 2 x 1 Months since last purchase Number of art books purchased margin w 1 x 1 + w 2 x 2 + b = 1 w 1 x 1 + w 2 x 2 + b = 0 w 1 x 1 + w 2 x 2 + b = -1 ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ●
    11. 11. <ul><li>The width of the margin is given by: </li></ul><ul><li>Note: </li></ul>Linear SVM: Separable Case x 2 x 1 Months since last purchase Number of art books purchased w 1 x 1 + w 2 x 2 + b = 1 w 1 x 1 + w 2 x 2 + b = 0 w 1 x 1 + w 2 x 2 + b = -1 margin maximize the margin minimize minimize ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ●
    12. 13. <ul><li>The optimization problem for SVM is: </li></ul><ul><li>subject to: </li></ul><ul><ul><li>w 1 x i 1 + w 2 x i 2 + b  +1 for i  ∆ </li></ul></ul><ul><ul><li>w 1 x j 1 + w 2 x j 2 + b  –1 for j  ● </li></ul></ul>Linear SVM: Separable Case x 2 x 1 margin maximize the margin minimize minimize ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ●
    13. 14. <ul><li>“ Support vectors” are those points that lie on the boundaries of the margin </li></ul><ul><li>The decision surface (line) is determined only by the support vectors. All other points are irrelevant </li></ul>Linear SVM: Separable Case x 2 x 1 “ Support vectors” ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ●
    14. 15. <ul><li>Non-separable case: there is no line separating errorlessly the two groups </li></ul><ul><li>Here, SVM minimize L (w, C ) : </li></ul><ul><li>subject to: </li></ul><ul><ul><li>w 1 x i 1 + w 2 x i 2 + b  +1 –  i for i  ∆ </li></ul></ul><ul><ul><li>w 1 x j 1 + w 2 x j 2 + b  –1 +  i for j  ● </li></ul></ul><ul><ul><li> I,j  0 </li></ul></ul>Linear SVM: Nonseparable Case x 2 x 1 ∆ buyers ● non-buyers Training set: 1000 targeted customers L ( w ,C ) = Complexity + Errors w 1 x 1 + w 2 x 2 + b = 1 ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ● ● ● ● ∆ ∆ ∆ maximize the margin minimize the training errors
    15. 16. <ul><ul><li>vectors X i </li></ul></ul><ul><ul><li>labels y i = ±1 </li></ul></ul><ul><ul><ul><li>margin and error vectors </li></ul></ul></ul>
    16. 17. Linear SVM: The Role of C C = 5 x 2 x 1 <ul><li>Bigger C </li></ul>( thinner margin ) smaller number errors ( better fit on the data ) increased complexity <ul><li>Smaller C </li></ul>( wider margin ) bigger number errors ( worse fit on the data ) decreased complexity ∆ ∆ <ul><li>Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors </li></ul>∆ ● ∆ ∆ ∆ ● ● ● ● x 2 x 1 C = 1 ∆ ● ∆ ∆ ∆ ● ● ● ●
    17. 18. Non-linear SVMs <ul><li>Transform x   ( x ) </li></ul><ul><li>The linear algorithm depends only on xx i , hence transformed algorithm depends only on  ( x )  ( x i ) </li></ul><ul><li>Use kernel function K( x i , x j ) such that K( x i , x j )=  ( x )  ( x i ) </li></ul>
    18. 19. Nonlinear SVM: Nonseparable Case <ul><li>Mapping into a higher -dimensional space </li></ul><ul><li>Optimization task: minimize L (w , C ) </li></ul><ul><li>subject to: </li></ul><ul><ul><li>∆ </li></ul></ul><ul><ul><li>● </li></ul></ul>x 2 x 1 ∆ ● ∆ ∆ ● ● ● ● ∆ ∆ ∆ ● ● ● ● ● ● ∆ ∆ ∆
    19. 20. Nonlinear SVM: Nonseparable Case <ul><li>Map the data into higher-dimensional space:  2  3 </li></ul>(1,-1) x 1 ● ∆ x 2 (1,1) (-1,1) (-1,-1) ∆ ∆ ● ● ∆ ∆ ● ●
    20. 21. Nonlinear SVM: Nonseparable Case <ul><li>Find the optimal hyperplane in the transformed space </li></ul>(1,-1) x 1 ∆ ● x 2 (1,1) (-1,1) (-1,-1) ∆ ∆ ● ● ∆ ∆ ● ●
    21. 22. Nonlinear SVM: Nonseparable Case <ul><li>Observe the decision surface in the original space (optional) </li></ul>x 2 ∆ ∆ ● ● x 1 ∆ ● ∆ ∆ ● ●
    22. 23. Nonlinear SVM: Nonseparable Case <ul><li>Dual formulation of the (primal) SVM minimization problem </li></ul>Primal Dual Subject to Subject to
    23. 24. Nonlinear SVM: Nonseparable Case <ul><li>Dual formulation of the (primal) SVM minimization problem </li></ul>Dual Subject to (kernel function)
    24. 25. Solving <ul><li>Construct & minimise the Lagrangian </li></ul><ul><li>Take derivatives wrt. w and b, equate them to 0 </li></ul><ul><ul><li>The Lagrange multipliers α i are called ‘dual variables’ </li></ul></ul><ul><ul><li>Each training point has an associated dual variable. </li></ul></ul><ul><li>parameters are expressed as a linear combination of training points </li></ul><ul><li>only SVs will have non-zero α i </li></ul>
    25. 27. Applications <ul><li>Handwritten digits recognition </li></ul><ul><ul><li>Of interest to the US Postal services </li></ul></ul><ul><ul><li>4% error was obtained </li></ul></ul><ul><ul><li>about 4% of the training data were SVs only </li></ul></ul><ul><li>Text categorisation </li></ul><ul><li>Face detection </li></ul><ul><li>DNA analysis </li></ul><ul><li>… </li></ul>
    26. 28. Architecture of SVMs <ul><li>Nonlinear Classifier(using kernel) </li></ul><ul><ul><li>Decision function </li></ul></ul><ul><ul><li>are computed as the </li></ul></ul><ul><ul><li>solution of quadratic program </li></ul></ul>
    27. 30. Artificial Neural Networks
    28. 31. Neural Network <ul><li>Taxonomy of Neural Network Architecture </li></ul>The architecture of the neural network refers to the arrangement of the connection between neurons, processing element, number of layers, and the flow of signal in the neural network. There are mainly two category of neural network architecture: feed-forward and feedback (recurrent) neural networks
    29. 32. Neural Network <ul><li>Feed-forward network , Multilayer Perceptron </li></ul>
    30. 33. Neural Network <ul><li>Recurrent network </li></ul>
    31. 34. Multilayer Perceptron (MLP) O 1 h1 h2 x1 x2 x3 x4 x n . . . Hidden Layer Output Layer Input Layer Input Vector MLP Structure F(y) y x1 x2 xn Neuron processing element w1 w2 wn F(y) y
    32. 35. Backpropagation Learning <ul><li>Architecture : </li></ul><ul><ul><li>Feedforward network of at least one layer of non-linear hidden nodes, e.g., # of layers L ≥ 2 (not counting the input layer) </li></ul></ul><ul><ul><li>Node function is differentiable </li></ul></ul><ul><ul><li>most common: sigmoid function </li></ul></ul><ul><li>Learning : supervised, error driven, </li></ul><ul><li>generalized delta rule </li></ul><ul><li>Call this type of nets BP nets </li></ul><ul><li>The weight update rule </li></ul><ul><li>(gradient descent approach) </li></ul><ul><li>Practical considerations </li></ul><ul><li>Variations of BP nets </li></ul><ul><li>Applications </li></ul>
    33. 37. Backpropagation Learning <ul><li>Notations : </li></ul><ul><ul><li>Weights: two weight matrices: </li></ul></ul><ul><ul><li>from input layer (0) to hidden layer (1) </li></ul></ul><ul><ul><li>from hidden layer (1) to output layer (2) </li></ul></ul><ul><ul><li>weight from node 1 at layer 0 to node 2 in layer 1 </li></ul></ul><ul><ul><li>Training samples: pair of </li></ul></ul><ul><ul><li>so it is supervised learning </li></ul></ul><ul><ul><li>Input pattern: </li></ul></ul><ul><ul><li>Output pattern: </li></ul></ul><ul><ul><li>Desired output: </li></ul></ul><ul><ul><li>Error: error for output node j when x p is applied </li></ul></ul><ul><ul><li>sum square error </li></ul></ul><ul><ul><li>This error drives learning (change and ) </li></ul></ul>
    34. 38. Backpropagation Learning <ul><li>Sigmoid function again : </li></ul><ul><ul><li>Differentiable: </li></ul></ul><ul><ul><li>When | net | is sufficiently large, it moves into one of the two saturation regions, behaving like a threshold or ramp function. </li></ul></ul><ul><li>Chain rule of differentiation </li></ul>Saturation region Saturation region
    35. 39. Backpropagation Learning <ul><li>Forward computing : </li></ul><ul><ul><li>Apply an input vector x to input nodes </li></ul></ul><ul><ul><li>Computing output vector x ( 1 ) on hidden layer </li></ul></ul><ul><ul><li>Computing the output vector o on output layer </li></ul></ul><ul><ul><li>The network is said to be a map from input x to output o </li></ul></ul><ul><li>Objective of learning: </li></ul><ul><ul><li>Modify the 2 weight matrices to reduce sum square error </li></ul></ul><ul><ul><li> for the given P training samples as much as possible (to zero if possible) </li></ul></ul>
    36. 40. Backpropagation Learning <ul><li>Idea of BP learning : </li></ul><ul><ul><li>Update of weights in w (2, 1) (from hidden layer to output layer): </li></ul></ul><ul><ul><li>delta rule as in a single layer net using sum square error </li></ul></ul><ul><ul><li>Delta rule is not applicable to updating weights in w (1, 0) (from input and hidden layer) because we don’t know the desired values for hidden nodes </li></ul></ul><ul><ul><li>Solution : Propagating errors at output nodes down to hidden nodes, these computed errors on hidden nodes drives the update of weights in w (1, 0) (again by delta rule), thus called error Back Propagation (BP) learning </li></ul></ul><ul><ul><li>How to compute errors on hidden nodes is the key </li></ul></ul><ul><ul><li>Error backpropagation can be continued downward if the net has more than one hidden layer </li></ul></ul><ul><ul><li>Proposed first by Werbos (1974), current formulation by Rumelhart, Hinton, and Williams (1986) </li></ul></ul>
    37. 41. <ul><li>Generalized delta rule : </li></ul><ul><ul><li>Consider sequential learning mode: for a given sample ( x p , d p ) </li></ul></ul><ul><ul><li>Update weights by gradient descent </li></ul></ul><ul><ul><li>For weight in w (2, 1) : </li></ul></ul><ul><ul><li>For weight in w (1, 0) : </li></ul></ul><ul><ul><li>Derivation of update rule for w (2, 1) : </li></ul></ul><ul><ul><li>since E is a function of l k = d k – o k , o k is a function of , and </li></ul></ul><ul><ul><li>is a function of , by chain rule </li></ul></ul>Backpropagation Learning
    38. 42. Backpropagation Learning <ul><ul><li>Derivation of update rule for </li></ul></ul><ul><ul><li>consider hidden node j : </li></ul></ul><ul><ul><li>weight influences </li></ul></ul><ul><ul><li>it sends to all output nodes </li></ul></ul><ul><ul><li> all K terms in E are functions of </li></ul></ul>i j o k by chain rule
    39. 43. Backpropagation Learning <ul><ul><li>Update rules: </li></ul></ul><ul><ul><li>for outer layer weights w (2, 1) : </li></ul></ul><ul><ul><li>where </li></ul></ul><ul><ul><li>for inner layer weights w (1, 0) : </li></ul></ul><ul><ul><li>where </li></ul></ul>Weighted sum of errors from output layer
    40. 44. Note: if S is a logistic function, then S’ ( x ) = S ( x )(1 – S ( x ))
    41. 45. Backpropagation Learning <ul><li>Pattern classification : an example </li></ul><ul><ul><li>Classification of myoelectric signals </li></ul></ul><ul><ul><ul><li>Input pattern: 2 features , normalized to real values between -1 and 1 </li></ul></ul></ul><ul><ul><ul><li>Output patter n s: 3 classes </li></ul></ul></ul><ul><ul><li>Network structure: 2-5-3 </li></ul></ul><ul><ul><ul><li>2 input nodes, 3 output nodes, </li></ul></ul></ul><ul><ul><ul><li>1 hidden layer of 5 nodes </li></ul></ul></ul><ul><ul><ul><li>η = 0.95, α = 0.4 (momentum) </li></ul></ul></ul><ul><ul><li>Error bound e = 0.05 </li></ul></ul><ul><ul><li>332 training samples </li></ul></ul><ul><ul><li>Maximum iteration = 20,000 </li></ul></ul><ul><ul><li>When stopped, 38 patterns remain misclassified </li></ul></ul>
    42. 46. 38 patterns misclassified
    43. 48. <ul><li>Great representation power </li></ul><ul><ul><li>Any L2 function can be represented by a BP net </li></ul></ul><ul><ul><li>Many such functions can be approximated by BP learning (gradient descent approach) </li></ul></ul><ul><li>Easy to apply </li></ul><ul><ul><li>Only requires that a good set of training samples is available </li></ul></ul><ul><ul><li>Does not require substantial prior knowledge or deep understanding of the domain itself (ill structured problems) </li></ul></ul><ul><ul><li>Tolerates noise and missing data in training samples (graceful degrading) </li></ul></ul><ul><li>Easy to implement the core of the learning algorithm </li></ul><ul><li>Good generalization power </li></ul><ul><ul><li>Often produce accurate results for inputs outside the training set </li></ul></ul>Strengths of BP Learning
    44. 49. <ul><li>Learning often takes a long time to converge </li></ul><ul><ul><li>Complex functions often need hundreds or thousands of epochs </li></ul></ul><ul><li>The net is essentially a black box </li></ul><ul><ul><li>It may provide a desired mapping between input and output vectors ( x, o ) but does not have the information of why a particular x is mapped to a particular o. </li></ul></ul><ul><ul><li>It thus cannot provide an intuitive (e.g., causal) explanation for the computed result. </li></ul></ul><ul><ul><li>This is because the hidden nodes and the learned weights do not have clear semantics. </li></ul></ul><ul><ul><ul><li>What can be learned are operational parameters, not general, abstract knowledge of a domain </li></ul></ul></ul><ul><ul><li>Unlike many statistical methods, there is no theoretically well-founded way to assess the quality of BP learning </li></ul></ul><ul><ul><ul><li>What is the confidence level of o computed from input x using such net? </li></ul></ul></ul><ul><ul><ul><li>What is the confidence level for a trained BP net, with the final E (which may or may not be close to zero)? </li></ul></ul></ul>Deficiencies of BP Learning
    45. 50. <ul><li>Problem with gradient descent approach </li></ul><ul><ul><li>only guarantees to reduce the total error to a local minimum . ( E may not be reduced to zero) </li></ul></ul><ul><ul><ul><li>Cannot escape from the local minimum error state </li></ul></ul></ul><ul><ul><ul><li>Not every function that is representable can be learned </li></ul></ul></ul><ul><ul><li>How bad: depends on the shape of the error surface. Too many valleys/wells will make it easy to be trapped in local minima </li></ul></ul><ul><ul><li>Possible remedies: </li></ul></ul><ul><ul><ul><li>Try nets with different # of hidden layers and hidden nodes (they may lead to different error surfaces, some might be better than others) </li></ul></ul></ul><ul><ul><ul><li>Try different initial weights (different starting points on the surface) </li></ul></ul></ul><ul><ul><ul><li>Forced escape from local minima by random perturbation (e.g., simulated annealing) </li></ul></ul></ul>
    46. 51. <ul><li>Generalization is not guaranteed even if the error is reduced to 0 </li></ul><ul><ul><li>Over-fitting/over-training problem: trained net fits the training samples perfectly ( E reduced to 0) but it does not give accurate outputs for inputs not in the training set </li></ul></ul><ul><ul><li>Possible remedies: </li></ul></ul><ul><ul><ul><li>More and better samples </li></ul></ul></ul><ul><ul><ul><li>Using smaller net if possible </li></ul></ul></ul><ul><ul><ul><li>Using larger error bound (forced early termination) </li></ul></ul></ul><ul><ul><ul><li>Introducing noise into samples </li></ul></ul></ul><ul><ul><ul><ul><li>modify ( x 1 ,…, x n ) to ( x 1 + α 1 , …, x n + α n ) where α i are small random displacements </li></ul></ul></ul></ul><ul><ul><ul><li>Cross-Validation </li></ul></ul></ul><ul><ul><ul><ul><li>leave some (~10%) samples as test data (not used for weight update) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>periodically check error on test data </li></ul></ul></ul></ul><ul><ul><ul><ul><li>learning stops when error on test data starts to increase </li></ul></ul></ul></ul>
    47. 52. <ul><li>Network paralysis with sigmoid activation function </li></ul><ul><ul><li>Saturation regions: </li></ul></ul><ul><ul><li>Input to an node may fall into a saturation region when some of its incoming weights become very large during learning. Consequently, weights stop to change no matter how hard you try. </li></ul></ul><ul><ul><li>Possible remedies: </li></ul></ul><ul><ul><ul><li>Use non-saturating activation functions </li></ul></ul></ul><ul><ul><ul><li>Periodically normalize all weights </li></ul></ul></ul>
    48. 53. <ul><li>The learning (accuracy, speed, and generalization) is highly dependent of a set of learning parameters </li></ul><ul><ul><li>Initial weights, learning rate, # of hidden layers and # of nodes... </li></ul></ul><ul><ul><li>Most of them can only be determined empirically (via experiments) </li></ul></ul>
    49. 54. <ul><li>A good BP net requires more than the core of the learning algorithms. Many parameters must be carefully selected to ensure a good performance. </li></ul><ul><li>Although the deficiencies of BP nets cannot be completely cured, some of them can be eased by some practical means. </li></ul><ul><li>Initial weights (and biases) </li></ul><ul><ul><li>Random, [-0.05, 0.05], [-0.1, 0.1], [-1, 1] </li></ul></ul><ul><ul><ul><li>Avoid bias in weight initialization </li></ul></ul></ul><ul><ul><li>Normalize weights for hidden layer ( w (1, 0) ) (Nguyen-Widrow) </li></ul></ul><ul><ul><ul><li>Random assign initial weights for all hidden nodes </li></ul></ul></ul><ul><ul><ul><li>For each hidden node j , normalize its weight by </li></ul></ul></ul>Practical Considerations
    50. 55. <ul><li>Training samples : </li></ul><ul><ul><li>Quality and quantity of training samples often determines the quality of learning results </li></ul></ul><ul><ul><li>Samples must collectively represent well the problem space </li></ul></ul><ul><ul><ul><li>Random sampling </li></ul></ul></ul><ul><ul><ul><li>Proportional sampling (with prior knowledge of the problem space) </li></ul></ul></ul><ul><ul><li># of training patterns needed : There is no theoretically idea number. </li></ul></ul><ul><ul><ul><li>Baum and Haussler (1989): P = W/e , where </li></ul></ul></ul><ul><ul><ul><li>W: total # of weights to be trained (depends on net structure) </li></ul></ul></ul><ul><ul><ul><li> e: acceptable classification error rate </li></ul></ul></ul><ul><ul><ul><li>If the net can be trained to correctly classify (1 – e/2)P of the P training samples, then classification accuracy of this net is 1 – e for input patterns drawn from the same sample space </li></ul></ul></ul><ul><ul><ul><li>Example: W = 27, e = 0.05, P = 540. If we can successfully train the network to correctly classify (1 – 0.05/2)*540 = 526 of the samples, the net will work correctly 95% of time with other input. </li></ul></ul></ul>
    51. 56. <ul><li>How many hidden layers and hidden nodes per layer : </li></ul><ul><ul><li>Theoretically, one hidden layer (possibly with many hidden nodes) is sufficient for any L2 functions </li></ul></ul><ul><ul><li>There is no theoretical results on minimum necessary # of hidden nodes </li></ul></ul><ul><ul><li>Practical rule of thumb: </li></ul></ul><ul><ul><ul><li>n = # of input nodes; m = # of hidden nodes </li></ul></ul></ul><ul><ul><ul><li>For binary/bipolar data: m = 2n </li></ul></ul></ul><ul><ul><ul><li>For real data: m >> 2n </li></ul></ul></ul><ul><ul><li>Multiple hidden layers with fewer nodes may be trained faster for similar quality in some applications </li></ul></ul>
    52. 57. <ul><ul><li>Example: compressing character bitmaps. </li></ul></ul><ul><ul><ul><li>Each character is represented by a 7 by 9 pixel bitmap, or a binary vector of dimension 63 </li></ul></ul></ul><ul><ul><ul><li>10 characters (A – J) are used in experiment </li></ul></ul></ul><ul><ul><ul><li>Error range: </li></ul></ul></ul><ul><ul><ul><li>tight: 0.1 (off: 0 – 0.1; on: 0.9 – 1.0) </li></ul></ul></ul><ul><ul><ul><li>loose: 0.2 (off: 0 – 0.2; on: 0.8 – 1.0) </li></ul></ul></ul><ul><ul><ul><li>Relationship between # hidden nodes, error range, and convergence rate </li></ul></ul></ul><ul><ul><ul><ul><li>relaxing error range may speed up </li></ul></ul></ul></ul><ul><ul><ul><ul><li>increasing # hidden nodes (to a point) may speed up </li></ul></ul></ul></ul><ul><ul><ul><li>error range: 0.1 hidden nodes: 10 # epochs: 400+ </li></ul></ul></ul><ul><ul><ul><li>error range: 0.2 hidden nodes: 10 # epochs: 200+ </li></ul></ul></ul><ul><ul><ul><li>error range: 0.1 hidden nodes: 20 # epochs: 180+ </li></ul></ul></ul><ul><ul><ul><li>error range: 0.2 hidden nodes: 20 # epochs: 90+ </li></ul></ul></ul><ul><ul><ul><li>no noticeable speed up when # hidden nodes increases to beyond 22 </li></ul></ul></ul>
    53. 58. <ul><li>Other applications . </li></ul><ul><ul><li>Medical diagnosis </li></ul></ul><ul><ul><ul><li>Input: manifestation (symptoms, lab tests, etc.) </li></ul></ul></ul><ul><ul><ul><li>Output: possible disease(s) </li></ul></ul></ul><ul><ul><ul><li>Problems: </li></ul></ul></ul><ul><ul><ul><ul><li>no causal relations can be established </li></ul></ul></ul></ul><ul><ul><ul><ul><li>hard to determine what should be included as inputs </li></ul></ul></ul></ul><ul><ul><ul><li>Currently focus on more restricted diagnostic tasks </li></ul></ul></ul><ul><ul><ul><ul><li>e.g., predict prostate cancer or hepatitis B based on standard blood test </li></ul></ul></ul></ul><ul><ul><li>Process control </li></ul></ul><ul><ul><ul><li>Input: environmental parameters </li></ul></ul></ul><ul><ul><ul><li>Output: control parameters </li></ul></ul></ul><ul><ul><ul><li>Learn ill-structured control functions </li></ul></ul></ul>
    54. 59. <ul><ul><li>Stock market forecasting </li></ul></ul><ul><ul><ul><li>Input: financial factors (CPI, interest rate, etc.) and stock quotes of previous days (weeks) </li></ul></ul></ul><ul><ul><ul><li>Output: forecast of stock prices or stock indices (e.g., S&P 500) </li></ul></ul></ul><ul><ul><ul><li>Training samples: stock market data of past few years </li></ul></ul></ul><ul><ul><li>Consumer credit evaluation </li></ul></ul><ul><ul><ul><li>Input: personal financial information (income, debt, payment history, etc.) </li></ul></ul></ul><ul><ul><ul><li>Output: credit rating </li></ul></ul></ul><ul><ul><li>And many more </li></ul></ul><ul><ul><li>Key for successful application </li></ul></ul><ul><ul><ul><li>Careful design of input vector (including all important features): some domain knowledge </li></ul></ul></ul><ul><ul><ul><li>Obtain good training samples: time and other cost </li></ul></ul></ul>
    55. 60. <ul><li>Architecture </li></ul><ul><ul><li>Multi-layer, feed-forward (full connection between nodes in adjacent layers, no connection within a layer) </li></ul></ul><ul><ul><li>One or more hidden layers with non-linear activation function (most commonly used are sigmoid functions) </li></ul></ul><ul><li>BP learning algorithm </li></ul><ul><ul><li>Supervised learning (samples ( x p , d p )) </li></ul></ul><ul><ul><li>Approach: gradient descent to reduce the total error </li></ul></ul><ul><ul><li>(why it is also called generalized delta rule) </li></ul></ul><ul><ul><li>Error terms at output nodes </li></ul></ul><ul><ul><li>error terms at hidden nodes (why it is called error BP) </li></ul></ul><ul><ul><li>Ways to speed up the learning process </li></ul></ul><ul><ul><ul><li>Adding momentum terms </li></ul></ul></ul><ul><ul><ul><li>Adaptive learning rate (delta-bar-delta) </li></ul></ul></ul><ul><ul><ul><li>Quickprop </li></ul></ul></ul><ul><ul><li>Generalization (cross-validation test) </li></ul></ul>Summary of BP Nets
    56. 61. <ul><li>Strengths of BP learning </li></ul><ul><ul><li>Great representation power </li></ul></ul><ul><ul><li>Wide practical applicability </li></ul></ul><ul><ul><li>Easy to implement </li></ul></ul><ul><ul><li>Good generalization power </li></ul></ul><ul><li>Problems of BP learning </li></ul><ul><ul><li>Learning often takes a long time to converge </li></ul></ul><ul><ul><li>The net is essentially a black box </li></ul></ul><ul><ul><li>Gradient descent approach only guarantees a local minimum error </li></ul></ul><ul><ul><li>Not every function that is representable can be learned </li></ul></ul><ul><ul><li>Generalization is not guaranteed even if the error is reduced to zero </li></ul></ul><ul><ul><li>No well-founded way to assess the quality of BP learning </li></ul></ul><ul><ul><li>Network paralysis may occur (learning is stopped) </li></ul></ul><ul><ul><li>Selection of learning parameters can only be done by trial-and-error </li></ul></ul><ul><ul><li>BP learning is non-incremental (to include new training samples, the network must be re-trained with all old and new samples) </li></ul></ul>
    57. 62. Experiments
    58. 63. Stock Prediction <ul><li>Stock prediction is a difficult task due to the nature of the stock data which is very noisy and time varying . </li></ul><ul><li>The efficient market hypothesis claim that future price of the stock is not predictable based on publicly available information. </li></ul><ul><li>However theory has been challenged by many studies and a few researchers have successfully applied machine learning approach such as neural network to perform stock prediction </li></ul>
    59. 64. Is the Market Predictable ? <ul><li>Efficient Market Hypothesis (EMH) (Fama, 1965) </li></ul><ul><li>Stock market is efficient in that the current market prices reflect all information available to traders, so that future changes cannot be predicted relying on past prices or publicly available information. </li></ul><ul><li>Murphy's law : Anything that can go wrong will go wrong. </li></ul><ul><ul><li>Fama et al. (1988) showed that 25% to 40% of the variance in </li></ul></ul><ul><ul><li>the stock returns over the period of three to five years is </li></ul></ul><ul><ul><li>predictable from past return </li></ul></ul><ul><ul><li>Pesaran and Timmerman (1999) conclude that the UK stock market is </li></ul></ul><ul><ul><li>predictable for the past 25 years. </li></ul></ul><ul><ul><li>Saad (1998) has successfully employed different neural network models </li></ul></ul><ul><ul><li>to predict the trend of various stocks on a short-term range </li></ul></ul>
    60. 65. Optimistic report
    61. 66. Implementation <ul><li>In this paper we propose to investigate SVM, MLP and RBF network for the task of predicting the future trend of the 3 major stock indices </li></ul><ul><li>a) Kuala Lumpur Composite Index (KLCI) </li></ul><ul><li>b) Hongkong Hangseng index </li></ul><ul><li>c) Nikkei 225 stock index </li></ul><ul><li>using input based on t echnical indicators. </li></ul><ul><li>This paper approach the problem based on 2 class pattern classification formulated specifically to assist investor in making trading decisions </li></ul><ul><li>The classifier is asked to recognise investment opportunities that can give a return of r% or more within the next h days. r=3% h=10 days </li></ul>
    62. 67. System Block Diagram <ul><li>The classifier is to predict if the trend of the stock index increment of more than 3% within the next 10 days period can be achieved. </li></ul>Increment Achievable ?? Yes / No Data from daily historical data converted into technical analysis indicator Classifier
    63. 68. Data Used <ul><li>Kuala Lumpur Stock Index (KLCI) for the period of 1992-1997 . </li></ul>
    64. 69. Data Used <ul><li>Hangseng index (20/4/1992-1/9/1997) </li></ul>
    65. 70. Data Used <ul><li>Nikkei 225 stock index (20/4/1982-1/9/1987) </li></ul>
    66. 71. Input to Classifier TABLE 1: DESCRIPTION OF INPUT TO CLASSIFIER x i i=1,2,3 ….12 n=15 DL N (t) = sign[q(t)-q(t-N)] * ln (q(t)/q(t-N) +1) (1) q(t) is the index level at day t and DL N (t) is the actual input to the classifier.
    67. 72. Prediction Formulation Consider y max (t) as the maximum upward movement of the stock index value within the period t and t +  . y(t) represents the stock index level at day t
    68. 73. Prediction Formulation <ul><li>Classification </li></ul><ul><li>T he prediction of stock trend is formulated as a two class </li></ul><ul><li>classification problem. </li></ul><ul><li>y r (t) > r% >> Class 2 </li></ul><ul><li>y r (t)  r% >> Class 1 </li></ul>
    69. 74. Prediction Formulation <ul><li>Classification </li></ul><ul><li>Let ( x i , y i ) 1<i<N be a set of N training examples, each input example x i  R n n=15 being the dimension of the input space, belongs to a class labelled by y i   +1,-1  . </li></ul>Y i =-1 Y i =+1
    70. 75. Performance Measure <ul><li>True Positive (TP) is the number of positive class predicted correctly as positive class. </li></ul><ul><li>False Positive (FP) is the number of negative class predicted wrongly as positive class. </li></ul><ul><li>False Negative (FN) is the number of positive class predicted wrongly as negative class. </li></ul><ul><li>True Negative (TN) is the number of negative class predicted correctly as negative class. </li></ul>
    71. 76. Performance Measure <ul><li>Accuracy = TP+TN / (TP+FP+TN+FN) </li></ul><ul><li>Precision = TP/(TP+FP) </li></ul><ul><li>Recall rate (sensitivity) = TP/(TP+FN) </li></ul><ul><li>F1 = 2 * Precision * Recall/(Precision + Recall) </li></ul>
    72. 77. Testing Method Train Test Rolling Window Method is Used to Capture Training and Test Data Train =600 data Test= 400 data
    73. 78. Experiment and Result <ul><li>Experiments are conducted to predict the stock trend of three major stock indexes, KLCI, Hangseng and Nikkei. </li></ul><ul><li>SVM, MLP and RBF network is used in making trend prediction based on classification and regression approach. </li></ul><ul><li>A hypothetical trading system is simulated to find out the annualized profit generated based on the given prediction. </li></ul>
    74. 79. Experiment and Result
    75. 80. Trading Performance <ul><li>A hypothetical trading system is used </li></ul><ul><li>When a positive prediction is made, one unit of money was invested in a portfolio reflecting the stock index. If the stock index increased by more than r% (r=3%) within the next h days (h=10) at day t, then the investment is sold at the index price of day t. If not, the investment is sold on day t+1 regardless of the price. A transaction fee of 1% is charged for every transaction made. </li></ul><ul><li>Use annualised rate of return . </li></ul>
    76. 81. Trading Performance <ul><li>Classifier Evaluation Using Hypothetical Trading System </li></ul>
    77. 82. Trading Performance
    78. 83. Experiment and Result <ul><li>Classification Result </li></ul>
    79. 84. Experiment and Result <ul><li>The result shows better performance of neural network techniques when compared to K nearest neighbour classifier. SVM shows the overall better performance on average than MLP and RBF network in most of the performance metric used </li></ul>
    80. 85. Experiment and Result <ul><li>Comparison of Receiver Operating Curve (ROC) </li></ul>
    81. 86. Experiment and Result <ul><li>Area under Curve (ROC) </li></ul>
    82. 87. Conclusion <ul><li>We have investigated the SVM, MLP and RBF network as a classifier and regressor to assess it's potential in the stock trend prediction task </li></ul><ul><li>Support vector machine (SVM) has shown better performance when compared to MLP and RBF . </li></ul><ul><li>SVM classifier with probabilistic output outperform MLP and RBF network in terms of error-reject tradeoff </li></ul><ul><li>Both the classification and regression model can be used for a profitable trend prediction system. The classification model has the advantage in which pattern rejection scheme can be incorporated. </li></ul>
    83. 88. This report
    84. 89. Implementation <ul><li>OnlineSVR by Francesco Parrella </li></ul><ul><li>http :// onlinesvr . altervista . org / </li></ul><ul><li>BPN by Karsten Kutza </li></ul><ul><li>http :// www . neural - networks - at - your - fingertips . com / </li></ul>
    85. 90. Results <ul><li>Basically zero correlation between prediction and the actual outcome </li></ul><ul><li>Suffer from many technical failures </li></ul><ul><li>Still have faith that these methods (when applied correctly) can predict the future better then a random guess </li></ul><ul><li>Tried many sorts of topologies of the BPN and the input values to SVM, looks like the secret does not lie there </li></ul><ul><li>Future investigation, use wavelets/noiselets coefficients as inputs </li></ul>
    86. 91. References <ul><li>http :// www . cs . unimaas . nl / datamining / slides2009 / svm_presentation . ppt </li></ul><ul><li>http :// merlot . stat . uconn . edu / ~lynn / svm . ppt </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li>http :// www . csee . umbc . edu / ~ypeng / F09NN / lecture - notes / NN - Ch3 . ppt </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li>Google, Wikipedia and others </li></ul>