Expert Trading Systems: Modeling Financial Markets with Kernel Regression

Add to Wishlist

Expert Trading Systems: Modeling Financial Markets with Kernel Regression

Hardcover

$89.50

Hardcover
$89.50

SHIP THIS ITEM

In stock. Ships in 1-2 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

With the proliferation of computer programs to predict market direction, professional traders and sophisticated individual investors have increasingly turned to mathematical modeling to develop predictive systems. Kernel regression is a popular data modeling technique that can yield useful results fast.

Provides data modeling methodology used to develop trading systems.
* Shows how to design, test, and measure the significance of results

John R. Wolberg (Haifa, Israel) is professor of mechanical engineering at the Haifa Institute in Israel. He does research and consulting in data modeling in the financial services area.

Product Details

ISBN-13:	9780471345084
Publisher:	Wiley
Publication date:	01/12/2000
Series:	Wiley Trading , #89
Pages:	256
Product dimensions:	6.34(w) x 9.47(h) x 0.92(d)

About the Author

JOHN R. WOLBERG, PhD, is a professor of mechanical engineering at the Technion-Israel Institute of Technology in Haifa, Israel. An expert in financial data modeling, he does research and consulting for leading financial institutions, and has worked with some of the pioneers of computerized trading. Dr. Wolberg holds a bachelor's degree in mechanical engineering from Cornell University and a PhD in nuclear engineering from MIT.

Read an Excerpt

Note: The Figures and/or Tables mentioned in this sample chapter do not appear on the Web.

INTRODUCTION

1.1 DATA MODELING

The concept of a mathematical model has been with us for thousands of years, going back to the ancient Egyptians and Greeks who were known for their mathematical ability. They used mathematical expressions to quantify physical ideas. For example, we are all familiar with Pythagorus's theorem:

C AB = + 2 2 (1.1)

which is used to compute the length of the hypotenuse of a right triangle once the lengths of the other two sides are known. For this example, C is called the dependent variable and A and B are the independent variables. To generalize the concept of a mathematical model, let us use the notation:

Y = f( X) (1.2)

where Y is the dependent variable and X is the independent variable. For the more general case, both X and Y might be vectors.

The function f might be expressed as a mathematical equation, or it might simply represent a surface that describes the relationship between the dependent and independent variables. Data modeling is a process in which data is used to determine a mathematical model. Although there are many data modeling techniques, all of them can be classified into two very broad categories: parametric and nonparametric methods. Parametric methods are those techniques that start from a known functional form for f( X). Probably the most well-known parametric method is the method of least squares. Most books on numerical analysis include a discussion of least squares, but usually the discussion is limited to linear least squares. The more general nonlinear theory is extremely powerful and is an excellent modeling technique if f( X) has a known functional form. For many problems in science and engineering, f( X) is known or can be postulated based on theoretical considerations. For such cases, the task of the data modeling process is to determine the unknown parameters of f( X) and perhaps some measure of their uncertainties.

As an example of this process, consider the following experiment: the count rate of a radioactive isotope is measured as a function of time. Equation (1.2) for this experiment can be expressed as

Y = A e -( kx) + B (1.3)

where the dependent variable Y is the count rate (i. e., counts per unit time), A is the amplitude of the count rate originating from the isotope (i. e., counts per unit time at x equal to zero), k is the unknown decay constant, x is the independent variable (which for this experiment is time), and B is the background count rate. This is a very straightforward experiment, and nonlinear least squares can be used to determine the values of A, k, and B that best fit the experimental data. As a further bonus of this modeling technique, uncertainty estimates of the unknown parameters are also determined as part of the analysis.

In some problems, however, f( X) is not known. For example, let's assume we wish to develop a mathematical model that gives the probability of rain tomorrow. We can propose a list of potential predictors (i. e., elements of a vector X) that can be used in the model, but there is really no known functional form for f( X). Currently, a tremendous amount of interest has been generated in weather forecasting, but the general approach is to develop computer models based on data that yield predictions but are not based on simple analytical functional forms for f( X).

Another area in which known functional forms for f( X) are not practical are for financial markets. It would be lovely to discover a simple equation to predict the price of gold tomorrow or next week, but so far no one has ever successfully accomplished this task (or if they have, they are not talking about it). Nevertheless, billions of dollars are invested daily based on computer models for predicting movement in the financial markets. The modeling techniques for such problems are typically nonparametric and are sometimes referred to as data-driven methods. Within this broad class, two subclasses can be identified that cover most of the nonparametric analyses being performed today: neural networks and nonparametric regression.

The emphasis in this book is on nonparametric methods and in particular on the nonparametric kernel regression method. One problem with this class of methods is that they can be very computer intensive. Emphasis will therefore be placed on developing very efficient algorithms for data modeling using kernel regression. We live in a complicated nonlinear world, and it is often necessary to use multiple dimensions to develop a model with a reasonable degree of predictive power. Thus our discussion must include the development of multidimensional models. We must also consider how one goes about evaluating a model. Can it be used for predicting values of Y? How good are the predictions? These are the sorts of questions considered in the following chapters.

1.2 THE HILLS OF THE GALILEE PROBLEM

From my office here in Haifa, I can look out the window and see the hills of the Galilee. I've used these hills to pose a problem to students: design a program that estimates height as a function of position for any point in the Galilee. Assume that we are limited to 10,000 sample data points. Let's say we limit the problem to a square that starts from a point in Haifa port and extends 30 kilometers east and 30 kilometers north. Anyone who has seen this part of the globe knows that the Galilee is quite irregular. It includes valleys, small hills, and some larger hills that might even be called mountains.

The first simple-minded approach is to fit a grid over the entire area and distribute the data points evenly throughout this area. Spreading 10,000 points over 900 square kilometers implies a separation distance of 300 meters. This might be reasonable if we were trying to develop a model for elevation in the middle of Kansas, and it might also be reasonable for some of the valleys of the Galilee, but it is certainly not a reasonable mesh size for some of the more mountainous areas. Ideally, we would like to concentrate our points in the hillier regions and use fewer points in the flatter regions. But by doing this we introduce a new level of complexity: how do we estimate height as a function of position for a nonuniform grid?

The uniform grid suffers from a lack of resolution but allows the user an incredibly simple data structure: a two-dimensional matrix of heights. Thus, to find the height at point (X, Y), all we need to do is calculate where this point falls in the matrix. For example, consider the point X = 12342 and Y = 18492 where X is the distance in meters going east from our 0,0 point and Y is the distance in the northward direction. If we denote a point in the matrix as (I, J) and the matrix as H, then (X, Y) is located northeast of point I = 41, J = 61. We can then use simple two-dimensional linear interpolation to determine the height at (X, Y) using the four surrounding points:

(1.4)

where

f( n, m) = H( n, m) * W( X, Y, n, m) (1.5)

and the weight terms W( X, Y, n, m) are calculated as follows:

W( X, Y, I, J) = (300 - X + 300 I) * (300 - Y + 300 J)/( 300* 300) (1.6)

W( X, Y, I + 1, J) = (X - 300 I) * (300 - Y + 300 J)/( 300* 300) (1.7)

W( X, Y, I, J+1) = (300 - X + 300 I) * (Y - 300 J)/( 300* 300) (1.8)

HEIGHT(X, Y) f( I i, J j) i = ++ = å å 0 1 j=0 1

W( X, Y, I+ 1, J + 1) = (X - 300 I) * (Y - 300 J)/( 300* 300) (1.9)

The problem with this simple-minded approach is illustrated in Figure 1.1. The estimate of the height of the hilltop just northeast of point (I, J) would be clearly underestimated. If the points are chosen such that we concentrate points in the hilliest regions, we introduce a number of problems:

If we use N points to estimate HEIGHT( X, Y), how do we choose the N points?
How many points should we use (i. e., what should be the value of N)?
Once we have found N points, what can we do to estimate the height at X, Y?

The simple data structure used for a uniform grid is no longer applicable. One approach is simply to list data in an array of length 10,000 and width 3 (i. e., column 1 is X( i), column 2 is Y( i), and column 3 is H( i). If our choice of N points are the N nearest neighbors to (X, Y), then we will have to run through the entire list to find these nearest points. This would require 10,000 calculations of distance to identify the N nearest neighbors.

We could improve our search using the same 10,000 by 3 matrix by just sorting the data on one of the columns (either the X or Y column). Then to find the N nearest neighbors we could use a binary search to find the region of the matrix in which to concentrate our search. For example, let's say that the value X = 12342 falls between rows 4173 and 4174 of the H matrix (after sorting in the X direction). In mathematical terms, the following inequalities are satisfied:

A( 4173, 1) < = X and A( 4174, 1) > X (1.10)

where column 1 of the A matrix is the X values of each point. Let's say that we use the four closest points for our estimation of HEIGHT( X, Y) (i. e., N = 4); then we need some rule to decide which rows of the matrix should be considered. One simple heuristic is to consider the 10 points below X and the 10 points above X (i. e., rows 4164 to 4183). The distance squared (Dsqr) from row i to the point X, Y is simply:

Dsqr = (X - A( i, 1)) 2 + (Y - A( i, 2)) 2 (1.11)

We use Dsqr rather than distance to avoid the need to take unnecessary square roots. The 20 values of Dsqr are compared, and the points represented by the smallest four are chosen. It is possible that some or all of the four nearest points to (X, Y) fall outside this range of rows. However, more than likely we will find some, if not all, of the four nearest neighbors using this method. By sorting the data, we reduce the number of values of Dsqr that must be determined from 10,000 to 20, which represents a tremendous saving in compute time (if we plan to use the program to determine elevation at many points).

Compared to the uniform grid, this choice of data structure results in a factor of 3 increase in the size of the matrix (i. e., 30,000 numbers instead of only 10,000). We also require an increase in computational time because we must now initiate a search to find the N nearest neighbors. This is the price we must pay to improve the accuracy of the computed values of elevation while limiting the size of the data array to 10,000 points.

Once the N nearest points have been located, the next problem to be addressed is the task of converting the heights of these points into an estimate of HEIGHT (X, Y). A number of approaches can be used to solve this problem, and they are discussed in later chapters. The main purpose of this discussion is to introduce the concept of estimating the value of a dependent variable (in this case HEIGHT) as a function of one or several independent variables (in this case X and Y) when the relationship is complex. Trying to determine one global equation that relates HEIGHT as a function of X and Y is useless. The best one could do is to find separate equations for many subregions in the 30 by 30 kilometer space. Using a parametric technique such as the method of least squares, an equation for HEIGHT as a function of X and Y within each region could be determined. An alternative approach is to use a nonparametric method such as kernel regression. This method is developed and discussed in the following chapters.

1.3 MODELING FINANCIAL MARKETS

The purpose of developing models for financial markets is to end up with a means for making market predictions. One typically attempts to develop models for predicting price changes or market volatility. The hope associated with such efforts is to use the model (or models) as the basis for a computerized trading system. To minimize equity drawdowns, most computerized trading systems use separate models for different markets and perhaps for different trading frequencies. By trading several models simultaneously, the equity decreases due to one model are hopefully balanced by equity increases from other models. Thus one would expect a smoother portfolio equity curve than what one might get by trading a single model. This well-known concept is called diversification and is discussed in most books on finance. 1 Data is fed into the models, and the system issues trading directives. The directives are usually in the form of buy and sell signals.

Using past history, one can simulate the performance of a system based on the issued signals. Thus for each model an equity curve can be generated for the simulated time period. One can look at the equity curves of each model separately and develop strategies for combining the various models into a multimodel trading system with a single combined equity curve. Whether one looks at the individual equity curves or the combined equity curve, measures of performance are required. Obviously, one must look at profitability (for example, annual rate of return), but it should be emphasized that profitability is not the sole measure of the value of a trading system. Typically, one also computes the risk associated with the system and then combines profitability and risk into some measure of performance. There are many definitions of risk. In his Nobel Prize-winning work on Modern Portfolio Theory, Markowitz used the standard deviations of equity changes as the measure of risk associated with an equity curve. 2 Many other definitions of risk are in use, some of which are included in a discussion of measures of performance in Section 1.4.

One danger associated with the modeling process described here is that if enough combinations of models and markets are tried, we will end up with a "successful" combination that only works for the modeling data but is not based on models with real predictive power. For such cases we can expect failure when the models are applied to unseen data. To protect against this possibility, one should test the entire system using unseen data (i. e., data not used in the modeling process). Only if it performs well on this data should one actually start using it to trade real funds.

The task of developing a model for a financial market is quite different from the Hills of the Galilee problem discussed in the previous section. That problem exhibited the following characteristics:

The number of independent variables (i. e., 2) was known.
For every combination of the independent variables (i. e., X and Y), there was one correct value of the dependent variable (i. e., H).
The dependent variable H could be modeled as a function of X and Y, and with enough data points the model could be made to be as accurate as desired.

The modeling of financial markets is quite different in all of these respects.

The number of independent variables required to develop a model with a reasonable degree of predictive power is unknown. Indeed, one does not even know if it is possible to obtain a decent model with the available data.
For a given set of independent variables, there is no guarantee that if the values of the independent variables are the same for two data points, the values of the dependent variable will also be the same. In other words, for each combination of the independent variables there is a range of possible values of the dependent variable.
Regardless of the number of available data points, there is no hope of converging to a model free of error. (In other words, we assume that our final model will contain some noise. Our hope is to obtain a model in which the signal is strong enough that the predictive power of the model will be of some value.)

Financial markets can be characterized as having a low signal-to-noise ratio. In other words, a large fraction of the change in price from one time period to the next appears to be a random shock. In addition, the small signal typically varies in a highly nonlinear manner over the modeling space. Often, however, the random shock is not totally random if one considers other related time series. By bringing in more relevant information (i. e., related time series), a greater fraction of the price changes can be explained. The interesting aspect of financial market modeling is that one does not need to obtain a high degree of predictability to develop a successful trading system. For example, if models could be developed for a dozen different markets and each model could consistently explain 5 percent of the variance in the price changes (see Section 1.4 for a definition of Variance Reduction), a highly profitable trading system could be developed based on these models!

As an example of the modeling process, consider the problem of predicting a future price of gold. What are the independent variables? The best that an analyst can do is to propose a set of candidate predictors. Each of these predictors must be backward looking. In other words, their values must be known at the times the predictions are made. How does one come up with a decent set of candidate predictors? A massive body of research has been devoted to this problem. Many articles and books have been written on this subject for many different financial markets. This book concentrates on how to select a model once a set of candidate predictors has been proposed. The focus of the book is on evaluating the candidate predictors individually and together once they have been specified. However, some comments are made in Chapter 2 regarding the specification of candidate predictors.

The number of candidate predictors available for developing prediction models is limited only by the imagination of the analyst. The primary source of candidate predictors is from the time series being modeled. For the gold model, one would first use the gold price series as a source of candidate predictors. Often a number of predictors might be variations on the same theme. For the gold example, the most obvious choices of candidate predictors are past changes in gold prices. For example, the relative change in the prices of gold over one, two, and three time periods can be selected as the first three candidate predictors:

X1 = 1 - LAG( GOLD, 1)/ GOLD
X2 = 1 - LAG( GOLD, 2)/ GOLD
X3 = 1 - LAG( GOLD, 3)/ GOLD

The variable GOLD represents a time series of gold prices, and the LAG operator returns the series being lagged by the number of records indicated as the second parameter. The next theme might be ratios based on the current price of gold and moving averages of gold prices. For example:

X4 = GOLD/ MA( GOLD, 3) - 1
X5 = GOLD/ MA( GOLD, 10) - 1
X6 = GOLD/ MA( GOLD, 50) - 1

The operator MA is a moving average over the number of time periods indicated by the second parameter. The candidate predictor X4 is the deviation from 1 of the ratio of the current price of gold divided by the moving average of gold over the last three time periods. The candidate predictor X5 is similar to X4 but is based on a longer time period, and X6 is based on the longest time period. Negative values of these three candidate predictors mean that the latest price of gold is less than the three moving averages. Next, we might start considering data from other markets: (e. g., X7 = 1 - LAG( S& P, 1)/ S& P). The variable S& P represents a time series of the S& P price index. When one starts considering all the possible predictors that might influence the future change in the price of gold, the set of candidate predictors can become huge.

A reasonable approach to modeling when the number of candidate predictors is large is to consider subspaces. For example, assume we have 100 candidate predictors. We might first consider all 1D (one-dimensional) spaces. We try to find a model based on X1 (i. e., Y = f( X1)), then f( X2), up to f( X100). After all 1D spaces have been considered, we then proceed to 2D spaces. If we examine all 2D combinations (i. e., f( X1,X2) up to f( X99, X100), then we must consider 100* 99/ 2 = 4950 different 2D spaces. Proceeding to 3D spaces, the number of combinations increases dramatically. For 100 candidate predictors, the total number of 3D spaces is 100* 99* 98/ 6 = 161700. Clearly, some sort of strategy must be selected that limits the process to an examination of only the spaces that offer the greatest probability of success.

One question that comes to mind is the upper limit for the dimensionality of the model. How far should the process be continued? three dimensions? four dimensions? For a given number of data points, as the dimensionality of the model increases, the sparseness of the data also increases. To discuss sparseness from a more quantitative point of view, let's first define a region in a space as a portion of the space in which all the signs of the values of the various X's comprising the space do not change. For example, if we have N data points spread out in a 2D space, then we have an average of N/ 4 points per region. Let's say we are looking at the 2D region made up of candidate predictors X5 and X17. Assume further that both X5 and X17 have been normalized so that their means are zero. The four regions are:

X5 > 0 and X17 > 0
X5 < 0 and X17 > 0
X5 > 0 and X17 < 0
X5 < 0 and X17 < 0

Now if we spread the same N points throughout a 3D space, we have eight separate regions and the average number of points per region is reduced to N/ 8. Generalizing this concept to a d dimensional space, the number of points per region is N/ 2 d . (In other words, for every increase by one in the dimensionality of the model, the average number of data points per region is halved.) Thus we can conclude that the value of N (the number of available data points for development of the model) can be used to set the maximum dimensionality of the model. All we have to do is to set a minimum value for the average number of data points per region. Let's say we have 1000 points and we want at least an average value of 10 points per region. The maximum dimensionality of the model would thus be determined by setting 1000 /2 d to 10, which leads to a value of dmax equal to 6. (Increasing d to 7 leads to an average number of data points per region of 1000 /128 < 10.) If the distribution of points in a particular dimension is not normal, then the definition of region must be modified. Nevertheless, the conclusion is the same: With increasing dimensionality, an exponentially increasing N is needed to maintain the data density. It should be mentioned that certain modeling strategies allow a greater number of variables to be included within the model without causing excessive sparseness. For example, the methods of principal components and factor analysis are well-known techniques for addressing this problem. Another approach is to use multistage modeling, a technique discussed in Chapter 4.

In summary, the process of modeling financial markets can be described as follows:

Specify a list of candidate predictors (X's) and gather the appropriate data to compute the X's for the time period that is proposed for modeling. It is not necessary to require that all data be in one concurrent time period.
For the same time period (or periods), compute the values of the dependent variable Y (the quantity to be modeled).
Determine a maximum value for the dimensionality of the model.
Specify a criterion (or criteria) for evaluating a particular model space. Some well-known criteria are discussed in the next section.
Specify a strategy for exploring the spaces.
For each space compute the value of the criterion (or criteria) used to select the "best" model (or models).
If there is sufficient data, test the "best" models using this remaining unseen data (i. e., data that has not yet been used in the modeling process). There are many details that are not included in this list. However, this list covers the general concepts required to model financial markets. Specifics regarding these various stages in the modeling process are considered in later chapters.

1.4 EVALUATING A MODEL

Typically, when a nonparametric method is used for data modeling, a variety of models are proposed and then some sort of procedure must be used that permits selection of the best model. When there are a number of candidate predictors, combinations of the predictors are often examined to see which space is best. In this section several popular definitions of best model are considered.

A number of different strategies may be used to evaluate a model. If sufficient data are available, the usual choice is to use some of the data as learning data (i. e., to generate or train the model) and some of the data for testing. Since one typically looks at many different models in the search for the best model, one question that should be asked is the following: if we finally end up with a good model, is it real or did it just happen by chance after looking at many potential models (i. e., spaces and parameters). A strategy used to answer this question is to save yet a third set of data: the evaluation data set. This data is used only after the modeling process has been completed. The final model is applied to this data to see if the model succeeds for unseen data.

What do we mean by the best model? The definition of bestis, of course, problem dependent. There are several different well-known definitions, but the modeling process need not be limited to these standard definitions. It is useful, however, to mention some of the most popular modeling criteria. A very useful and popular criterion for data modeling is Variance Reduction (VR). If the purpose of our model is to predict Y, once a model has been proposed, it can be used to predict Y for a series of ntst test data points. In other words, for test point i there is a known value of Y( i) and a value that is calculated using the model: Ycalc( i). VR is computed as follows:

(1.12)

In this equation, Yavg is the average value of all the values of Y( i). This equation shows that VR is the percentage of the variance in the data that is explained by the model. A perfect model (i. e., a model that yields values of Ycalc( i) = Y( i) for all test points) has a value of VR equal to 100 percent. A value of VR close to zero means that the model has little or no predictive power.

An interesting question is, what can we say about the expected value of VR for a model with absolutely no predictive power? Let's say we have a random time series of X and Y values. We divide the data into learning and test data sets. The values of Yavg for the two sets should be close, but there will be a difference. Any modeling technique that attempts to calculate values of Ycalc for the test data based on the learning set data will be slightly biased due to this slight difference in the means. Thus one can expect a slightly negative VR rather than a value of zero. (This point is treated in greater detail in Appendix B.)

VR = ç ç ç ç 1- ÷ ÷ ÷ ÷ 100* (() (() Y i Ycalc()) i Y i Yavg) i=1 ntst i=1 ntst

Another criterion often used in data modeling is the root mean square error (RMSE). The computation of RMSE is similar to VR:

RmedSE ) 2 = -( ) med Y( i) Ycalc( i) ( RMSE = -å ( ( ) ( )) / Y i Ycalc i ntst 2

(1.13)

The problem with RMSE is that it has the dimensions of Y whereas VR is a dimensionless quantity. For example, if two models are compared, one for predicting changes in the price of gold and the other for predicting changes in the price of silver, a comparison of the two values of RMSE will be meaningless. These RMSE values would have to be normalized in some way that would introduce the relative prices of gold and silver. Alternatively, the comparison of the two values of VR gives a direct indication regarding the relative worth of the two models.

Both VR and RMSE suffer from the problem of outliers. That is, if the data include some points that are far from the model surface, then these points tend to have a disproportionate effect on the evaluation of the model. To reduce the effect of outliers, a number of "robust" modeling criteria have been proposed. Probably the most popular criterion in this category is the root median square error (RMedSE):

(1.14)

where med is the median operator. To turn RmedSE into a dimensionless variable similar to VR, the median variance reduction (MVR) can be defined as follows:

MVR = 100* (1 - med(( Y( i) - Ycalc( i)) 2 )/ med(( Y( i) - Yavg) 2 ) (1.15)

Robust criteria provide a straightforward answer to the problem of outliers, but they add to the computational complexity of the modeling process. The computation of RMedSE and MVR requires a sort of the values of (Y( i) - Ycalc( i)) 2 . Sorting typically requires times of order N* log2( N). Depending on N (the number of test points), this can be an important factor in determining the compute time required for the modeling process. An additional problem is that the values of Ycalc( i) must be saved in order to compute RMedSE or MVR. Alternatively, VR and RMSE can be determined at almost no extra cost, and they do not even require saving the values of Ycalc( i).

Another measure of the performance of a model is Fraction Same Sign (FSS). The FSS value is the fraction of the ntst predictions in which the signs of Ycalc( i) and Y( i). are the same. If we can effectively predict the sign of Y( i), the development of a useful trading strategy is straightforward. Since the computation of FSS is so simple, it is useful to include this output parameter in any output report regardless of whether or not it is used as the modeling criterion. One problem with FSS as defined above is that values of Ycalc( i) and Y( i). close to zero are treated the same as values far from zero. In Section 6.6 a variation on the definition of FSS is introduced that avoids this shortcoming.

A well-known modeling criterion, correlation coefficient (CC), measures the degree of correlation between Ycalc( i) and Y( i). CC is defined as:

(1.16)

To actually compute CC there is no need to use Eq. (1.16). This equation requires two passes through the data: the first pass is required to compute the average values, and the second pass is then required to sum the differences. A direct single pass calculation is performed as follows:

SUMYYC = sum( Y* Ycalc) - sum (Y 2 )* sum( Ycalc 2 )/ ntst (1.17)

SUMY2 = sum( Y 2 ) - (sum( Y)) 2 /ntst (1.18)

SUMYC2 = sum( Ycalc 2 ) - (sum( Ycalc)) 2 /ntst (1.19)

CC = SUMYYC/ sqrt( SUMY2 * SUMYC2) (1.20)

The sum operator is the scalar sum of all terms of the vector argument. The squaring operator on a vector yields a vector output of the same length as the input but with each element squared.

There is an important weakness in using the correlation coefficient as a modeling criterion for financial markets. A high degree of correlation can exist, and yet the predicted values of Y( i) (i. e., Ycalc( i)) might not be particularly useful. Consider, for example, a case in which Ycalc( i) ranges between 1 and 1.2, while the values of Y( i) range between -2 and 2. Even if the correlation coefficient is significant, there is no meaningful interpretation of the resulting values of Ycalc( i). The model is always predicting that the value of Y( i) will be positive, even though the range includes negative values. Correlation can still be used as a modeling criterion if the coefficient is defined as Correlation Coefficient through the Origin (CCO). This parameter assumes a linear relationship between Y( i) and Ycalc( i), but the assumption is that the relationship goes through the origin of the Y - Ycalc plane. CCO is computed as follows:

CCO = sum (Y* Ycalc)/ sqrt(( sum( Y 2 )* sum( Ycalc 2 )) (1.21)

Values of CCO, like the standard correlation coefficient, range between -1 and 1. A value of 1 implies that all values fall exactly on a line in the Y - Ycalc plane that goes through the 0,0 point. Furthermore, positive values of Ycalc( i) correspond to positive values of Y( i), and negative values correspond to negative values. The value of CCO is not related to the slope of the line.

When modeling financial markets, an alternative approach is to use a modeling criterion based on trading performance. Rather than measuring how close the values of Ycalc( i) are to Y( i), the values of Ycalc( i) can be used to generate trades. Using the trades, one can generate an equity curve and then use this curve to evaluate the quality of the predictions. In this manner, a trading-based measure of performance can be obtained that is then used in the selection process needed to choose the best set of candidate predictors. A number of parameters must be specified to describe how the system trades based on the values of Ycalc( i). Examples of such parameters are buy and sell thresholds for entering and exiting a trade. The problem with this approach is the tendency to overfit. By varying the various thresholds through a wide range of values, one can end up with results that are excellent in the testing period but are disappointing when used in production.

This section defines some of the better known measures of performance. It should be clear that many other measures of performance are possible. Section 6.6 considers this subject in greater detail and discusses some measures of particular application to financial market modeling.

1.5 NONPARAMETRIC METHODS

Nonparametric methods of data modeling predate the modern computer age. In the 1920s two of the giants of statistics (Sir R. A. Fisher and E. S. Pearson) debated the value of such methods. 3 Fisher correctly pointed out that a parametric approach is inherently more efficient. Pearson was also correct in stating that if the true relationship between X and Y is unknown, then an erroneous specification in the function f( X) introduces a model bias that might be disastrous.

Hardle includes a number of examples of successful non-parametric models, the most impressive of which is the relationship between change in height (cm/ year) and age of women (Figure 1.2). 3 A previously undetected growth spurt at around age 8 was noted when the data was modeled using a nonparametric smoother. 4 To measure such an effect using parametric techniques, one would have to anticipate this result and include a suitable term in f( X).

Clearly, one can combine nonparametric and parametric modeling techniques. A possible strategy is to use nonparametric methods on an exploratory basis and then use the results to specify a parametric model. However, as the dimensionality of the model and the complexity of the surface increase, the hope of specifying a parametric model becomes more and more remote. For financial market modeling, parametric models are not really feasible. As a result, considerable interest has been shown in applying nonparametric methods to financial market modeling. Recent books on the subject include Bauer (1994), 5 Gately (1995), 6 and Refenes (1995).

The emphasis on neural networks as a nonparametric modeling tool is particularly attractive for time series modeling. The basic architecture of a typical neural network is based on interconnected elements called neurons, as shown in Figure 1.3. The input vector may include any number of variables. All the internal elements are interconnected, and the final output for a given combination of the input parameters is a predicted value of Y. Weighting coefficients are associated with the inputs to each element. If a particular interaction has no influence on the model output, the associated weight for the element should be close to zero. As new values of Y become available, they can be fed back into the network to update the weights. Thus the neural network can be adaptive for time series modeling: in other words, the model has the ability to change over time.

One major problem is associated with neural network modeling of financial markets: the huge amount of computer time required to generate a model. If one wishes to use tens or even hundreds of thousands of data records and hundreds of candidate predictors, the required computer time is monumental. To have any hope of success, techniques are required to preprocess the data in order to reduce the number of candidate predictors to a reasonable amount. The definition of reasonable varies, of course, depending on the available computing power. However, regardless of the hardware available, preprocessing strategies are essential to successfully apply neural networks to financial market modeling. Use of kernel regression is an alternative modeling strategy that can be many orders of magnitude faster than more computer-intensive methods such as neural networks. Kernel regression techniques can be used to very rapidly obtain the information-rich subsets of the total candidate predictor space. These subspaces can in turn be used as inputs to a neural network modeling program. For a comparison of neural network and kernel regression modeling, see Appendix D.

1.6 FUNDAMENTAL VERSUS TECHNICAL ANALYSIS

Chapter 1 of Jack Schwager's book Fundamental Analysis is entitled "The Great Fundamental versus Technical Analysis Debate. 8 In that chapter Schwager defines fundamental analysis as analysis involving the use of economic data (e. g., production, consumption, disposable income) to forecast prices. He defines technical analysis as analysis based primarily on the study of patterns in the price data itself (and perhaps volume and open interest data).

The popularity of these differing approaches has evolved over time. In the early 1970s, most serious financial analysts regarded technical analysis with disdain. But as the decade wore on, the huge price trends that developed in the commodities markets were highly favorable for trend-following techniques. Technical analysis was ideally suited to capture these movements, and this approach became the order of the day. By the late 1980s, technical analysis was the dominant approach for making trading decisions. However, nothing lasts forever, and as Schwager noted, "the general market behavior became increasingly erratic." Choppy markets are notoriously unfriendly toward trend followers, and this had a real damping effect on technical analysis. Once again, fundamental analysis became more popular. What seems to have evolved is the tendency to combine fundamental analysis for making longer-term predictions and using technical analysis for shorter-term market timing.

The question that one might ask is, Where does the approach discussed in this book fit into the grand scheme of things? Is it technical analysis or fundamental analysis? Clearly, we are looking for patterns, so this might be construed as a technical approach. On the other hand, the analyst is encouraged to use economic data as well as basic price, volume, and open interest data. Once one has the ability to look at hundreds of candidate predictors, there is no need to limit the search for a model to the confines of the very simple price and volume-related indicators. Relevant series such as interest rates and currency exchange rates are fair game for the analyst. Perhaps one should say that the distinction between fundamental and technical analysis becomes moot when one is using a multivariate prediction approach to modeling.

In the last few years, the power of the computer has grown so enormously that the type of analysis proposed in this book has become increasingly cost effective. By using the types of algorithms described in Chapter 4, vast numbers of potential subspaces can be examined in a relatively short period of time. I recently did some consulting work in which I searched for a model using 658 candidate predictors based on 2377 data records. All of the 658 X's were first examined individually. The best 35 (on the basis of Variance Reduction) were then used to form two-dimensional spaces using all other variables. The number of 2D spaces examined was 35* 34/ 2 + 35*( 658 - 35) = 22400 (i. e., all pairs of the best 35 plus all pairs of each of the 35 with the remaining 623). The 50 best 2D spaces were then used to create 3D spaces. A grand total of 55737 spaces were examined in less than two hours using a relatively slow computer (a Pentium 100). The average time per space was about 0.1 second. Another analysis was based on a combined database of 48 stocks over 2041 days (i. e., a total of 48* 2041 = 97968 data records). This data set included 23 X's, and a total of 524 spaces were examined in about 6000 seconds. Even for this huge data set, the average time per space was only about 11 seconds. These rates can already be improved by over a factor of five going to the faster processors on the market today. With symmetric multiprocessor hardware about to become standard, even greater speeds can be expected in the near future.

NOTES

1. See J. C. Francis, Investments, Analysis and Management (New York: McGraw-Hill, 1980).

2. H. Markowitz, "Portfolio Selection," Journal of Finance (March 1952).

3. W. Hardle, Applied Nonparametric Regression (Cambridge, UK: Cambridge University Press, 1990).

4. T. Gasser, H. G. Muller, W. Kohler, L. Molianari, and A. Prader, "Nonparametric Regression Analysis of Growth Curves," Annals of Statistics 12 (1984): 210- 229.

5. R. J. Bauer, Genetic Algorithms and Investment Strategies (New York: John Wiley & Sons, 1994).

6. E. Gately, Neural Networks for Financial Forecasting (New York: John Wiley & Sons, 1996).

7. A. P. Refenes, Neural Networks in the Capital Markets (New York: John Wiley & Sons, 1995).

8. J. Schwager, Fundamental Analysis (New York: John Wiley & Sons, 1995).

Data Modeling of Time Series.

Kernel Regression.

High-Performance Kernel Regression.

Kernel Regression Software Performance.

Modeling Strategies.

Creating Trading Systems.

Appendices.

Bibliography.

Index.

From the B&N Reads Blog

Page 1 of

Expert Trading Systems: Modeling Financial Markets with Kernel Regression

Expert Trading Systems: Modeling Financial Markets with Kernel Regression

Hardcover

Hardcover

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Table of Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Table of Contents

Related Subjects

Customer Reviews