## The Hwang dataset

DELVE version created by Radford Neal, April 1997.
Original use by Jenq-Neng Hwang, et al, 1994.

This dataset gives the values of five test functions over a common two-dimensional domain. The values are provided both without noise and with Gaussian noise added. This artificial data is intended for use in testing regression methods.

This dataset is categorized as "historical", which means that it is provided primarily in order to allow comparisons with previously published results.

## Original use of these functions.

These functions were first used for testing in the following paper:

Hwang, J.-N., Lay, S.-R., Maechler, M., Martin, R. D., and Schimert, J. (1994) Regression modeling in back-propagation and projection pursuit learning, IEEE Transactions on Neural Networks, vol. 5, no. 3, pp. 342-353.

The five functions were chosen as being illustrative of different types:

1. simple interaction function
3. harmonic function
5. complicated interaction function
The functions are all real-valued, and are defined on the two-dimensional domain [0,1] X [0,1].

According to the paper by Hwang, et al, each of the functions was defined with a scaling factor incorporated so that the standard deviation of its values over a 2,500-point grid would be one, and with values translated so that they are all non-negative over the domain. (Note: this is part of the function definition; it is not a later normalization.) It does appear to be true that the functions never take on negative values. However, the standard deviation is not one when these functions are evaluated over a 50x50 grid. The standard deviations instead range from 0.9118 to 0.9941 (these values are computed by the gen.c program). Something apparently was wrong with the computations over this grid, or with the subsequent adjustment of the function definitions.

Hwang, et al generated a single set of 225 independent training points, randomly drawn from the uniform distribution over [0,1] X [0,1]. Various methods were applied to learning each function, using a single training set of function values at these 225 points, with or without added noise (Gaussian, with standard deviation 0.25). Note that since only a single training set was used, it is not possible to determine the variability of the results of Hwang, et al with respect to random choice of training set.

Hwang, et al evaluated the performance of their methods in terms of squared-error loss, with respect to the true function values (even when the training data was noisy), evaluated over a grid of 10,000 points. They expressed this squared-error loss in terms of "Fraction of Variance Unexplained", or FVU, which is the squared-error loss divided by the variance of the (noise-free) values over the grid of 10,000 points. As noted above, the functions were supposed to be defined to have a variance of one over a grid of 2,500 points. One would expect the variance over a slighly finer grid to differ from this only slightly, so the FVU should be almost identical to the squared-error loss. However, as noted above, the functions do not actually have a variance of one over the 2,500-point grid (though the variance is fairly close to one), so the normalization to FVU will in fact have a non-negligible effect.

The statistics over the 10,000-point grid (computed by gen.c) for the five functions are as follows:

1 2 3 4 5
Means 3.6368 2.0868 4.2659 2.1603 2.7033
Variances 0.9296 0.9878 0.8348 0.9773 0.9877
1/Variances 1.0757 1.0123 1.1979 1.0233 1.0124

To convert the results of Hwang, et al (given as FVU) to plain squared error loss, one should presumably multiply the FVU by the variance shown above, assuming that Hwang, et al computed the variance correctly on the 10,000-point grid (unlike the apparently flawed computations on the 2,500-point grid).

The detailed forms of the five functions can be found in the paper by Hwang, et al, or by examining the source of the generation program used to create the DELVE dataset (gen.c). Note that Hwang, et al give two formulas for the "harmonic" function, one in terms of complex numbers, one directly in terms of reals. These two formulas are not, in fact, equivalent. The figure in the paper shows the function defined by the formula in terms of reals, and this is the form used in generating the DELVE dataset. Also note that the plot in the paper for the first function has the scale for x2 reversed in comparison with the plots of the other functions.

## Other uses of these test functions.

These five functions have been used by several other authors to test nonparametric regression methods. These uses include at least the following:
• Cherkassky, V., Gehring, D., and Mulier, F. (1996) Comparison of adaptive methods for function estimation from samples, IEEE Transactions on Neural Networks, vol. 7, no. 4, pp. 969-984.
• Denison, D. G. T., Mallick, B. K., and Smith, A. F. M. (1996) Bayesian MARS, preprint.
• Holmes, C. C. and Mallick, B. K. (1997) Bayesian Radial Basis Functions of unknown dimension, preprint.
• Kwok, T. Y. and Teung, D. Y. (1996) Use of bias term in projection pursuit learning improves approximation and convergence Properties, IEEE Transactions on Neural Networks, vol. 7, no. 5, pp. 1168-1183.
• Roosen, C. B. and Hastie, T. J. (1994) Automatic smoothing spline projection pursuit, Journal of Computational and Graphical Statistics, vol. 3, pp. 235-248.

Note that each author may do things in a somewhat different way, so care is required when comparing results.

## Usage of these test functions in DELVE.

These functions are incorporated into a single DELVE dataset. Each case in this dataset contains the coordinates of a point randomly drawn from the input domain, the noise-free values of the five functions at that point, and the same values plus independent Gaussian noise of standard deviation 0.25. Ten prototasks are defined for predicting each of the five functions, with or without noise added. A hierarchical scheme for test sets is used (with random cases, not a grid of values as Hwang, et al used). Sixteen instances (training and test sets) are provided for each task.

One should note that the scaling and translation incorporated into the definition of the functions may not correspond to a realistic learning scenario. The default DELVE treatment (unless a method overrides it) is to encode the targets in normalized form, which effectively removes the information that the true standard deviation is guaranteed to be (approximatately) one and that negative target values are impossible. The dataset specification for this data has been set up to indicate that the targets are non-negative. The constraint on the standard deviation is not expressed in the specification. Some learning method could conceivably do better by exploiting this information, but this should probably be regarded as cheating.

In the DELVE environment, the raw squared-error loss is available (from mstats), along with a "standardized" loss, based on the test set statistics. For the tasks with noise-free targets, it appears that the raw squared error loss should be multiplied by 1/Variance from the table above to obtain an FVU value that will be comparable to the results of Hwang, et al. For tasks with noisy targets, the raw squared-error loss that mstats outputs is for prediction of noisy targets. It appears that Hwang, et al tested instead for prediction of the noise-free targets. To convert DELVE results on noisy data for comparison, one should subtract the noise variance of 0.0625 from the raw squared-error loss output by mstats before multiplying by 1/Variance.

Of course, the training sets used by DELVE are not the same as the one used by Hwang, et al (even though they are of the same size, of 225 cases), and the DELVE test set is randomly drawn, rather than being over a fixed grid. These differences limit the ability to make comparisons. Formal significance tests will not be possible, but the results should be comparable as far as the expected values of the losses are concerned.

In addition to the usual std.prior specification for each prototask, a nonoise.prior specification is provided for the prototasks in which the true target is to be predicted. The nonoise.prior specifies as prior information that there is no noise in the target. Of course, this makes no difference unless the learning method being tested is defined to behave differently depending on this prior information.

 Last Updated 22 April 1997 Comments and questions to: delve@cs.toronto.edu