Non-linear principal component analysis technique using neural networks: learning dependencies in data sets.

Ms. A. Perrino

April 22, 1997

Abstract

In this thesis a nonlinear technique is developed for the analysis of dependencies in data sets. The approach can be summarized as follows. The given data set X is a subset of some N-dimensional euclidean space |N The elements of the data set may be represent measurements in some water system, or the results of a model. For this data set a classifier is constructed. This is done in the form of a nonlinear distance function d:|N —> [0,1] that is zero or (close to zero) for all points in X, and gradually increases to 1 for points with increasing distance to X.

The main steps in the algorithm to determine this classifier involve the construction of a 'virtual' set of data points that are located in the orthogonal complement of X. For each point of the virtual set the distance d to the original data set X is determined. The orthogonal complements of X (if present) can be found with a repeated number of local PCA applications to subsets of X. The union of the original data set and the virtual set leads to an extended data set, and for every data point x of this extended data set a distance d(x)0 [0, 1 ] to X has been assigned. To find an explicit prescription (and generalization) of the distance function d( ) a multilayer perceptron neural network is applied.

As result a surface in |N is found that provides a best fit for the original dat set. This surface is of the form { x 0s|N |F(X) } =O position of the points where function F(x)=0.

The present approach is applied in several analytical tests and data obtained from a numerical model. It turns out that the technique is capable to model non-linear dependencies in data sets In this way it forms an important extension of linear data analysis techniques such as PCA, or regression techniques. Apart from this, the approach has shown its potential for several important practical problems such as data reduction, smoothing, classification and inverse modelling.

The study demonstrated that multilayer perceptron neural networks can be used in data analysis to model non- linear relation of the data samples. For the training set of this NN the original data set must appropriately be extended using standard PCA. In this way data reduction to a lower dimensional space can be performed. The method was used in the analysis of the data sets generated by the SOBEK system in modelling of the river Waal. Several other applications also show the applicability and high potential of the method.

Back to the list of MSc abstracts