4 Neural Networks as Universal Function Approximators
Neural networks are a flexible class of parametric models which form the basis for a wide range of algorithms in machine learning. Their widespread success is due both to the ease and efficiency of training them, or optimizing over their parameters, using the backpropagation algorithm, and their ability to approximate any smooth function. Recall that a feed-forward neural network is simply a combination of multiple 'neurons' such that the output of one neuron is fed as the input to another neuron. More precisely, a neuron v; is a computational unit that takes in an input activation function : R -> R, that is: v_i(w,b) := w + b. With this notation, we can define a layer of a neural network N that takes in an I-dimensional input and returns a O-dimensional output as N(x) := v_f,v(...,v_x) = o(W^T x + b^e) x ∈ R^I We e R^d do defined as [W,..., w], and b ∈ R^d_o defined as [b_i,...,b_o]
Here w and b refer to the weight and the bias associated with neuron v in layer , and the activation function is applied pointwise. An L-layer (feed-forward) neural network F_L-layer is then defined as a network consisting of network layers N_1,... ,N_L, where the input to layer i is the output of layer i - 1. By convention, input to the first layer (layer 1) is the actual input data x_1-x_n.
ii Consider a single layer feed forward neural network that takes a d_i-dimensional input and returns a d_o-dimensional output, defined as F_i-layer() := o(W^T x + b) for some d_i x d_o weight matrix W and a d_o x 1 vector b. Given a training dataset (x_1, y_1),.. ., (n, y_n), we can define the average error (with respect to the network parameters of predicting y_i from input example i as:
E(W,b) := 1/n Σ |F_i-layer(x) - y_i|^2 i=1
(note: we can use this gradient in a descent-type procedure to minimize this error and learn a good setting of the weight matrix that can predict y_i from x_i.)