Lecture 7

The Multilayer Perceptron

The perceptron can only learn simple problems. It can place a hyperplane in pattern space and move the plane until the error is reduced. Unfortunately this is only useful if the problem is linearly separable.

A linearly separable problem is one in which the classes can be separated by a single hyperplane.

It is often the case that a problem is not linearly separable. To solve these we use a Multi-Layer Perceptron (MLP) where one layer feeds into the next. Consider the following problem.

The XOR problem

This is the simplest problem that can not be solved by a perceptron.

For two inputs x1 and x2, the output is the exclusive OR of the inputs.

The pattern space for this problem looks like this:

This cannot be solved using a single line, the solution uses two lines:

A two layer Multi-Layer Perceptron to solve this problem looks like this:

The shape of regions in pattern space that can be separated by a Multi-Layer Perceptron is shown in the following table.

We can see that a three layer MLP can learn arbitrary areas while a two layer MLP can learn convex regions. (if you can draw a line from any point in the region to any other in the region and the line passes out of the region then that region is not convex).

Training the MLP

last time we saw that the delta rule can be used to train a perceptron. When training the MLP, the error (delta) must be propagated back through the layers. This is called error back-propagation. Or just backpropagation.

The following procedure can be used to train a backpropagation network.

t is the target
units[l] is the number of units in layer l
n[l][i] is unit i in layer l
n[l][i].output is the output
n[l][i].delta is the delta
n[l][i].weight[j] is weight j
ek is the learning constant
adapt() {
int i,j,k,l;

  for(l=layers-1;l>=0;l--) 
    for(i=0;i<units[l];i++) 
      if(l==layers-1)
        n[l][i].delta=
          ek*n[l][i].output*
          (1.0-n[l][i].output)*
          (t[i]-n[l][i].output);
      else {
        n[l][i].delta=0.0;
        for(k=0;k<units[l];k++)
          n[l][i].delta+=
            n[l+1][k].delta*
            n[l+1][k].weight[i];
        n[l][i].delta=n[l][i].delta*
          ek*n[l][i].output*
          (1.0-n[l][i].output);
       }  
    
  
  for(l=layers-1;l>=1;l--) 
    for(i=0;i<units[l];i++) 
      for(j=0;j<weights;j++)
        n[l][i].weight[j]+=
           n[l-1][j].output*
           n[l][i].delta;
      
    
   for(i=0;i<units[0];i++)
     for(j=0;j<weights;j++)
       n[0][i].weight[j]+=
         input[j]*n[0][i].delta;
}
When this algorithm is applied to the XOR we get the following output.
iteration no 0, inputs 0 1, target 1, output 0.477995 
iteration no 20, inputs 1 0, target 1, output 0.447816 
iteration no 40, inputs 0 0, target 0, output 0.450292 
iteration no 60, inputs 1 0, target 1, output 0.549096 
iteration no 80, inputs 0 0, target 0, output 0.460706 
iteration no 100, inputs 1 0, target 1, output 0.507636 
iteration no 120, inputs 0 1, target 1, output 0.571619 
iteration no 140, inputs 0 0, target 0, output 0.451493 
iteration no 160, inputs 0 1, target 1, output 0.570574 
iteration no 180, inputs 1 0, target 1, output 0.575979 
iteration no 200, inputs 0 1, target 1, output 0.744079 
iteration no 220, inputs 1 1, target 0, output 0.233541 
iteration no 240, inputs 0 1, target 1, output 0.755600 
iteration no 260, inputs 1 1, target 0, output 0.185273 
iteration no 280, inputs 0 1, target 1, output 0.788309 
iteration no 300, inputs 1 1, target 0, output 0.167068 
iteration no 320, inputs 0 0, target 0, output 0.123461 
iteration no 340, inputs 1 1, target 0, output 0.132892 
iteration no 360, inputs 1 1, target 0, output 0.133583 
iteration no 380, inputs 1 1, target 0, output 0.116641 
iteration no 400, inputs 0 0, target 0, output 0.088269 
iteration no 420, inputs 1 0, target 1, output 0.861810 
iteration no 440, inputs 1 1, target 0, output 0.102406 
iteration no 460, inputs 0 0, target 0, output 0.080179 
iteration no 480, inputs 0 0, target 0, output 0.075584 
iteration no 500, inputs 1 0, target 1, output 0.884442 
iteration no 520, inputs 1 0, target 1, output 0.892789 
iteration no 540, inputs 0 1, target 1, output 0.923969 
iteration no 560, inputs 0 0, target 0, output 0.064146 
iteration no 580, inputs 1 1, target 0, output 0.071938 
iteration no 600, inputs 1 1, target 0, output 0.075764 
iteration no 620, inputs 1 1, target 0, output 0.074536 
iteration no 640, inputs 1 1, target 0, output 0.069014 
iteration no 660, inputs 1 1, target 0, output 0.066534 
iteration no 680, inputs 1 0, target 1, output 0.918422 
iteration no 700, inputs 1 0, target 1, output 0.924860 
iteration no 720, inputs 1 1, target 0, output 0.065864 
iteration no 740, inputs 1 1, target 0, output 0.052634 
iteration no 760, inputs 1 0, target 1, output 0.927081 
iteration no 780, inputs 0 0, target 0, output 0.050964 
iteration no 800, inputs 0 1, target 1, output 0.948869 
iteration no 820, inputs 0 0, target 0, output 0.049082 
iteration no 840, inputs 0 0, target 0, output 0.048074 
iteration no 860, inputs 1 1, target 0, output 0.057916 
iteration no 880, inputs 1 1, target 0, output 0.056088 
iteration no 900, inputs 0 1, target 1, output 0.954659 
iteration no 920, inputs 1 1, target 0, output 0.057337 
iteration no 940, inputs 1 0, target 1, output 0.944243 
iteration no 960, inputs 0 0, target 0, output 0.045653 
iteration no 980, inputs 1 0, target 1, output 0.946199

Example: NETtalk

The problem

Convert English text into the vowel and consonant sounds (phonemes) that make up speech. For example consider the following words: albany, Albania, misled and bedraggled.

Traditional Approach

  1. Create if-then rules to encode all the regularities in the language.
  2. Maintain a database of exceptions to these rules.
  3. Build a production system to resolve conflicts.
For example, a ``c'' can either be pronounced like an ``s'' as in center, icy, and city or a ``k'' as in cat and crumb. Here is a rule that might encode this difference:
 

Neural Network Approach

Allow a system to learn how to pronounce words by giving it lots of examples and correcting its mistakes.
 
  1. Choose an architecture.
  2. Choose a representation for the input and output.
  3. Determine a training set. 16,000 words were chosen randomly from the 20,000 words in Webster's Dictionary.
  4. Train the system. Give the network each word in the training set several times, and indicate the errors for learning.
  5. Test the system. The network was able to respond correctly to 90%of the 4,000 remaining words from the dictionary.

Advantages of the Network Approach
 

Steps in the training of NETtalk