Lecture 5
Pattern Recognition
Classification
Feature Extraction:

Convert a raw pattern into a feature vector.

Reduce redundancy in the pattern.

e.g. convert image to line drawing.
Classification

Assign the feature vector a class

Pattern space (or feature space) must be partitioned through training.
Training and Testing

The system is trained using a finite set of patterns the training set.

If the correct classification for these patterns is known then this is
Supervised Learning, otherwise it is Unsupervised Learning.

The system is evaluated using a different set of patterns  the test
set.
Overtraining and Undertraining.
It is easy to learn about the training set, just remember them all,
unfortunately this will be of little use for classification of the test
set  this is overtraining. Similarly If the system knows too little about
the training set then it will classify badly.
The system must not overgeneralise or undergeneralise about patterns.
Bayesian Classification

The probability of event A:

The joint probability of A and B:

Conditional probability of B given A:
For independent events:
Pr(B  A) = Pr(B)
i.e. Pr(A, B)=Pr(A)Pr(B)
Bayes Rule
Pr(A, B) = Pr(B, A)
Pr(B  A) Pr(A) = Pr(A  B) Pr(B)
so
Pr(B  A) = Pr(A  B) Pr(B)
Pr(A)
For pattern classification:
if x is the feature vector and w_{i} is
class i.
Pr(w_{i}) is the prior probability of the pattern
being in class i.
p(x  w_{i}) is the probability
density function for the feature vector.
p(x) = sum(p(x  w_{i})
Pr(w_{i}))
From Bayes Rule:
where:
Pr(w_{i}  x) = posterior probability
for class i.
The optimal decision is the i for which Pr(w_{i}
 x) is largest.
Bayesian theory gives us the optimum decision boundary but in practice
we don't have enough information to calculate this boundary.
Nearest Neighbour Classification
A pattern is classified as belonging to the class of the training pattern
that is closest to it. To measure closeness use a distance metric.
For a feature vector x = {x_{1}, x_{2}, x_{3,}.....,x_{n}}
and a training pattern t = {t_{1}, t_{2}, t_{3,}.....,t_{n}}

Euclidean distance:
D^{2}=Sum_{i}( (x_{i}  t_{i})^{2})

Dot Product Distance:
D = Sum_{i}(x_{i *} t_{i}) /(x_{i}*t_{i})

Angle between vectors:
D = Sum_{i}(x_{i *} t_{i}) /(x_{i}*t_{i})
Efficiency

Training is trivial, just store all patterns, this may require a lot of
storage.

Classification may be time consuming since all stored patterns must be
compared.
Optimality

Nearest Neighbour is not optimal. Making simple assumptions it can be shown
that the error probability is bounded by twice the optimal Bayesian error.
Nearest neighbour classification is prone to errors due to rogue patterns.
A rogue pattern is a mislabeled pattern.
K Nearest Neighbour
To eliminate the problem caused by rogue patterns use not just the nearest
neighbour but a group of them.
Using K neighbours, take a majority vote of their classes to give a
classification.
Optimality

As K gets larger this method approaches the optimal decision.
Speeding up nearest neighbour methods.
The biggest problem with this method is the time it takes to calculate
the distances to the training examples possible solutions are:

Only store examples near the decision boundary.

Use a precomputed search tree and branch and bound to search for the nearest
neighbour.