Artificial Intelligence - Introduction

Can Machines Think?

This is a difficult question and to answer it we need to define intelligence and thinking.

Some aspects of intelligence:

In the 1950's, a test was proposed by Alan Turing to determine whether a machine is intelligent or not - The Turing Test.

The Turing Test

The basis of the turing test is for a person to attempt to distinguish between a machine and a human being. For fairness, the machine and the human are isolated from the person carrying out the test and messages are exchanged via a keyboard and screen. If the person can not distinguish between the computer and the human being, then the computer must be intelligent.  Each year, there is a Turing test contest called the Loebner Prize. This is a transcript from the winner for 1998.

This test is often criticised because it only tests a limited aspect of intelligence.

Some people think that even if a machine could pass the Turing Test it may still not be intelligent.

The Chinese Room Problem

A (non Chinese speaking) person is locked in a room and given the algorithm for a program that could pass the Turing test in Chinese. He/she is asked questions in Chinese, applies the algorithm by hand, and feeds the answers back. The room will appear intelligent but the person inside understands no Chinese, so is there any intelligence present?

This problem is criticised because, it may well be possible for the complete system to be intelligent (i.e. room and person inside) without the person being intelligent.

Some people say that passing the Turing test is sufficient to prove intelligence but it is not necessary to prove intelligence. In other words, a machine may fail the Turing Test but still be intelligent.

There are plenty of examples of computer systems that perform tasks that would require intelligence if they were performed by a human being.

Types of AI Tasks

One possible classification of AI tasks is into 3 classes: Mundane problems, Formal problems and Expert Problems.

Mundane Tasks

Formal Tasks

Expert Tasks

Rule based systems -
         if (conditions) then action

AI's Underlying assumption:

The Physical Symbol System Hypothesis In other words:

Computers (Turing Machines) have the power for general intelligent action.

Example of AI problem solving.

Problem : A farmer has a hungry fox a fat goose and a bag of grain. The farmer needs to cross a river but his boat can only carry two things.

Constraints: Fox and goose cannot be left together Goose and grain cannot be left together.

How to cross the river?

English language representation is hard to solve.

Try visual/graphical representation:

To solve this problem we need only follow the tree from its root node to any leaf node. Section 01


AI is one of the newest disciplines, formally initiated in 1956 when the name was coined. However, the study of intelligence is one of the oldest disciplines being approximately 2000 years old. The advent of computers made it possible for the first time for people to test models they proposed for learning, reasoning, perceiving, etc.

What is AI?

Figure 1.1

Definitions may be organised into four categories.

Acting humanly

The first proposal for success in building a program and acts humanly was the Turing Test. To be considered intelligent a program must be able to act sufficiently like a human to fool an interrogator. A human interrogates the program and another human via a terminal simultaneously. If after a reasonable period, the interrogator cannot tell which is which, the program passes.

To pass this test requires:

This test avoids physical contact and concentrates on "higher level" mental faculties. A total Turing test would require the program to also do:

Thinking Humanly

This requires "getting inside" of the human mind to see how it works and then comparing our computer programs to this. This is what cognitive science attempts to do. Another way to do this is to observe a human problem solving and argue that one's programs go about problem solving in a similar way.

Example: GPS (General Problem Solver) was an early computer program that attempted to model human thinking. The developers were not so much interested in whether or not GPS solved problems correctly. They were more interested in showing that it solved problems like people, going through the same steps and taking around the same amount of time to perform those steps.

Thinking Rationally

Aristotle was one of the first to attempt to codify "thinking". His syllogisms provided patterns of argument structure that always gave correct conclusions, giving correct premises.

Example: All computers use energy. Using energy always generates heat. Therefore, all computers generate heat.

This initiate the field of logic. Formal logic was developed in the late nineteenth century. This was the first step toward enabling computer programs to reason logically.

By 1965, programs existed that could, given enough time and memory, take a description of the problem in logical notation and find the solution, if one existed. The logicist tradition in AI hopes to build on such programs to create intelligence.

There are two main obstacles to this approach: First, it is difficult to make informal knowledge precise enough to use the logicist approach particularly when there is uncertainty in the knowledge. Second, there is a big difference between being able to solve a problem in principle and doing so in practice.

Acting Rationally: The rational agent approach

Acting rationally means acting so as to achieve one's goals, given one's beliefs. An agent is just something that perceives and acts.

In the logical approach to AI, the emphasis is on correct inferences. This is often part of being a rational agent because one way to act rationally is to reason logically and then act on ones conclusions. But this is not all of rationality because agents often find themselves in situations where there is no provably correct thing to do, yet they must do something.

There are also ways to act rationally that do not seem to involve inference, e.g., reflex actions.

The study of AI as rational agent design has two advantages:

  1. It is more general than the logical approach because correct inference is only a useful mechanism for achieving rationality, not a necessary one.
  2. It is more amenable to scientific development than approaches based on human behaviour or human thought because a standard of rationality can be defined independent of humans.
Achieving perfect rationality in complex environments is not possible because the computational demands are too high. However, we will study perfect rationality as a starting place.

Foundations of AI


Philosophers staked out most of the important ideas of AI, but to move to a formal science requires a level of mathematical formalism in three main areas: computation, logic and probability.

Mathematicians have proved that there exists an algorithm to prove any true statement in first-order logic. However, if one adds the principle of induction required to capture the semantics of the natural numbers, then this is no longer the case. Specifically, the incompleteness theorem showed that in any language expressive enough to describe the properties of the natural numbers, there are true statements that are undecidable: their truth cannot be established by any algorithm.

Analogously, Turing showed that there are some functions that no Turing machine can compute.

Although undecidability and noncomputability are important in the understanding of computation, the notion of intractability has had much greater impact on computer science and AI. A class of problems in called intractable if the time required to solve instances of the class grows at least exponentially with the size of the instances.

Exponential vs. polynomial. In between is nondeterministic polynomial.

Even moderately sized instances of intractable problems classes cannot be solved in reasonable amounts of time. Therefore, one should strive to divide the overall problem of generating intelligent behaviour into tractable subproblems rather than intractable ones.

Another important concept from mathematics is problem reduction. A reduction is a general transformation from one class of problems to another such that the solutions to the first class can be found by reducing them to problems in the second class and then solving those.

One notion for recognizing intractable problems in that of NP-Completeness. Problems that can be solved in nondeterministic polynomial time. Any problem class to which an NP-complete problem can be reduced is likely to be intractable.

Probability is the principle mathematical tool that we have to represent and reason about uncertainty. Bayes proposed a rule for updating subjective probabilities in the light of new evidence. This rule forms the basis of the modern approach to uncertain reasoning in AI.

Decision theory combines probability theory with utility theory (which provides a framework for specifying the preferences of an agent) to give a general theory that can distinguish good actions from bad ones.


The principle characteristic of cognitive psychology is that the brain processes and processes information. The claim is that beliefs, goals, and reasoning steps can be useful components of a theory of human behaviour. The knowledge-based agent has three key steps:

  1. Stimulus is translated into an internal representation
  2. The representation is manipulated by cognitive processes to derive new internal representations
  3. These are translated into actions

Having a theory of how humans successfully process natural language is an AI-complete problem - if we could solve this problem then we would have created a model of intelligence.

Much of the early work in knowledge representation was done in support of programs that attempted natural language understanding.

The History of AI

In the early days of AI there was great optimism that the intelligent computer were just a few decades off. However, the problem proved to be far more difficult than anticipated. Today most researchers in AI a smart enough not to make predictions. Also many are not really concerned with creating intelligence, rather they are concerned with creating more intelligent computer programs than currently exist.

The microworlds approach to AI was pioneered in the 1960's and tried to solve problems in limited domains.

The ANALOGY program could solve geometric analogy problems such as this

The most famous microworld is the Blocks World (1970's). A command such as "Pick up the red block" could be used to manipulate the world.

Figure 1.3

The microworld approach turned out to have problems because the advances made in writing programs for microworlds turned out not to be generalisable.

Early work in the Logicist camp also had problems because of the use of weak methods (these use weak information about a domain). However, knowledge intensive approaches have been more successful. A key development from the Logicist tradition was knowledge-based systems in the 1980's.

In the late 1980's Neural Networks became fashionable again (they had been popular in the 60's) due to improved learning algorithms and faster processors.

AI Time Line

1943 - McCulloch and Pits propose modelling neurons using on/off devices.
1950's - Claude Shannon and Alan Turing try to write chess playing programs.
1957 - John McCarthy thinks of the name "Artificial Intelligence".
1960's - Logic Theorist, GPS, microworlds, neural networks.
1971 - NP-Comlpeteness theory (Cook and Karp) casts doubt on general applicability of AI methods.
1970's - Knowledge based systems and expert systems.
1980's - AI techniques in widespread use, neural networks rediscovered.
1990's - Deep Blue wins against world chess champion. Image and Speech recognition become practical.
  Section 02

Intelligent Agents

An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors.

Our aim is to design agents.

A rational agent is one that performs the actions that cause the agent to be most successful.

We use the term performance measure for the criteria that determine how successful and agent is. We will insist on an objective performance measure imposed by some authority.

Example Consider the case of an agent that is supposed to vacuum a dirty floor. A plausible performance measure would be amount of dirt cleaned in a certain period of time. A more sophisticated measure would include the amount of electricity consumed and amount of noise generated.

We need to be careful to distinguish between rationality and omniscience. If an agent is omniscient, it knows the actual outcomes of its actions. Rationality is concerned with expected success given what has been perceived. In other words, we cannot blame an agent for not taking into account something it could not perceive or for failing to take an action that it is not capable of taking.

What is rational at any given time depends on four things:

The ideal rational agent:

For each possible percept sequence, an ideal rational agent should do whatever action is expected to maximize its performance measure, on the basis of the evidence provided by the percept sequence and whatever built-in knowledge the agent has.

Ideal mapping from percept sequences to actions

For an ideal agent, we can simply make a table of the action it should take in response to each possible percept sequence. (For most agents, this would be an infinite table.) This table is called a mapping from the percept sequences to actions.

Specifying which action an agent ought to take in response to any given percept sequence provides a design for an ideal agent.

It is, of course, possible to specify the mapping for an ideal agent without creating a table for every possible percept sequence.

Example: The sqrt agent

The percept sequence for this agent is a sequence of keystrokes representing a number and an action is to display a number on a screen. The ideal mapping when a percept is an positive number x is to display a positive number z such that z2=x. This specification does not require the designer to actually construct a table of square roots.

Algorithms exist that make it possible to encode the ideal sqrt agent very compactly. It turns out that the same is true for much more general agents.

One more requirement for agents: Autonomy

If an agent's actions are based completely on built-in knowledge, such that it need pay no attention to its percepts, then we say that the agent lacks autonomy.

An agent's behaviour can depend both on its built-in knowledge and its experience. A system is autonomous if it's behaviour is determined by it's own experience.

It seems likely that the most successful agents will have some built-in knowledge and will also have the ability to learn.

Structure of Intelligent Agents

Now we start talking about the insides of agents.

The job of AI is to design the agent program: a function that implements the agent mapping from percepts to actions. We assume this program will run on some sort of computing device call the architecture.

The architecture makes the percepts from the sensors available to the agent program. runs the program and feeds the program's action choices to the effectors as they are generated.

Before we design an agent program, we must have a pretty good idea of the possible percepts and actions, what goals or performance measure the agent is supposed to achieve, and what sort of environment it will operate in.

Example A robot designed to inspect parts as they go by on a conveyer belt can make use of a number of simplifying assumptions: that the lighting will always be the same, that the only thing that will be on the conveyer belt are parts of a certain kind, and there are only two actions: accept and reject.

In contrast, some software agents (softbots) exist in rich unlimited domains, e.g., a robot designed to fly a 747 flight simulator or one designed to scan on-line news sources and show the interesting items to its customers. For the latter to do well, it must be able to process natural language, learn the interests of its customers and must be able to dynamically change its plans as news sources become available or unavailable.

Agent Programs

All agent programs have roughly the same skeleton; they accept percepts from the environment and generate actions.

Each agent uses some internal data structure that is updated as new percepts arrive. These data structures are operated on by the agent's decision-making procedures to generate an action choice, which is then passed to the architecture for execution. Good data structures are often very important in AI.

Agents will receive only single percepts as input. It is up the agent to build up the percept sequence in memory if it so desires. In some environments it is possible to be quite successful without storing the percept sequence, and in complex domains, it is infeasible to store the complete sequence.

Why not do table lookup?

An agent that can play chess would need a table with about 35100 entries.

Furthermore, such an agent has no autonomy at all because the calculation of best actions is entirely built-in. If the environment changed in some way, the agent would be entirely lost.

Learning in the context of such a large table is hopeless.

Example: The taxi driver agent

The full taxi driver task is extremely open-ended - there is no limit to the novel situations that can arise.

We start with percepts, actions, and goals

The taxi driver will need to know where it is, what else is on the road and how fast it is going. This information can be obtained from the percepts provided by one or more controllable cameras, a speedometer and odometer. To control the vehicle properly it should have an accelerometer. It will need to know the state of the vehicle, so it will need the usual arrays of engine and electrical sensors. It might also have instruments like a GPS to give exact position with respect to an electronic map. It might also have infrared or sonar sensors. Finally, it will need some way for the customer to communicate destination.

The actions will include control over the engine through the accelerator pedal and control over steering and braking. Some way of talking to passengers and perhaps some way to communicate with other vehicles.

Performance measures?

Getting to the correct destination, minimizing fuel consumption and wear and tear, minimizing trip time and cost, minimizing traffic violations and disturbances of other drivers, maximizing safety and passenger comfort. Some of these goals conflict and so there will be trade-offs.

Operating environment?

City streets? highways? snow and other road hazards? driving on right or left?

The more controlled the environment, the easier the problem.

We will now consider four types of agent programs:

Simple reflex agents

Constructing a lookup table is out of the question. The visual input from a simple camera comes in at the rate of 50 megabytes per second, so the lookup table for an hour would be 260x60x50M. However, we can summarize certain portions of the table by noting certain commonly occurring input/output associations. For example, if the car in front brakes, then the driver should also brake.

In other words some processing is done on visual input to establish the condition, "brake lights in front are on" and this triggers some established connection to the action "start braking". Such a connection is called a condition-action rule written as

Agents that keep track of the world

Simple reflex agents only work if the correct action can be chosen based only on the current percept.

Even for the simple braking rule above, we need some sort of internal description of the world state. (To determine if the car in front is braking, we would probably need to compare the current image with the previous to see if the brake light has come on.)

Another example

From time to time the driver looks in the rear view mirror to check on the location of nearby vehicles. When the driver is not looking in the mirror, vehicles in the next lane are invisible. However, in order to decide on a lane change requires that the driver know the location of vehicles in the next lane.

This problem illustrates that for any complex domain, sensors do not provide access to the complete world state. In such domains, the agent must maintain an internal state that it updates as new sensor information becomes available.

Updating the state requires the agent to have two kinds of information: First, it needs information about how the world changes over time. Second, it needs information about how its actions effect the world.

Goal-based agents

Knowing about the state of the world is not always enough for the agent to know what to do next. For example, at an intersection, the taxi driver can either turn left, right, or go straight. Which turn it should make depends on where it is trying to get to: its goal.

Goal information describes states that are desirable and that the agent should try to achieve.

The agent can combine goal information with information about what it's actions achieve to plan sequences of actions that achieve those goals.

Search and Planning are the sub fields of AI devoted to finding action sequences that achieve goals.

Decision making of this kind is fundamentally different from condition-action rules, in that it involves consideration of the future. In the reflex agent design this information is not used because the designer has precomputed the correct action for the various cases. A goal-based agent could reason that if the car in front has its brake lights on, it will slow down. From the way in which the world evolves, the only action that will achieve the goal of not hitting the braking car is to slow down. To do so requires hitting the brakes. The goal-based agent is more flexible but takes longer to decide what to do.

Utility-based agents

Goals alone are not enough to generate high-quality behaviour. For example, there are many action sequences that will get the taxi to its destination, but some are quicker, safer, more reliable, cheaper, etc.

Goals just provide a crude distinction between "happy" and "unhappy" states whereas a more general performance measure should allow a comparison of different world states. "Happiness" of an agent is called utility.

Utility can be represented as a function that maps states into real numbers. The larger the number the higher the utility of the state.

A complete specification of the utility function allows rational decisions in two kinds of cases where goals have trouble. First, when there are conflicting goals, only some of which can be achieved (e.g., speed vs. safety), the utility function specifies the appropriate trade-off. Second, when there are several goals that the agent can aim for, none of which can be achieved with certainty, utility provides a way in which the likelihood of success can be weighed up against the importance of the goals.


Now we look at how agents couple with the environment.

Properties of environments

Environment Programs

The following pseudo-code illustrates the basic relationship between agents and environments. The environment simulator takes one or more agents as input and arranges to repeatedly give each agent the right percepts and receive back an action. The simulator then updates the environment based on the actions and possibly other dynamic processes in the environment that are not considered to be agents.

The environment is therefore defined by an initial state and an update function. Obviously, an agent that works in a simulated environment ought also to work in the real thing.

Run-Eval-Environment returns the performance measure for a single environment defined by a single initial state and a particular update function. Usually, an agent is designed to work in an environment class, a whole set of different environments. So, in order to measure the performance of an agent, we need an environment generator that selects particular environments in which to run the agent. We are then interested in average performance.

A possible confusion arises between the state variable in the environment simulator and the state variable in the agent itself. These must be kept strictly separate!

Section 03

Solving Problems by Searching

Simple reflex agents are limited because they cannot plan ahead. They also have no knowledge of what their actions do nor of what they are trying to achieve.
Now we investigate one type of goal-based agent called a problem-solving agent. This type of agent decides what to do by finding sequences of actions that lead to desirable states.
How can an agent formulate an appropriate view of the problem it faces?

Problem-Solving Agents
Intelligent agents are supposed to act in such a way that the environment goes through a sequence of states that maximizes the performance measure.
Unfortunately, this specification is difficult to translate into a successful agent design. The task is simplified if the agent can adopt a goal and aim to satisfy it.

Example Suppose the agent is in Auckland and wishes to get to Wellington. There are a number of factors to consider e.g.  cost, speed and comfort of journey.

Goals such as this help to organize behaviour by limiting the objectives that the agent is trying to achieve. Goal formulation, based on the current situation is the first step in problem solving. In addition to formulating a goal, the agent may wish to decide on some other factors that affect the desirability of different ways of achieving the goal.

We will consider a goal to be a set of states - just those states in which the goal is satisfied.
Actions can be viewed as causing transitions between states.
How can the agent decide on what types of actions to consider?

Problem formulation is the process of deciding what actions and states to consider. For now let us assume that the agent will consider actions at the level of driving from one city to another.

The states will then correspond to being in particular towns along the way.

The agent has now adopted the goal of getting to Wellington, so unless it is already there, it must transform the current state into the desired one. Suppose that there are three roads leaving Auckland but that none of them lead directly to Wellington. What should the agent do? If it does not know the geography it can do no better than to pick one of the roads at random.

However, suppose the agent has a map of the area. The purpose of a map is to provide the agent with information about the states it might get itself into and the actions it can take. The agent can use the map to consider subsequent steps of a hypothetical journey that will eventually reach its goal.

In general, an agent with several intermediate options of unknown value can decide what to do by first examining different possible sequences of actions that lead to states of known value and then choosing the best one.

This process is called search. A search algorithm takes a problem as input and returns a solution in the form of an action sequence. Once a solution is found, the actions it recommends can be carried out. This is called the execution phase. Hence, we have a simple "formulate, search, execute" design for the agent.

Formulating Problems
Formulating problems is an art. First, we look at the different amounts of knowledge that an agent can have concerning its actions and the state that it is in. This depends on how the agent is connected to its environment.

There are four essentially different types of problems.

Knowledge and problem types
Let us consider the vacuum world - we need to clean the world using a vacuum cleaner. For the moment we will simplify it even further and suppose that the world has just two locations. In this case there are eight possible world states. There are three possible actions: left, right, and suck. The goal is to clean up all the dirt, i.e., the goal is equivalent to the set of states {7,8}.

First, suppose that the agent's sensors give it enough information to tell exactly which state it is in (i.e., the world is completely accessible) and suppose it knows exactly what each of its actions does.

Then it can calculate exactly what state it will be in after any sequence of actions. For example, if the initial state is 5, then it can calculate the result of the actions sequence {right, suck}.

This simplest case is called a single-state problem.

Now suppose the agent knows all of the effects of its actions, but world access is limited. For example, in the extreme case, the agent has no sensors so it knows only that its initial state is one of the set {1,2,3,4,5,6,7,8}. In this simple world, the agent can succeed even though it has no sensors. It knows that the action {right} will cause it to be in one of the states {2,4,6,8}. In fact, it can discover that the sequence {right, suck, left, suck} is guaranteed to reach a goal state no matter what the initial state is.

In this case, when the world is not fully accessible, the agent must reason about sets of states that it might get to, rather than single states. This is called the multiple-state problem.

The case of ignorance about the effects of actions can be treated similarly.

Example Suppose the suck action sometimes deposits dirt when there is none. Then if the agent is in state 4, sucking will place it in one of {2,4}. There is still a sequence of actions that is guaranteed to reach the goal state.

However, sometimes ignorance prevents the agent from finding a guaranteed solution sequence. Suppose that the agent has the nondeterministic suck action as above and that it has a position sensor and a local dirt sensor. Suppose the agent is in one of the states {1,3}. The agent might formulate the action sequence {suck, right, suck}. The first action would change the state to one of {5,7}; moving right would then change the state to {6,8}. If the agent is, in fact, in state 6, the plan will succeed, otherwise it will fail. It turns out there is no fixed action sequence that guarantees a solution to this problem.

The agent can solve the problem if it can perform sensing actions during execution. For example, starting from one of {1,3}: first suck dirt, then move right, then suck only if there is dirt there. In this case the agent must calculate a whole tree of actions rather than a single sequence, i.e., plans now have conditionals in them that are based on the results of sensing actions.

For this reason, we call this a contingency problem. Many problems in the real world are contingency problems. This is why most people keep their eyes open when walking and driving around.

Single-state and multiple-state problems can be handled by similar search techniques. Contingency problems require more complex algorithms. They also lend them selves to an agent design in which planning and execution are interleaved.

We will only consider cases where guaranteed solutions consist of a single action sequence.

The last type of problem is the exploration problem. In this type of problem, the agent has no information about the effects of its actions. The agent must experiment, gradually discovering what its actions do and what sorts of states exist. This is search in the real world rather than in a model.

Well-defined problems and solutions

We have seen that the basic elements of a problem definition are the states and actions. To capture these ideas more precisely, we need the following:

Together these define the state space of the problem: the set of all states reachable from the initial state by any sequence of actions.

A path in the state space is simply any sequence of actions leading from one state to another.

It may also be the case that one solution is preferable to another, even though they both reach the goal. To capture this idea, we use the notion of path cost. Together the initial state, operator set, goal test, and path cost function define a problem.

To deal with multiple-state problems, a problem consists of an initial state set; a set of operators specifying for each action the set of states reached from any given state; and a goal test and path cost function. The state space is replaced by the state set space.

Measuring problem-solving performance
The effectiveness of a search can be measured in at least three ways:

The total cost of a search is the sum of the path cost and search cost.
What makes problem solving an art is deciding what goes into the description of states and operators and what should be left out. The process of removing detail is called abstraction. Without it, intelligent agents would be swamped.

Example Problems

We will talk both about toy problems and real-world problems. Toy problems are used to illustrate and exercise various techniques. Real-world problems are usually much harder and we usually care about the solutions.

Toy problems
The 8-puzzle problem

One important trick is to notice that rather than use operators such as "move the 3 tile into the blank space," it is more sensible to have operators such as "the blank space changes places with the tile to its left." This is because there are fewer of the latter kind of operator. This leads to the following formulation:

The 8-queens problem
The goal of this problem is to place 8 queens on the board so that none can attack the others. The following is not a solution!

 There are two main kinds of formulations. The incremental formulation involves placing queens one-by-one on the board. The complete-state formulation starts with all 8 queens on the board and moves them around until a solution is found. In this formulation, we have 648 possible sequences to investigate. A more sensible choice would use the fact that placing a queen where it is already under attack cannot work because subsequent placings will not undo the attack. So, try the following instead: It is easy to see that the actions given can generate only states with no attacks; but sometimes no actions will be possible. For example, after making the first seven choices (left to right) in the above figure, there is no action available. However, in this formulation there are only 2057 possible sequences to investigate.
The right formulation makes a big difference to the size of the search space.

Vacuum World

State Space with sensors

State space with no sensors (self loops omitted) - initial state is centre-top

Real-world problems

Route finding, e.g., find a route from Auckland to Wellington.

The Travelling Salesperson Problem is a famous touring problem in which each city in a network of connected cities must be visited exactly once. The goal is to find the shortest tour. This problem is NP-hard.

VLSI layout - a VLSI chips can have millions of gates and the positioning and connections to every gate are crucial to the successful operation of the chip. CAD tools are used in every phase of the process. Two difficult subtasks are cell layout and channel routing. The goal in cell layout is to place every cell on the chip with no overlap and enough room in between to place connecting wires. There are additional constrains such as placing particular cells near each other. Channel routing in the process of placing the connecting wires.

Robot navigation - Guide a robot about Mars.

Assembly sequencing - Robot assembly of complex objects.

Searching for Solutions

The idea behind most search techniques is to maintain and extend a set of partial solution paths.

Generating action sequences
To solve the route planning example problem, the first step is to test if the current state is the goal state. If not, we must consider some other states. This is done by applying the operators to the current state, thereby generating a new set of states. This process is called expanding the state. Whenever there are multiple possibilities, a choice must be made about which one to consider further.

This is the essence of search - choosing one option and putting the others aside for later, in case the first choice does not lead to a solution. We continue choosing, testing, and expanding until a solution is found or until there are no more states that can be expanded. The choice of which state to expand first is determined by the search strategy.

It is helpful to think of the process as building a search tree that is superimposed over the state space. The root of the tree is a search node corresponding to the initial state. The leaves of the tree are nodes that do not have successors either because they are goal states or because no operators can be applied to them.

The general search algorithm is described as follows:

Data structures for search trees
We will assume a search tree node is a five component structure with the following components:

Remember that a node is a bookkeeping data structure used to represent a search tree for a particular problem instance. This is what makes a node different from a state.
Nodes have depths and parents, whereas states do not.

The EXPAND function is responsible for calculating each of the five components above for the nodes that it generates.
We also need to represent the collection of nodes that are waiting to be expanded - this collection is called the fringe or frontier.
For efficiency, the collection of node on the fringe will be implemented as a queue using the following operations:

Different varieties of queuing functions produce different search algorithms.

Now we can write a more formal version of the general search algorithm.

Search Strategies

We will evaluate search strategies in terms of four criteria: Uninformed (or blind) search strategies have no information about the number of steps or the path cost from the current state to the goal.
In a route finding problem, given several choices of cities to go to next, uniformed search strategies have no way to prefer any particular choices.
Informed (or heuristic) search strategies use considerations to prefer choices. For example, in the route finding problem with a map, if a choice is in the direction of the goal city, prefer it.

Even though uninformed search is less effective than heuristic search, uninformed search is still important because there are many problems for which information used to make informed choices is not available.

We now present six uninformed search strategies:

Breadth-first search
In this strategy, the root node is expanded first, then all of the nodes generated by the root are expanded before any of their successors. Then these successors are all expanded before any of their successors.

In general, all the nodes at depth d in the search tree are expanded before any nodes at d+1. This kind of search can be implemented by using the queuing function ENQUEUE-AT-END.

Breadth-first search is complete and optimal provided the path cost is a nondecreasing function of the depth of the node. Lets look at the amount of time and memory it requires. Consider a hypothetical search tree with b successors for every node. Such a tree is said to have a branching factor of b. If the first solution is a depth d, then the maximum number of nodes expanded before reaching a solution is 1+b+b2+b3+...+bd.

This is not good. For example consider the following table for b=10, 1 node/ms, 100 bytes/node:

Note that memory requirements are a bigger problem here than execution time.
In general, exponential complexity search problems cannot be solved for any but the smallest instances.

Uniform cost search (Dijkstra's Algorithm)
This strategy modifies breadth-first search to account for general path cost. Now the lowest cost node on the fringe is always the first one expanded (recall that the cost of a node is the cost of the path to that node).

When certain conditions are met, the first solution found is guaranteed to be the cheapest.

Consider an example

Uniform cost search finds the cheapest solution provided the cost of a path never decreases as it gets longer, i.e., g(SUCCESSOR(n)>=g(n) for every node n.

Depth-first search

This search strategy always expands one node to the deepest level of the tree. Only when a dead-end is encountered does the search backup and expand nodes at shallower levels.

This strategy can be implemented by calling GENERAL-SEARCH with a queuing function that always puts the newest node on the front ( ENQUEUE-AT-FRONT)
Depth-first search has modest memory requirements, storing only a single path from root to the current node along with the remaining unexpanded sibling nodes for each node on the path. For a state space with branching factor b and maximum depth m, depth first search requires storage of only bm nodes, in contrast to bd for breadth-first search. Using the same assumptions as our analysis of breadth-first search, depth-first search would require 12 kilobytes instead of 111 terabytes at depth 12, a factor of 10 billion times less space.

For problems that have a lot of solutions, depth-first search is often faster than breadth-first search because it has a good chance of finding a solution after exploring only a small portion of the search space.

However, depth-first search can get stuck on a wrong path in infinite search spaces.

It is common to implement depth-first search with a recursive function that calls itself on each of its children in turn. In this case, the queue is stored implicitly in the local state of each invocation on the calling stack.

Depth-limited search
This strategy avoids the pitfalls of depth-first search by imposing a cut-off on the maximum depth of a path. Depth-first search is used to search to the given depth.
Depth-limited search is complete but non optimal and if we choose a depth-limit that is too shallow its not even complete.

Iterative deepening search
The hard part about depth-limited search is picking the right depth limit. Most of the time we will not know what to pick for a depth limit. Iterative deepening search is a strategy that side-steps the issue of choosing the depth limit by trying all possible limits in increasing order.

In effect, this search strategy keeps the best features of both breadth-first and depth-first search. It has the modest memory requirements of depth-first search but is optimal and complete like breadth-first search. Also it does not get stuck in dead ends.

Iterative deepening may seem wasteful because many states are expanded multiple times. For most problems, however, the overhead of this multiple expansion is actually rather small because a majority of the nodes are at the bottom of the tree.

So instead of:
we have:

For b=10 and d=5, the numbers are 111,111 and 123,456. So iterative deepening is 11% more costly than depth-limited in this case.
In terms of complexity numbers, iterative deepening has the same asymptotic complexity as breadth-first, but has space complexity O(bd).
In general, iterative deepening is the preferred search method when there is a large search space and the depth of the solution is unknown.

Bidirectional Search
The idea here is to search forward from the initial state and backward from the goal state simultaneously. When it works well, the search meets in the middle. For problems where the branching factor is b in both directions, bidirectional search can make a big difference.

The solution will be found in O(bd/2) steps because both searches have to go only half way. For example, if b=10 and d=6, breadth-first generates 1,111,111 nodes, whereas bidirectional search generates only 2,222 nodes. Even though this sounds great in theory, there are several issues that must be addressed before this algorithm can be implemented.

The complexity figure assumes that the process of testing for intersection of the two frontiers can be done in constant time. Also in order to have the two searches meet, the frontier of one of them must all be kept in memory, so the space complexity is O(bd/2).

Comparison of Search Strategies

Avoiding Repeated States

We have so far ignored one of the most important complications to the search process: the possibility of wasting time by expanding states that have already been encountered. For many problems, this is unavoidable. Any problem where operators are reversible has this problem. For example in route planning problems, the search spaces are infinite, but if we prune some of the repeated states, we can cut the tree down to a finite size.

Even when the tree is finite, avoiding repeated states can yield exponential speedups. The classic example is a space that contains only m+1 states. The tree contains every possible path which is 2m branches.

There are three ways to deal with repeated states, in increasing order of effectiveness and complexity:

To implement this last option, search algorithms often make use of a hash table that stores all the nodes that are generated. This makes checking for repeated nodes fairly efficient. Whether or not to do this depends on how "loopy" the state space is.

Constraint satisfaction search

A constraint satisfaction problem (CSP) is a special kind of search problem in which states are defined by the values of a set of variables and the goal test specifies a set of constraints that the values must obey.

Example The 8-queens problem can be posed as a CSP in which the variables are the locations of the eight queens; the possible values are squares on the board; and the constraints state that no two queens can be in the same row, column, or diagonal.

A solution to a CSP specifies values for all of the variables such that the constraints are met.

Cryptoarithmetic and VLSI layout can be posed as CSPs.

Constraints can be unary, binary, or higher-order:

Constraints can also be either absolute or preference.

Each variable vi in a CSP has a domain Di which is the set of possible values that the variable can take on. The domain can be either discrete or continuous.

A unary constraint specifies the allowable subset of a variable's domain and a binary constraint specifies the allowable subset of the cross-product of the two domains. In discrete CSPs with finite domains, constraints can be represented simply by enumerating the allowable combinations of values.

Example in 8-queens, the no attack constraint can be represented by a set of pairs of allowable values: {<1,3>,<1,4>,<1,5>,...,<2,4>,<2,5>,...}.

Constraints involving continuous variables cannot be enumerated in this way. Here we consider only CSPs with discrete, absolute, binary or unary constraints.

The initial state will be the state in which all variables are unassigned. Operators will assign a value to a variable from the set of possible values. The goal test will check if all variables are assigned and all constraints satisfied.

We can use depth-limited search because all solutions are of depth n, where n is the number of variables.

In the most naive implementation, any unassigned variable in a state can be assigned a value by and operator. Hence, the branching factor is the sum of the cardinalities of the different domains, i.e., , e.g., 64 in the 8-queens problem. Then the search space would be 648.
But note that the order of assignments makes no difference to the final solution. So we can choose a value for a single variables at each node. This results in a search space that is , or 88 in 8-queens.

A straightforward depth-limited search will examine all of these possible states. CSP contain some well known NP-complete problems and so in the worst case any algorithm to solve CSPs will be exponential.

In most real problems, however, we can take advantage of problem structure to eliminate large parts of the search space.

In CSPs, the goal test is a set of constraints which can be considered individually, as opposed to all together. Depth-limited search on a CSP wastes time searching when constraints have already been violated. For example, suppose we put the first two queens in the top row. DLS will examine all 86 possibilities for the remaining queens before discovering that no solution exists in that subtree.

A substantial improvement is to insert a test before the successor generation step to check whether any constraint has been violated by the variable assignments up to this point. The resulting algorithm backtracks immediately to try something else. This is called backtracking search.

Backtracking still has some failings. Suppose the squares chosen for the first six queens make it impossible to place the seventh queen, because they attack all 8 squares in the last column. Backtracking will still try all possible placings of the seventh queen. Forward checking avoids this problem by looking ahead to detect unsolvability. Each time a variable is instantiated, forward checking deletes from the domains of the as-yet-uninstantiated variables all of those values that conflict with the variables assigned so far. If any of the domains becomes empty, then the search backtracks immediately.

Forward checking is a special case of arc consistency checking. A state is arc-consistent if every variable has a value in its domain that is consistent with each of the constraints on that variable. This can be achieved by successive deletion of values that are inconsistent with some constraint. As values are deleted, other values may become inconsistent because they relied on the deleted values. This behaviour is called constraint propagation: choices are gradually narrowed down. Sometimes arc consistency is all that is needed to completely solve a problem. Section 04

Informed Search Methods

Unfortunately, uninformed search methods are very inefficient in most cases. We now show how heuristic search strategies - ones that use problem specific knowledge - can find solutions more efficiently.

Best-first search

If we use the GENERAL-SEARCH algorithm of the previous chapter, the only place where knowledge can be applied is in the queuing function, which determines the node to expand next. The knowledge used to make this determination is provided by an evaluation function that returns a number that describes the desirability of expanding the node.
When the nodes are ordered so that the one with the best evaluation is expanded first, the resulting strategy is called best-first search. It can be implemented as:

Note that the name best-first search is actually a misnomer because if it really always expanded the best node first, there would be no search at all.
What it, in fact, does is choose next the node that appears to be the best.
Just as there is a whole family of GENERAL-SEARCH algorithms with different ordering functions, there is a whole family of BEST-FIRST-SEARCH algorithms with different evaluation functions. These algorithms typically use some estimated measure of the cost of the solution and try to minimize it. One such measure is cost of the path so far. We have already used that one. In order to focus the search, the measure must incorporate some estimate of the cost of the path from a state to the goal.

Minimize estimated cost to reach a goal: Greedy search
The simplest best-first strategy is to minimize the estimated cost to reach the goal, i.e., always expand the node that appears to be closest to the goal. A function that calculates cost estimates is called an heuristic function.

A best-first search that uses h to select the next node to expand is called a greedy search.

To get an idea of what an heuristic function looks like, lets look at a particular problem. Here we can use as h the straight-line distance to the goal. To do this, we need the map co-ordinates of each city. This heuristic works because roads tend to head in more or less of a straight line.

This figure shows the progress of a greedy search to find a path from Arad to Bucharest.

For this problem, greedy search leads to a minimal cost search because no node off the solution path is expanded. However, it does not find the optimal path: the path it found via Sibiu and Fagaras to Bucharest is 32 miles longer than the path through Pimnicu Vilcea and Pitesti.
Hence, the algorithm always chooses what looks locally best, rather than worrying about whether or not it will be best in the long run. (This is why its called greedy search.)
Greedy search is susceptible to false starts. Consider the problem of getting from Iasi to Fagaras. h suggests that Neamt be expanded first, but it is a deadend. The solution is to go first to Vaslui and then continue to Urziceni, Bucharest and Fagaras.
Note that if we are not careful to detect repeated states, the solution will never be found - the search will oscillate between Neamt and Iasi.
Greedy search resembles dfs in the way that it prefers to follow a single path to the goal and backup only when a deadend is encountered. It suffers from the same defects as dfs - it is not optimal and it is incomplete because it can start down an infinite path and never try other possibilities.
The worst-case complexity for greedy search is O(bm), where m is the maximum depth of the search. Its space complexity is the same as its time complexity, but the worst case can be substantially reduced with a good heuristic function.

Minimizing the total path cost: A* search
Recall that uniform search minimizes the cost of the path so far, g(n): it is optimal and complete, but can be very inefficient. We can combine the use of g(n) and h(n) simply by summing them

f(n) = the estimated cost of the cheapest solution through n

Hence, if we are trying to find the cheapest solution, a reasonable thing to try is the node with the lowest value for f. It is possible to show that this strategy is complete and optimal, given a simple restriction on the h function: h must never overestimate the cost to reach the goal. Such an h is called an admissible heuristic.

If h is admissible, f(n) never overestimates the actual cost of the best solution through n.

Best-first search using f and an admissible h is known as A* search.

Notice that hsld (straight-line distance) is an addmissible heuristic because the shortest path between two points is a straight line. The following diagram shows the first few steps of the A* search for Bucharest using hsld.

Notice that A* prefers to expand from Rimnicu Vilcea rather than Fagaras. Even though Fagaras is closer to Bucharest, the path taken to get to Fagaras is not as efficient in getting close to Bucharest as the path taken to get to Rimnicu.

The behaviour of A* search
An important thing to notice about the example A* search above: along any path from the root, the f-cost never decreases. This fact holds true for almost all admissible heuristics. An heuristic with this property is said to exhibit monotonicity.

Monotonicity is not quite the same as admissible. In a few cases, heuristics are admissible, but not monotonic.

When our heuristics are not monotonic, we will make them so, so we can assume that if an heuristics is admissible it is monotonic.

Now, we can take a look at what is going on in A*. If f never decreases along any path out of the root, we can conceptually draw contours in the state space.

Because A* expands the leaf node of lowest f, an A* search fans out from the start node, adding nodes in concentric bands of increasing f-cost.
With uniform-cost search (A* using h=0), the bands will be "circular" around the start node. WIth more accurate heuristics, the bands will stretch towards the goal state and become narrowly focused around the optimal path. If we define f* to the the cost of the optimal solution path, we can say:

Hence, the first solution found must be the optimal one because nodes in all subsequent contours will have higher f-cost and hence higher g-cost.
A* search is also complete because as we add bands of increasing f, we must eventually reach a band where f is equal to the cost of the path to a goal state.

A* is optimally efficient for any given admissible heuristic function. This is because any algorithm that does not expand all nodes in the contours between the root and the goal runs the risk of missing the optimal solution.

Complexity of A*
The catch with A* is that even though its complete, optimal and optimally efficient, it still can't always be used, because for most problems, the number of nodes within the goal contour search space is still exponential in the length of the solution.

Similarly to breadth-first search, however, the major difficulty with A* is the amount of space that it uses.

Heuristic Functions

First, we will look at some different heuristics for the 8-puzzle problem. A typical solution to the puzzle has around 20 steps. The branching factor is about 3. Hence, an exhaustive search to depth 20 will look at about 320 = 3.5 x 109 states.

By keeping track of repeated states, this number can be cut down to 9!=362,880 different arrangements of 9 squares. We need a heuristic to further reduce this number.

If we want to find shortest solutions, we need a function that never overestimates the number of steps to the goal. Here are two possibilities:

The effect of heuristic accuracy on performance
One way to characterize the quality of an heuristic is the effective branching factor b*. If the total number of nodes expanded by A* for a particular problem is N and the solution depth is d, then b* is the branching factor that a uniform tree of depth d would have to have in order to contain N nodes. Example if A* finds a solution at depth 5 using 52 nodes, the effective branching factor is 1.91

Usually, the effective branching factor of an heuristic is fairly constant over problem instances and, hence, experimental measurements of b* on a small set of problems provides a good approximation of the heuristic's usefulness.

A well designed heuristic should have a value of b* that is close to 1. The following is a table of the performance of A* with h1 and h2 above on 100 randomly generated problems. It shows that h2 is better than h1 and that both are much better than iterative deepening search.

Note that h2 is always a better heuristic function than h1.

Inventing heuristic functions
Is is possible for heuristic functions to be invented mechanically?

In many cases, the answer is yes. In the 8-puzzle case, h1 and h2 can be considered perfectly accurate path lengths for simplified versions of the 8-puzzle problem. That is, if the rules of the puzzle were changed so that a tile could be moved anywhere instead of just into an opening, h1 would accurately give the number of steps in the shortest solution.
Is there a simplified problem in which h2 gives correct path lengths?
Yes. If a tile could move a square in any direction, even onto an occupied square.
A problem with less restrictions on its operators is called a relaxed problem. It is often the case that the cost of an exact solution to a relaxed problem is a good heuristic for the original problem.
If a problem definition is written down in a formal language, it is possible to automatically generate relaxations. Again in 8-puzzle, the operators can be described as:

one can generate three relaxed problems by removing conditions: One problem with generating new heuristic functions is there is often no one best heuristic.

It turns out that if there are several choices for heuristic functions with none clearly dominating the others, and they are all admissible, then they can be combined as

This composite heuristic uses whichever function is most accurate on a node.

Often it is possible to pick out features of a state that contribute to its heuristic evaluation, even if it is hard to say what the contribution should be. For example, the goal in chess is to checkmate and the relevant features include number of pieces of each kind belonging to each side, the number of pieces that are attacked by opponent's pieces, etc.

Usually, the evaluation function is assumed to be a linear combination of the features and even if we don't know the weights, it is possible to use a learning algorithm to acquire them.

We have not considered the cost of computing the heuristic functions in our discussion. So long as its about the cost of expanding a node, this is not an issue, but when the cost is as high as expanding several hundred nodes, it can no longer be ignored.

Heuristics for constraint satisfaction
We have so far examined variants of depth first search for solving CSPs. Now we extend the analysis by considering heuristics for selecting a variable to instantiate and for choosing a value for the variable.

Suppose we are trying to colour the following map with three colours so that no two adjacent regions have the same colour.

Clearly the best next move is to colour region E first because the only possible colour for it is blue. All the other regions have a choice and we might make the wrong one requiring us to backtrack.

This idea is called the most-constrained-variable heuristic. It is used with forward checking. At each point in the search, the variable with the fewest possible values is chosen to have a value assigned next. In this way the branching factor in the search tends to be minimized. When this heuristic is used in solving the n-queens problems, the maximum value of n goes from 30 for forward checking to around 100.

The most-constraining-variable heuristic attempts to reduce the branching factor on future choices by assigning a value to the variable that is involved in the largest number of constraints on other unassigned variables.

Once a variable has been selected, a value must still be chosen for it. For example, suppose we decide to assign a value to region C after A and B. Red is a better choice than blue because it leaves more freedom for future choices. This is the least-constraining-value heuristic - choose a value that rules out the smallest number of values in variables connected to the current variable by constraints. When applied to the n-queens problem, it allows problems up to n=1000 to be solved.

Memory Bounded Search

IDA* is a logical extension of iterative deepening search and SMA* is similar to A* but restricts the queue size to fit into the available memory.
Iterative deepening A* search (IDA*)

Uses the same trick we used to make breadth-first search memory efficient.
In this algorithm each iteration is dfs, just as in regular iterative deepening. However, the search is modified to use an f-cost limit rather than a depth limit.
Hence, each iteration expands all nodes inside the contour for the current f-cost, peeping over the contour to find out where the next contour lies.
Once the search inside a contour is completed, a new iteration is started using a new f-cost for the next contour.

IDA* is complete and optimal, but because it is depth-first it only requires space proportional to the longest path that it explores. b*d is a good estimate of storage requirements.

The algorithm's time complexity depends on the number of different values that the h function can take on. The city block distance function used in the 8-puzzle takes on one of a small number of integer values. So typically, f only increases two or three times along any solution path. Hence, IDA* only goes through 2 or 3 iterations. Because IDA* does not need to maintain a queue of nodes, its treatment of nodes is faster.

Unfortunately, IDA* has difficulty in more complex domains. In the travelling salesperson problem, for example, the heuristic value is different for every state. As a result, the search is much more like iterative deepening: if A* expands N nodes, IDA* will have to go through N iterations and will therefore expand 1+2+...+N=O(N2) nodes.

One way around this problem is to increase the f-cost limit by a fixed amount [e] on each iteration, so that the total number of iterations is proportional to 1/[e]. This can reduce the search cost at the expense of returning solutions that can be worse than optimal by at most [e]. Such an algorithm is called [e]-admissible.

SMA* search

IDA*s difficulties in some problem spaces can be traced to using too little memory. Between iterations, it retains only a single number, the current f-cost limit.

SMA*(Simplified memory-bounded A*) can make use of all available memory to carry out the search. SMA* has the following properties:

When SMA* needs to make space on its queue, it drops a node from the queue. It prefers to drop nodes that are unpromising - that is nodes with high f-cost. To avoid reexploring subtrees that it has dropped from memory, it retains in the ancestor nodes information about the quality of the best path in the forgotten subtree. In this way, it only regenerates the subtree when all other paths have been shown to look worse than the path it has forgotten.

Example  Here is a hand simulation of SMA* on illustrates how it works.

In this case, there is enough memory for the shallowest optimal solution path. If J had had a cost of 19 instead of 24, however, SMA* still would not have been able to find it because the solution path contains four nodes.

In this case, it would return D which is a reachable, non-optimal solution.

Given a reasonable amount of memory, SMA* can solve significantly more difficult problems that A* without incurring significant overhead in terms of extra nodes generated. It performs well on problems with highly connected state spaces with real-valued heuristics, on which IDA* has difficulty.

Iterative Improvement Algorithms

Start with a complete configuration and make modifications to improve quality. Example: n-queens

For this kind of algorithm, consider a grid containing all of the problem states. The height of any point corresponds to the value of the evaluation function on the state at that point.

The idea of iterative improvement algorithms is to move around the grid, trying to find the highest peaks, which are optimal solutions.

These algorithms usually keep track only of the current state and do not look ahead beyond immediate neighbors.

There are two major classes of iterative improvement algorithms: Hill-climbing algorithms try to make changes that improve the current state. Simulated annealing algorithms sometimes make changes that make things worse.

Hill-Climbing Search

Always move in a direction of increasing quality. This simple policy has three well-known drawbacks:

Simulated Annealing

Allow hill-climbing to take some downhill steps to escape local maxima.

The innermost loop of simulated annealing is quite similar to hill-climbing. Instead of picking the best move, however, it picks a random move. If the move actually improves the situation, it is executed. Otherwise, the algorithm make the move with some probability less than 1. The probability decrease exponentially with the badness of the move - the amount [[Delta]]E by which the evaluation is worsened.

A second parameter T is also used to determine the probability. At higher values of T, bad moves are more likely to be allowed. As T tends towards 0, they become more and more unlikely, until the algorithms behaves like hill-climbing. The schedule input determines the value of T as a function of how many cycles already have been completed.

Simulated annealing was developed from an explicit analogy with annealing - the process of gradually cooling a liquid until it freezes. The value function corresponds to the total energy of the atoms in the material and T corresponds to the temperature. The schedule determines the rate at which the temperature is lowered. Individual moves in the state space correspond to random fluctuations due to thermal noise. One can prove that if the temperature is lowered sufficiently slowly, the material will attain a lowest-energy configuration. This corresponds to the statement that if schedule lowers T slowly enough, the algorithm will find a global optimum.

Applications to constraint satisfaction problems

CSPs can be solved by iterative improvement methods by first assigning values to all the variables and then applying modification operators to move the configuration towards a solution. These operators simply assign a different value to a variable, e.g., 8-queens.

Algorithms that solve CSPs in the fashion are called heuristic repair methods. In choosing a new value for a variable, the most obvious heuristic is to select the value that results in the minimum number of conflicts with other variables. This is called the min-conflicts heuristic.

Section 05


Games require different search methods since we are often dealing with an adversary.

Game Tree

A tree in which the nodes denote board configurations, and the branches indicate how one board configuration can be transformed into another by a single move.  A pair of moves is often called a ply.

Need a way to rank game outcomes. The ranking function is called a utility function.  A simple utility function may give -1,0 or +1 for lose, draw and win.
Consider noughts and crosses with two players called MAX and MIN. If MAX is X and goes first then the tree is as follows.

Max will always try and maximize the utility function for the terminal state and min will always try to minimise it.
An exhaustive search of the game tree for a solution is usually not possible.
Consider the following simple game tree.
The utility function is only applied to terminal nodes. If max makes move A1 then Min should make move A11. Thus the result of making move A1is a value of 3 for the utility function. Similarly A2 leads to a value of 2 and A3 a value of 2. Max wants to maximise the utility function and so A1 is the best move to make.

Minimax consists of 5 steps.

  1. Generate the complete game tree.
  2. Apply the utility function to all the terminal states.
  3. Use the utility of the terminal states to calculate a utility value for their parents (either max or min) depending on depth.
  4. Continue up to root node.
  5. Choose move with highest value from root node.
The chosen move is called the minimax decision because it maximises the utility under the assumption that the opponent will play perfectly to try and minimise it.

If the maximum depth of the tree is m and there are b moves at each node then the time complexity of minimax is O(bm) .  The algorithm is depth first so memory requirement is O(bm).

Evaluation functions

Minimax assumes that we can search all the way to the terminal states, this is not often practical.

Instead of doing this we could use a heuristic evaluation function to estimate which moves leads to better terminal states. Now a cutoff test must be used to limit the search.

Choosing a good evaluation function is very important to the efficiency of this method. The evaluation function must agree with the utility function for terminal states and it must be a good predictor of the terminal values. If the evaluation function is infallible then no search is necessary, just choose the move that gives the best position. The better the evaluation function the less search need to be done.

e.g. the following shows some evaluations of chess positions.

The evaluation function is often a linear combination of features.

Cutting off search

The easiest way to stop the search is to have a fixed depth or to use iterative deepening.

Heuristic Continuation

You search for N ply in a tree, and find a very good move. But, perhaps, if you had searched just one ply further you would have discovered
that this move is actually very bad.

In general:

The analysis may stop just before your opponent captures one of your pieces or the analysis stops just before you capture one your opponent's pieces.

This is called the horizon effect: a good (or bad) move may be just over the horizon.

The Singular Extension Heuristic:

Search should continue as long as one moves static value stands out from the rest. If we don't use this heuristic, we risk harm from the Horizon Effect.

e.g. consider the following chess position. Black is ahead in material but if white can reach the eighth row with it's pawn then it can win. Black can stall this for some time by checking the white and so will never see that this is a bad position.

Alpha Beta Pruning

Using the Minimax algorithm allows a chess playing program to look ahead about 3 or 4 ply. This is roughly as good an intermediate human player, but still fairly easy to beat. Minimax, however is wasteful because it evaluates some branches of the game tree that it need not. The principle behind Alpha-Beta pruning is:

If you have an idea that is surely bad, do not take time to see how terrible it is.

Consider the following example:

Search proceeds exactly as for minimax evaluating A11 A12 A13 and A1 until, after looking at A21, we find that A2 must have a utility less than 2. We now know that A2 is a bad choice because A1 has a higher utility and we need not evaluate any more branches below A2 .

The following  diagram shows this for a general case.

If we have a better choice m, either at the parent node of n or at any choice point further up, then n will never be reached in actual play.

Minimax is depth first, so all that need be remembered is the best choice found so far for the player (MAX) and the best choice found so far for the opponent (MIN). These are called alpha and beta respectively.

The following algorithm illustrates this.

Effectiveness of alpha-beta pruning.

What are the maximum savings possible?

Suppose the tree is ordered as follows:

^                ____________________________o____________________________
                /                            |                            \
v      ________o________              _______o________             ________o________
      /        |        \           /        |        \           /        |        \
^    o         o         o         o         o         o         o         o         o
   / | \     / | \     / | \     / | \     / | \     / | \     / | \     / | \     / | \
 14 15 16  17 18 19  20 21 22  13 14 15  26 27 28  29 30 31  12 13 14  35 36 37  38 39 40
  *  *  *  *         *         *  *  *                       *  *  *
Only those nodes marked (*) need be evaluated.
How many static evaluations are needed?.

If b is the branching factor (3 above) and d is the depth (3 above)

s = 2bd/2 - 1 IF d even

s = b(d+1)/2 + b(d-1)/2 - 1 IF d odd

For our tree d=3, b=3 and so s=11.

This is only for the perfectly arranged tree. It gives a lower bound on the number of evaluations of approximately 2bd/2 .The worst case is bd (minimax)
In practice, for reasonable games, the complexity is O(b3d/4). Using minimax with alpha beta pruning allows us to look ahead about half as far again as without.

Games of Chance

Many games, such as backgammon, involve chance.

To represent this game we need to add chance nodes to the game tree. In the tree below chance nodes are shown as circles. Below each circle are the possible outcomes of the dice throw with the probability of the throw occurring. (1/36 for a double, 1/18 otherwise).

Now each position has no known outcome only an expected outcome. The expected value for a chance node above a max node is:

Where The expectimin value is similarly defined for a chance node above a min node.

Complexity of expectiminimax

Minimax has a complexity of O(bm) where b is the branching factor and m is the depth. If there are now also a number (n) of chance nodes at each level, then the complexity becomes: O(bmnm)
This extra cost can make games of chance very difficult to solve.

State of the art

The following diagram suggests that in 1997 a chess program is as good as the best human. This is in fact what happened.
Deep Thought has 32 PowerPC CPUs and 256 dedicated hardware devices for evaluating board positions. It also has 100 years worth of grandmaster chess games and an opening and closing move database.

see: for more details of Deep Blue

  Section 06

Agents that reason logically

We will now look at knowledge-based agents; these implement a view of agents in which they can be seen as knowing about their world and reasoning about their possible courses of action.

A knowledge-based agent needs to know: the current state of the world, how to infer unseen properties from percepts, how the world evolves over time, what it wants to achieve, and what its own actions do in various circumstances.

The basic elements of a reasoning agent's design are: a formal language in which knowledge can be expressed and a means of carrying out reasoning in such a language. These two elements constitute a logic.

A knowledge-based agent

The central component of a knowledge-based agent is its knowledge base (KB). A KB is a set of representations of facts about the world. Often the individual units of a KB are called sentences.

There must be a way to add new sentences to the knowledge base and a way to query what is known. We will call these standard functions TELL and ASK, respectively.

Determining what follows from a KB is the job of the inference mechanism.

Like all other agents, this agent takes a percept as input and returns an action.

The agent maintains a KB which may initially contains some background knowledge.

In the process of answering this query, logical reasoning is used to prove which action is better than all others, given what the agent knows and what its goals are.

Make-percept-sentence takes a percept and a time and returns a sentence representing the fact that the agent perceived the percept at time t.

Make-action-query takes a time and returns a sentence that is suitable for asking what action should be performed at that time.

At any point, we can describe a knowledge-based agent at three levels:

The agents initial program is built by adding sentences one at a time to the knowledge base. Provided that the representation language makes it easy to express this knowledge in the form of sentences, this simplifies the construction problem significantly. This is called the declarative approach to system building.

The Wumpus World

We now provide a simple world.  We must explore a cave consisting of rooms connected by passageways. Lurking somewhere in the cave is a beast (the wumpus)  that eats anyone who enters his room. Some rooms contain bottomless pits, but another contains gold.

The agents task is to find the gold and return to square [1,1].

The percepts, actions and goals in the wumpus world are as follows: The agent must do well over a class of wumpus problems. We will assume a 4x4 grid surrounded by walls. The location of the gold and the wumpus are chosen randomly, with a uniform distribution from the squares other than [1,1]. In addition, each square other than the start can be a pit, with probability .2.

In most environments in this class, it is possible for the agent to return successfully with the gold. However, in some the agent must choose between going home empty handed or taking a chance that could lead to death or the gold. And in about 21% of the environments, there is no way the agent can get a positive score.

Acting and reasoning in the wumpus world

Why does an agent that is successful in this world need to be able to reason? The following figure shows the state of the agent's knowledge after it has received its initial percept in the world above.

From the fact that there was no stench or breeze in [1,1], the agent can infer that [1,2] and [2,1] are free of dangers. From the fact that it is still alive, it can infer than [1,1] is also OK. A cautious agent will only move into a square that it knows is OK. Let us suppose the agent moves to [2,1], giving us (b) above.

The agent detects a breeze in [2,1], so there must be a pit in a neighbouring square, either [2,2] or [3,1]. (The notation P? indicates a possible pit.) The pit cannot be in [1,1] because the agent has already been there and did not fall in. At this point, there is only one known square that is OK and has not yet been visited. So the prudent agent turns back and goes to [1,1] and then [1,2], giving us (a) below:

The agent detects a stench in [1,2], which means there must be a Wumpus near by. But the wumpus cannot be in [1,1] and it cannot be in [2,2] (or the agent would have detected a stench when it was in [1,3]). More interesting is the lack of a breeze in [1,2] means that there must be a pit in [2,2]. ...

Representation, Reasoning and Logic

Together representation and reasoning support the operation of a knowledge-based agent. The object of a knowledge representation is to express knowledge in computer-manipulatable form such that it can be used to help agents perform well. A knowledge representation language has two aspects: Languages with precisely defined syntax and semantics are called logics. In a logic, we can develop inference mechanisms for an agent that uses the language. A sentence inside of a computer is not a fact, it is a representation of a fact. An inference mechanism must ensure that it allows new sentences to be derived from existing ones only when the new fact is true just when the existing ones are.

We want to generate sentences that are necessarily true given that the old sentences are true. This relationship between sentences is called entailment.

How do we build sound inference procedures? By examining the semantics of logical languages, we can extract what is called a proof theory of the language.

Example Consider the sentence E=mc2. The syntax of the "equation language" allows two expressions to be connected by =. An expression can be a symbol or a number, a concatenation of two expressions, two expressions joined by +, and so on.

The semantics of the language says that the two expressions on each side of the = refer to the same quantity; that the concatenation of two expressions refers to the quantity that is the product of the two quantities referred to by each of the expressions; and so on.

From the semantics, we can show that a new sentence can be generated by, for example, concatenating the same expression to both sides of the equation. ET=mc2T.


Representations need to be good at the job for which they are intended, but what does this mean? We start by looking at some more familiar examples than representation languages.

Programming languages are good for describing algorithms and concrete data structures. It is easy to imagine using a 4x4 array to represent the wumpus world and world[2,2]<-pit is a natural way to place a pit in square [2,2] of a world. However, when we want to say less specific things, we run into trouble. For example, "there is a pit in [2,2] or [3,1]" or "there is a wumpus somewhere."

The reason we run into this problem is that programming languages are designed to completely describe the state of the computer and how it changes as a program executes. But we would like to be able to describe situations where complete information is not available. A language that does not allow us to do this is not sufficiently expressive.

A good knowledge representation language should be expressive and concise so that what we say today will still be understandable tomorrow. It should be unambiguous and independent of context. Also, it should be possible to design an inference procedure for it.


In logic, the meaning of a sentence is what it states about the world. So, how do we establish the correspondence between facts and sentences? In order to say what a sentence means, an interpretation must be provided to say what fact is corresponds to. Without an interpretation, a sentence is meaningless. This idea is easiest to see in fabricated languages.

The languages that we will deal with are all compositional - the meaning of a sentence is a function of the meaning of its parts. Just as the meaning of x2 + y2 is determined from the meanings of x2 and y2, we want the meaning of "s1 and s2" to be determined from the meanings of "s1" and "s2".

A sentence is true under a particular interpretation if the state of affairs it represents is the case. Hence, truth depends both on the interpretation and the actual state of affairs in the world.


"S1,2" would be true under the interpretation in which it means that there is a stench in [1,2], in the world we explored earlier, but it would be false in worlds that do not have a stench in [1,2]. Also it would be false in our example world if its interpretation were "there is a breeze in [1,2]."


We are concerned with sound inference which is called logical inference or deduction. Logical inference is a process that implements the entailment relation between sentences. We begin with the idea of necessary truth.

Validity and satisfiability

A sentence is valid or necessarily true if and only if it is true under all possible interpretations in all possible worlds.

Example "There is a stench at [1,1] implies there is a stench at [1,1]." By contrast, "There is an open area in the square in front of me or there is a wall in the square in front of me" is not valid.

A sentence is satisfiable if and only if there is some interpretation in some world that makes it true, e.g., "there is a wumpus at [1,2]". If there is no interpretation and no world in which a sentence is true, it is called unsatisfiable.

Example There is a wall in front of me if and only if there is no wall in front of me.

Inference in computers

It turns out that validity and unsatisfiability are crucial to the ability of a computer to reason logically.

Computers do not know the interpretation you are using for the sentences in the knowledge base, nor does it know anything more than what appears on the knowledge base (it has no direct experience with the world).

Example Suppose we ask if it is OK to move to square [2,2]. The computer does not know what OK means, nor does it know what a wumpus a a pit is. So it cannot reason informally as we do. All it can do is see if KB"[2,2] is OK". The inference procedure must show that the sentence, "If KB is true then [2,2] is OK" is a valid sentence. If it can show this then [2,2] is OK under any interpretation and model.

Hence, the powerful thing about formal inference is that it can be used to derive valid conclusions even if the reasoner does not know the intended interpretation.


As we have said, logics consist of the following:

We will look first at propositional logic (0-order logic) and then at first-order logic with equality.

In propositional logic symbols represent whole propositions (what we have been calling facts), e.g., D might have the interpretation, "the wumpus is dead". These symbols can be combined into more complicated sentences using boolean connectives.

First-order logic commits to the representation of the world in terms of objects and predicates on those objects as well as connectives (as in propositional logic) and quantifiers.

It is illuminating to consider logics in the light of their ontological and epistemological commitments. Ontological commitments have to do with the nature of reality. For example, propositional logic assumes that there are facts that either hold or do not hold in the world. First-order logic further assumes that the world consists of objects with certain relationships between them that either hold or do not hold. Special purpose logics make further commitments.

Epistemological commitments have to do with the possible states of knowledge an agent can have using various types of logics. In propositional and first-order logic, a sentence represents a fact and the agent either believes the sentence to be true, believes it to be false or cannot make a determination. Hence, these logics have three possible states of belief about any sentence. Systems using probability theory can have any degree of belief in a sentence, ranging from 0 (false) to 1 (true).

Propositional Logic


The symbols are True and False, propositional symbols such as P or Q, the logical connectives, ,,<=>,=>, and parentheses. All sentences are made by putting these symbols together using the following rules:

BNF for propositional logic:
  Precedence of operators is: ,,,=>,<=>.


A propositional symbol can be interpreted as any fact. The propositional constants have the obvious interpretations.

For a complex sentence the meaning (as a function of the truth values of the constituent parts, is given by truth tables.

The truth table for => may be confusing at first. This is because, in English, "if P then Q" has a causal interpretation, i.e., P causes Q. This is not the case in propositional logic, where, "if P then Q" means that when P is the case I claim that Q is also. When P is not the case, I am making no claim.

Validity and inference

Truth tables can also be used to test the validity of sentences.


Since we can check validity of a sentence using truth tables, we can check whether a conclusion follows from some premises by checking whether or not Premises => Conclusion is valid.



Any world in which a sentence is true under a particular interpretation is called a model of that sentence under that interpretation.

The more we claim, the fewer models there will be of those claims, e.g., there are fewer models of "There is a pit in [1,2] and there is a Wumpus in [4,2]" than of "There is a pit in [1,2]."

Models are very important in logic because of the relationship between models and entailment. If KB, then every model of KB is also a model of . Hence, whenever KB is true so also is .

We could define the meaning of a sentence by means of set operations on sets of models. For example, the set of models of PQ is the intersection of the models of P and the models of Q.

Rules of inference for propositional logic

The process by which the soundness of an inference rule is established through truth tables can be extended to entire classes of inferences. There are certain patterns of inferences that occur over and over, and their soundness can be shown once and for all. Then the pattern can be captured in what is called an inference rule. Once this has been done, the rule can be used to make inferences without going through the tedious process of building truth tables.

We have already seen the notation  to say that  can be derived from  by inference. An alternate notation emphasizes that this is not a sentence, but rather an inference rule:

Here  and  match sentences.

Whenever something in the knowledge base matches the pattern above the line, the inference rule concludes the sentence below the line.

An inference rule is sound if the conclusion is true in all cases in which the premises are true.

Some example inference rules:

Modus Ponens

Double-Negation Elimination
Unit Resolution
Complexity of Propositional Inference

Since there are 2n rows in a truth table, one might suspect that the complexity of propositional inference is exponential. One might wonder if there is a polynomial-time procedure for propositional inference. This very problem was the motivation for the class of NP-complete problems.

There is a useful class of sentences for which polynomial-time inference procedure exists. This is a class called Horn sentences. These sentences have the form:

Where each of the Pi and Q are nonnegative literals.

An Agent for the Wumpus World

Let S1,1 mean there is a stench in [1,1].

Let B1,1 mean there is a breeze in [1,1].

Let W1,1 mean there is a Wumpus in [1,1].

Let P1,1 mean there is a pit in [1,1].

what we know about the wumpus world is:
How can we express the rules about Wumpuses and Pits?

r1: S1,1=>W1,1W1,2W2,1

r2: S2,1=>W1,1W2,1W2,2W3,1

r3:    S1,2=>W1,3W1,2W2,2W1,1

Now to find the wumpus using the inference rules:

modus ponens with S1,1 and r1 gives


and elimination gives:


modus ponens followed by and elimination with S2,1 and r2 gives:


modus ponens with S1,2 and r3 gives:


apply unit resolution where  is W1,3W1,2W2,2 and  is W1,1 gives:


apply unit resolution where  is W1,3W1,2 and  is W2,2 gives:


apply unit resolution where  is W1,3W1,2 and is W1,2 gives:


We have found the wumpus!!


First-Order Logic

Propositional logic can only represent facts about the world.
First-order logic describes a world which consists of objects and properties (or predicates) of those objects.
Among objects, various relations hold, e.g., Parent(Martin, Zac).
A function is a relation in which there is only one value for a given input.

Objects: people, houses, numbers, planets,...
Relations: parent, brother-of, greater-than,...
Properties: red, round, prime,...
Functions: father-of, one-more-than

"One plus one equals two."
"Squares neighbouring the Wumpus are smelly."
First-order logic is universal in the sense that it can express anything that can be programmed.

Syntax and semantics

First-order logic has sentences just like propositional logic and, in addition, it has terms, which represent objects. Constant symbols, variables and function symbols are used to build terms and quantifiers and predicate symbols are used to build sentences.

<Sentence> := <AtomicSentence>
                          | <Sentence> <Connective> <Sentence>
                          | <Quantifier> <Variable>,... <Sentence>
                          | <Sentence>
                          | (<Sentence>)

<Atomic Sentence> := <Predicate>(<Term>,...)
                         | <Term> = <Term>

<Term> := <Function>(<Term>,...)
                          | <Constant>
                          | <Variable>

<Connective> := | <=> | =>
<Quantifier> := 
<Constant> := Martin | 59302 | Cat | X | ...
<Variable> := a | x | s | ...
<Predicate> := Before | Likes | Raining | Fails | ...
<Function> := Father | Hairof | 304gradefor | ...

BNF for first-order logic

Constant symbols: A, B, C, 1, John,...
 Each constant symbol names exactly one object in the world, not all objects need to have names and some can have more than one name.

Predicate symbols: Round, Brother,...
A predicate symbol refers to a particular relation in the model. For example,  Brother ; since  Brother is a binary relation symbol, the relation it refers to must also be binary, i.e., it must hold or fail to hold between pairs of objects.

Function symbols: Cosine, Fatherof, Leftlegof
A functional relation relates an object to exactly one other object. The last element in the tuple is the value of the function for the other elements. e.g. Officeof(Martin,sc1.40)

A term is a logical expression that refers to an object. Constant symbols are terms. Terms can also be constructed from function symbols and constant symbols, e.g., Fatherof(John).

The formal semantics of terms is as follows: An interpretation specifies a functional relation referred to by the function symbol and objects referred to by the constant symbols. Hence, a function term refers to the n+1th object in a tuple whose first n elements are those objects refereed to by the arguments of the function.

Atomic sentences
An atomic sentence is formed from a predicate symbol follows by a parenthesized list of terms, for example, Brother(Richard,John) states that the object referred to by Richard is the brother of the object referred to by John.
Atomic sentences can have arguments that are complex terms:

Complex sentences
We can use logical connectives to construct more complex sentences. The semantics of these is the same as in propositional logic.

Brother(Richard,John)  Brother(John,Richard) is true just in case John is the brother of Richard and Richard is the brother of John.
Older(John,30)  Younger(John,30) is true when John is older than 30 or he is younger than 30.

We now have a logic that allows objects, quantifiers let us express properties of entire collections of objects.
There are two quantifiers in first-order logic: universal and existential.

Universal quantification()
Using this we can say things like, "All cats are mammals."

Notice the difference between Universal statements are true if they are true for every individual in the world. They can be thought of as an infinite conjunction.

Existential Quantification
Where universal quantification makes statements about every object, existential quantification makes statements about some objects. To say, for example that Spot has a sister that is a cat, we write

x P is true if P is true for some object in the world. It can be thought of as an infinite disjunction.

Nested quantifiers
Very complex statements can be made if one nests quantifiers. Starting simply, without mixing types of quantifiers, we can say things like

We can also mix quantifiers, as in "Everybody is good at something"

Connections between  and 
There is an intimate connection between the two quantifiers. To see this, consider the sentence

"For all x, x does not like Deceitful Politicians."
Another way to say this is, "There does not exists an x that likes Deceitful Politicians" This is true in general because  is a conjunction over all objects and  is a disjunction over all objects. In fact, all of the following are also true
P <=> x P        PQ<=>(PQ)                             
x P<=>P          (PQ)<=>PQ                             
x P<=>P           PQ<=>(PQ)                             
x P<=>P           PQ<=>(PQ)
Often the equality symbol is included as a special symbol. This is because the notion of equality is so important in our thinking. With this symbol, we can write things like Father(John)=Bill, whose intended meaning is that the object that is the father of John is the same object as Bill.
Equality can also be thought of as an ordinary binary relation symbol, so the interpretation of = is a set of pairs.
Equality can be used to say that  there are two or more individuals with a particular property "There is an x and y that are sisters of Spot and they are not the same individual."
The equality symbol can also be used to restrict the number of objects that have a certain property, for example, "Every pair of objects with property P are equal." This statement restricts there to being one object with property P.
A short hand that is often used is !x King(x) which means

Using First-Order Logic

In knowledge representation, a domain is a section of the world about which we wish to express some knowledge.
Here we give some simple examples of using FOL to encode domains.

The Kinship domain

m,c Mother(c)=m <=> Female(m)  Parent(m,c)
w,h Husband(h,w) <=> Male(h)  Spouse(h,w)
x Male(x) <=> Female(x)
g,c Grandparent(g,c) <=> p Parent(g,p) Parent(p,c)
x,y Sibling(x,y) <=> xp Parent(p,x)  Parent(p,y)

Axioms, Definitions and Theorems
Axioms capture the basic facts about a domain, e.g., the examples above are axioms. The axioms are then used to prove theorems.
We can have too few or too many axioms. If we have too few, then we cannot prove theorems that we expect should actually follow in a domain. If we have too many axioms, then some axioms follow from (combinations of) others. An axiom is said to be independent of a set of axioms if it does not follow from that set.
None of the primitive predicates (functions) are defined, instead we give partial specifications of the properties of these.

Asking questions and getting answers

To add sentences to the knowledge base, we call TELL, e.g.,
   TELL(KB,m,c Mother(c)=m <=> Female(m)  Parent(m,c))
This goes both for axioms (as above) and specific facts about a particular situation such as
With these facts added, we can
and receive a yes/no answer. We can also ask questions that return additional information in the answers such as
   ASK(KB,x Child(x,Spot)).
Here we do not want simply a yes/no answer. In addition we would like to know a term for x denoting an object in the domain. In general, for a query involving existentially quantified variables, we want to know bindings for those variables. Hence, ASK returns a binding list, e.g., {x/boots}.

Logical Agents for the Wumpus World

We will consider three agent architectures for the Wumpus World: reflex agents that merely classify their percepts and act on that classification, model-based agents that construct an internal representation of the world and use it to act, and goal-based agents that form goals and try to achieve them.

The first step is to define the interface between the agent and the world. The percept sequence must contain both the percepts and the time at which they occurred, otherwise the agent will not be able to tell when things were observed. We will use integers for time steps so a typical percept sentence would be

The agent's action must be one of:
Turn(Right), Turn(Left), Forward, Shoot, Grab, Release, Climb
To determine which one is best, we create a query such as a Action(a,5). If we set things up right, such a query will return a binding list such as {a/Grab}.

A Simple Reflex Agent

The simplest type of rule connects actions directly with percepts. These rules resemble reflexes. For example, if the agent sees a glitter, it should perform a Grab: The connection between percepts and actions can be mediated by rules for perception, which turn the immediate perceptual input into more useful forms: Then a connection can be made from these predicates to action choices with separate rules: Limitations of simple reflex agents
Simple reflex agents react to only the current percept sequence. Therefore, they cannot know what things were discovered in previous steps. Hence, they are doomed to a life of wondering aimlessly.

Representing Change in the World

We would like to have rules that refer to previous times. The best way to do this is to maintain an internal model of the world.
The most straightforward way to deal with change is simply to change the knowledge base, erasing and adding sentences as the agent explores its environment. This approach solves some of the problems we saw with the propositional agent. However, it does not allow the agent to keep track of the past.

Situation Calculus
Situation Calculus is the name for a particular way of describing change in FOL. It conceives of the world as a sequence of situations, each of which is a snapshot of the state of the world. Situations are generated from previous situations by actions.
Every relation whose truth can change over time is handled by giving an extra situation argument to the corresponding predicate symbol. By convention, we make the situation argument always the last. Hence, instead of At(Agent,Location), we would have At(Agent,[1,1],S0)  At(Agent,[1,2],S1).
Relations or properties that do not change over time need not have the extra argument, e.g., WallAt([0,1]).
The next step is to represent how the world changes from one situation to the next. One way to do this is to have a function result that maps actions and states to new states. Then we would have:

Actions are described by stating their effects. We specify the properties of the situation that results from doing the action. Or These axioms are called effect axioms. Alone they are not sufficient to keep track of whether the agent is holding the gold. We also need to say that if the agent is holding something in a state then it is still holding it in the state that results from an action other than Release: Axioms like these are called frame axioms. The combination of effect and frame axioms provide a complete description of how the world changes in response to an agent's actions.

A more elegant representation combines the effects and frame axioms into a single axiom that describes how to compute the next time step. These axioms have the structure

The use of <=> ensures that a predicate will be true if it is made true or stays true and that it will be false in all other cases.

a,x,s Holding(x,Result(a,s))<=>[(a=GrabPresent(x,s)Portable(x))  (Holding(x,s)aRelease)]

This type if axiom is called a successor-state axiom. One such axiom is needed for each predicate that may change its value over time. The axiom must list all the ways in which a predicate can become true and all the ways in which it can become false.

Keeping track of location

In the Wumpus world, the agent cannot directly perceive its location, so it needs to keep track of where it has been. Its also a good idea for it to keep track of what it has seen in previous locations. The agent needs to know its location and orientation:

The agent also needs to know how the locations are arranged. Its "map" consists of the values of the function LocationToward, which takes a location and a direction and gives the location that is one step forward in the given direction: From the map it is possible to tell which square is directly ahead of the agent The agent also needs to know whatever is known about the contents of the locations. Assume that the perimeter locations contain walls and the rest do not. The agent needs to know what the actions do to locations: Finally the agent needs to know what the actions do to orientations.
  In addition to keeping track of location and the gold, the agent should also keep track of whether the Wumpus is alive or dead. We also need to describe shoot.

Deducing Hidden Properties of the World

From the perceptual information we obtain in situations, we can infer properties of locations. Neither Breezy nor Smelly need situation arguments because pits and Wumpuses do not move around.

We need to write some rules that relate various aspects of single world states (as opposed to across states). There are two main kinds of such rules:

Why do we need both causal and diagnostic rules? It would seem that diagnostic rules are enough. However, it is very tricky to ensure that they derive the strongest possible conclusions from the available information. For example, the absence of stench or breeze implies that adjacent squares are OK: but sometimes a square can be OK even when smells and breezes abound. The model-based rule is probably the best was to represent safety.

The important thing to remember about all this is that if the axioms correctly and completely describe the way the world works and the way percepts are produced, the inference procedure will correctly infer the strongest possible description of the world state given the available percepts.

Preference Among Actions

A problem with the Wumpus world knowledge base that we have built so far is that it is difficult to decide which action is best among a number of possibilities. For example, to decide between a forward and a grab, axioms describing when it is OK to move to a square would have to mention glitter. This is not modular. We can solve this problem by separating facts about actions from facts about goals. This way our agent can be reprogrammed just by asking it to achieve different goals.

The first step is to describe the desirability of actions independent of each other. In doing this we will use a simple scale: actions can be Great, Good, Medium, Risky, or Deadly. Obviously, the agent should always do the best action it can find:

We use this action quality scale in the following way. Up to the point where it finds the gold, the basic strategy for our agent will be: Once the gold is found the policy must change. At this point the agent should have the goal of safely reaching square [1,1] as quickly as possible. The presence of an explicit goal allows the agent to work out a sequence of actions that will achieve the goal. There are at least three ways to find such a sequence:

Lecture 5

Pattern Recognition


Feature Extraction:

Classification Training and Testing Over-training and Under-training.

It is easy to learn about the training set, just remember them all, unfortunately this will be of little use for classification of the test set - this is over-training. Similarly If the system knows too little about the training set then it will classify badly.

The system must not overgeneralise or undergeneralise about patterns.

Bayesian Classification

For independent events:

Bayes Rule

so For pattern classification:

if x is the feature vector and wi is class i.

Pr(wi) is the prior probability of the pattern being in class i.

p(x | wi) is the probability density function for the feature vector.
p(x) = sum(p(x | wi) Pr(wi))

From Bayes Rule:

where: The optimal decision is the i for which Pr(wi | x) is largest.

Bayesian theory gives us the optimum decision boundary but in practice we don't have enough information to calculate this boundary.

Nearest Neighbour Classification

A pattern is classified as belonging to the class of the training pattern that is closest to it. To measure closeness use a distance metric.

For a feature vector x = {x1, x2, x3,.....,xn}

and a training pattern t = {t1, t2, t3,.....,tn}

  1. Euclidean distance:

  2. D2=Sumi( (xi - ti)2)
  3. Dot Product Distance:

  4. D = Sumi(xi * ti)
  5. Angle between vectors:

  6. D = Sumi(xi * ti) /(|xi|*|ti|)



Nearest neighbour classification is prone to errors due to rogue patterns. A rogue pattern is a mislabelled pattern.

K Nearest Neighbour

To eliminate the problem caused by rogue patterns use not just the nearest neighbour but a group of them.

Using K neighbours, take a majority vote of their classes to give a classification.


Speeding up nearest neighbour methods.

The biggest problem with this method is the time it takes to calculate the distances to the training examples.
Possible solutions are:

Lecture 6

Neural Networks

Consider the plane shown above; this ane be used to make decisions about patterns in the x-y plane. For every point in the plane x-y there is a corresponding value for z where (x,y,z) lies on the angled plane. If z is negative then the point lies above the dotted line, otherwise it is below the dotted line.

The equation for z is

We can generalise this for more dimensions

o is the output, w are the weights and i is the input.

This can be drawn as shown below.

Generally we don't care about the size of the output, it will be a yes/no answer for a class; so use the following:


The function in the box is called the sigmoid function and is:

If t is the expected output then there will be an error between t and o. To train the system we need to minimise this error. For a number of these functions there will be one error - the sum of the individual errors.

To train the classifier, just change each w so that e becomes smaller.

Gradient Descent





Weight updating rule:

where n is a parameter used to change the speed of gradient descent. Note that we want to reduce the error so the minus sign disappears.

For theta:


This device is sometimes called a perceptron. The training rule is called the delta rule.

Perceptron Learning Algorithm

The Multilayer Perceptron

This perceptron can only learn simple problems.

Lecture 7

The Multilayer Perceptron

The perceptron can only learn simple problems. It can place a hyperplane in pattern space and move the plane until the error is reduced. Unfortunately this is only useful if the problem is linearly separable.

A linearly separable problem is one in which the classes can be separated by a single hyperplane.

It is often the case that a problem is not linearly separable. To solve these we use a Multi-Layer Perceptron (MLP) where one layer feeds into the next. Consider the following problem.

The XOR problem

This is the simplest problem that can not be solved by a perceptron.

For two inputs x1 and x2, the output is the exclusive OR of the inputs.

The pattern space for this problem looks like this:

This cannot be solved using a single line, the solution uses two lines:

A two layer Multi-Layer Perceptron to solve this problem looks like this:

The shape of regions in pattern space that can be separated by a Multi-Layer Perceptron is shown in the following table.

We can see that a three layer MLP can learn arbitrary areas while a two layer MLP can learn convex regions. (if you can draw a line from any point in the region to any other in the region and the line passes out of the region then that region is not convex).

Training the MLP

last time we saw that the delta rule can be used to train a perceptron. When training the MLP, the error (delta) must be propagated back through the layers. This is called error back-propagation. Or just backpropagation.

The following procedure can be used to train a backpropagation network.

t is the target
units[l] is the number of units in layer l
n[l][i] is unit i in layer l
n[l][i].output is the output
n[l][i].delta is the delta
n[l][i].weight[j] is weight j
ek is the learning constant
adapt() {
int i,j,k,l;

      else {
When this algorithm is applied to the XOR we get the following output.
iteration no 0, inputs 0 1, target 1, output 0.477995 
iteration no 20, inputs 0 0, target 1, output 0.447816 
iteration no 40, inputs 1 0, target 0, output 0.450292 
iteration no 60, inputs 0 0, target 1, output 0.549096 
iteration no 80, inputs 1 0, target 0, output 0.460706 
iteration no 100, inputs 0 0, target 1, output 0.507636 
iteration no 120, inputs 0 1, target 1, output 0.571619 
iteration no 140, inputs 1 0, target 0, output 0.451493 
iteration no 160, inputs 0 1, target 1, output 0.570574 
iteration no 180, inputs 0 0, target 1, output 0.575979 
iteration no 200, inputs 0 1, target 1, output 0.744079 
iteration no 220, inputs 1 0, target 0, output 0.233541 
iteration no 240, inputs 0 1, target 1, output 0.755600 
iteration no 260, inputs 1 1, target 0, output 0.185273 
iteration no 280, inputs 0 1, target 1, output 0.788309 
iteration no 300, inputs 1 1, target 0, output 0.167068 
iteration no 320, inputs 1 0, target 0, output 0.123461 
iteration no 340, inputs 1 1, target 0, output 0.132892 
iteration no 360, inputs 1 1, target 0, output 0.133583 
iteration no 380, inputs 1 1, target 0, output 0.116641 
iteration no 400, inputs 1 0, target 0, output 0.088269 
iteration no 420, inputs 0 0, target 1, output 0.861810 
iteration no 440, inputs 1 1, target 0, output 0.102406 
iteration no 460, inputs 1 0, target 0, output 0.080179 
iteration no 480, inputs 1 0, target 0, output 0.075584 
iteration no 500, inputs 0 0, target 1, output 0.884442 
iteration no 520, inputs 0 0, target 1, output 0.892789 
iteration no 540, inputs 0 1, target 1, output 0.923969 
iteration no 560, inputs 1 0, target 0, output 0.064146 
iteration no 580, inputs 1 1, target 0, output 0.071938 
iteration no 600, inputs 1 1, target 0, output 0.075764 
iteration no 620, inputs 1 1, target 0, output 0.074536 
iteration no 640, inputs 1 1, target 0, output 0.069014 
iteration no 660, inputs 1 1, target 0, output 0.066534 
iteration no 680, inputs 0 0, target 1, output 0.918422 
iteration no 700, inputs 0 0, target 1, output 0.924860 
iteration no 720, inputs 1 1, target 0, output 0.065864 
iteration no 740, inputs 1 0, target 0, output 0.052634 
iteration no 760, inputs 0 0, target 1, output 0.927081 
iteration no 780, inputs 1 0, target 0, output 0.050964 
iteration no 800, inputs 0 1, target 1, output 0.948869 
iteration no 820, inputs 1 0, target 0, output 0.049082 
iteration no 840, inputs 1 0, target 0, output 0.048074 
iteration no 860, inputs 1 1, target 0, output 0.057916 
iteration no 880, inputs 1 1, target 0, output 0.056088 
iteration no 900, inputs 0 1, target 1, output 0.954659 
iteration no 920, inputs 1 1, target 0, output 0.057337 
iteration no 940, inputs 0 0, target 1, output 0.944243 
iteration no 960, inputs 1 0, target 0, output 0.045653 
iteration no 980, inputs 0 0, target 1, output 0.946199

Example: NETtalk

The problem

Convert English text into the vowel and consonant sounds (phonemes) that make up speech. For example consider the following words: albany, Albania, misled and bedraggled.

Traditional Approach

  1. Create if-then rules to encode all the regularities in the language.
  2. Maintain a database of exceptions to these rules.
  3. Build a production system to resolve conflicts.
For example, a ``c'' can either be pronounced like an ``s'' as in center, icy, and city or a ``k'' as in cat and crumb. Here is a rule that might encode this difference:

Neural Network Approach

Allow a system to learn how to pronounce words by giving it lots of examples and correcting its mistakes.
  1. Choose an architecture.
  2. Choose a representation for the input and output.
  3. Determine a training set. 16,000 words were chosen randomly from the 20,000 words in Webster's Dictionary.
  4. Train the system. Give the network each word in the training set several times, and indicate the errors for learning.
  5. Test the system. The network was able to respond correctly to 90%of the 4,000 remaining words from the dictionary.

Advantages of the Network Approach

Steps in the training of NETtalk

Lecture 8

Unsupervised Learning

So far we have always known the correct classification for every member of the training set, this is not always the case.

Patterns are often created by processes we are not able to describe well. In these cases we can try to build a classifier based on inherent properties of the patterns.

This is Unsupervised Learning


If we take speech signals and transform them to the frequency domain we get a speech spectrogram.

The spectrogram is a two dimensional signal showing how the frequencies change over time.

The spectrogram above shows the phrase, "In nineteen ninety two"
For human speech we find that speech consists of two of three main bands called formants. We could try to classify speech by using the position of the formants. (this is not easy because there is a wide variation in way the formants change for a particular sound).

Since there is some inherent structure in the signal we can use unsupervised learning to reduce the quantity of information in the formants.

We need to construct a classifier that minimises not the error between output and target, but the error in the fit of a particular model to the data.

If we use a very complex model it will be able to fit the data quite well.

An example model is a set of points in the pattern space, each point represents the centre of a cluster. An algorithm must be devised to move the points so that an error is reduced.

The K means algorithm:

Assume that there are K clusters in the data, i.e. K classes.
  1. choose small random values for each cluster mean
  2. Use the training set to assign a cluster to each training pattern.

  3. Given a training pattern y, y is in cluster i at time t  - Ci(t) if where d is some distance metric.
    i.e. y is in cluster i, if i has the closest mean.
  4. Calculate new values for m1(t)...mK(t)
  5. For the Euclidean distance this is just the mean of all the patterns assigned to class i.

    Repeat from 2 until

    i.e. the clusters are not moving much.
Clustering is applicable to problems where


ISODATA is an algorithm based on K means that does not need to know the number of clusters. It starts with one cluster and splits/merges clusters according to certain parameters.
  1. Minimum distance between two cluster centres below which they are merged.
  2. Maximum standard deviation for a cluster above which it is split into two clusters.
  3. Minimum proportion of training patterns in a cluster.
Amongst many others.

To modify the K mans algorithm we need to add an extra step 4 which can split or merge clusters.

Clustering algorithms perform dimensionality reduction. They do this because they take a high dimensional pattern space and produce a lower dimensional space. Hopefully the lower dimensional space still contains most of the information from the original space. Image Processing

Image Processing


What is an Image?

What is Image Processing?

Image Characteristics

Image Processing Characteristics

Image Processing Operations

Image Coding

Image coding reduces the storage requirement for an image by removing redundant information. This may be done in two ways.

Exact Coding

The image may be reconstructed exactly. Standard coding techniques may be used.
Compression of 2:1 is typically achieved.

Fidelity Coding

Applications of Image Coding

Image Enhancement

An image may need to be processed so that it may be viewed. For example, an X-ray image may be displayed with greater contrast in some regions so that very faint features can be seen.



Applications of Image Enhancement

Image Restoration

If an image has known faults then these may often be corrected. A blurred image may be 'deblurred' or a noisy image may have noise removed. Restoration often relies on knowledge of the process that caused the degradation.

Original Image

Noisy Image

Noise removal applied once

Noise Removal applied twice

Original Image

Sharpened Image

Image Segmentation

Sometimes similar areas in an image must be identified, for example from an aerial photograph the land and the sea may need to be segmented.

Original Image

Segmented Image

Feature detection

Often, only part of an image is of interest.

Feature detection creates an image that contains information about the presence of certain types of features. Edges are common features.

Original Image

Edges in Image

Feature Detection using the Hough Transform

Image of Road

Hough Transform of Road Image

Applications of Image Restoration, Segmentation and Feature Detection.


Lecture 10

Convolution operations

The convolution operation is a weighted sum of all the pixels in the neighbourhood. For example, shown below is an area in an image and a weight array:



















If I is a 3x3 neighbourhood from the image, and M is an array the same size as I, then a pixel I'11 to replace I11 is found using.

M is called the convolution mask, and by choosing values for the elements of M, many different operations may be performed. The convolution mask may be of any size, but is commonly 3x3, 5x5 or 7x7. The sum of the elements of M is commonly either 0 or 1. In order to see why, consider an area in an image with pixels of a uniform grey level.

If the sum of the elements in M is 0 then the operator produces an image in which with all pixels are zero. Thus such operators are used to detect properties of non uniform regions, for example edges.

If the sum of the elements in M is 1 then the operator produces an image with all pixels unchanged. Thus such operators are used to change the pixels in an image that are in non uniform areas. They may be used, for example, to blur or sharpen an image.

The symbol used to denote a convolution is a star -


Convolution Masks for common operations


If all the mask coefficients are positive, then the result will be a blurred image.




















The mask to the left uses the average in the neighbourhood. The mask to the right uses a weighted average. Commonly, the weighted average is calculated using a Gaussian distribution so that there is no sudden discontinuity in the mask.


The gradient in an image is important because it gives information about edges. The first and second derivative are often used in edge detection.

First derivative

An image is a two dimensional function, thus the gradient at any point is a vector

The gradient at any point in an image is a vector given by:

The gradient in the x direction in an image I at a point x,y may be approximated as:

gx(x,y)=I(x+0.5,y)-I(x-0.5,y) and, since x and y are integers

gx(x-0.5,y)=I(x,y)-I(x-1,y) or 2gx(x,y)=I(x+1,y)-I(x-1,y)

Similarly the gradient in the y direction is:

gy(x,y-0.5)=I(x,y)-I(x,y-1) or 2gy(x,y)=I(x,y+1)-I(x,y-1)

The convolution masks for this operation are:

In the x direction In the y direction













Other masks are:

Prewitt                                     Sobel
x mask       y mask         x mask              y mask





































Isotropic                                  Roberts
x mask        y mask























The magnitude and direction of the gradient may be found using

Often, only the magnitude of the gradient is required. In this case an approximation is usually used.

This needs only a single mask





Second derivative

The magnitude of the gradient may be used to detect edges by thresholding. However, the width of the line in the thresholded image will be proportional to the strength of the edge. If the position of the edge is required then the second derivative can be used. Any place where the second derivative crosses zero will be on an edge.

The centre of the edge occurs at the zero crossing of the second derivative


In one dimension, an approximation to the second derivative can be obtained by using the approximation to the first derivative and applying it again.

f'(x)= f(x+0.5)-f(x-0.5)


f''(x)= f(x+1)-f(x)-(f(x)-f(x-1))


In two dimensions, the second derivative is the Laplacian -.

Convolution Masks that will approximate the Laplacian or the negative Laplacian are:




















Combining convolution masks

The following mask will give the average for the pixels in a 3x3 neighbourhood.










If this is applied to an image and then the Laplacian is applied, the result will be equivalent to using the following mask.


























Thus, two convolutions may be replaced by one, and vice versa. This is useful when two small convolution masks can be used to replace one large mask. This is only possible for certain masks. If a mask can be split into two one dimensional masks it is said to be separable. The Laplacian operator is not separable, but a similar mask is:






followed by




is equivalent to








In order to be separable each row in the mask must be a multiple of all other rows and each column must be a multiple of all other columns.

Using separable masks it is possible to use large neighbourhood sizes because the number of computations is proportional to the width of the neighbourhood and not its area.

Edge Detection Using the Laplacian

The zero crossings of Laplacian

As described earlier, the edges in an image correspond to the zero crossings of the Laplacian. These may be found by examining the pixels in a 2x2 neighbourhood, i.e.





if ac<t or ab<t or ad<t there is an edge at a

The vale of t must be set so that noise pixels do not produce edges.

Laplacian of a Gaussian

The Laplacian locates edges, but it does not differentiate between edges of different widths. In practice some edges are more important than others. For example, given a face image a single hair will generate an edge as will the border between face and background. In order to detect edges with a large width, then a larger convolution mask must be necessary.

If the image is blurred by a Gaussian then the Laplacian will only detect wide edges. These two operations are usually combined by finding the Laplacian of a Gaussian operator.

thus and

This may be extended to two dimensions and used to generate a convolution mask.

Non Linear Neighbourhood Operations.

The convolution operation is a linear process. If there is a conditional step in the process of obtaining a pixel value from a neighbourhood, then it is said to be non linear.


This maximum value in the neighbourhood is chosen. The maximum operation causes light areas to expand, it is sometimes called the dilation or expand operation. It is useful for displaying data which contains isolated points and in segmentation.


This operation involves choosing the minimum in the neighbourhood. Sometimes called an erode or shrink operation, it is often used in conjunction with the maximum operator to perform fuzzy morphological operations on grey scale images.

Median Filter

The most commonly used non linear neighbourhood operator is the median filter.

The pixel is replaced by the median value from within the neighbourhood. The pixel values must be sorted and the middle value chosen. For example, given the following neighbourhood:







These values are sorted to give





In this case the median value is 20.

The median filter is important because it removes pixels with values that are not similar to those in the neighbourhood while preserving edges. To illustrate this, consider the one dimensional case shown below.

The example to the left shows a feature that has a size less than half the width of the neighbourhood. In this case there is no point where the median value will be taken from inside the feature. Thus it is removed.

The example to the right shows a feature that has a size greater than half the width of the neighbourhood. In this case the median will always be the same as the value in the centre of the neighbourhood. This feature remains intact.

This property of the median filter is important because other noise removal methods, for example blurring, will change edge features.

The disadvantage of the median filter is than it is computationally intensive because it involves sorting. It is possible to use a moving histogram to reduce the number of calculations for large neighbourhood sizes.

Example of the Median Filter

Original Image 1/4 of all pixels replaced by random values

3x3 Median Filter applied once 3x3 Median Filter applied twice

The Mode Filter

The mode filter replaces a pixel with the most common pixel value in the neighbourhood. This is only useful if there are a small number of grey levels. If grey levels are used to represent segmented regions then noise can be removed by using the mode filter.

K Nearest Neighbour

Instead of using all the pixels in the neighbourhood use the K pixels closest in value to the central pixel. Replace it with the average of these. For example if K=6:




The 6 pixels nearest 10 are:




10,15,18,20,20 and 30




Average = 113/6 19

This operator performs an operation without including extreme values (the 100 in this case).

Maximum Homogenity

Calculate the average for a number of neighbourhoods around a pixel and use the closest to the pixel value.






Calculate the average in each neighbourhood n1..n5 and use the closest of these to the original pixel value.

Image Enhancement using the Laplacian

The Laplacian operation may be used to enhance an image.

f(x) f''(x) f(x)-f''(x)

If the second derivative of a function is subtracted from the function then sharp transitions in the function are emphasised as shown above. Similarly, the Laplacian may be used to enhance the edges in an image.






























Image Laplacian Edge Enhancement


Original                    Sharpened using Laplacian

Binary neighbourhood operators

Often it is necessary to process binary images. Since each pixel may take only two values, the pixel is primarily used to denote the presence or absence of a feature. This feature may be the result of previous processing, for example edge detection. Often the binary image gives information about the shape of an object. Neighbourhood operators for binary images are called morphological operators because they deal with shape information.


In order to use morphological operators it is necessary to decide how one pixel is connected to another pixel.

Eight Connected

The simplest method for deciding if a pixel is joined to one of its neighbours, is to look at all eight of the neighbours. If a pixel in this 8 neighbourhood is set, then it is joined to the central pixel. Eight connectedness correctly connects diagonal lines but it also connects the background across a diagonal line. Thus when eight connectedness is used the background must be treated differently to the foreground.

Four Connected

Pixels are four connected, if they are joined to the left, right, above or below, but not diagonally.


Eight Connected Figure    Four Connected Figure

Shrink and Expand

Two important morphological operators are shrink and expand. Shrink will clear the pixels that have any neighbours clear. Expand will set the pixels that have any neighbours set.

Close and Open

An expand followed by a shrink is called a Close because it will remove small gaps between objects.

A shrink followed by and expand is called an open because it will keep small gaps between objects open.

Shrink, Expand, Open and Close are illustrated below.

original expand shrink close open

Shrink and expand are analogous to the neighbourhood operators min and max.

The edges in an image may be found by taking the original away from the expanded image, or taking the shrunk image away from the original. These correspond to the outer 4 connected edge and the inner 4 connected edge respectively.


The shrink operator may be used to reduce the size of a region of set pixels but if it is repeatedly applied then the region will eventually disappear. If the shape of the region is important then rather than shrink to nothing an operator should shrink the region to its skeleton.

The skeleton (or medial axis) is a set of points which are equidistant from the two or more closest edge points in the image.

Skeletonisation may be performed by repeatedly applying an algorithm which thins the image until the skeleton remains.


Binary morphological operators may be implemented by using a lookup table. The pixels in the neighbourhood are combined to form a binary word. This word will be between 0 and 511, for a 3x3 neighbourhood. This word is used to address a table which contains a new value for the central pixel.

Image Transforms - Hough Transform

The Hough Transform

The Hough transform is a technique which can be used to isolate features of a particular shape within an image.

Because it requires that the desired features be specified in some parametric form, the classical Hough transform is most commonly used for the detection of regular curves such as lines, circles, ellipses, etc.

A generalized Hough transform can be employed in applications where a simple analytic description of a feature(s) is not possible.

The Hough transform has many applications, as most manufactured parts (and many anatomical parts investigated in medical imagery) contain feature boundaries which can be described by regular curves or straight lines.

The main advantage of the Hough transform is that it is tolerant of gaps in feature boundary descriptions and is relatively unaffected by image noise.

How It Works

The Hough technique is useful for computing a global description of a feature(s) (where the number of solution classes need not be known a priori), given (possibly noisy) local measurements.

The motivating idea behind the Hough technique for line detection is that each input measurement (e.g. coordinate point) indicates its contribution to a globally consistent solution (e.g. the physical line which gave rise to that image point).

As a simple example, consider the common problem of fitting a set of line segments to a set of discrete image points (e.g. pixel locations output from an edge detector). The diagram below shows some possible solutions to this problem.

a) Coordinate points. b) and c) Possible straight line fittings.

We can analytically describe a line segment in a number of forms. However, a convenient equation for describing a set of lines uses parametric or normal notion:


where Eqn:eqnr is the length of a normal from the origin to this line and Eqn:eqntheta is the orientation of Eqn:eqnr with respect to the X-axis. (See Figure 2.) For any point Eqn:eqnxy on this line, Eqn:eqnr and Eqn:eqntheta are constant.

Figure 2 Parametric description of a straight line.

In an image analysis context, the coordinates of the point(s) of edge segments (i.e. Eqn:eqnxyi ) in the image are known and therefore serve as constants in the parametric line equation, while Eqn:eqnr and Eqn:eqntheta are the unknown variables we seek. If we plot the possible Eqn:eqnhtrt values defined by each Eqn:eqnxyi, points in cartesian image space map to curves (i.e. sinusoids) in the polar Hough parameter space. This point-to-curve transformation is the Hough transformation for straight lines. When viewed in Hough parameter space, points which are collinear in the cartesian image space become readily apparent as they yield curves which intersect at a common Eqn:eqnhtrt point.

The transform is implemented by quantizing the Hough parameter space into finite intervals or accumulator cells. (i.e. a multidimensional array). As the algorithm runs, each Eqn:eqnxyi is transformed into a discretized Eqn:eqnhtrt curve and the accumulator cells which lie along this curve are incremented.

Peaks in the accumulator array represent strong evidence that a corresponding straight line exists in the image.

We can use this same procedure to detect other features with analytical descriptions. For instance, in the case of circles, the parametric equation is


where Eqn:eqnhta and Eqn:eqnhtb are the coordinates of the center of the circle and Eqn:eqnr is the radius. In this case, the computational complexity of the algorithm begins to increase as we now have three coordinates in the parameter space and a 3-D accumulator. (In general, the computation and the size of the accumulator array increase polynomially with the number of parameters. Thus, the basic Hough technique described here is only practical for simple curves.)

The Hough transform can be used to identify the parameter(s) of a curve which best fits a set of given edge points.

This edge description is commonly obtained by using an edge detector such as the zero crossings of the laplacian. The edge image may be noisy, i.e. it may contain multiple edge fragments corresponding to a single whole feature.

Since the output of an edge detector defines only where features are in an image, the work of the Hough transform is to determine both what the features are (i.e. to detect the feature(s) for which it has a parametric description) and how many of them exist in the image.

In order to illustrate the Hough transform in detail, we begin with the simple image of two occluding rectangles,


An edge detector can produce a set of boundary descriptions for this part, as shown in


Here we see the overall boundaries in the image, but this result tells us nothing about the identity (and quantity) of feature(s) within this boundary description. In this case, we can use the Hough (line detecting) transform to detect the eight separate straight lines segments of this image and thereby identify the true geometric structure of the subject.

If we use these edge/boundary points as input to the Hough transform, a curve is generated in polar Eqn:eqnhtrt space for each edge point in cartesian space. The accumulator array, when viewed as an intensity image, looks like


Enhancing the contrast of this image allows us to see the patterns of information contained in the low intensity pixel values.


Although Eqn:eqnr and Eqn:eqntheta are notionally polar coordinates, the accumulator space is plotted rectangularly.

Note that the accumulator space wraps around at the vertical edge of the image such that, in fact, there are only 8 real peaks.

Curves generated by collinear points in the gradient image intersect in peaks Eqn:eqnhtrt in the Hough transform space. These intersection points characterize the straight line segments of the original image.

There are a number of methods which one might employ to extract these bright points, or local maxima, from the accumulator array.

For example, a simple method involves thresholding and then applying some thinning to the isolated clusters of bright spots in the accumulator array image.

Here we use a relative threshold to extract the unique Eqn:eqnhtrt points corresponding to each of the straight line edges in the original image.

(In other words, we take only those local maxima in the accumulator array whose values are equal to or greater than some fixed percentage of the global maximum value.)

Mapping back from Hough transform space (i.e. de-Houghing) into cartesian space yields a set of line descriptions of the image subject. By overlaying this image on an inverted version of the original, we can confirm the result that the Hough transform found the 8 true sides of the two rectangles.


The accuracy of alignment of detected and original image lines is not perfect due to the quantization of the accumulator array.

Note also that the lines generated by the Hough transform are infinite in length. If we wish to identify the actual line segments which generated the transform parameters, further image analysis is required in order to see which portions of these infinitely long lines actually have points on them.

To illustrate the Hough technique's robustness to noise, the edge description is been corrupted by noise.


The result, plotted in Hough space, is


De-Houghing this result (and overlaying it on the original) yields


(As in the above case, the relative threshold is 40%.)

The sensitivity of the Hough transform to gaps in the feature boundary can be investigated by transforming the following image


The Hough representation is


The de-Houghed image is


In this case, because the accumulator space did not receive as many entries as in previous examples, only 7 peaks were found, but these are all structurally relevant lines.

We will now show some examples with natural images. In the first case, we have a city scene where the buildings are obstructed in fog,


If we want to find the true edges of the buildings, an edge detector cannot recover this information very well, as shown in:


However, the Hough transform can detect some of the straight lines representing building edges within the obstructed region. The accumulator space representation of the original image is shown here:


If we set the relative threshold to 70%, we get the following de-Houghed image


Only a few of the long edges are detected here, and there is a lot of duplication where many lines or edge fragments are nearly colinear. Applying a more generous relative threshold, i.e. 50%, yields


This yields more of the expected lines, but at the expense of many spurious lines arising from the many colinear edge fragments.

Another example comes from a remote sensing application. Here we would like to detect the streets in the image


of a reasonably rectangular city sector. We can edge detect the image as follows:


However, street information is not available as output of the edge detector alone. The image


shows that the Hough line detector is able to recover some of this information. Because the contrast in the original image is poor, a limited set of features (i.e. streets) is identified.

Common Variants

Generalized Hough Transform

The generalized Hough transform is used when the shape of the feature that we wish to isolate does not have a simple analytic equation describing its boundary. In this case, instead of using a parametric equation of the curve, we use a look-up table to define the relationship between the boundary positions and orientations and the Hough parameters.

For example, suppose that we know the shape and orientation of the desired feature. (See Figure 3.) We can specify an arbitrary reference point Eqn:eqnxyr within the feature, with respect to which the shape (i.e. the distance Eqn:eqnr and angle of normal lines drawn from the boundary to this reference point Eqn:eqnomega) of the feature is defined. Our look-up table (i.e. R-table) will consist of these distance and direction pairs, indexed by the orientation Eqn:eqnomega of the boundary.

Description of R-table components.

The Hough transform space is now defined in terms of the possible positions of the shape in the image, i.e. the possible ranges of Eqn:eqnxyr. In other words, the transformation is defined by:



(The Eqn:eqnr and Eqn:eqnbeta values are derived from the R-table for particular known orientations Eqn:eqnomega.) If the orientation of the desired feature is unknown, this procedure is complicated by the fact that we must extend the accumulator by incorporating an extra parameter to account for changes in orientation.