April 2003

As always, feedback is welcome.

Reasoning about Uncertainty, Bayesian Belief Networks

Certainty Factors

The first expert systems were research efforts in the medical diagnostic domain. The researchers quickly found out that dealing with, and propagating, uncertainty was a critical issue for medical systems. Typically medical experts hedged their opinions with words like "likely" and "may".

The researchers invented the concept of certainty factors (CFs) to represent this uncertainty. These were numbers between 0 and 100. A given bit of data had a CF associated with it. For example the symptom "patient has a cough" might have a CF of 70 if its not clear how bad the patient's cough is.

Rules would also have CFs, so the rule that "if a patient has a cough then the patient has a cold" might also have a CF, say 60. The CF of the symptom is then mathematically combined with the CF of the rule to come up with a CF for the final diagnosis.

An diagnostic expert system could then, given a set of symptoms, generate a number of possible diagnoses and the CF associated with each.

(Note that uncertainty is only critical for certain types of applications. The expert system in a doctor's office that helps fill out insurance forms needs to say exactly what form to fill out, not provide a menu of probable choices ranked by possibility.)

The formula for propagating CFs was developed by "feel", tweaked until the medical experts were comfortable with the results.

Many expert system tools provide CFs as part of their knowledge representation and reasoning engine.

Bayesian Belief Networks

It wasn't long before more rigorous types pointed out that probability theory already had all the tools necessary for propagating belief, in Bayesian theory. And it was mathematically precise, not like the ad hoc CFs.

Thrown on the defensive, the CF creators argued that there is so much uncertainty in the uncertainty estimates in an expert system that it hardly matters if they are propagated with mathematical precision. If a doctor estimates that a cough implies a cold with certainty 60, that could really be anywhere from 40 to 80.

Moving to the offensive, the CF creators noted that it was much more complex to use probability theory.

Complexity has never slowed the research community down, and a wealth of work has been done on Bayesian belief networks (BBNs). These use rigorous mathematics to propagate belief (probabilities) between connected nodes in a network.

Using a BBN, the cough/cold example would be represented: P(cough, 0.7), P(cold | cough, 0.6). The first is just the probability the symptom is a cough. The second is the conditional probability that the disease is a cold given the symptom cough.

But, it would also be necessary to include P(not cough, 0.3) and P(cold | not cough, 0.2) for the formulas to be used to accurately determine P(cold). That is because the probability of the patient having a cold is a combination of the probability given a cough as a symptom and the probability of having a cold if not cough is a symptom.

And if the node describing the conditional probability of a cold, given a cough, were to be connected to other nodes, then we would also need to specify P(not cold|cough) and P(not cold|not cough).

It is this requirement for a matrix of conditional probabilities that adds overhead to the development of BBNs.

But, the potential of a BBN is fantastic. All the factors in a particular problem domain can be connected together in dependency networks, with the conditional probabilities at each node specified as above. Then, as evidence about a situation is gathered, that evidence, which itself is expressed as a probability, can be propagated through the entire belief network, updating the beliefs of all the other nodes.

This means propagating both backwards and forwards, which is where the real power of Bayes' theorem comes to bear. It states that one can compute the P(A|B) from P(B|A) by the formula:

P(A|B) = P(B|A) * P(A) / P(B).

This equation has another very important implication. It overcomes the CF creator's objection that estimates of uncertainty are themselves extremely uncertain. Consider the cough/cold example.

It is difficult to determine the probability that someone has a cold given they have a cough. A doctor will go by feel, indicating a cold is likely if the patient has a cough.

But, it is easy to measure exactly how many patients that visit the doctor have colds, how many have the symptoms of a cough, and, most important, how many who have colds have a cough.

That is, whereas determining P(cold|cough) is hard, we can use statistics to exactly determine P(cough|cold).

Armed with Bayes theorem, and P(cold) and P(cough) in the doctor's office, we can then precisely compute P(cold | cough) given P(cough | cold).

P(cold | cough) = P(cough | cold) * P(cold) / P(cough).

It is no longer necessary to estimate what "likely" means.

One can imagine great networks of linked conditional probabilities used to model different application domains, that react to evidence, propagating beliefs, and that is exactly what people do with Bayesian belief networks (BBNs).

There are many deployed applications using BBNs, and there are a number of commercial products available as well.

But there are technical problems, the first being that propagation is computationally too hard given a significant number of nodes -- unless you make assumptions and limitations as to the type of network you build.

The second is that it is difficult to get all of the probability data for the nodes. As the CF creators pointed out, what good is it if the data is wrong?

Researchers are working on both these areas, with learning algorithms for developing BBNs from statistical data, and better algorithms for propagating beliefs.

See the links below to learn more about what can and has been done with BBNs, and where the current research is going.

Fuzzy Logic

Fuzzy logic remains as a topic for a later date, but I'll briefly mention it here for completeness.

Fuzzy logic is another approach to reasoning with uncertainty, in which variables in a system, instead of having probabilities, have a degree of truth associated with them. So rather than being true (1.0) or false (0.0) a fuzzy variable might be "most likely true" (0.8). Like with both probability theory and CFs, there are rules for combining and propagating fuzzy truth values.

One can immediately see an advantage of fuzzy logic over probability in the cough/cold situation. It's not really that the patient has a probability of a cough, but rather a degree of cough, from mild to severe. This is naturally represented as a fuzzy truth value.

But one can also immediately see the disadvantage, because the result of a diagnostic system should be the probability that the patient has a cold, not a degree of cold.

If it were simple, it wouldn't be AI.

Code Corner - Probability and Bayes

Prolog is a popular language for AI for a very simple reason. It has built in pattern-matching (unification) and search (backtracking) algorithms. And much of AI is exactly about pattern-matching and search -- searching for patterns in if-then rules for an inference engine; searching for patterns and moves in game playing; searching for patterns in vision and natural language application; etc. etc.

Because the programmer doesn't have to code the pattern-matching and search in a Prolog application, the code is often very concise and easy to read. Here's an example from some simple experiments with Bayes theorem and probability, where a few lines of code can be used to answer queries about the probability of various events.

A simple example relating cancer and smoking, similar to the cold/cough analysis above, came from Norman Fenton's tutorial. We can write a small Prolog program that lets us play with that data and Bayes theorem.

First we represent the basic probability facts as Prolog facts:

p(patient(cancer), 0.1).
p(patient(smoker), 0.5).

We then represent the conditional probabilities, using a ^ operator:

cp(patient(smoker) ^ patient(cancer), 0.8).

Next we write query rules that can be used to find various probabilities. There are three rules, covering the case where the probability is known, the conditional probability is known, or we can compute the conditional probability using Bayes theorem. Words beginning with upper case letters are variables in Prolog. Note that the third rule recursively calls itself to get the individual probabilities.

getp(A,P) :-
p(A,P), !.
getp(A^B, P) :-
cp(A^B, P), !.
getp(A^B, P) :-
cp(B^A, Pba),
getp(A, Pa),
getp(B, Pb),
P is Pba * Pa / Pb.

We now have a simple system for representing and querying knowledge expressed as probabilities. We can consult these facts and rules into a Prolog listener and try it:

?- getp(patient(cancer), P).
P = 0.1 

?- getp(patient(smoker)^patient(cancer), P).
P = 0.8 

?- getp(patient(cancer)^patient(smoker), P).
P = 0.16 

One of the fun things about reading probability papers is they have Rube Goldberg like examples. Here's a more complex set of a chain of facts, probably relating to the author's vacation. This one comes from an online lecture available from Temple University.

p(bonus, 0.6).

cp(money ^ bonus, 0.8).
cp(money ^ not bonus, 0.3).

cp(hawaii ^ money, 0.7).
cp(Hawaii ^ not money, 0.1).

cp(san_francisco ^ money, 0.2).
cp(san_francisco ^ not money, 0.5).

cp(wierd_people ^ san_francisco, 0.95).
cp(wierd_people ^ not san_francisco, 0.6).

cp(surfing ^ Hawaii, 0.75).
cp(surfing ^ not Hawaii, 0.2).

Note that in this case, except for bonus, we don't always have a direct probability that we can use to get P(A) when needed. There is another formula that can be used to compute P(A) from P(A|B) when needed.

P(A) = sum( P(A|B) * P(B) )

So in the example, if we wanted to compute P(surfing), it depends on P(Hawaii) which depends on P(money) which depends on P(bonus). We can expand our query rules to include linked probabilities like this. Prolog's recursion handles this nicely. (Prolog's findall finds all the instances and stores them in a list; we add a sum rule to add the elements in the list.)

We also need a rule to calculate not P, which probability theory says is 1 - P.

getp(not A, P) :-
getp(A, Pn),
P is 1 - Pn, !.
getp(A, P) :-
findall(P, (
    cp(A ^ B, P1),
    getp(B, P2),
    P is P1*P2
    ), L),
sum(L, P).

sum(L, Sum) :-
sum(L, 0, Sum).
sum([], Sum, Sum).
sum([A|Z], Acc, Sum) :-
Acc2 is Acc + A,
!, sum(Z, Acc2, Sum).

We can try this in a listener:

?- getp(surfing, P).
P = 0.453 
We can also change the probability of the bonus, to see how that affects the probability of surfing.

?- retract(p(bonus,_)).

?- assert(p(bonus,0.3)).

?- getp(surfing, P).
P = 0.4035 

That let's us see top down affects. What about rippling Bayes theorem back up to see what the fact that someone went surfing means for the probability they had a bonus? That's harder and requires propagating beliefs through the network. But its not too hard for graphs without loops, and there is a relatively straightforward algorithm for doing it that was originally published by Judea Pearl in 1988 in a book called Probabilistic Reasoning in Intelligent Systems, published by Morgan Kaufman.

There is a concise explanation of the algorithm in Facundo Bromberg's final project report, which describes a Java implementation of the algorithm. The Temple University tutorial points to a Prolog program which does the same. The Prolog code is relatively obtuse unless you have the paper version of the algorithm to study as well.

Work in AI is all about knowledge representation and reasoning engines. Those two aspects are clear in this little example. The p and cp facts used above are a knowledge representation language for a probability network. The getp rules are a reasoning engine that uses that knowledge representation.

Links

The Best of AI Expert

The June 1994 issue had three excellent articles on neural networks.

Object-Oriented Neural Networks [Jun 1994] - Ivo Vondrak provides a clean object-oriented design for implementing your own neural network application.

Tuning Neural Networks with Genetic Algorithms [Jun 1994] - Kurzweil has suggested the way to build an artificial brain is to have neural networks tuned with genetic algorithms, and some of the leading edge game-playing software is using the same approach. Dan Murray provides details of the approach in this article.

Activating Neural Networks: Part I [Jun 1994] - Anthony Kempka looks into the details of activation functions, the little equations that decide when a neuron fires. He discusses a variety of functions and their impact on the behavior of a neural net, but also points out that the use of backpropagation for learning severely restricts the types of functions that can be used.

AI in Medicine

This is a chapter from Enrico Coiera's book, "Medical Informatics, The Internet and Telemedicine". It provides a good introduction to the topic of AI in medicine.

Elsevier, which published a number of learned journals, has one called Artificial Intelligence in Medicine. This is its home page.

Quantum Computing

http://tph.tuwien.ac.at/~oemer/qcl.html Bernhard Γ–mer is working on Quantum Computer Language (QCL) which is aimed at overcoming some of the notational and usability issues that have confused those, including this editor, attempting to wrap their brains around programming with bits that are 0 and 1 at the same time. He has software that can be downloaded and used, although I suspect it just simulates the quantum computer, so it can be run on Newtonian machines.

Bayesian Belief Networks

http://www.dcs.qmul.ac.uk/~norman/SERENE_Help/start.htm - Norman Fenton has included an excellent, not too technical, tutorial on probability and Bayesian belief networks (BBNs) in the documentation on SERENE, which is a safety and risk evaluation system that uses a BBN. There is a fascinating section on the difficulties people have in grasping realistic probabilities. The tutorial includes a good section on other resources for BBN information.

Temple University has a number of interesting lectures available online. This is one on Bayesian belief networks and includes working Prolog source code for belief propagation in networks where there is only one path between nodes.

Facundo Bromberg's AI Final Project Report has a clear, concise summary of the algorithm for belief propagation in networks without loops, or directed acyclic graphs (DAGs), which was originally published in Judea Pearl's 1988 book, Probabilistic Reasoning in Intelligent Systems, published by Morgan Kauffman.

http://www.auai.org/ - Home page for Association for Uncertainty in Artificial Intelligence, lots of good links and pointers.

Kevin Murphy has written an excellent in depth look at belief networks that doesn't shy away from equations and the difficult issues. It also has numerous links to other resources on BBNs.