System combination for inference not decision

A common reaction to the unsupervised inference algorithms discussed here is “Oh, that is just system combination, what is so new about that?” System combination usually refers to the use of multiple decision makers/recognizers to reach a consensus decision. For example, the ROVER system in speech recognition or the fusion system benchmarked at the 2003 Reliable Information Access (RIA) Workshop at MITRE.

Incidentally, I was the representative for the Center for Intelligent Information Retrieval at the RIA Workshop and the fusion system included the Lemur derived system we contributed to the workshop. It was my first direct experience with system combination for decision. A comment by Warren Greiff about that fusion experiment has always stuck in the back of my mind. We all worked together in a common room and upon seeing the results of the fusion experiment he said: “And, of course, the combined system beats any single system.”

If system combination is so great, why is it used so little? One reason may be that system combination as currently practiced is a decision algorithm. We collect a bunch of opinions of what something should be labeled as and then we combine the opinions to form our final vote. The Condorcet Jury Theorem is the first quantification of how successful we expect to be with a majority decision algorithm. Count up the votes and give the decision to choice that has the majority or, in the case of multiple labels, to the plurality.

The novelty of our unsupervised inference algorithms is that we are not combining systems to make decisions but to make statistical inferences. That is a big difference. If you take all cases in the data where five recognizers vote with labels $\{A,B,B,A,A\}$ and decide that they should all be considered as having label $A$, that is very different from using the frequency of all label voting patterns to infer that 83% of the time that the recognizers vote this way the true label is $A$ and 17% of the time it is label $B$.

This point about separating algorithms for inference from algorithms for decision is made in the first chapter of Christopher Bishop’s “Pattern Recognition and Machine Learning” book.

One clear advantage of inference over decision is that you get to have better measurements of properties of the data. For example, in majority voting you assign all cases of the pattern $\{A,B,B,A,A\}$ to label $A$. You proceed this way with all the label voting patterns and then tally the prevalence of label $A$ in the data. That estimate is worse than one where you use an unsupervised inference algorithm for getting the actual prevalences of the labels. This should be intuitively obvious since you are apportioning fractional counts of each pattern to the labels when you do an inference algorithm.

Of course, we get to have our cake and it eat it too! If you are able to get good estimates for the prevalences of the labels in the data along with the conditional recognition probabilities, then you can use Bayes Theorem to make the statistically optimal decision for each observed label voting pattern.