System combination for inference not decision

A common reaction to the unsupervised inference algorithms discussed here is “Oh, that is just system combination, what is so new about that?” System combination usually refers to the use of multiple decision makers/recognizers to reach a consensus decision. For example, the ROVER system in speech recognition or the fusion system benchmarked at the 2003 Reliable Information Access (RIA) Workshop at MITRE.

Incidentally, I was the representative for the Center for Intelligent Information Retrieval at the RIA Workshop and the fusion system included the Lemur derived system we contributed to the workshop. It was my first direct experience with system combination for decision. A comment by Warren Greiff about that fusion experiment has always stuck in the back of my mind. We all worked together in a common room and upon seeing the results of the fusion experiment he said: “And, of course, the combined system beats any single system.”

If system combination is so great, why is it used so little? One reason may be that system combination as currently practiced is a decision algorithm. We collect a bunch of opinions of what something should be labeled as and then we combine the opinions to form our final vote. The Condorcet Jury Theorem is the first quantification of how successful we expect to be with a majority decision algorithm. Count up the votes and give the decision to choice that has the majority or, in the case of multiple labels, to the plurality.

The novelty of our unsupervised inference algorithms is that we are not combining systems to make decisions but to make statistical inferences. That is a big difference. If you take all cases in the data where five recognizers vote with labels $\{A,B,B,A,A\}$ and decide that they should all be considered as having label $A$, that is very different from using the frequency of all label voting patterns to infer that 83% of the time that the recognizers vote this way the true label is $A$ and 17% of the time it is label $B$.

This point about separating algorithms for inference from algorithms for decision is made in the first chapter of Christopher Bishop’s “Pattern Recognition and Machine Learning” book.

One clear advantage of inference over decision is that you get to have better measurements of properties of the data. For example, in majority voting you assign all cases of the pattern $\{A,B,B,A,A\}$ to label $A$. You proceed this way with all the label voting patterns and then tally the prevalence of label $A$ in the data. That estimate is worse than one where you use an unsupervised inference algorithm for getting the actual prevalences of the labels. This should be intuitively obvious since you are apportioning fractional counts of each pattern to the labels when you do an inference algorithm.

Of course, we get to have our cake and it eat it too! If you are able to get good estimates for the prevalences of the labels in the data along with the conditional recognition probabilities, then you can use Bayes Theorem to make the statistically optimal decision for each observed label voting pattern.

Accuracy is cheap, precision expensive

Dawid and Skene noted in their 1976 paper, where they presented an unsupervised inference algorithm, that data likelihood had degenerate maxima. The same thing occurs for the unsupervised inference equations based on frequencies of label voting patterns.

A simple example will illustrate the degenerate solutions to the unsupervised inference problem. Consider three independent recognizers that are 80% accurate for both labels of a two label task. And for ease of discussion, let us assume that the labels are equally prevalent in the data stream, $p(A)=50\%$ and $p(B)=50\%$.

The pattern $\{B,A,A\}$ occurs with frequency
$$\begin{split}
p_1(B|A)p_2(A|A)p_3(A|A)p(A) + p_1(B|B)p_2(A|B)p_3(A|B)p(B) = \\
0.20*0.80*0.80*0.5 + 0.80*0.20*0.20*0.5 = \\
0.08
\end{split}$$
The same frequency would be observed if the recognizers where all 20% accurate.

I once gave a talk at a  company on the subject of unsupervised inference and this was the objection from an audience member. If degenerate solutions exist, how could one claim that unsupervised inference is completely possible. The answer is that it is not. If anybody were to claim that such an algorithm existed, it would be tantamount to claiming that one could build a perfect detector.

The algorithms discussed here are useful and practical because in most situations we have prior knowledge about the quality of our recognizers. For example, one does not usually deploy recognizers that are worse than uniform guessers. A lot of research and development has gone to build generally good recognizers that seem to be robust to changes in the data stream. If that is the case, then one can deploy the algorithms discussed here and monitor their operational swings.

Besides, knowing that the recognizers are 80% or 20% percent accurate is a heck of a lot better than not knowing anything about their performance. This degeneracy is not too much of a problem because accuracy is cheap. A data stream monitored sporadically would allow one to gauge which solution is correct. Since these two solutions are the only ones that explain the frequency of label voting patterns, a few samples will quickly decide between the two solutions. This point about accuracy being cheap but precision being costly was made in our ICML 2008 paper on unsupervised inference for the regression task.

An unsupervised inference algorithm together with light human monitoring is much cheaper and precise than continual human monitoring.

MathML rendering issues ameliorated

We have enabled the MathJax javascript to render math equations in the blog. This should take care of most rendering issues in Safari and IE. Please let us know if it does not.

Unsupervised inference for WER without any correct transcripts

One of the neat applications of having an algorithm for unsupervised inference in the labeling task is that you can estimate statistical quantities that are of interest to users of machine learning systems. One such statistic of interest is the word error rate or WER for sequential data. For example, you may be considering the quality of DNA sequencers and you want to use the word error rate – one minus the average label accuracy of the sequencer’s performance over the data processed – rather than the conditional recognition probabilities for each of the labels.

Let us rephrase that last statement mathematically. The WER when you are performing an $L$-label task combines the prevalence of the labels and the conditional recognition probabilities of the recognizers is,
$$\text{WER} = \sum_{\ell \in \mathcal{L}} p(\ell) \left( \sum_{\ell_r \neq \ell} p(\ell_r | \ell) \right).$$
The average label accuracy would be one minus this expression.

One can readily accommodate recognizers that perform deletion and insertion errors by introducing a null label ,denoted by the letter $\mathcal{N}$. Deletion errors would be associated with conditional probabilities of the form $p(\mathcal{N} | \ell \ne \mathcal{N})$. Insertion errors would be associated with the conditional probabilities $p(\ell \ne \mathcal{N} | \mathcal{N})$.

By taking a bunch of outputs from a collection of sequencers (e.g. DNA sequencers, speech recognizers) and aligning them to each other, you would be able to infer the WER of each of the transcripts since you have turned it into a L+1 labeling problem and we have an unsupervised inference algorithm for all the quantities necessary to calculate the WER.

I don’t know about you, but I think there is something magical about the idea that the average accuracy or WER of sequencers can be estimated even when we have no idea what is the correct sequence. This could be extremely useful in settings where one needs to monitor sequencers without human supervision.

Unsupervised inference for the ranking task

We want to take a detour on our discussions of the algorithms for unsupervised inference for the labeling task to point out a side benefit of this technology: the ability to measure the accuracy of rankers without any knowledge of the correct rank for the items!

Consider four ranking algorithms given the task of repeatedly ranking three items. The output of their ranking decisions can be represented as a table:

ranker 1 ranker 2 ranker 3 ranker 4 true rank
a,c,b b,c,a c,a,b c,b,a ?
f,e,d d,e,f f,e,d e,d,f ?
g,h,i h,g,i g,h,i g,h,i ?

We implicitly assume that this table extends downward for a large number of ranking decisions. The unsupervised inference problem for the ranking task is to estimate the performance of these rankers without any knowledge of the correct ranking for the items.

The way to solve this problem is to map it into an abstract labeling task using the permutation group. Select any of the rankers as your reference ranker. Replace its ranking decision with the identity element of the $S_n$ permutation group. In this case we have three items that are being repeatedly ranked so the group to consider is $S_3$. Let us arbitrarily pick the second ranker and transform the table accordingly:

ranker 1 ranker 2 ranker 3 ranker 4 true rank
a,c,b $\pi_1$ c,a,b c,b,a ?
f,e,d $\pi_1$ f,e,d e,d,f ?
g,h,i $\pi_1$ g,h,i g,h,i ?

where $\pi_1$ denotes the identity element for $S_n$.

We can continue in this manner for all the rankers. Using the ordering of the second ranker as the reference, we can replace another ranker’s ordering by the permutation group element that relates the two. This is always possible because the rank orderings form a group! There is always an operation that takes you from one ordering of three items to any other ranking of the same items.

ranker 1 ranker 2 ranker 3 ranker 4 true rank
$\pi_{\text{whatever}}$ $\pi_1$ $\pi_{\text{whatever}}$ $\pi_{\text{whatever}}$ ?
$\pi_{\text{whatever}}$ $\pi_1$ $\pi_{\text{whatever}}$ $\pi_{\text{whatever}}$ ?
$\pi_{\text{whatever}}$ $\pi_1$ $\pi_{\text{whatever}}$ $\pi_{\text{whatever}}$ ?

Where I have not bothered to fill out the table precisely but you get the idea. For each ranker other than the reference we can find the permutation element in $S_n$ that relates its ranking with that of the arbitrarily chosen ranker.

That is it. We are done. The unknown correct ranking would also have to be transformed into one of the elements of the permutation group. The fact that we don’t know the correct “label” does not matter! We have an unsupervised inference algorithm for the labeling task. By turning the ranking task to an abstract labeling problem involving the members of the permutation group, we can solve the problem.

In general, for the task of ranking $n$ items, you would have to solve a $n!$ label task. The prevalence of the labels are not really meaningful except as they allow us to calculate the overall accuracy of the rankers. The only difference is that you have to decide with how many rankers you are going to do it. Why?

The table of abstract labels that we have constructed has only one label type in the second column. Therefore, not all possible $(n!)^R$ labels voting patterns show up on that table. The elegant way to deal with this is to throw out that column. Then you have mapped the $n$-item ranking task with $R$ rankers to the labeling task with $n!$ labels using $R-1$ labelers.