The window of opportunity for unsupervised inference

In our last post we derived a comparison between the number of observable label voting pattern frequencies and the number of parameters we would need to solve in the case of independent recognizers. That lead us to the concept of a minimum number of recognizers needed for unsupervised inference. It so happens that we can prove that the minimum is always greater than two. A somewhat satisfying result that is reminiscent of the idea of triangulation in geometry. More than two decision makers are always needed whenever we want to infer the accuracy of their labeling performance without any knowledge of the ground truth.

In this post we want to talk about the possibility of carrying out unsupervised inference even when the recognizers are correlated. As far as we know, our approach is unique in dealing with this situation. Not solely because the inference equations associated with the label voting pattern frequencies give us much more information than traditional approaches such as maximizing log likelihood of the data, but because we can quantify when recognizers are too correlated. Unsupervised inference is not always possible and a viable, practical algorithm for carrying it out needs to be able to detect when the recognizers have become too correlated.

The statistical quantities relevant to a discussion about the quality of the recognizers are the conditional recognition probabilities,
$$p(\{\ell_1,\ell_2,\ell_3,\ldots\} | \ell_t).$$
For each of the $L$ labels in a particular task, the $R$-dimensional table of these conditional probabilities sums to one,
$$\sum_{e} p(e | \ell_t) = 1,$$
where $e=\{\ell_1,\ell_2,\ell_3,\ldots\}$ is a label voting pattern event and we sum over all possible label voting patterns. The most general conditional probability table for a given label would then require that we estimate $L^R -1$ parameters. Compare this to the case of independent recognizers where the table only requires that we estimate $R*(L-1)$ parameters. We immediately deduce that the unsupervised inference problem where $L^R-1$ parameters are required for each label is not solvable since the number of parameters we have to estimate,
$$L(L^R-1) + (L-1),$$
is always greater than the number of inference equations we have,
$$L^R -1.$$
Restating what we have just said,
$$L(L^R – 1) + (L-1) \gt L^R -1,$$
since
$$L^{R+1} \gt L^R,$$
for any $L \ge 2$ and $R \gt 1$.

While we have been discussing unsupervised inference assuming that all $L^R$ label voting patterns are observable, it should be clear to the reader that the real issue is: does the system of inference equations allow for unique solutions? If you consider the case of completely correlated recognizers, only $L$ label voting patterns would be observed (the patterns where their votes are equal). But we still need to solve for $L(L-1) + (L-1)$ statistical parameters and this is always greater than the $L-1$ linearly independent inference equations.

Having discussed at length when unsupervised inference is impossible, we now come to the title of this post: the window of opportunity for unsupervised inference. Whenever the number of parameters needed to specify the prevalence of labels and the conditional recognition probabilities, call it $P$, lies in the range
$$(L-1)(R*L-1) \le P \le L^R -1,$$
the unsupervised inference problem in the labeling task is solvable. The lower limit corresponds to the case of completely independent recognizers and the upper limit corresponds to the maximum number of linearly independent inference equations.

In our next post we will begin detailing the application of this approach to a set of eight information retrieval engines used during the 2003 Reliable Information Access (RIA) Workshop at MITRE.

The Marriage Lemma

In our previous post we explained how each of the frequencies of the label voting patterns leads to an equation. On one side of an equation is the observable frequency of the particular label voting pattern. These frequencies are readily observable and no knowledge of the correct labels for the data points is needed to measure it. On the other side is an expression with the unknown quantities we would like to know: the prevalence of the labels in the data and the conditional recognition probabilities.

In this post we want to prove the marriage lemma for the labeling task: two recognizers will never be able to solve the unsupervised inference problem. This lemma has shown up on our previous work related to unsupervised inference for the regression task as in, for example, this ICML 2008 paper. The proof is based on comparing the number of unknown parameters versus the number of inference equations.

The number of inference equations is generally $L^R$, where $L$ is the number of labels and $R$ is the number of recognizers/annotators/labelers. Since we are using frequencies of the label voting patterns, the equations must sum to one and hence, only $L^R -1$ are independent.

We now count the number of parameters we need to solve for. Given $L$ labels, we need $L$ prevalence parameters. Again, the prevalences sum to one, so $L-1$ prevalence parameters are needed. The best unsupervised inference setting is one where all the recognizers are independent. This should be intuitively obvious to the reader. Independence would mean that the conditional recognition probabilities factor into individual recognizer conditional probabilities. For example, for three recognizers
$$p(\{\ell_1,\ell_2,\ell_3 \} | \ell_t) = p(\ell_1 | \ell_t) p(\ell_2 | \ell_t) p(\ell_3 | \ell_t).$$
So for each recognizer, we need to solve for $L*(L-1)$ conditional probabilities. Conditioned on a specific true label, the conditional probabilities sum to one so $(L-1)$ parameters are needed. And there are $L$ labels, hence the $L*(L-1)$ parameters for each independent recognizer. We have $R$ recognizers so we have to solve for $R*L*(L-1)$ conditional recognition probabilities. Putting it all together, the number of unknown parameters when carrying out unsupervised inference with independent recognizers is
$$(L-1)*(R*L + 1).$$

The inference equations for independent recognizers will become solvable (not necessarily uniquely!) whenever
$$L^R -1 \ge (L-1)*(R*L + 1).$$
To be mathematically strict, an argument based on counting parameters versus equations does not allow me to prove that solutions exist. So I turn the argument around and say that the problem of unsupervised inference is unsolvable for the labeling task whenever
$$L^R -1 \lt (L-1)*(R*L + 1).$$

The marriage lemma follows from setting $R=2$ since
$$L^2 -1 \lt (L-1)*(2*L + 1)$$
is always true for $L \ge 2.$ Since it may be news to some people that you could carry out unsupervised inference in the labeling task at all, the lemma becomes interesting when we point out that it is indeed possible to solve the problem at all. Consider the case of two labels and three independent recognizers. In that case the comparison between the number of inference equations versus unknown parameters becomes
$$7 \ge 7,$$
so the problem is solvable.

The utility of our technology is that it can readily accommodate the ubiquitous case of correlated recognizers. Independence is a rare quality in most systems deployed today. This may be due to a paradigm that assumes that the best way to carry out annotations is to have a single system that is the most accurate possible. This leads to a kitchen sink approach that time and time again has resulted in highly correlated systems in evaluation settings such as TREC. Our technology opens another door: if you can measure the degree of correlation between systems, you can reward development strategies that increase independence rather than accuracy. In our next post we will discuss correlation between recognizers and the window of opportunity for unsupervised inference implied by the counting argument discussed here.

Equations for unsupervised inference

In our previous post, we discussed the observables in an unsupervised setting: the label voting patterns of the recognizers. In this post we will discuss how we can use the frequency of these voting patterns to create a system of equations that can be used for unsupervised inference. We will use the case of three recognizers carrying out a binary labeling task for ease of discussion but hasten to add that the formalism works for any number of labels.

Take one of the eight possible label voting patterns that the three recognizers can produce. We will use the label voting pattern $\{A,B,A\}$ where $A$ and $B$ represent our two labels. The inference equation for this voting pattern is,
$$\color{red} f_{\{A,B,A\}} = p(\{A,B,A\} | A)p(A) + p(\{A,B,A\} | B)p(B).$$
The frequency of the $\{A,B,A\}$ voting pattern is equal to the number of times the recognizers where presented with a data point with correct label $A$ and they outputted the labels $\{A,B,A\}$ plus the number of times the recognizers where presented with a data point with correct label $B$ and they outputted the said voting pattern. Similar equations can be written for each of the eight label voting patterns.

Each of these equations relates an observable, the frequency of a label voting pattern, that we can measure without any knowledge of the correct labels for the data points to two types of statistical quantities we would like to know: the prevalence of the labels in the data, $p(\ell)$, and the conditional recognition probabilities, $p(\text{votes} | \ell).$

We can immediately deduce from these equations that carrying out unsupervised inference when the recognizers are completely correlated is impossible. For example, if the three recognizers are completely correlated they would output just two patterns $\{A,A,A\}$ and $\{B,B,B\}$. This observation could be explained as one of two cases. The recognizers are perfect and the observed frequency of the two voting events gives us the prevalence of the $A$ and $B$ labels in the data. Or, they are imperfect and there are an infinity of solutions for the prevalences and conditional recognition probabilities.

The interesting science occurs in the window of opportunity that these equations create. We will show on our next post how there is a minimum number of recognizers that are needed to carry out unsupervised inference. And that as long as the recognizers are not too correlated, it is possible to solve the joint unsupervised inference problem for the labeling task.

Voting Patterns in the Labeling Task

Consider a collection of recognizers/annotators/labelers that have to label a data set. The problem of unsupervised inference in this task is being able to infer the true prevalence of the labels in the data and the conditional recognition probabilities for the recognizers.

The most abstract formulation of this problem treats the recognizers as black boxes. All we get to see about the decision process of each recognizer is its label for each data point. This abstraction is very helpful in practice. Treating the recognizers as black boxes allows us to develop algorithms that can readily accommodate heterogeneous decision makers. For example, it is common practice to have human annotators decide a gold standard in situations where computer algorithms also can label. Instead of accepting that the human annotators are perfect (numerous studies have shown they are not!), an algorithm that treats everyone as a black box can pool computer and human decisions and output an accuracy for all.

The only information available in this abstract setting are all the possible label voting patterns that the collection of recognizers could output. For example, if we have three recognizers carrying out a binary label task, such as information retrieval engines deciding if a document is relevant to a search query, the number of possible voting patterns is equal to $8$. In general, for $R$ recognizers trying to label data with one of $L$ labels, the number of label voting patterns observable is $L^R$. There could be less than these number of states. For example, in the case of speech recognizers that have insertion and deletion errors. In other words, the event space of label voting patterns depends on the type of errors that the recognizers make. We will discuss the event space of recognizers that have deletion and insertion errors in a future post.

The frequency of each of these label voting patterns is a direct observable available for any algorithm that tries to carry out unsupervised inference. In our next post, we will show how each of these observable frequencies results in an inference equation. And it is the collection of these inference equations that determines if a particular labeling task is solvable in an unsupervised setting.

Joint Unsupervised Inference in the Labeling Task

Labeling data is a common task in today’s world. Is there a face on this photograph? What are the base pairs in a DNA snippet? Which documents are relevant to a search query? Thanks to the ingenuity of many researchers, we currently enjoy a wide variety of recognizers for any labeling task. Likewise, we frequently enjoy large amounts of data. In the convergence of those two opportunities – lots of data with many recognizers to label it – lies the possibility to solve the chicken and egg problem of determining the prevalence of the labels in the data and jointly infer the accuracy of the recognizers.

Interestingly, the problem of unsupervised inference is not universally solvable. If recognizers are strongly correlated, it is impossible to solve the problem. Likewise, if too few recognizers are used, the problem becomes undeterminate. In general, one requires three or more recognizers to carry out unsupervised inference.

Our technology uses the frequency of observed label voting patterns by the recognizers to set up a set of inference equations for the unknown label prevalences and the conditional probability of recognition for the recognizers. When the recognizers are sufficiently independent, the problem is solvable as either a set of polynomial equations or a least squares problem.

One important feature of our technology is that it can be applied even when the recognizers are correlated. Given enough recognizers, increasing levels of correlation between the recognizers can be fitted. In addition, we acquire, through Bayes’ theorem, the ability to turn the inference values into a better system combination – where the best recognizers have their label decisions weighted more than mediocre ones.

In coming posts, we will discuss the intricacies of our technology and show how it can be extended to tasks such as measuring the accuracy of a collection of rankers.