The Marriage Lemma – Data Engines Corporation

In our previous post we explained how each of the frequencies of the label voting patterns leads to an equation. On one side of an equation is the observable frequency of the particular label voting pattern. These frequencies are readily observable and no knowledge of the correct labels for the data points is needed to measure it. On the other side is an expression with the unknown quantities we would like to know: the prevalence of the labels in the data and the conditional recognition probabilities.

In this post we want to prove the marriage lemma for the labeling task: two recognizers will never be able to solve the unsupervised inference problem. This lemma has shown up on our previous work related to unsupervised inference for the regression task as in, for example, this ICML 2008 paper. The proof is based on comparing the number of unknown parameters versus the number of inference equations.

The number of inference equations is generally $L^R$, where $L$ is the number of labels and $R$ is the number of recognizers/annotators/labelers. Since we are using frequencies of the label voting patterns, the equations must sum to one and hence, only $L^R -1$ are independent.

We now count the number of parameters we need to solve for. Given $L$ labels, we need $L$ prevalence parameters. Again, the prevalences sum to one, so $L-1$ prevalence parameters are needed. The best unsupervised inference setting is one where all the recognizers are independent. This should be intuitively obvious to the reader. Independence would mean that the conditional recognition probabilities factor into individual recognizer conditional probabilities. For example, for three recognizers
$$p(\{\ell_1,\ell_2,\ell_3 \} | \ell_t) = p(\ell_1 | \ell_t) p(\ell_2 | \ell_t) p(\ell_3 | \ell_t).$$
So for each recognizer, we need to solve for $L*(L-1)$ conditional probabilities. Conditioned on a specific true label, the conditional probabilities sum to one so $(L-1)$ parameters are needed. And there are $L$ labels, hence the $L*(L-1)$ parameters for each independent recognizer. We have $R$ recognizers so we have to solve for $R*L*(L-1)$ conditional recognition probabilities. Putting it all together, the number of unknown parameters when carrying out unsupervised inference with independent recognizers is
$$(L-1)*(R*L + 1).$$

The inference equations for independent recognizers will become solvable (not necessarily uniquely!) whenever
$$L^R -1 \ge (L-1)*(R*L + 1).$$
To be mathematically strict, an argument based on counting parameters versus equations does not allow me to prove that solutions exist. So I turn the argument around and say that the problem of unsupervised inference is unsolvable for the labeling task whenever
$$L^R -1 \lt (L-1)*(R*L + 1).$$

The marriage lemma follows from setting $R=2$ since
$$L^2 -1 \lt (L-1)*(2*L + 1)$$
is always true for $L \ge 2.$ Since it may be news to some people that you could carry out unsupervised inference in the labeling task at all, the lemma becomes interesting when we point out that it is indeed possible to solve the problem at all. Consider the case of two labels and three independent recognizers. In that case the comparison between the number of inference equations versus unknown parameters becomes
$$7 \ge 7,$$
so the problem is solvable.

The utility of our technology is that it can readily accommodate the ubiquitous case of correlated recognizers. Independence is a rare quality in most systems deployed today. This may be due to a paradigm that assumes that the best way to carry out annotations is to have a single system that is the most accurate possible. This leads to a kitchen sink approach that time and time again has resulted in highly correlated systems in evaluation settings such as TREC. Our technology opens another door: if you can measure the degree of correlation between systems, you can reward development strategies that increase independence rather than accuracy. In our next post we will discuss correlation between recognizers and the window of opportunity for unsupervised inference implied by the counting argument discussed here.