Independence and its usefulness, part I

Independence of decision makers is precious in an unsupervised inference setting. Sometimes we are better off having independent recognizers that perform worse than highly correlated recognizers that have higher accuracy. The question comes down to how much computation you have to do to obtain a desired level of uncertainty. Before I get into that comparison in a later post, I wanted to discuss some other points of views on independence in the context of multiple annotators.

We start with Panos Ipeirotis with an interesting entry in his blog about the benefits and drawbacks of independence. Ipeirotis and collaborators have done a lot of work in the intersection of machine learning and crowdsourcing platforms like Mechanical Turk. For example, when handing out an annotation task on Mechanical Turk, should one get more than one opinion on the annotation of some data points? Ideally, you would want to minimize your labeling costs so handing out non-overlapping tasks to your labor force is the cheapest — no data point gets seen by more than one person. But annotators are not perfect. If you knew how errorful they were (unsupervised inference application!), when does it pay to get more than one opinion because you reduce your uncertainty on the labels by a great deal? Check out their paper for details on the solution.

Ipeirotis uses the concept of independence to discuss three issues about collective decision making: correlation, modularity, and communication. In this post I will start with the first topic — statistically correlated decision makers.

Independence in the context of correlation refers to the tendency to make your errors independently of others. Ipeirotis points out a paper by Clemen and Winkler that looked at aggregating continuous estimators with Gaussian noise. Assuming that all the estimators have no bias, they ask — how much does the variance of the combined estimate decrease when you aggregate their opinions? The surprising answer (to me) is
$$
n_{\text{independent}} = \frac{k}{1 + (k-1) \rho}
$$
where Clemen and Winkler compare the reduction in variance for $k$ dependent estimators by comparing it to the number of independent recognizers that would have produced the same reduction. To make the comparison fair, we assume that all estimators have the same variance by themselves. The case of independent estimators is corresponds to $\rho = 0$. In that case, all the estimators are at “full power,” if you will, and we have
$$
n_{\text{independent}} = k
$$

What I find surprising about the formula is that the moment the recognizers cease to be independent, $\rho \ne 0$, the reduction in variance is clamped at a finite value! As $k \rightarrow \infty$, the formula becomes
$$
n_{\text{independent}} = \frac{1}{\rho}.
$$
I would have expected some power law decrease in the effectiveness of the dependent estimators. For example, for the case $\rho=0.1$ we would get that the dependent estimators would never be better than ten independent ones — no matter how many you have! A million dependent estimators are just as good as ten, a billion the same. It does not matter.

Granted, they are assuming that all the estimators are correlated to each other. If you made any real effort at creating different estimators, this would be easy to circumvent. For example, different meters could be rigged on different physical principles. But in the context of human annotators, this is not possible, so I can see why even a little correlation in your annotators could make it uneconomical to aggregate estimates — one annotator gets you most of the information you need.

In our next post we will discuss Ipeirotis’s second concept of independence — modularity. This will allow us to bring into our discussion the topic of machine learning reductions.

Average WER without any correct transcripts: the large V algorithm

The vocabulary size of some recognizers can be quite large. Dictation speech recognition systems can recognize on the order of 50 thousand words. Carrying out an unsupervised inference calculation for such a large number of words/labels would be prohibitive. With $V=50,000$, one would have to fit about $10^{14}$ frequency terms when considering the output of three recognizers – an impossible task.

Granted, the effective number of recognition events would be much smaller since most vocabulary confusions would not occur. Speech recognizers are confused on the average between 20 or so words. This would result in about $C*20^R$ expected patterns, where $C$ is the number of confusability classes. For three recognizers and thousands of confusability classes this would require us to fit a million frequencies!

The way to solve this problem is to abandon label specific knowledge and settle on the average WER. In this post we will show that it is possible to do so by turning any large $V$ unsupervised inference problem into one with $(R+1)$ labels. In practice, we have not given up much. Most development is driven by average measures like this. This is a reasonable development strategy whenever costs associated with errors are roughly constant. I have sat in many meetings where the decision to deploy one speech recognizer versus another was based solely on the average WER ranking.

Let’s start with the basics. We define the word error rate of a single recognizer for a specific label, $\ell_t$, as
$$\sum_{\ell \ne \ell_t} p(\ell | \ell_t).$$
Which is easier to express in terms of accuracy for the label,
$$1 – p(\ell_t | \ell_t)$$
The recognizer’s average WER is the individual WERs weighted by the prevalence of the labels,
$$\sum_{\ell_t \in \mathcal{L}} \left(\sum_{\ell \ne \ell_t} p(\ell | \ell_t) \right) p(\ell_t)$$
Or again, in terms of average label accuracy in the form
$$1 – \sum_{\ell_t \in \mathcal{L}} p(\ell_t | \ell_t) p(\ell_t)$$

As always, it is necessary, and easiest, to first talk about the independent recognizers case. We assume that we have the aligned output of multiple such recognizers. We will use three recognizers to illustrate the algorithm although when one compares the number of inference equations versus parameters to capture the statistics, at least $R=5$ recognizers are required to solve the task. The table below shows a hypothetical sample of their output,

Recognizer 1 Recognizer 2 Recognizer 3 truth
The Day The ?
quick quick quick ?
lawn brown brown ?
sox pox lox ?

We begin the transformation by replacing all the output of an arbitrary recognizer by the abstract label $\alpha$. We’ll pick the first recognizer so the recognizer output now looks like this:

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ Day The ?
$\alpha$ quick quick ?
$\alpha$ brown brown ?
$\alpha$ pox lox ?

Proceed by transforming a second recognizer’s output, we choose the second one. Whenever its output agreed with the first recognizer, we place an $\alpha$ label. If its output disagrees, we but the $\beta$ label. The output now looks like this

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ $\beta$ The ?
$\alpha$ $\alpha$ quick ?
$\alpha$ $\beta$ brown ?
$\alpha$ $\beta$ lox ?

Finally, we transform the output of the third recognizer. Three possible labels can be assigned. The label $\alpha$ is assigned if it agrees with first recognizer. If not, $\beta$ is assigned if it agrees with the second one. And finally, a $\gamma$ label if its output does not agree with any of the transformed recognizers. The final state of the table is now,

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ $\beta$ $\alpha$ ?
$\alpha$ $\alpha$ $\alpha$ ?
$\alpha$ $\beta$ $\alpha$ ?
$\alpha$ $\beta$ $\gamma$ ?

The crucial point is the same algorithm can be extended to the unknow “truth” column. The correct output may agree with some of the recognizers or it may not. So we need to introduce a fourth abstract label, $\delta$ to capture this case. For our example, the last column would look as follows

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ $\beta$ $\alpha$ $\alpha$
$\alpha$ $\alpha$ $\alpha$ $\alpha$
$\alpha$ $\beta$ $\alpha$ $\beta$
$\alpha$ $\beta$ $\gamma$ $\delta$

But, of course, we do not know the last column because the ground truth is not available to us. If we did, we could immediately calculate the WER! But we have succeeded in mapping any large $V$ recognition task to an abstract $(R+1)$-label problem. Since we have an unsupervised inference algorithm for that task – we can calculate the average WER no matter how large $V$ becomes.

For the rest of the post, we will assume that the unsupervised inference calculation has been performed and we have available the single recognizer marginals (the recognizers don’t need to be independent!). Let us look at how we would calculate the average WER for the first recognizer.

Its WER is given the prevalences of all but the $\alpha$ label.
$$\sum_{\lambda \ne \alpha} p(\lambda)$$
Why? Since our transformation of the recognition substituted all of the its entries into $\alpha$ its conditional recognition probabilities for the abstract labels must be of the form
$$\{ p_{1}(\alpha | \lambda) =1 \}_{\lambda \in \Lambda},$$
where $\Lambda$ represents the collection of abstract labels used to transform the recognition outputs. Therefore its average WER is
\begin{align}
\sum_{\lambda \ne \alpha} p_{1}(\alpha | \lambda)*p(\lambda) & = \\
& = \sum_{\lambda \ne \alpha} 1*p(\lambda) \\
& = \sum_{\lambda \ne \alpha} p(\lambda)
\end{align}
It is easier to express this last expression in terms of the average accuracy of the recognizer
$$1 – \sum_{\lambda \in \{\alpha\}} p_{1}(\lambda | \lambda)p(\lambda) = 1 – p(\alpha)$$
In general, for recognizer $r$ the average WER is given by
$$1 – \sum_{\lambda \in \Lambda_r} p_{r}(\lambda | \lambda)p(\lambda),$$
with the set $\Lambda_r = \{\lambda_1, \lambda_2, \ldots, \lambda_r \}$ representing the $r$ labels that can appear in recognizer $r$’s output.

The key is that we have transformed the average WER, which involves events over the correct labels into an expression involving labels related to the pattern of agreement and disagreement between the recognizers. The average WER can be calculated because the partition of the large space of recognition events with $V$ labels can be partitioned into a smaller space where abstract labels capture only the agreements and disagreements.

Pretty nifty, eh! Another reason why our company motto is “Sufficiently Advanced Technologies”. Magic is possible when you give up specific details for averaged ones.

Metrology for Machine Learning Tasks

Unsupervised inference for sequential statistics

In our previous posts, we have been talking about unsupervised inference of prevalence statistics for data. Given a labeled collection of data, how often does label $\ell$ appear? For example, in a DNA sample, how often do we see a particular nucleotide? How often does a particular face appear in a video sample? Sometimes, however, these simple statistics are not enough. We may want to know more than point statistics.

This is particularly so in the case of sequential data, e.g. the output of a speech recognizer. In such cases we would want to answer questions like: how often does the word pair “New York” get recognized correctly? This would require that we infer two things about the recognizers transcribing the audio: the prevalence of the phrase in the audio and their ability to output such a pair.

In other words, we now want the prevalence of label pairs in the data and their related conditional recognition probabilities. So we would be interested in quantities like
$$p(\ell_i,\ell_j)$$
and
$$p_{r}(\ell_k, \ell_l | \ell_i, \ell_j)$$

The way to infer these statistics is to apply the same frequency counting argument we used for point statistics to the label pair events associated with these 2nd order statistics. So to infer something about the probability of labels following each other in a sequential data record, we just count the number of $L^R*L^R$ events that occur in the labeling output of the recognizers.

Considerations similar to those we have discussed for the case of point statistics show that a necessary condition for this unsupervised inference problem to be solvable is that the number of recognizers must satisfy
$$L^{2R} – 1 \ge (L^2 -1)(R*L^2 + 1).$$
The greater the gap between the two sides of this equation, the greater our ability to also solve the problem with correlated recognizers. Furthermore, since we can take the pairs to be any pair of indices, we could repeatedly take statistics of labels one apart, two apart, etc. to build an auto-correlation function in cases where that makes sense for the data.

But the magic does not stop there, in principle, we would be able to calculate any $n$-th order statistics by looking at the frequency of $(L^R)^n$ recognition events and the minimum number of recognizers needed to carry out unsupervised inference would have as a necessary condition (not suficient!) the following inequality
$$L^{nR} – 1 \ge (L^n -1)(R*L^n + 1).$$
The only practical limit would be the number of maximum variables that your global non-linear optimizer could solve. Given that the current state of the art in that technology is on the order of thousands of variables, second and third order statistics would be possible to obtain.

Bias is cheap to fix, variance expensive

The title of this post is another way to express the title of the previous post in terms that may be more familiar to people knowledgeable about Machine Learning. To make a physical analogy that would be familiar to physical scientists, the DC offset of a voltmeter is easy to fix by installing a biasing field controlled by an external knob available to the user. It’s not so easy to give the user a knob that makes the volt reading become arbitrarily precise. Each precision digit has an increasing marginal cost. Going from 2 to 3 precision digits costs less than going from 3 to 4.