Independence of decision makers is precious in an unsupervised inference setting. Sometimes we are better off having independent recognizers that perform worse than highly correlated recognizers that have higher accuracy. The question comes down to how much computation you have to do to obtain a desired level of uncertainty. Before I get into that comparison in a later post, I wanted to discuss some other points of views on independence in the context of multiple annotators.
We start with Panos Ipeirotis with an interesting entry in his blog about the benefits and drawbacks of independence. Ipeirotis and collaborators have done a lot of work in the intersection of machine learning and crowdsourcing platforms like Mechanical Turk. For example, when handing out an annotation task on Mechanical Turk, should one get more than one opinion on the annotation of some data points? Ideally, you would want to minimize your labeling costs so handing out non-overlapping tasks to your labor force is the cheapest — no data point gets seen by more than one person. But annotators are not perfect. If you knew how errorful they were (unsupervised inference application!), when does it pay to get more than one opinion because you reduce your uncertainty on the labels by a great deal? Check out their paper for details on the solution.
Ipeirotis uses the concept of independence to discuss three issues about collective decision making: correlation, modularity, and communication. In this post I will start with the first topic — statistically correlated decision makers.
Independence in the context of correlation refers to the tendency to make your errors independently of others. Ipeirotis points out a paper by Clemen and Winkler that looked at aggregating continuous estimators with Gaussian noise. Assuming that all the estimators have no bias, they ask — how much does the variance of the combined estimate decrease when you aggregate their opinions? The surprising answer (to me) is
$$
n_{\text{independent}} = \frac{k}{1 + (k-1) \rho}
$$
where Clemen and Winkler compare the reduction in variance for $k$ dependent estimators by comparing it to the number of independent recognizers that would have produced the same reduction. To make the comparison fair, we assume that all estimators have the same variance by themselves. The case of independent estimators is corresponds to $\rho = 0$. In that case, all the estimators are at “full power,” if you will, and we have
$$
n_{\text{independent}} = k
$$
What I find surprising about the formula is that the moment the recognizers cease to be independent, $\rho \ne 0$, the reduction in variance is clamped at a finite value! As $k \rightarrow \infty$, the formula becomes
$$
n_{\text{independent}} = \frac{1}{\rho}.
$$
I would have expected some power law decrease in the effectiveness of the dependent estimators. For example, for the case $\rho=0.1$ we would get that the dependent estimators would never be better than ten independent ones — no matter how many you have! A million dependent estimators are just as good as ten, a billion the same. It does not matter.
Granted, they are assuming that all the estimators are correlated to each other. If you made any real effort at creating different estimators, this would be easy to circumvent. For example, different meters could be rigged on different physical principles. But in the context of human annotators, this is not possible, so I can see why even a little correlation in your annotators could make it uneconomical to aggregate estimates — one annotator gets you most of the information you need.
In our next post we will discuss Ipeirotis’s second concept of independence — modularity. This will allow us to bring into our discussion the topic of machine learning reductions.