Labeling data is a common task in today’s world. Is there a face on this photograph? What are the base pairs in a DNA snippet? Which documents are relevant to a search query? Thanks to the ingenuity of many researchers, we currently enjoy a wide variety of recognizers for any labeling task. Likewise, we frequently enjoy large amounts of data. In the convergence of those two opportunities – lots of data with many recognizers to label it – lies the possibility to solve the chicken and egg problem of determining the prevalence of the labels in the data and jointly infer the accuracy of the recognizers.
Interestingly, the problem of unsupervised inference is not universally solvable. If recognizers are strongly correlated, it is impossible to solve the problem. Likewise, if too few recognizers are used, the problem becomes undeterminate. In general, one requires three or more recognizers to carry out unsupervised inference.
Our technology uses the frequency of observed label voting patterns by the recognizers to set up a set of inference equations for the unknown label prevalences and the conditional probability of recognition for the recognizers. When the recognizers are sufficiently independent, the problem is solvable as either a set of polynomial equations or a least squares problem.
One important feature of our technology is that it can be applied even when the recognizers are correlated. Given enough recognizers, increasing levels of correlation between the recognizers can be fitted. In addition, we acquire, through Bayes’ theorem, the ability to turn the inference values into a better system combination – where the best recognizers have their label decisions weighted more than mediocre ones.
In coming posts, we will discuss the intricacies of our technology and show how it can be extended to tasks such as measuring the accuracy of a collection of rankers.