Unsupervised inference for graph statistics

I’ve been reading about Machine Learning Reductions and trying to understand its similarities with the unsupervised inference transformations we use. For example, when we turn an inference problem in ranking to one for labels as detailed in this post. The beauty of such transformations is that if you can turn an unsupervised inference problem into one involving labels, you can solve it. And given the generality of the concept of labels, you can solve a lot of problems. In this post we detail another application of this concept – how to carry out unsupervised inference for algorithms that return graphs.

Suppose that you wanted to know how good those graphs were, but you did not have any ground truth to use. If you settled for a statistic of the graphs, you could do it. For example, let us ask for the accuracy of the algorithms laying down a correct connection between two nodes. That corresponds to labeling the edges with two labels: in the graph, absent from the graph. For every data instance that caused the algorithms to spit out a graph annotation, you would have a collection of pairs for which they gave one label or the other. So you would have decision label patterns for the binary label task. That takes at least 6 annotators to model pair decisions correlations between them.

A meter for information retrieval accuracy

Data Engines sells shiny new meters. We have no idea how to make information retrieval algorithms more accurate or better. We just know how to measure their retrieval accuracy when no ground truth for the correct relevance judgments is available. So we think of our algorithms for unsupervised inference as meters – sort of like voltmeters and ammeters. In our case, our meters measure how accurate your retrieval engines are. This number is just a statistic of the unknown ground truth – nothing more.

The interesting part comes when you think about the “nothing less” counterpart of what we just said. What can you accomplish when you are able to rank your retrievers correctly? For one, it gives you access to Bayes’ theorem to make better decisions than majority vote. And by the very same Bayes’ theorem, you get a probability estimate for your uncertainty – a particularly good way to spot the instances in your data for which your algorithms have less discriminative power. So, while we agree that ground truth can never be discovered by any automatic method – we can deduce statistics of the unknown ground truth. And in many cases, that is better than not having that measurement.

When does it pay to identify your best performer?

How to tell hammers from spammers in Amazon Mechanical Turk

Reasoning about probability is hard. God knows I have made my share of mistakes. A recent paper by Karger, Oh, and Shah on Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems offers another example when they claim that spamming workers in a crowdsourcing platform like Amazon Mechanical Turk would be 50% correct. This is incorrect. They could be anywhere from 100 to 0 percent correct. Accuracy cannot distinguish shirking from conscientious workers in these platforms but our technology can.

Let me first explain with some simple math why accuracy is not a sign of honest work. We all make mistakes. So we all accept that honest workers will also make them. Suppose that you now send out a crowdsourcing task that consists of deciding if an Internet photograph is appropriate for children. This is a two-label task: yes or no. The results come back and worker A gets it 85% correct and worker B gets it 80% correct. Who is the spammer? Are they both conscientious? Well, there is not enough information in what I just said to answer either of these questions.

To understand this, let me make up a possible scenario. The task you sent out consists of 85% age appropriate photos. A spammer could spend a little time noticing that most of photographs trend this way and just proceed to answer that they are all age appropriate. Their final accuracy would be 100%*85% = 85%.

An honest worker on the same data could be 100% accurate on the age inappropriate photos and 76.4% correct on appropriate ones. This would make their performance 80% accurate ( 85%*76.4% + 15%*100% ). The honest worker is less accurate than the spammer in this task! Accuracy is NOT a measure of spamming in crowdsourcing!

What tells hammers from spammers apart is how they answer their tasks. You have to look at all their answers and see how they align. This is not the place to explain how our wonderful technology does this (see our other posts). But you can get an intuitive sense of how this can work by realizing that being honest aligns workers in their answers. To say it tongue in cheek, the truth biases them. In the hypothetical case above, the spammer answered that all of the age inappropriate photos were OK while the honest one did not.

Contact us if you are interested in finding out how we can improve your crowdsourcing platform.

 

The Marriage Lemma again

I picked up Algebraic Statistics this morning and discovered a simple proof of the Marriage Lemma in the first page of the Introduction. The book is by Pistone, Riccomagno and Wynn and dates to 2001. The subtitle describes its mathematical bent well: “Computational Commutative Algebra in Statistics”. Basically, they show that polynomial algebra problems abound in statistics and the book is an explanation of the mathematical machinery from algebraic geometry that can be used to gain insight into traditional statistical techniques.

The problem they start with in the Introduction is the “simplest example possible”, as they say. Here is their description:

Suppose that two people (small enough) stand together on a bathroom scale. Our model is that the measurement is additive, so that if there is no error, and $\theta_1$ and $\theta_2$ are the two weights, the reading should be
$$
Y = \theta_1 + \theta_2
$$
Without any other information it is not possible to estimate, or compute, the individual weights $\theta_1$ and $\theta_2$. If there is an unknown zero correlation $\theta_0$ then $Y=\theta_1 + \theta_2 + \theta_0$ and we are in worse trouble.

This simple linear algebra problem (and its solution) is the core of the Marriage Lemma: Two recognizers will never be able to tell how accurate the other one is. The weight problem described in the book is equivalent if we think of accuracies as weights. Let me explain.

Clearly, if we are forced to weigh two people together whenever we use a weight scale, we could never determine the individual weights of a single pair. Or to state it explicitly, the single linear equation $Y=\theta_1 + \theta_2$ where we have a single measurement cannot be used to solve for the two unknown weights. This is the marriage lemma: two people living together will never be able to figure out their individual weights when they use this two-at-a-time weight scale.

But if three people live together, they can do so. This is equivalent to our machine learning claim that three independent recognizers can measure their own accuracies without any knowledge of the correct labels/values for the data.

The following equations would apply for the case of finding the weight of three people when you are forced to use a two-person scale
\begin{align}
Y_{12} &= \theta_1 + \theta_2 \\
Y_{13} &= \theta_1 + \theta_3 \\
Y_{23} &= \theta_2 + \theta_3
\end{align}
You weigh the three possible pairs and then you get three linear equations for three unknowns. Assuming the weight scale is perfect – this will always have a solution. When noise is present in the scale you minimize the least squares of the equation residuals.

[update]: The case of a two-person scale is not as far fetched as it sounds. It would apply to a weight scale that only recorded weights above some non-zero value. For example, a scale that only gave readings above 200 pounds. A group of three people, each of whom weighed between 100 and 200 lbs, could then use this algorithm to figure out their individual weights.