In our previous posts, we have been talking about unsupervised inference of prevalence statistics for data. Given a labeled collection of data, how often does label $\ell$ appear? For example, in a DNA sample, how often do we see a particular nucleotide? How often does a particular face appear in a video sample? Sometimes, however, these simple statistics are not enough. We may want to know more than point statistics.
This is particularly so in the case of sequential data, e.g. the output of a speech recognizer. In such cases we would want to answer questions like: how often does the word pair “New York” get recognized correctly? This would require that we infer two things about the recognizers transcribing the audio: the prevalence of the phrase in the audio and their ability to output such a pair.
In other words, we now want the prevalence of label pairs in the data and their related conditional recognition probabilities. So we would be interested in quantities like
$$p(\ell_i,\ell_j)$$
and
$$p_{r}(\ell_k, \ell_l | \ell_i, \ell_j)$$
The way to infer these statistics is to apply the same frequency counting argument we used for point statistics to the label pair events associated with these 2nd order statistics. So to infer something about the probability of labels following each other in a sequential data record, we just count the number of $L^R*L^R$ events that occur in the labeling output of the recognizers.
Considerations similar to those we have discussed for the case of point statistics show that a necessary condition for this unsupervised inference problem to be solvable is that the number of recognizers must satisfy
$$L^{2R} – 1 \ge (L^2 -1)(R*L^2 + 1).$$
The greater the gap between the two sides of this equation, the greater our ability to also solve the problem with correlated recognizers. Furthermore, since we can take the pairs to be any pair of indices, we could repeatedly take statistics of labels one apart, two apart, etc. to build an auto-correlation function in cases where that makes sense for the data.
But the magic does not stop there, in principle, we would be able to calculate any $n$-th order statistics by looking at the frequency of $(L^R)^n$ recognition events and the minimum number of recognizers needed to carry out unsupervised inference would have as a necessary condition (not suficient!) the following inequality
$$L^{nR} – 1 \ge (L^n -1)(R*L^n + 1).$$
The only practical limit would be the number of maximum variables that your global non-linear optimizer could solve. Given that the current state of the art in that technology is on the order of thousands of variables, second and third order statistics would be possible to obtain.