Average WER without any correct transcripts: the large V algorithm

The vocabulary size of some recognizers can be quite large. Dictation speech recognition systems can recognize on the order of 50 thousand words. Carrying out an unsupervised inference calculation for such a large number of words/labels would be prohibitive. With $V=50,000$, one would have to fit about $10^{14}$ frequency terms when considering the output of three recognizers – an impossible task.

Granted, the effective number of recognition events would be much smaller since most vocabulary confusions would not occur. Speech recognizers are confused on the average between 20 or so words. This would result in about $C*20^R$ expected patterns, where $C$ is the number of confusability classes. For three recognizers and thousands of confusability classes this would require us to fit a million frequencies!

The way to solve this problem is to abandon label specific knowledge and settle on the average WER. In this post we will show that it is possible to do so by turning any large $V$ unsupervised inference problem into one with $(R+1)$ labels. In practice, we have not given up much. Most development is driven by average measures like this. This is a reasonable development strategy whenever costs associated with errors are roughly constant. I have sat in many meetings where the decision to deploy one speech recognizer versus another was based solely on the average WER ranking.

Let’s start with the basics. We define the word error rate of a single recognizer for a specific label, $\ell_t$, as
$$\sum_{\ell \ne \ell_t} p(\ell | \ell_t).$$
Which is easier to express in terms of accuracy for the label,
$$1 – p(\ell_t | \ell_t)$$
The recognizer’s average WER is the individual WERs weighted by the prevalence of the labels,
$$\sum_{\ell_t \in \mathcal{L}} \left(\sum_{\ell \ne \ell_t} p(\ell | \ell_t) \right) p(\ell_t)$$
Or again, in terms of average label accuracy in the form
$$1 – \sum_{\ell_t \in \mathcal{L}} p(\ell_t | \ell_t) p(\ell_t)$$

As always, it is necessary, and easiest, to first talk about the independent recognizers case. We assume that we have the aligned output of multiple such recognizers. We will use three recognizers to illustrate the algorithm although when one compares the number of inference equations versus parameters to capture the statistics, at least $R=5$ recognizers are required to solve the task. The table below shows a hypothetical sample of their output,

Recognizer 1 Recognizer 2 Recognizer 3 truth
The Day The ?
quick quick quick ?
lawn brown brown ?
sox pox lox ?

We begin the transformation by replacing all the output of an arbitrary recognizer by the abstract label $\alpha$. We’ll pick the first recognizer so the recognizer output now looks like this:

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ Day The ?
$\alpha$ quick quick ?
$\alpha$ brown brown ?
$\alpha$ pox lox ?

Proceed by transforming a second recognizer’s output, we choose the second one. Whenever its output agreed with the first recognizer, we place an $\alpha$ label. If its output disagrees, we but the $\beta$ label. The output now looks like this

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ $\beta$ The ?
$\alpha$ $\alpha$ quick ?
$\alpha$ $\beta$ brown ?
$\alpha$ $\beta$ lox ?

Finally, we transform the output of the third recognizer. Three possible labels can be assigned. The label $\alpha$ is assigned if it agrees with first recognizer. If not, $\beta$ is assigned if it agrees with the second one. And finally, a $\gamma$ label if its output does not agree with any of the transformed recognizers. The final state of the table is now,

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ $\beta$ $\alpha$ ?
$\alpha$ $\alpha$ $\alpha$ ?
$\alpha$ $\beta$ $\alpha$ ?
$\alpha$ $\beta$ $\gamma$ ?

The crucial point is the same algorithm can be extended to the unknow “truth” column. The correct output may agree with some of the recognizers or it may not. So we need to introduce a fourth abstract label, $\delta$ to capture this case. For our example, the last column would look as follows

Recognizer 1 Recognizer 2 Recognizer 3 truth
$\alpha$ $\beta$ $\alpha$ $\alpha$
$\alpha$ $\alpha$ $\alpha$ $\alpha$
$\alpha$ $\beta$ $\alpha$ $\beta$
$\alpha$ $\beta$ $\gamma$ $\delta$

But, of course, we do not know the last column because the ground truth is not available to us. If we did, we could immediately calculate the WER! But we have succeeded in mapping any large $V$ recognition task to an abstract $(R+1)$-label problem. Since we have an unsupervised inference algorithm for that task – we can calculate the average WER no matter how large $V$ becomes.

For the rest of the post, we will assume that the unsupervised inference calculation has been performed and we have available the single recognizer marginals (the recognizers don’t need to be independent!). Let us look at how we would calculate the average WER for the first recognizer.

Its WER is given the prevalences of all but the $\alpha$ label.
$$\sum_{\lambda \ne \alpha} p(\lambda)$$
Why? Since our transformation of the recognition substituted all of the its entries into $\alpha$ its conditional recognition probabilities for the abstract labels must be of the form
$$\{ p_{1}(\alpha | \lambda) =1 \}_{\lambda \in \Lambda},$$
where $\Lambda$ represents the collection of abstract labels used to transform the recognition outputs. Therefore its average WER is
\begin{align}
\sum_{\lambda \ne \alpha} p_{1}(\alpha | \lambda)*p(\lambda) & = \\
& = \sum_{\lambda \ne \alpha} 1*p(\lambda) \\
& = \sum_{\lambda \ne \alpha} p(\lambda)
\end{align}
It is easier to express this last expression in terms of the average accuracy of the recognizer
$$1 – \sum_{\lambda \in \{\alpha\}} p_{1}(\lambda | \lambda)p(\lambda) = 1 – p(\alpha)$$
In general, for recognizer $r$ the average WER is given by
$$1 – \sum_{\lambda \in \Lambda_r} p_{r}(\lambda | \lambda)p(\lambda),$$
with the set $\Lambda_r = \{\lambda_1, \lambda_2, \ldots, \lambda_r \}$ representing the $r$ labels that can appear in recognizer $r$’s output.

The key is that we have transformed the average WER, which involves events over the correct labels into an expression involving labels related to the pattern of agreement and disagreement between the recognizers. The average WER can be calculated because the partition of the large space of recognition events with $V$ labels can be partitioned into a smaller space where abstract labels capture only the agreements and disagreements.

Pretty nifty, eh! Another reason why our company motto is “Sufficiently Advanced Technologies”. Magic is possible when you give up specific details for averaged ones.