UMass Amherst Linguistics Linguistics Department Colloquium
February 9, 2006
WHISC 4.3

Lexical entropy, finite state optimality, and learning from surface forms alone

Jason Riggle
University of Chicago

The problem of ranking Optimality Theoretic constraints in a fashion consistent with a training sample comprised of ⟨input, output⟩ pairs has been solved with a variety of algorithms (e.g. Tesar & Smolensky 2002, Boersma & Hayes 1999). The real-world problem of learning from output forms alone, however, still presents many challenges. Chief among these is the problem that a given set of surface forms can be consistent with a range of ⟨possible-input, possible-grammar⟩ pairs. While knowledge of meaning and morphology can help adjudicate among the hypotheses, several researchers have proposed that properties of the grammar hypotheses themselves can serve as heuristics before such knowledge is available (e.g. Prince & Tesar’s (1999) restrictiveness metric (R-measure) or Smolensky’s (1996) default MARKEDNESS >> FAITHFULNESS ranking).

I propose another strategy for adjudicating among grammars without recourse to morphological information that is based, not on the formal properties of the constraint rankings themselves, but instead on information-theoretic properties of the input set that each candidate grammar assigns to the training sample. If learners choose grammars whose associated input sets have the highest entropy (are least ordered), then learners will select grammars that maximally characterize patterns in the training sample as consequences of the grammar rather than as accidents of the lexicon. In this strategy, learners assume that all segment types (unigrams), segment pairs (bigrams), triplets, patterns of comparable complexity, etc. are equiprobable as inputs. This needn’t be true of the mature lexicon, but it represents a null-hypothesis that places the onus on the grammar to account for all patterns. This idea is central in Zellig Harris’ work (1942 et seq.) and encodes the same insight as Smolensky’s (1996) Richness of the Base hypothesis in that grammars with maximally entropic input lexicons impose the least structure on those inputs.

To test this learning strategy, the grammar model must be such that it is possible for the learner to invert the grammar and, for each observed surface form, generate a set of ⟨input, ranking⟩ pairs to work with. I show how this can be done in finite state models of Optimality Theory by generating for each input the range of optimal ⟨output, ranking⟩ pairs – the contenders – and then working backwards from observed outputs to ⟨input, ranking⟩ pairs. With this, it is possible to evaluate the amount of structure in the lexicon in each (lexicon, grammar) hypothesis. Because the set of such pairs can grow quite large, however, it is untenable to store an entire lexicon with each hypothesis. One simple way to compress the lexical hypotheses is to reduce them to bigram and unigram counts. Though this restricts the detectable structure to local patterns, it is adequate for many phonological phenomena. The unigram/bigram counts can then, at any point, be cashed out as an estimate of the entropy of the lexicon.

I illustrate the proposal with a couple of case studies involving constraints on syllable structure. In the smaller of the two studies, using lexical entropy as a heuristic to pick grammars always leads to the ‘right’ hypothesis once the learner has observed enough data. In the larger of the two studies, the number of viable hypotheses can grow into the thousands and thus necessitates that the learner only retain the n best (most entropic) hypotheses. In this latter case the right hypothesis can be swamped by highly entropic alternatives that allow a subset of the surface forms allowed by the correct hypothesis. Though the lexical entropy heuristic leads to errors in such cases, the errors are exactly the kinds of overgeneralizations to subset languages that can be recovered from upon observing further data. What the case studies show is that, if the learner is able to invert the grammar, the heuristic of lexical entropy can help the learner pick a hypothesis that maximally encodes observed patterns, and even when this leads the learner astray, the results are interesting.