Tuesday, April 6, 2010

How many words do we need? Corpora, frequency and language learning

This is a good continuation of my previous posts.

Corpus size, word frequency and extensive reading

Corpora comparison by frequency

Word frequency and incidental learning

Native speaker language input

The inescapable case for extensive reading

The nicest thing about it is that I didn't have to work on this one myself.

In linguistics a language corpus is a machine-searchable collection of examples of written and spoken language use. Corpus linguistics aims to discover patterns of authentic language use through analysis of actual usage. A good corpus will give a snapshot of modern language use. This will eventually result in new, comprehensive corpus-based reference grammars, textbooks and dictionaries. Teachers may improvise their own corpora for particular purposes. Yawn. Anyway, that's a short intro. I'm mostly looking at corpora as adequate, accessible and entertaining snapshots of language use. I am concerned with word frequency as a factor in practical vocabulary and grammar acquisition. So, a corpus as an entertaining snapshot of language, large enough to provide enough repetitions of the most frequent items across a variety of fields and facilitate natural acquisition. Or shoud we grab a dictionary and a grammar and plough through a smaller sample? Or maybe both? Obviously, learning something involves internalizing it, you do not want to simply end up with a passive version of a native speaker's active vocabulary. What's the number of repititions for active vocabulary? Is actual language use more important than input at this stage?

But I digress.

A note on corpus and lexicon size
Johanna Nichols, UC Berkeley



How much material is needed for minimal, normal, or optimal documentation of a language? How large should a text corpus be?

Cheng 2002, 2002 (and earlier works) shows that, for both English and Chinese, any given author uses a maximum of about 4000-8000 lemmatized words. A book of about 100,000 running words of text reaches this maximum in some cases. Cheng quotes Francis & Kucera 1982 to the effect that the Brown corpus (1,000,000 running words) contains over 30,000 lemmatized words of which just under 6000 occur more than 5 times.

Cheng's interpretation is that 8000 words is about the maximum for an individual's actively known vocabulary. Of the authors he has surveyed only Shakespeare reaches an exceptional 10,000 words. Other sources cite vocabulary sizes ranging up to 80,000 words for the individual; but this is passive vocabulary, some of it known only in the sense of being understood in context. Cheng 2000 shows that, over time from 93BC to the present, the size of Chinese dictionaries increases regularly but the size of the individual author's vocabulary remains at a constant 4000-8000 characters.
Cheng's results suggest a measure of adequacy for lexical documentation: it should reach the range of an individual's active vocabulary, and it should be compiled from extensive enough materials to include the entire active vocabulary for at least one good speaker and preferably several good speakers.

I note that a fair amount of Cheng's English corpus appears to be literature for young adults. The sources with the higher lemma figures are writers writing for a full-fledged audience, e.g. Mark Twain; nonfiction writers appear to mostly fall in here. So Cheng's figure of 8000 may be a minimum, in that inclusion of more varied genres would almost certainly expand it. Also, the figure probably includes distinctly fewer technical terms than the average user knows actively. Finally, what Cheng surveys in this paper is not the given author's whole oeuvre but just one large work or (e.g. for Mark Twain) a collection of short works. That said, evidently one needs close to 100,000 words per individual to have any chance of capturing that individual's entire active lexical range. That would be about 17 real-time hours:

Running words of text

The Uppsala corpus of Russian (1,000,000 words) yields zero or near-zero frequencies of some morphological forms. Timberlake 2004:6 searched for the two attested instrumental singular forms of Russian tysjacha 'thousand' and got zero returns for the less common one, while searching the entire Internet returned thousands of hits for each.

Therefore, a desideratum for corpora to be used for close syntactic work would be at least a million words, preferably at least ten million. Ten thousand or even less will suffice to attest the basic patterns. However, anything at all - even just a few sentences - is enormously valuable.

Monson et al. 2004 measure the rate at which new wordforms show up in running text and find that, for the polysynthetic language Mapudungun, there is no fall-off in the steep rate of increase of wordforms even after 1,000,000 running words. In the Spanish translation of this corpus, by contrast, the rate is much flatter.
Thus it appears that the number of words of running text needed for work on inflection and other morphology varies with the inflectional complexity of a language.

4. Real-time value of corpus sizes

Based on the Berkeley Ingush corpus... and on the corpus size reported by Monson et al. 2004, I calculate that an hour of transcribed recorded speech contains about 6000 words. (The figure might be somewhat higher for languages with less inflected and therefore shorter words.)

(Comment: this matches my estimates for some TV sitcoms)

Transcribed recorded hours needed at this rate for various corpus sizes:

1 million words (Brown size): 170 hours
10 million words: 1,700 hours
100 million words (BNC size): 17,000 hours

Recommended corpus sizes in running words.

Figures recommended here are for quality recordings, transcribed, glossed, and adequately commented -- that is, provided with fluent speaker judgments on the meaning of the material and the identity of the lexical items, and additional judgments on the kind of question that is likely to arise as a linguist works on the material.

Minimal documentation: Something like 1000 clauses excluding those with the most common verb (if any verb is substantially more common than others, as 'be' is in medieval Slavic texts). To be safe, 2000 clauses (this more than provides for excluding the most common verb). This would be several thousand to ten thousand running words. This appears to be minimally adequate for capturing major inflectional categories and major clause types, in moderately synthetic languages; for a highly synthetic or polysynthetic language more material is needed.

Basic documentation: About 100,000 running words, which appears to be the threshold figure adequate for capturing the typical good speaker's overall active vocabulary.
Good documentation: A million-word corpus. 150-200 hours of good-quality recorded text, up to about 20 hours per speaker, from a variety of speakers on a variety of topics in a variety of genres.

At 20 hours/speaker this is 10 speakers. Also, by Cheng's criteria, 100,000 words/speaker is 10 speakers for a million-word corpus. In reality, though, it is highly desirable to get more than 10 speakers (and also highly desirable to get the full 20 hours or 100,000 words from each of several speakers).

Excellent documentation: At least an order of magnitude larger than good; i.e. at least 10,000,000 words (1500-2000 recorded hours).

Full documentation: The sobering examples of the research experiences of Timberlake and Ruppenhofer (mentiolned above) show that even 100,000,000 words is at least an order of magnitude too small to capture phenomena that, though of low frequency, are in the competence of ordinary native speakers. That would represent at least 20,000 recorded hours, and it is too low by an order of magnitude.

Assuming that a typical speaker hears speech for about 8 hours per day, the typical exposure is around 3000 hours per year. Assuming that full ordinary linguistic competence (i.e. not highly educated competence but ordinary adult lexical competence) is reached by one's mid-twenties, that would represent 75,000 hours. For written languages, add to that some unknown amount representing reading. Extraordinary linguistic competence -- that of a genius like Shakespeare or a highly educated modern reader -- requires wide reading, attentive listening to a wide range of selected good speakers, and a good memory.

On these various criteria it would take well over a billion (a thousand million) running words, and over 100,000 carefully chosen recorded hours, to just begin to approach the lifetime exposure of a good young adult speaker. Unfortunately, field documentation cannot hope to reach these levels.


marc said...

Hi Reineke,

You can quite easily get a decent corpus of "spoken" language by analysing film and television subtitles. We have done so for English, French, Dutch, Spanish, and Chinese. The frequency measures predict word reading times better than frequency estimates on the basis of books. You find the English data (including an article that may interest you) on:



reineke said...

Hi Marc

I mentioned subtlexus in the previous posts. I even arranged a boxing match: Subtlexus vs Brown :)

Subtlexus is certainly a great resource.