Sunday, April 26, 2009

corpus size, word frequency and extensive reading

A corpus is assembled by collecting examples of written and spoken language. It is a descriptive sample of the language, not a prescriptive one. Modern dictionaries are created using such resources and are a distillation of the data. A national corpus, created by linguists, can contain several hundred million words and may use different types of annotation: metatextual (information about the text), morphological, accentual and semantic. Such corpora are nowadays mostly stored as texts in digital form for easy reference. A spoken language corpus may also be stored as audio and video.

The British National Corpus (BNC) is a 100-million word corpus of written and spoken English. Not the largest but probably the world's most carefully designed and scrupulously annotated corpus. The written texts (100 million words) were taken from a range of fiction and non-fiction domains usually dating back not earlier than 1975. The spoken samples (10 million words) include material from different contexts and regions produced by speakers of different ages and social backgrounds.
The Cobuild project was set up jointly by Collins (now HarperCollins) and the University of Birmingham. The 20-million word corpus, published in 1987, was of sufficient size to be the basic data resource for numerous reference and teaching works. Renamed the Bank of English in 1991, it now collects over 450 million words.
The German corpus has over 3.3 billion searchable words.
The Corpus del Español is a corpus of 100 million words of the Spanish language. It has been funded by the National Endowment for the Humanities in the USA.
Corpus of Spoken Israeli Hebrew (CoSIH), which started in 2000 at Tel Aviv University, aims to provide a representative corpus of Hebrew (5 million words) as spoken today by different groups in society taking into account such factors as: sex, age, profession, social and economic background and education.
The Corpus of Spontaneous Japanese contains approximately 650 hours of spontaneous speech of various kinds, recorded between 1999 and 2003.
The Croatian Corpus currently has 101.3 million words.
Michigan Corpus of Academic Spoken English (MICASE) has recorded and transcribed nearly 200 hours (over 1.7 million words) of English.

Corpus linguistics is the study of language as expressed in samples (corpora). It is mainly the domain of linguists, language teachers, computational linguists and other NLP (natural language processing) people. Corpus samples are now seen as very important both in linguistics and in language learning, as they expose learners to the kinds of authentic language use that they will encounter in real life situations. Although recognized by educators as a potentially useful tool, until recently corpus application has been limited because the concordance examples retrieved have been deemed too difficult for beginning level learners to understand. From corpus linguistics we have concordance-based teaching and data-driven learning, two examples of technology-enhanced language learning.

Building one’s own language corpus and language learning media library.

I am primarily concerned here with the language corpus as an adequate snapshot of the target language and word frequency as a factor in practical vocabulary and grammar acquisition. Computer searchability is important but secondary. Examples of language usage can easily be looked up from much larger online corpora. A good corpus provides an authentic representation of the target language, and contains enough words to give us a trustworthy sample of clusters and collocations. An ideal language learner’s library would collect authentic material in which the learner’s attention is directed to the subject matter and away from the form in which it is expressed. It would focus on authentic material even when dealing with tasks such as the acquisition of grammatical structures and lexical items. It would be a representative and well-balanced collection of texts and multimedia items. To be representative of the language as a whole, a corpus should contain samples of all major text types and be in some way proportional to their usage in ‘every day language’. Large linguistic corpuses contain things like transcripts and recordings of “real” telephone conversations and other unpalatable items. While allowing for sufficient representation of “everyday” language artistic and entertainment value would be a primary concern when putting together a private library. This private corpus would need to be sufficiently large – but also lean and mean. General corpora contain samples of academic, literary and spoken language. Literary sources contain a wealth of idioms and metaphors. According to one study adding a quite small domain-specific corpus to a large general one considerably boosts performance (common sense, btw).

Corpus size

How large should our corpus be in terms of providing an authentic representation of the target language, and containing enough words to give us a trustworthy sample of clusters and collocations? How large a multimedia library do we need in order to allow users to “pick up” the most frequent vocabulary through extensive reading and listening?

The role of corpora in foreign language and especially ESL teaching is unfortunately mostly limited to “learner corpora” collections of texts written by non-native speakers (containing errors etc.). Some teachers develop their own corpora for use in teaching. For ESL a couple of informed recommendations range from a general corpus of one to two million words (Ball 1996) to one of over ten million words (University of Birmingham).

Compiling monolingual and parallel corpora:

The average literary novel has around 100,000 words or about 300-400 words per printed page. Words per page may range from 250 in a large-print book to 600 in an academic tome. The average novel length is about 300-400 pages. Words per Sentence: 15-20 words per sentence.

Pride and Prejudice 480 pages
Words: 120,528 (29% have more)
Sentences: 5,878 (35% have more)
Words per Sentence: 20.5 (42% have more)

Effi Briest (German) 304 pages
Words: 98,455
Sentences: 5,501
Words per Sentence: 17.9

Anna Karenina (translated) 864 pages
Words: 334,108 1% have more
Sentences: 19,345 2% have more
Words per Sentence: 17.3

James Clavell's Shogun
Words: 428,978
Sentences: 43,647
Words per Sentence: 9.8

Bravo Two Zero 416 pages
Words: 133,775
Sentences: 10,212 11% have more
Words per Sentence: 13.1

Puppet Masters 352 pages
Words per Sentence: 11.9
Words: 96,154
Sentences: 8,073

An average literary novel would have around 100k words and 5,000-7,000 sentences. Ten such novels would have one million words and 60,000 sentences.
20-25 "average" novels = 10,000 pages, 120,000 sentences, 2.5 million words.
Henry James' corpus (20 main novels) has some 2.4 million words.
200 such novels = over 1,000,000 sentences and more than 20,000,000 words.

Some 150/160 words per minute is the industry standard for most voiceovers and verbatim closed captioning of sitcoms and similar programs. Audiobooks are recommended to be 150-160 words per minute, which is the range that people comfortably hear and vocalize words. In a sample of 318 English-language movies the average movie was 8,937 running words with a running time of 113.5 minutes.
SUBTLEXUS, a corpus of some 51 million running words was compiled from a corpus of 8,388 films and television episodes. The average movie was around 7,000 words.

Going back to the frequency and extensive reading issue, according to several research papers most words need to be encountered more than 20 times in order for the learner to have a significant chance of learning them through exposure. Opinions differ as to the effectiveness of extensive learning. The available data supports the notion that words can be learned incidentally from context. However, researchers disagree as to the effectiveness of this approach – some argue that few new words appear to be learned from this type of exposure, and half of those that are learned are soon lost. Most vocabulary experiments were short in duration and did not therefore take into account constant study and reinforcement.

There are 6318 words with more than 800 occurrences in the whole 100M-word BNC.
In a corpus of 100 million words the top 25,000 words would have at least 100 occurrences in English, German, Italian and Russian. In the 18m Cobuild corpus (1986), there were about 19,800 lemmas (potential dictionary headwords) with 10+ tokens. The 120m Cobuild Bank of English corpus (1993): 45,150. The 450m Bank of English corpus (2001): 93,000.

In a frequency list based on 29,213,800 words from TV and movie scripts and transcripts 10,000 words occur more than 77 times. In order to cover 29 million words one would need to watch some 3,000 hours of video. Word no. 8,000 “moss” is repeated 111 times. Word no. 20,000 “dashboard” would be heard 24 times. Some 26,000 words in all would get at least 15 repetitions. Word no. 25,991 is “amicable”. Word no. 40,000 “imperious” would be practiced 6 times. The problem with movie and sitcom scripts is that they contain a lot of words that are never spoken on screen. A 52 minute episode of a US drama had shrunk from 9,300 words to a little over 4,500 words after all the instructions for the actors were removed. A 22-minute episode of the Addams Family was a little over 2,200 words long after all the extra stuff was removed.

Some 8,000 word families are sufficient to cover 98% of fiction and 98% of movies (see The Lexical Coverage of Movies). Very high frequency words will be more common in speech than writing (TV). Lower frequency items will be more common in writing than speech, in rough proportion to frequency (literature).


Keith said...

Did you write this yourself? This is a very impressive article. It deserves a good follow-up discussion!

As you know, I've been building a small corpus of Chinese TV, although not scientifically or anything.

frenkeld said...

//according to several research papers most words need to be encountered more than 20 times in order for the learner to have a significant chance of learning them through exposure//

The question I've always had about acquiring vocabulary through extensive reading is the number of times it takes seeing a word before one('s brain) figures out its meaning, versus the number of additional times one then has to see the word to memorize it after the meaning has been "decoded" for the first time.

Anonymous said...

I don't understand this word frequency preoccupation of yours, it's really unnecessary to bother so much about it.

Here are my reasons:

1.“according to several research papers most words need to be encountered more than 20 times in order for the learner to have a significant chance of learning them through exposure”
Yes 20 times, but nobody bothers to clarify that this is valid only for "memorizing" out of context words. Most of the studies that claim this are made on students memorizing flashcards.

2.“some argue that few new words appear to be learned from this type of exposure, and half of those that are learned are soon lost”
Not very convenient. Have you heard of the psychological phenomenon "Переживание успеха" (sorry I don't know how it should be translated in English, maybe something like "Experiencing the success"). Few famous polyglots like Kato Lomb say that after you work out the meaning of the word by yourself from the context it sticks in your mind for decades thanks to this phenomenon.

3.“Most vocabulary experiments were short in duration and did not therefore take into account constant study and reinforcement”
Not just most of them but probably all of them, and there is also insensible testing methodology.

4.The chances of learning new word are getting bigger as your language proficiency is growing, i.e. if you need few encounters to learn new word in the beginning stages it's not impossible to learn the less frequent words by only one encounter when you are very proficient in the new language. Or, you may need 20 encounters to learn let's say 1568th, 1947th or 2136th word but you may need only one encounter to learn the 5678th, 7247th or 8004th word. You see, as less there are words on the page (or tape, whatever) as stronger is your attention on them (you are comfortable with the rest of the words so you don't bother much with them and your attention is not split on too many words), as stronger is your attention on them as easier is to remember them well.


Igor Efremov

Anonymous said...

Sorry I forgot this:

"Some 8,000 word families are sufficient to cover 98% of fiction and 98% of movies"

My figures are little bit different than that:
-The 8000th word covers 99.19% of a general native text;
-The 98% level cuts in at the 5262th most frequent word
(Data from a lemmatized BNC range x frequency rankings)