The most frequent 4000 word families from the BNC provide 95% coverage of new texts which translates into “adequate comprehension” (1 in 20 words per 2 lines unknown) with “some learners” (Hu and Nation). Most, however, do not have adequate comprehension even with 95% coverage. For most learners, 98 % coverage was necessary to achieve adequate comprehension of fiction. For reading to be considered a pleasurable activity some researchers (Hirsh and Nation, 1992) suggest that 98-99% coverage may be necessary (one unknown word in every 50-100 running words). 7000 words are needed for 98% coverage (Nation, 2006).
Word coverage
80% 1 unknown word in 5
90% 1 in 10 words per line
95% 1 in 20 words per 2 lines
98% coverage (eight unknown words per 400 word page)
A collection of excerpts regarding vocabulary acquisition:
“The results showed that knowledge of the most frequent 3,000 word families plus proper nouns and marginal words provided 95.76% coverage, and knowledge of the most frequent 6,000 word families plus proper nouns and marginal words provided 98.15% coverage of movies. Both American and British movies reached 95% coverage at the 3,000 word level. However, American movies reached 98% coverage at the 6,000 word level while British movies reached 98% coverage at the 7,000 word level. The vocabulary size necessary to reach 95% coverage of the different genres ranged from 3,000 to 4,000 word families plus proper nouns and marginal words, and 5,000 to 10,000 word families plus proper nouns and marginal words to reach 98% coverage.”
The Lexical Coverage of Movies
Stuart Webb and Michael P. H. Rodgers
“A corpus of one million words would probably have over 60,000 instances of the word the but is unlikely to include any of the following: gastronomic, plagiarism, incoherent, reassuring, preach all of which have a frequency rating of well under one-hit-per-million-words, yet could hardly be described as obscure.”
HLT Magazine
"My one's bigger than your one"
[url]http://www.hltmag.co.uk/jul01/idea.htm[/url]
“The source text consisted of three months (approximately 5 million words) of Le Monde
#sentences 167,359
#words (total) 4,244,810
Less than 20% of the distinct words account for over 95% of all word occurrences. In fact, 40% (about 35,000 words) occurred only once in the text, and 60% of the words appeared at most 3 times. This effect is even more pronounced for syllables, where the roughly 20% most common syllables account for 98% of all syllable occurrences.”
http://www.limsi.fr/~lamel/euro91.pdf
“Given this enormous amount of material, you might expect to find a lot of frequent idioms. If so, you would be disappointed. Simpson and Mendis found only 8 idioms that occurred more than 10 times (ranging from 10-17 times) in their corpus of nearly 2 million words/197 hours. Another 107 occur 1.2-2.4 times per million words. Liu, with an even larger corpora (roughly 6 million words) and a more generous definition, found only 47 items with a frequency of 50 or more tokens per million words. Another 107 had a frequency of 11-49 per million words and the other 148 had a frequency of 2-19 per million words. That’s a total of only 302 idioms, which strikes me as not only a relatively limited number, but also a very teachable number. The lack of many common idioms, makes the task of teaching idioms both easier and harder. It is easier because we can focus our teaching on those idioms that are fairly frequent”.
http://www.nystesol.org/pub/idiom_archive/idiom_summer2005.html
The effect of frequency of occurrence on incidental word learning.
“It should also be pointed out that the volume of text that would need to be read to meet an unknown word increases with reading ability level. This is because rarer words are met less frequently and thus more text has to be read to meet an unknown word the required number of times. This also has implications for the amount of text that needs to be read.”
http://nflrc.hawaii.edu/rfl/October2003/waring/waring.html
Beyond A Clockwork Orange: Acquiring Second Language Vocabulary through Reading
“The frequency of words in the language as a whole was also investigated; Brown (1993) found overall frequency to be a better predictor of incidental vocabulary growth than frequency in the specific texts her subjects read. The third explanatory variable was learner vocabulary size. It was assumed that knowing more words would assure better global comprehension of the text and, as a result, more incidental word acquisition. Laufer (1989, 1992) found evidence of a strong relationship between measures of learner vocabulary size and text comprehension.”
The Mayor of Casterbridge listening/reading experiment:
"Unfortunately, the experimental support for incidental vocabulary acquisition through reading in a second language is weak and plagued by methodological flaws..."
"The first study claiming to show that second language vocabulary learning occurs incidentally through reading is a well known experiment by Saragi, Nation and Meister (1978). They tested native speakers of English who had read Anthony Burgess's A Clockwork Orange on their understanding of many of the Russian-based slang words that occur in the novel. They found that the subjects were able to correctly identify the meanings of most these nadsat words on a surprise multiple-choice test , especially the frequently occurring ones. But it seems strange to equate the circumstances of this study with second language learning. Here, native speakers of English used contexts which they must have fully understood to infer, for example, that droog meant friend; but making such connections is probably much harder for readers in a foreign language for whom many words in the context may be unknown or only partially known.
The mean number of words subjects acquired in the experiment was 68.4, amounting to about three quarters of the 90 words tested. But replications of this study with second language learners have not managed to reproduce these impressive results (see Table 1 below)... Dupuy and Krashen (1993) report a larger gain of almost seven words, but this higher than usual result may have little to do with reading since their experiment also involved viewing a video..."
"The (Mayor of Casterbridge) novel is one of a series of simplified classics published by Nelson for learners of English who know approximately 2000 basewords.” …”21,232 words of the simplified Mayor of Casterbridge text: subjects followed along in their books while the entire text was read aloud in class by the teacher... The remaining 34 (students) appeared to be absorbed by the story of secret love, dissolution and remorse, and tears were shed for the mayor when he met his lonely death at the end...The knowledge gain of five of the 23 means that about 22 per cent of the words that could have been learned were learned; in other words, there was an average pick-up rate of about one new word in every five..."
"Laufer (1982, 1989) claims that readers need a sight recognition of at least 95 percent of the words in a text for it to be comprehensible enough for meanings of unknown words to be inferred.”
"As far as implications for vocabulary learning are concerned, the experiment makes a stronger case for incidental acquisition than was made in the earlier Clockwork Orange replication studies. Subjects who read a full-length book recognized the meanings of new words at a higher rate than in previous studies with shorter texts, and built associations between new words as well... ‘Cobb (1997) found that encountering new words in multiple contexts resulted in a deeper, more transferrable knowledge of words than the usual strategy of studying short definitions.
"...But even though it may be possible to develop better resources for incidental learning, the study suggests that extensive reading is not a very effective way for learners who have a mean vocabulary size of around 3000 words to expand their lexicons...In brief, the experiment indicates that teachers of low intermediate learners of English can expect vocabulary growth from reading a simplified novel to be small and far from universal… In the last two decades, it has often been assumed that incidental acquisition was a sufficient strategy to take care of learner's lexical needs, to the point that explicit vocabulary instruction effectively disappeared from many coursebooks and vocabulary acquisition became "a neglected aspect of language learning" (Meara 1980:221). The present study suggests that the the power of incidental acquisition may have been overestimated.
...Nagy, Herman and Anderson (1985) propose that for children learning English as their first language, school reading can account for the acquisition of thousands of new words each year. Even though the incidental pick-up rate was found to be low, large gains occur, they argue, because children encounter millions of words annually. But this is hardly applicable to beginning second language learners; for the subjects of this study, encountering one million words would entail reading fifty graded readers the size of The Mayor of Casterbridge - a worthy but unattainable goal for most learners at this level.”
“The results of this study point to several things. Firstly, the data support the notion that words can be learned incidentally from context. However, these data suggest that few new words appear to be learned from this type of reading, and half of those that are learned are soon lost....Assuming an optimistic scenario in which reading fifty novels per year was possible ...even if yearly gains increased marginally with increased vocabulary size, it would take many years to acquire incidentally the 5,000 words most frequent word families of English, the figure which has been proposed as the minimum knowledge base needed for learners of English to be able to infer the meanings of new words they encounter in normal, unsimplified texts (Hirsh & Nation 1992, Laufer 1989)...
That is not to say that low intermediate learners should never read, but that teaching decisions should be based on an adequate account of what they can gain from their reading. Through reading extensively, they will probably enrich their knowledge of the words they already know, increase lexical access speeds, build network linkages between words, and more, but as this study has shown, only a few new words will be acquired. Therefore, it seems clear that in the early stages of their second language acquisition, learners should direct a considerable portion of their energies to using intentional strategies to learn high frequency vocabulary, in preparation for the day when they will know enough words and can read in enough volume for more substantial incidental benefits to accrue.”
http://www.er.uqam.ca/nobel/r21270/cv/Casterbridge.html
Incidental vocabulary acquisition from reading, reading-while-listening, and listening to stories
"The results showed that new words could be learned incidentally in all 3 modes, but that most words were not learned. Items occurring more frequently in the text were more likely to be learned and were more resistant to decay. The data demonstrated that, on average, when subjects were tested by unprompted recall, the meaning of only 1 of the 28 items met in either of the reading modes and the meaning of none of the items met in the listening-only mode, would be retained after 3 months...
...The subjects, it seems, displayed a critical lack of familiarity with spoken English. As they listened to the story, they had to pay constant attention to a stream of speech whose speed they could not control. Because they were incapable of processing the phonological information as fast as the stream of speech, they may have failed to recognize many of the spoken forms of words that they already knew in their written forms."
http://nflrc.hawaii.edu/rfl/October2008/brown/brown.pdf
Current Research and Practice in Teaching Vocabulary
Alan Hunt and David Beglar
“In the long run, most words in both first and second languages are probably learned incidentally, through extensive reading and listening (Nagy, Herman, & Anderson, 1985). Several recent studies have confirmed that incidental L2 vocabulary learning through reading does occur (Chun & Plass 1996; Day, Omura, & Hiramatsu, 1991; Hulstijn, Hollander & Greidanus, 1996; Knight, 1994; Zimmerman, 1997). Although most research concentrates on reading, extensive listening can also increase vocabulary learning (Elley, 1989). Nagy, Herman, & Anderson (1985) concluded that (for native speakers of English) learning vocabulary from context is a gradual process, estimating that, given a single exposure to an unfamiliar word, there was about a 10% chance of learning its meaning from context. Likewise, L2 learners can be expected to require many exposures to a word in context before understanding its meaning...The incidental learning of vocabulary may eventually account for a majority of advanced learners' vocabulary; however, intentional learning through instruction also significantly contributes to vocabulary development (Nation, 1990; Paribakht & Wesche, 1996; Zimmerman, 1997). Explicit instruction is particularly essential for beginning students whose lack of vocabulary limits their reading ability. Coady (1997b) calls this the beginner's paradox. He wonders how beginners can "learn enough words to learn vocabulary through extensive reading when they do not know enough words to read well" (p. 229). His solution is to have students supplement their extensive reading with study of the 3,000 most frequent words until the words' form and meaning become automatically recognized (i.e., "sight vocabulary"). The first stage in teaching these 3,000 words commonly begins with word-pairs in which an L2 word is matched with an L1 translation... Translation has a necessary and useful role for L2 learning, but it can hinder learners' progress if it is used to the exclusion of L2-based techniques. Prince (1996) found that both "advanced" and "weaker" learners could recall more newly learned words using L1 translations than using L2 context. However, "weaker" learners were less able to transfer knowledge learned from translation into an L2 context. Prince claims that weaker learners require more time when using an L2 context as they have less developed L2 networks and are slower to use syntactic information... “Understanding of a word acquired from meeting it in context in extensive reading is ‘fragile knowledge’, and may not be internalized longterm if there are no further encounters with it; but it is still useful...Vocabulary lists can be an effective way to quickly learn word-pair translations (Nation, 1990). However, it is more effective to use vocabulary cards, because learners can control the order in which they study the words (Atkinson, 1972). Also, additional information can easily be added to the cards. When teaching unfamiliar vocabulary, teachers need to consider the following:
1. Learners need to do more than just see the form (Channell, 1988). They need to hear the pronunciation and practice saying the word aloud as well (Ellis & Beaton, 1993; Fay and Cutler, 1977; Siebert, 1927). The syllable structure and stress pattern of the word are important because they are two ways in which words are stored in memory (Fay and Cutler, 1977).
2. Start by learning semantically unrelated words. Also avoid learning words with similar forms (Nation, 1990) and closely related meanings (Higa, 1963; Tinkham, 1993) at the same time... Likewise, words with similar, opposite, or closely associated (e.g., types of fruit, family members) meanings may interfere with one another if they are studied at the same time.
3. It is more effective to study words regularly over several short sessions than to study them for one or two longer sessions. As most forgetting occurs immediately after initial exposure to the word (Pimsleur, 1967), repetition and review should take place almost immediately after studying a word for the first time.
4. Study 5-7 words at a time, dividing larger numbers of words into smaller groups.
5. Use activities like the keyword technique to promote deeper mental processing and better retention (Craik and Lockhart, 1972). Associating a visual image with a word helps learners remember the word. “
“Provide opportunities for developing fluency with known vocabulary.
Fluency building activities recycle already known words in familiar grammatical and organizational patterns so that students can focus on recognizing or using words without hesitation. “
http://www.jalt-publications.org/tlt/files/98/jan/hunt.html
At what rate do learners learn and retain new vocabulary from reading a graded reader?
Rob Waring
"The results show that words can be learned incidentally but that most of the words were not learned. More frequent words were more likely to be learned and were more resistant to decay. The data suggest that, on average, the meaning of only one of the 25 items will be remembered after three months, and the meaning of none of the items that were met fewer than eight times will be remembered three months later. The data thus suggest that very little new vocabulary is retained from reading one graded reader, and that a massive amount of graded reading is needed to build new vocabulary...
...This suggests that it is far more difficult to pick up words from listening-only than from either the reading-only or reading-while-listening
modes. There was, however, no significant difference between reading-only and reading-while listening modes.
…This suggests that meanings are lost faster than other the types of word knowledge tested here.”
Number of meetings needed to learn a word
“As we saw in the introduction, previous estimates of the number of times it takes to learn a word from reading varied considerably. It is clear from this research that it is very difficult to pin a number on this age-old question. It seems much more complex than a simple single figure. From the results of this experiment, it seems that to have a 50% chance of recognizing a word form again three months later, learners have to meet the word at least eight times. Similar results could be said for prompted recognition. However, for unprompted form-meaning recognition (i.e., word learning) there is only a 10% to 15% chance that the word's meaning will be remembered after three months even if it was met more than 18 times. If the word was met fewer than 5 times, the chance is next to zero. This is rather disappointing because it suggests that we do not learn a lot of new words from our reading even with a 96% coverage rate. There are several reasons why this might be so. Firstly, the learners are presumably focused on comprehending and enjoying the story rather than on the words themselves. The words were not made explicit by bolding or highlighting the words in any way, as is the case in natural reading. Because of this, the learners are not being forced to notice them and their awareness of the words is not being raised. Some recent research has suggested the noticing of a form is an essential step in word learning (Schmidt, 1990)... Thirdly, the reason for low vocabulary rate retention may have simply been that there were too few chances to learn the words. As we have seen, it takes much more than one meeting of a word to learn it from reading. Moreover, even words met more than fifteen times in the text still have only a 40% change of being learned. This seems to suggest that it would take well over 20 or even 30 meetings for most of those words to be learned.”
http://nflrc.hawaii.edu/rfl/October2003/waring/waring.html
“A number of studies have shown that second language learners acquire vocabulary through reading, but only relatively small amounts. However, most of these studies used only short texts, measured only the acquisition of meaning, and did not credit partial learning of words.”
“The results showed that knowledge of 65% of the target words was enhanced in some way, for a pickup rate of about 1 of every 1.5 words tested. Spelling was strongly enhanced, even from a small number of exposures. Meaning and grammatical knowledge were also enhanced, but not to the same extent. Overall, the study indicates that more vocabulary acquisition is possible from extensive reading than previous studies have suggested.”
“There is no frequency point where meaning acquisition is assured, but by about 10+ exposures, there does seem to be a discernable rise in the learning rate. However, even after 20+ exposures, the meaning of some words eluded G, echoing Grabe and Stoller's (1997) point that some words simply seem hard to learn.”
“As a whole, the results are consistent with those of Schmitt (1998), who found that it is possible for L2 learners to have other kinds of word knowledge without having acquired knowledge of the word's meaning.”
“...the role of frequency of occurrence in the texts in the enhancement of the three types of word knowledge... As mentioned before, it seems that spelling knowledge can be gained with even a few exposures. Meaning does not seem to be as affected by frequency as much as one might expect, with 2-19 text occurrences yielding uptake rates ranging between 16-36% when we take the nouns and verbs together. Only at the extremes of frequency do we see a noticeable effect. Single encounters produced hardly any learning of meaning at all (3.4%), while it took 20+ occurrences to lead to a noticeable increase in uptake rates (60%). Only in the case of grammar (when articles and prepositions are considered together) was there a relatively steady increase of learning along the frequency scale. Overall, only when words were seen twenty or more times was there a good chance of all three word knowledge facets being enhanced.”
http://nflrc.hawaii.edu/rfl/April2006/pigada/pigada.html
“Chun and Plass' (1996) study of American university students learning German found that unfamiliar words were most efficiently learned when both pictures and text were available for students. This was more effective than text alone or combining text and video, possibly because learners can control the length of time spent viewing the pictures.”
http://www.jalt-publications.org/tlt/files/98/jan/hunt.html
Beyond raw frequency: Incidental vocabulary acquisition in extensive reading:
“However, words of lower frequency were better learned than words of higher frequency when the meanings of the lower frequency words were crucial for meaning comprehension.”
"...a richer sense of a word is learned through contextualized input. Furthermore, the incidental acquirer not only acquires word meanings but also increases his or her chances to get a feel for collocations and colligations that are not easily learned by learners of English as a foreign language (Bahns & Eldaw, 1993); therefore, learning can be facilitated by repeated exposure to words that go together (cf. Lewis, 1993; Nattinger & DeCarrico, 1992, for the importance of learning lexical phrases)...
“It does not seem feasible to define a number of exposures that is sufficient for successful acquisition, such as at least 10 exposures (Saragi et al., 1978) or 5–16 exposures (Nation, 1990). As Henriksen (1999, p. 314) pointed out, word acquisition seems to be able to range “over continua of lexical knowledge” from partial recognition knowledge to productive use ability, depending on how many and what kinds of exposures are needed for successful acquisition. The observation that some words that do not appear frequently, but are nevertheless acquired and retained, apparently because they are salient and significant to a story, is highly interesting. We suggest that the rate of incidental vocabulary learning is not simply related to the raw frequency of specific words in the language. We further propose that learning is a consequence of noticing and the conscious learning of words that are important in the narrative. (Schmidt, 2001).“
http://nflrc.hawaii.edu/rfl/October2008/kweon/kweon.pdf
WHAT DOES FREQUENCY HAVE TO DO WITH GRAMMAR
“Another reason, as Larson-Freeman (2002) and Ellis (2002b) point out, is that if second language learning were simply a matter of acquiring the most frequently occurring patters of target language (TL), then English language learners (ELLs) would be proficient in their uses of the definite and indefinite articles, the most frequently occurring free morphemes in English. This, of course, is not the case. It is clear that the frequency of input is not the only factor involved in learning a second language; however, we believe it plays a significant role. Ultimately, we hope to show that high-frequency constructions provide more exemplars for L2 learners to make generalizations than low-frequency constructions and that this directly relates to the number and kind of L2 learner errors.”
“ELLs will tend to produce more errors with low frequency constructions…”
Link
Monday, April 27, 2009
corpora comparison by frequency
The Brown University Standard Corpus of Present-Day American English (Brown Corpus) was compiled by Henry Kucera and W. Nelson Francis at Brown University as a general corpus in 1961. The corpus contains 1,014,312 words sampled from 15 text categories: press (politics, sports culture, financial – 44 texts) editorial (letters to the editor etc). theatre and book reviews, religious texts, skills and hobbies, “popular lore” (48) Biography, Memoirs (75); government documents (30 texts); learned (natural science, medicine math, humanities, technology -80 texts) fiction – general (29 texts); Mystery and Detective Fiction (24 texts); Adventure and Western (29 texts); Romance and Love Story (29 texts); humor (9 texts). The Brown Corpus is made up of 500 texts of about 2000 words each. The first American Heritage Dictionary (1969) was based on the Brown Corpus. This was the first dictionary to be compiled using evidence gleaned from corpus linguistics.
"The" constitutes nearly 7% of the Brown Corpus. About half of the total vocabulary of about 50,000 words are words that occur only once in the corpus.
NCFWD - a corpus of nineteenth-century fiction written between 1830 and 1870 (approximately 2.2 million words)
The Dickens Corpus – some 4.6 million running words
NCFWD and Dickens corpus data taken from: Investigating Dickens’ style by
Masahiro Hori.
SUBTLEXUS compiled by Brysbaert & New on the basis of American subtitles (51 million words in total). A corpus of 8,388 films and television episodes with a total of 51 million running words (16.1M from television series, and 14.3M from films before 1990, and 20.6M from films after 1990).
USA films from 1900-1990 (2046 files)
USA films from 1990-2007 (3218 files)
USA television series (4575 files)
There are 4,554 examples of gentleman in the Dickens Corpus (4.6 million words) 825 in the NCFWD (2.2 MILLION WORDS), 2,777 examples in the entire Cobuild (200,000,000 words) and 2,135 in SUBTLEXUS. Per million:
Dickens: 968
NCFWD 375
Cobuild 13.9
SUBTLEXUS: 42
Dickens Oliver Twist: 332
Thackeray's Vanity Fair: 269
Jane Austen’s Emma 36
Bronte sisters
39 in Jane Eyre
13 in Wuthering Heights
22 in Agnes Grey
19th>20th century sign o’ the times
DICKENS, NCFWD, BROWN and SUBTLEXUS compared (Frequency per million words)
Brown: frequency rank number in parentheses
Man
Dickens 2037
NCFWD 1587
Brown 1210(no 81)
SUBTLEXUS: 1099
Old
Dickens 1973
NCFWD 1335
Brown: 660 (no. 140)
SUBTLEXUS: 609
Hand:
Dickens 1289
NCFWD 871
Brown: 431
SUBTLEXUS: 280
Head
Dickens 1212
NCFWD 616
Brown: 404 (no 201)
SUBTLEXUS: 371
Face
Dickens 1075
NCFWD 765
Brown 371 (no 245)
SUBTLEXUS: 289
Eyes
Dickens: 985
NCFWD: 816
Brown : 401 (no 214)
SUBTLEXUS: 221
Dear
Dickens 1284
NCFWD 790
BROWN 54 (no 2040)
SUBTLEXUS: 223
LIFE
Dickens 711
NCFWD: 854
BROWN 715 (no 127)
SUBTLEXUS: 797
Room
Dickens 954
NCFWD 981
BROWN: 384 (no 232)
SUBTLEXUS: 440
LADY
Dickens 834
NCFWD 1284
BROWN: 80 (no 1328)
SUBTLEXUS: 217
Another
Dickens 829
NCFWD 566
BROWN 684 (no 133)
SUBTLEXUS: 509
Night
Dickens 1079
Ncfwd: 649
BROWN 411 (no 209)
SUBTLEXUS: 866
Door
Dickens 986
Ncfwd 614
BROWN 312 (no 295)
SUBTLEXUS: 292
Boy
Dickens 563
NCFWD: 333
BROWN 242 (no 384)
SUBTLEXUS: 530
Manner
dickens 547
ncfwd 285
BROWN 124 (no 831)
SUBTLEXUS: 12
Child
Dickens 538
Ncfwd 338
BROWN 213 (no 435)
SUBTLEXUS: 158
Seemed
Dickens 535
Ncfwd 569
BROWN 332 (no 274)
SUBTLEXUS: 54
Yet
Dickens 590
Ncfwd: 864
BROWN 419 (no 202)
SUBTLEXUS: 342
Let
DICKENS 656
NCFWD 726
BROWN: 384 (no 231)
SUBTLEXUS: 2,419
DONE
DICKENS: 656
NCFWD 597
BROWN 320 (no 283)
SUBTLEXUS: 485
Half
Dickens 618
Ncfwd 580
Brown 275 (no 337)
SUBTLEXUS: 199
People
Dickens 592
Ncfwd 668
Brown 847 (no 106)
SUBTLEXUS: 1103
Love
Dickens 420
Ncfwd 775
Brown 232 (no 397)
SUBTLEXUs: 1,115
Only
Dickens 978
Ncfwd 1502
Brown 1747 (no 62)
SUBTLEXUS: 1084
Returned
Dickens: 846
NCFWD 264
Brown: 115 (return: 180)
SUBTLEXUS: 25 (return: 92)
Replied
Dickens: 823
NCFWD: 299
Brown: 57 (reply: 42)
SUBTLEXUS: 1 (reply: 5)
Slowly
Dickens: 178
NCFWD: 117
Brown 115 (no.900) slow: 60 (no.1817)
SUBTLEXUS: 25 slow: 76
Softly
Dickens: 101
NCFWD: 36
Brown: 31 (no. 3425)Soft: 62
SUBTLEXUS: 5 Soft: 1126
Easily
Dickens: 100
NCFWD: 79
Brown 106 (no. 981) Easy: 125
SUBTLEXUS: 23 Easy: 266
Gradually
Dickens: 94
NCFWD: 49
Brown: 51 (no. 2125)
Quickly
Dickens: 92
NCFWD: 70
Brown: 89 (no.1169) Quick: 68
SUBTLEXUS: 57 Quick: 109
Hastily
Dickens: 87
NCFWD: 45
Brown: n/a not in the top 5,000 (less than 19)
SUBTLEXUS: 1 (haste: 2)
Gently
Dickens: 83
NCFWD: 59
Brown: 31 (no.3441) Gentle: 27
SUBTLEXUS: 9 Gentle: 17
Quietly
Dickens: 78
NCFWD: 85
Brown: 48 (no.2250) Quiet: 76
SUBTLEXUS: 12 Quiet: 117
Carefully
Dickens: 65
NCFWD: 56
Brown: 87 (no.1213) Careful: 62 care: 162
SUBTLEXUS: 24 Careful: 109 Care: 485
Heartily
Dickens: 54
NCFWD: 26
Brown: not in the top 5,000
SUBTLEXUS: 1
Steadily
Dickens: 47
NCFWD: 19
Brown: 22 (no 4499) Steady: 41
SUBTLEXUS:: 1 Steady: 23
Frequently
Dickens: 42
NCFWD: 52
Brown: 91 (no.1146) Frequent: 34
SUBTLEXUS: 3 Frequent: 2
Thoughtfully
Dickens: 39
NCFWD: 5
Brown: not in the top 5,000; neither is thoughtful (less than 19)
SUBTLEXUS: 1 Thoughtful: 8
Eagerly
Dickens: 37
NCFWD: 49
not in the top 5,000 Eager: 27 (no. 3772)
SUBTLEXUS: 1 Eager: 7
Freely
Dickens: 35
NCFWD: 24
Brown: 22 (no 4476) Free: 260 (no.358)
SUBTLEXUS: 4 Free: 178
Happily
Dickens: 32
NCFWD: 27
Brown: 20 (no 4836) Happy: 98 (no1069).
SUBTLEXUS: 10 Happy: 333
Cheerfully
Dickens: 32
NCFWD: 18
not in the top 5,000 neither is cheerful
SUBTLEXUS: 1 Cheerful: 4
Sharply
Dickens: 31
NCFWD: 25
Brown: 38 (no.2827) Sharp: 72
SUBTLEXUS: 1 Sharp: 24
Silently
Dickens: 30
NCFWD: 30
Brown: not in the top 5,000 Silent: 49 (no. 2229)
Seriously
Dickens: 27
NCFWD: 45
Brown: 46 (no.2368) Serious: 116 (no.883)
Angrily
Dickens: 26
NCFWD: 12
Brown: not in the top 5,000. Angry: 45 (no.2430)
SUBTLEXUS: 0.4 Angry: 59
Sternly
Dickens: 26
NCFWD: 12
Brown not in the top 5,000 Stern: 23 (no.4295)
SUBTLEXUS: 0.1 Stern 6
Timidly
Dickens: 26
NCFWD: 19
Brown not in the top 5,000 (neither is “timid”)
SUBTLEXUS: 0.1 Timid: 2
SUBTLEXUS VS Brown
This 7,979 vs 5,146
Now 3202 vs 1314
Be 5746 vs 6376
Was 5654 vs 9815
Been 1737 vs 2473
In 9,773 vs 21,345
Out 3865 vs 2096
Me 9,242 vs 1183
My 6763 vs 1319
Mine 251 vs 59
Can 5,247 vs 1,772
Could 1629 vs 1599
Should 1062 vs 888
Will 2124 vs 2244
Would 1768 vs 2715
There 4348 vs 2725
But 4,418 vs 4381
By 1340 vs 5307
He 7,637 vs 9,542
Him 3484 vs 2619
So 4244 vs 1985
Go 3793 vs. 626
Goes 217 vs 89
Going 2123 vs 399
Went 411 vs 507
Gone 297 vs 195
Like 3,999 vs 1290
Likes 76 vs 20
Liked 79 vs 58
How 3056 vs 836
If 3541 vs 2199
Just 4,749 vs 872
Get 4583 vs 749
gets: 223 vs 66
Got 3306 vs 482
Gotten 54 vs n/a -less than 19
Had 1676 vs 5,131
Come 3141 vs 630
comes 229 vs 137
came 464 vs 622
Coming 527 vs 174
They 4102 vs 3619
See 2557 vs 772
saw 403 vs. 352;
seen: 385 vs 279
Time 1959 vs 1601
Let 2419 vs 384
Did 2341 vs 1044
From 2039 vs 4370
Want 2759 vs 329
Wants 307 vs 71
Wanted 502 vs 226
Think 2691 vs 433
thinks 103 vs 23
Thought 809 vs 516
thinking 281 vs 145
Take 1891 vs 611
Took 342 vs 426
Taken 281 vs 139
Look 1947 vs 399
looks: 311 vs 78
looked 121 vs 361
Some 1727 vs 1617
Then 1490 vs 1377
Why 2248 vs 404
Where 1830 vs 938
Too 1372 vs 833
More 1299 vs 2216
Down 1490 vs 895
Yes 1997 vs 144
Tell 1724 vs less than 19
Little 1446 vs 831
Thing 1088 vs 333
Mean 1244 vs 199
Said 1109 vs 1961
Sure 1100 vs 264
First 840 vs 1361
Put 829 vs 437
Please 1101 vs 62
Mexico 31 vs 19
Wildlife 2 vs 19
victims 23 vs 19
Father 555 vs 183
Mother 480 vs 216
English 74 vs 195
hasn't 91 vs 20
Tuesday 24 vs 59
January 7 vs 53
Halloween 13 vs n/a
Keith 0 vs 21
Economical 0.33 vs 22
Arrested 35 vs 19
Run 350 vs 217
Court 101 vs 230
Office 2O4 vs 255
Planet 39 vs. 21
Planets 4 vs 22
Political 22 vs 258
Theoretical 2 vs 21
sixty 5 vs 21
Troops 19.3 vs 53
College 85 vs 267
"The" constitutes nearly 7% of the Brown Corpus. About half of the total vocabulary of about 50,000 words are words that occur only once in the corpus.
NCFWD - a corpus of nineteenth-century fiction written between 1830 and 1870 (approximately 2.2 million words)
The Dickens Corpus – some 4.6 million running words
NCFWD and Dickens corpus data taken from: Investigating Dickens’ style by
Masahiro Hori.
SUBTLEXUS compiled by Brysbaert & New on the basis of American subtitles (51 million words in total). A corpus of 8,388 films and television episodes with a total of 51 million running words (16.1M from television series, and 14.3M from films before 1990, and 20.6M from films after 1990).
USA films from 1900-1990 (2046 files)
USA films from 1990-2007 (3218 files)
USA television series (4575 files)
There are 4,554 examples of gentleman in the Dickens Corpus (4.6 million words) 825 in the NCFWD (2.2 MILLION WORDS), 2,777 examples in the entire Cobuild (200,000,000 words) and 2,135 in SUBTLEXUS. Per million:
Dickens: 968
NCFWD 375
Cobuild 13.9
SUBTLEXUS: 42
Dickens Oliver Twist: 332
Thackeray's Vanity Fair: 269
Jane Austen’s Emma 36
Bronte sisters
39 in Jane Eyre
13 in Wuthering Heights
22 in Agnes Grey
19th>20th century sign o’ the times
DICKENS, NCFWD, BROWN and SUBTLEXUS compared (Frequency per million words)
Brown: frequency rank number in parentheses
Man
Dickens 2037
NCFWD 1587
Brown 1210(no 81)
SUBTLEXUS: 1099
Old
Dickens 1973
NCFWD 1335
Brown: 660 (no. 140)
SUBTLEXUS: 609
Hand:
Dickens 1289
NCFWD 871
Brown: 431
SUBTLEXUS: 280
Head
Dickens 1212
NCFWD 616
Brown: 404 (no 201)
SUBTLEXUS: 371
Face
Dickens 1075
NCFWD 765
Brown 371 (no 245)
SUBTLEXUS: 289
Eyes
Dickens: 985
NCFWD: 816
Brown : 401 (no 214)
SUBTLEXUS: 221
Dear
Dickens 1284
NCFWD 790
BROWN 54 (no 2040)
SUBTLEXUS: 223
LIFE
Dickens 711
NCFWD: 854
BROWN 715 (no 127)
SUBTLEXUS: 797
Room
Dickens 954
NCFWD 981
BROWN: 384 (no 232)
SUBTLEXUS: 440
LADY
Dickens 834
NCFWD 1284
BROWN: 80 (no 1328)
SUBTLEXUS: 217
Another
Dickens 829
NCFWD 566
BROWN 684 (no 133)
SUBTLEXUS: 509
Night
Dickens 1079
Ncfwd: 649
BROWN 411 (no 209)
SUBTLEXUS: 866
Door
Dickens 986
Ncfwd 614
BROWN 312 (no 295)
SUBTLEXUS: 292
Boy
Dickens 563
NCFWD: 333
BROWN 242 (no 384)
SUBTLEXUS: 530
Manner
dickens 547
ncfwd 285
BROWN 124 (no 831)
SUBTLEXUS: 12
Child
Dickens 538
Ncfwd 338
BROWN 213 (no 435)
SUBTLEXUS: 158
Seemed
Dickens 535
Ncfwd 569
BROWN 332 (no 274)
SUBTLEXUS: 54
Yet
Dickens 590
Ncfwd: 864
BROWN 419 (no 202)
SUBTLEXUS: 342
Let
DICKENS 656
NCFWD 726
BROWN: 384 (no 231)
SUBTLEXUS: 2,419
DONE
DICKENS: 656
NCFWD 597
BROWN 320 (no 283)
SUBTLEXUS: 485
Half
Dickens 618
Ncfwd 580
Brown 275 (no 337)
SUBTLEXUS: 199
People
Dickens 592
Ncfwd 668
Brown 847 (no 106)
SUBTLEXUS: 1103
Love
Dickens 420
Ncfwd 775
Brown 232 (no 397)
SUBTLEXUs: 1,115
Only
Dickens 978
Ncfwd 1502
Brown 1747 (no 62)
SUBTLEXUS: 1084
Returned
Dickens: 846
NCFWD 264
Brown: 115 (return: 180)
SUBTLEXUS: 25 (return: 92)
Replied
Dickens: 823
NCFWD: 299
Brown: 57 (reply: 42)
SUBTLEXUS: 1 (reply: 5)
Slowly
Dickens: 178
NCFWD: 117
Brown 115 (no.900) slow: 60 (no.1817)
SUBTLEXUS: 25 slow: 76
Softly
Dickens: 101
NCFWD: 36
Brown: 31 (no. 3425)Soft: 62
SUBTLEXUS: 5 Soft: 1126
Easily
Dickens: 100
NCFWD: 79
Brown 106 (no. 981) Easy: 125
SUBTLEXUS: 23 Easy: 266
Gradually
Dickens: 94
NCFWD: 49
Brown: 51 (no. 2125)
Quickly
Dickens: 92
NCFWD: 70
Brown: 89 (no.1169) Quick: 68
SUBTLEXUS: 57 Quick: 109
Hastily
Dickens: 87
NCFWD: 45
Brown: n/a not in the top 5,000 (less than 19)
SUBTLEXUS: 1 (haste: 2)
Gently
Dickens: 83
NCFWD: 59
Brown: 31 (no.3441) Gentle: 27
SUBTLEXUS: 9 Gentle: 17
Quietly
Dickens: 78
NCFWD: 85
Brown: 48 (no.2250) Quiet: 76
SUBTLEXUS: 12 Quiet: 117
Carefully
Dickens: 65
NCFWD: 56
Brown: 87 (no.1213) Careful: 62 care: 162
SUBTLEXUS: 24 Careful: 109 Care: 485
Heartily
Dickens: 54
NCFWD: 26
Brown: not in the top 5,000
SUBTLEXUS: 1
Steadily
Dickens: 47
NCFWD: 19
Brown: 22 (no 4499) Steady: 41
SUBTLEXUS:: 1 Steady: 23
Frequently
Dickens: 42
NCFWD: 52
Brown: 91 (no.1146) Frequent: 34
SUBTLEXUS: 3 Frequent: 2
Thoughtfully
Dickens: 39
NCFWD: 5
Brown: not in the top 5,000; neither is thoughtful (less than 19)
SUBTLEXUS: 1 Thoughtful: 8
Eagerly
Dickens: 37
NCFWD: 49
not in the top 5,000 Eager: 27 (no. 3772)
SUBTLEXUS: 1 Eager: 7
Freely
Dickens: 35
NCFWD: 24
Brown: 22 (no 4476) Free: 260 (no.358)
SUBTLEXUS: 4 Free: 178
Happily
Dickens: 32
NCFWD: 27
Brown: 20 (no 4836) Happy: 98 (no1069).
SUBTLEXUS: 10 Happy: 333
Cheerfully
Dickens: 32
NCFWD: 18
not in the top 5,000 neither is cheerful
SUBTLEXUS: 1 Cheerful: 4
Sharply
Dickens: 31
NCFWD: 25
Brown: 38 (no.2827) Sharp: 72
SUBTLEXUS: 1 Sharp: 24
Silently
Dickens: 30
NCFWD: 30
Brown: not in the top 5,000 Silent: 49 (no. 2229)
Seriously
Dickens: 27
NCFWD: 45
Brown: 46 (no.2368) Serious: 116 (no.883)
Angrily
Dickens: 26
NCFWD: 12
Brown: not in the top 5,000. Angry: 45 (no.2430)
SUBTLEXUS: 0.4 Angry: 59
Sternly
Dickens: 26
NCFWD: 12
Brown not in the top 5,000 Stern: 23 (no.4295)
SUBTLEXUS: 0.1 Stern 6
Timidly
Dickens: 26
NCFWD: 19
Brown not in the top 5,000 (neither is “timid”)
SUBTLEXUS: 0.1 Timid: 2
SUBTLEXUS VS Brown
This 7,979 vs 5,146
Now 3202 vs 1314
Be 5746 vs 6376
Was 5654 vs 9815
Been 1737 vs 2473
In 9,773 vs 21,345
Out 3865 vs 2096
Me 9,242 vs 1183
My 6763 vs 1319
Mine 251 vs 59
Can 5,247 vs 1,772
Could 1629 vs 1599
Should 1062 vs 888
Will 2124 vs 2244
Would 1768 vs 2715
There 4348 vs 2725
But 4,418 vs 4381
By 1340 vs 5307
He 7,637 vs 9,542
Him 3484 vs 2619
So 4244 vs 1985
Go 3793 vs. 626
Goes 217 vs 89
Going 2123 vs 399
Went 411 vs 507
Gone 297 vs 195
Like 3,999 vs 1290
Likes 76 vs 20
Liked 79 vs 58
How 3056 vs 836
If 3541 vs 2199
Just 4,749 vs 872
Get 4583 vs 749
gets: 223 vs 66
Got 3306 vs 482
Gotten 54 vs n/a -less than 19
Had 1676 vs 5,131
Come 3141 vs 630
comes 229 vs 137
came 464 vs 622
Coming 527 vs 174
They 4102 vs 3619
See 2557 vs 772
saw 403 vs. 352;
seen: 385 vs 279
Time 1959 vs 1601
Let 2419 vs 384
Did 2341 vs 1044
From 2039 vs 4370
Want 2759 vs 329
Wants 307 vs 71
Wanted 502 vs 226
Think 2691 vs 433
thinks 103 vs 23
Thought 809 vs 516
thinking 281 vs 145
Take 1891 vs 611
Took 342 vs 426
Taken 281 vs 139
Look 1947 vs 399
looks: 311 vs 78
looked 121 vs 361
Some 1727 vs 1617
Then 1490 vs 1377
Why 2248 vs 404
Where 1830 vs 938
Too 1372 vs 833
More 1299 vs 2216
Down 1490 vs 895
Yes 1997 vs 144
Tell 1724 vs less than 19
Little 1446 vs 831
Thing 1088 vs 333
Mean 1244 vs 199
Said 1109 vs 1961
Sure 1100 vs 264
First 840 vs 1361
Put 829 vs 437
Please 1101 vs 62
Mexico 31 vs 19
Wildlife 2 vs 19
victims 23 vs 19
Father 555 vs 183
Mother 480 vs 216
English 74 vs 195
hasn't 91 vs 20
Tuesday 24 vs 59
January 7 vs 53
Halloween 13 vs n/a
Keith 0 vs 21
Economical 0.33 vs 22
Arrested 35 vs 19
Run 350 vs 217
Court 101 vs 230
Office 2O4 vs 255
Planet 39 vs. 21
Planets 4 vs 22
Political 22 vs 258
Theoretical 2 vs 21
sixty 5 vs 21
Troops 19.3 vs 53
College 85 vs 267
Sunday, April 26, 2009
corpus size, word frequency and extensive reading
A corpus is assembled by collecting examples of written and spoken language. It is a descriptive sample of the language, not a prescriptive one. Modern dictionaries are created using such resources and are a distillation of the data. A national corpus, created by linguists, can contain several hundred million words and may use different types of annotation: metatextual (information about the text), morphological, accentual and semantic. Such corpora are nowadays mostly stored as texts in digital form for easy reference. A spoken language corpus may also be stored as audio and video.
The British National Corpus (BNC) is a 100-million word corpus of written and spoken English. Not the largest but probably the world's most carefully designed and scrupulously annotated corpus. The written texts (100 million words) were taken from a range of fiction and non-fiction domains usually dating back not earlier than 1975. The spoken samples (10 million words) include material from different contexts and regions produced by speakers of different ages and social backgrounds.
The Cobuild project was set up jointly by Collins (now HarperCollins) and the University of Birmingham. The 20-million word corpus, published in 1987, was of sufficient size to be the basic data resource for numerous reference and teaching works. Renamed the Bank of English in 1991, it now collects over 450 million words.
The German corpus has over 3.3 billion searchable words.
The Corpus del Español is a corpus of 100 million words of the Spanish language. It has been funded by the National Endowment for the Humanities in the USA.
Corpus of Spoken Israeli Hebrew (CoSIH), which started in 2000 at Tel Aviv University, aims to provide a representative corpus of Hebrew (5 million words) as spoken today by different groups in society taking into account such factors as: sex, age, profession, social and economic background and education.
The Corpus of Spontaneous Japanese contains approximately 650 hours of spontaneous speech of various kinds, recorded between 1999 and 2003.
The Croatian Corpus currently has 101.3 million words.
Michigan Corpus of Academic Spoken English (MICASE) has recorded and transcribed nearly 200 hours (over 1.7 million words) of English.
Corpus linguistics is the study of language as expressed in samples (corpora). It is mainly the domain of linguists, language teachers, computational linguists and other NLP (natural language processing) people. Corpus samples are now seen as very important both in linguistics and in language learning, as they expose learners to the kinds of authentic language use that they will encounter in real life situations. Although recognized by educators as a potentially useful tool, until recently corpus application has been limited because the concordance examples retrieved have been deemed too difficult for beginning level learners to understand. From corpus linguistics we have concordance-based teaching and data-driven learning, two examples of technology-enhanced language learning.
Building one’s own language corpus and language learning media library.
I am primarily concerned here with the language corpus as an adequate snapshot of the target language and word frequency as a factor in practical vocabulary and grammar acquisition. Computer searchability is important but secondary. Examples of language usage can easily be looked up from much larger online corpora. A good corpus provides an authentic representation of the target language, and contains enough words to give us a trustworthy sample of clusters and collocations. An ideal language learner’s library would collect authentic material in which the learner’s attention is directed to the subject matter and away from the form in which it is expressed. It would focus on authentic material even when dealing with tasks such as the acquisition of grammatical structures and lexical items. It would be a representative and well-balanced collection of texts and multimedia items. To be representative of the language as a whole, a corpus should contain samples of all major text types and be in some way proportional to their usage in ‘every day language’. Large linguistic corpuses contain things like transcripts and recordings of “real” telephone conversations and other unpalatable items. While allowing for sufficient representation of “everyday” language artistic and entertainment value would be a primary concern when putting together a private library. This private corpus would need to be sufficiently large – but also lean and mean. General corpora contain samples of academic, literary and spoken language. Literary sources contain a wealth of idioms and metaphors. According to one study adding a quite small domain-specific corpus to a large general one considerably boosts performance (common sense, btw).
Corpus size
How large should our corpus be in terms of providing an authentic representation of the target language, and containing enough words to give us a trustworthy sample of clusters and collocations? How large a multimedia library do we need in order to allow users to “pick up” the most frequent vocabulary through extensive reading and listening?
The role of corpora in foreign language and especially ESL teaching is unfortunately mostly limited to “learner corpora” collections of texts written by non-native speakers (containing errors etc.). Some teachers develop their own corpora for use in teaching. For ESL a couple of informed recommendations range from a general corpus of one to two million words (Ball 1996) to one of over ten million words (University of Birmingham).
Compiling monolingual and parallel corpora:
The average literary novel has around 100,000 words or about 300-400 words per printed page. Words per page may range from 250 in a large-print book to 600 in an academic tome. The average novel length is about 300-400 pages. Words per Sentence: 15-20 words per sentence.
Pride and Prejudice 480 pages
Words: 120,528 (29% have more)
Sentences: 5,878 (35% have more)
Words per Sentence: 20.5 (42% have more)
Effi Briest (German) 304 pages
Words: 98,455
Sentences: 5,501
Words per Sentence: 17.9
Anna Karenina (translated) 864 pages
Words: 334,108 1% have more
Sentences: 19,345 2% have more
Words per Sentence: 17.3
James Clavell's Shogun
Words: 428,978
Sentences: 43,647
Words per Sentence: 9.8
Bravo Two Zero 416 pages
Words: 133,775
Sentences: 10,212 11% have more
Words per Sentence: 13.1
Puppet Masters 352 pages
Words per Sentence: 11.9
Words: 96,154
Sentences: 8,073
An average literary novel would have around 100k words and 5,000-7,000 sentences. Ten such novels would have one million words and 60,000 sentences.
20-25 "average" novels = 10,000 pages, 120,000 sentences, 2.5 million words.
Henry James' corpus (20 main novels) has some 2.4 million words.
200 such novels = over 1,000,000 sentences and more than 20,000,000 words.
Some 150/160 words per minute is the industry standard for most voiceovers and verbatim closed captioning of sitcoms and similar programs. Audiobooks are recommended to be 150-160 words per minute, which is the range that people comfortably hear and vocalize words. In a sample of 318 English-language movies the average movie was 8,937 running words with a running time of 113.5 minutes.
SUBTLEXUS, a corpus of some 51 million running words was compiled from a corpus of 8,388 films and television episodes. The average movie was around 7,000 words.
Going back to the frequency and extensive reading issue, according to several research papers most words need to be encountered more than 20 times in order for the learner to have a significant chance of learning them through exposure. Opinions differ as to the effectiveness of extensive learning. The available data supports the notion that words can be learned incidentally from context. However, researchers disagree as to the effectiveness of this approach – some argue that few new words appear to be learned from this type of exposure, and half of those that are learned are soon lost. Most vocabulary experiments were short in duration and did not therefore take into account constant study and reinforcement.
There are 6318 words with more than 800 occurrences in the whole 100M-word BNC.
In a corpus of 100 million words the top 25,000 words would have at least 100 occurrences in English, German, Italian and Russian. In the 18m Cobuild corpus (1986), there were about 19,800 lemmas (potential dictionary headwords) with 10+ tokens. The 120m Cobuild Bank of English corpus (1993): 45,150. The 450m Bank of English corpus (2001): 93,000.
In a frequency list based on 29,213,800 words from TV and movie scripts and transcripts 10,000 words occur more than 77 times. In order to cover 29 million words one would need to watch some 3,000 hours of video. Word no. 8,000 “moss” is repeated 111 times. Word no. 20,000 “dashboard” would be heard 24 times. Some 26,000 words in all would get at least 15 repetitions. Word no. 25,991 is “amicable”. Word no. 40,000 “imperious” would be practiced 6 times. The problem with movie and sitcom scripts is that they contain a lot of words that are never spoken on screen. A 52 minute episode of a US drama had shrunk from 9,300 words to a little over 4,500 words after all the instructions for the actors were removed. A 22-minute episode of the Addams Family was a little over 2,200 words long after all the extra stuff was removed.
Some 8,000 word families are sufficient to cover 98% of fiction and 98% of movies (see The Lexical Coverage of Movies). Very high frequency words will be more common in speech than writing (TV). Lower frequency items will be more common in writing than speech, in rough proportion to frequency (literature).
The British National Corpus (BNC) is a 100-million word corpus of written and spoken English. Not the largest but probably the world's most carefully designed and scrupulously annotated corpus. The written texts (100 million words) were taken from a range of fiction and non-fiction domains usually dating back not earlier than 1975. The spoken samples (10 million words) include material from different contexts and regions produced by speakers of different ages and social backgrounds.
The Cobuild project was set up jointly by Collins (now HarperCollins) and the University of Birmingham. The 20-million word corpus, published in 1987, was of sufficient size to be the basic data resource for numerous reference and teaching works. Renamed the Bank of English in 1991, it now collects over 450 million words.
The German corpus has over 3.3 billion searchable words.
The Corpus del Español is a corpus of 100 million words of the Spanish language. It has been funded by the National Endowment for the Humanities in the USA.
Corpus of Spoken Israeli Hebrew (CoSIH), which started in 2000 at Tel Aviv University, aims to provide a representative corpus of Hebrew (5 million words) as spoken today by different groups in society taking into account such factors as: sex, age, profession, social and economic background and education.
The Corpus of Spontaneous Japanese contains approximately 650 hours of spontaneous speech of various kinds, recorded between 1999 and 2003.
The Croatian Corpus currently has 101.3 million words.
Michigan Corpus of Academic Spoken English (MICASE) has recorded and transcribed nearly 200 hours (over 1.7 million words) of English.
Corpus linguistics is the study of language as expressed in samples (corpora). It is mainly the domain of linguists, language teachers, computational linguists and other NLP (natural language processing) people. Corpus samples are now seen as very important both in linguistics and in language learning, as they expose learners to the kinds of authentic language use that they will encounter in real life situations. Although recognized by educators as a potentially useful tool, until recently corpus application has been limited because the concordance examples retrieved have been deemed too difficult for beginning level learners to understand. From corpus linguistics we have concordance-based teaching and data-driven learning, two examples of technology-enhanced language learning.
Building one’s own language corpus and language learning media library.
I am primarily concerned here with the language corpus as an adequate snapshot of the target language and word frequency as a factor in practical vocabulary and grammar acquisition. Computer searchability is important but secondary. Examples of language usage can easily be looked up from much larger online corpora. A good corpus provides an authentic representation of the target language, and contains enough words to give us a trustworthy sample of clusters and collocations. An ideal language learner’s library would collect authentic material in which the learner’s attention is directed to the subject matter and away from the form in which it is expressed. It would focus on authentic material even when dealing with tasks such as the acquisition of grammatical structures and lexical items. It would be a representative and well-balanced collection of texts and multimedia items. To be representative of the language as a whole, a corpus should contain samples of all major text types and be in some way proportional to their usage in ‘every day language’. Large linguistic corpuses contain things like transcripts and recordings of “real” telephone conversations and other unpalatable items. While allowing for sufficient representation of “everyday” language artistic and entertainment value would be a primary concern when putting together a private library. This private corpus would need to be sufficiently large – but also lean and mean. General corpora contain samples of academic, literary and spoken language. Literary sources contain a wealth of idioms and metaphors. According to one study adding a quite small domain-specific corpus to a large general one considerably boosts performance (common sense, btw).
Corpus size
How large should our corpus be in terms of providing an authentic representation of the target language, and containing enough words to give us a trustworthy sample of clusters and collocations? How large a multimedia library do we need in order to allow users to “pick up” the most frequent vocabulary through extensive reading and listening?
The role of corpora in foreign language and especially ESL teaching is unfortunately mostly limited to “learner corpora” collections of texts written by non-native speakers (containing errors etc.). Some teachers develop their own corpora for use in teaching. For ESL a couple of informed recommendations range from a general corpus of one to two million words (Ball 1996) to one of over ten million words (University of Birmingham).
Compiling monolingual and parallel corpora:
The average literary novel has around 100,000 words or about 300-400 words per printed page. Words per page may range from 250 in a large-print book to 600 in an academic tome. The average novel length is about 300-400 pages. Words per Sentence: 15-20 words per sentence.
Pride and Prejudice 480 pages
Words: 120,528 (29% have more)
Sentences: 5,878 (35% have more)
Words per Sentence: 20.5 (42% have more)
Effi Briest (German) 304 pages
Words: 98,455
Sentences: 5,501
Words per Sentence: 17.9
Anna Karenina (translated) 864 pages
Words: 334,108 1% have more
Sentences: 19,345 2% have more
Words per Sentence: 17.3
James Clavell's Shogun
Words: 428,978
Sentences: 43,647
Words per Sentence: 9.8
Bravo Two Zero 416 pages
Words: 133,775
Sentences: 10,212 11% have more
Words per Sentence: 13.1
Puppet Masters 352 pages
Words per Sentence: 11.9
Words: 96,154
Sentences: 8,073
An average literary novel would have around 100k words and 5,000-7,000 sentences. Ten such novels would have one million words and 60,000 sentences.
20-25 "average" novels = 10,000 pages, 120,000 sentences, 2.5 million words.
Henry James' corpus (20 main novels) has some 2.4 million words.
200 such novels = over 1,000,000 sentences and more than 20,000,000 words.
Some 150/160 words per minute is the industry standard for most voiceovers and verbatim closed captioning of sitcoms and similar programs. Audiobooks are recommended to be 150-160 words per minute, which is the range that people comfortably hear and vocalize words. In a sample of 318 English-language movies the average movie was 8,937 running words with a running time of 113.5 minutes.
SUBTLEXUS, a corpus of some 51 million running words was compiled from a corpus of 8,388 films and television episodes. The average movie was around 7,000 words.
Going back to the frequency and extensive reading issue, according to several research papers most words need to be encountered more than 20 times in order for the learner to have a significant chance of learning them through exposure. Opinions differ as to the effectiveness of extensive learning. The available data supports the notion that words can be learned incidentally from context. However, researchers disagree as to the effectiveness of this approach – some argue that few new words appear to be learned from this type of exposure, and half of those that are learned are soon lost. Most vocabulary experiments were short in duration and did not therefore take into account constant study and reinforcement.
There are 6318 words with more than 800 occurrences in the whole 100M-word BNC.
In a corpus of 100 million words the top 25,000 words would have at least 100 occurrences in English, German, Italian and Russian. In the 18m Cobuild corpus (1986), there were about 19,800 lemmas (potential dictionary headwords) with 10+ tokens. The 120m Cobuild Bank of English corpus (1993): 45,150. The 450m Bank of English corpus (2001): 93,000.
In a frequency list based on 29,213,800 words from TV and movie scripts and transcripts 10,000 words occur more than 77 times. In order to cover 29 million words one would need to watch some 3,000 hours of video. Word no. 8,000 “moss” is repeated 111 times. Word no. 20,000 “dashboard” would be heard 24 times. Some 26,000 words in all would get at least 15 repetitions. Word no. 25,991 is “amicable”. Word no. 40,000 “imperious” would be practiced 6 times. The problem with movie and sitcom scripts is that they contain a lot of words that are never spoken on screen. A 52 minute episode of a US drama had shrunk from 9,300 words to a little over 4,500 words after all the instructions for the actors were removed. A 22-minute episode of the Addams Family was a little over 2,200 words long after all the extra stuff was removed.
Some 8,000 word families are sufficient to cover 98% of fiction and 98% of movies (see The Lexical Coverage of Movies). Very high frequency words will be more common in speech than writing (TV). Lower frequency items will be more common in writing than speech, in rough proportion to frequency (literature).
Friday, April 24, 2009
update - April 09
I need to keep a better record of what I’ve been doing.
April:
French:
Dumas Les Trois Mousquetaires
Balzac La Peau de Chagrin
Stendhal : La Chartreuse de Parme
German :
Die Drei ??? Several radio dramas
Uh, something else that I cannot remember.
At least 100 hours of German.
Italian:
A couple of movies.
Russian: Nothing
Die drei ??? is about teenage detectives. I dislike the setting (Kalifornien) and the use of American names but otherwise it’s a pleasant way to spend time learning German.
I also have several TKKG radio dramas in a similar vein, and a few classics of world literature (adapted as radio dramas for young audiences).
Both above-mentioned series also appear in book form and are extremely popular in Germany. The audio dramas are published by Europa
Of course, this is not “literature,” it’s full of clichés and weak plot points. I would however highly recommend listening to radio dramas as a means to crack the spoken language.
One concern – I am more likely to repeatedly listen to an audiobook than a radio drama.
As a side note – all the language learning bloggers I’ve been following (and I’m an avid reader) frequently mention their brains. My brain this, my brain that. It’s hilarious.
April:
French:
Dumas Les Trois Mousquetaires
Balzac La Peau de Chagrin
Stendhal : La Chartreuse de Parme
German :
Die Drei ??? Several radio dramas
Uh, something else that I cannot remember.
At least 100 hours of German.
Italian:
A couple of movies.
Russian: Nothing
Die drei ??? is about teenage detectives. I dislike the setting (Kalifornien) and the use of American names but otherwise it’s a pleasant way to spend time learning German.
I also have several TKKG radio dramas in a similar vein, and a few classics of world literature (adapted as radio dramas for young audiences).
Both above-mentioned series also appear in book form and are extremely popular in Germany. The audio dramas are published by Europa
Of course, this is not “literature,” it’s full of clichés and weak plot points. I would however highly recommend listening to radio dramas as a means to crack the spoken language.
One concern – I am more likely to repeatedly listen to an audiobook than a radio drama.
As a side note – all the language learning bloggers I’ve been following (and I’m an avid reader) frequently mention their brains. My brain this, my brain that. It’s hilarious.
Sunday, April 12, 2009
update
First three months of 2009.
So far I have read:
12 books in German (about 120 pages each).
1 book in Russian (around 1200 pages).
I have looked up every single unknown word in German. Unsurprisingly, I keep looking up certain stuff.
I am also done with all of the German audiobooks that were on my mp3 player.
April/May - I intend to concentrate on German and French.
I keep talking about French. It's a beautiful language I can enjoy without any birthing pains and I keep neglecting it and spreading myself thin with other languages. I intend to have lots of fun with French. So far I have read 1/3 of Les Trois Mousquetaires by Alexandre Dumas. It's been a blast.
One thing that is interfering with my language studies: I am exercising actively. I am also on a diet. I have lost some 25 pounds (February-April). Another 20 to go.
So far I have read:
12 books in German (about 120 pages each).
1 book in Russian (around 1200 pages).
I have looked up every single unknown word in German. Unsurprisingly, I keep looking up certain stuff.
I am also done with all of the German audiobooks that were on my mp3 player.
April/May - I intend to concentrate on German and French.
I keep talking about French. It's a beautiful language I can enjoy without any birthing pains and I keep neglecting it and spreading myself thin with other languages. I intend to have lots of fun with French. So far I have read 1/3 of Les Trois Mousquetaires by Alexandre Dumas. It's been a blast.
One thing that is interfering with my language studies: I am exercising actively. I am also on a diet. I have lost some 25 pounds (February-April). Another 20 to go.