Monday, January 11, 2016

How to learn ten thousand words

This is a companion post to Word frequency and incidental learning

While there are countless words in the English language, knowledge of between 3000 and 5000 most frequent word families will yield lexical coverage of 95-98%, depending on language content. This vocabulary size provides a good basis for comprehension and language use. This core vocabulary also provides a good basis for further comfortable vocabulary acquisition through extensive reading and listening which is the core mantra of the comprehensible input crowd (i.e. Krashen et al).

How do we go about developing a core vocabulary of some 3000 word families? Pick up any decent comprehensive course like the Assimil series and it will likely include the bulk of the required vocabulary. People occasionally "pick up" languages. If you're lucky, your previous linguistic knowledge will provide plenty of cognates. Speakers of multiple languages may possess a passive knowledge of cognates rivaling that of native speakers with little or no involvement in their new target language. Not everyone is equally successful at recognizing and exploiting cognates, but we'll leave that for another time.

Successful language learners will learn the bulk of their vocabulary through repeated encounters in different contexts.Words and their shades of meaning are learned gradually. According to some published research, in order to have a high probability of learning a new word from context you need to encounter it between five and twenty times*.

According to the statistical analysis table developed by Dr. Rob Waring:
  • To meet all the 3000 most frequent words in English 1, 5, 10 and 20 times, you’d need to read or otherwise meet 47,300, 236,700, 473,000, and 947,000 words, respectively.
  • To meet all the 5000 most frequent words in English 1, 5, 10 and 20 times, you’d need to read 132,100, 661,000, 1,321,000, and 2,642,000 words.
  • To meet all the 10,000 most frequent words in English 1, 5 and 10 times, you’d need to read 632,000, 3,164,000 and 6,328,947 words, respectively.
 In order to meet the most frequent 10,000 words 20 times, you will need to read the equivalent of 12,657,895 running words or the equivalent of approximately 100 books the length of Pride and Prejudice. At 140 words per minute, you would hear 12.7 million words after 1507 hours of listening to audiobooks.

Different types of language material may have very different word frequencies. Per 1 million words, the word "Dear" occurs 1284 times in the Dickens corpus, 54 times in the Brown corpus (a compendium of a variety of sources), and 223 times in Subtlexus (Movies and TV shows). "Me" occurs 9,242 times in Subtlexus vs 1,183 in Brown. See other comparisons here and here. Based on the foregoing, it may be concluded that a listening and reading strategy involving substantial blocks of different types of language content may lead to a wider vocabulary and long-term vocabulary retention.

*Over the course of the past 20 years the suggested number of exposures in order to retain a word increased from 5-6 initially, to the current recommendation of 15-20 exposures. Some recent research suggests that incidental acquisition of vocabulary can happen "extremely fast even with complete beginners in a FL" with "as little as two exposures to new words."  Moreover the researchers found that "the impact of exposure was not constant across number of exposures, but rather decreased following the initial encounters." See the first link.

The Role of Repeated Exposure to Multimodal Input in Incidental Acquisition of Foreign Language Vocabulary by Marie-Josée Bisson, Walter J. B. van Heuven, Kathy Conklin and Richard J. Tunney
Words are Learned Incrementally over Multiple Exposures by Steven A Stahl
Why should we build up a Start-up vocabulary quickly? (Rob Waring)
The inescapable case for extensive reading (Rob Waring)
Lexical Threshold revisited: Lexical text coverage, learners' vocabulary size and reading comprehension by Batia Laufer and Geke C Ravenhorst-Kalovski
Vocabulary Size, Text Coverage and Word Lists by Paul Nation and Robert Waring
At what rate do learners learn and retain new vocabulary from reading a graded reader? by Rob Waring and Misako Takaki
Vocabulary Demands of Television Programs by Stuart Webb and Michael P. H. Rodgers
Effect of Frequency and Idiomacity on Second Language Reading Comprehension by Ron Martinez

Sunday, January 10, 2016

The average book length

An average English language literary novel would have around 80-100,000 words (depending on genre) and 5,000-7,000 sentences. An average trade 5.5x8.5 inch book may have approximately 300 words per page and a 6x9 inch size book may have 350 words per page. The smaller, cheaper mass market paperbacks (usually 4.125 x 6.75 inches) pack on average 250 words per printed page but longer books may pack more than 350 words per printed page.

The median length for all Amazon books is about 64,000 words.  Here are some stats about book lengths taken from Amazon's text stats taken a couple of years ago.

Pride and Prejudice
Words: 120,528 (29% have more)
Sentences: 5,878 (35% have more)
Words per Sentence: 20.5 (42% have more)

Effi Briest (German)
Words: 98,455
Sentences: 5,501
Words per Sentence: 17.9

Anna Karenina (translated)
Words: 334,108 1% have more
Sentences: 19,345 2% have more
Words per Sentence: 17.3

James Clavell's Shogun
Words: 428,978
Sentences: 43,647
Words per Sentence: 9.8

Animal Farm
29,966 words (75% of books have more words)

Ethan Frome
30,191 words (75% of books have more words)

The Crying of Lot 49
46,573 words (64% of books have more words)

47,192 words (64% of books have more words)

Lord of the Flies
62,481 words (51% of books have more words)

Brave New World
64,531 words (50% of books have more words)

The Adventures of Tom Sawyer
70,570 words (45% of books have more words)

Portnoy’s Complaint
78,535 words (41% of books have more words)

112,473 words (21% of books have more words)

Madame Bovary
117,963 words (18% of books have more words)

Mansfield Park
159, 344 words (9% of books have more words)

209,117 words (4% of books have more words)

262,869 words (2% of books have more words)

310,593 words (2% of books have more words)

War and Peace
544,406 words (0% of books have more words)

The European Reading Challenge

The European Reading Challenge
January 1, 2016 to January 31, 2017

Hosted by Rose City Reader.

I intend to read five books from five different European authors in their native languages. For good measure I also entered into the Classics Challenge 2016.

Thursday, January 7, 2016

Senator Says Maryland's Italian State Motto Is Sexist

Senator Bryan Simonaire  (R-Anne Arundel) has filed a bill that would update the nearly 400-year-old phrase “Fatti maschii, parole femine” which literally translates into “manly deeds, womanly words” to something more gender-neutral. Senator Simonaire proposes to change the official state motto to English“Strong deeds, gentle words,” which is also Maryland's current officially cited translation of the motto.

Maryland is the only US state with a motto in Italian. In 1993 a move to change the English translation of Maryland's Italian motto from "Manly Deeds, Womanly Words" to "Strong Deeds, Gentle Words" passed a House committee but never made it to the House floor. In 2001, the official translation was changed to the aforementioned gentler, politically correct version.

The Washington Post chimes in:

How a ‘sexist’ quote from 16th-century pope became Maryland’s state motto

The Maryland motto is sexist in any language (Opinion)

See also No longer manly, state seal uses gentle words (The Baltimore Sun, January 12, 2001).

Sunday, August 24, 2014

Global job search by language skill

Search performed by keyword ("English", "ingles" etc.) as of 8/24/2014 on Indeed country-specific search engines. Figures slightly rounded. Added the generic search term "assistant" to gage the English-language job search.

United States

"assistant" 331000
Spanish 87300
French 8500
Portuguese 7060
Chinese 6100
Japanese 4800
German 4300
Mandarin 4000
Italian 3200
Korean 2100
Russian 1900
Arabic 1600


"assistant" 18900
French 15000
Mandarin/Chinese 1470/540
Spanish 1200
Italian 630
German 500
Japanese 500
Portuguese 350
Russian 340
Arabic 240
Hindi 200

United Kingdom

"assistant" 133630
French 11500
German 10000
Italian 5400
Spanish 5000
Russian 1950
Mandarin 1250
Arabic 1000
Japanese 1500
Portuguese 1150
Hindi 310
Korean 300


"assistant" 3400
German 2120
French 1160
Spanish 500
Italian 450
Arabic 160
Portuguese 140
Russian 120
Polish 120
Japanese 90
Chinese/Mandarin 60/20


anglais (English) 35000
allemand (German) 3500
espagnol (Spanish) 1900
italien (Italian) 1400
russe (Russian) 500
portugais (Portuguese) 500
chinois (Chinese) 450
néerlandais (Dutch) 430
arabe (Arabic) 270
japonais (Japanese) 250
coreen (Korean) 50


Englisch (English) 133000
Französisch (French) 5100
Spanisch (Spanish) 2,100
Italienisch (Italian) 1600
Russisch (Russian) 1400
Niederländisch (Dutch) 840
Polnisch 640 (Polish)
Chinesisch (Chinese) 500
Turkisch (Turkish) 470
Japanisch (Japanese) 500
Portugiesisch (Portuguese) 300
Arabisch (Arabic) 300
Schwedisch (Swedish) 250
Koreanisch (Korean) 50
Kroatisch (Croatian) 50


inglese (English) 29300
tedesco (German) 4200
francese (French) 3040
spagnolo (Spanish) 1300
russo (Russian) 1030
cinese (Chinese) 540
arabo (Arabic) 200
giapponese (Japanese) 140
portoghese (Portuguese) 180
croato (Croatian) 40
turco (Turkish) 40
coreano (Korean) 20


angielski (English) 15900
niemiecki (German) 8500
francuski (French) 2100
włoski (Italian) 1200
rosyjski (Russian) 800
hiszpański (Spanish) 600
portugalski (Portuguese) 220
japoński (Japanese) 90
chiński (Chinese) 80
koreański (Korean) 30


английский (English) 61,000
немецкий (German) 7,700
итальянский (Italian) 5900
Японский (Japanese) 4300
Французский (French) 3900
китайский (Chinese) 2500
испанский (Spanish) 1350
корейский (Korean) 850
турецкий (Turkish) 600
польский (Polish) 500
португальский (Portuguese) 70

Netherlands (Dutch/English search)

Engels/English 34400/34400
Duitse/German 6750/1500
Frans/French 2300/1200
Spaans/Spanish 320/400
Italiaanse/Italian 150/220
Russisch/Russian 50/100
Japanse/Japanese 35/100
Arabisch/Arabic 60
Chinees/Chinese 50/120
Turkse/Turkish 40/45
Mandarijn/Mandarin 20/20
Bahasa -

Sweden (Swedish/English search)

engelska/English 7200/2400
tyska/German 470/110
norska/Norwegian 390/70
finska/Finnish 370/440
danska/Danish 320/90
franska/French 210/70
spanska/Spanish 140/30
ryska/Russian 60/10
kinesiska/Chinese 50/20
portugisiska/Portuguese 24/4


inglés 12600
francés 2030
alemán (German) 1220
portugués 320
italiano 300
chino (Chinese) 230
ruso (Russian) 220
arabe 70
japones 50
coreano 10


ingles (English) 6900
francês (French) 3000
espanhol (Spanish) 1400
alemão (German) 610
italiano 460
chinês 80
russo 70
árabe 45
polaco (Polish) 40
japonês 20


İngilizce (English) 10500
alman/ca (German) 100/540
rusça (Russian) 470
arapça (Arabic) 380
fransız/ca (French) 43/210
İtalyan/ca (Italian) 60/80
İspanyol/ca (Spanish)  17/84


ingles 14200
alemán 280
italiano 150
chino 140
francés 134
japones 120
portugués 100
mandarin 14
ruso 10


inglés 7700
chino/mandarin 240/40
francés 160
japones 140
italiano 77
portugués 60
coreano 35
alemán 33


ingles 2100
francés 90
italiano 65
portugués 60
alemán 20
chino/mandarin 3/20


Inglés 7,100
Portugués 630
Chino 100
Italiano 50
Francés 50
Alemán 50
Japones 10
Ruso 10
Arabe 10


inglês 19500
espanhol 3,300
francês 1,700
alemão 1,100
italiano 400
japonês 200
árabe 100
chinês 100
russo 30
"assistant" 16000
English 33400
Hindi 7950
German 600
French 550
Japanese 450
Chinese/Mandarin 340/80
Spanish 300
Italian 150
Arabic 150
Portuguese 100
Korean 100
Russian 90

China (Chinese-language search)

英语 (English) 452,900
日本 (Japanese) 12,600
韩国 (Korean) 7,050
德国 (German) 4100
法国 (French) 2800
俄罗斯 (Russian) 1300
阿拉伯语 (Arabic) 1200
意大利 (Italian) 2,650
西班牙 (Spanish) 960
印地文 (Hindi) 540
葡萄牙 (Portuguese) 255
印尼语 (Bahasa) 200

China (English-language search)

English 452,700
French 4,000
German 4,700
Spanish 3600
Japanese 1,300
Korean 500
Italian 150
Russian 130
Portuguese 36

(Search for "English" affected by English-language translations of Chinese job posts but one would assume that the job posts are translated into English for a good reason)

Japan (Japanese-language search)

英語 (English) 132,568
中国の (Chinese) 29100
韓国人 (Korean) 5400
スペイン語 (Spanish) 1060
フランス語 (French) 920
イタリア語 (Italian) 670
ドイツ語 (German) 460
ポルトガル語の (Portuguese) 360
ロシア語 (Russian) 340
アラビア語 (Arabic) 64
ヒンディー語 (Hindi) 30

South Africa

English 16800
Afrikaans 8200
French 700
German 600
Zulu 500
Portuguese 460
Xhosa 300
Spanish 200
Italian 200
Chinese/Mandarin 90/80
Sotho 90
Arabic 90
Japanese 70
Russian 50
Swahili 17
Korean 10
Hindi 6


"assistant" 17000
English 20650
Mandarin/Chinese 5800/4700
Japanese 2600
Korean 490
German 480
Bahasa 425
French 270
Italian 120
Spanish 110
Russian 35
Portuguese 16


English 15800
Bahasa 6360
Chinese/Mandarin 6200/3640
Japanese 1470
Korean 100
German 90
Hindi 65
French 60
Spanish 26
Italian 14
Russian 13
Portuguese 8


English 10200
Bahasa 3730
Mandarin/Chinese 1120/220
Japanese 700
French 120
Korean 110
Italian 60
German 50
Dutch 30
Spanish 20
Arabic 15
Russian 10
Portuguese 10


English 13000
Japanese 950
Mandarin/Chinese 620/400
Korean 520
Arabic 320
Tagalog 285
Spanish 350
French 340
German 230
Bahasa 160
Italian 150
Portuguese 125
Russian 70


"assistant" 10200
Italian 550
Mandarin/Chinese 500/400
French 440
Japanese 310
German 90
Spanish 80
Korean 70
Arabic 40
Russian 15
Portuguese 15
Bahasa 6

New Zealand

"assistant" 1200
Mandarin/Chinese 60/40
French 50
Japanese 30
German 24
Spanish 14
Italian 10
Hindi 4
Russian 4
Portuguese 1

Wednesday, December 5, 2012

Keeping English in Indonesian Schools

Keeping English in Indonesian Schools
Salim Osman - Straits Times | December 04, 2012

"After weeks of review, Indonesia's Education Ministry eventually succumbed to societal pressure that English lessons be retained in elementary schools.

This about-face should be good news for parents. But it is not unexpected given the national swing towards English as an important foreign language in recent years, which the government has acknowledged.

Deputy Education Minister Musliar Kasim announced in late September that English would be scrapped for lower elementary pupils in the next school year beginning July as part of a curriculum revamp.

It was part of efforts by the ministry to ease the workload of pupils by reducing the number of subjects from ten to six. It would involve the scrapping of English, science and social studies in favor of religion, nationalism, Bahasa Indonesia, mathematics, art and sports.

With English dropped, pupils could concentrate on strengthening their Bahasa Indonesia — the country's national language — imbibing national values and picking up knowledge on science incorporated in other subjects. They would study English as a compulsory subject when they reached lower secondary or high school.

But the decision to leave out English was unpopular from the start not only among parents and language teachers but also several education departments in the regions. They debated the issue for many weeks to persuade the government to retain the language.

Parents wanted their children to have a head start in the language, seen as having higher economic value than Dutch, the language of their colonial masters. They feared their children's English lessons would be disrupted by the new curriculum.

"The scrapping of English is a retrogressive step," the head of West Kalimantan's provincial government education department, Alexius Akim, told Kompas daily.

The decision also had language teachers worried about their future as they were specifically recruited to teach English to primary school pupils.

But in a volte-face last month, Musliar announced that English would not be scrapped after all. "Schools would be allowed to offer the subject but as an elective instead. It should not be made compulsory," he said in a statement to Kompas and the Jakarta Globe.

Unlike previously, when he said that it would be "haram" or illegal to hold English lessons, Musliar made it clear that his ministry would not stop schools from offering the subject to pupils."

Brits "lazy" when it comes to learning foreign languages

Brits "lazy" when it comes to learning foreign languages
By David Howells | 21 Nov 2012                       
Brits have been dubbed "lazy linguists" after many admitted not learning the language of the country to which they're headed, reports.
A new poll published by foreign exchange provider VIDAFX found that just one in ten British travellers make any effort to learn snippets of the language before heading to another country.

In total, just five per cent said they would learn the translations for simple words such as hello, please, thank you, water and beer. A further five per cent said they would learn more complex words and phrases.

Whilst the results show a disinterest in interacting from British travellers, it also highlights potential risks for those using hire cars when on holiday as their poor grasp of the native language could cause trouble when out and about on the roads.

When quizzed on exactly why they don't take the time to learn the language, many claimed it was simply because English is so widely spoken outside of the UK. This, they said, meant there was "no point" in learning another language. Others, meanwhile, blamed shyness for not learning, with many fearing they'd be embarrassed by incorrect use of words or mispronunciation. "English tourists are renowned the world over for being particularly poor at languages," a spokesperson for VIDAFX told

"While for many holidaymakers there really is no need as such to learn the local language, it was good to report that one in 20 tourists tried their best to communicate with locals - regardless of whether they could've got by without doing so."

Comment: One in 20, huh? Good news indeed.