Thursday, October 7, 2010

word frequency: TV vs books

And why you need both (and more). A frequency list of words occurring in a collection of movies and TV series vs. a collection of texts (a possible sample of extensive reading). More on this and many other related things, here.

Subtlexus vs Brown (TV vs books)

Word frequency per million words of running text:

This 7,979 vs 5,146
Now 3,202 vs 1,314
Was 5,654 vs 9,815
In 9,773 vs 21,345
Out 3,865 vs 2,096
Me 9,242 vs 1,183
My 6,763 vs 1,319
Mine 251 vs 59
Can 5,247 vs 1,772
Will 2,124 vs 2,244
Would 1,768 vs 2,715
There 4,348 vs 2,725
But 4,418 vs 4,381
By 1,340 vs 5,307
He 7,637 vs 9,542
Him 3,484 vs 2,619
So 4,244 vs 1,985
Go 3,793 vs. 626
Goes 217 vs 89
Going 2,123 vs 399
Went 411 vs 507
Gone 297 vs 195
Like 3,999 vs 1,290
Likes 76 vs 20
Liked 79 vs 58
How 3,056 vs 836
If 3,541 vs 2,199
Just 4,749 vs 872
Get 4,583 vs 749
gets: 223 vs 66
Got 3,306 vs 482
Had 1,676 vs 5,131
Come 3,141 vs 630
comes 229 vs 137
came 464 vs 622
Coming 527 vs 174
They 4,102 vs 3,619
See 2,557 vs 772
saw 403 vs. 352
seen: 385 vs 279
Time 1,959 vs 1,601
Let 2,419 vs 384
Did 2,341 vs 1044
From 2,039 vs 4370
Want 2,759 vs 329
Wants 307 vs 71
Wanted 502 vs 226
Think 2,691 vs 433
thinks 103 vs 23
Thought 809 vs 516
thinking 281 vs 145
Take 1,891 vs 611
Took 342 vs 426
Taken 281 vs 139
Look 1,947 vs 399
looks: 311 vs 78
looked 121 vs 361
Some 1,727 vs 1,617
Then 1,490 vs 1,377
Why 2,248 vs 404
Where 1,830 vs 938
Too 1,372 vs 833
More 1,299 vs 2,216
Down 1,490 vs 895
Yes 1,997 vs 144
Tell 1,724 vs 268
Little 1,446 vs 831
Thing 1,088 vs 333
Mean 1,244 vs 199
Said 1,109 vs 1,961
Sure 1,100 vs 264
First 840 vs 1361
Put 829 vs 437
Please 1,101 vs 62
Mexico 31 vs 19
Wildlife 2 vs 19
victims 23 vs 19
Father 555 vs 183
Mother 480 vs 216
English 74 vs 195
hasn't 91 vs 20
Tuesday 24 vs 59
January 7 vs 53
Halloween 13 vs n/a
Economical 0.33 vs 22
Arrested 35 vs 19
Run 350 vs 217
Court 101 vs 230
Office 204 vs 255
Planet 39 vs. 21
Planets 4 vs 22
Political 22 vs 258
Theoretical 2 vs 21
sixty 5 vs 21
Troops 19 vs 53
College 85 vs 267

One million words is about 170 hours of audio or a dozen average literary novels. A sample of TV documentaries would have a much higher percentage of popular science words that are not present in TV series, words like "theoretical" and "wildlife". "Focused reading" (gotta love the old man) of teen novels would have a higher percentage of useful general words and, if one is reading a series of novels, some more specialized, but still general vocabulary (e.g. horses, magic). Read a few newspapers and you can get exposure to words like "troops" (unfortunately), and "court" and examples of their usage. Magazines may expose the reader to some very useful everyday vocabulary or a very specialized one. Of course newspapers are written in a different manner from magazines and the two differ from fiction. Good textbooks should have samples from all of these but they're insufficient.

SUBTLEXUS was compiled by Brysbaert & New on the basis of American subtitles (51 million words in total). A corpus of 8,388 films and television episodes with a total of 51 million running words (16.1M from television series, and 14.3M from films before 1990, and 20.6M from films after 1990).

The Brown University Standard Corpus of Present-Day American English (Brown Corpus) contains 1,014,312 words sampled from many categories: press (politics, sports culture, financial, theatre and book reviews), religious texts, skills and hobbies, biographies, memoirs, government documents, natural science, medicine, math, humanities, technology, mystery and detective fiction, adventure and western, romance and love story, humor. The Brown Corpus is made up of 500 texts of about 2,000 words each. The first American Heritage Dictionary (1969) was based on the Brown Corpus.

No comments: