Monday, April 27, 2009

corpora comparison by frequency

The Brown University Standard Corpus of Present-Day American English (Brown Corpus) was compiled by Henry Kucera and W. Nelson Francis at Brown University as a general corpus in 1961. The corpus contains 1,014,312 words sampled from 15 text categories: press (politics, sports culture, financial – 44 texts) editorial (letters to the editor etc). theatre and book reviews, religious texts, skills and hobbies, “popular lore” (48) Biography, Memoirs (75); government documents (30 texts); learned (natural science, medicine math, humanities, technology -80 texts) fiction – general (29 texts); Mystery and Detective Fiction (24 texts); Adventure and Western (29 texts); Romance and Love Story (29 texts); humor (9 texts). The Brown Corpus is made up of 500 texts of about 2000 words each. The first American Heritage Dictionary (1969) was based on the Brown Corpus. This was the first dictionary to be compiled using evidence gleaned from corpus linguistics.

"The" constitutes nearly 7% of the Brown Corpus. About half of the total vocabulary of about 50,000 words are words that occur only once in the corpus.

NCFWD - a corpus of nineteenth-century fiction written between 1830 and 1870 (approximately 2.2 million words)

The Dickens Corpus – some 4.6 million running words

NCFWD and Dickens corpus data taken from: Investigating Dickens’ style by
Masahiro Hori.

SUBTLEXUS compiled by Brysbaert & New on the basis of American subtitles (51 million words in total). A corpus of 8,388 films and television episodes with a total of 51 million running words (16.1M from television series, and 14.3M from films before 1990, and 20.6M from films after 1990).
USA films from 1900-1990 (2046 files)
USA films from 1990-2007 (3218 files)
USA television series (4575 files)

There are 4,554 examples of gentleman in the Dickens Corpus (4.6 million words) 825 in the NCFWD (2.2 MILLION WORDS), 2,777 examples in the entire Cobuild (200,000,000 words) and 2,135 in SUBTLEXUS. Per million:

Dickens: 968
NCFWD 375
Cobuild 13.9
SUBTLEXUS: 42

Dickens Oliver Twist: 332
Thackeray's Vanity Fair: 269
Jane Austen’s Emma 36
Bronte sisters
39 in Jane Eyre
13 in Wuthering Heights
22 in Agnes Grey

19th>20th century sign o’ the times

DICKENS, NCFWD, BROWN and SUBTLEXUS compared (Frequency per million words)
Brown: frequency rank number in parentheses

Man
Dickens 2037
NCFWD 1587
Brown 1210(no 81)
SUBTLEXUS: 1099

Old
Dickens 1973
NCFWD 1335
Brown: 660 (no. 140)
SUBTLEXUS: 609

Hand:
Dickens 1289
NCFWD 871
Brown: 431
SUBTLEXUS: 280

Head
Dickens 1212
NCFWD 616
Brown: 404 (no 201)
SUBTLEXUS: 371

Face
Dickens 1075
NCFWD 765
Brown 371 (no 245)
SUBTLEXUS: 289

Eyes
Dickens: 985
NCFWD: 816
Brown : 401 (no 214)
SUBTLEXUS: 221

Dear
Dickens 1284
NCFWD 790
BROWN 54 (no 2040)
SUBTLEXUS: 223

LIFE
Dickens 711
NCFWD: 854
BROWN 715 (no 127)
SUBTLEXUS: 797

Room
Dickens 954
NCFWD 981
BROWN: 384 (no 232)
SUBTLEXUS: 440

LADY
Dickens 834
NCFWD 1284
BROWN: 80 (no 1328)
SUBTLEXUS: 217

Another
Dickens 829
NCFWD 566
BROWN 684 (no 133)
SUBTLEXUS: 509

Night
Dickens 1079
Ncfwd: 649
BROWN 411 (no 209)
SUBTLEXUS: 866

Door
Dickens 986
Ncfwd 614
BROWN 312 (no 295)
SUBTLEXUS: 292

Boy
Dickens 563
NCFWD: 333
BROWN 242 (no 384)
SUBTLEXUS: 530

Manner
dickens 547
ncfwd 285
BROWN 124 (no 831)
SUBTLEXUS: 12

Child
Dickens 538
Ncfwd 338
BROWN 213 (no 435)
SUBTLEXUS: 158

Seemed
Dickens 535
Ncfwd 569
BROWN 332 (no 274)
SUBTLEXUS: 54

Yet
Dickens 590
Ncfwd: 864
BROWN 419 (no 202)
SUBTLEXUS: 342

Let
DICKENS 656
NCFWD 726
BROWN: 384 (no 231)
SUBTLEXUS: 2,419

DONE
DICKENS: 656
NCFWD 597
BROWN 320 (no 283)
SUBTLEXUS: 485

Half
Dickens 618
Ncfwd 580
Brown 275 (no 337)
SUBTLEXUS: 199

People
Dickens 592
Ncfwd 668
Brown 847 (no 106)
SUBTLEXUS: 1103

Love
Dickens 420
Ncfwd 775
Brown 232 (no 397)
SUBTLEXUs: 1,115

Only
Dickens 978
Ncfwd 1502
Brown 1747 (no 62)
SUBTLEXUS: 1084

Returned
Dickens: 846
NCFWD 264
Brown: 115 (return: 180)
SUBTLEXUS: 25 (return: 92)

Replied
Dickens: 823
NCFWD: 299
Brown: 57 (reply: 42)
SUBTLEXUS: 1 (reply: 5)

Slowly
Dickens: 178
NCFWD: 117
Brown 115 (no.900) slow: 60 (no.1817)
SUBTLEXUS: 25 slow: 76

Softly
Dickens: 101
NCFWD: 36
Brown: 31 (no. 3425)Soft: 62
SUBTLEXUS: 5 Soft: 1126

Easily
Dickens: 100
NCFWD: 79
Brown 106 (no. 981) Easy: 125
SUBTLEXUS: 23 Easy: 266

Gradually
Dickens: 94
NCFWD: 49
Brown: 51 (no. 2125)

Quickly
Dickens: 92
NCFWD: 70
Brown: 89 (no.1169) Quick: 68
SUBTLEXUS: 57 Quick: 109

Hastily
Dickens: 87
NCFWD: 45
Brown: n/a not in the top 5,000 (less than 19)
SUBTLEXUS: 1 (haste: 2)

Gently
Dickens: 83
NCFWD: 59
Brown: 31 (no.3441) Gentle: 27
SUBTLEXUS: 9 Gentle: 17

Quietly
Dickens: 78
NCFWD: 85
Brown: 48 (no.2250) Quiet: 76
SUBTLEXUS: 12 Quiet: 117

Carefully
Dickens: 65
NCFWD: 56
Brown: 87 (no.1213) Careful: 62 care: 162
SUBTLEXUS: 24 Careful: 109 Care: 485

Heartily
Dickens: 54
NCFWD: 26
Brown: not in the top 5,000
SUBTLEXUS: 1

Steadily
Dickens: 47
NCFWD: 19
Brown: 22 (no 4499) Steady: 41
SUBTLEXUS:: 1 Steady: 23

Frequently
Dickens: 42
NCFWD: 52
Brown: 91 (no.1146) Frequent: 34
SUBTLEXUS: 3 Frequent: 2

Thoughtfully
Dickens: 39
NCFWD: 5
Brown: not in the top 5,000; neither is thoughtful (less than 19)
SUBTLEXUS: 1 Thoughtful: 8

Eagerly
Dickens: 37
NCFWD: 49
not in the top 5,000 Eager: 27 (no. 3772)
SUBTLEXUS: 1 Eager: 7

Freely
Dickens: 35
NCFWD: 24
Brown: 22 (no 4476) Free: 260 (no.358)
SUBTLEXUS: 4 Free: 178

Happily
Dickens: 32
NCFWD: 27
Brown: 20 (no 4836) Happy: 98 (no1069).
SUBTLEXUS: 10 Happy: 333

Cheerfully
Dickens: 32
NCFWD: 18
not in the top 5,000 neither is cheerful
SUBTLEXUS: 1 Cheerful: 4

Sharply
Dickens: 31
NCFWD: 25
Brown: 38 (no.2827) Sharp: 72
SUBTLEXUS: 1 Sharp: 24

Silently
Dickens: 30
NCFWD: 30
Brown: not in the top 5,000 Silent: 49 (no. 2229)

Seriously
Dickens: 27
NCFWD: 45
Brown: 46 (no.2368) Serious: 116 (no.883)

Angrily
Dickens: 26
NCFWD: 12
Brown: not in the top 5,000. Angry: 45 (no.2430)
SUBTLEXUS: 0.4 Angry: 59

Sternly
Dickens: 26
NCFWD: 12
Brown not in the top 5,000 Stern: 23 (no.4295)
SUBTLEXUS: 0.1 Stern 6

Timidly
Dickens: 26
NCFWD: 19
Brown not in the top 5,000 (neither is “timid”)
SUBTLEXUS: 0.1 Timid: 2


SUBTLEXUS VS Brown

This 7,979 vs 5,146
Now 3202 vs 1314
Be 5746 vs 6376
Was 5654 vs 9815
Been 1737 vs 2473
In 9,773 vs 21,345
Out 3865 vs 2096
Me 9,242 vs 1183
My 6763 vs 1319
Mine 251 vs 59
Can 5,247 vs 1,772
Could 1629 vs 1599
Should 1062 vs 888
Will 2124 vs 2244
Would 1768 vs 2715
There 4348 vs 2725
But 4,418 vs 4381
By 1340 vs 5307
He 7,637 vs 9,542
Him 3484 vs 2619
So 4244 vs 1985
Go 3793 vs. 626
Goes 217 vs 89
Going 2123 vs 399
Went 411 vs 507
Gone 297 vs 195
Like 3,999 vs 1290
Likes 76 vs 20
Liked 79 vs 58
How 3056 vs 836
If 3541 vs 2199
Just 4,749 vs 872
Get 4583 vs 749
gets: 223 vs 66
Got 3306 vs 482
Gotten 54 vs n/a -less than 19
Had 1676 vs 5,131
Come 3141 vs 630
comes 229 vs 137
came 464 vs 622
Coming 527 vs 174
They 4102 vs 3619
See 2557 vs 772
saw 403 vs. 352;
seen: 385 vs 279
Time 1959 vs 1601
Let 2419 vs 384
Did 2341 vs 1044
From 2039 vs 4370
Want 2759 vs 329
Wants 307 vs 71
Wanted 502 vs 226
Think 2691 vs 433
thinks 103 vs 23
Thought 809 vs 516
thinking 281 vs 145
Take 1891 vs 611
Took 342 vs 426
Taken 281 vs 139
Look 1947 vs 399
looks: 311 vs 78
looked 121 vs 361
Some 1727 vs 1617
Then 1490 vs 1377
Why 2248 vs 404
Where 1830 vs 938
Too 1372 vs 833
More 1299 vs 2216
Down 1490 vs 895
Yes 1997 vs 144
Tell 1724 vs less than 19
Little 1446 vs 831
Thing 1088 vs 333
Mean 1244 vs 199
Said 1109 vs 1961
Sure 1100 vs 264
First 840 vs 1361
Put 829 vs 437
Please 1101 vs 62
Mexico 31 vs 19
Wildlife 2 vs 19
victims 23 vs 19
Father 555 vs 183
Mother 480 vs 216
English 74 vs 195
hasn't 91 vs 20
Tuesday 24 vs 59
January 7 vs 53
Halloween 13 vs n/a
Keith 0 vs 21
Economical 0.33 vs 22
Arrested 35 vs 19
Run 350 vs 217
Court 101 vs 230
Office 2O4 vs 255
Planet 39 vs. 21
Planets 4 vs 22
Political 22 vs 258
Theoretical 2 vs 21
sixty 5 vs 21
Troops 19.3 vs 53
College 85 vs 267

No comments: