Why & how to use frequency
lists to learn words
Tom CObb
(cobb.tom at uqam.ca)
Frequency-based wordlists can
help you expand your English vocabulary by telling you
which words you should try to learn. These lists
contain the words that are very common in English, but
that you are unlikely to discover in a random or natural
manner. Learning your L1 (first language), you had lots
of time at your disposal, to discover all of the common
words of your language and to learn them without trying.
But in a second language there is simply not enough
time for this to occur. Why? Because many common words
and phrases are nonetheless not all that common, occuring
only a few times per million words of natural text.
How many million words of English are you likely to
read this year? Moreover, several encounters with each
word (probably about ten) are needed for stable learning
to occur.
Why would you want to know all or most of the highest
frequency words of English? For the simple reason that
English, like any other language, has the habit of recycling
a relatively small number of words over and over again,
and if you know these words then your reading power
can be enhanced dramatically for a relatively modest
learning investment.
With a random or 'discovery' approach to lexical growth,
you will learn many words that are rare and relatively
useless for you, yet you will fail to notice the words
that recur often enough to repay the effort of learning.
The word lists presented on this website are the result
of more than 50 years work and are based on large scale
computational analysis of English text and speech corpora.
They are intended to deliver the main words of English
to you in a shortened time frame, and deliver along
with them enough contextual and definitional information
to get solid learning underway.
|
Different words |
Percent
of word tokens in average text |
| 86,741 |
100
% |
| 43,831 |
99.0 |
|
6,000 |
89.9 |
|
5,000 |
88.6 |
|
4,000 |
86.7 |
|
3,000 |
84 |
|
2,000 |
79.7 |
| 1,000 |
72.0 |
|
10 |
23.7 |
Table 1
How small a number of words are these 'main words'
that are recycled over and over in Enlgish (or any other
language)? Suppose your goal is to read academic English
texts with good comprehension, and to use reading as
a way to expand your vocabulary still further. In that
case, your first goal should be to make sure you know
the 2000 most frequent word families of English (headwords
and their main inflections and derivations), because
these words make up roughly 80% of the individual words
(word tokens) in any English text. This can be seen
in the table on the left, which shows data from the
Brown corpus, (which can be accessed from this site),
as cited in Nation (1990) p.17, and Nation (2001) p.
15.
Table 1 shows us that in English just a few word types
account for most of the word tokens in any text. Ten
words account for 23.7 % of the ink on any page (repeated
words like "the" and "of"). Just
1000 word families account for more than 70% of the
words or ink, and 2000 account for about 80%. So you
need to find out what these 2000 word families are and
be sure you know them.
You could, of course, wait and meet these words "naturally"
in the normal course of reading the texts that interest
you, but this takes a long time. An alternative is to
meet these words in convenient lists provided on this
website. While it is true that nothing can replace the
experience of meeting new words in rich natural contexts,
some of this experience has been reproduced for you
here by linking the word lists to a computer program
called a "concordance", and fom there to the
dictionary Wordnet.
A concordance provides several contexts for each word,
derived from a large collection of texts called a corpus.
Is reading these computer contexts as useful as meeting
words in natural contexts? Probably not, but research
by Cobb
(1997) suggests that using computer concordances
can get the learning process off to a good start.
After the first 2000 words
However, Table 1 also presents some bad news about
vocabulary growth. It suggests that after you have learned
the most frequent 2000 words of English, then simply
continuing to accumulate more words on a frequency basis
gives a much lower rate of return. You could learn another
3000 words (up to the 5000 frequency mark) and increase
the amount of black ink coverage in an average text
by only about 8%. The graph below, which is just the
table above turned on its head, dramatizes the drop-off
in coverage after 2000 words.
It is not obvious how to proceed after you have reached
the 2000 mark. However, it seems clear that knowing
2000 words or 80% of the words in an average text is
not sufficient for either comprehension of academic
texts or for further independent vocabulary acquisition
through reading such texts.
Here is what a text looks like to someone who knows
the most frequent 2000 words and no others. Words that
are not on the 2000 list have been replaced by gaps:
If _____ planting rates are _____ with planting _____
satisfied in each _____ and the forests milled at
the earliest opportunity, the _____ wood supplies
could further increase to about 36 million _____ meters
_____ in the period 2001-2015. (Nation, 1990, p. 242.)
(Text A: 80% of words known)
Text A has 40 words, seven of which are unknown or
(7/40 =) 16%. It seems clear that someone reading this
text would get a some idea of the topic, but not exactly
what was being said about the topic.
Here is the same text with 95% of its words known,
or 5% unknown:
If current planting rates are maintained with planting
targets satisfied in each _____ and the forests milled
at the earliest opportunity, the available wood supplies
could further _____ to about 36 million cubic meters
annually in the period 2001-2015.
(Text B: 95% of words known)
In Text B, the main idea of the text is reasonably
clear. And the concepts needed to fill the two remaining
gaps are also clear, so that if these had been new words
instead of gaps there is a good chance the words would
have been understood through inference.
In fact, research has shown that reading in a second
language is reliably successful, and supports further
vocabulary acquisition, when 95% of the individual words
(word tokens) in a text are known. With fewer than that,
the reader does not have enough to go on (Laufer 1989;
1992; Hirsh and Nation, 1992).
Two strategies for the journey from 80% to 95 %
After learning the most frequent 2000 words, you can
adopt one of two strategies for further vocabulary acquisition.
Strategy 1 is simply to carry on up the slope
of Figure 1, past the 2000 hump, learning words at the
3000, 4000, 5000 zones and far beyond. Any learner is
bound to adopt this strategy to some extent -- thinking
about and looking up interesting new words encountered
randomly in newspapers, books, or movies.
However, there are some problems with this strategy.
First, the learning task is enormous. The learner reaches
the 90% mark only at 5000 words, or after another 3000
new words have been learned. Second, as well as being
numerous these words are difficult to learn because
they are relatively infrequent and are not encountered
over and over again. Third, while the first 2000 words
have been identified and made somewhat easy to learn
(e.g., on this website), useful frequency lists at the
3000, 4000, 5000 zones and beyond are not available
at present.
Strategy 2 is to take advantage of research
that has been done to target the vocabulary needed for
the different purposes a learner might have for reading.
Nation and his colleagues in New Zealand have analysed
academic texts and determined that across domains there
are certain words that, while not necessarily frequent
in the language at large, are very frequent in academic
texts. These are normally Greco-Latin terms like "probablilty,"
"conclusion," and "hypothesis." There are approximately
570 of these words, and they have been brought together
as the Academic Word List (AWL). This list appears on
this website and is included in the diagnostic tests.
The good news is the 2000 list and the AWL together,
a combined list of 2570 words, can bring the coverage
of an academic text up to approximately 90%. In other
words, if you know the first 2000 plus 570 AWL words,
then you know about 90% of the words you will meet in
any academic text. To see support for these claims,
see examples of computer
text analysis with VocabProfile on this site. You
will see that there is a reasonably reliable profile
of texts by frequency zone, with the AWL words claiming
more as the texts you select are more "academic." For
the rest of the journey (90% to 95%), for the moment
you are pretty much on your own. But you have an adequate
base for inferences and look-ups.
As for how to use these lists to learn words,
if you go to the ListLearn
page you will find all these lists connected by simple
mouseclicks to both a speech engine and a large corpus
of natural English. In other words, each word can be
heard and met in a diverse set of contexts. Such an
encounter should get the learning process well under
way, and then of course you will more readily recognize
these words when you meet them again and further learning
will occur.
In conclusion, students learning English anywhere on
the planet can use this website to test
themselves on how well they know the 2000 and AWL lists,
fill any significant gaps they find, and make their
way toward basic lexical competence.
(Tip : You could learn words
by frequency in WordHacker ! )
References
Cobb, T. From
concord to lexicon: Development and test of a corpus-based
lexical tutor. Montreal: Concordia University, PhD
dissertation.
Cognitive Science Lab, Princeton University. Wordnet:
A lexical database for English.
Hirsh, D. & Nation, P. (1992). What vocabulary
size is needed to read unsimplified texts for pleasure?
Reading in a Foreign Language, 8 (2), 689-696.
Laufer, B. (1992). How much lexis is necessary for
reading comprehension? In P.J. Arnaud & H. B?oint
(Eds.), Vocabulary and applied linguistics (pp.
126-132). London: Macmillan.
Nation, P. (1990). Teaching and learning vocabulary.
New York: Newbury House.
Nation, P. (2001). Learning vocabulary in another
language. New York: Cambridge University Press.
|