Estimating Character Knowledge
A few months ago I started a project to learn 3000 Chinese characters but I struck me that I only had a vague notion of how many I already knew so I decided to estimate how many characters I knew before I started.
I downloaded a character frequency list[1], sorted the list by frequency and tested myself on ten randomly[2] selected words from each interval of 100 words, i e 10 words from words 1–100, 10 from words 101–200 and so on.
To make things less straight forward I also estimated how well I knew each word on a scale from 0 to 3 where 3 means I really know the character but if I, for instance, know what it means but I forget the tone it might get a two and so on. For each interval I added the character scores, normalized the sum to a scale from 0 to 10 and plotted it against the number of the interval in the following diagram.
Calculating the area under the line (adjusting for the way the axes are graded) gives the answer 1734, which seems reasonable, in fact, when people ask me how many characters I know I usually say I know about 2000 because I can read a paper with some effort[3]. In a way the estimate it rather conservative since it gives about equal weight to meaning and pronounciation and to read you only have to know the meaning.
The last character I know in the list is 涮 shuan (rinse) at place 3458. I also know the bankers numerals which are more uncommon, but I decided they don’t count. It would be an interesting exercise to find out how long the tail is but it would be very dependent on the frequency list used[1].
My first plan was to redo the test with a different sample when I’d been studying for a while, but that might not be necessary; my spaced repetition software (Skritter) gives me up-to-date, and quite addictive, statistics of my performance.
-
Unfortunately it isn’t the same list that I’m learning from (word frequency lists are notoriously dependent on the underlying corpus once you get past the 1000 or so most common words). ↩
-
OK, OK, I simply decided to test the first 10 characters in each 100 character interval. ↩
-
The conventional wisdom among Chinese learners that you need about 3000 characters to read a paper comfortably. It’s probably true, though… ↩