A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb “I before E except after C” was “total bullshit.” This got me thinking: the “I before E except after C” rule (let’s just call it the IEC rule) was almost certainly developed without any research at all, just based on the subjective experience of educators. It’s not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I’ll be refering to these as “trigram spelling rules”). We have the technology, and I have nothing better to do with my night off :)
The long version
You can find my full methodology and analysis in the following IPython notebook:
http://nbviewer.ipython.org/gist/dmarx/b6a095d2b161eccb18a3
The short version
I used a larger word list than Patrick (233,621 words), but my analysis still corroborated his. I observed the following bigram and trigram frequencies for the IEC rule:
Bigram | Count | Trigram | Count |
---|---|---|---|
ie: | 3950 | cie: | 256 |
ei: | 2607 | cei: | 156 |
I thought that perhaps although the IEC rule doesn’t work when we look at the unique words in our vocabulary, perhaps it might hold true if we look at trigram and bigram frequencies across word usage in written text. Here are the frequencies for the IEC rule in the Brown corpus:
Bigram | Count | Trigram | Count |
---|---|---|---|
ie: | 13275 | cie: | 1310 |
ei: | 5677 | cei: | 485 |
Nope, still no good.
Instead of the IEC rule, here are some alternatives (taken from my vocabulary analysis, not the word usage analysis). For each rule of the form “A before B except after C” below, the bigram frequency percentage
\[ \frac{count(AB)}{count(AB)+count(BA)} \]
is at least \(\frac{1}{2}\), and the laplace smoothed trigram frequency ratio
\[ \frac{ (1+count(cba)) }{ (1+count(cab)) } \]
is maximized:
P before E except after C
Bigram | Count | Trigram | Count |
---|---|---|---|
pe: | 8052 | cpe: | 0 |
ep: | 5053 | cep: | 955 |
E before U except after Q
Bigram | Count | Trigram | Count |
---|---|---|---|
eu: | 2620 | qeu: | 0 |
ue: | 1981 | que: | 949 |
I before C except after I
Bigram | Count | Trigram | Count |
---|---|---|---|
ic: | 26140 | iic: | 1 |
ci: | 6561 | ici: | 1830 |
T before E except after M
Bigram | Count | Trigram | Count |
---|---|---|---|
te: | 27265 | mte: | 2 |
et: | 11743 | met: | 2684 |
R before D except after N
Bigram | Count | Trigram | Count |
---|---|---|---|
rd: | 3641 | nrd: | 0 |
dr: | 2738 | ndr: | 808 |
Update
After posting this article, there was some discussion that the optimal rules should focus on vowel placement and have a higher bigram ratio than the 1/2 threshold I used. Here are two “better” rules that satisfy these condiitons:
O before U except after Q
Bigram | Count | Trigram | Count |
---|---|---|---|
ou: | 12144 | qou: | 0 |
uo: | 671 | quo: | 122 |
I before O except after J
Bigram | Count | Trigram | Count |
---|---|---|---|
io: | 15247 | jio: | 0 |
oi: | 4040 | joi: | 95 |