David Marx bio photo

David Marx

Data obsessed

Twitter LinkedIn Github Stackoverflow CrossValidated

A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb “I before E except after C” was “total bullshit.” This got me thinking: the “I before E except after C” rule (let’s just call it the IEC rule) was almost certainly developed without any research at all, just based on the subjective experience of educators. It’s not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I’ll be refering to these as “trigram spelling rules”). We have the technology, and I have nothing better to do with my night off :)

The long version

You can find my full methodology and analysis in the following IPython notebook:

http://nbviewer.ipython.org/gist/dmarx/b6a095d2b161eccb18a3

The short version

I used a larger word list than Patrick (233,621 words), but my analysis still corroborated his. I observed the following bigram and trigram frequencies for the IEC rule:

Bigram Count Trigram Count
ie: 3950 cie: 256
ei: 2607 cei: 156

I thought that perhaps although the IEC rule doesn’t work when we look at the unique words in our vocabulary, perhaps it might hold true if we look at trigram and bigram frequencies across word usage in written text. Here are the frequencies for the IEC rule in the Brown corpus:

Bigram Count Trigram Count
ie: 13275 cie: 1310
ei: 5677 cei: 485

Nope, still no good.

Instead of the IEC rule, here are some alternatives (taken from my vocabulary analysis, not the word usage analysis). For each rule of the form “A before B except after C” below, the bigram frequency percentage

\[ \frac{count(AB)}{count(AB)+count(BA)} \]

is at least \(\frac{1}{2}\), and the laplace smoothed trigram frequency ratio

\[ \frac{ (1+count(cba)) }{ (1+count(cab)) } \]

is maximized:

P before E except after C

Bigram Count Trigram Count
pe: 8052 cpe: 0
ep: 5053 cep: 955

E before U except after Q

Bigram Count Trigram Count
eu: 2620 qeu: 0
ue: 1981 que: 949

I before C except after I

Bigram Count Trigram Count
ic: 26140 iic: 1
ci: 6561 ici: 1830

T before E except after M

Bigram Count Trigram Count
te: 27265 mte: 2
et: 11743 met: 2684

R before D except after N

Bigram Count Trigram Count
rd: 3641 nrd: 0
dr: 2738 ndr: 808

Update

After posting this article, there was some discussion that the optimal rules should focus on vowel placement and have a higher bigram ratio than the 1/2 threshold I used. Here are two “better” rules that satisfy these condiitons:

O before U except after Q

Bigram Count Trigram Count
ou: 12144 qou: 0
uo: 671 quo: 122

I before O except after J

Bigram Count Trigram Count
io: 15247 jio: 0
oi: 4040 joi: 95