I Before E Except After C

A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb “I before E except after C” was “total bullshit.” This got me thinking: the “I before E except after C” rule (let’s just call it the IEC rule) was almost certainly developed without any research at all, just based on the subjective experience of educators. It’s not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I’ll be refering to these as “trigram spelling rules”). We have the technology, and I have nothing better to do with my night off :)

The long version

You can find my full methodology and analysis in the following IPython notebook:

http://nbviewer.ipython.org/gist/dmarx/b6a095d2b161eccb18a3

The short version

I used a larger word list than Patrick (233,621 words), but my analysis still corroborated his. I observed the following bigram and trigram frequencies for the IEC rule:

Bigram	Count	Trigram	Count
ie:	3950	cie:	256
ei:	2607	cei:	156

I thought that perhaps although the IEC rule doesn’t work when we look at the unique words in our vocabulary, perhaps it might hold true if we look at trigram and bigram frequencies across word usage in written text. Here are the frequencies for the IEC rule in the Brown corpus:

Bigram	Count	Trigram	Count
ie:	13275	cie:	1310
ei:	5677	cei:	485

Nope, still no good.

Instead of the IEC rule, here are some alternatives (taken from my vocabulary analysis, not the word usage analysis). For each rule of the form “A before B except after C” below, the bigram frequency percentage

\[ \frac{count(AB)}{count(AB)+count(BA)} \]

is at least \(\frac{1}{2}\), and the laplace smoothed trigram frequency ratio

\[ \frac{ (1+count(cba)) }{ (1+count(cab)) } \]

is maximized:

P before E except after C

Bigram	Count	Trigram	Count
pe:	8052	cpe:	0
ep:	5053	cep:	955

E before U except after Q

Bigram	Count	Trigram	Count
eu:	2620	qeu:	0
ue:	1981	que:	949

I before C except after I

Bigram	Count	Trigram	Count
ic:	26140	iic:	1
ci:	6561	ici:	1830

T before E except after M

Bigram	Count	Trigram	Count
te:	27265	mte:	2
et:	11743	met:	2684

R before D except after N

Bigram	Count	Trigram	Count
rd:	3641	nrd:	0
dr:	2738	ndr:	808

Update

After posting this article, there was some discussion that the optimal rules should focus on vowel placement and have a higher bigram ratio than the 1/2 threshold I used. Here are two “better” rules that satisfy these condiitons:

O before U except after Q

Bigram	Count	Trigram	Count
ou:	12144	qou:	0
uo:	671	quo:	122

I before O except after J

Bigram	Count	Trigram	Count
io:	15247	jio:	0
oi:	4040	joi:	95

David Marx

I Before E Except After C

The long version

The short version

Update

You might also enjoy (View all posts)