How to Normalize: Karma edition
The problem I was considering Wednesday morning was the following: if you take a word's frequency in the text you're analyzing, compare it to its frequency in a corpus (say, the Google Books from a certain point in time), multiply it by the log of this other thing and divide it by the fifth root of something else after adding in an offset in order to...the point is, once you're done with your computations, you end up with a pretty arbitrary looking number that falls somewhere on a scale from zero to who the heck knows.
So I thought, wouldn't it be nice to normalize those numbers? Some sort of normalization that would bring them in line as something meaningful, as measured against some kind of standard. My coworker suggested that the right standard might be something like "a word that occurs an average number of times in the corpus", which turned out to be 223,037 ("the", for comparison, occurs a little less than 8.7 billion times). As a number, 223,037 wasn't bad, but it's nice to get a handle of what that means, so we looked at words that appeared as close to that number of times as possible.
The first three we found, in order of closeness, were "Wien", "bombed", and "parol", none of which struck as something you'd want to name your metric after. Then we hit the fourth word, which turned out to be exactly the kind of word we wanted to express our normalization method.
That word, with 223,058 occurrences, is "normalization".
Sometimes, things just work.