eternally stressed semanticist (cqs) wrote,

How to Normalize: Karma edition

Word frequency is something I deal with a lot in my work. It's the basis of some fairly fundamental information: how often does Word X show up in the text you're analyzing? Is that more or less often than you'd expect Word X to show up? It's by no means the way we measure everything, but it's at the very least a good benchmark.

The problem I was considering Wednesday morning was the following: if you take a word's frequency in the text you're analyzing, compare it to its frequency in a corpus (say, the Google Books from a certain point in time), multiply it by the log of this other thing and divide it by the fifth root of something else after adding in an offset in order to...the point is, once you're done with your computations, you end up with a pretty arbitrary looking number that falls somewhere on a scale from zero to who the heck knows.

So I thought, wouldn't it be nice to normalize those numbers? Some sort of normalization that would bring them in line as something meaningful, as measured against some kind of standard. My coworker suggested that the right standard might be something like "a word that occurs an average number of times in the corpus", which turned out to be 223,037 ("the", for comparison, occurs a little less than 8.7 billion times). As a number, 223,037 wasn't bad, but it's nice to get a handle of what that means, so we looked at words that appeared as close to that number of times as possible.

The first three we found, in order of closeness, were "Wien", "bombed", and "parol", none of which struck as something you'd want to name your metric after. Then we hit the fourth word, which turned out to be exactly the kind of word we wanted to express our normalization method.

That word, with 223,058 occurrences, is "normalization".

Sometimes, things just work.
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

  • 7 comments