eternally stressed semanticist (cqs) wrote,
eternally stressed semanticist

How not to remove suffixes

Sadly, I don't have much interesting to report about my work, because most of the interesting things I've discovered have been on particular not-yet-discussable projects; and because mostly I've just been frustrated with the state of natural language processing, which I'm sure has made advances since the last time I looked in on it when I was in college, but I'm having a little trouble seeing them. (I'm sure tf*idf is great and all, but I'd like it better if it were working.)

So the best I can do, in the tradition of this video courtesy of Heidi Harley, is to observe that a coworker has found an example of bad suffix removal that probably beats anything I've found so far. I was merely amused to learn that a particularly common term in one set of data was the singular city of "Lo Angele", the parser having helpfully stripped away the plural suffixes; but he discovered that the parser did the same for the less-than-superlative city of Budap.
