The intensionality of "alleged"

"Alleged" is well-known in semantics as a word that introduces reference to possible worlds: just as a "former senator" isn't a senator (who is former), but rather someone who at a prior time was a senator, an "alleged criminal" is someone who isn't (necessarily) an actual criminal, but merely someone who, in the worlds compatible with what the alleger belives, is a criminal.

Which makes the following sub-header from the front page of particularly odd:

A witness saw Christopher Piantedosi allegedly stab his ex-girlfriend in their daughter's room via a videochat on an iPad.

It's sensible to say that Piantedosi allegedly stabbed his ex-girlfriend; it means that, according to police, Piantedosi stabbed his ex-girlfriend. But I'm not at all sure what it would mean for someone to see Piantedosi allegedly stab someone. (She saw the police allege that he stabbed her?) Perhaps this witness can see into possible worlds, in which case we really need to get her into a lab to do some experimental semantics.

How to Normalize: Karma edition

Word frequency is something I deal with a lot in my work. It's the basis of some fairly fundamental information: how often does Word X show up in the text you're analyzing? Is that more or less often than you'd expect Word X to show up? It's by no means the way we measure everything, but it's at the very least a good benchmark.

The problem I was considering Wednesday morning was the following: if you take a word's frequency in the text you're analyzing, compare it to its frequency in a corpus (say, the Google Books from a certain point in time), multiply it by the log of this other thing and divide it by the fifth root of something else after adding in an offset in order to...the point is, once you're done with your computations, you end up with a pretty arbitrary looking number that falls somewhere on a scale from zero to who the heck knows.

So I thought, wouldn't it be nice to normalize those numbers? Some sort of normalization that would bring them in line as something meaningful, as measured against some kind of standard. My coworker suggested that the right standard might be something like "a word that occurs an average number of times in the corpus", which turned out to be 223,037 ("the", for comparison, occurs a little less than 8.7 billion times). As a number, 223,037 wasn't bad, but it's nice to get a handle of what that means, so we looked at words that appeared as close to that number of times as possible.

The first three we found, in order of closeness, were "Wien", "bombed", and "parol", none of which struck as something you'd want to name your metric after. Then we hit the fourth word, which turned out to be exactly the kind of word we wanted to express our normalization method.

That word, with 223,058 occurrences, is "normalization".

Sometimes, things just work.

How not to remove suffixes

Sadly, I don't have much interesting to report about my work, because most of the interesting things I've discovered have been on particular not-yet-discussable projects; and because mostly I've just been frustrated with the state of natural language processing, which I'm sure has made advances since the last time I looked in on it when I was in college, but I'm having a little trouble seeing them. (I'm sure tf*idf is great and all, but I'd like it better if it were working.)

So the best I can do, in the tradition of this video courtesy of Heidi Harley, is to observe that a coworker has found an example of bad suffix removal that probably beats anything I've found so far. I was merely amused to learn that a particularly common term in one set of data was the singular city of "Lo Angele", the parser having helpfully stripped away the plural suffixes; but he discovered that the parser did the same for the less-than-superlative city of Budap.

Spell checking and the internet

I've spent the last week noodling around with spelling correction in Python, with no particularly good results. (I might need to do more than noodle, if I want good results.) Part of the problem is deciding what to do with unfamiliar words—and if your client wants you to be searching on Pepsi-Cola (not an actual example), you kind of want references to "Pespi" to be corrected to "Pepsi"...but without any reference to "Cole Porter" to be corrected to "Cola Porter".

Today's lesson, though, while looking through capitalized phrases in a corpus, and finding "beret syndrome": no amount of spell-checking will help you when someone refers to Gion Beret Syndrome.

Thanks, Python Style Guide!

So, this journal having been laying fallow while I transitioned from academia to real, live, paying jobs, I'm now thinking of reviving it for the occasional work-related post.

As a potential starting post, then: I'm reading the Python code style guide, which I've never read before, but now that I'm writing code other people will have to read, it seems like a good idea to get accustomed to it. I'm finding it a mix of things that are good ideas, things that don't strike me as particularly useful or necessary, and snottiness about using complete sentences and writing in English. On this last point, the sentence that really struck me:
When writing English, Strunk and White apply.
I'll set aside the subject-verb agreement. (I know that US and British English differ on things like "the crowd is..." vs. "the crowd are...", but even in British English, wouldn't "Strunk and White" be taken as a single unit, insofar as it's a single book, and therefore use the singular verb "applies"?) Instead, what I find really striking is the dangling modifier: is it supposed to be Strunk and White who are writing English? A proper, Strunk-and-White-sanctioned sentence would say "When writing English, you should follow Strunk and White" or "When you are writing English, Strunk and White apply" or "In English, Strunk and White apply". I'd even be willing to grant them "When writing English, Strunk and White apply to what you write", which I believe S&W and its adherents would object to, because at least there's something the sentence ("you") for "when writing English" to modify. But the sentence as it stands? Unacceptable by any standard.

Meanwhile, I'm going to go back to writing comments however I darned well care to. Er, to however I darned well care.

I love children! Sauteed in a little...

I don't have a huge problem with dangling participles; I'll often say things like "Speaking as a linguist, that sentence isn't grammatical", where it's obviously not the sentence that's speaking as a linguist. (See various Language Log posts from Geoff Pullum arguing that avoiding danging participles is typically a matter of politeness as opposed to grammar.) But the following sentence from the government's anti-childhood-obesity website is priceless:
Cauliflower? No problem. Roasted with garlic and olive oil, the kids happily munched as if they were fries.
It doesn't help that the only plural antecedent for "they" is "the kids"; but, man, they really ought to have a copyeditor over there.


Dear Mango Languages,

Insofar as I'm not fluent in at least three languages, nor am I certain I want to work 50-60 hour weeks as a contract worker, I doubt I'm going to be applying for your job in any case. But in the meantime, in case I do become fluent in another language or two and find myself wanting to work long hours, perhaps you should explain your core values a little.

I mean, it's a good thing that on your blog, you have a post that says: "Mango Languages has six core values that we all believe very strongly in: Quality, Entrepreneurial Spirit, Positive Attitude, Innovation, Integrity, and Fundipline. Let’s chat today about Entrepreneurial Spirit." Perhaps you should consider that it isn't Entrepreneurial Spirit that people need an explanation of.

Yours etc.

A sentence too poorly constructed to succeed

Lake Superior State University has, as is its wont, released its annual "please pay attention to us" annual Banished Words List. As Arnold Zwicky puts it, "it's a steaming pile of intemperate peeving". Picking apart the gripes is like shooting fish in a barrel—they hate "czar" because it's "long used by the media" and "tweet" because it's new, and so forth. (Heck, the complaint about the latter that "I don't know a single non-celebrity who actually uses ['tweet']" just makes the people behind this look old and grumpy.) They hate "app" because it's "yt another abrv"; presumably they say "mobile vulgus" and "taximeter cabriolet" instead of "mob" and "taxicab", but more to the point, Merriam-Webster dates the word to 1987, so they're coming to this fight a little late....

Right, sorry, enough barrelfish shooting. The point I'd actually wanted to make was about their quote from Claire Shefchik in favor of banning "too big to fail": "Just for the record, nothing's too big to fail unless the government lets it." That is to say, nothing's too big to fail, so no matter how big something is, it can still fail, unless the government lets it fail, in which case it's...wait, if the government lets it fail, then it's too big to fail, i.e., it can't fail? Or does she mean that unless the government lets it be too big to fail, then—except that to let something be too big to fail, it has to already be too big to fail, so...

I'm pretty sure that Ms. Shefchik's statement is in fact gibberish. But that didn't stop these defenders of the Queen's English from citing it approvingly. Dolts.

Sanity check

To those reading this who feel qualified to answer: in the sentence Mary took physics in high school, is Mary went to high school an entailment or a presupposition?