Bad Text and Part of Speech Tagging

I’ve recently been fascinated with some aspects of Natural Language Processing (NLP) having worked on some of them at my day job.

One of the key aspects that are very important for a computer program to understand natural language is called Part of Speech Tagging (POS or POST).

Basically, in the POS tagging phase, the computer assigned the part of speech (noun, verb, adjective, etc) to each word of the specified text, thus allowing the computer to figure out what this text is about and perform later analysis with it.

The POS step is very crucial since its output will later be used in the rest of the reasoning process of understanding what the text is about and every mistake at this stage will be dragged onwards making the end result way off target.

The problem with most POS taggers (see a list of most of the free ones here) is that they assume that the text you are trying to tag is grammatically correct and (hopefully) is free of spelling mistakes. Proper casing (upper case and lower case of words and letters) is also important to distinguish various types of POS. The other type of POS taggers perform unsupervised learning and can be trained to work with various text types.

The problems begin when the text is not grammatically correct, contains spelling mistakes, is not correctly punctuated and proper casing is non existent or used wrongfully.

These problems are most common on the Internet and stem from various reasons:

English is not the native tongue for a large part of the Internet users making grammar, spelling and punctuation mistakes a bit more common.
A portion of the current young Internet users (and I’m not coming from a judgmental approach) use a lot of Internet shortcuts and improper casing and grammar.

The big challenge is to still being able to understand what the text is about in spite of these problems.

Since the mistakes varies from person to person (and possibly from group to group – which might make things easier. I haven’t done or seen a research about that yet), pre-training your POS tagger is not very useful since the mistake rate will be quite high. Running an unsupervised learning algorithm on each of these text will be time consuming and might return strange results due to the fact that there are quite a bit of error types that can appear in the text.

Handling one sentence or just a set of keywords in search engines is relatively easier than figuring out what a block of text (a couple of sentences, a paragraph or even a set of paragraphs).

I’ve been experimenting with various techniques on extracting more meaningful results badly formed English text. Some of them are not POS tagging in the tranditional sense of the POS tagging (i.e. tagging each and every word in the text), but rather a way of figuring out the most interesting words in a block of text that might imply what this text is really about.

The goal of my experimentation is to try and develop an algorithm that will output a set of words or phrases in various combinations that will later allow me, using a co-occurrence database containing some statistics about different words, to output the main areas that the supplied text talks about.

In later posts I’ll try to describe various algorithms I’ve been experimenting with that should increase the efficiency of understanding the main subject of a block of a grammatically improper, badly spelled and wrongfully cased text.

If any of you who actually read this post knows people that are working on the subject (or similar aspects of it) and that can point me to some interesting articles on the subject, please leave a comment.

Feel free to leave a comment if you’d like to discuss some of these aspects.