annotation.org

Tools for Natural Language Processing

  • Increase font size
  • Default font size
  • Decrease font size
Home Blog NLP (Natural Language Processing) I could have been a contender

I could have been a contender

E-mail Print PDF

Reading Bob Carpender's blog today, I was introduced to a series of posts about part-of-speech evaluation by Matthew Wilkens in his blog, Work Product.  I couldn't help but be a little disappointed that he didn't include OpenNLP Tools in his POS tagger evaluation.  He looked at Ling Pipe, Stanford's tagger, Morph Adorner, and Tree Tagger,  This is especially the case since in many ways this guy is the target audience for OpenNLP.  Basically, someone whose willing to do a little programming, but wants the tools to work on most domains out-of-the-box with minimal amounts of futzing and who wants a true open-source solution.

 From the looks of his conclusions and evaluations we would have fared pretty well.  The criteria he is interested in are: Accuracy, Tagset, Training Data, Speed, License/Source Code/Cost, Thread Safety and Input/Output. I'm not entirely clear what his evaluation procedures were for accuracy, so its hard to say for sure, but our accuracy numbers on Wall Street Journal (96.8%) and Brown (98.3%) seem to put us in the ball park of the 97.0% number his evaluation showed.  The OpenNLP tagger is trained on Wall Street Journal, Brown Corpus, and about 8k words of narrative text so it should be a reasonable match for Wilkens' target domain of literary text.  Our tagset is the Penn Treebank tagset.     Matthew says he prefers the Brown or Morph Adorner tagset:

Out of the box, then, I think MorphAdorner and NUPOS win for literary work, with LingPipe/Brown a reasonably close second. Stanford and TreeTagger usethe significantly smaller Penn tagset, which seems less suitable for my needs.

While I can't speak for the Morph Adorner tagset, the Brown set has always struck me as just a more lexicalized version of the Penn Treebank tagset.  I suspect that a classifier could map word/brown_tag to ptb_tag quite easily but I'll have to try that for myself.  Perhaps I can get a new source of nearly human tagged Brown data out of that work.  Speed-wise I think we do quite well.   On my home machine, which is slower than the one used in the evaluation, a quick test shows the tagger go through 8500 words a second.  This would place us second in his evaluation behind TreeTagger.  License, code, and TCO-wise I think we win hands down.  We're LGPL, have regular releases, are not tied to a research grant or a particuar grad student, and have been around for a while.  As of he 1.4 release the underlying model code is thread-safe so you can re-use the same model in multiple threads which accounts for most of the memory usage for any of our tools.  Finally on the input output front, we support setting the encoding of the data, and have a pretty simple format.

I think the bigger question this poses for OpenNLP is how to advertise better.  OpenNLP has a decent size user base.  In the last 12 months, we had just under 12,000 downloads, but this post lets us know we're still missing some core members of our audience.  Part of what I think is missing is documentation and other writing about OpenNLP.  While there are a number of things in the works including a white paper for the research types with evaluation data, and a book targeted at developers, I'll try and make an immediate impact with this blog post.