annotation.org

Tools for Natural Language Processing

  • Increase font size
  • Default font size
  • Decrease font size
Home Blog NLP (Natural Language Processing) Average Perceptron Algorithm Better Than Average.

Average Perceptron Algorithm Better Than Average.

E-mail Print PDF

I recently finished up an initial implementation of an average perceptron trainer and a simple version of applying it to sequences as described in Collins 2002.  There have been a couple of pleasant surprises which have some interesting implications for OpenNLP and the development of models in general.

While this work has been around for while, I wasn't particularly interested in what seemed like small improvements in sequence tagging.  I've tried alternatives to Maxent-GIS in the past such as gaussian smoothing, mallet's parameter estimation, conjugent gradient methods, etc and have typically found that the improvements don't scale to larger data sets either in terms of run time performance or accuracy.  This was also my reaction to the feature set used in Toutanova and Manning, 2000 and 2003 (with Klein and Singer);  an extreme trade-off of run time performance for accuracy. Matthew Wilkens reports results that indicate that it is pretty slow comparatively. This occurs a lot in academic work which has publication as its currency for success and not working systems.

Fortunately, a number of people started using average perceptron for ranking and mentioned that its implementation was pretty trivial.  A parse re-ranker is something that OpenNLP could use.  It's parser works ok for one without re-ranking, but most modern parses use re-ranking to improve parsing accuracy.  This made looking into average perceptron worth while.  I figured I'd start with pos-tagging as I already had something to compare it to.

When I took the time to to decode the math in the paper, I discovered that the implementation was in fact pretty easy and that even without the sequence modifications, performance is decent.  This would be uninteresting except that it is also balls fast to train.  For instance, it takes the better part of a day to train the maxent version of the pos-tagger on 1.5 million words.  It takes under 6 minutes to train an average-perceptron version.  The difference in performance on section 00 of the WSJ for your day of cpu cycles is about a .2%.

  No tag dictionaryTag dictionary
 maxent96.50% 96.58%
 average-perceptron95.22% 96.38%


It's also much faster to run as there is no converting to and from the log domain as their is in determining best sequences in maxent models. 

 No tag dictionary
Tag dictionary
maxent
8705 w/s
9750 w/s
 average-perceptron10600 w/s13500 w/s

This is nice because this should also apply to models built using the sequence-based updating scheme.  Unfortunately, these models currently take a solid day to train as they re-tag every sentence for every iteration, but there is hope for optimization.

In the future I'll probably use regular average perceptron model when I'm trying to figure out how to model stuff and then try several learners once I'm pretty happy with my features.  This should make development easier, but also has interesting implications for feature selection.