annotation.org

Tools for Natural Language Processing

  • Increase font size
  • Default font size
  • Decrease font size
Home Blog
Annotation.org Blog

Average Perceptron Algorithm Better Than Average.

E-mail Print PDF

I recently finished up an initial implementation of an average perceptron trainer and a simple version of applying it to sequences as described in Collins 2002.  There have been a couple of pleasant surprises which have some interesting implications for OpenNLP and the development of models in general.

While this work has been around for while, I wasn't particularly interested in what seemed like small improvements in sequence tagging.  I've tried alternatives to Maxent-GIS in the past such as gaussian smoothing, mallet's parameter estimation, conjugent gradient methods, etc and have typically found that the improvements don't scale to larger data sets either in terms of run time performance or accuracy.  This was also my reaction to the feature set used in Toutanova and Manning, 2000 and 2003 (with Klein and Singer);  an extreme trade-off of run time performance for accuracy. Matthew Wilkens reports results that indicate that it is pretty slow comparatively. This occurs a lot in academic work which has publication as its currency for success and not working systems.

Fortunately, a number of people started using average perceptron for ranking and mentioned that its implementation was pretty trivial.  A parse re-ranker is something that OpenNLP could use.  It's parser works ok for one without re-ranking, but most modern parses use re-ranking to improve parsing accuracy.  This made looking into average perceptron worth while.  I figured I'd start with pos-tagging as I already had something to compare it to.

When I took the time to to decode the math in the paper, I discovered that the implementation was in fact pretty easy and that even without the sequence modifications, performance is decent.  This would be uninteresting except that it is also balls fast to train.  For instance, it takes the better part of a day to train the maxent version of the pos-tagger on 1.5 million words.  It takes under 6 minutes to train an average-perceptron version.  The difference in performance on section 00 of the WSJ for your day of cpu cycles is about a .2%.

  No tag dictionaryTag dictionary
 maxent96.50% 96.58%
 average-perceptron95.22% 96.38%


It's also much faster to run as there is no converting to and from the log domain as their is in determining best sequences in maxent models. 

 No tag dictionary
Tag dictionary
maxent
8705 w/s
9750 w/s
 average-perceptron10600 w/s13500 w/s

This is nice because this should also apply to models built using the sequence-based updating scheme.  Unfortunately, these models currently take a solid day to train as they re-tag every sentence for every iteration, but there is hope for optimization.

In the future I'll probably use regular average perceptron model when I'm trying to figure out how to model stuff and then try several learners once I'm pretty happy with my features.  This should make development easier, but also has interesting implications for feature selection.

Last Updated on Tuesday, 11 August 2009 19:59
 

I could have been a contender

E-mail Print PDF

Reading Bob Carpender's blog today, I was introduced to a series of posts about part-of-speech evaluation by Matthew Wilkens in his blog, Work Product.  I couldn't help but be a little disappointed that he didn't include OpenNLP Tools in his POS tagger evaluation.  He looked at Ling Pipe, Stanford's tagger, Morph Adorner, and Tree Tagger,  This is especially the case since in many ways this guy is the target audience for OpenNLP.  Basically, someone whose willing to do a little programming, but wants the tools to work on most domains out-of-the-box with minimal amounts of futzing and who wants a true open-source solution.

 From the looks of his conclusions and evaluations we would have fared pretty well.  The criteria he is interested in are: Accuracy, Tagset, Training Data, Speed, License/Source Code/Cost, Thread Safety and Input/Output. I'm not entirely clear what his evaluation procedures were for accuracy, so its hard to say for sure, but our accuracy numbers on Wall Street Journal (96.8%) and Brown (98.3%) seem to put us in the ball park of the 97.0% number his evaluation showed.  The OpenNLP tagger is trained on Wall Street Journal, Brown Corpus, and about 8k words of narrative text so it should be a reasonable match for Wilkens' target domain of literary text.  Our tagset is the Penn Treebank tagset.     Matthew says he prefers the Brown or Morph Adorner tagset:

Out of the box, then, I think MorphAdorner and NUPOS win for literary work, with LingPipe/Brown a reasonably close second. Stanford and TreeTagger usethe significantly smaller Penn tagset, which seems less suitable for my needs.

While I can't speak for the Morph Adorner tagset, the Brown set has always struck me as just a more lexicalized version of the Penn Treebank tagset.  I suspect that a classifier could map word/brown_tag to ptb_tag quite easily but I'll have to try that for myself.  Perhaps I can get a new source of nearly human tagged Brown data out of that work.  Speed-wise I think we do quite well.   On my home machine, which is slower than the one used in the evaluation, a quick test shows the tagger go through 8500 words a second.  This would place us second in his evaluation behind TreeTagger.  License, code, and TCO-wise I think we win hands down.  We're LGPL, have regular releases, are not tied to a research grant or a particuar grad student, and have been around for a while.  As of he 1.4 release the underlying model code is thread-safe so you can re-use the same model in multiple threads which accounts for most of the memory usage for any of our tools.  Finally on the input output front, we support setting the encoding of the data, and have a pretty simple format.

I think the bigger question this poses for OpenNLP is how to advertise better.  OpenNLP has a decent size user base.  In the last 12 months, we had just under 12,000 downloads, but this post lets us know we're still missing some core members of our audience.  Part of what I think is missing is documentation and other writing about OpenNLP.  While there are a number of things in the works including a white paper for the research types with evaluation data, and a book targeted at developers, I'll try and make an immediate impact with this blog post.

Last Updated on Monday, 10 August 2009 22:57
 

Be a contributor

E-mail Print PDF

A fellow developer on OpenNLP, Jörn, recently pointed me to a blog post where Daniel McLaren provides a nice turorial to OpenNLP. I went looking for more posts like this and found a crabby blog post about a lack of documentation on OpenNLP from someone who who didn't find the README file and spent their time kavetching about it.  Admittedlly it was harder to find when they wrote their post, but there are plenty of forum post pointing to it.  While our documentation could definitely use improvement (and is being improved) the craby post misses the point on a number of levels.

The first point miseed is that our medocre documentation might be intentional.  Historically, the documentation for OpenNLP has been intentionally selective.  This software is targeted at developers.  Parsing or the other things OpenNLP does are not applications in and of themselves so the only users are developers.  Command-line tools are provide with examples of how to run them and they generate usage information.  After that I expect that anyone using the software:

  1. Know how to compile it
  2. Know how to set their classpath
  3. Know that the command line tools must have a main() in them.
  4. Can read that main() and javadoc to figure out how to use them

The main()s are pretty straight forward as Daniel McLaren's code shows. Previously, I've considered providing more documentation, but I've been worried that it will just invite more forum questions from people I can't really help and I spend a good deal of time answering those now.

I could justify a number of other points in response to the crabby post like why the models aren't included in the source package, licensing differences with the other parsers he mentions, what it means to be a successful open source project, yada, yada, but here is the real point I want to make.

The crabby post makes a good point when it say: "Maybe Daniel and Thomas Morton (author of OpenNLP) should talk.".  My question to the author and the other complainers out there is:

  1. What did you do about that?
  2. Did your actions make any difference in addressing your underlying concerns?
  3. Can you say that you caused that difference to be made?

The guy who told me about Daniel's page, Jorn,  is a contributor.  Seeing Daniel's post has me looking at incorporating some of what Daniel did in upcoming documentation as I now see that it provides something, without inviting a lot of ill-formed questions.  I found the crabby post  looking for other content like Daniel's.

The crabby blogger also has a post on using  BufferedInputStream to optimize model loading, but that's where it ended.  Marc Schröder made the same observation, but actually put it in a bug post.  That made a difference and will be an upcoming incremental release.

Do people get that they are using free software contributed in people's free time and just how stingy it is to complain without taking the next obvious actions to make a difference?  If there wasn't such a huge untapped potential for others to contribute and make a difference in wherever they choose to I wouldn't even bother writing this.

Consider being a contributor in life!  It's probaly no more effort than what you put into complaining and it makes the world a better place.

Last Updated on Monday, 10 August 2009 22:56
 

String Pooling in Java

E-mail Print PDF

The named entity detector in opennlp.tools  has always been a little chunkier than I thought it should be. "I wouldn't go so far as to call the brother fat", but it could use a little less memory. So what can we do?

 The basis for this is that the 7 models that are loaded to find people, locations, dates, etc, and all share the same context generator.  This means that if you run them on the same text, almost all of their features will be identical.  However, when the models which are built from these contexts are loaded, this relationship is ignored and each context allocates a new String object on the heap even if that String exists elsewhere.

One option is to add a new GISModel constructor that takes an existing model and then reuses the context strings from the passed-in model when they exist.  This turns out to be fairly complicated as the contexts are in a hash and while the hash is happy to tell you that it contains a key, their isn't an easy way to get the reference to the key that the hash is using in its underlying data structure.  You could ask for all the  keys and hope that it gives you the object references you want, but then you need a way to store them, and retrieve them.  Most data structures which are good for this have the same issue we're trying to get around.  There is likely some way to get these references, but before I spent too much time looking for it, I tried something else.

A second option is to normalize the string references before they get put in the model.  It turns out that the model reader nicely abstracts the reading of strings with a readUTF() method.  Simply replace this method with one that returns a reference normalized string and we're good to go.  All that is needed is some sort of string pooling object.  I suspected there would be some apache.commons code to do this, but after a quick google search, I found that a solution is actually built into Java's String class.  String contains an intern() method which returns a conical representation of a String if one exists, or creates one if it doesn't.  This has the nice property that a context can be shared across models without actually specifying which models are sharing what.  The coding is also trivial; extend an existing GISModelReader, override readUTF() with { super.readUTF().intern();} wa-lah!

So what does all this mean to the named entity detector?  To load all 7 models previously took 350M. With the new PooledGISModelReader class implemented in all its one line of new code glory, you can do it in 200M. There are likely some other areas for optimizations, but for this release, I happy with this improvement alone.

Last Updated on Monday, 10 August 2009 22:56
 

I am not a web developer...

E-mail Print PDF

I am not a web developer.  From afar, web development seems easy and not particularly complicated.  I always envisioned myself as a "real" application developer or at least a server-side developer who was above just generating HTML with code and whatever else web developers do.  It's been over ten years since I've worked directly with web technologies.  While I've read about these technologies and worked with people and on projects that used them, its not the same as from scratch development.  After this Van-Winkleian absence from the technology, I was in for a rude awakening.

The last time I developed for the web was in 1995 when knowing Perl and HTML we're really all you needed to do web development.  Even then, all I did was debug an existing application by tweaking some Perl code to produce better reports and not generate any warnings.  At the time the basics of web development we're simple albeit tedious and a good understanding of Perl and knowing where to find httpd logs went a long way.  Today the amount of technology that one typically interacts with to do even basic web development is daunting.  Coming from a world where one seeks to avoid war, even if it is just a file extension, I now feel I've lost my innocence to .war files and the trenches of the "web.xml" file.

In the past couple weeks, I've learned the basics of configuring Apache, Tomcat, Postgres, and Eclipse for web development.  This was mostly a tedious process of looking on the web, trying something seemingly random, it doesn't work, rinse and repeat.  For some reason, (and unlike xorg.conf files), I mostly found non-working configuration files published to the web, and well, they make poor examples.   Perhaps with my lack of experience, I should have purchased a book, or several, and done a lot more prep-reading.  This only occurs to me now, but still may not have helped because I tended to choose new versions of most technologies.  I mean, who wants to learn the old/proven technologies?  This definitely made things harder as often user documentation and Q&A is less prevalent or no longer accurate.  Who knew config.xml files are formatted differently in Tomcat 5.5 than in 5.0?  (this guy)  Fortunatly, this part of web development seems like one-off tasks which once I've overcome, they won't haunt me for too long.

The programming side of web development was less painful.  After an initial foray into Struts 2.0, I ended up using the Google Web Toolkit (GWT) for most of my application with JSP to handle logins and some other preliminaries.  GWT translates client-side java coded against Google's widgets into browser agnostic java-script with AJAX functionality which is just cool.  Working with GWT was mostly fun and productive as it leverages my existing java and swing skills.  I was able to do most of my application design and build a mock-up reasonably quickly by using stubs where database functionality would eventually go.  Layout was pain but, but that's the devil I know with the added twist of including CSS attributes.  These unfortunately still have a voodoo like quality for me.  Once I finished most of the layout and page flows, I designed my first database schema and wrote my first SQL queries.  This is kinda embarrassing after almost 20 years of coding, but in previous work, I was either insulated from the database by others, used the file system, or Berkley DB Perl bindings to get the job done.  Fortunately, this was not particularly difficult as the skills for designing schema seem similar to creating data structures with pointers and SQL is pretty straightforward.  Additionally, my wife is familiar with both these tasks so in this case I did have reference materials.

The one benefit of having experience in development, even if not Web development, is that I have developed a sense of when things are ugly.  For instance, early on, I knew I didn't want very much logic on my JSP pages.  I don't have any hands-on experience to tell me why, but my intuitions about the maintainability of such pages kept yelling "yuck!".  When one page had a solution with a lot of client-side logic, I just kept searching until I found something I could stomach.  I ended up finding solutions to checking for a logged-in user and form validation which were much cleaner than the first examples I found. Likewise in Tomcat, modifying server.xml seemed like a poor deployment choice so I kept searching until I found talk of the context.xml file which allows this information to be deployed with the application.  Finally, I found a good description of wrapping exceptions so just one gets displayed on a default error page (hmm pretty).  Yesterday, I reached a milestone when I was able to register myself as a user in my application via the website, form validation, table insert, and everything.  I sat there looking at my silent Tomcat logs like a proud father.

I'm now looking forward to finishing up the mostly programming tasks that remain and with which I'm much more familiar.  This has been a good learning experience, and probably like any good learning experience, a little humbling.  Previously I might have said, "I am not a web developer" with an arrogance about the lack of complexity in the task.  Today, I say it with a tone of respect for the people who have paid their dues and generously written about them on the web, and to the statment "I am not a web developer" I append a determined "...yet".

Last Updated on Monday, 10 August 2009 22:56