annotation.org

Tools for Natural Language Processing

  • Increase font size
  • Default font size
  • Decrease font size
Home Blog Software Development String Pooling in Java

String Pooling in Java

E-mail Print PDF

The named entity detector in opennlp.tools  has always been a little chunkier than I thought it should be. "I wouldn't go so far as to call the brother fat", but it could use a little less memory. So what can we do?

 The basis for this is that the 7 models that are loaded to find people, locations, dates, etc, and all share the same context generator.  This means that if you run them on the same text, almost all of their features will be identical.  However, when the models which are built from these contexts are loaded, this relationship is ignored and each context allocates a new String object on the heap even if that String exists elsewhere.

One option is to add a new GISModel constructor that takes an existing model and then reuses the context strings from the passed-in model when they exist.  This turns out to be fairly complicated as the contexts are in a hash and while the hash is happy to tell you that it contains a key, their isn't an easy way to get the reference to the key that the hash is using in its underlying data structure.  You could ask for all the  keys and hope that it gives you the object references you want, but then you need a way to store them, and retrieve them.  Most data structures which are good for this have the same issue we're trying to get around.  There is likely some way to get these references, but before I spent too much time looking for it, I tried something else.

A second option is to normalize the string references before they get put in the model.  It turns out that the model reader nicely abstracts the reading of strings with a readUTF() method.  Simply replace this method with one that returns a reference normalized string and we're good to go.  All that is needed is some sort of string pooling object.  I suspected there would be some apache.commons code to do this, but after a quick google search, I found that a solution is actually built into Java's String class.  String contains an intern() method which returns a conical representation of a String if one exists, or creates one if it doesn't.  This has the nice property that a context can be shared across models without actually specifying which models are sharing what.  The coding is also trivial; extend an existing GISModelReader, override readUTF() with { super.readUTF().intern();} wa-lah!

So what does all this mean to the named entity detector?  To load all 7 models previously took 350M. With the new PooledGISModelReader class implemented in all its one line of new code glory, you can do it in 200M. There are likely some other areas for optimizations, but for this release, I happy with this improvement alone.