annotation.org

Tools for Natural Language Processing

  • Increase font size
  • Default font size
  • Decrease font size
Home Blog Software Development
Annotation.org Blog

I am not a web developer...

E-mail Print PDF

I am not a web developer.  From afar, web development seems easy and not particularly complicated.  I always envisioned myself as a "real" application developer or at least a server-side developer who was above just generating HTML with code and whatever else web developers do.  It's been over ten years since I've worked directly with web technologies.  While I've read about these technologies and worked with people and on projects that used them, its not the same as from scratch development.  After this Van-Winkleian absence from the technology, I was in for a rude awakening.

The last time I developed for the web was in 1995 when knowing Perl and HTML we're really all you needed to do web development.  Even then, all I did was debug an existing application by tweaking some Perl code to produce better reports and not generate any warnings.  At the time the basics of web development we're simple albeit tedious and a good understanding of Perl and knowing where to find httpd logs went a long way.  Today the amount of technology that one typically interacts with to do even basic web development is daunting.  Coming from a world where one seeks to avoid war, even if it is just a file extension, I now feel I've lost my innocence to .war files and the trenches of the "web.xml" file.

In the past couple weeks, I've learned the basics of configuring Apache, Tomcat, Postgres, and Eclipse for web development.  This was mostly a tedious process of looking on the web, trying something seemingly random, it doesn't work, rinse and repeat.  For some reason, (and unlike xorg.conf files), I mostly found non-working configuration files published to the web, and well, they make poor examples.   Perhaps with my lack of experience, I should have purchased a book, or several, and done a lot more prep-reading.  This only occurs to me now, but still may not have helped because I tended to choose new versions of most technologies.  I mean, who wants to learn the old/proven technologies?  This definitely made things harder as often user documentation and Q&A is less prevalent or no longer accurate.  Who knew config.xml files are formatted differently in Tomcat 5.5 than in 5.0?  (this guy)  Fortunatly, this part of web development seems like one-off tasks which once I've overcome, they won't haunt me for too long.

The programming side of web development was less painful.  After an initial foray into Struts 2.0, I ended up using the Google Web Toolkit (GWT) for most of my application with JSP to handle logins and some other preliminaries.  GWT translates client-side java coded against Google's widgets into browser agnostic java-script with AJAX functionality which is just cool.  Working with GWT was mostly fun and productive as it leverages my existing java and swing skills.  I was able to do most of my application design and build a mock-up reasonably quickly by using stubs where database functionality would eventually go.  Layout was pain but, but that's the devil I know with the added twist of including CSS attributes.  These unfortunately still have a voodoo like quality for me.  Once I finished most of the layout and page flows, I designed my first database schema and wrote my first SQL queries.  This is kinda embarrassing after almost 20 years of coding, but in previous work, I was either insulated from the database by others, used the file system, or Berkley DB Perl bindings to get the job done.  Fortunately, this was not particularly difficult as the skills for designing schema seem similar to creating data structures with pointers and SQL is pretty straightforward.  Additionally, my wife is familiar with both these tasks so in this case I did have reference materials.

The one benefit of having experience in development, even if not Web development, is that I have developed a sense of when things are ugly.  For instance, early on, I knew I didn't want very much logic on my JSP pages.  I don't have any hands-on experience to tell me why, but my intuitions about the maintainability of such pages kept yelling "yuck!".  When one page had a solution with a lot of client-side logic, I just kept searching until I found something I could stomach.  I ended up finding solutions to checking for a logged-in user and form validation which were much cleaner than the first examples I found. Likewise in Tomcat, modifying server.xml seemed like a poor deployment choice so I kept searching until I found talk of the context.xml file which allows this information to be deployed with the application.  Finally, I found a good description of wrapping exceptions so just one gets displayed on a default error page (hmm pretty).  Yesterday, I reached a milestone when I was able to register myself as a user in my application via the website, form validation, table insert, and everything.  I sat there looking at my silent Tomcat logs like a proud father.

I'm now looking forward to finishing up the mostly programming tasks that remain and with which I'm much more familiar.  This has been a good learning experience, and probably like any good learning experience, a little humbling.  Previously I might have said, "I am not a web developer" with an arrogance about the lack of complexity in the task.  Today, I say it with a tone of respect for the people who have paid their dues and generously written about them on the web, and to the statment "I am not a web developer" I append a determined "...yet".

Last Updated on Monday, 10 August 2009 22:56
 

String Pooling in Java

E-mail Print PDF

The named entity detector in opennlp.tools  has always been a little chunkier than I thought it should be. "I wouldn't go so far as to call the brother fat", but it could use a little less memory. So what can we do?

 The basis for this is that the 7 models that are loaded to find people, locations, dates, etc, and all share the same context generator.  This means that if you run them on the same text, almost all of their features will be identical.  However, when the models which are built from these contexts are loaded, this relationship is ignored and each context allocates a new String object on the heap even if that String exists elsewhere.

One option is to add a new GISModel constructor that takes an existing model and then reuses the context strings from the passed-in model when they exist.  This turns out to be fairly complicated as the contexts are in a hash and while the hash is happy to tell you that it contains a key, their isn't an easy way to get the reference to the key that the hash is using in its underlying data structure.  You could ask for all the  keys and hope that it gives you the object references you want, but then you need a way to store them, and retrieve them.  Most data structures which are good for this have the same issue we're trying to get around.  There is likely some way to get these references, but before I spent too much time looking for it, I tried something else.

A second option is to normalize the string references before they get put in the model.  It turns out that the model reader nicely abstracts the reading of strings with a readUTF() method.  Simply replace this method with one that returns a reference normalized string and we're good to go.  All that is needed is some sort of string pooling object.  I suspected there would be some apache.commons code to do this, but after a quick google search, I found that a solution is actually built into Java's String class.  String contains an intern() method which returns a conical representation of a String if one exists, or creates one if it doesn't.  This has the nice property that a context can be shared across models without actually specifying which models are sharing what.  The coding is also trivial; extend an existing GISModelReader, override readUTF() with { super.readUTF().intern();} wa-lah!

So what does all this mean to the named entity detector?  To load all 7 models previously took 350M. With the new PooledGISModelReader class implemented in all its one line of new code glory, you can do it in 200M. There are likely some other areas for optimizations, but for this release, I happy with this improvement alone.

Last Updated on Monday, 10 August 2009 22:56