OpenNLP Documentation

Introduction

The opennlp project is now the home of a set of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.

These tools are not inheriently useful by themselves, but can be integrated with other software to assist in the processing of text. As such, the intended audience of much of this documentation is software developers who are familar with software development in Java.

In its previous life, OpenNLP was used to hold a common infrastructure code for the opennlp.grok project. The work previously done can be found in the final release of that project available on the main project page.

What follows covers:

  1. Installing The Build Tools
  2. Building Instructions
  3. Maven Repository
  4. Downloading Models
  5. Running the Tools
  6. Training the Tools
  7. Bug Reports

Installing The Build Tools

The OpenNLP build system is based on Maven.

The current version and installation instructions can be found here:
http://maven.apache.org/download.html

Building Instructions

The build instructions only work for the source distribution of OpenNLP, because the binary distribution does not contain any source code.

Ok, let's build the code. First, make sure your current working directory is where the pom.xml file is located. If you have maven installed then you can simply run "mvn install" from this directory.

If everything is right this command will compile, test and install OpenNLP into your local maven repository. The tests can take a few minutes, be patient. A copy of the jar file will also be placed in the newly created target folder.

Maven Repository

OpenNLP is also distributed via a maven repository, if you want to use OpenNLP with maven you should add our repository to your pom.xml file.

OpenNLP Repository:
<repository>
<id>opennlp.sf.net</id>
<url>http://opennlp.sourceforge.net/maven2</url>
</repository>

In order to use OpenNLP in your project, you must define a dependency on opennlp.tools. The dependency to the opennlp.maxent project will be automatically resolved by maven.

OpenNLP dependency:
<dependency>
<groupId>opennlp</groupId>
<artifactId>tools</artifactId>
<version>1.5.0</version>
</dependency>

Downloading Models

Models have been trained for various of the component and are required unless one wishes to create their own models exclusivly from their own annotated data. These models can be downloaded clicking here or the "Models" link at opennlp.sourceforge.net. The models are large (especially the ones for the parser). You may want to just fetch specific ones.

Running the Tools

To run any of these tools on UNIX just type

bin/opennlp

or on windows

java -jar opennlp-tools-1.5.0.jar

inside the opennlp directory of the binary distribution. The tools will then print out a list of possible commands for the various components.

Further instruction on howto to use these tools can be found in our wiki.

Some of the components require processing by the previous component. Most of these take a single argument which is the location of the model for this component. The exception is coref which requires a model directory, and the namefinder which can also take a list of models.

Examples: These example are simply that, examples, and are not a recommendation that you run the tools this way. It's in fact very inefficient. However, this should give you an idea of what the tools can do and the kind of input they assume. Developers should know to look at the ComponentNameTool classes inside the opennlp.tools.cmdline package to see how to set up a particular component for use.

The examples below use the english models and assume that you downlaoded the models into a sub-directory called "models".

Sentence Detection and Tokenization:

bin/opennlp SentenceDetector models/en-sent.bin < text |
bin/opennlp TokenizerME models/en-token.bin

POS Tagging:

bin/opennlp SentenceDetector models/en-sent.bin < text |
bin/opennlp TokenizerME models/en-token.bin |
bin/opennlp POSTagger models/en-pos-maxent.bin

Phrase Chunking:

bin/opennlp SentenceDetector models/en-sent.bin < text |
bin/opennlp TokenizerME models/en-token.bin |
bin/opennlp POSTagger models/en-pos-maxent.bin |
bin/opennlp Chunker models/en-chunker.bin

Parsing:

bin/opennlp SentenceDetector models/en-sent.bin < text |
bin/opennlp TokenizerME models/en-token.bin |
bin/opennlp Parser models/en-parser.bin

Name Finding:

bin/opennlp SentenceDetector models/en-sent.bin < text |
bin/opennlp SimpleTokenizer |
bin/opennlp TokenNameFinder models/en-ner-person.bin models/en-ner-location.bin

English Coreference:

Note for this sample the opennlp jar, the maxent jar and the jwnl jar must be on the classpath.
java opennlp.tools.lang.english.SentenceDetector \
  opennlp.models/english/sentdetect/EnglishSD.bin.gz < text |
java opennlp.tools.lang.english.Tokenizer \
  opennlp.models/english/tokenize/EnglishTok.bin.gz |
java -Xmx350m opennlp.tools.lang.english.TreebankParser -d \
  opennlp.models/english/parser |
java -Xmx350m opennlp.tools.lang.english.NameFinder -parse \
  opennlp.models/english/namefind/*.bin.gz |
java -Xmx200m -DWNSEARCHDIR=$WNSEARCHDIR \
  opennlp.tools.lang.english.TreebankLinker opennlp.models/english/coref

In the example above $WNSEARCHDIR is the location of the "dict" directory for your WordNet 3.0 installation. WordNet can be obtained at here.

Training the Tools

There are training tools for all components expect the coref component. Please consult the help message of the tool and the javadoc to figure out how to train the tools.

The tutorials in our wiki might also be helpful.

The following modules currently support training via the WordFreak opennlp.plugin v1.4 (http://wordfreak.sourceforge.net/plugins.html).

Note: In order to train a model you need all the training data. There is not currently a mechanism to update the models distributed with the project with additional data.

Bug Reports

Please report bugs at the bug section of the OpenNLP sourceforge site:

sourceforge.net/tracker/?group_id=3368&atid=103368

Note: Incorrect automatic-annotation on a specific piece of text does not constitute a bug. The best way to address such errors is to provide annotated data on which the automatic-annotator/tagger can be trained so that it might learn to not make these mistakes in the future.