The opennlp project is now the home of a set of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.
These tools are not inheriently useful by themselves, but can be integrated with other software to assist in the processing of text. As such, the intended audience of much of this documentation is software developers who are familar with software development in Java.
In its previous life, OpenNLP was used to hold a common infrastructure code for the opennlp.grok project. The work previously done can be found in the final release of that project available on the main project page.
What follows covers:
The current version and installation instructions
can be found here:
The build instructions only work for the source distribution of OpenNLP, because the binary distribution does not contain any source code.
Ok, let's build the code. First, make sure your current working directory is where the pom.xml file is located. If you have maven installed then you can simply run "mvn install" from this directory.
If everything is right this command will compile, test and install OpenNLP into your local maven repository. The tests can take a few minutes, be patient. A copy of the jar file will also be placed in the newly created target folder.
OpenNLP is also distributed via a maven repository, if you want to use OpenNLP with maven you should add our repository to your pom.xml file.
In order to use OpenNLP in your project, you must define a dependency on opennlp.tools. The dependency to the opennlp.maxent project will be automatically resolved by maven.
Models have been trained for various of the component and are required unless one wishes to create their own models exclusivly from their own annotated data. These models can be downloaded clicking here or the "Models" link at opennlp.sourceforge.net. The models are large (especially the ones for the parser). You may want to just fetch specific ones.
To run any of these tools on UNIX
or on windows
java -jar opennlp-tools-1.5.0.jar
inside the opennlp directory of the binary distribution. The tools will then print out a list of possible commands for the various components.
Further instruction on howto to use these tools can be found in our wiki.
Some of the components require processing by the previous component. Most of these take a single argument which is the location of the model for this component. The exception is coref which requires a model directory, and the namefinder which can also take a list of models.
Examples: These example are simply that, examples, and are not a recommendation that you run the tools this way. It's in fact very inefficient. However, this should give you an idea of what the tools can do and the kind of input they assume. Developers should know to look at the ComponentNameTool classes inside the opennlp.tools.cmdline package to see how to set up a particular component for use.
The examples below use the english models and assume that you downlaoded the models into a sub-directory called "models".
bin/opennlp SentenceDetector models/en-sent.bin < text | bin/opennlp TokenizerME models/en-token.bin
bin/opennlp SentenceDetector models/en-sent.bin < text | bin/opennlp TokenizerME models/en-token.bin | bin/opennlp POSTagger models/en-pos-maxent.bin
bin/opennlp SentenceDetector models/en-sent.bin < text | bin/opennlp TokenizerME models/en-token.bin | bin/opennlp POSTagger models/en-pos-maxent.bin | bin/opennlp Chunker models/en-chunker.bin
bin/opennlp SentenceDetector models/en-sent.bin < text | bin/opennlp TokenizerME models/en-token.bin | bin/opennlp Parser models/en-parser.bin
bin/opennlp SentenceDetector models/en-sent.bin < text | bin/opennlp SimpleTokenizer | bin/opennlp TokenNameFinder models/en-ner-person.bin models/en-ner-location.bin
java opennlp.tools.lang.english.SentenceDetector \ opennlp.models/english/sentdetect/EnglishSD.bin.gz < text | java opennlp.tools.lang.english.Tokenizer \ opennlp.models/english/tokenize/EnglishTok.bin.gz | java -Xmx350m opennlp.tools.lang.english.TreebankParser -d \ opennlp.models/english/parser | java -Xmx350m opennlp.tools.lang.english.NameFinder -parse \ opennlp.models/english/namefind/*.bin.gz | java -Xmx200m -DWNSEARCHDIR=$WNSEARCHDIR \ opennlp.tools.lang.english.TreebankLinker opennlp.models/english/coref
In the example above $WNSEARCHDIR is the location of the "dict" directory for your WordNet 3.0 installation. WordNet can be obtained at here.
There are training tools for all components expect the coref component. Please consult the help message of the tool and the javadoc to figure out how to train the tools.
The tutorials in our wiki might also be helpful.
The following modules currently support training via the WordFreak opennlp.plugin v1.4 (http://wordfreak.sourceforge.net/plugins.html).
Note: In order to train a model you need all the training data. There is not currently a mechanism to update the models distributed with the project with additional data.
Please report bugs at the bug section of the OpenNLP sourceforge site:
Note: Incorrect automatic-annotation on a specific piece of text does not constitute a bug. The best way to address such errors is to provide annotated data on which the automatic-annotator/tagger can be trained so that it might learn to not make these mistakes in the future.