|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES All Classes | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectopennlp.tools.tokenize.TokenizerME
public class TokenizerME
A Tokenizer for converting raw text into separated tokens. It uses
Maximum Entropy to make its decisions. The features are loosely
based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
Algorithms and Applications.", which is available from his
homepage:
This tokenizer needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel
class encapsulates the model and provides
methods to create it from the binary representation.
A tokenizer instance is not thread safe. For each thread one tokenizer
must be instantiated which can share one TokenizerModel
instance
to safe memory.
To train a new model {#train(String, Iterator, boolean)
method
can be used.
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
Tokenizer
,
TokenizerModel
,
TokenSample
Field Summary | |
---|---|
static Pattern |
alphaNumeric
Alpha-Numeric Pattern |
static String |
NO_SPLIT
Constant indicates no token split. |
static String |
SPLIT
Constant indicates a token split. |
Constructor Summary | |
---|---|
TokenizerME(TokenizerModel model)
|
Method Summary | |
---|---|
double[] |
getTokenProbabilities()
Returns the probabilities associated with the most recent calls to AbstractTokenizer.tokenize(String) or tokenizePos(String) . |
String[] |
tokenize(String s)
Splits a string into its atomic parts |
Span[] |
tokenizePos(String d)
Tokenizes the string. |
static TokenizerModel |
train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization)
Trains a model for the TokenizerME with a default cutoff of 5 and 100 iterations. |
static TokenizerModel |
train(String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization,
int cutoff,
int iterations)
Trains a model for the TokenizerME . |
boolean |
useAlphaNumericOptimization()
Returns the value of the alpha-numeric optimization flag. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String SPLIT
public static final String NO_SPLIT
public static final Pattern alphaNumeric
Constructor Detail |
---|
public TokenizerME(TokenizerModel model)
Method Detail |
---|
public double[] getTokenProbabilities()
AbstractTokenizer.tokenize(String)
or tokenizePos(String)
.
public Span[] tokenizePos(String d)
d
- The string to be tokenized.
public static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization, int cutoff, int iterations) throws IOException
TokenizerME
.
languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skippedcutoff
- number of times a feature must be seen to be considerediterations
- number of iterations to train the maxent model
TokenizerModel
IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is created during training.
Or if reading from the ObjectStream
fails.public static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization) throws IOException, ObjectStreamException
TokenizerME
with a default cutoff of 5 and 100 iterations.
languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skipped
TokenizerModel
IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is
ObjectStreamException
- if reading from the ObjectStream
fails
created during training.public boolean useAlphaNumericOptimization()
public String[] tokenize(String s)
Tokenizer
tokenize
in interface Tokenizer
s
- The string to be tokenized.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES All Classes | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |