Contains classes related to finding token or words in a string.


Interface Summary
Detokenizer A Detokenizer merges tokens back to their untokenized representation.
TokenContextGenerator Interface for TokenizerME context generators.
Tokenizer The interface for tokenizers, which segment a string into its tokens.

Class Summary
DefaultTokenContextGenerator Generate events for maxent decisions for tokenization.
DictionaryDetokenizer A rule based detokenizer.
SimpleTokenizer Performs tokenization using character classes.
TokenizerEvaluator The TokenizerEvaluator measures the performance of the given Tokenizer with the provided reference TokenSamples.
TokenizerME A Tokenizer for converting raw text into separated tokens.
TokenizerModel The TokenizerModel is the model used by a learnable Tokenizer.
TokenizerStream The TokenizerStream uses a tokenizer to tokenize the input string and output TokenSamples.
TokenSample A TokenSample is text with token spans.
TokenSampleStream This class is a stream filter which reads in string encoded samples and creates TokenSamples out of them.
TokSpanEventStream This class reads the TokenSamples from the given Iterator and converts the TokenSamples into Events which can be used by the maxent library for training.
WhitespaceTokenizer This tokenizer uses white spaces to tokenize the input text.
WhitespaceTokenStream This stream formats a TokenSamples into whitespace separated token strings.

Enum Summary
Detokenizer.DetokenizationOperation This enum contains an operation for every token to merge the tokens together to their detokenized form.

Package Description

Contains classes related to finding token or words in a string. All tokenizer implement the Tokenizer interface. Currently there is the learnable TokenizerME, the WhitespaceTokenizer and the SimpleTokenizer which is a character class tokenizer.

Copyright © 2010. All Rights Reserved.