opennlp.tools.sentdetect
Class SentenceDetectorME

java.lang.Object
  extended by opennlp.tools.sentdetect.SentenceDetectorME
All Implemented Interfaces:
SentenceDetector

public class SentenceDetectorME
extends Object
implements SentenceDetector

A sentence detector for splitting up raw text into sentences.

A maximum entropy model is used to evaluate the characters ".", "!", and "?" in a string to determine if they signify the end of a sentence.


Field Summary
static String NO_SPLIT
          Constant indicates no sentence split.
static String SPLIT
          Constant indicates a sentence split.
protected  boolean useTokenEnd
           
 
Constructor Summary
SentenceDetectorME(SentenceModel model)
          Initializes the current instance.
SentenceDetectorME(SentenceModel model, Factory factory)
           
 
Method Summary
 double[] getSentenceProbabilities()
          Returns the probabilities associated with the most recent calls to sentDetect().
protected  boolean isAcceptableBreak(String s, int fromIndex, int candidateIndex)
          Allows subclasses to check an overzealous (read: poorly trained) model from flagging obvious non-breaks as breaks based on some boolean determination of a break's acceptability.
static void main(String[] args)
          Trains a new sentence detection model.
 String[] sentDetect(String s)
          Detect sentences in a String.
 Span[] sentPosDetect(String s)
          Detect the position of the first words of sentences in a String.
static SentenceModel train(String languageCode, ObjectStream<SentenceSample> samples, boolean useTokenEnd, Dictionary abbreviations)
           
static SentenceModel train(String languageCode, ObjectStream<SentenceSample> samples, boolean useTokenEnd, Dictionary abbreviations, int cutoff, int iterations)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SPLIT

public static final String SPLIT
Constant indicates a sentence split.

See Also:
Constant Field Values

NO_SPLIT

public static final String NO_SPLIT
Constant indicates no sentence split.

See Also:
Constant Field Values

useTokenEnd

protected boolean useTokenEnd
Constructor Detail

SentenceDetectorME

public SentenceDetectorME(SentenceModel model)
Initializes the current instance.

Parameters:
model - the SentenceModel

SentenceDetectorME

public SentenceDetectorME(SentenceModel model,
                          Factory factory)
Method Detail

sentDetect

public String[] sentDetect(String s)
Detect sentences in a String.

Specified by:
sentDetect in interface SentenceDetector
Parameters:
s - The string to be processed.
Returns:
A string array containing individual sentences as elements.

sentPosDetect

public Span[] sentPosDetect(String s)
Detect the position of the first words of sentences in a String.

Specified by:
sentPosDetect in interface SentenceDetector
Parameters:
s - The string to be processed.
Returns:
A integer array containing the positions of the end index of every sentence

getSentenceProbabilities

public double[] getSentenceProbabilities()
Returns the probabilities associated with the most recent calls to sentDetect().

Returns:
probability for each sentence returned for the most recent call to sentDetect. If not applicable an empty array is returned.

isAcceptableBreak

protected boolean isAcceptableBreak(String s,
                                    int fromIndex,
                                    int candidateIndex)
Allows subclasses to check an overzealous (read: poorly trained) model from flagging obvious non-breaks as breaks based on some boolean determination of a break's acceptability.

The implementation here always returns true, which means that the MaxentModel's outcome is taken as is.

Parameters:
s - the string in which the break occurred.
fromIndex - the start of the segment currently being evaluated
candidateIndex - the index of the candidate sentence ending
Returns:
true if the break is acceptable

train

public static SentenceModel train(String languageCode,
                                  ObjectStream<SentenceSample> samples,
                                  boolean useTokenEnd,
                                  Dictionary abbreviations)
                           throws IOException
Throws:
IOException

train

public static SentenceModel train(String languageCode,
                                  ObjectStream<SentenceSample> samples,
                                  boolean useTokenEnd,
                                  Dictionary abbreviations,
                                  int cutoff,
                                  int iterations)
                           throws IOException
Throws:
IOException

main

public static void main(String[] args)
                 throws IOException

Trains a new sentence detection model.

Usage: opennlp.tools.sentdetect.SentenceDetectorME data_file new_model_name (iterations cutoff)?

Parameters:
args -
Throws:
IOException


Copyright © 2010. All Rights Reserved.