edu.cmu.taghelpertools.middle_layer.util
Class SimpleFeatureSpaceBuilder

java.lang.Object
  extended by edu.cmu.taghelpertools.middle_layer.util.SimpleFeatureSpaceBuilder

public class SimpleFeatureSpaceBuilder
extends java.lang.Object

SimpleFeatureSpaceBuilder is a simple implementation that provides methods for fulfilling some general needs for programmers to build feature space by supplying a list of texts or a .dic file. Example:


 ...
 
 SimpleFeatureSpaceBuilder simpleBuilder = new SimpleFeatureSapceBuilder(); 
 simpleBuilder.build(texts);  //texts is a list
 Iterator itr = simpleBuilder.getBinaryFeatureIterator(text);
  
 ...
 

Author:
Hao-Chuan Wang

Constructor Summary
SimpleFeatureSpaceBuilder()
          Instantiating a SimpleFeatureSapceBuilder object with all default options for feature selection: deafult feature selection options: punctuation, ungrams, bigrams, POS bigrams, line length, contains non-stopwords, remove rare features(threshold=5), remove storwords=true, stemming=true
SimpleFeatureSpaceBuilder(java.lang.String options)
          Instantiating a SimpleFeatureSapceBuilder object with customized feature selection options
 
Method Summary
 java.util.ArrayList build(java.util.ArrayList<java.lang.String> texts)
          Building an attribute space from the list of texts where attirbutes should be extracted from
 java.util.ArrayList build(java.lang.String dicFilePath)
          Building an attribute space from a dic file
 java.util.Iterator getBinaryFeatureIterator()
          Retrieve the binary dimensions of the attribute space
 java.util.Iterator getBinaryFeatureIterator(java.lang.String text)
          Retrieve the binary attributes which are true for the given text
 java.util.ArrayList getBinaryFeatures()
          Retrieve the binary dimensions of the attribute space
 java.util.Iterator getNumericFeatureIterator()
          Retrieving a list of names of all existent numeric attributes
 java.util.ArrayList getNumericFeatures()
          Retrieving a list of names of all existent numeric attributes
 java.util.ArrayList<java.lang.Double> getNumericFeatureValues(java.lang.String text, java.util.ArrayList featureNames)
          Retrieving values of numeric features for the given text.
 java.lang.String printoutOptions()
          Print out the feature selection options and also return the string that has been printed
 void writeDicFile(java.lang.String dicFilePath)
          Output the built attribute space to a dictionary file, which can be reloaded and reused later
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleFeatureSpaceBuilder

public SimpleFeatureSpaceBuilder(java.lang.String options)
Instantiating a SimpleFeatureSapceBuilder object with customized feature selection options

Parameters:
options - -- a list of feature selection options separated by spaces

options:

 -stop  --remove stopwords 
 -stem  --do stemming
 -rr            --remove rare words
 -rt (integer value)    
                --set the threshold of rare word removal to an integer value, 
                specifying this option automatically turns on -rr
 -f {uni, bi, posbi, punc, ll, cns}  
                --set whether to include following features: ungrams, bigrams, 
                pos bigrams, punctuation, line length, and containing non-stopwords
-lang {eng, ger, chi}
                --set the language of the dataset, default language setting can be specified in
                default_lang_setting.txt (English, Germany, or Chinese(extra module required; 
                the Chinese module is licensed separately) )

example (remove stopwords, do stemming, set language to German, include unigram, bigram and pos bigram features):

-stop -stem -lang ger -f uni -f bi -f posbi

Also see edu.cmu.taghelpertools.middle_layer.Tester for a concrete example


SimpleFeatureSpaceBuilder

public SimpleFeatureSpaceBuilder()
Instantiating a SimpleFeatureSapceBuilder object with all default options for feature selection: deafult feature selection options:

 punctuation, 
 ungrams, 
 bigrams, 
 POS bigrams, 
 line length, 
 contains non-stopwords, 
 remove rare features(threshold=5), 
 remove storwords=true, 
 stemming=true 
 

Method Detail

build

public java.util.ArrayList build(java.lang.String dicFilePath)
Building an attribute space from a dic file

Parameters:
dicFilePath - -- the path of the input dictionary file
Returns:
all the attributes extracted from the texts (including binary and numeric attributes)

writeDicFile

public void writeDicFile(java.lang.String dicFilePath)
Output the built attribute space to a dictionary file, which can be reloaded and reused later

Parameters:
dicFilePath - -- the path of the input dictionary file

printoutOptions

public java.lang.String printoutOptions()
Print out the feature selection options and also return the string that has been printed

Returns:
String that has been printed

build

public java.util.ArrayList build(java.util.ArrayList<java.lang.String> texts)
Building an attribute space from the list of texts where attirbutes should be extracted from

Parameters:
texts - -- a list of string
Returns:
all the attributes extracted from the texts (including binary and numeric attributes)

getBinaryFeatureIterator

public java.util.Iterator getBinaryFeatureIterator()
Retrieve the binary dimensions of the attribute space

Returns:
java.uti.Iterator -- the iterator of binary attributes

getBinaryFeatures

public java.util.ArrayList getBinaryFeatures()
Retrieve the binary dimensions of the attribute space

Returns:
java.uti.ArrayList -- the list of binary attributes

getBinaryFeatureIterator

public java.util.Iterator getBinaryFeatureIterator(java.lang.String text)
Retrieve the binary attributes which are true for the given text

Returns:
java.uti.Iterator -- the iterator of binary attributes

getNumericFeatureIterator

public java.util.Iterator getNumericFeatureIterator()
Retrieving a list of names of all existent numeric attributes

Returns:
java.uti.Iterator -- the iterator of numeric attributes

getNumericFeatures

public java.util.ArrayList getNumericFeatures()
Retrieving a list of names of all existent numeric attributes

Returns:
java.util.ArrayList -- the list of numeric attributes

getNumericFeatureValues

public java.util.ArrayList<java.lang.Double> getNumericFeatureValues(java.lang.String text,
                                                                     java.util.ArrayList featureNames)
Retrieving values of numeric features for the given text. Note that currently the only possible numeric feature is line length. The value of line length returned by this method is normalized against the maximum and minimum values of line length observed in the corpus (if available).

Returns:
the list of values (wrapped by Double) in which the order is consistent with the provided list of feature names.