cic.wordnet
Class WordNet

java.lang.Object
  extended by cic.wordnet.WordNet

public final class WordNet
extends java.lang.Object

A simple connector to WordNet 3.0. Before using it you must set the path of CICWN with WordNet.setPath(String). WordNet.loadDataBase(String) its the first method that should be executed. This connector is a little slow to load because of the counts for IDF calculation.

Author:
Francisco Viveros-Jiménez

Field Summary
(package private) static java.util.ArrayList<KeyString> exceptions
          Contains memory map for irregular morphs.
(package private) static int glossCount
          Number of loaded synsets.
(package private) static int maxCollocationSize
          Maximum word size from collocations stored in WordNet.
(package private) static java.lang.String path
          The path of CICWN
(package private) static java.util.ArrayList<java.lang.String> prepositions
          Contains the list of prepositions stored in the preposition file.
(package private) static java.util.ArrayList<KeyArray> synMaps
          Contains the mapping between a lemma and its possible synsets.
(package private) static java.util.ArrayList<ParsedSynset> synsets
          Contains WordNet synsets with lemmatized glosses.
(package private) static java.util.ArrayList<KeyString> wordCounts
          Contains the frequency of each lemma on the loaded samples.
 
Constructor Summary
WordNet()
           
 
Method Summary
static java.util.ArrayList<java.io.File> getAllFiles(java.io.File source)
          Simple utility method for getting all the files nested inside a folder and its subfolders.
static java.util.ArrayList<KeyString> getExceptions()
          Returns list with irregular morphs.
static int getGlossCount()
          Return a count of the synsets in WordNet.
static java.util.ArrayList<ParsedSynset> getGlosses()
          Returns glosses list.
static double getIDF(java.lang.String lemma)
          Retrieve IDF for a lemma.
static java.util.ArrayList<ParsedSynset> getLemma(java.lang.String lemma)
          This method its similar to wn command of WordNet. getLemma uses a lemma in the format "lemma_P".
static int getMaxCollocationSize()
          Returns the maximum word size from collocations stored in WordNet.
static java.lang.String getPOS(int pos)
          Returns the corresponding POS tag
static int getPOS(java.lang.String pos)
          Returns the corresponding POS tag
static java.util.ArrayList<java.lang.String> getPrepositions()
          Returns a list with the prepositions.
static ParsedSynset getSynset(java.lang.String sid)
          Retrieve a synset by its synsetId using binary search over glosses mapping.
static java.util.ArrayList<KeyArray> getSynsets()
          Returns lemma/synsets memory mapping.
static boolean hasPrepositions(java.lang.String morph)
          Method for detecting if a collocation has a preposition in it.
static java.util.ArrayList<java.util.ArrayList<java.lang.String>> lemmatize(java.lang.String line, edu.stanford.nlp.tagger.maxent.MaxentTagger tagger)
          Lemmatizer that uses Morphy as morphological processor and Stanford Log-linear Part-Of-Speech Tagger.
private static void loadCountsFromFile(java.io.FileReader input)
          Loads a count file.
static void loadDataBase(java.lang.String sampleSources)
          Reads files in the Resources/wordnet folder and creates the memory mapping for all the terms in WordNet.
private static void loadSamplesFromSource(java.io.FileReader input)
          Load the samples from a parsed source.
static void loadWordNet()
          loadWordNet exist for allowing to parse new SemCor" files. loadWordNet loads synset information, mappings and relations.
static void main(java.lang.String[] args)
           
static java.util.ArrayList<java.lang.String> Morphy(java.lang.String morph, int pos)
          Implementation of WordNet Morphological processor.
static java.util.ArrayList<java.lang.String> Morphy(java.lang.String morph, java.lang.String postag)
          Implementation of WordNet Morphological processor.
private static ParsedSynset parseGloss(java.lang.String line, int pos)
          Method for extracting a synset from a line of WordNet data.
static void parseSamplesFromSemCor(java.lang.String source)
          This method parses the samples and counts for a valid SemCor format source.
static void parseSamplesFromWordNet()
          Utility method for parsing WordNet glosses and samples.
private static java.util.ArrayList<java.lang.String> parseSenseMap(java.lang.String[] tokens, java.lang.String pos)
          Method for extracting the possible synsets of a lemma.
static void setPath(java.lang.String path)
           
static java.util.ArrayList<java.util.ArrayList<java.lang.String>> softLemmatize(java.lang.String line, edu.stanford.nlp.tagger.maxent.MaxentTagger tagger)
          Open-class words extracted with the Stanford Log-linear Part-Of-Speech Tagger.
static java.util.ArrayList<java.lang.String> Transform(java.lang.String morph, int pos)
          Implementation of WordNet's Morphy rules of detachment.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

path

static java.lang.String path
The path of CICWN


wordCounts

static java.util.ArrayList<KeyString> wordCounts
Contains the frequency of each lemma on the loaded samples.


synMaps

static java.util.ArrayList<KeyArray> synMaps
Contains the mapping between a lemma and its possible synsets. Lemmas have the format lemma_P, where P is the POStag (N, V, A, R). Synsets have the format synsetId_P. This list its always sorted by the lemma field.


synsets

static java.util.ArrayList<ParsedSynset> synsets
Contains WordNet synsets with lemmatized glosses. Lemmatization was done using methods provided in this class. This list allows searching by synsetId and its sorted this field.


exceptions

static java.util.ArrayList<KeyString> exceptions
Contains memory map for irregular morphs. Morphs are stored in the following format: ("irregular_morp","base_morp1[base_morphN]*"). exceptions its sorted by irregular_morph.


prepositions

static java.util.ArrayList<java.lang.String> prepositions
Contains the list of prepositions stored in the preposition file. Prepositions are used for detecting some verb collocations in Morphy. preposition its sorted.


glossCount

static int glossCount
Number of loaded synsets.


maxCollocationSize

static int maxCollocationSize
Maximum word size from collocations stored in WordNet.

Constructor Detail

WordNet

public WordNet()
Method Detail

getSynsets

public static java.util.ArrayList<KeyArray> getSynsets()
Returns lemma/synsets memory mapping.

Returns:
synsets

getGlosses

public static java.util.ArrayList<ParsedSynset> getGlosses()
Returns glosses list.

Returns:
glosses

getExceptions

public static java.util.ArrayList<KeyString> getExceptions()
Returns list with irregular morphs.

Returns:
exceptions

getPrepositions

public static java.util.ArrayList<java.lang.String> getPrepositions()
Returns a list with the prepositions.

Returns:
prepositions

getGlossCount

public static int getGlossCount()
Return a count of the synsets in WordNet.

Returns:
glossCount

getMaxCollocationSize

public static int getMaxCollocationSize()
Returns the maximum word size from collocations stored in WordNet.

Returns:
maxCollocationSize

getSynset

public static ParsedSynset getSynset(java.lang.String sid)
Retrieve a synset by its synsetId using binary search over glosses mapping.

Parameters:
sid - The synsetId in format "Number_P".
Returns:
The corresponding synset object.

getIDF

public static double getIDF(java.lang.String lemma)
Retrieve IDF for a lemma. Lemmas are in the format "lemma_P". IDF is calculated by using each ParsedSynset as a document. Note that even by loading SemCor" the samples will be added to a single parsed synset object. This means that samples are attached to its corresponding synset.

Parameters:
lemma - The lemma to look for.
Returns:
A double with the calculated IDF value

loadWordNet

public static void loadWordNet()
                        throws java.lang.Exception
loadWordNet exist for allowing to parse new SemCor" files. loadWordNet loads synset information, mappings and relations. However, it does not load glosses and samples. After, executing loadWordNet you can execute the method parseSamplesFromSemCor(String) for adding a new sample corpus.

Throws:
java.lang.Exception

loadDataBase

public static void loadDataBase(java.lang.String sampleSources)
                         throws java.lang.Exception
Reads files in the Resources/wordnet folder and creates the memory mapping for all the terms in WordNet. This method loads the samples for the bag of words. CICWN include 3 different sample sources: (1)WNGlosses (WordNet glosses), (2)WNSamples (WordNet samples), and, SemCor. You can load your own sample sources if you parse them first with the method parseSamplesFromSemCor(String).

Parameters:
sampleSources - A string with the sources that will form the bag of words. Some valid examples are: "WNGlosses", "WNGlosses;WNSamples", "WNGlosses;WNSamples;SemCor;yoursamplesource"
Throws:
java.lang.Exception

loadCountsFromFile

private static void loadCountsFromFile(java.io.FileReader input)
                                throws java.lang.Exception
Loads a count file. A count file contains how many times a word appears in a sample corpus. CICWN comes bundled with 3 sample corpus: WNGlosses, WNSample y SemCor" .

Parameters:
input - The count file.
Throws:
java.lang.Exception

parseSamplesFromSemCor

public static void parseSamplesFromSemCor(java.lang.String source)
                                   throws java.lang.Exception
This method parses the samples and counts for a valid SemCor format source. After using this method you can load the samples from it by loading the source name.

Parameters:
source - The name of the file or the folder that contains SemCor valid format files. An error will be raised if a no SemCor file is mixed in the source folder.
Throws:
java.lang.Exception

getAllFiles

public static java.util.ArrayList<java.io.File> getAllFiles(java.io.File source)
                                                     throws java.lang.Exception
Simple utility method for getting all the files nested inside a folder and its subfolders.

Parameters:
source - The analyzed folder.
Returns:
An ArrayList containing all the files founded.
Throws:
java.lang.Exception

parseSamplesFromWordNet

public static void parseSamplesFromWordNet()
                                    throws java.lang.Exception
Utility method for parsing WordNet glosses and samples. Try to avoid its usage, unless you modify WordNet or erase the files in samples and counts resource folders.

Throws:
java.lang.Exception

loadSamplesFromSource

private static void loadSamplesFromSource(java.io.FileReader input)
                                   throws java.lang.Exception
Load the samples from a parsed source.

Parameters:
input - The loaded source.
Throws:
java.lang.Exception

getLemma

public static java.util.ArrayList<ParsedSynset> getLemma(java.lang.String lemma)
This method its similar to wn command of WordNet. getLemma uses a lemma in the format "lemma_P". First, base forms are retrieved with Morphy. Then, senses are retrieved for the corresponding base forms.

Parameters:
lemma - The lemma to look for.
Returns:
An ArrayList with the senses for the lemma. An empty ArrayList will be returned if the lemma wasn't found.

Morphy

public static java.util.ArrayList<java.lang.String> Morphy(java.lang.String morph,
                                                           java.lang.String postag)
Implementation of WordNet Morphological processor. See WordNet's Morphy for further details.

Parameters:
morph - The word to be processed.
postag - The POS tag of the word ("N","V","A","R").
Returns:
A list with the possible corresponding lemmas found in WordNet.

Morphy

public static java.util.ArrayList<java.lang.String> Morphy(java.lang.String morph,
                                                           int pos)
Implementation of WordNet Morphological processor. See WordNet's Morphy for further details.

Parameters:
morph - The word to be processed.
pos - The POS tag of the word ("N=0","V=1","A=2","R=3").
Returns:
A list with the possible corresponding lemmas found in WordNet.

Transform

public static java.util.ArrayList<java.lang.String> Transform(java.lang.String morph,
                                                              int pos)
Implementation of WordNet's Morphy rules of detachment. See WordNet's Morphy for further details.

Parameters:
morph - The word to be processed.
pos - The POS tag of the word ("N=0","V=1","A=2","R=3").
Returns:
A list with the possible base forms for the word. The List could contain duplicates and invalid words.

hasPrepositions

public static boolean hasPrepositions(java.lang.String morph)
Method for detecting if a collocation has a preposition in it.

Parameters:
morph - The collocation to process. White spaces must be replaced with "_".
Returns:
true if the collocation has a preposition in it.

parseSenseMap

private static java.util.ArrayList<java.lang.String> parseSenseMap(java.lang.String[] tokens,
                                                                   java.lang.String pos)
Method for extracting the possible synsets of a lemma.

Parameters:
tokens - Array containing values of line.split(" ") operation of a index file line. POS WordNet data file.
pos - The POS tag of the current file.
Returns:
An ArrayList with the corresponding synsets of a lemma.

parseGloss

private static ParsedSynset parseGloss(java.lang.String line,
                                       int pos)
Method for extracting a synset from a line of WordNet data. POS WordNet file.

Parameters:
line - The line to be processed.
pos - The POS tag of the WordNet file.
Returns:
A synset object.

getPOS

public static java.lang.String getPOS(int pos)
Returns the corresponding POS tag

Parameters:
pos - The POS tag of the word ("N=0","V=1","A=2","R=3").
Returns:
The POS tag. An empty string if pos was not a valid number

lemmatize

public static java.util.ArrayList<java.util.ArrayList<java.lang.String>> lemmatize(java.lang.String line,
                                                                                   edu.stanford.nlp.tagger.maxent.MaxentTagger tagger)
Lemmatizer that uses Morphy as morphological processor and Stanford Log-linear Part-Of-Speech Tagger.

Parameters:
line - The text to be processed.
tagger - An instance of the Stanford MaxentTagger.
Returns:
An ArrayList that contains an ArrayList of the possible lemmas of each word. Most of the words have one corresponding lemma in WordNet, but, there are some exceptions. For example: axes_N-> (axis_N, ax_N, axe_N).

softLemmatize

public static java.util.ArrayList<java.util.ArrayList<java.lang.String>> softLemmatize(java.lang.String line,
                                                                                       edu.stanford.nlp.tagger.maxent.MaxentTagger tagger)
Open-class words extracted with the Stanford Log-linear Part-Of-Speech Tagger. This lemmatizer return the tagged words obtained by the tagger.

Parameters:
line - The text to be processed.
tagger - An instance of the Stanford MaxentTagger.
Returns:
An ArrayList that contains the tagged words.

getPOS

public static int getPOS(java.lang.String pos)
Returns the corresponding POS tag

Parameters:
pos - The POS tag of the word ("N=0","V=1","A=2","R=3","W=3").
Returns:
The POS tag. -1 if pos its an invalid tag.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Throws:
java.lang.Exception

setPath

public static void setPath(java.lang.String path)