-----------------------
Files in this package:
-----------------------


get_features.pl:
The main file for feature extraction.

Features:
position, lexical forms and context, phrase information, dependency
information and overlap features. Keywords and patterns are available
by un-commenting related parts in the script.

Input:
Input 1: Sentences with target proteins annotated by <prot> protein name <\prot>.
Input 2: Keyword list. (Used only when testing keyword and pattern features).
Input 3: Output of the Chunklink script (http://ilk.uvt.nl/sabine/chunklink/).
Input 4: Output of the minipar parser.

To get Input 3, first parse Input 1 using Collins' parser and then process it
using the Chunklink script.

To get Input 4, parse Input 1 using the Minipar parser.

Output:
Output 1: The feature file. Each line is a feature vector of a data point.
Output 2: Each line is a mapping between the index of a sentence and indexes of all the data
	  points (in the feature file) extracted from this sentence.
Output 3: Each line shows the start and end indexes of two proteins
	  (corresponds to one data point) in a sentence.


dependency.pm, outmostphrase.pm:
work with get_features.pl

noun.list, verb.list:
Used only when testing keyword and pattern features.

Note: Before running "get_features.pl", you need to run the following scripts
      to generate files that are used in "get_features.pl".

get_protein_unigram_list.pl:
Extract protein symbols and unigrams from annotated abstracts (e.g. annotation
from UofT Texas).


get_phraseheadpath_list.pl:
Extract phrase heads and phrase paths from output of
chunklink_2-2-2000_for_conll.pl


get_depend_list.pl:
Extract the dependent word and its part-of-speech tag from output of minipar parser.




