Splice dataset

The information is a replica of the notes for the Splice dataset from the UCI repository of machine learning databases.

1. Title of Database:

Primate splice-junction gene sequences (DNA) with associated imperfect domain theory

2. Sources:

  1. Creators:
  2. Donor: G. Towell, M. Noordewier, and J. Shavlik, {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu
  3. Date received: 1/1/92

3. Past Usage:

  1. machine learning:
  2. attributes predicted: given a position in the middle of a window 60 DNA sequence elements (called "nucleotides" or "base-pairs"), decide if this is a
    a) "intron -> exon" boundary (ie) [These are sometimes called "donors"]
    b) "exon -> intron" boundary (ei) [These are sometimes called "acceptors"]
    c) neither (n)
  3. Results of study indicated that machine learning techniques (neural networks, nearest neighbor, contributors' KBANN system) performed as well/better than classification based on canonical pattern matching (method used in biological literature).

4. Relevant Information Paragraph:

Splice junctions are points on a DNA sequence at which `superfluous' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). (In the biological community, IE borders are referred to a ``acceptors'' while EI borders are referred to as ``donors''.)

This dataset has been developed to help evaluate a "hybrid" learning algorithm (KBANN) that uses examples to inductively refine preexisting knowledge. Using a "ten-fold cross-validation" methodology on 1000 examples randomly selected from the complete set of 3190, the following error rates were produced by various ML algorithms (all experiments run at the Univ of Wisconsin, sometimes with local implementations of published algorithms).

System Neither EI IE
KBANN 4.62 7.56 8.47
BACKPROP 5.29 5.74 10.75
PEBLS 6.86 8.18 7.55
PERCEPTRON 3.99 16.32 17.41
ID3 8.84 10.58 13.99
COBWEB 11.80 15.04 9.46
Near. Neigh. 31.11 11.65 9.09

Type of domain: non-numeric, nominal (one of A, G, T, C)

5. Number of Instances: 3190

6. Number of Attributes: 61

7. Attribute information:

Attribute Description
1 One of {n ei ie}, indicating the class.
2-61 The remaining 60 fields are the sequence, starting at position -30 and ending at position +30.

8. Missing Attribute Values: none

9. Class Distribution:

EI 767 (25%)
IE 768 (25%)
Neither 1655 (50%)

10. Modifications for Delve

  1. The name attribute in the original UCI distribution was deleted.
  2. Cases having with any of D,N,S and R categories have been deleted because these categories have very low incidence (see the attribute frequency table above).

Last Updated 8 October 1996
Comments and questions to: delve@cs.toronto.edu