The Abalone dataset

The information is a replica of the notes for the abalone dataset from the UCI repository.

1. Title of Database:

Abalone data

2. Sources:

(a) Original owners of database:
Marine Resources Division
Marine Research Laboratories - Taroona
Department of Primary Industry and Fisheries, Tasmania
GPO Box 619F, Hobart, Tasmania 7001, Australia
(contact: Warwick Nash +61 02 277277, wnash@dpi.tas.gov.au)

(b) Donor of database:
Sam Waugh (Sam.Waugh@cs.utas.edu.au)
Department of Computer Science, University of Tasmania
GPO Box 252C, Hobart, Tasmania 7001, Australia

(c) Date received: December 1995

3. Past Usage:

Sam Waugh (1995) "Extending and benchmarking Cascade-Correlation", PhD thesis, Computer Science Department, University of Tasmania.

-- Test set performance (final 1044 examples, first 3133 used for training):
24.86% Cascade-Correlation (no hidden nodes)
26.25% Cascade-Correlation (5 hidden nodes)
21.5% C4.5
0.0% Linear Discriminate Analysis
3.57% k=5 Nearest Neighbour
(Problem encoded as a classification task)

-- Data set samples are highly overlapped. Further information is required
to separate completely using affine combinations. Other restrictions to data set examined.

David Clark, Zoltan Schreter, Anthony Adams "A Quantitative Comparison of Dystal and Backpropagation", submitted to the Australian Conference on Neural Networks (ACNN'96). Data set treated as a 3-category classification problem (grouping ring classes 1-8, 9 and 10, and 11 on).

-- Test set performance (3133 training, 1044 testing as above):
64% Backprop
55% Dystal

-- Previous work (Waugh, 1995) on same data set:
61.40% Cascade-Correlation (no hidden nodes)
65.61% Cascade-Correlation (5 hidden nodes)
59.2% C4.5
32.57% Linear Discriminate Analysis
62.46% k=5 Nearest Neighbour

4. Relevant Information Paragraph:

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

Data comes from an original (non-machine-learning) study:

Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)

5. Number of Instances: 4177

6. Number of Attributes: 8

7. Attribute information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name Data Type Meas.Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years

Statistics for numeric domains:

Length Diam Height Whole Shucke Viscera ShellRings
Min0.0750.0550.0000.0020.0010.0010.002 1
Max0.8150.6501.1302.8261.4880.7601.005 29
Mean0.5240.4080.1400.8290.3590.1810.2399.934
SD0.1200.0990.0420.4900.2220.1100.1393.224
Correl0.5570.5750.5570.5400.4210.5040.628 1.0

8. Missing Attribute Values: None

9. Class Distribution:

Class Examples
1 1
2 1
3 15
4 57
5 115
6 259
7 391
8 568
9 689
10 634
11 487
12 267
13 203
14 126
15 103
16 67
17 58
18 42
19 32
20 26
21 14
22 6
23 9
24 2
25 1
26 1
27 2
29 1
Total 4177

10. Modifications for Delve

A single prototask, age, has been defined to predict the number of rings. The prototask treats the output as a continuous variable, even though it is positive integer with a maximum value of 29 (see class distribution above).



Last Updated 8 October 1996
Comments and questions to: delve@cs.toronto.edu
Copyright