The "demo" dataset was invented to serve as an example for the Delve manual and as a test case for Delve software and for software that applies a learning procedure to Delve datasets. To those ends, it has a variety of attributes. The rule for generating cases was based on various stereotypical notions, which may or may not have any basis in reality, in an effort to make the characteristics of the data more easily remembered, and not completely arbitrary.
Each case in the dataset describes a person from an imaginary population, with attributes giving the person's sex, age, income, number of siblings, and favourite colour (from among pink, blue, red, green, and purple). The attributes of each person are generated in the order given, independently of the attributes of other persons.
If you want, you can download the dataset demo.tar.gz, or get a list of learning methods that have been run on this dataset.
The dataset was generated by Radford Neal specifically for the Delve project.
In detail, the procedure used to generate the attributes is as follows:
Sex is chosen randomly with probability 0.53 for female.
Age is set to the absolute value of a normal random variate with mean zero, and a standard deviation of 40 for females and 30 for males.
The number of siblings is picked by truncating an exponentially distributed variate to an integer, with the mean of the exponential distributon being 3*(1+age)/(3+age).
Income is the sum of employment income and other income, but only the total income is recorded, not the two components.
To determine employment income, an unobserved binary "working" flag is first selected randomly. For males, the probability of working is:
for females, it is:
If the person is working, their employment income is drawn from the exponential distribution with mean
where C is 30000 for females and 40000 for males. If the person is not working, their employment income is zero.
The person's other income has an exponential distribution with mean of age*100. The value randomly picked from this distribution is added to the employment income (if any) to give the total income, which is the only number recorded.
The person's favourite colour is determined as follows. First, an unobserved binary "childlike" value is selected randomly, with the probability of the person being childlike being 1/(1+exp(age-10)). If the person is a childlike female, her favourite colour is pink with probability 0.9, and is otherwise drawn from the following distribution:
Note that pink gets a second chance here. If the person is a childlike male, his favourite colour is blue with probability 0.9, and otherwise is otherwise drawn from the distribution shown above (with blue getting a second chance). If the person is not childlike, their favourite colour is purple with probability 1/(1+exp(-(income-80000)/10000)), and is otherwise once again drawn from the distribution above (with purple thus getting a second chance).
The demo dataset has five prototasks, named according to the attribute to be predicted: age, colour, income, sex, siblings.
Last Updated 26 September 1996 Comments and questions to: delve@cs.toronto.edu |
![]() |