BUILDING AN ADDITIVE MODEL USING TWO DIFFUSION TREES. Here, I give an example of how a distribution with an additive structure can be modeled using more than one diffusion tree. The data for this example (in file 'adata') consists of 30 cases, each with eight binary variables. It was manually constructed so that the first four variables were unreleated to the last four variables. With respect to the first four variables, there are two groups, in which these variables tend to all be "1", or all be "0". Similarly, there are two groups with respect to the last four variables. The combined effect is that there are four overall groups, corresponding to patterns of 00000000, 11110000, 00001111, and 11111111. We could model this data as a mixture, or using a single Dirichlet diffusion tree. Even though this fails to capture the division into two sets of variables, it does work fairly well. Here is one way this could be done with a diffusion tree model: > dft-spec alog.dft1 0 8 / 0.2:4:2 - 1 > model-spec alog.dft1 binary > data-spec alog.dft1 0 8 2 / adata . > dft-gen alog.dft1 fix 2 > mc-spec alog.dft1 repeat 50 gibbs-latent slice-positions met-terminals\ > gibbs-sigmas > dft-mc alog.dft1 100 This is similar to what is done in the example of Ex-mixdft-b.doc. The 'dft-gen' command fixes the diffusion standard deviations to 2 initially. The subsequent 'gibbs-latent' command will produce values for the latent variables that fit the data reasonably, starting the chain off in from a reasonable state. (Without the 'dft-gen' command, the latent variables start off rather small.) The above commands take 82 seconds on the system used (see Ex-system.doc). We can look at the tree found using 'dft-display' with the "-g" option, or using the following command: > dft-dendrogram alog.dft1 100 alabels | ghostview - The 'dft-dendrogram' command produces a Postscript picture of the tree, which is here piped into the ghostview utility (it could instead by printed). This command shows the final tree, at iteration 100, and uses the labels for cases in the file 'alabels', which labels the four patterns mentioned above with A, B, C, and D . You should be able to see that cases with each of the four patterns are mostly grouped together in subtrees. To better capture the structure of this data, we can use a model with two trees, specified as follows: > dft-spec alog.dft2 0 8 / 0.1:4:1 - 1 / 0.1:4:1 - 1 > model-spec alog.dft2 binary > data-spec alog.dft2 0 8 2 / adata . The two sets of specifications in 'dft-spec' after the slashes (identical here) are for two Dirichlet diffusion trees, which are generated independently in the prior distribution. The values at the terminal nodes of these trees are added together to produce the latent values for the cases, which define the probabilities of the variables being "0" or "1". We hope that this will allow one tree to specialize in modeling the first four variables and the other tree to specialize in modeling the last four. Here are the commands used to sample from the posterior distribution: > dft-gen alog.dft2 fix 1.4 > mc-spec alog.dft2 create-latent repeat 50 gibbs-locations sample-latent \ > slice-positions met-terminals \ > gibbs-sigmas > dft-mc alog.dft2 100 Here again, the 'dft-gen' command helps start the chain off in a reasonable state. The 'create-latent' operation ensures that latent values exist, which is necessary when there is more than one tree when the model is non-Gaussian. These commands take 56 seconds on the system used (see Ex-system.doc). We can look at the resulting hyperparameters as follows: > dft-display alog.dft2 DIFFUSION TREE MODEL IN FILE "alog.dft2" WITH INDEX 100 PARAMETERS OF TREE 1 Standard deviation parameters for diffusion process 0.247: 1.380 0.796 7.847 2.466 0.312 0.169 0.554 0.182 Divergence function parameters: - 1.0000 - PARAMETERS OF TREE 2 Standard deviation parameters for diffusion process 0.147: 0.057 0.350 0.106 0.148 4.766 691.193 4.756 0.956 Divergence function parameters: - 1.0000 - We see that the first tree has mostly larger diffusion standard deviations for the first four variables than the last four. This is reversed for the second tree, for which the diffusion standard deviations are small for the first four variables and large for the last four. This is consistent with the two trees dividing up the modeling task. (In another run, the roles of the two trees might be reversed.) We can view the two trees as follows: > dft-dendrogram alog.dft2 100 1 alabels | ghostview - > dft-dendrogram alog.dft2 100 2 alabels | ghostview - This should show that one tree divides the A and B cases from the C and D cases, whereas the other tree divides the A and C cases from the B and D cases. This shows that the two-tree model has discovered the structure of the data.