ClusterTests

This MATLAB script is used to generate data for testing the various vMAT functions used for clustering observations.

Contents

Input Test Data

First off, we need some data to play with. Since we don’t have anything more interesting at hand for the moment, let’s start with a repeatable random sample from a normal distribution.

X = gallery('normaldata', [10 3], 13, 'double');
cmap = gray; cmap(:, 1:3) = .95;
makeHtmlTable(X, [], seqlabels('obs', 10), seqlabels('x', 3), cmap, 4);
x1x2x3
obs11.2030.06809-0.5938
obs20.6676-0.37440.2813
obs31.7371.532-1.225
obs4-0.474-0.10090.02343
obs51.677-0.75480.8096
obs6-1.28-0.6805-0.009285
obs7-0.94271.165-0.7701
obs80.27071.1550.4899
obs9-1.4290.7309-2.161
obs10-0.50941.4790.406

pdist and linkage

Now we can compute the pairwise-distances between observations (pdist) and then the hierarchical cluster tree (linkage). The dendrogram plot provides a nice visual summary of the hierarchical cluster tree. (It’s only here for informational purposes, as vMAT doesn’t have any plotting functions.)

NOTE: Don’t forget, vMAT_pdist wants X’, and vMAT_linkage has zero-based indexes in the first two columns. The data in the MAT file includes the variable Zv for validating the test results.

Y = pdist(X);
Z = linkage(Y);
Zv = [Z(:,1:2)-1 Z(:,3)];
makeHtmlTable(Zv,[], ...
    seqlabels('link', length(Zv)), {'idx1', 'idx2', 'distance'}, cmap, 4);
dendrogram(Z);
idx1idx2distance
link1790.8487
link2350.9934
link3011.117
link44121.201
link511131.202
link66101.292
link714151.533
link88161.536
link92171.681

inconsistent

Two methods of clustering are supported in vMAT (for now): ‘cutoff:’ uses an inconsistency coefficient to produce an arbitrary number of clusters, and ‘maxclust:’ returns at most a specified number of clusters. [The MATLAB cluster function of course has a number of additional options, but these are two commonly useful ones.]

When you’re actually using vMAT_cluster it’s not necessary to call the vMAT_inconsistent function explicitly; it’s called from vMAT_cluster as necessary. But for the purpose of testing, we’re going to call it so we can validate that vMAT’s function matches the output of MATLAB’s. (If the two implementations are inconsistent [sic], the clustering is likely to come out differently as well.) The inconsistency coefficient is explained in the documentation for the inconsistent function.

Wv = inconsistent(Z);
ZW = [Zv Wv];
makeHtmlTable(ZW, [], ...
    seqlabels('link', length(Z)), ...
    {'idx1', 'idx2', 'distance', 'mean', 'std', 'n', 'coefficient'}, cmap, 4);
idx1idx2distancemeanstdncoefficient
link1790.84870.8487010
link2350.99340.9934010
link3011.1171.117010
link44121.2011.1590.0593420.7071
link511131.2021.1320.120230.58
link66101.2921.070.313520.7071
link714151.5331.3420.171131.114
link88161.5361.5340.00225820.7071
link92171.6811.6090.102620.7071

cluster

Now, finally we’re ready to see how the data clusters.

VCv = cluster(Z, 'cutoff', [.5 .75]) - 1;
VMv = cluster(Z, 'maxclust', [3 4]) - 1;
Summary = [VCv VMv];
makeHtmlTable(Summary, [], ...
    seqlabels('obs', length(Z) + 1), {'cutoff: 0.5', 'cutoff: 0.75', 'maxclust: 3', 'maxclust: 4'}, cmap, 4);
cutoff: 0.5cutoff: 0.75maxclust: 3maxclust: 4
obs15010
obs25010
obs34223
obs41010
obs50010
obs61010
obs72311
obs86311
obs93102
obs106311

cluster-normaldata-10x3-13.mat

Save the data in a MAT file, so it can be loaded by the unit test cases.

save('cluster-normaldata-10x3-13', '-v6', ...
    'X', 'Zv', 'Wv', 'VCv', 'VMv');
attach('cluster-normaldata-10x3-13.mat');
cluster-normaldata-10x3-13.mat

Comments