Wednesday, December 3, 2014

Clustering Ambiguity II: A Deeper Dive into the Archania.

library(proxy) #For bit vector (dis)similarity measures
library(cluster) #For hierarchical clustering and dendrogram formatting 
library(fpc)   #For relative differences between clusterings.
clusterMeasure <- "Jaccard"
clusterMethod <- "complete"

The following data is from http://blog.revolutionanalytics.com/2014/01/predictive-models-in-r-clustered-by-tag-similarity-1.html by Max Kuhn, though I have modified it slightly: the row names are the actual predictive model function names in R from various packages, the full names are stored as strings in the second column of the .csv file. A tag is a feature of a predictive model, such as, is it used for classification?, or regression?, or both?. There are 41 tag features, and 143 predictive model functions in R.

tag_data <- read.csv("model-tags2.csv", header=TRUE,stringsAsFactors = FALSE)
head(tag_data[,2:5]) #A small snippet of the data without the model names in full (column 1)
##          Classification Regression Bagging Bayesian.Model
## ada                   1          0       0              0
## avNNet                1          1       1              0
## bagEarth              1          1       1              0
## bagFDA                1          0       1              0
## bayesglm              1          1       0              0
## bdk                   1          1       0              0
head(tag_data[,1]) #The corresponding model names in full
## [1] "Boosted Tree (ada)"                            
## [2] "Model Averaged Neural Networks (avNNet)"       
## [3] "Bagged MARS (bagEarth)"                        
## [4] "Bagged Flexible Discriminant Analysis (bagFDA)"
## [5] "Bayesian Generalized Linear Model (bayesglm)"  
## [6] "Self-Organizing Maps (bdk)"
lentag_data <- dim(tag_data)[[1]] #Get the number of models (the number of items in the data set)
#Set the column indices
columnIndices <- 2:42
#These will have to be changed if the input data is changed
#This `dist` function is from the proxy package 
D <- dist(tag_data[,columnIndices], method = clusterMeasure)
#Convert `dist` output to a matrix object
Dm <- as.matrix(D)
DmOriginal <- Dm  #Save the original matrix, since we will create the matrix, Dm, again and again with permutations.
PModelCompleteClust <- agnes(Dm,diss=TRUE,clusterMethod)

set.seed(13)
#Generate a permutation
a.permutation <- sample(lentag_data) 
#Permute the order of the tag data
tag_dataPermutation <- tag_data[a.permutation,columnIndices]
#Show a small section of the now permuted data
tag_dataPermutation[1:4,1:5]
##          Classification Regression Bagging Bayesian.Model Boosting
## qrf                   0          1       1              0        0
## gcvEarth              1          1       0              0        0
## lda2                  1          0       0              0        0
## C5.0Cost              1          0       0              0        1
#Create the new dissimilarity matrix
Dpermutation <- dist(tag_dataPermutation, method = clusterMeasure)
#Convert the dist() vector output to a full matrix
DmPermutation <- as.matrix(Dpermutation)
par(cex=0.3,col="red")
PModelCompleteClustPermute <- agnes(DmPermutation,diss=TRUE,clusterMethod)

Markov Chain Clustering (MCL) is a method that takes a probability matrix formed from the graph representation of the similarity matrix. The intuition is that of graph ebb and flow, a type of tidal flow that settles into the natural clusters of the graph. This flow is a result of expansion and inflation operations on the probability matrix that leads to a steady state. The clustering process uses matrix multiplication repeatedly, so it is therefore O(kN3), where k is the number of iterations to arrive at a steady state. k is typically small, but O(N3) is not. Matrix multiplication can be parallelized, so given enough resources the algorithm will scale – to a point. Inflation is also done repeatedly and is therefore O(kN2). That process in and of itself would be more typical of the computational complexity of many common clustering algorithms.

One would expect that MCL returns the natural clusters without knowing the number of clusters in advance. That is given the the inflation parameter and the number of iterations sufficient for the matrix to reach a steady state. Howevever, a change in the inflation parameter can alter the clusterings, not necessarily in a way that can be readily systemitized (what, say, is the best or reasonable choice of the inflation parameter with respect to the data matrix?). It also can return, though rarely, overlapping clusters. One would expect that given the above data that it would return the occasional overlapping clustering for some choice of inflation parameter.

Do the different disjoint clusterings due to permuting the input differ from those of the hierarchical clusterings very much like the hierarchical clusterings differ from themselves? The answer seems to be no, or it is extremely unlikely. The function below “group.mcl.clusters()” seems to preclude overlapping clusters more generally. Below I run the MCL code with different permutations of the input to no effect. The clusterings are the same each time, given the same input parameters to MCL.

First, we can therefore check the difference between the hierarchical clustering levels and the MCL results and then also see if there are corresponding ties (overlapping). The following code is due to Abhik Seal that can be found here: http://chemin-abs.blogspot.com/2014/10/lean-markov-chain-clustering-in-r.html I have modified it slightly to improve its running time – slightly. I have also modified the code that prints out the clusters from the MCL algorithm: I have modified it to save the clusters to a list of the indices with the names of the associated items to be clusters.

# Algorithm to perform MCL Clustering in R 
# Add the identity matrix to the matrix which indicates self loops 
add.selfloops <- function (M) {
  LM<-M+diag(dim(M)[1])
  return (LM);
}

# Inflation step of MCL
inflate <- function (M,inf) {
  M <- M^(inf)
  return (M)
}

# Normalize the matrix by column
norm <- function (M) {
  return (t(t(M)/apply(M,2,sum)))
}

# MCL procedure
mcl <- function (M,   # Matrix
                 inf,   # Inflation value
                 iter,   # Number of iterations
                 verbose = T
) 
  { 
  lenNSquared <- (dim(M)[1])^2
  for (i in 1:iter) {
    old.M <- M
    M.norm <- norm(M)
    M <- M.norm%*%M.norm
    M <- inflate(M, inf)
    M <- norm(M)
    # Check to reach steady state vector
    if (sum(old.M == M) == lenNSquared) {
      break;
    }
    if (verbose) {
      print (paste ("iteration", i));
    } 
  }
  return (list(M,i));
}

# get the grouping using the matrix names and row indices after running the iteration

group.mcl.clusters <- function (M, verbose=FALSE) {
  M.names <- row.names(M)
  lenN <- dim(M)[1]
  clustered.nodes <- vector(mode = "logical", length = lenN)
  clusterList <- list()
  for (i in 1:lenN) {
    nonzeros <- which(M[i,] != 0)
    #To print the names of the items per cluster, include the following commented code.
    #nodes <- M.names[nonzeros]
    if (length(nonzeros) > 0 && !clustered.nodes[nonzeros]) {
      if(verbose) print (i)
      clusterList <- append(clusterList,list(nonzeros))
      clustered.nodes[nonzeros] = T                   
    }
  }
  return(clusterList)
}

## End of functions by Abhik Seal


#This adds the self loops as the Dm matrix is a dissimilarity matrix, but it also generates #a similarity matrix that is really the graph adjacency matrix.  
data <- add.selfloops(Dm)
data <- add.selfloops(1.0-data)#Use the similarity matrix of the tag data.
#1.0 - data removes the previous self loop, so that it is put back in after the matrix is now a similarity matrix.

#Set the iterations sufficiently high in case some of the inflation parameters require a lot of iterations
inf <- 2
iter<-3000
infs <- 40 # March through different inflation settings to see how this changes the number of clusters
results <- matrix(0,nrow = infs,ncol = 8)
MCLClusters <- numeric(lentag_data)
for(i in 1:infs){

  clusterS <- mcl(data,inf,iter, verbose = F) #,
  clusterList <- group.mcl.clusters(clusterS[[1]])

  #Comparing it to clusterings from above, given that the MCL returns a 
  #non-trivial partition of clusters.

  cutsize <- length(clusterList)  #Number of clusters returned by MCL fed as cutsize to cut the hierarchy from the hierarchical clustering
  if(cutsize > 1 && cutsize < lentag_data){
    Clusters <- cutree(PModelCompleteClust,cutsize)
    ClustersPermute <- cutree(PModelCompleteClustPermute,cutsize)

    #Convert the MCL cluster list into the form of the 
    #clustering vector returned by cutree().
    for(j in 1:cutsize){
      MCLClusters[unlist(clusterList[[j]])] <- j
    }

    CompareClusteringsMCL <- cluster.stats(Dm,Clusters,MCLClusters)
    results[i,1] <- cutsize
    results[i,2] <- inf
    results[i,3] <- CompareClusteringsMCL$corrected.rand
    results[i,4] <- CompareClusteringsMCL$vi

    CompareClusteringsPermuteMCL <- cluster.stats(Dm,
                                         ClustersPermute[order(a.permutation)],MCLClusters)

    results[i,5] <- CompareClusteringsPermuteMCL$corrected.rand
    results[i,6] <- CompareClusteringsPermuteMCL$vi
    results[i,7] <- length(unlist(clusterList)) 
    results[i,8] <- clusterS[[2]]
  } else {
    results[i,1] <- cutsize
    results[i,2] <- inf
    results[i,7] <- length(unlist(clusterList)) #Number items to be clustered (no overlapping)
    results[i,8] <- clusterS[[2]] #number of iterations
  }
  inf <- inf + 0.5
}
dimnames(results)[[2]] <- c("Cutsize","inf","corRand","vi","perm.corRand",
                            "perm.vi","No.Items","MCLIterations")

Here is a table of the results:

kable(results, digits = 3)
Cutsize inf corRand vi perm.corRand perm.vi No.Clusters MCLIterations
1.000 2.000 0.000 0.000 0.000 0.000 143.000 17.000
1.000 2.500 0.000 0.000 0.000 0.000 143.000 14.000
2.000 3.000 -0.108 0.793 -0.108 0.793 143.000 13.000
2.000 3.500 -0.109 0.835 -0.109 0.835 143.000 12.000
2.000 4.000 -0.108 0.842 -0.108 0.842 143.000 12.000
3.000 4.500 0.271 0.971 0.271 0.971 143.000 11.000
3.000 5.000 0.293 0.932 0.293 0.932 143.000 9.000
3.000 5.500 0.293 0.932 0.293 0.932 143.000 9.000
3.000 6.000 0.293 0.932 0.293 0.932 143.000 8.000
3.000 6.500 0.293 0.932 0.293 0.932 143.000 8.000
4.000 7.000 0.588 0.844 0.627 0.779 143.000 3000.000
5.000 7.500 0.419 1.101 0.737 0.782 143.000 3000.000
5.000 8.000 0.433 1.075 0.754 0.756 143.000 8.000
6.000 8.500 0.462 1.121 0.471 1.087 143.000 8.000
6.000 9.000 0.466 1.131 0.473 1.100 143.000 3000.000
6.000 9.500 0.486 1.084 0.493 1.059 143.000 3000.000
6.000 10.000 0.486 1.084 0.493 1.059 143.000 8.000
6.000 10.500 0.486 1.084 0.493 1.059 143.000 7.000
7.000 11.000 0.498 1.095 0.469 1.146 143.000 7.000
9.000 11.500 0.450 1.271 0.457 1.220 143.000 8.000
9.000 12.000 0.450 1.271 0.457 1.220 143.000 3000.000
9.000 12.500 0.450 1.271 0.457 1.220 143.000 6.000
9.000 13.000 0.450 1.271 0.457 1.220 143.000 6.000
9.000 13.500 0.450 1.271 0.457 1.220 143.000 6.000
9.000 14.000 0.450 1.271 0.457 1.220 143.000 3000.000
9.000 14.500 0.450 1.271 0.457 1.220 143.000 3000.000
9.000 15.000 0.456 1.227 0.463 1.176 143.000 3000.000
9.000 15.500 0.462 1.223 0.469 1.173 143.000 7.000
9.000 16.000 0.462 1.223 0.469 1.173 143.000 3000.000
9.000 16.500 0.462 1.223 0.469 1.173 143.000 6.000

The rand index differences are not that far removed from the difference between the rand index for differing hierarchical clusterings due to permuting the data.

We can show any difference (there are none) in pair-wise comparisons of permutations of the data with MCL:

#Create matrix of clusterings
set.seed(23)
NPermutations <- 100
permutations <- matrix(0,nrow=NPermutations,ncol=lentag_data)
for(i in 1:NPermutations){
  #Generate a permutation
  permutations[i,]<- sample(lentag_data)
}

inf <- 3
iter <- 2000
clusteringResultsMatrix <- matrix(0,nrow=NPermutations,ncol=lentag_data)
for(i in 1:NPermutations){
  #Permute the order of the tag data
  tag_dataPermutation <- tag_data[permutations[i,],columnIndices]
  #Create the new dissimilarity matrix
  Dpermutation <- dist(tag_dataPermutation, method = clusterMeasure)
  #Convert the dist() vector output to a full matrix
  Dm <- as.matrix(Dpermutation)

  data <- add.selfloops(Dm)
  data <- add.selfloops(1.0-data)

  clusterS <- mcl(data,inf,iter, verbose = F) #,
  clusterList <- group.mcl.clusters(clusterS[[1]],verbose=FALSE)
  #print(length(clusterList))
  for(j in 1:length(clusterList)){
    MCLClusters[unlist(clusterList[[j]])] <- j
  }
  clusteringResultsMatrix[i,] <- MCLClusters 
}

RandIndices <- numeric(NPermutations*(NPermutations-1)/2)
k <- 1
for(i in 1:(NPermutations-1)){
  clustOrderPerm.i <- clusteringResultsMatrix[i,order(permutations[i,])]
  for(j in (i+1):NPermutations){
    #Dm is in the original order, so the clustering results due to the permutations
    #have to be put in the original order (the inverse permutation).
    RandIndices[k] <- cluster.stats(DmOriginal,clustOrderPerm.i,
                        clusteringResultsMatrix[j,order(permutations[j,])])$corrected.rand
    k <- k + 1
  }
}

summary(RandIndices)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

Nope.

Let's see what kind of distribution of rand indices are obtained with successive random permutations of the data with the use of the hierarchical clustering. If we use 100 permutations, then we will need to run 4950 pair-wise comparisons. Plus, it would be nice to show this across some smaller set of non-trivial cuts of the dendrograms. In this case, something like cuts at 2, …, 22, say. We can then use a set of violin plots to display the successive distributions of the corrected rand index for each cut.

#We can use the permutations generated above

start <- 2
stop  <- 22
NPermutations <- 100
allpairsN <- NPermutations*(NPermutations-1)/2
RandIndices <- numeric(allpairsN)
RandIndicesMatrix <- matrix(0, nrow = stop-start+1,ncol=allpairsN)
clusteringResultsMatrix <- matrix(0,nrow=NPermutations,
                                      ncol=lentag_data)

Here is the code that generates the all-pairs comparisons of different permutations and at different cuts of the hierarchy. This code takes some time to run with just 100 permutations on a single core. Your clock time may vary, but something like 30 to 60 minutes.

for(j in start:stop){
  cutsize <- j

  for(i in 1:NPermutations){
    #Permute the order of the tag data
    tag_dataPermutation <- tag_data[permutations[i,],columnIndices]
    #Create the new dissimilarity matrix
    Dpermutation <- dist(tag_dataPermutation, method = clusterMeasure)
    #Convert the dist() vector output to a full matrix
    Dpermutation <- as.matrix(Dpermutation)

    clusteringResultsMatrix[i,] <- cutree(agnes(Dpermutation,
                                                diss=TRUE,clusterMethod),cutsize) 
  }

  k <- 1
  for(p in 1:(NPermutations-1)){
    clustOrderPerm.p <- clusteringResultsMatrix[p,order(permutations[p,])]
    for(q in (p+1):NPermutations){
      #Dm is in the original order, so the clustering results due to the permutations
      #have to be put in the original order (the inverse permutation).
      RandIndices[k] <- cluster.stats(DmOriginal,clustOrderPerm.p,
                        clusteringResultsMatrix[q,order(permutations[q,])])$corrected.rand
      k <- k + 1
    }
  }
  RandIndicesMatrix[j-start+1,] <- RandIndices
}

Showing the results is simple:

library(ggplot2)

NumOfCuts <- stop-start+1

CorrectedRand <- as.vector(RandIndicesMatrix)

CutNames <- vector(mode="character",length=allpairsN)

for(j in 0:(NumOfCuts-1)){ 
  CutNames[(j*allpairsN)+(1:allpairsN)] <- rep(paste("cut",as.character(j+start),
                                                        sep=""),allpairsN)
}

RanIndMat <- data.frame(CutNames, CorrectedRand)
p <- ggplot(RanIndMat,aes(factor(CutNames),CorrectedRand))
p + geom_violin(fill = "grey80", colour = "#3366FF")

plot of chunk plotOutput

RanIndMat1.9 <- RanIndMat[1:39600,]
RanIndMat10.19 <- RanIndMat[39601:89100,]
RanIndMat20.22 <- RanIndMat[89101:length(RanIndMat[,1]),]

p <- ggplot(RanIndMat1.9,aes(factor(CutNames),CorrectedRand))
p + geom_violin(fill = "grey80", colour = "#3366FF")

plot of chunk plotOutput

p <- ggplot(RanIndMat10.19,aes(factor(CutNames),CorrectedRand))
p + geom_violin(fill = "grey80", colour = "#3366FF")

plot of chunk plotOutput

p <- ggplot(RanIndMat20.22,aes(factor(CutNames),CorrectedRand))
p + geom_violin(fill = "grey80", colour = "#3366FF")

plot of chunk plotOutput

#For all of the rand indices, irrespective of the cut:
p <- ggplot(RanIndMat,aes(factor(1),CorrectedRand))
p + geom_violin(fill = "grey80", colour = "#3366FF")

plot of chunk plotOutput

#The corrected.rand indices are only of certain values:
uniqueRandValues <- unique(RanIndMat[,2])
lenRandValues <- length(uniqueRandValues)

lenRandValues
## [1] 224
plot(1:lenRandValues,sort(uniqueRandValues))

plot of chunk plotOutput

Now let's see if different permutations cause different cuts to be the best cut, but using Kelley's level selection measure, from the maptree library function kgs(). First create a lot of random permutations of the data.

set.seed(41)
NPermutations <- 5000
permutations <- matrix(0,nrow=NPermutations,ncol=lentag_data)
for(i in 1:NPermutations){
  #Generate a permutation
  permutations[i,]<- sample(lentag_data)
}

The following code runs all of the clusterings and kgs code, and outputs the results. It too takes some time (e.g., 20 minutes of clock time for 5000 permutations).

library(maptree)
## Loading required package: rpart
bestCutNo.Clusters <- matrix(0,nrow=NPermutations,ncol=2)
for(i in 1:NPermutations){
  #Permute the order of the tag data
  tag_dataPermutation <- tag_data[permutations[i,],columnIndices]
  #Create the new dissimilarity matrix
  Dpermutation <- dist(tag_dataPermutation, method = clusterMeasure)
  #Find the best cut (level partition of the hierarchy == number of clusters)  
  b <- kgs (agnes(Dpermutation,diss=TRUE,clusterMethod), Dpermutation, maxclust=20)
  minb <- min(b)
  bestCutNo.Clusters[i,1] <- as.numeric(names(which(b==minb)))
  bestCutNo.Clusters[i,2] <- minb
}

table(bestCutNo.Clusters[,1])
## 
##    5    6 
##  388 4612
#   5    6 
#  388 4612 

#Here is the distribution of Kelley index.
hist(bestCutNo.Clusters[,2],50) 

plot of chunk BestCuts

summary(bestCutNo.Clusters[bestCutNo.Clusters[,1] == 6,2])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.22    8.22    8.22    8.22    8.22    8.27
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  8.221   8.222   8.223   8.223   8.223   8.265 
summary(bestCutNo.Clusters[bestCutNo.Clusters[,1] == 5,2])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.61    7.61    7.62    7.62    7.62    7.62
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  7.615   7.615   7.617   7.619   7.624   7.624 

#It turns out that 6 clusters is more prevalent, but 5 clusters has the smaller index.  This is #relative however to the number of clusters.  There are different index values even for 5 and 6 #however, so the permutations impact that as well.

The take away from all this is that, though relatively rare in this instance, the input order of the data can change the clustering at specific cuts of the hierarchy – and those differences can be substantial; and, even the level selection can be impacted by the change in input order. This is somewhat annoying on both counts. Claims made about clustering such data must include the fact that these types of results are possible and may be subject to revision simply by a change of input order.

MCL seems largely impervious to ties in proximity, and appears to find the natural clustering – if you can find the right inflation parameter. The hierarchical clusterings primarily found 6 clusters as the natural clustering as did MCL, but 5 clusters appears to be a good clustering as well, for both MCL and the hierarchical clustering – other hierarchical clustering methods, and similarity measures may prove to have different results, so this is just a taste of the archania.