Tuesday, October 14, 2014

Document Topic Co-clustering, MDS Mantel Test, 2D-3D Onion Layered Convex Hulls


The data for this little exercise in the use of 3D in R comes from the Open Science Code Fest project “TopicViz.” 2606 documents were analyzed with two forms of text analysis, producing a classification of the corpus into 87 specific topics, and 16 more general topics. Naturally, there is some relationship between the two classifications – each of the 16 topics can be distinguished by a preponderence roughly 4-10 of the 87 topics. This can be shown by way of a co-clustering of the 87 by 16 cross-table of document occurences. Nevertheless, there is still a fair amount of noise.

First, the code below will show the coclustering as a heatmap with dendrograms. Not shown: the top cophenetic correlations were used to determine best type of hierarchical clustering method for the clustering of the the row and col space. The dark regions denote the most consistent groupings between the 87 and 16 topic classes.

#Create cross table of 16 topics with 87 topics
crossTable <- table(table1[,2:3]) #87 rows by 16 columns
#Generate distance matrices (87X16 and 16X87)
maxcrossTable <- max(crossTable) #Set grayscale to the maximum number of articles in the (i,j) entries of the cross table. 
DM <- dist(crossTable)
TDM <- dist(t(crossTable))

#Color the individual topics to distinguish the rows and columns of the heatmap
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
colorPaletteCol <- getPalette(16)
colorPaletteRow <- getPalette(87)

#Take the methods with the best cophenetic correlation for both the row space and the column space
clusterMethodRow <- "centroid"
clusterMethodCol <- "average"

  Rowv = as.dendrogram(hclust(DM,method=clusterMethodRow)),
  Colv = as.dendrogram(hclust(TDM,method=clusterMethodCol)),
  RowSideColors = colorPaletteRow,
  ColSideColors = colorPaletteCol,
  xlab=paste(clusterMethodRow,"clustering\n","16 General Topics"),
  ylab=paste(clusterMethodCol,"clustering\n","87 Specific Topics"))

plot of chunk CrossTableCoClustering

Here is shown a 2D MDS plot, where the 2606 document are laid out using the 87 topic classification with MDS, and the document points are jittered so that those points in common are spread out locally around that point. The document points are then colored by a 16 topic coloring. 2D convex hull onion layers (4) are then plotted transparently for a specific topic from the 16 topic coloring (in this case topic 8). The topic points are highlighted (see the “alpha” setting for the selected topic is 0.99 and the remaining points are “alpha” = 0.2 in the ggplot code below).

#Jitter MDS coordinates found for the 87 specific topics via the tsne MDS of the topic cross table
tableRows <- dim(table1)[[1]]
coords.mds <- matrix(0,nrow=tableRows,ncol=3,byrow=TRUE);
completeDisMatrix <- dist(crossTable)
MDSTopics <- tsne(completeDisMatrix,k=3) #smacofSym(completeDisMatrix,ndim=2)$conf 
jitterAmount <- (max(MDSTopics) + abs(min(MDSTopics)))/100 #Jitter the points by 1% of the total range of min and max of MDSTopics
#TODO:  quasi-random local placement for 87 regions based on the jitterAmount scale.
for(i in 1:tableRows){
  coords.mds[i,1] <- jitter(MDSTopics[which(table1[i,2] == 1:87),1],amount = jitterAmount)
  coords.mds[i,2] <- jitter(MDSTopics[which(table1[i,2] == 1:87),2],amount = jitterAmount)
  coords.mds[i,3] <- table1[i,3] #Get 16 colors
#Plot jittered points and show convex hull for topic "8" of 16 topics
#First interpolate colors from color brewer palette of 12 colors to get 16 colors
colorPalette <- getPalette(16) #Same as above
coords.mds <- data.frame(coords.mds)
#Add in column names
dimnames(coords.mds)[[2]] <- c("x", "y", "Colors")

topicPolygon <- 8
#Example topic number of the 16 topics to generate the convex hull with transparent fill.
#Create first hull
hull <- cbind(coords.mds[coords.mds[,3]==topicPolygon,1],
indices <- which(coords.mds[,3]==topicPolygon) #Needed for ggplot subsetting of data frame.
hpts <- chull(hull)
hpts <- c(hpts,hpts[1])

#Create function to get the new convex hull onion layer from previous onion layer
onionLayer <- function(indices, hull, hpts){
  hull2 <- hull[-hpts,]
  indices2 <- indices[-hpts] 
  hpts2 <- chull(hull2)
  layer <- c(hpts2,hpts2[1])
  list(indices2, hull2, layer, indices2[layer])

#Generate 4 onion layers aside from the first hull
#TODO:  create recursive function with user defined stopping criteria (number of layers) and failsafe number of hull points greater than or equal to 3
AnOnionLayer1 <- onionLayer(indices, hull, hpts)
AnOnionLayer2 <- onionLayer(AnOnionLayer1[[1]], AnOnionLayer1[[2]], 
AnOnionLayer3 <- onionLayer(AnOnionLayer2[[1]], AnOnionLayer2[[2]], 
AnOnionLayer4 <- onionLayer(AnOnionLayer3[[1]], AnOnionLayer3[[2]], 

#Plot all jittered points with the tsne 87 point layout and show the convex hull and 4 interior convex hull onion layers
polygonFillColor <- colorPalette[topicPolygon] 
ggplot(coords.mds, aes(x=x, y=y))+
    geom_point(position = position_jitter(w=1,h=1),alpha=0.2, 
    geom_point(data=coords.mds[indices,1:2], alpha=0.99, color=polygonFillColor)+
    #TODO:  for loop through number of layers (see previous TODO)
    xlab("MDS Dim 1")+
    ylab("MDS Dim 2")

plot of chunk ConvexHullOnionLayers

#Note, once in a while tsne will get stuck in a local min and the results #will come out very odd, where a group of points will be placed at a #considerable distance from the rest of the points.  

Find an MDS method that best distinguishes the full distance matrix in 2D and 3D. First show a scree plot of the classical (linear) MDS method. The eigenvalues fall off somewhat slowly, such that the linear method probably won’t be so great in just 2 or 3 dimensions – still the first 3 dimensions captures nearly 40% of the (orthogonal) variance in the n-1 (87-1 = 86) possible dimensional space.

#Show scree plot of MDS of crossTable
mmds1 <- mmds(as.matrix(completeDisMatrix), pc = 15)
scree.plot(mmds1$eigen.perc, lab = TRUE, 
           title = "Scree plot of metric MDS\nCross Table Distance Matrix")

Generate Mantel correlation between the complete distance matrix and the various MDS methods to show which method(s) have the highest correlation. This hideous construction of the results data frame of the results could be done much more elegantly (e.g., by looping through a function list) if all of the MDS functions had the same parameters and return objects – alas, neither is true.

MDSResultsDFrame <- data.frame(MDSMethods = 
                    c("cmdscale","isoMDS","sammon", "smacofSym","tsne"),
                    TwoD = c(
ThreeD = c(
## initial  value 55.779265 
## iter   5 value 54.410214
## iter  10 value 52.655049
## iter  15 value 50.868053
## iter  20 value 49.061078
## final  value 46.218401 
## converged
## Initial stress        : 0.54337
## stress after  10 iters: 0.27543, magic = 0.018
## stress after  20 iters: 0.19419, magic = 0.009
## stress after  22 iters: 0.18855
## initial  value 40.717644 
## final  value 38.439091 
## converged
## Initial stress        : 0.39697
## stress after   9 iters: 0.18408
#Some of the MDS methods have output stats and iteration information.  The #Also note, Mantel test results for those MDS methods that depend upon a stochastic process may differ slightly from run to run.

Show the Table of MDS Mantel test results.

MDSMethods TwoD ThreeD
cmdscale 0.5853 0.7301
isoMDS 0.6903 0.7213
sammon 0.7461 0.7387
smacofSym 0.9172 0.9479
tsne 0.5240 0.6059

The following is the 3D convex hull prep code and functions.

coords3D.tsne <- matrix(0,nrow=tableRows,ncol=4,byrow=TRUE);
dimnames(coords3D.tsne)[[2]] <- c("MDS1", "MDS2", "MDS3","Topics16")
MDSTopics3D <- tsne(completeDisMatrix,k=3)
#Note that the tsne (and other nonlinear methods) optimize with respect to the dimension, so taking the first two dimensions of the 3D output is not the same as running tsne on just two dimensions.
jitterAmount <- (max(MDSTopics3D) + abs(min(MDSTopics3D)))/100 
for(i in 1:tableRows){
    #100 is added to each coordinate before jittering to expand the default jittering
    coords3D.tsne[i,1] <- jitter(MDSTopics3D[which(table1[i,2] == 1:87),1]+100,factor=1)
    coords3D.tsne[i,2] <- jitter(MDSTopics3D[which(table1[i,2] == 1:87),2]+100,factor=1)
    coords3D.tsne[i,3] <- jitter(MDSTopics3D[which(table1[i,2] == 1:87),3]+100,factor=1)
    coords3D.tsne[i,4] <- table1[i,3]
colorPalette16 <- colorPalette[coords3D.tsne[,4]]
#Functions to draw the 2D hulls.
drawhullLinesXY <- function(j,hull,hpts,repmins,TopicColor){
   rgl.lines(hull[hpts[j:(j+1)],1], hull[hpts[j:(j+1)],2], repmins, col=TopicColor, lty=2)
drawhullLinesYZ <- function(j,hull,hpts,repmins,TopicColor){
   rgl.lines(repmins,hull[hpts[j:(j+1)],1], hull[hpts[j:(j+1)],2], col=TopicColor, lty=2)
drawhullLinesXZ <- function(j,hull,hpts,repmins,TopicColor){
   rgl.lines(hull[hpts[j:(j+1)],1], repmins, hull[hpts[j:(j+1)],2], col=TopicColor, lty=2)

findHulls2D <- function(clustermembership,x,y,repmins,coordsflag,TopicColor){
   hull <- cbind(x[clustermembership],y[clustermembership])
   hpts <- chull(hull)
   hpts <- c(hpts,hpts[1])
   if(coordsflag == 1){
             drawhullLinesXY, hull, hpts, repmins,TopicColor)
   } else if(coordsflag == 2){
             drawhullLinesYZ, hull, hpts, repmins,TopicColor)
   } else {
             drawhullLinesXZ, hull, hpts, repmins,TopicColor)

PlotClusters3D <- function(data, clusters, numHulls, colors, Topic, onehull){
    clusterd <- clusters == Topic #Find topic points
    TopicColor <- colorPalette[Topic]
    for(i in 1:numHulls){       
        Points3D <- cbind(data[clusterd,1], 
        indices <- which(clusterd)
        ts.surf <- t(convhulln(Points3D))
        #Plot the 3D convex hulls with considerable transparency (alpha=.15)
        analpha <- 0.2
        if(onehull == 0){
        else if(i == onehull){
          #Plot convex hull of specified onion layer
          #Plot Alpha Shape of specificed onion layer
          x <- cbind(Points3D[ts.surf,1],Points3D[ts.surf,2],
          ashapealpha <- 66 
          ashape3d.obj <- ashape3d(x, alpha = ashapealpha, pert=TRUE)
          plot(ashape3d.obj, clear = FALSE, transparency=0.25)

        #Onion layer plotting: substract out previous hull here 
        clusterd[indices[sort(unique(as.vector(ts.surf)))]] <- FALSE

Plot the onion layered 3D convex hulls of a specific topic from the 16 topics.

topicPolygon <- 8 #Set the topic to be viewed (1-16)
firstfocus <- open3d() #Open the viewing window
par3d(FOV=11)  #Set the field of view -- this setting is nearly isometric.
bg3d("dark gray")  #Set the background color

plot3d(coords3D.tsne[,1], coords3D.tsne[,2], coords3D.tsne[,3], xlab = "MDS 1", 
       ylab="MDS 2", zlab="MDS 3",col=colorPalette16)  #Plot the points in 16 topic colors
PlotClusters3D(coords3D.tsne[,1:3],coords3D.tsne[,4], 4, 
               colorPalette, topicPolygon, 0)  #Create the 2D and 3D convex hulls

Show a 3D alpha hull of one of the onion layers.

secondfocus <- open3d()
bg3d("dark gray")
plot3d(coords3D.tsne[,1], coords3D.tsne[,2], coords3D.tsne[,3], xlab = "MDS 1", 
       ylab="MDS 2", zlab="MDS 3",col=colorPalette[coords3D.tsne[,4]])
PlotClusters3D(coords3D.tsne[,1:3],coords3D.tsne[,4], 2, colorPalette, topicPolygon, 2)
## Warning: The general position assumption is not satisfied
## Perturbation of the data set was required
## Device  2  : alpha =  66