Clustering

CLUSTERING

Clustering is an unsupervised model that groups data according to proximity or “similarity.” While there are many different clustering methods, the general framework uses some distance metric to compare the “closeness” or similarity between data vectors and then assigns data points to like groups accordingly. The clustering two techniques explored in the following analyses are: partitional clustering and hierarchical clustering.

Partitional clustering encompasses the most common/widely used classification techniques. This iterative process bases clusters on a series of randomly or strategically selected “centroids” (i.e. cluster centers) and data points are assigned to clusters based on the centroid they are closest to. Centroids are recalculated based on the mean value of all data points contained within the cluster and the cluster-assigning process is repeated.

In the scope of text exploration clustering can be used to identify groupings of similar documents and/or to validate existing document groupings. For example, with existing knowledge of the origin or subredit associated with each redit post, clustering

Hierarchical Clustering

First, hierarchical clustering can be applied to an aggregated version of the data to identify possible similarities between the threads as a whole. While some subredits tend towards clear GMO-supporting or anti-GMO posters, other threads offer a more ambiguous space for general discussion. The following two subredit descriptions show examples of a thread descriptions that invites a more explicit bias versus a thread description that could invite members from either or both sides of the debate.

On the other hand, even with a somewhat explicit leaning, eother factors such as sarcasm, missenterpretation and trolling could make it difficult to determine the trending opinion behind posts in any one subredit.

First, a preliminary hierarchical clustering model can reveal potential similarities between two or more of the subredits. By aggregating the data by subredit, the threads can be analysed as a cohesive group of text (rather than analyzing each post individually). The condensed, simplified, Document Term Matrix contains four observations (rows); each representing a subredit and the term frequency totals for all text/posts gathered, across 763 terms of interest.

Aggregated Document Term Matrix (pictured: sample snapshot)

Note that the non-standardized, Document Term Matrix generated using Count Vectorizer was used for this analyses. Despite variation in total word count across the Redit threads there is no need for normalization of the matrix prior to clustering due to the distance metric used.

Cosine similarity is a non-euclidean distance metric that compares vectors according to the angle that is formed between them. Normalization is encapsulated within the distance calculation, and therefore standardizing data prior to generating the dissimilarity matrix is redundant. Additionally, unlike euclidean metrics, cosine similarity distance does not scale up with greater dimensionality. Therefore, it becomes useful with text data (i.e. text data typically contains extensive vocabularies and therefore the number of terms leads to high dimension DTM's and TDM's).

The dendrogram to the right shows the results of hierarchical clustering (using the 'hclust' algorithm). While the results should be taken with a grain of salt, this preliminary analyses suggests two clusters which group the r/GMO and r/GMOFacts subredits together and the r/GMOfree and r/GMOMyths subredits together. While the considerably small quantity of data used could impact the significance of these results, these similarities could begin to explain thread specific factors such as: trending opinion/viewpoint or the nature of subject matter that tends to be shared in each thread.

Hierarchical Clustering R Code

Results & Conclusions

Partitional Clustering

Condensing all of the posts from each subredit forces many significant assumptions about the similarities between posts within a single subredit. However, it is quite possible that the assumed trends do not exist within the sampled text data (or at all)! In order to observe the similarities between text without bias from preconceived labels or groupings, the labels can be completely removed and partitional clustering can be carried out on the unaggregated data.

Although, using four clusters seems intuitive based on the four subredit categories, In order to eliminate bias and honor exploration, 3 unsupervised calculation methods are used to determine the optimal number of clusters.

Elbow Method

Unfortunatly, this method does not seem to reveal an obvious, optimal k-value. There appears to be a steady decline in Total Within Sum of Squares across the tested cluster quantites. A possible elbow can be seen at k=8

Silhouette Method

The silhouette method reveals a much clearer indication for selecting a k value. According to this method, 5 clusters should be used.

Gap Statistic

The gap statistics method

Results & Conclusions

By applying the "kmeans" clustering algorithm in R, with 5 clusters, the following groupings resulted:

Due to the high dimensionality of the data, the over-simplification of a 2 dimensional representation fails to show the true clustering logic. Therefore, the clusters are better represented in the following data table. Note that the colors represent each cluster. The vectors are represented by transaction data (this inclusion of terms provides visual insight into the subject matter of clusters, and does NOT account for term frequency which is the bases on which vectors were clustered). The subredit labels have also been added back into the data table for conclusion-drawing purposes.

High Level Conclusions

The summary table above shows the number of vectors (i.e. Redit posts) in each cluster. The data is aggregated by subRedit. It appears that many of the posts ended up in cluster 2, while very few were assigned to cluster 1. This imbalance makes logical sense as it is expected that many of the posts would contain common or similar subject matter, while others might branch out to less commonly discussed aspects of the conversation. A closer look at the individual vectors in each cluster is required to draw more conclusive results.

Kmean Clustering Code