Association Rule Mining
Association Rule Mining (ARM) is a method used to identify items that commonly appear together. In the context of text data this method is commonly applied to identify subject matter commonly found together or under a common topic. ARM operates on transaction data, so for text data, each document (row) should contain a list of words found within that document. Note that transaction data does not account for term frequency, therefore each term is listed only once within a transaction regardless of how many times it occurs within the document. The model output generates "rules" or associations between sub groups of items. Each rule expresses association based on three core metrics/measurements:
Support - the likelihood that item(s) will occur in a transaction together regardless of order (i.e. how common or rare a term is)
Confidence - the likelihood that items will occur in a transaction given that the item(s) on the lefthand side of the rule occur (i.e. how common or rare item(s) are with respect to other item(s))
Lift - how strongly associated items are with one another (this value will also reveal if items are independent/no association)
3D Network | Top by Lift Association Rules
Drag and zoom on the graphic below to better understand the connections between highly associated words
Top Rules by Support
Top Rules by Confidence
Top Rules by Lift
Conclusions & Next Steps
The generated rules revealed some fairly intuative associations such as "genetic" + "engineering." Additionally the association between "golden" + "rice" makes sense as golden rice is a very commonly referenced example of wide-spread implemenetation of genetically modified crops (see introduction page for details). Similarly, words such as "survey," "project," and "question" which are less relevant to the specific subject matter point to other individuals using redit to gauge social perception of this hot topic. This makes logical sense and ties back to the original motivator for harnessing the platform for exploring the social and scientific narrative of GMO's.
Data
Originally, the ARM model was fit to the text data containing both uni-grams and bigram word pairings. However, this created an issue of redundancy due to the fact that the bigram terms were ALWAYS associated (found in the same post with) the individual paired words, and therefore most rules containing a single bi-gram added little to no new insight. For this reason, bigram terms were eliminated from the model.
It is possible that removing the more obviously associated words and the platform specific words, could reveal greater insight.
One interesting association that can be seen in the 3-node, cluster above is the relationship between "crop," "animal" and "food." These three often represent the main subcategories of GMO use. Furthermore, from a purely scientific viewpoint, these different uses seem to pose very different threats, motivations and potential side effect.
Association Rule Mining was conducted in R via the arules package and the rules were pruned using a support minimum of 1% and a confidence minimum of 2%. The rules were then sorted by the following 3 metrics: