NAIVE BAYES

Naive Bayes is a probabilistic machine learning algorithm commonly used for classification tasks, such as sentiment analysis. In this context, the "probabilistic" nature, characteristic of all Bayesian statistics methods, calculates the probability of each data point belonging to a certain class based on the number of occurrences within the training data. Unlike, other supervised models, Naive Bayes models provide probability values to indicate the confidence of predictions. The algorithm is based on Bayes' theorem and is rooted in the assumption that the features are independent of one another. This may or may not be true- hence the "naive" in its name. It works by calculating the probability of a data point belonging to a certain class based on the probability of its features appearing in that class, and then selecting the class with the highest probability. Naive Bayes is fast, efficient, and requires relatively little training data compared to other algorithms, making it a popular choice in many machine learning applications. However, its assumption of feature independence can limit its performance in certain complex datasets.

To choose or not to choose, GMO's?

...THAT is the question! Naive Baye's is particularly versatile as it can be applied to mixed data types. Therefore it is the perfect model for the GMO perception survey data (see Data page for full description). The data contains information about individuals including:

  • Plant type - the type of fruit tree or plant variety

  • Heirloom - branding indicating age-old plant varieties

  • Income - Average income of individual

  • Price - purchasing cost of the plant

  • Age - age of individual being surveyed

  • Sex - biological sex of individual being surveyed

The target variable 'GMO' contains a binary value representing whether the individual selected the GMO plant (1) or not (0). The data was split into Testing and Training subsets. Note that there was a major imbalance in the number of entries containing GMO class 1 and class 0 (in other words, there is a greater number of 'not selecting' than 'selecting' GMO's due to the fact that there were often more non-GMO choices in the pool of plant options). Therefore, the data was split in a way to balance out the GMO classes so that the model is trained to identify both cases.

  • Household size - number of individuals (kids+adults) in surveyed individual’s household

  • Education - education level on scale 1 to 5 (level values unknown)

  • Ethnicity - ethnicity of individual represented by value 1-6 (ethnicities unknown)

  • Veriflora - branding indicating sustainable growing practices

  • Objective knowledge - individuals knowledge of GMO’s according to 3-question quiz

  • Subjective knowledge - individuals self declared (I.e. perceived) knowledge of GMO’s

- Full Model: All Predictors -

Initially, a multinomial Naive Bayes model was applied to the data to predict the likelihood of an individual selecting a GMO plant. However, the model performed very poorly (see confusion matrix).

When all 6 predictor variables were used in the model the model predicted at:

60.77% Accuracy

- Reduced Model -

In order to improve the predictive power of the model a feature selection process was carried out using Logistic Regression. This revealed significance in only the following variables:

['Heirloom', 'Price', 'Sex', 'Subjective Knowledge']

Sure enough, when Naive Bayes was applied to the reduced model (4 predictor variables) the model predicted with improved accuracy:

68.21% Accuracy

Despite the slight improvement, overall Naive Bayes does not predict likelihood to select GMO plants very effectively (based on the given data). It is worth noting that the purpose of the original study from which the data was sourced, did not seek to predict plant choice but rather to explore individuals' perceived versus actual knowledge surrounding genetically modified organisms. The study contained multiple clinical trail types and therefore the breadth of data lent itself to the goals and questions of this project.

While attributes such as plant price and existence of the Heirloom, sustainable farming label may seem like intuitive indicators for an individuals likelihood to purchase GMO products, the subjective knowledge attribute is worth further exploration. True to the research goals of the original study, and the discussion explored throughout the textual analyses of this project, subjective knowledge (a.k.a opinion or what people believe to know about) largely shapes their resistance to GMO's as a whole. In this way, the model reduction exploration above further emphasizes the significant polarization in the genetically modified organisms debate.

Text Data: NewsAPI.org

An additional exploration was conducted using text data obtained from NewsAPI.org. Article descriptions for the following 3 keywords: "organic," "pesticide" and "livestock." These 3 words were selected due to the fact that they cover 3 major uses or topics where genetic modification becomes relevant. Because Naive Bayes model becomes quite complex with high dimension data, the goal of this exploration is to find simplistic topics that can be more effectively categorized/identified with a small vocabulary of associated terms (in this way the number of variables can be reduced to reasonably fit to a NB model).

Article Descriptions collected from NewsAPI.org (labeled)
Article data vectorized & transformed to Document Term Matrix
Results

An initial glance at the confusion matrix, might suggest somewhat reasonable predictions. This is indicated by the correctly predicted values along the diagonal (particularly the high accuracy of the organic label predictions. However, a closer look reveals the high number of incorrect predictions (non-diagonals) which reveals a much less impressive model performance.

This indicates that although the selected key words are hot topics of discussion within the GMO conversation, perhaps there is significant crossover in the terminology across the different topics. Therefore it is difficult to distinguish (i.e. predict) which topic a particular headline belongs to.