DataPrep/EDA

The following dataset contains records from a study on public perception of genetically modified organisms. The study recognized a general trend in perception-research towards GM foods and consumable product packaging. However with the increasingly popular ‘grow your own food’ movement as well as the growing prevalence of genetic modification in all facets of agriculture, this particular study focuses on social perception of GM plants. Rather than showing test subjects food products marked with GMO or non-GMO labeling, the study examines an even earlier stratum of the GM pipeline using food-producing plants. In analyzing public perception surrounding a controversy riddled with misinformation and conflicting agendas, a shift in modality can offer deeper insight into peoples’ opinions. In other words, while the thought of “genetically engineered” food might yield a higher tendency for rapid dismissal based on common stigmas, plants offer a more literal example of GMO’s at work (through crops). Overall, the original study empirically analyzes customers’ perceived knowledge of versus willingness to pay (“WTP”) for genetically modified organisms. The dataset contains records from 1680 study participants. In short, the participants were shown a series of 16 choice scenarios in which they chose “Option A,” “Option B” or “Neither A nor B.” The choice scenarios consisted of plant pictures with product characteristics/information listed. The following image contains one example of the choice scenarios:

The plants shown included a mixture of both ornamental house plants as well as food-producing fruit plants.

As discussed previously, much of the debate and controversy surrounding GMO implementation stems from social stigma and agenda (good or bad) rather than scientific exploration or backing. Therefore, public perception data could offer insight into the forces that determine such stigma. Given the experimental attributes as well as demographic-based attributes contained within the survey data, the dataset can be used to identify attributes that impact one’s perception of GMO’s, and ultimately predict an individual’s likelihood of choosing or not choosing genetically modified products. The data was obtained from Harvard University's data repository API which is hosted by Dataverse (an open source web app). In addition to the API, Dataverse offers a python and R package with extensive functionality to simplify the data mining process. The datafile can be accessed in .tab format using the following url

The dataverse package contains various methods for getting data depending on the files available metadata. In this case, the data was pulled into R studio through referencing the file’s unique identification number as well as the Harvard server:

Study on Perceived Subjective vs Objective Knowledge of GM Crops

DATA PREP & EDA

https://dataverse.harvard.edu/api/access/datafile/4641884

https://www.scsglobalservices.com/services/veriflora-certified-sustainably-grown

https://www.nongmoproject.org/

The study includes plants with a various combination of known, verified labels including Veriflora (a company dedicated to sustainable farming), Heirloom (a verification the preservation of age old plant varieties). The non-GMO plant options are marked by either text or the Non-GMO Project logo (pictured below).

Link to full Study & Published Report:

Perceived subjective versus objective knowledge: Consumer valuation of genetically modified certification on food producing plants

>> Click here to access the full Data Import Code (.rmd) <<

Rihn A, Khachatryan H, Wei X. Perceived subjective versus objective knowledge:

Consumer valuation of genetically modified certification on food producing

plants. PLoS One. 2021 Aug 19;16(8):e0255406. doi: 10.1371/journal.pone.0255406.

PMID: 34411110; PMCID: PMC8376035.

The raw dataset contains 40320 records (rows) across 73 attributes (columns). Although the data contains both quantitative as well as categorical attributes the raw dataset lists 67 attributes labeled as numeric and 6 attributes labeled as character. Within R, the data was transformed into a dataframe, written to a csv file and transferred into Python for data cleaning.

^ Snapshot of raw data prior to cleaning (curated & written to CSV in R) ^

Click here to download full Raw Dataset

^ Snapshot of data after cleaning (cleaned & written to CSV in Python) ^

Click here to download full Raw Dataset

Full Data Cleaning Code (.py)

DIMENSIONALITY REDUCTION

Rows: For the purposes of the source study, each trial is represented by 3 records (this is due to the fact that many of the attributes are represented as individual binary columns). These 3 records represent the 3 options given to each participant. For the purposes of this exploration (which focuses specifically on the participant choosing or not choosing GMO), each trial can be compressed to a single row which contains only the plant option selected by the subject. This transformation reduced the dataset by two thirds.

Columns: Throughout the cleaning process a total of 49 attributes were dropped. Many of the columns contained redundant or irrelevant information. Ultamitly, the following attributes were dropped for 1 of 3 overarching reasons:

Irrelevance: the attribute exists for some purposes within the original use-case but does not contribute to modeling or gaining insight into perception of GMO's (ex: survey id)
Redundancy/repetition: the information contained within the column is better represented through another existing column or the data was transformed to condense/represent the attribute to better fit the purposes of Machine Learning.
Calculated Metric/Statistic: the dataset contains many attributes that were generated using the collected data. However these metrics pertain to the specific calculations and insights of the original study. Because these were produced from the collected data, they do not offer insight beyond what can be extracted from the raw data.

DATA TYPE