Data

Much of the debate surrounding GMO's in food and plants exists as a result of starkly contrasting claims in circulation. These arguments are tied to scientific study, social narrative/experience, medical unknowns and economic agenda... making it difficult to determine the true root of the contention. Therefore, textual analyses can offer insight into

The potential good/benefit of GM crops,
The downsides/harm of GM crops,
and ultimately, where people are disagreeing

Text data can be leveraged to answer the following questions:

What is science saying about GMO's?
What is society saying about GMO's?
What are the driving forces behind both strong support and strong opposition of GM crops?

Genetically Modified Crops: A Textual Exploration

News.org API

BT CROPS VS HERBICIDE RESISTANT CROPS

NewsAPI.org is a JSON based API that fetches current news from all corners of the internet based on given keyword queries.

Preliminary research seems to suggest that many of the negative associations with GM crops stem from the harmful impact of herbicide resistant crops, while the implications and downsides of BT crops seem far less prominent. Although individuals on both sides of the debate tend to apply their biases to GM crops as a whole, analyzing these two crop types individually could reveal separate narratives.

For an initial exploration of these two topics, the keywords “Glyphosate” and “Bacillus thuringiensis” were used to fetch articles via the News API. The following url shows a sample of the api endpoint with a query search for "Glyphosate"

The request fetched about 150 articles across the two queries who's descriptions were saved as individual text files. This corpus was then manually sorted through to preserve only the most relevant articles. This reduced the corpus to 24 documents.

^ Snapshot of raw data from NewsAPI.org (JSON format) ^

The remaining documents were then condensed into a CSV file made up of two columns:

Text description
Keyword labels (i.e. Glyphosate and Bacillus thuringiensis)

Through this transformation the text descriptions were cleaned to exclude numeric values, punctuation, additional spaces/paragraphs and removal of meaningless jargon.

From here the descriptions were tokenized and transformed into a document term matrix using CountVectorizer. The cleaned dataset contains the top 50 features with a matrix of term frequencies:

Full Python Code

RedDit API

Like other social media sites, Reddit provides direct insight into the conversations, opinions and talking points surrounding trending (as well as obscure) topics. When an individual seeks advice, conversation/collaboration or general info on a topic he/she can likely find a subreddit relating to the topic of interest. The Reddit API offers direct access to the thoughts of the masses.

The following data was fetched using a query for the top 100 hottest posts (i.e. high engagement posts) from 4 relevant subreddits:

r/GMO
r/GMOMyths
r/GMOFacts
r/GMOfree

These groups will serve as the labels for the finalized dataset. Each label has an assumed viewpoint (for, against, or general). By linking each post to a label should reveal common sentiment across posts from each thread. Based on the formal description, the general r/GMO sub claims to contain posts from all things or advancements related to the subject matter. This group of posts could be used on a trained model to identify the supporting side of the poster.

Before accessing Reddit data an authentication request must be made using one's personal login and app credentials. If this request is successful access tokens will be granted in the following format

Full Redit Python Code

The cleaned data was vectorized first using CountVectorizer:

^^ A second document Term Matrix was generated by vectorizing the text using TfidfVectorizer (in order to account for lengthier reddit posts):

CountVectorizer DTM Data

TFIDFVectorizer DTM Data