Support Vector Machines

SUPPORT VECTOR MACHINES

Overview: Support Vector Machines (SVMs) are a type of supervised learning algorithm used for classification and regression analysis. SVMs are particularly useful when dealing with complex datasets with high-dimensional features. The main objective of SVMs is to find a hyperplane that best separates the data into different classes. SVMs are linear separators because they work by finding a linear decision boundary that separates the input space into two regions, one for each class. This decision boundary is called a hyperplane, and it is defined as the set of points where the dot product of the input vector with a weight vector is equal to a threshold value. The weight vector determines the orientation of the hyperplane and the threshold value determines its location in the input space.

Kernel functions enable SVMs to transform input data into a higher-dimensional feature space where it can be more easily separated. The kernel function computes the dot product between two vectors in the feature space, without explicitly computing the coordinates of the vectors in that space. The dot product is critical to the use of the kernel because it allows SVMs to perform computations in the feature space without explicitly computing the coordinates of the vectors.

Common Kernel Function Examples:

Polynomial kernel function - Commonly used for non-linear classification problems by casting data into higher-dimensional space using a polynomial function. The degree of the polynomial determines the complexity of the decision boundary. The polynomial kernel function is defined as: K(x, y) = (x^T y + c)^d (where x and y are the input vectors, d is the degree of the polynomial, and c is a constant term).
Radial Basis Function (RBF) - casts input data into a higher-dimensional feature space using a Gaussian function. The RBF kernel function is defined as: K(x, y) = exp(-γ ||x-y||^2) where x and y are the input vectors, γ is a parameter that determines the width of the Gaussian function, and ||x-y|| is the Euclidean distance between the two vectors. The RBF kernel function is particularly useful when the decision boundary is non-linear and complex.

In the context of text data, SVMs can be used for tasks such as text classification, sentiment analysis, and spam detection. To use SVMs on text data, the text first needs to be transformed into numerical features. This process is called vectorization or feature extraction. One common method is to use bag-of-words (BoW) representation, which represents each document as a vector of word frequencies. Once the text data has been transformed into numerical features, SVMs can be trained on labeled data to create a classification model. During the training process, SVMs find the hyperplane that best separates the data into different classes, with the objective of maximizing the margin between the hyperplane and the closest data points. This margin is known as the "maximum margin" and is what gives SVMs their name. After training, the SVM model can be used to classify new text data into different categories based on the hyperplane it has learned. SVMs have been shown to be effective in text classification tasks such as sentiment analysis and spam detection, as they can handle high-dimensional feature spaces and work well with limited training data.

Data Prep: To gather data for sentiment analysis on the topic of Genetically Modified Organisms (GMOs), articles and blogs that are either for or against GMOs were scraped using web scraping tools. Web scraping is the process of extracting data from websites, and it is commonly used to collect data for sentiment analysis. In this case, articles and blogs related to GMOs were identified and extracted from various online sources that tend to either lean towards anti-GMO or pro-GMO. Articles and blogs from these sources were curated into a PRO and an ANTI corpus. The sources listed below were selected based on their known leaning and their popularity among readers. The collected data was then pre-processed, which involves cleaning and transforming the raw data into Bag of words (numeric) format that can be used for sentiment analysis.

By creating a "pro-GMO" corpus and an "anti-GMO" corpus, an SVM model can be used to identify the language or subject matter that is often discussed in each siding to then classify documents as either pro or anti based on the language used in the text.

Sources With Generally PRO-GMO Stance:

Genetic Literacy Project: This website provides articles and resources on genetics, biotechnology, and GMOs, with a focus on science-based information.
GMO Answers: This is a website run by the Council for Biotechnology Information, which is a trade association representing biotech companies. The site provides information and answers to common questions about GMOs.
Biofortified: This is a non-profit organization that promotes evidence-based information about genetic engineering and biotechnology in agriculture.
Alliance for Science: This is a global initiative based at Cornell University that aims to promote access to scientific information about biotechnology and its potential to address global challenges.

Sources With Generally ANTI-GMO Stance:

Natural News: This website promotes alternative medicine and natural health, and often publishes articles that are critical of GMOs.
GMO Free USA: This organization advocates for GMO labeling and transparency, and provides information on the potential risks and negative impacts of GMOs.
Food Babe: This website is run by a popular blogger and food activist who is vocal about the potential risks associated with consuming GMOs and other processed foods.
Organic Consumers Association: This organization promotes organic farming and food production, and is critical of GMOs and industrial agriculture.
Non-GMO Project: A non-profit organization that is generally opposed to GMO's in food and agriculture, and advocates for increased transparency in labeling of GMOs in products

Pro vs Anti GMO Dataset

Data Prep

Four document term matrices were generated from the article data (each with different variations of vectorizers, tokenizers and parameters). Ultimately, the data processing that yielded the best SVM results was a CountVectorizer with a WordNetLemmatizer applied.

View processed data set:

Results

Three support vector machine models were applied to the data in order to asses the predictive power using various kernels.

Model 1: Linear Kernel

Applying a SVM model with a simple linear kernel resulted in very accurate predictions. While perfect accuracy can often indicate an over-fit or suspicious model, it is not as alarming in this case considering the very small test data size and the simplicity of only two variables. A linear kernel could be a good fit in this case due to the large number of features (i.e. terms). Conversely, more complex kernels could result in overfitting, and therefore lack of generalizable prediction power.

Model 2: Radial Basis Function

The RBF kernel performed the poorest of all three models. While one inaccurate prediction is not necessarily grounds for rejecting the use of this kernel, it is expected that RBF could be a viable option for the data due to the fact that it provides greater flexibility when there is overlap in the classes. Such is the case in the above data as many of the topics and subjects contained within the text are discussed by both the supporting and the opposing sides of the GMO debate.

Model 3: 3rd Degree Polynomial Kernel

A polynomial kernel support vector machine model with various degrees were applied and compared. The model performed with the above, "perfect" accuracy for 2nd and 3rd degree polynomials but did not perform highly with a 4th degree polynomial kernel.

SVM Python Code