Building an NLP Sentiment Analysis Pipeline In Python
Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. A sentiment analysis system for text analysis generally combines natural language processing (NLP) and machine learning techniques to assign weighted sentiment scores to the entities, topics, themes and categories within a particular sentence or phrase.
Sentiment analysis helps data scientists within large enterprises gauge public opinion, conduct nuanced market research, monitor brand and product reputation, and understand customer experiences.
Today I will be using pythons classical machine learning library sklearn to walk you through a basic sentimental analysis problem. Keep in mind that the approach taken here is quite modest, however, I believe it is a good starting point and it will provide you with decent framework on how you can begin to approach your own sentimental analysis problems in the future.
We will start with three different datasets where we have a collection of user reviews:
- IMBD — movie user reviews
- Amazon — Product user reviews
- Yelp — Restaurant user reviews
The datasets we are using are coded with a 0 when the review is negative and with a 1 when the review is positive. Let’s get started!
Table Of Contents
- Download the Data
- Data Cleaning
- Defining a Transformer
- Feature Engineering
- Train Test Split
- Defining Our Classifier
- Creating a Pipeline
- Evaluating Our Model
1. Download The Data
First step is to simply load the text data using pandas. As you can see each review is labelled with a 0 indicating a negative review, or a 1 indicating a positive review.
In this next step we rename our columns so they are easily accessible as ‘Message’ for the text data and ‘Target’ for the training labels, which will be what we are aiming to predict based on the text after we have trained our machine learning model.
Notice that we also only have 2,745 rows. The dataset we are using is relatively small.
2. Data Cleaning
Now it is time to clean the dataset in order to prepare it for our machine learning model.
In natural language processing and text processing we need to convert our textual data from its raw form into a numerical representation so that a machine learning algorithm can make sense out of it. However, before doing this we need to remove unhelpful noise from our dataset through data cleaning.
A Token is a word, punctuation symbol, whitespace, etc. and tokenization refers to the process of identifying each word, space or symbol in a document as a token.
We also have the following features that we can get per token:
- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape — capitalization, punctuation, digits.
- is alpha: Is the token an alphanumeric character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?
Below we are writing our own function to tokenize and clean our dataset appropriately. Again, the reason for this is to remove unhelpful noise from our dataset.
This function will:
- Tokenize each word.
- Lemmetize each token. E.g. going → go, went → go
- Convert everything to lowercase
- Remove stop words
Stop words are extremely common words that are irrelevant for our analysis and can be removed e.g. if, and, but, or
3. Defining a Transformer
Next we will define our own transformer. Sklearn provides a library of transformers which may do the following to feature representations:
- Clean: Preprocesing Data
- Reduce: Dimensionality Reduction
- Expand: Kernal Approximation
- Generate: Feature Extraction
Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.
We will be using the class TransformerMixIn from sklearn to create our own class transformer.
Our class will override the transform, fit and getparams from the main function and create our own. We will also pass a function called clean_text() that removes the spaces and converts the text into lowercase for an easier analysis.
4. Feature Engineering
Vectorization with Bag of Words and TF-IDF
When we classify text we end up with text snippets matched with their respective labels. As said previously we need to convert our text into something that can be represented numerically.
There are different tools for this such as Bag of Words and TF-IDF.
Bag of Words
Bag of words converts text into the matrix of the occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.
We can generate a BoW matrix for our text data by using sklearns CountVectorizer. In the code below, we’re telling CountVectorizer to use the custom spacy_tokenizer function we built as its tokenizer, and defining the ngram range we want.
TF-IDF
TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
It is often used as a weighting factor in searches of information retrieval, text mining, and user modelling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
TF-IDF is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use TF-IDF.
5. Train-Test split
In machine learning, we always need to split our datasets into train and test. We will use one for training the model and another one to check how the model performs. Luckily, sklearn comes with an in-built function for this.
The split is done randomly, but we can attribute a seed value to make it stable for developing purposes. The usually split is 20% test and 80% train.
6. Defining Our Classifier
When choosing a classifier we are choosing the strategy for how our model will learn from the data. Since we are trying to do a classification (good and bad) we will need to choose an algorithm that is a classifier.
For this purpose we will use a simple MLP classifier.
7. Creating a Pipeline
We are going to create a pipeline that:
- Cleans and preprocess the text using our predictors class from above.
- Vectorizes the words with BOW or TF-IDF to create word matrixes from our text.
- Load the classifier which performs the algorithm we have chosen to classify the sentiments.
8. Model Evaluation
Now that we have evaluated our model, let’s look at how it performs! First of all we need to predict the results of the test using our model:
Let’s check the results for each sample:
Now we can evaluate the model using different metrics, so that we can look at the three main performance metrics:
- Accuracy: Refers to the percentage of the total predictions our model makes that are completely correct.
- Precision: Describes the ratio of true positives to true positives plus false positives in our predictions.
- Recall: Describes the ratio of true positives to true positives plus false negatives in our predictions.
Accuracy, precision and recall are summarised visually in the image below.
Here you can see the levels accuracy, precision and recall scores that our model has achieved.