Comment Classification

Multi-label comment classification

To leverage the power of platform, it is possible to build a general application for any machine learning scenario and run it on different datasets, in addition, to benefit from existing nodes built in previous apps or by the community. As an example, we built a Baseet application for a multi-label comments classification, to receive a dataset of comments associated with their labels.

Comments classification belongs to the text classification area which is one of the major tasks in the Natural Language Processing (NLP) field with broad applications such as sentiment analysis and topic labeling. It is developing rapidly in each part starting from the feature engineering part to the used classification model. But there are main parts to build such an app which are: read data, preprocess data, encode data, build a model and train it, and measure the model performance.

Hence, we built two applications on for a multi-label comment classification task, training app, and prediction app.

The following screenshot of the training app on Baseet, follows by the details of the nodes:

Import dataset Node: This is a built-in Baseet node to import dataset to the app either from your previous datasets stored in datasets dashboard, or datasets shared by the community, or upload a new dataset.

Read tabular data Node: Read any tabular data file based on the columns separator (delimiter) and output it as a panda data frame so we can work with it easily.

Comments preprocessing Node: Offer the main text preprocessing steps such as remove stop words, remove punctuations, and lemmatize words.

Comments encoding Node: convert the text to numeric vectors, we offer two options: CountVectorizer method and TF-DIFVectorizer method.

Test Comments encoding Node: Encode the text data using the transformer generated from encoding the training data texts.

Multi-label comments model-1 Node: As stated previously, there is a large and variant number of works that can be used to classify texts. In this app, we used a model based on Naïve Bayes and Logistic Regression to classify comments.

Measure model performance Node: To evaluate the trained model using the test data by measuring the Compute Area Under the Receiver Operating Characteristic Curve (ROC-AUC) for each label and average the results.

Save file Node: This node saves the vectorizer (transformer or vocabulary file) to the files dashboard so it can be used later in prediction apps.

Save Model Node: Save the weights of the model as a pickle file to be used later in predictions apps.

Some nodes are repeated in this app assuming the train data and test data are in separate files, so import, read, preprocess, and encode each separately. But with taking into consideration to apply the same configuration on the two data.

In order to use the trained model in classifying new comments, we will build a prediction app based on the generated model from the training app. The following screenshot of the prediction app on Baseet, followed by the details of the nodes:

Import file Node: to import the transformer file generated from encoding the training data.

Import model Node: to import the model weights file generated from the training data.

Classify comment node: to classify the encoded comment into the data labels.