Stackoverflow Tags Prediction

5 min readFeb 12, 2020

1. Real world Use-Case:

We will be given Question info, like Title and Body, and our job is to predict the suitable tag. This is very important to get the answers quickly because the question will reach to the correct developers community.

2. Mapping to Machine Learning Techniques:

In this use-case, we are trying to predict the tags for a particular question. Here we have 2(“title”, “body”) input features(x1,x2) and one or more target features (y1, y2, y3,…yn). It’s a classification problem. If we have 2 categories in target variable Y then it will be binary classification and if we have more than 2 categories then it will be multi-class classification. Our use-case is not binary or multi-class classification but it’s called Multi-label Classification problem. You can more about multi-class and multi-label algorithms here

3. Data Collection:

Go to the link and you will see Stackoverflow site where you can query the data that you want. In the right-side pan you can see the Database Schema (Table name (“Posts” and it’s attributes).

You can write your own query to fetch the result. I’ve the below query to fetch random 5000 records from the Table.

SELECT TOP 5000 Id, Title, Body , Tags from Posts WHERE Title IS NOT NULL AND Body IS NOT NULL ORDER BY RAND()

Once you get the result you can download the result as a .CSV file.

4. Exploratory Data Analysis:

(i). Loading the Data: Read the data using Pandas lib

(ii). Checking for Null, Duplicates:

We have included null values check in the select query itself. So our data should not contain any null values in (Id can’t be null as it is unique, and Tags should not be null there will at least 1 tag exist per question). But just check, to be on safer side, if we have any duplicates, and null values.

(iii) Text Preprocessing: We have some html tags, special characters in Title, Body, Tags. We need to be careful as there will tag “.net”, “C#” we need to do separate text cleaning for Tags.

(Don’t be confused with number 4999 (in my initial run I had taken only 4999). But in your case it should (5000, No of Tags)). We can convert this list into Pandas dataframe to see the Bag of Words(BOW) table.

Let’s see how many number of times a tags appeared in the data.

Let’s see top 20 tags in data that we have.

Since our problem is Multi-Label Classification, our X will be N*M, Y will be N*T matrices where N= no.of records, M= No.of features(no.of unique words), T = no.of Tags. Let’s create Y matrix:

Now, our Y matrix (multilabel_y) is ready. Let’s separate the dataset into training (80% of total data) and testing(20% of total data) sets.

It’s time to create X matrix. In x_train, x_test we have “Title” and “Body” as a separate features. Let’s combine both and create X matrix. I used TF-IDF technique to create X matrix ( alternatively we can use Countvectozer(), HashingVectorizer, etc). We will try different technique in-order to improve the accuracy of our model.

5. Model Building

Now we have the data in required format. It’s time to jump into model creation. I’ve chosen OneVsRestClassifier, SGDClassifier Algorithms to for this Multi-Label Classification problem.

6. Model Validation:

For binary classification problem, general validation metrics like ROC, AUC,Accuracy can work better. But for our Multi-Label classification problem we need to have high precision, recall, and good micro & macro f1 scores.

7. Saving the TF-IDF Vectorizer, Model, and List of Tags that our model has been trained on.

Let’s test our model by giving some random input say “Facing Problem with javascript”

We need to convert the above pred_result to list then send a function which maps the prediction with tags’ label.

Conclusion:

Taken Live real world data from stackoverflow website. Did text preprocessing. Tried Countvectorizer() for creation of Y matrix and TF-IDF technique for X matrix. Created OneVsRestClassifier with SGDClassifier. Achieved micro f1 score : 0.25.

Further Tasks: We can do lot more to improve model performance like use of different X matric creation technique like TF-IDF with different n-grams_ranges, Countvectorizer(), HashingVectorizer, different models OneVsRestClassifier with Logistic Regression, SVM, Naive Bayes, etc.

Full Source Code is available here

Please give a thumb up if you like the article and also provide your thoughts in the comment section.