A Data Scientist's Guide to Semi-Supervised Learning

Semi-supervised learning is the type of machine learning that is not commonly talked about by data science and machine learning practitioners but still has a very important role to play.

What is Semi-Supervised Learning?

In a simple definition, semi-supervised learning is the technique that involves both labeled and unlabeled data when training a model that makes a prediction. Usually, in this type of learning, you will have a small amount of labeled data and a large amount of unlabeled data.

"Dealing with the situation where relatively few labeled training points are available, but a large number of unlabeled points are given, it is directly relevant to a multitude of practical problems where it is relatively expensive to produce labeled data"-Page xiii, Semi-Supervised Learning, 2006.

Semi-supervised learning tends to combine both supervised learning and unsupervised learning when training a model.

Why is Semi-Supervised Learning Important?

Working in a machine learning project, especially a supervised learning problem, requires a lot of labeled data to make sure that your model can learn and perform well. But sometimes it is impossible to get access to more labeled data than what you already have.

The most common option is to manually label all unlabeled data and combine it with the existing one. But this option is very expensive as you will need to hire people to label the data, also it will take a lot of time to accomplish the labeling task and then continue with your data science project.

But with Semi-supervised learning, you will be able to skip the labeling task and rely on the machine learning algorithm to learn from both labeled data and unlabeled data. By doing so, a Semi-supervised learning algorithm can utilize the unlabeled data and help you achieve better performance than a supervised learning algorithm which requires you to have only labeled data.

In this article, we will look at the two semi-supervised algorithms that you can use in your data science project.

1. Self-Training Meta-Estimator

The latest version of scikit-learn (0.24) has introduced a new self-training implementation for semi-supervised learning called SelfTrainingclassifier. SelfTrainingClassifier can be used with any supervised classifier that can return probability estimates for each class.

This means any supervised classifier can function as a semi-supervised classifier, allowing it to learn from unlabeled observations in the dataset.

Note: The unlabeled values in the target column must have a value of -1.

Let’s understand more about how it works in the following example.

Import important packages.

import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC

In this example, we will use the iris dataset and the Super vector machine algorithm as a supervised classifier (It can implement fit and predict_proba).

Then we load the dataset and select randomly some of the observations to be unlabeled.

rng = np.random.RandomState(42)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
iris.target[random_unlabeled_points] = -1

As you can see unlabeled values in the target column have a value of -1.

Create an instance of the supervised estimator.

svc = SVC(probability=True, gamma="auto")

Create an instance of self-training meta estimator and add svc as a base_estimator.

self_training_model = SelfTrainingClassifier(base_estimator=svc)

Finally, we can train self_traning_model on the iris dataset that has some unlabeled observations.

self_training_model.fit(iris.data, iris.target)

SelfTrainingClassifier(base_estimator=SVC(gamma=’auto’, probability=True))

2. Label Propagation Algorithm

The label propagation is a semi-supervised learning algorithm that assigns labels to unlabeled data points by propagating labels through the dataset.

The algorithm works by creating a graph and then connecting all data points from the dataset based on their distance. In the graph, the nodes have label distribution based on the other data points connected nearby.

The propagation process is repeated to strengthen the labels assigned to unlabeled data points in the dataset.

To learn more about the algorithm, I recommended you read the technical report by Xiaojin Zhu and Zoubin Ghahramani called Learning From Labeled and Unlabeled Data with Label Propagation in 2002

Note: The algorithm is typically used for the classification problem.

The implementation of the algorithm is very similar to the previous algorithm. Make sure all the targets of unlabeled data points have a value of -1. Then you can import and train the algorithm on the dataset.

from sklearn.semi_supervised import LabelPropagation

Then we load the dataset and select randomly some of the observations to be unlabeled.

rng = np.random.RandomState(42)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
iris.target[random_unlabeled_points] = -1

Then create an instance of the algorithm.

label_prop_model = LabelPropagation()

Finally, we can train label_prop_model on the iris dataset that has some unlabeled observations.

label_prop_model.fit(iris.data, iris.target)

Final Thoughts on Semi-Supervised Learning

In this article, you have learned what semi-supervised learning is, why it is important and the 2 semi-supervised algorithms you can try to implement in your data science project.

If you want to learn more about semi-supervised learning, I recommend you read the following books.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

And you can read more articles like this here.

Want to keep up to date with all the latest in python? Subscribe to our newsletter in the footer below.