Introduction to Machine Learning Algorithms: Logistic Regression

Build your Logistic Regression model from scratch

Logistic regression is the most famous machine learning algorithm after linear regression. In a lot of ways, linear regression and logistic regression are similar. But, the biggest difference lies in what they are used for. Linear regression algorithms are used to predict/forecast values but logistic regression is used for classification tasks. If you are shaky on the concepts of linear regression, check this out. There are many classification tasks done routinely by people. For example, classifying whether an email is a spam or not, classifying whether a tumour is malignant or benign, classifying whether a website is fraudulent or not, etc. These are typical examples where machine learning algorithms can make our lives a lot easier. A really simple, rudimental and useful algorithm for classification is the logistic regression algorithm. Now, let’s take a deeper look into logistic regression.

Sigmoid Function (Logistic Function)

Logistic regression algorithm also uses a linear equation with independent predictors to predict a value. The predicted value can be anywhere between negative infinity to positive infinity. We need the output of the algorithm to be class variable, i.e 0-no, 1-yes. Therefore, we are squashing the output of the linear equation into a range of [0,1]. To squash the predicted value between 0 and 1, we use the sigmoid function.

Linear Equation and Sigmoid Function

Squashed output-h

We take the output(z) of the linear equation and give to the function g(x) which returns a squashed value h, the value h will lie in the range of 0 to 1. To understand how sigmoid function squashes the values within the range, let’s visualize the graph of the sigmoid function.

Sigmoid Function graph

As you can see from the graph, the sigmoid function becomes asymptote to y=1 for positive values of x and becomes asymptote to y=0 for negative values of x.

Cost Function

Since we are trying to predict class values, we cannot use the same cost function used in linear regression algorithm. Therefore, we use a logarithmic loss function to calculate the cost for misclassifying.

The above cost function can be rewritten as below since calculating gradients from the above equation is difficult.

Calculating Gradients

We take partial derivatives of the cost function with respect to each parameter(theta_0, theta_1, …) to obtain the gradients. with the help of these gradients, we can update the values of theta_0, theta_1, … To understand the equations below you would need some calculus.

Gradients

But, if you are not able to understand them you can ask me or you can just take them as they are.

Code

Now that we have formulated the equations necessary, let’s write the code. We will be using only the numpy library to build the model from scratch. I believe this would provide an understanding of what is happening under the hood. We will be using the Iris dataset to train and test the algorithm.

We load the data using pandas library. The Iris dataset has three target values(‘Iris-virginica’, ‘Iris-setosa’, ‘Iris-versicolor’). Since we would like to implement a binary classification algorithm, I decided to drop the rows with target value Iris-virginica. Now, we have only two target classes to predict. We extract the independent and dependent variables from the dataset. Now, let’s move on to preparing the training and testing data.

We shuffle the data and split it into training and testing data. There are 90 examples in our training data and 10 examples in our test data. There are four predictors in the dataset. Therefore, we extract each of the features and store it in individual vectors.

We initialize the parameters(theta_0, theta_1, …) with 0. During each epoch, we calculate the values using the linear equation, squash the values within the range of 0 to 1 and then we calculate the cost. From the cost function, we calculate the gradients for each parameter and update their values by multiplying gradients with alpha. Alpha is the learning rate of the algorithm. After 10000 epochs, our algorithm would’ve converged to the minima. We can test our algorithm with the test data.

We prepare the test data features similar to the training data. We also clip the values of theta_0, theta_1, theta_2, theta_3 and theta_4 from 90x1 to 10x1 as the number of testing examples is only 10. We calculate the test classes and check the accuracy of our model.

Accuracy score of our model

Our model is able to achieve 100% accuracy. Even though logistic regression is a pretty powerful algorithm, the dataset we have used isn’t really complex. Therefore, our model is able to achieve 100% accuracy. We can also visualize the cost function value as our model trained for 10000 epochs.

Cost Function

Now, you might be wondering that this is a lot of lines of code to implement a simple algorithm. To save us from typing so many lines of code, we can use the scikit learn library. The scikit learn library has a built-in class for logistic regression which we can just import and use.

Over 50 lines of code have been reduced to under 10 lines of code. We also get 100% accuracy though scikit learn library’s logistic regression class.

Conclusion

Logistic regression is a simple algorithm that can be used for binary/multivariate classification tasks. I think by now you would’ve obtained a basic understanding of how logistic regression algorithm works. Hope this article was helpful :)