Machine Learning with JavaScript : Part 2

Written by abhisheksoni2720 | Published 2017/06/22
Tech Story Tags: machine-learning | javascript | artificial-intelligence | mls | ml-with-js

TLDRvia the TL;DR App

Diving deep into Supervised Learning šŸŠ

This is Part 2 of the ongoing series Machine Learning with JavaScript. Hereā€™s Part 1.

Itā€™s kNN time.

kNN stands for k-Nearest-Neighbours, which is a Supervised learning algorithm. It can be used for classification, as well as regression problems. First, we are gonna say hello to kNN but if you want, you can skip ahead to the code. GitHub Repository: Machine Learning with JS.

How does kNN Algorithm work?

kNN decides the class of the new data point based on the maximum number of neighbors the data point has that belong to the same class.

If the neighbors of a new data point are as follows, NY: 7, NJ: 0, IN: 4, then the class of the new data point will be NY.

Letā€™s say you work at a Post Office and your job is to organize and distribute letters among Postmen so as to minimize the number of trips to the different neighborhoods. And since we are just imagining stuff, we can assume that there are only seven different neighborhoods. This is a kind of classification problem. You need to divide the letters into classes, where classes here refer to Upper East Side, Downtown Manhattan, and so on.

Photo credit

If you love wasting time and resources, you might give one letter from every neighborhood to each Postman, and hope that they meet each other in the same neighborhood and discover your corrupt plan. Thatā€™s the worst kind of distribution you could achieve.

On the other hand, you could organize the letters based on which addresses are close to each other.

You might start with ā€œIf itā€™s within a three block range, give it to the same Postman.ā€ That number of nearest blocks is where **k** comes from. You can keep increasing the number of blocks until you hit an efficient distribution. Thatā€™s the most efficient value of k for your classification problem.

So, based on some parameter(s), like the address of the house here, you classified whether a letter belongs to Downtown Manhattan, Times Square, et cetera. (I am not good with names, so)

kNN in practice |Ā Code

As we did in the last tutorial, we are going to use ml.jsā€™s KNN module to train our kNearestNeighbors classifier. Every Machine Learning problem needs data, and we are gonna use the IRIS dataset in this tutorial.

The Iris dataset consists of 3 different types of irisesā€™ (Setosa, Versicolour, and Virginica) petal and sepal length, along with a field signifying their respective type.

Step 1. Install the libraries

$ yarn add ml-knn csvtojson prompt

Or if you like npm

$ npm install ml-knn csvtojson prompt

[ml-knn](https://github.com/mljs/knn): k Nearest Neighbors

[csvtojson](https://github.com/Keyang/node-csvtojson): Parse data

[prompt](https://github.com/flatiron/prompt): To allow user prompts for predictions

Step 2. Initialize the library and load the Data

The Iris dataset is provided by the University of California, Irvine and is available here. However, because of the way itā€™s organized, you are gonna have to copy the content in the browser (Select All | Copy) and paste it into a file named iris.csv. You can name it whatever you want, except that the extension must beĀ .csv.

Now, initialize the library and load the data. I am assuming you already have an empty npm project set-up, but if you are not familiar with it, hereā€™s a quick intro.

The header names are used for visualization and understanding. They will be removed later.

Also,seperationSize is used to split the data into training and test datasets.

Cool, eh?

We imported the csvtojson package, and now we are going to use its fromFile method to load the data. (Since our data doesnā€™t have a header row, we are providing our own header names.)

We are pushing each row to the data variable, and when the process is done, we are setting the seperationSize to 0.7 times the number of samples in our dataset. Note that, if the size of the training samples is too low, the classifier may not perform as well as it would with a larger set.

Since our dataset is sorted with respect to types(console.log to confirm), the shuffleArray function is used to, well, shuffle the dataset to allow splitting. (If you donā€™t shuffle, you might end up with a model which works fine for the first two classes, but fails with the third.)

Hereā€™s how it is defined. I got it from an answer over at StackOverflow.

Step 3. Dress Data (yet again)

Our data is organized as follows:

{sepalLength: ā€˜5.1ā€™,sepalWidth: ā€˜3.5ā€™,petalLength: ā€˜1.4ā€™,petalWidth: ā€˜0.2ā€™,type: ā€˜Iris-setosaā€™}

There are two things we need to do to our data before we serve it to the kNN classifier:

  1. Turn the String values to floats. (parseFloat)
  2. Turn the type into numbered classes. (Computers like numbers, you know?)

If you are not familiar with Sets, they are just like their mathematical counterparts, as in they canā€™t have duplicate elements, and their elements do not have an index. (As opposed to Arrays.)

And they can be easily converted to Arrays using the spread operator or the by using the Set constructor.

Step 4. Train your model and then testĀ it

Data has been dressed, wands at the readyā€Šā€”ā€ŠExpelliarmus:

The train method takes two mandatory arguments, the input data, such as the Petal Length, Sepal Width, and itā€™s actual class, such as Iris-setosa, and so on. It also takes an optional options parameter, which is just a JS object that can be passed to tweak the internal parameters of the algorithm. We are passing the value of **k** as an option. The default value of k is 5.

Now that our model has been trained, letā€™s see how it performs on the test set. Mainly, we are interested in the number of misclassifications that occur. (That is, number of times it predicts the input to be something, even though itā€™s actually something else.)

The error is calculated as follows. We use the humble for-loop to loop over the dataset, and see if the predicted output is not equal to the actual output. Thatā€™s a misclassification.

Step 5. (Optional) Start Predicting

Itā€™s time to have some prompts and predictions.

Feel free to skip this step, if you donā€™t want to test out the model on new input.

Step 6. Boom-shaw-shey-Done. šŸš€

If you followed the steps, this is how your index.js should look:

Go fire a terminal šŸ’», and run node index.js.

$ node index.js

Test Set Size = 45 and number of Misclassifications = 2prompt: Sepal Length: 1.7prompt: Sepal Width: 2.5prompt: Petal Length: 0.5prompt: Petal Width: 3.4With 1.7,2.5,0.5,3.4 -- type = 2

Well done. Thatā€™s your kNN algorithm at work, classifying like a charm. šŸ’¹

All the code is on Github: machine-learning-with-js

A huge aspect of the kNN algorithm is the value of k, and it is referred to as a hyperparameter. Hyperparameters are a, and I paraphrase from this answer on Quora, ā€œkind of parameters that cannot be directly learned from the regular training process. These parameters express ā€œhigher-levelā€ properties of the model such as its complexity or how fast it should learn. They are called hyperparameters.ā€

k defines how many blocks in the neighborhood of the address should be considered to classifyĀ it.

Photo credit

I am working on the ml-knn module and hopefully, the process of choosing k will be automated pretty soon.

If you are kinda excited and want to see what this can do, you can go to UC Irvine Machine Learning Repository and use your classifier on a different dataset. (That repository has hundreds.)

PS: To get the latest articles in this series, keep an eye on my profile, or you could cut yourself some slack and follow me. šŸ˜„

Thanks for reading! If you liked it, hit the green button ā¤ļø to let others know about how powerful JS is and why it shouldnā€™t be lagging behind when it comes to Machine Learning.


Published by HackerNoon on 2017/06/22