Finding Non-Unique Elements in Javascript

It is surprising how algorithms are everywhere and how we modern developers often don’t care about them. I assumed and was convinced this knowledge is only for language and framework core development teams.

— Who cares how Array.sort() works if it does its job and does it fast enough, right?

So I came with this mindset to Facebook Hacker Cup wishing to win a t-shirt last year and failed in the first round, was disappointed, started to think how I could improve and started slowly diving into algorithms world.

Long story short, let’s consider a simple example:

We have an array of integer numbers and the task is to find all non-unique elements in it.

Naive Solution

Just think for a second how you would do it and let me show my first working solution:

var length = data.length;var unique = [];var distinct = [];

for (var i = 0; i < length; i++) {var elem = data[i];var uniqueInd = unique.indexOf(elem);

if (uniqueInd > -1) {  
    unique.splice(uniqueInd, 1);  
}

if (distinct.indexOf(elem) === -1) {  
    distinct.push(elem);  
    unique.push(elem);  
}

}

for (var i = length - 1; i > -1; i--) {if (unique.indexOf(data[i]) > -1) {data.splice(i, 1);}}

return data;

The time I wrote that code I didn’t know any functional Javascript methods like map or reduce so it looked long. But it worked, and made me super proud of myself. I spent less than a day on it.

I posted it to https://js.checkio.org/mission/non-unique-elements/ to compare with other solutions.

Nice Solution

Once you submit your own code there you can check out how other guys approached it.

And this beauty was on the top of the list:

return data.filter(function(a){return data.indexOf(a) !== data.lastIndexOf(a)});

It’s short and readable. The idea behind is brilliant —if the first and the last occurrence of an element are same then this element occurs only once, so it is unique.

Just look at this and compare to what I had. Do you see this precipice?

I was so impressed that I started to do other competitive programming quizzes which at one point brought me to the big O notation.

Fast Solution

To quickly understand why big O makes sense look at this chart taken from Steven Skiena book:

Imagine our algorithm has n square complexity. This will mean that it takes ~16.7min to process array of 1 million numbers. Isn’t that too much? I would say it is considering that on Facebook Hacker Cup you have only 5 minutes to upload results once you get the input data.

And does our nice solution still that nice when we look at it from algorithm complexity point of view? Same as my first solution unfortunately not. Let’s show why:

indexOf and lastIndexOf take n operations in the worst case (say we have an array of all unique numbers) because they basically iterate through the whole data. And we execute them inside a filter method which is a loop of n so it means that we do n operations n times which is O(n * n). Oops. That is unacceptably slow.

One way to fix this is to use sorting. The basics of algorithms science tell us we can sort with O(n * log(n)) complexity which is if you check the table above much faster than O(n * n) and takes reasonable time even for 1 billion records.

Once we sort the list we can use another basic algorithms trick by overridingindexOf and lastIndexOf methods with binary search. The idea behind binary search is that on each step we split the searchable array into 2 pieces and continue searching only in one of them. We simply check if the number in the middle of the array is bigger or less than the one we searched for and then choose left part or right part of the data correspondingly knowing that it is sorted. Until we have an array of only one element which is either the element we are searching for or it is not and that would mean our element is not in the array at all. So to complete this we need x steps where 2 to the power of x is n(because on each iteration we divide amount of our data by 2), which gives us O(log(n)) complexity to find both indexOf and lastIndexOf. And because O(log(n)) + O(log(n)) = O(log(n)) we have got O(log(n)) complexity altogether for indexOf and lastIndexOf methods call.

Finally we have a loop of n(because we iterate through the whole array with filter) with O(log(n)) complexity inside each iteration and that gives us O(n * log(n)) overall complexity.

It is the same as sorting complexity so in the end, we have O(n * log(n)) + O(n * log(n)) = O(n * log(n)) for the whole algorithm. Cool! It’s much faster.

Now we can handle even 1 billion records in less than a minute.

And this is how we can do indexOf with O(n * log(n)) complexity on a sorted array:

function newIndexOf(array, elem, startIndex, endIndex) {if (array.length === 1) {if (array[1] === elem) {return startIndex;} else {return -1;}}

var median = Math.floor(array.length / 2);  

if (elem <= array\[median\]) {  
    return newIndexOf(array, elem, startIndex, median);  
} else {  
    return newIndexOf(array, elem, median + 1, endIndex);  
}

}

lastIndexOf implementation will be the same but instead of ≤ condition we will use < condition.

An even faster solution

But talking about a web server with a high load even ~30 seconds seems to be too much. Can we make it better?

Every time I think about advanced array utilities I think of lodash.js library. And that’s how they propose to do this:

_.transform(_.countBy(array), function(result, count, value) {if (count > 1) result.push(value);}, []);

First, we use countBy method to calculate occurrence of each element. It gives us a JS object which has elements as keys and their occurrences as values. countBy method utilizes JS reduce method inside and takes O(n) time complexity because it is basically iterating through the whole array once.

Then with transform, we go through all keys we have in the object and take those of them which have values(occurrences) greater than 1 which again gives us time complexity O(n) because we cannot have more than n keys in this object.

And because O(n) + O(n) = O(n) we have O(n) complexity as a final result. And if you check the table above you’ll see that we can process 1 billion records 30 times faster now. Wow!

Conclusion

Somehow my short analysis confirms that frameworks and utility libraries like lodash.js usually take care of algorithms time complexity and we can trust them in that sense.

At the same time, it is easy to get into the trap of nice looking clever code which in the end will significantly decrease our app performance.

So even though we may not create complex algorithms ourselves, it is good to know how to calculate the complexity of a given one because then we can be confident we don’t get surprises when at one point we decide to scale our application to larger volumes.

If you’re really interested I highly recommend Steven Skiena classes and book for learning as well as https://js.checkio.org/ and https://www.codewars.com/ platforms to get some practice. And may be see you on the next programming competition? ;)