Creating more accurate search results for small sites!

Written by ethan.jarrell | Published 2017/12/21
Tech Story Tags: javascript | html | web-development | data-science | programming

TLDRvia the TL;DR App

In a previous post, I was working on a small site, where you could search Donald Trump’s recent speeches and get statistics on how often he uses certain words or phrases, when he uses them, and where he most frequently says certain things. For reference, he’s the working prototype of the site I’m building:

https://trumpspeechdata.herokuapp.com/

Now, suppose I want to add a feature, so that you can search for a topic, and return any speeches on that topic. Now, have you ever searched for something on a given site, and the search results were not too close to what you actually had in mind? We’ve probably all had that happen before. If we have tons and tons of data, what I’m about to suggest might not be feasible, but for smaller chunks of data like this site, it’s both reasonable, and allows us to return more accurate search results.

Here’s the data I have so far, if you want to follow along:

https://www.dropbox.com/s/u4vuwazx609uvvw/trumpspeeches.json?dl=0

Our current data structure looks like this:

{"speechtitle": "Title of Speech","speechdate": "Date of Speech","speechlocation": "Location of Speech","text": "Entire transcript of Speech",}

And if a user searches for a specific topic, we’ll probably want to return all of that data to the user, if their search matches the topic of the speech. To start off, let’s use this User Story:

Bob, a user searches for “Budget” so he can view speeches made by Donald Trump on The Budget.

One way we could do this would be the following:

let matchingSpeeches = [];for (var i = 0; i < api.length; i++) {if(api[i].speechtitle.indexOf(inputValue) > -1) {matchingspeeches.push([api[i].speechtitle,api[i].speechdate,api[i].speechlocation,api[i].text,])}}

This basically says, “if the speech title contains the word or phrase searched for, push it into the matching speeches array”. Then we would be able to format the results of the array to the User. Back to Bob though, would this get him what he wants? Yes, as long as “Budget” is included in the speech title. But what if there’s a speech on the Budget, but Budget doesn’t appear in the speech title? Sorry Bob, you’re SOL.

Maybe we could do the same thing, as above, but include a search of the speech text too, like this:

let matchingSpeeches = [];for (var i = 0; i < api.length; i++) {if(api[i].speechtitle.indexOf(inputValue) > -1 || api[i].text.indexOf(inputValue) > -1) {matchingspeeches.push([api[i].speechtitle,api[i].speechdate,api[i].speechlocation,api[i].text,])}}

Here we’re saying, “okay, if the search value appears in either the speech title or the text of the speech, we’ll push that into the array.” Better right? Well, yes and no. What if a speech is all about abortion, or gun control, but mentions the budget once? The speech isn’t really about the budget at all, but we’re still returning it to Bob and making him sort through that mess. Or, what if the speech is about the Budget, but uses another word, like “Spending” through the speech but doesn’t actually mention “Budget?” We could put our user, Bob, in a scenario where he gets a speech on abortion, but doesn’t get the one on Spending. Not a great thing for our end user, Bob. Here’s another idea. Let’s add a field to our data structure called “tags”. Then, for each talk, we can add tags for topics. For instance, let’s take this entry from our JSON data:

{"speechtitle": "Remarks by President Trump at Tax Reform Event","speechdate": "September 2017","speechlocation": "Indiana","text": "speech text here",}

We could modify that to the following:

{"speechtitle": "Remarks by President Trump at Tax Reform Event","speechtags": ["budget", "taxes"],

"speechdate": "September 2017","speechlocation": "Indiana","text": "speech text here",}

Then when Bob makes his search, we could use our code from earlier, and loop over the tags instead, and return a tag that matches the search input. However, even though this could be more targeted, and in theory make our search results better, we could still run into issues here. For example, what if Bob searches for “Spending” instead of “Budget”. Again, even though they are close, this speech wouldn’t be sent to bob because the query doesn’t match. So here’s one way we could solve that problem. What we want to do is to boil down many of the popular search terms. So if a user searches for “Spending”, “Budget”, “Tax Reform” or “Deficit”, we will still send the user the results with the “Budget” tag, since that’s a pretty close match. What we’re going to do is build a word Object. Then, we can put any words we want in the Object. The structure will look something like this:

var mapObj = {"a" : "b","c" : "b","d" : "b","e" : "b",}

The idea here is that, if the user, Bob searches for “a”, we’ll give him “b”. If he searches for “c” or “d” or “e”, we’re still going to give him “b”. This is just what we’ve described above. Basically, if he searches for “spending” we’ll return “budget”. But if he searches for “tax reform” or “deficit” or “budget”, we’ll still return the results containing “budget”, since that’s still a good match.

Now, we’ll need to add some regex, to match the input string. Here’s what the code could look like:

var mapObj = {"spending" : "budget","tax reform" : "budget","deficit" : "budget","budget" : "budget",};var re = new RegExp(Object.keys(mapObj).join("|"), "gi");keyWord = str.replace(re, function(matched) {return mapObj[matched.toLowerCase()];});

We’re using Regex to match what the user searched for, and then replace it with something else. So now we can use the variable “keyWord” in our code from earlier, and use the speechtags field we created:

let matchingSpeeches = [];for (var i = 0; i < api.length; i++) {if(api[i].speechtags.indexOf(keyWord) > -1) {matchingspeeches.push([api[i].speechtitle,api[i].speechdate,api[i].speechlocation,api[i].text,])}}

Now, as I mentioned earlier, this might not work well on a really large scale, but I think in this scenario it works because we have a pretty limited scope. Because the topic is politics, there are only so many search terms a user might enter. And we can always put some code in there that returns the user something, if we don’t get any matches to their input. For example, we probably won’t have a lot of data available if Bob searches for “chicken soup”. Since the possible search terms are somewhat limited, we can modify our search object to include as many possibilities as we would like, to match all the tags we are using, like so:

var mapObj = {"spending" : "budget","tax reform" : "budget","deficit" : "budget","budget" : "budget",

    "abortion" : "abortion",  
    "women's rights" : "abortion",  
    "pro life" : "abortion",  
    "pro choice" : "abortion",

    "healthcare" : "healthcare",  
    "obamacare" : "healthcare",  
    "health reform" : "healthcare",  
    "medicaid" : "healthcare",  
  };  
  var re = new RegExp(Object.keys(mapObj).join("|"), "gi");  
  keyWord = str.replace(re, function(matched) {  
    return mapObj\[matched.toLowerCase()\];  
  });

Then we could go back and add tags to each speech to match the keyWords we’re using. We could do that manually, especially if our data set is small, or we could use JavaScript or Python or whatever, but I won’t cover that here. Again, I even with this scenario, we would still run into some problems, but it’s not bad if you’re looking for a quick way to make the search results you’re returning more targeted, especially for smaller data sets.

Ethan Jarrell_My background is in graphic design, and I have spent the last 10 years doing both digital and print design for a…_www.upwork.com


Published by HackerNoon on 2017/12/21