We've all come across warnings when visiting suspicious websites. Your browser or search engine might even block you from entering, displaying a message that this site may harm your device. But what if the site you're trying to visit is not flagged as malicious?

According to SiteLock's 2022 Security Report, 92% of infected websites are not blacklisted by search engines. This means that businesses and individuals are vulnerable to attack when they visit these sites.

There are a number of reasons why search engines are missing infected sites. Firstly, it can take weeks or even months for a website to be identified as malicious. This is because attackers are constantly changing their tactics to evade detection. Secondly, many businesses don't realize their site has been hacked until it's too late. And thirdly, even if a website is flagged, there's no guarantee that users will avoid it.

How AI can secure the web

So what can be done to protect businesses and users from these threats? Just as cybercriminals use AI to automate their attacks, so too can we use AI to defend businesses. This isn't merely theory; An IEEE analysis of AI-based malware detection techniques concluded that they "provide significant advantages," such as in terms of accuracy, speed, and scalability.

For example, SafeDNS uses "continuous machine learning," achieving 98% precision in detecting malware. They use a "database of malware" to fuel machine learning models that analyze data to look for new patterns of behavior that could indicate a threat. This allows them to identify threats quickly and effectively, before they can do any damage.

If we want to stay one step ahead of cybercriminals, we need to use AI to defend our businesses. Recent research is a wake-up call - it's time to take action and invest in AI-powered solutions.

Detecting Malware - A Python Proof of Concept

There are many ways to detect and protect against malware. In this section, we'll take a look at one such method: using Python to detect malware based on a dataset of executable files. View the full associated code here.

The dataset we'll be using is from Kaggle's “Malware Executable Detection" dataset. It's made of 373 samples of executable files, 301 of which are malicious files and 72 of which are non-malicious. As you can see, the dataset is imbalanced, with regular files outnumbered by malware files.

There are 531 features represented in the dataset, from F1 to F531, and a label column stating whether the file is malicious or non-malicious. We won't be using all of these features, but we'll be using a variety of them to build our models.

We'll start by importing the necessary libraries for our demo. We'll be using the pandas, numpy, and scikit-learn libraries:

import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,accuracy_score,confusion_matrix,recall_score,precision_score,f1_score, auc, roc_auc_score
from sklearn.model_selection import train_test_split

Next, we'll load in the dataset:

df = pd.read_csv('uci_malware_detection.csv')

Now that we've taken a look at the dataset, let's go ahead and split it into training and testing sets. We'll also map the labels from strings to numbers and remove duplicates:

df['Label'] = df['Label'].map({'malicious': 0, 'non-malicious': 1})
df = df.drop_duplicates(keep=False)

X, y = df.drop("Label", axis=1), df["Label"]
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42)

We're now ready to build our models. We'll be using a simple logistic regression model:

lr_model =  LogisticRegression(max_iter=1,000)
lr_model.fit(X_train, y_train)

We can now evaluate our model's performance on the testing set:

lr_model.score(X_test, y_test)
y_pred = lr_model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print('ROC-AUC score', roc_auc_score(y_test,y_pred))
print('Confusion matrix:\n ', confusion_matrix(y_test, y_pred))

Running this code gives us the following output:

0.9864864864864865
ROC-AUC score 0.9705882352941176
Confusion matrix:
 [[57  0]
 [ 1 16]]

Ultimately, we've managed to make an accurate model with both a high precision and recall. Not bad! Of course, this is just a proof of concept, as the real-world situation is orders of magnitude more complex. At scale, AI systems trained on big data can make a real difference in the fight against malware.

Search Engines are Missing Infected Sites, Putting Businesses At Risk

How AI can secure the web

Detecting Malware - A Python Proof of Concept