Using the LDA Algorithm for Websites

Written by maxim.savonin | Published 2021/06/05
Tech Story Tags: machine-learning | artificial-intelligence | ai | ai-applications | ml | website-development | nlp | hackernoon-top-story

TLDR Latent Dirichlet Allocation (LDA) is widely used for online content generation. LDA searches for the same word clusters across the documents you're working with and generates any so-called unique topics. These are topics that occur at some frequency in the documents. This algorithm is particularly helpful in sales and marketing. To get accurate results, you have to remove stop words from the dataset first. To use LDA, you need to import the necessary libraries for file processing and cleaning HTML, CSS, JavaScript, and other words related to LDA analysis.via the TL;DR App

Have you ever had to find unique topics in a set of documents? If you have, then you’ve probably worked with Latent Dirichlet Allocation (LDA).
This is how LDA works:
The algorithm searches for the same word clusters across the documents you're working with and generates any so-called unique topics. These are topics that occur at some frequency in the documents.
LDA is widely used for online content generation. This algorithm is particularly helpful in sales and marketing.
Suppose that you want to create a logo for your company. How can you find the one that will best represent your business and attract clients? The answer is to study the existing cases and review the competitors’ strategies.
To create an effective logo, you should gather data from competitors’ websites. That is, you need to parse websites and apply LDA to generate goal-related topic clusters for you. Based on the results you get, you will be able to outline the keywords most frequently associated with your sphere and create an attractive logo.
Or suppose that you want to optimize your content based on SEO research. Applying LDA will allow you to gather unique topic clusters with keywords in each cluster. Based on the results, you can adjust your content to make it semantically valuable for search engines.
You can use LDA not only for websites but for any type of textual data.
So, if you want to use LDA to generate unique topics, you need to have a large number of documents or websites to use for your analysis.
But that’s not all: to get accurate results, you have to remove stop words from the dataset first. Stop words are the words that bear no value in the analysis, like “the,” “your,” “him,” “a,” and so on.
Apart from deleting stop words, you have to delete stuff related to HTML, CSS, JavaSript SEO, and other words not related to LDA analysis. Does that seem too complicated? Let’s lay it all out.
In this article, we will discuss how to generate unique topics with LDA when parsing websites.

How to Start Using the LDA Algorithm

To parse websites with LDA, first, you need to import the necessary libraries for file processing and cleaning the HTML, CSS, JavaScript text. You'll also import the LDA analyzer:
import os
import re
import gensim
import nltk
import time
import random
import itertools
import pandas as pd
from string import digits
from wordcloud import WordCloud
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from tqdm.notebook import tqdm, trange
import gensim.corpora as corpora
from bs4 import BeautifulSoup
Next, select the number of websites you need to analyze. The larger and more precise your sample is, the more correct the LDA results will be.
Here, both the quantity and the quality of the sample matter. That is, you should create a uniform sample of websites that may contain the necessary keywords for a topic generation.
Then, choose the folder that will be used to read the documents:
entries = os.listdir('./dataset_htmls/')
Next, read the chosen websites and conduct a preliminary cleaning of the HTML, CSS, JavaScript:
def cleanDocument(html):
    soup = BeautifulSoup(html, "html.parser") 
    for script in soup(["script", "style"]): 
        script.extract()

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text


start_time = time.time()

files = [0] * files_number 
df = pd.DataFrame(data={}, columns = ['text'])

for i in range(files_number):
    with open("./dataset_htmls/" + entries[i], "r", encoding='utf-8') as f:
        try:
            text = cleanDocument(f.read())
            text = re.sub(r'\b\w{1,2}\b', '', text)
            df = df.append({'text': text}, ignore_index=True)   
            
        except:
            files[i] = " "
            print("problem with file" + entries[i])
            
print("--- %s seconds ---" % (time.time() - start_time))
When you're using LDA, keep in mind that there are a large number of frameworks for creating websites like Angular, Vue, React, WordPress, and more. Also, there are a lot of ready-made website builders that allow you to build websites using a no-code technique like Wix or Tilda. And each developer uses their own best practices to build websites.
The issue is that most problems occur because of the so-called "dirty hacks" that developers use. With these dirty hacks, websites bypass the parser and analyzer which poses a problem for the LDA.

How the LDA Algorithm Works

Let’s start with CSS. Make sure you detect the following ways of defining styles:
  • Inline CSS — use of style attribute in HTML tags
  • Inline CSS in the head section — use of
    <style> 
    in the document header
  • External CSS like a file — use of
    <link>
    to load a CSS file
For HTML, it's important to delete tags but not to delete the content of these tags. The same goes for JavaScript. One way to do so is to use regular expressions:
df['text'] = df['text'].map(lambda x: re.sub('<[^<]+?>', '', x))
df['text'] = df['text'].map(lambda x: re.sub('(?s)<style>(.*?)<\/style>', '', x))
df['text'] = df['text'].map(lambda x: re.sub(r'<script.+?</script>', '', x, flags=re.DOTALL))
Yet, there is a problem with such methods. Using single tags, JavaScript plugins, and various ways of building web applications means that you cannot completely remove HTML, CSS, and JavaScript. But you can easily check this in a site's input files.
This is why, for writing this article, I used the bs4 library. To test the cleanup, I chose 1,000 websites where no significant bugs with removing web elements were detected. So you also can use this library and remove web elements that are not related to the LDA analysis via the
cleanDocument
function.
The next step is to delete complex words consisting of two words. Use the following regular expression to solve this problem:
text = re.sub(r'\b\w{1,2}\b', '', text)
All cleaned text is recorded to
df
dataframes. To evaluate the work of the algorithm, look at the running time. The running time for 100 websites is about 5 seconds.
Your next task is clean for stop words. As I mentioned, stop words refer to the words that bear no value for analysis like “the,” “you,” “himself,” and so on. To clean stop words, download a list of standard words for the English language.
Apart from stop words, consider the standard syntax used on nearly every website. For example, such words as “contact us,” “cookies,” “confirm,” “back,” and so on. Then add those words to your list of stop words. Also, when cleaning, review the "standard" set of words used in SEO optimization.
other_stop_words = [
            'contact', 'cookies', 'confirm', 'website', 'share', 
]
    nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(other_stop_words)

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
        
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]

data = df['text'].values.tolist()
data_words = list(sent_to_words(data))

data_words = remove_stopwords(data_words)
Congratulations! You are halfway there. What’s next? You need to form a word dictionary for the LDA algorithm:
id2word = corpora.Dictionary(data_words)
texts = data_words
corpus = [id2word.doc2bow(text) for text in texts]

num_topics = 10
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

print(lda_model.print_topics())
Important note: to get precise results, you need to make a dictionary of stop words that will be used to clean the text. To see the output of this procedure, use the frequency counting of words in all text. This will let you determine which words occur the most frequently:
def freq(data_words):
    df = pd.DataFrame({}, columns=['Word', 'Frequency'])
    
    unique_words = set(data_words)
      
    for word in unique_words:
        df = df.append({'Word' : word,
                'Frequency' : data_words.count(word)}, 
                ignore_index=True)
        
    return df

data_words = itertools.chain.from_iterable(data_words)
data_words = list(data_words)
frequency = freq(data_words)
frequency = frequency.sort_values(by="Frequency", ascending=False)
frequency.to_csv('frequency.csv')

The Results of LDA Analysis

Let’s review the results. We'll use as an example 10 websites that provide courses for learning programming languages.
Based on the algorithm above, I parsed the 10 websites and removed all words unrelated to the analysis.
You can see the first result of the LDA analysis in the following image:
Below, you can see the results of frequency analysis:
Frequency analysis provides information for SEO optimization where words can be used as search keys. However, to find a cluster of similar and "valuable" words, you should clean the text of base words. In our case, base words include:
'course', 'learn', 'programming', 'lpa', 'code', 'courses', 'java', 'learning', 'get', 'one', 'python', 'see', 'pluralsight', 'web', 'org', 'become', 'javascript', 'android', 'may', 'site', 'developer.'
You can identify these words based on frequency analysis or the results of the LDA algorithm after n iterations.
After cleaning out the base words we get the following result:
In fact, the output result is a set of topics with the following set of words:
Team, students, read, review, free, skills, reviews, path, data, project, build, masterclass, instructor, job, etc. 
Keep in mind that you need to clean for topics manually and iteratively until you get your desired results. In such algorithms, you get only the roots of words. The endings, plural forms, and duplicated words will be rejected.
That’s it — if you are interested in more info, I welcome you to check out the full code.

Wrapping Up

You can see that LDA can be an effective way of topic modeling. Based on the results above, you can easily choose the right word combination for your logo.
You can also optimize your existing content and adjust it to perform better based on the results.
The advantage of LDA lies in the fact that its functions are flexible. Its purpose is primarily mathematical, so you can adjust LDA’s work to your needs.
The disadvantage, though, is that you need to polish the results manually. While this is more time-consuming, you get more precise results in the end. So if you deal with marketing and SEO, using LDA is a perfect way to dive deeper into content generation.
Special thanks for co-authoring this article to Volodia Andrushchak, Machine Learning Engineer at @KeenEthics. If you are interested in other useful readings on Internet of Things and Artificial Intelligence, check other articles by Volodia: https://keenethics.com/volodya-andrushchak?article=2697
Thank you for reading, and I sincerely hope that you found it helpful!



Written by maxim.savonin | CEO at KeenEthics
Published by HackerNoon on 2021/06/05