Influencer marketing prospects really impress! This market has doubled for the last year and a half and currently amounts to over $2 billion. No wonder! Bloggers give brands access to a huge audience on social networks. Instagram bloggers lead by a wide margin: 69% of marketers promote products with their help. However, there are some obstacles to the growth of this market, because of which small business almost does not risk to use the services of Instagram stars. Their prices are crazy high, guarantees are not always provided, and the selection of relevant influencers is too complicated. Data scientist Arthur Suilin presents problems with the new marketing industry and their solutions. Editor and co-author Egor Perezhogin.

Stubborn Instagram stars with crazy high prices

Advertising on pages of famous influencers costs a staggering sum. For example, Kylie Jenner reportedly makes fabulous $1 million per paid Instagram post. TV advertising with a Hollywood actress is way cheaper. But not only influencer prices scare. Top-influencers themselves behave worse than capricious stars, and they get away with it…

Screenshot © Instagram / Luka Sabbat

Luka Sabbat was sued for failing to live up to an agreement to promote Snap Spectacles on his Instagram account. PR Consulting Inc. seeks reimbursement of $45,000 paid upfront plus another $45,000 in additional damages. According to the agreement, the 20-year-old Instagram blogger was obliged to make four posts with Snap Spectacles on his page with 1.4 million followers. But he failed to make some posts and did not provide analytics to PRC for his first Instagram Story. Even if Sabbat loses this case, he strategically wins. The scandal only raises his fame: now he has almost 2 million followers, therefore his prices have increased.

Simple fake follower tricks, or a potato with 10 thousand followers

Lena Katz, a branded content strategist, conducted a demonstrative experiment. She photographed a simple baked potato, created her an Instagram account and within a couple of weeks provided it with 10 thousand fake followers. The vegetable became a star, having received a bunch of likes and comments.

Any Internet marketer can look at this page and think, "For some mysterious reason, people are following the potato, it has influence."

Lena laughs, "Actually, I just bought all those subscribers and their engagement."

Screenshot © Instagram / PotatoMcTato

The engagement illusion is created not only through the direct purchase of subscribers. Bloggers unite in mutual support groups, where everyone follows each other, zealously stamping comments and likes. Mommy-bloggers are especially vigorous in this matter.

1000 Alexis Baker VS 1 Kardashian: numerical advantage

The more subscribers, the less their engagement: experts have proven that this simple formula works flawlessly.

Source: https://influencermarketinghub.com/influencer-marketing-2019-benchmark-report/

Thus, we are gradually approaching the fact that a nano-influencer (with <10 thousand subscribers/followers) has undeniable advantages as an ambassador of your product on Instagram.

Subscribers trust him more because among them there are many friends, relatives, acquaintances, and people who know him personally;
Involvement/engagement of such subscribers is almost 7 times higher compared to top influencers;
Prices of nano-influencers are low enough. Many of them will agree to work for free, simply for samples of your products.

Obvious.ly Marketing Agency CEO Mae Karwowski aptly remarked, "You’re able to place a lot of really small bets rather than, ‘We’re going to work with Kim Kardashian.’"

Screenshots © Instagram / kimkardashian / alexisbakerrr

Left: Kim Kardashian — 146 million subscribers, post price from $500000

Right: Alexis Baker — 3262 subscribers, post price equal to a piece of pizza

Obviously, the nano-influencer marketing strategy is promising for large brands, interesting for medium-sized companies, and potentially irreplaceable for small businesses.

But there is a problem. How to obtain nano- and micro-influencers with an audience that is relevant to you…

An example of how to waste your advertising budget

The collapse of the advertising campaign launched by the cosmetic brand Clarins is indicative. Its marketing experts paid for posts of 4 top-influencers, girls from the UAE (with 100+ thousand subscribers). Only two of them provided an acceptable ROI. Deep.Social Company, where the author of this article previously worked, conducted an in-depth analysis of the pages of «failed» bloggers. The failure was explained unbelievably simply. 80% of their followers consisted of men from third world countries. Such subscribers, in principle, could not spend their money on cosmetics. In other words, influencers were chosen incorrectly.

Noorstars — one of the most popular influencers from the UAE | Screenshot © Instagram / noorstars

In fact, most of the problems of promoting products through Instagram influencers can be reduced to only two purely technical issues: high-quality targeting and identifying fake followers. Marketing experts advertising through Google AdWords, have the option of focusing their campaigns by keywords, targeting an arbitrarily narrow thematic segment of an audience. Instagram does not have this feature but offers billions of hashtags and millions of influencers, many of which have fake followers.

How to choose the most relevant bloggers whose audience is real and suitable for your particular product? Existing techniques are imperfect. Let us consider them in detail. We will start with targeting issues.

Targeting: problems and solutions

Impassable jungles of thematic analysis

The obvious way to select the right bloggers is to use corresponding thematic trees-networks. They are used by most advertising systems, including Facebook Ads. The creation of such trees-networks is time-consuming and is carried out in two stages:

1: Marketing experts manually compose a thematic/topical tree. For example: topic Meal. Obviously, a Thai restaurant in London’s Walthamstow district needs not just Meal, but the entire thematic chain Meal: Restaurants: London: Walthamstow and so on. Actually, the number of topics interesting to advertisers is equal to infinity. The narrower the niche in which they work, the more detailed division by topics they require. Thus, the tree grows enormously.

2: This subjectively «grown» tree is transformed into a network allowing catching bloggers tagged with required topics. The more the number of selected topics, the more difficult it is to find a blogger corresponding to all of them. To correlate each blogger with each topic of the tree, it is necessary to use manual manipulations or machine learning systems. In both cases, due to the initial subjectivity of the tree, the risk of human errors is very high.

The more the number of topics, the larger the tree grows, the more it takes to keep the tree-network up-to-date (new trends arise each day, new topics emerge and old topics drown), and to update corresponding training samples for machine learning.

Small thematic trees with truncated branches are inflexible to changes and provide unacceptably coarse filters for finding bloggers.

Hashtag focusing will not work

What if you focus your advertising campaign on Instagram using a set of keywords, like in AdWords, but with hashtags instead of keywords? This solution is obviously wrong:

1. Bloggers writing about cars, post tens and even hundreds of different hashtags: #car, #auto, #fastcars, # wheels, #drive, #bmw, #audi, etc.

2. Bloggers can use the hashtag #car incidentally once, for example, when they take photos of an interesting car.

3. Bloggers can use popular hashtags such as #cat just to draw attention to their posts, without any specific meaning.

The selection of bloggers by their tags will not work correctly. Smarter methods are required in this case.

Thematic modeling: impressive theory and mediocre practice

Modern techniques of natural language processing have a subject area named topic modeling. Let us consider a very primitive social network on which people have only two basic interests: Food and Japan. An the analysis of the power of these interests using a scale from 0 to 1, any hashtags used by bloggers on this virtual social network can be placed in the following 2D diagram.

Sample topical 2D diagram

Obviously, any hashtag is described by a pair of numbers (from 0 to 1) corresponding to the X and Y coordinates in the 2D thematic/topical space. Using the diagram, it is possible to calculate a centroid (with «averaged» coordinates in the topical space) of a particular post with several tags. The centroid coordinates along the X and Y axes correspond to the relevance of the post by topics Food and Japan accordingly. The closer the coordinate is to 1, the higher the relevance. Thus, having calculated the centroid of all posts of a particular blogger, it is possible to understand what topics are generally relevant to his/her content.

In real topical modeling, not two, but hundreds of topics are used, and corresponding tags exist in a high-dimensional space. Let us look at the following table with simulation results by 15 topics using the BigARTM library.

Topics | Top tags

As you can see, a reasonable structure is clearly traced, but the topics are far from perfect. The reason is that thematic modeling is intended to work with documents containing hundreds and thousands of words. In our case, most posts have only 2–3 tags. As a result, for the Prometeus Network project, our development team chose a different modeling method that is simpler and more powerful at the same time.

TopicTensor model: theory

The key thematic/topical modeling advantage is the interpretability of results. Any word/tag from a post can be weighed with a special diagram showing how close this post is to each considered topic.

But this plus turns into a minus since it limits the number of considered topics. Meanwhile, their number on Instagram is almost infinite. Therefore, after the removal of the fixed number of topics requirement, machine learning for selection bloggers becomes much more efficient.

We get a model that is close enough by its essence to well-known Word2Vec. Each tag is represented as a vector in N -dimensional space:

The degree of similarity (i.e. how close are the topics) between tags w and w′ can be calculated as a dot product:

as Euclidean distance:

as cosine similarity:

The task of the model during learning is to find such tag representations that will be useful for one of the predictions:

Based on one tag, predict what other tags will be included in the post (Skip-gram architecture)Based on all post tags except one, predict the missing tag (CBOW architecture, “bag-of-words”)Take two random tags from the post, and based on the first one, predict the second

All these predictions boil down to the fact that there is a target tag

which needs to be predicted and context c represented by one or more tags included in the post.
The model should maximize the probability of the tag depending on the context; this can be represented as a softmax criterion:

But calculating softmax across the entire set of W tags is expensive (a million tags or more can participate in learning), so alternative methods are used instead. They boil down to the fact that there is a positive example

which must be predicted, and randomly selected negative examples

exemplifying how not to predict.

Negative examples should be sampled from the same tag frequency distribution as in the learning data.

Loss function for a set of examples can take the form of a binary classification (Negative sampling in classic Word2Vec)

or work as a ranking loss, comparing “compatibility” by pairs with the context of positive and negative examples:

where l (⋅,⋅) This is a ranking function, which often use max margin loss:

The TopicTensor model is also equivalent to matrix factorization, but instead of the “document-word” matrix (as in topic modeling) here the “context-tag” matrix is factorized, which in some types of predictions turns into a “tag-tag” tags mutual occurrence matrix.

Practical implementation of TopicTensor V1.0

Several possible ways to implement the model were considered: Tensorflowcode, PyTorch code, Gensim library, StarSpace library. The last option was chosen, as requiring minimal effort on revision (all the necessary functionality is already there), giving high quality, and almost linearly parallel to any number of cores (32 and 64-core machines were used to speed up the learning). StarSpace by default uses the max margin ranking loss loss and cosine distance function as a metric of proximity of vectors. Subsequent experiments with hyperparameters showed that these default settings are optimal.

Results

The resulting embeddings showed excellent separation of topics, good generalization ability and resistance to spam tags. A demo sample of the top 10K tags (English only) is available for viewing in Embedding Projector. Clicking on the link, you need to switch to t-SNEmode (tab in the lower left) and wait for about 500 iterations until the projection in 3D is built. View better in Color by = logcnt mode. If you do not want to wait, in the lower right corner there is a section Bookmarks, select Default in it, then the calculated projection will be loaded immediately.

Topics formation examples

Let’s start with the simplest. Set the topic by one tag and find the top 50 relevant tags.

Topic set by the tag #bmw

Tags are colored according to relevance. Tag size is proportional to its popularity.

As you can see, TopicTensor did a fine job with shaping the BMW topics and found many relevant tags that most people don’t even know exist.

Topic, set by tags #bmw, #audi, #mercedes, #vw

Let’s complicate the task and form the subject of several German auto-brands (find the tags that are closest to the sum of the vectors of input tags):

This example shows TopicTesnor’s ability to generalize: TopicTensorunderstood that we mean cars in general (#car tags, #cars tags). And also understood that the topic should be given preference to German cars (tags circled in red), and he added the “missing” tags: #porsche (also German auto brand), and options for writing tags that were not at the input: #mercedesbenz, #benz and #volkswagen.

Topic, set by the tag #apple

Let’s complicate the task even more, and create a topic based on the ambiguous #apple tag, which can designate both a brand and just a fruit. It can be seen that the theme of the brand dominates, however the fruit theme is also present in the form of tags #fruit, #apples and #pear.

Let’s try to highlight a clean “fruit” theme, for this we add a few tags related to the apple brand, with a negative weight. Accordingly, we will look for tags that are closest to the weighted sum of the input tag vectors (by default, the weight is equal to one):

It can be seen that negative weights removed the brand theme, and only the fruit theme remained.

Topic, set by the tag #mirror

TopicTensor is aware that the same concept can be expressed by different words in different languages, as can be seen in the example with #mirror. The English mirror and reflection came up with: зеркало and отражение in Russian, espejo and reflejo in Spanish, espelho and reflexo in Portuguese, specchio and riflesso in Italian, spiegel and spiegelung in German.

Topic, set by the tag #boobs

The last example shows that casual themes work as well as branded ones :)

Selection of bloggers

For each blogger, his posts are analyzed and the vectors of all tags included in them are summarized.

where |p o s t s| is the number of posts |t a g s i| is the number of tags in the i-th post

Result vector β is the topic of a blogger. Then there are bloggers, topic vector of which is closest to the topic vector defined by the user. The list is sorted by relevance and is given to the user.
Additionally takes into account the popularity of the blogger and the number of tags in his posts, because otherwise, bloggers who have one post with one tag specified by the user at the entrance would come to the top. The final score by which bloggers are sorted is calculated as follows:

where λ, ϕ, τ are empirically selected coefficients lying in the interval 0 … 1

Calculating the cosine distance across the entire array of bloggers (several million accounts participate in the selection) takes considerable time. To speed up the selection, the NMSLIB (Non-Metric Space Library) library was used, which reduced the search time by an order of magnitude. NMSLIB pre-builds indices on the coordinates of vectors in space, which makes it possible to calculate the top close vectors much faster, calculating the cosine distance only for those candidates for whom it makes sense.

Topic Lookalikes

Vectors β calculated for the selection of bloggers can be used to compare bloggers with each other. In fact, lookalikes is the same selection of bloggers, but instead of a vector of input tags, themed vector is served β user-defined blogger. The output is a list of bloggers whose topics are close to the subject of a given blogger, in order of relevance.

Fixed topics

In TopicTensor, as already mentioned, there are no explicitly defined topics. Nevertheless, the correlation of posts and bloggers with a fixed set of topics is necessary, to simplify the search, or to rank bloggers within individual topics. The problem arises of extracting fixed topics from the vector tag space.

To solve this problem, unsupervised learning was chosen to avoid subjectivity in determining possible topics, and to save resources, because viewing hundreds of thousands of tags (even 10% of them) and assigning them topics is a lot of manual work.

The most obvious way to extract topics is to cluster the vector representation of tags, one cluster = one topic. Clustering was carried out in two stages, because Algorithms that can effectively search for a cluster in 200D space do not yet exist.

At the first stage, dimension reduction was carried out using the UMAP technology. This technology is in some sense improved t-SNE (although based on completely different principles), works faster, and better preserves the original data topology. The dimension decreased to 5D, cosine distance was used as the distance metric, the remaining hyper parameters were selected according to the results of clustering (second stage).

An example of clustering tags in 3D space. Different clusters are marked with different colors (colors are not unique and can be repeated for different clusters).

At the second stage, the clustering was performed using the HDBSCANalgorithm. The results of clustering (in English only) can be seen on GitHub. Clustering has allocated about 500 topics (the parameters of UMAP and clustering can regulate the number of topics within wide limits), while the cluster got 70% -80% of tags. A visual check showed good coherence of topics and the absence of a noticeable correlation between the clusters.

However, for practical use the cluster needs to be improved: to collect a tree from them, to remove useless ones (for example, a cluster of personal names, a cluster of negative emotions, a cluster of commonly used words), to merge some clusters into one topic.

TopicTensor V2.0 — possible improvements

The main disadvantage of TopicTensor V1.0 is its coverage far from 100%. Not all bloggers use hashtags, and not all who use hashtags — write something meaningful.

There are three main ways to expand coverage:

1) Analyzing photo content. The theme of the blog is clearly defined by the photos (in fact, they set it up), so the computer vision model, trained to issue its thematic vector from a photo, could partially replace the tags.

2) If we assume that bloggers with a similar audience should have a similar topic, we can display the topics of bloggers who do not use tags through the audience lookalikes, if there are bloggers with similar audience and tags.

3) Analyzing text content of the post.

What is the difference between fake likes and fair likes?

Fair likes are sent by people who really like a particular post. It resonates with them because their sphere of interests is close to a particular topic or blogger’s personality. Fake likes are created by people who actually do not care a particular post.

How to identify whether a person really likes a particular post topic? If you analyze types of likes he/she places, his/her subscriptions, etc., then using machine learning (AI) with good enough accuracy you can calculate *probability* of his/her likes under a particular post or *probability* of his/her subscription to particular blogger.

If a like is placed by a real/fair account, its *probability* is high. If a like is created artificially from an account managed by Internet cheaters (fake followers), its *probability* is low. Thus, it is possible to trace, for example, cases when a young mother (who usually should like household and raising children posts) suddenly likes various fishing equipment.

Accordingly, an account with a higher (on the average) *probability* of likes/subscriptions is more respectable, than an account with a low *probability*.

This method of analysis of quality of accounts is absolutely objective and representative since the AI-model is trained on the data of existing subscriptions and likes, on samples throughout Instagram, without any manually created heuristic rules.

Conclusions & suggestions

The development of influencer marketing is currently limited only by technical barriers. In fact, developers have not yet provided marketers with effective tools for selecting relevant bloggers and identifying fake followers. Thus, the risk of wasting an advertising budget is high enough. TopicTensor model (used in Prometeus network) claims to be the right solution for this problem. Test it for your benefits at the following addresses: http://tt-demo.suilin.ru/ , https://demo.prometeus.io/.

Editor and co-author Egor Perezhogin.

Targeting and fake followers: how can we solve the most acute problems of influencer marketing?