What I learnt from building a chatbot without a framework (Part 2/2)

Written by kevinze | Published 2016/10/09
Tech Story Tags: bots | chatbots | messaging | lessons-learned | software-development

TLDRvia the TL;DR App

In Part 1, I mentioned that there is great learning value in interfacing a bot with a bot API directly, without going through an external bot building framework.

I explained that it was a good way to learn about various interactions and messages on different chat platforms such as Telegram or Messenger, which enables us to optimize the user experience on individual platforms. Furthermore, if we do use a cross-platform framework eventually, we will be more aware of its limitations.

In this follow-on post, I will share my learning experiences in parsing messages, handling timezones, and dealing with multiple edge cases.

I simplified message parsing

The platform may send different messages to the bot. For example, a message that contains a text from the user is different from a message that contains a picture and caption. We need to parse and extract the contents of the received payload, which is typically a JSON object received via a HTTP POST. Let us see some code examples written in Python.

def post(self):

Called when the bot gets a POST request.

update = json.loads(self.request.body)

chat = Noneif ("message" in update and update["message"] is not None and "chat" in update["message"] and update["message"]["chat"] is not None):chat = update["message"]["chat"]

if chat is None:return

More parsing ...

To extract a nested property safely, we need to verify that the parent property exists and has a non-null value. If any of these checks evaluates to False in the above if statement, the next condition will not be evaluated, because the entire statement is already False.

These checks are needed for properties that are optional, i.e. they may not be present in the payload. Besides, even if they are present, they may be null. We may avoid such laborious checks only if the API guarantees that a property is mandatory and has a non-null value.

However, the above approach is too verbose and hard to maintain. For example, what if we miss out on a particular check? This is a big problem because we need to extract more properties, such as the chat_id and timestamp. These properties may be further nested in the JSON object.

To solve this problem, we can replace the if statement with a try-except clause.

def post(self):update = json.loads(self.request.body)

try:chat = update["message"]["chat"]except:chat = None

if chat is None:return

More parsing ...

If the “message” or “chat” properties do not exist, an exception will be thrown. If the “message” property is null, an exception will be thrown as well.

Although this approach is better, the post method still gets overburdened with many low level parsing details. Furthermore, duplicating try-excepts while parsing more properties is a bad idea.

To solve these problems, I created a separate Update class. It is essentially a lightweight wrapper class around the JSON update, exposing only the properties that we care about in the application.

class Update(object):def __init__(self, raw_update):self.update = json.loads(raw_update)self.chat = self._get_prop_value("chat")...

def _get_prop_value(self, alias):try:return self._resolve(alias)except:return None

def _resolve(self, alias):if alias == "chat":return self.update["message"]["chat"]...

...

def post(self):update = Update(self.request.body)

if update.chat is None:  
  return  
...

As you can see from the code snippet above, the try-except clause is put into the _get_prop_value method which can be reused. Therefore, there is no need to duplicate it. Furthermore, it becomes remarkably simple to use the Update object after it is created. We just need to access the desired property via the dot notation.

When I succeeded in simplifying the code, I was overjoyed! Although the first version worked, it constantly nagged at me for an improvement. I even thought of creating a Parser object to extract the required information. However, everything clicked into place when I saw how I could put a wrapper around the original JSON object. I learnt that refactoring can be a matter of perspective.

I simplified timezone handling

To recap, the main function of my bot was to send daily Twitter trends to users. However, when should this daily update be sent?

I could ask the user for a preferred time, but I would need to ask for the timezone as well. These two steps complicate the onboarding experience, and many users may not even attempt or complete these two steps.

After weighing the pros and cons, I figured that it was not worth it to allow the user to set a precise time. However, I still needed to set a sensible default time.

The simple solution that I came up with is to record the current Coordinated Universal Time (UTC) hour as the preferred update hour when the user starts the subscription, i.e. when the user sends a /start message. Recording the preferred UTC hour does not require any knowledge of the user’s specific timezone. I then scheduled my updater to run every hour and update subscribers who should be updated at that hour.

To elaborate, if a user sends /start at 8:30am (UTC), the preferred update hour will be saved as 8. When the updater runs at 8am the following day, this user will be updated.

There are some details that need to be taken care of to get this right. For example, only users who did not recently issue a /start command should be updated. In addition, since the scheduler may invoke the updater slightly earlier than expected, we may want to schedule it a few minutes after the hour, or round the current time to the nearest hour, such that we always update the right set of users.

I evaluated technical edge cases

There are many edge cases to consider with regards to the conversational user experience; this is best addressed in a separate post. Instead, I will elaborate on three technical edge cases that I encountered when building my Telegram bot without a framework.

1. Retry policy

Every platform should have a retry policy to resend messages when they are not successfully acknowledged by the bot.

When Telegram sends the user’s message to the bot using a webhook, it expects a HTTP 200 acknowledgement from the bot, to indicate that the bot has successfully received the message. If the bot fails to give this reply, the platform may resend the same message again at a later time.

During development, we can view the retry policy in action by logging requests and doing the following:

  1. Deploy a version of the bot that raises an error, or returns HTTP 500
  2. Send a few messages from the user
  3. Deploy a version of the bot without the error some time later

Sample logs showing that the platform resends failed messages

Through the examination of the application logs, I realized that Telegram employs an exponential backoff policy, i.e. it increases the time delay between subsequent retries. However, if the bot still fails to acknowledge a message after some number of attempts, Telegram will stop trying to resend the message. That being said, if any other message in the future is acknowledged, it will try to resend older messages that have not been acknowledged yet. Note that these details are not officially documented by Telegram.

Although retries are great, there is one problem. If they occur, the bot may not receive messages in the exact order in which the user had first sent them. This brings us to the next edge case.

2. Message ordering

In normal cases, the bot should receive user messages in the same order that the user had sent them. However, there are at least two scenarios in which the bot does not process messages in the correct order.

The first scenario occurs when failed messages are resent according to the platform’s retry policy. For example, the user sends A then B. The bot fails to acknowledge A, but receives B successfully. Following that, it receives A when the platform resends it. As a result, the bot processed B then A, which is the wrong order.

The second scenario occurs when the same user sends many messages within a very short period of time (e.g. 3 or more messages per second). This may be be common in a group chat, where people in the same group chat concurrently.

When handling a large number of user messages, we typically spin up multiple instances to handle requests concurrently. Requests may be sent to different queues to wait for their turn. Thus, there is a small probability that a later message from the user gets processed first. There is also a small probability that two messages from the same user gets handled simultaneously, which makes message ordering ambiguous.

Why message ordering may be important

Message ordering is not that important if the bot only gives information and can fulfill requests in one step.

However, it is important if the bot needs to set user data over multiple messages, as it will affect the accuracy of that data. Consider the example below.

User: ABot: Got it.User: B.Bot: Would you like to order B1 or B2?

However, if the order is reversed, this would have happened.

User: AUser: BBot Receives: BBot: B1 or B2?Bot Receives: ABot: Sorry, we don’t have A. Would you like to order B1 or B2?

At this point, the user will be utterly confused.

Worse still, if A was a legitimate answer to “B1 or B2?”, then the bot would have wrongly recorded A as the answer.

How can we solve this problem?

One way to solve this problem is to simply discard late arrivals. We can determine if a message is late by comparing the timestamp of the last received user message.

In addition, before any critical action is taken after a series of messages, the bot can ask the user for a final confirmation.

Can we prevent this problem?

Messages may be processed in the wrong order when the retry policy is invoked. To prevent this from happening, the bot should successfully acknowledge all messages, unless it runs into an error that it cannot or should not recover from.

Keeping retries to a minimum can be achieved by

  • Testing extensively
  • Separating development and production environments
  • Scaling adequately to maintain responsiveness

The problem may also occur if the same user sends too many messages within a very short time. In a one-on-one chat, this should not happen unless the user is spamming the bot (consider giving a warning to a spammer). However, if the bot needs to be able to chat in a group, the bot may legitimately receive many messages from the group at once.

Consider the act of setting a group reminder. To prevent Bob from interfering when Alice is trying to set the details of the group reminder, the bot could treat anything that Bob says as a separate request. For this to work, the bot probably needs to allocate separate memory for each user in the group. In addition, it can commit changes to the shared group memory at an appropriate time.

Other than preventing the problem discussed here, this arrangement grants the bot superior context awareness, because it has knowledge of individual users in the group and the group itself.

3. Contention

So far, we have seen that the retry policy and an influx of messages can cause messages to be processed in the wrong order. However, they may also cause database contention, which results in inaccurate updates to the database.

Suppose that the bot receives the user’s message, makes a change to the database, and generates a reply. Does this change take effect immediately?

We can write synchronously to the database, which may provide a guarantee that the change has been committed before the reply is generated. However, depending on the database, that change may take some time to become visible to another database query.

Consider a scenario where the bot is supposed to record the chat history in a list. Denote the existing history as H.

After message m1 arrives, the history is supposed to be saved as H + m1. A couple of milliseconds later, message m2 arrives as well. Since the change is not visible yet, the queried history is still H, so the history is saved as H + m2. What is disconcerting is that the user will think that nothing went wrong because the bot replied to both messages as per normal.

In order to address this problem, we can apply the same mitigating measures as described in the previous edge case. Furthermore, we need to understand how our chosen database deals with read and write consistency, and how atomic transactions can be performed, especially if strong consistency is an important requirement.

General comments on edge cases

Before investing time in implementing various preventive or mitigating measures for edge cases, it is important to evaluate the likelihood of their occurrence and the impact to the user. If the risk and impact is not small, it is best to be upfront about the time and effort required to understand and handle them.

To conclude

I have learnt so much from building a bot without a framework. It enabled me to discover specific things about the chosen bot platform (Telegram) and made me think about how to simplify message parsing and timezone handling. Moreover, I learnt about edge cases and how to handle them.

If you have not seen Part 1, click here.

If you find this post insightful and want to see more, please ❤ and follow me!


Published by HackerNoon on 2016/10/09