Timestamp extraction from abstract and definite time specifications in online promotions’ text

“A timestamp is a sequence of characters or encoded information identifying when a certain event occurred, usually giving date and time of day, sometimes accurate to a small fraction of a second.” (Source Wikipedia — Nov 23,2016)

Abstract time specification along with definite time-frame references are commonly used in the text used in online promotions, especially while mentioning the details related to an offer validity. Even in our daily conversation between people, time is not always described in exact terms. Natural language is full of abstract time specifications in written as well as verbal dialogue. For example we often use phrases like “time is running out”, “few hours”. However it is not straightforward to quantify the urgency here and pin point a timestamp based on colloquial usage of time representations.

When the task is to sort different promotions based on their validity, natural language processing techniques can come very handy in figuring out when a coupon (promotion) can be used based on the duration of time a promotion is valid for. Following are some example sentences extracted from promotions’ data which was collected from different sources.

“Offer expires in 7 days. Exclusions apply. Discount taken at checkout. See disclaimer for details”

“Preferred Customer Private Sale — Thursday Feb. 11 at 2 pm — To redeem use code TODAY at checkout.”

“Limited Time Sale Events Powered by HauteLook”

Given a reference date (time when the promotion was sent), the first 2 sentences exactly specify the date and time when the promotion is valid, the last sentence though does not give such a concrete time frame yet it clearly indicates that the sale event is valid for a small duration. It is not uncommon to find such sentences in the promotion data, where the exact end time (coupon expiry) or begin time (when the coupon comes into effect) is not clearly specified.

Modeling sequence of words — If one uses a simple rule based approach which looks for specific sequence of words to determine the duration, it can quickly become unmanageable given the number of possible combinations of tokens to determine the timestamp (month, year, date, hour and minute etc) . The combinations will only increase if we have to determine both start and end time for a promotion. The goal is to minimize the number of hard-coded rules and exploit possible patterns or common underlying structure used in sentences specifying time and craft rules on the underlying structure.

Consider a sentence “Valid from 10th October to 15th November”, there are two timestamps in the sentence, one is 10th October and the other is 15th November. The promotion is valid between the mentioned dates and the start and end dates can be determined by looking at the word sequence before the actual dates in the sentence. The words “Valid from” indicate that the date 10th Oct is start date and “to” indicates 15th Nov is end date . This example shows the importance of using context [surrounding words] to determine the duration of validity of a promotion.

Point in time vs duration — The three sentences mentioned above also highlight another aspect of time description in general text. Phrases like “In 7 weeks” and “limited time” indicate duration with respect to a reference date and phrases like “15th November” indicate the exact point in time when a coupon validity starts or ends. Based on this observation we can come up with a grammar (or set of labels) to explain the time description structure in a given sentence.

Grammar (labels for describing words in a sentence which indicate time) — The labeling scheme follows the following design:

Any label starting with letter ‘B’ corresponds to Begin (start) Time of validity of a promotion and ‘E’ corresponds to the End Time. The last letter in the label, if it is ‘S’ means it is a string and ’N’ implies the word is an ‘Integer’ or ‘DIGIT’. “TP” in a label indicates “Time Point” or “Point in time” and “TD” implies “Time Duration”. Other characters “D”, “H”, “M”, “Y”, “m” are assigned to “Day”, “Hour”, “Month”, “Year” and “minute” respectively. “AP” in a tag indicates AM/PM. “Z” Indicates time zone and single P and S in time duration labels (TDPS, TDSS) are used to indicate ‘prefix’ and ‘suffix’ respectively. Duration is normally specified in a sentence as a (prefix, suffix) tuple — for example ‘7 weeks’ , ‘2 days’, ‘Nine hours’ etc.

All words corresponding to a tag ending with letter “N” are numbers. Some of the time zone strings are “et”, “pt”, “pdt” etc. Time duration suffixes are always strings (for eg. ‘hours’, ‘weeks’, ‘days’ etc) and duration prefixes can be both string and digits (eg. 2 , “two”, “limited”, “few” etc). The hour strings include words like “midnight”, “noon” etc and the day strings are “sunday”, “saturday”, “tomorrow” etc. Month strings are strings which mention month names in different formats eg. “jan” , “january” , “apr” etc. Special days like “father’s day”, “memorial day” etc are labeled as (TDPS,TDSS) time duration prefix-suffix tuples. Following are the labeled version of the three sentences mentioned earlier. In some cases where the begin and end time is not mentioned separately (2nd sentence), I have marked them as “E”(nd) labels as the offer’s validity ends that day or at that point in time.

“Offer expires in 7_TDPN days._TDSS Exclusions apply. Discount taken at checkout. See disclaimer for details”

“Preferred Customer Private Sale — Thursday_ETPDS Feb._ETPMS 11_ETPDN at 2_ETPHN pm_ETPAP — To redeem use code TODAY at checkout.”

“Limited_TDPS Time_TDSS Sale Events Powered by HauteLook”

After establishing the need for a structure/grammar definition and context specification, the next step is to pick a model which can model sequence of words in a sentence and predict a label for each word in it. The rules can then be written against the labels which are fewer in number than against entire set of words in the English vocabulary (well vocabulary in training data to be precise). The simplest model that we can use to model sequence of words is a HMM model.

Supervised HMM Model —

A hidden Markov model (HMM) is a Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. Here the hidden sequence of variables or states are made of labels from the grammar which are considered unobserved (hidden) and each state emits a word (observed) with certain emission probability. Each state or label in a given sequence depends on “n” previous labels according to certain transition probability distribution. Choice of “n” here determines the order of the HMM model.Fig.1 shows the HMM states for the sentence “2 day sale”. The list of labels I mentioned earlier did not have words like “SALE” in it, however here we see that the word “sale” is both an observed word as well as the label. This goes back to the idea of context window I described in the “Modeling sequence of words” section. The next two sections explain this idea even further and introduces an innovative way to deal with unknown words.

Context window and pseudo labels — To understand this concept better let us look at the labeled sentence “This promotion expires in 10_TDPS days_TDSS offer valid only till 20_ETPDN th November_ETPMS but for our preferred customer the offer is valid till end of November_ETPMS”. The phrases around the labeled words, like “expires in”, “valid only till”, “end of” are sort of indicators that the words follow them in a sentence might contain time related information. A context window essentially tries to identify such phrases and assign importance to them by making them labels themselves. For eg. in the mentioned sentence, if the context window is 2 — which means any word which is 2 words before or ahead of the actual labeled-word will also be a labeled-word and label will be the word itself. This step is done before learning transition and emission probabilities on the training data. The complete labeled sentence after considering the context window (size 2) looks something like this:

“This_<NA> promotion_<NA> expires_<EXPIRES> in_<IN> 10_<TDPS> days_<TDSS> offer_<OFFER> valid_<VALID> only_<ONLY> till_<TILL> 20_<ETPDN> th_<TH> November_<ETPMS> but_<BUT> for_<FOR> our_<NA> preferred_<NA> customer_<NA> the_<NA> offer_<NA> is_<NA> valid_<NA> till_<NA> end_<END> of_<OF> November_<ETPMS>”

Notice that the word “valid” is labeled “<VALID>” in one instance and in the other it is labeled “<NA>”, so a given word can have more than one label based on the context it is used in, which in this case is defined by the size of the window. If you notice the label “<NA>” suddenly appears out of no where, the next section is devoted completely to explain this mysterious label and its importance in addressing the “problem of unknowns”.

<NA> and the problem of unknowns — An important consideration to keep in mind while using models like HMMs to model a sequence, is to make sure they generalize well. By generalization, I mean scenarios where a sentence with unseen words is given to the model and we have to predict the labels for the unknown sequence. This has generally been the hard problem to solve and different methods like using Maxent model and grouping words into categories based on certain similarities etc. have been proposed.

Here I have used a new method to address this problem with predicting the labels for unknown or unseen words. The inspiration is drawn from the fact that the words outside the context window (as defined above) normally do not carry much information about the actual timestamp embedded in a sentence. For example in the sentence earlier mentioned, words like “this”, “promotion”, “customer” do not contribute to identifying the coupon expiry time. We can use such words (which are outside the context window) and mark them as “unknowns”. And the way I do that is by assigning them a label “<NA>”. The unknown emission probability is obtained by add-1 smoothing (Laplace smoothing) on (label,word) pairs where label is <NA>. The emission probability on unknown word, P_e(<UNK>|<NA>) = 1/(N+C), where N is the total number of times we encounter (<NA>, word) combination in the training data and ‘C’ is number of unique words “<NA>” label is assigned to. Hence the probability of seeing an unknown word for any label is replaced with the emission probability of (<NA>,<UNK>) label word pair.

I use Viterbi decoding to infer the label sequence based on the observed word sequence and it is quite essential to figure out the unknown word emission probabilities before performing the inference task. If the emission probability is zero as we never saw the word in our training vocabulary then it is not possible to decode the sentence as all transition paths between states will carry a “zero” value. This means it will be impossible to decode the sentence if they contain words which we see only in test data and not in the training set.

Using randomly sampled (with replacement) 80% of data as training data (around 960 sentences) and the rest 20% as test (around 240 sentences), I calculated the average metrics on the test data (using 3 fold cross validation). The following table shows the average labeling accuracy on the true labels (labels from the grammar) for different context window sizes.

Pay attention to the “ % Decoding Failed” column, this column tells us the number of sentences in the test data, for which the model could not infer a label sequence based on the available training data. Decoding of a sentence fails in the following scenarios:

As mentioned earlier, when the word is unknown and emission probability P_e(unknown_word|label) = 0 (for all labels) in the training data, then all paths in the trellis go to zero. We have taken care of this scenario by using P_e(<UNK>|<NA>) value as the emission proability.
Let’s say we have one path with non zero weights up till the point when we see word (w_t). Let the index of ‘w_t’ be 3 , which means it’s the 3rd word in the sentence. Let’s also assume for a 1st order HMM, the nonzero weight path in the trellis so far comprises of sequence ((label1,word1), (label2, word2)). If the transition probability P_t(label3|label2) !=0 and zero for all other transitions from label2 , i.e. P_t(label_i|label2) = 0 ( for all i != 3) and P_e(w_t|label3) = 0, then weight of the path becomes zero before reaching the last word in the sentence. In general for more than one paths, if all paths which carry a non zero weight in the trellis before seeing w_t can only transition, with a non zero probability, to a state (label) with zero emission probability for the word (ie P_e(w_t|label) = 0), then decoding fails as the weight for all the paths drops to zero before reaching the last word.

Two rows in the table above have % decoding failed sentences = 0 (or close to zero). In the following situations we achieve that:

Zero context window — When the context window size is zero, all the words which are labeled with a label from the grammar will remain as it is and every other word will get the label <NA>. This happens because the label of a given word does not depend on its neighboring words(or labels). If you look at the sentence — “sale_NA ends_NA in_NA 2_TDPN days_TDSS” , only “2 days” is labeled and the rest of the words get “NA” as the default label. Given there are a very limited set of labels in this case , the emission probability of a known word from one of these labels is unlikely to be zero. Like before emission probabilties for unknown words are obtained using P_e(<UNK>|<NA>). When the context window is zero, the average label prediction accuracy is quite low, and this is expected as context information is not taken into account here.
Relying on transition probabilities when emission probability is zero — Take a note of the row marked IgnoreVocab in the table. The HMM order and context-window size are not different from row 3. The only difference here is that whenever a known word is encountered and the emission probability of that word (say ‘w_t’) is zero for a label, then it is replaced with the emission probability of P_e(<UNK>|<NA>). This means that there is never a possibility that all paths in the trellis will drop to zero weight before reaching the last word. This is ensured because, let’s say there are only ’n’ (states) labels we can transition to and each one of them has the same emission probability for the word (which is P_e(<UNK>|<NA>)), then transition to subsequent states purely depends on transition probabilities. This ensures there are no (label, word) tuples with zero emission probabilities. However this method is not entirely reliable, for example when we have 2 states we can transition to - label1 and label2 and P_e(w_t|label2) = 0 and P_e(w_t|label1) !=0, then we are replacing P_e(w_t|label2) with P_e(<UNK>|<NA>). If P_e(<UNK>|<NA>) > P_e(w_t|label1), there is a chance during the inference step the decoder will pick <NA> over label1 as the target label for the word. Hence the expectation here is that P_e(<UNK>|<NA>) is very small and it only helps to break the tie when all paths are leading to zero weight because of zero emission probability. The run time increases significantly as well, from 80 seconds (HMM order=1 and window size = 2) to 2300 seconds for decoding 240 sentences, because there is no path that can be eliminated because of zero emission probability during the decoding or inference step.

Another extreme case is when the context window is -1 , where we have no <NA> label and every word has a distinct label (which is either a label from the grammar, or the word itself) , in that case the accuracy is high for the sentences which were decoded correctly. However almost 50% of sentences were not decoded, a classic example of overfitting. Window size of 2 and order 1 seems to perform better in terms of average label accuracy and average end time predictions. Average begin time prediction accuracy is almost always high because of the way the translator is designed which iteratively starts with the reference date and tries to figure out the best begin date from the sentence (see next section — Translation layer).

Translation layer

Once the labels are obtained using the decoding technique mentioned before. The next step is to parse these labels and figure out the actual time stamp. This translation step is mainly rules driven and the rules are compiled against a limited set of labels and not against the entire set of words in the training vocabulary. Few things to keep in mind before designing a translation layer for promotions data are:

Promotions are always forward looking — Given a reference date when a promotion was sent, it is highly unlikely that the promotion has already expired. This means references like “weekday”, “weekend”, “thursday” etc are used in current or future frame of reference.
A promotion can contain multiple dates embedded in them, for example in the sentence “This promotion expires in 10_TDPS days_TDSS offer valid only till 20_ETPDN th November_ETPMS but for our preferred customer the offer is valid till end of November_ETPMS”, the promotion is valid till 30th November for preferred customers and till 20th for others. In the translation layer I am trying to extract the largest range which is — earliest start time to farthest end time.
I assume certain standard translation for abstract phrases like (‘limited’,’time’) , (‘few’,’days’) which are duration tuples — (TDPS, TDSS) . The standard translations, for example, translate (‘limited’, ‘time’) to (‘2’, ’days’) and (‘last’, ‘chance’) to (‘1’, ‘day’) etc.
Another form of duration tuples are special calendar days like — (mothers’, day) , (memorial, day), (black, friday). For these days a standard calendar service can be used to figure out the exact date.

The following 2 graphs (for HMM Order — 1, contextwindow size — 2), show the comparison between predicted and actual start and end times, sorted by actual start time and time to expiry. Points where there is a mismatch between predicted and actual time can be seen by the triangles which depart from the solid line. The time on y axis is the unix epoch time.

It is not difficult to see that when the predicted labels are not correct then the translation layer can not make a good estimate of the start and end times of a promotion. However, looking at different time signatures in a given sentence, the hope is that even if a subset of labels are correctly predicted then we can get an approximate estimate of the timestamp embedded in it.

Extending to general use-cases — The framework that I have built for estimating the validity time of a promotion can be broken down into primarily 3 steps

Parse — Process sentences and convert them into a standard form so that can be passed to the model.
Predict Labels — Choose an appropriate context window to incorporate context information and to take care of unknown words and labels. In this step we use a sequence model to predict the labels in a sentence.
Translate — This step involves applying rules on the predicted labels so that we can estimate the start and end time of a promotion which determines its validity.

These steps can in general be applied to sentences which are not necessarily from promotions’ data. For example when we look at a chat bot conversations it is crucial to figure out timestamp as accurately as possible when a user asks for something to be done at a certain time.

Consider the following 4 sentences, where the first column shows the reference time or the point in time when the sentence was spoken. The tuple output shows the begin time and end time. When we look at the predicted end times, it is clear that the model trained on the data from promotions can be used to predict the timestamp for a sentence which is outside the promotions’ domain. On line 2, the begin time is after the end time, however the end time indicates accurate time when the event took place. This is also an example of a sentence where we don’t necessarily have an accurate begin time for an event.

11/17/16 14:00, The class starts in 2 hours — (11/17/2016 14:0:0, 11/17/2016 16:0:0)
11/17/16 14:00, Midterm was held on 16th November at 5 pm — (11/17/2016 14:0:0, 11/16/2016 17:0:0)
11/17/16 14:00,The deadline to submit the writeup is tomorrow — (11/17/2016 14:0:0, 11/18/2016 23:59:59)
11/17/16 14:00, It may rain on the weekend — (11/17/16 14:00, 11/20/2016 23:59:59)

As mentioned earlier the translator implementation also has a bearing on the accuracy of endtime and start time predictions. The method employed or the rules designed to parse the predicted labels can determine how the start and end times are calculated.

Code

I have provided an initial implementation of parser, HMM Model and the translator here : Abstract Time Parser

Main.py is contains the code to invoke the parser which internally calls the model to predict the labels for words in a sentence. The translator class “BeginEndTime” is also invoked in the main function.
Training data — labeledsentences.csv is available under the data folder. The ref time, begintime, end time, labeledsentence columns are self explanatory. Ignore the comments column and the ETU column. The former indicates some special cases and ETU means End Time Unknown. Whenever the end time is unknown I have plugged in some educated guess for the end time in those places.
General Sentences file contains a very small set of example sentences which are not in promotions’ context.
Unlabeled Sentences are available here — serves as an example to illustrate the format the file needs to be in for the code to correctly parse it
The results folder contains sample output/results file, the description of each output file is contained in the Main.py code documentation.

Some Assumptions I have made in the code and scope for improvement

Since the promotions data I have is from online promotions sent during 2015 and 2016, I have a check to see if the predicted year falls in this range.
Given the nature of training data the format followed in the training data is MM/DD/YY, however this might not always be true depending on the region where the promotion is sent to. I have assumed uniform timezone here as well.
I have made an assumption when it comes to duration of abstract phrases like (‘limited’,’time’) etc. These are available in the timeDurationOffsetAliases map specified in the translator implementation. Instead of setting it to one number, it can be replaced with a duration distribution of individual phrase, which can be obtained by analyzing the validity of promotion from different vendors where ever these phrases are used.
Similarly a special map is maintained for the important calendar dates like holidays, festivals etc. For now this is hard-coded in the code to 2016 dates in the translator file. This can easily be replaced by a generic holiday/special calendar service which returns a valid date for different years.
The translator code is written in such a way that the begin time is first set to reference time and based on each predicted label it tries to figure out the month, year, hour, day values by applying the defined rules on each word separately. This can end up in creating a date where the combination of values is wrong. For example if there are two dates mentioned in a sentence “2 Feb and 30 March”, based on the labels predicted if 30 is marked as day when the promotion begins along with Feb being the corresponding month and 2 is marked as promotion end date. Then the translator will output 30th Feb as the begin date, which is not a valid date. There are validation checks in place in the code to make sure the returned date is valid, however the code can definitely be improved to make sure multiple begin dates and end times are returned and the best is chosen out of them, which is more a parallel approach than a cascaded one.
As mentioned earlier, promotions are ‘forward looking’, if one wants to implement a parser/labeler/translator for abstract time context which is in the past — for eg. last week, 2 weeks before etc then this code as it is won’t work for that use-case. This should not very difficult to implement as well I believe, with some additional labels in the grammar and changes to translation logic one can achieve that using the current setup.
In the Viterbi decoding step when there are two previous states which are the best state to reach from to the current state, the code in HMModel.py just picks one of the two randomly. This can definitely be changed to preserve multiple previous best states at each step.
Of course picking the right names and format for functions is not my core strength so excuse me if the function names, class names are bizarrely named.

Further Improvements — I will end this blog discussing possible improvements to the approach discussed here to build, what I call, an “abstract time parser”

Repetitions — There is a mysterious label “TPRD” that you will find in the predicted labels (if you look close enough at the training data). The intention was to capture repeated events — like every day/week/hour etc. The labeling part does put in some effort to identify such sequences however I have not implemented the code to translate it (will leave it to some other day).
Since we are talking about models that are relevant for sequences, other fairly advanced models that one can look at are — CRFs (Conditional Random Fields) and RNNs (Recurrent Neural Networks) which consider full sentence instead of just looking at preceding words to determine the label of an individual word in a sentence.
In a general sense the model does a decent job with simple sentences, but bear in mind the approach I have discussed here only cares for one begin time and one end time value. If there are multiple dates or timestamps embedded and one wants to extract from a sentence, you should look at the Code section above where I have discussed how to implement such improvements.

Acknowledgements

Special thanks to professor Kyunghyun Cho for reviewing the post and for providing valuable feedback and mentorship on the overall project I am working on, which is to apply Natural language processing, Machine learning and Deep learning techniques to analyze online marketing data and provide on-demand, intelligent and relevant promotions to the end user while improving the overall conversion rate for the seller.