Predicting the likelihood of a customer to make repeat purchases using logistic regression

For a business getting a customer is exciting, not only because it helps you ‘secure the bag’ by bringing in much needed revenue, but it also creates an opportunity to create loyalty with this new found customer which in turn could help you ‘secure more bags’ through repeat purchases.

Safe Bag

Assuming your business continues driving and optimising the action(s) that have the biggest positive impact on creating repeat purchases, you create more organic brand ambassadors who will refer friends to your business at no extra cost to you. Additionally, depending on the sizes of the purchases made by these repeat customers, this will increase average customer lifetime value. All things constant, the higher the average lifetime value the more wiggle room you have to spend on acquiring new customers on existing and/or new channels. You can then use this to coax your investors to give you more money to move into a cooler office space, give your employees catered lunches and buy a bigger pool table to ‘attract new talent’.

As a starting point you need to get an understanding of the factors within the customer journey that have a higher importance in increasing the likelihood of a customer to make a repeat purchase and then: i.) work to optimise those factors and ii.) maybe start testing out personalised email marketing to customers who are more likely to become repeat customers to ensure that they buy products from your business again- this may of-course include having an understanding of the types of products they are more inclined to be interested in.

With this context in mind, I decided to analyse a Kaggle dataset on a Brazilian e-commerce platform- Olist- with an exploratory data analysis section to explore and understand more about the data itself, user behaviour and potentially valuable trends and a machine learning/analytical section dealing with user a classification algorithm to predict the likelihood of a customer making repeat purchases based on their past behaviour on the platform.

Exploratory Data Analysis

With the EDA process I focused on answering questions around: i.) how efficient each marketing channel has been for Olist, looking at the conversion rates for each channel, the average length of time it takes for each channel to convert users from first contact to ‘closing’ the lead, ii.) the median value of customers originating from each channel to get more clarity on how much money should be allocated to acquire customers per channel, iii.) median time delta between the purchase date and the date in which the order is delivered to the customer per state and iv.) the time delta between actual delivery time and estimated delivery time.

merged_list = pd.merge(closed_deals,mark_qualified,on='mql_id',how='outer').fillna(0)
frames = [merged_list,seller_data,order_list]
seller_merged = reduce(lambda left, right: pd.merge(left,right,on='seller_id',how='outer'),frames).fillna(0)more product a customer purchases the greater the the customer is customer purchases

#calculate different between won date and first contact date
seller_merged['time_to_close'] = pd.to_datetime(seller_merged['won_date']) - pd.to_datetime(seller_merged['first_contact_date'])

#extract days from time to close
seller_merged['days_to_close'] = seller_merged['time_to_close'].apply(lambda x: x.days)

Since the dataset is broken up into several tables on Kaggle, the first step I took was merging the tables, opting for a full outer join. Similar to SQL, a full outer join merges all the data from all the datasets. Merging with this and other tables required having an understanding of the data schema laid out on Kaggle.

Data Schema

Each arrow with pointers pointing both to and from a database indicates the common key connecting the two databases, this is the key that I used to merge each table to build the database.

After joining the relevant tables and calculating the difference between the won date (date the lead was closed) and the date of first contact, I created a lambda function to extract the days from this time delta. When it comes to measuring the median time it took to close a marketing lead, using days as a the measure seemed more appropriate given the timelines involved in this dataset.

#closing success rate by point of origin
seller_merged['closed'] =  np.where(seller_merged['won_date']!=0, "True","False")

#percentage of deals closed/not closed by point of origin
closed_or_not = seller_merged.groupby(['origin', 'closed']).agg({'closed': 'size'})
ax = closed_or_not.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).unstack().plot.bar(stacked=True)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    horiz_offset = 1
    vert_offset = 1
    ax.legend(bbox_to_anchor=(horiz_offset, vert_offset))
    ax.annotate('{:.0f} %'.format(height), (p.get_x()+.15*width, p.get_y()+.4*height))

To visualise the conversion rate for leads coming through each channel and converting to a paying customer I created a column that returns a boolean of either true or false depending on whether or not the won date contained a 0. This was followed by a lambda function to divide the total of closed and unclosed leads by the total numbers of leads that came through each channel. I returned those numbers as percentages by multiplying each number by a hundred and plotted those numbers on a 100% stacked bar-plot.

Visualising the efficiency of each marketing channel

From the visualisation it appears that leads that come in through paid channels are closed at a higher rate than organic channels. This is presumably because there is stronger intent to purchase if a user types in an associated keyword, sees an ad and actually clicks it. We cannot however simply attribute these conversions to paid search for example because it may the case that other marketing channels created awareness and paid search simply drove the last converting click.

A dog jaded from seeing banner ads

Display seems to be amongst the worst performers but as indicated in the previous paragraph, channels such as display may drive initial awareness which then leads to the customer clicking the paid search ad at a later stage. It may be more appropriate to measure display with other metrics for example brand recall.

Visualising the median time it takes to close a lead

#length of time to close deals by origin
closing_deals = seller_merged.query('closed == "True"')
medians =  seller_merged.query('closed == "True"').groupby('origin')['days_to_close'].median().values
median_labels = [f'{m} days' for m in medians]
pos = range(len(medians))
ax = sns.violinplot(x='origin',y='days_to_close',data=closing_deals)
for tick,label in zip(pos,ax.get_xticklabels()):
  ax.text(pos[tick],medians[tick]+0.5,median_labels[tick],
         horizontalalignment='center',size='xx-large',color='b',weight='semibold')
sns.set(rc={'figure.figsize':(20,15)})

The Seaborn library offers a violin plot that seems very intuitive in not only helping visualise the median time deltas but the range of time deltas for closing a lead. I assumed that any negative values where incorrectly recorded and filtered these from my visualisation, leaving only values where the time delta was equal to or greater than zero. I assumed that doing this would also filter out leads that did not have a ‘won date’ and therefore had not been closed or converted to a paying client

Time to close each leads

Interestingly, the visualisation shows that the median time it takes to close a paid search lead is 7 days, one of the lowest of all the other channels, while social and email are a staggering 42 and 35 days. It should be noted that 0 and unknown are both unknown channels and were only included in these visualisations to fully represent all the data presented in the dataset as it is.

Median customer value by channel

The next part of the EDA process involved visualising the median lifetime value of customers per channel with the end goal of trying to understand which marketing channel produces higher value customers.

#average value by origin
geolocation = pd.read_csv(r"/Users/emmanuelsibanda/Downloads/olist_geolocation_dataset.csv")
customer_db = pd.read_csv(r"/Users/emmanuelsibanda/Downloads/olist_customers_dataset.csv")
orders_db = pd.read_csv(r"/Users/emmanuelsibanda/Downloads/olist_orders_dataset.csv")
#merge on zip code
geolocation = geolocation.rename(columns={'geolocation_zip_code_prefix':'zip_code'})
customer_db = customer_db.rename(columns={'customer_zip_code_prefix':'zip_code'})
seller_merged = seller_merged.rename(columns={'seller_zip_code_prefix':'zip_code'})
#crashing my ram
seller_customer = pd.merge(seller_merged,customer_db,on='zip_code',how='outer').fillna(0)

In order to be able to carry out this analysis, I needed to merge my existing table with the customer database that contained customer ids.

#unique identifiers for duplicate zero values
seller_customer['customer_id'] =seller_customer['customer_id'].mask(seller_customer['customer_id'] == 0 , seller_customer['customer_id'].eq(0).cumsum())

#currency conversions
seller_customer['usd_price'] = seller_customer['price'].apply(lambda x: x*0.26)
seller_db = pd.DataFrame(seller_customer.groupby(['origin','customer_id'])['usd_price'].sum()).reset_index()
ax_chart = seller_db.groupby('origin')['usd_price'].mean().plot.bar()
ax_chart.set_xlabel("Point of origin")
ax_chart.set_ylabel("Mean Customer Value")
for x in ax_chart.patches:
  ax_chart.text(x.get_x()+.04,x.get_height()+10,\
               f'${round(x.get_height(),2)}',fontsize=11,color='black',
               rotation=45)

One problem with the dataset was that a lot of customer_ids where recorded as zero, even though these were presumably different users. Simply using a groupby function without rectifying this error would result in counting all the customer ids of zero as one user, pushing up average customer lifetime values. To rectify this error, I made the assumption that every customer id recorded as zero was a unique user and added unique identifiers to differentiate between each zero — by adding n to each zero, n being the numerical order in which the zero appears, e.g the first zero would be 0+1, second zero would be 0+2 etc.

For ease of understanding I proceeded to convert the transaction price spent on each purchase from Brazilian Real to US dollars. At the time of the analysis, the rate stood at approximately 0.26 with 1 Brazilian real equivalent to about 26 US cents. The was done for ease of reading as the numbers seemed a lot higher than they actually were.

So much money

I then returned the mean total amount spent by customers per channel, or the mean lifetime customer value per channel.

While means are not robust to outliers, returning the median in this context didn’t seem appropriate since a majority of leads did not result in conversions and would therefore have values of zero- giving a median of zero for many of the channels. From looking at the chart presented, paid search appears to be returning a significant mean customer lifetime value- about 401% more than the mean value of customers coming from social channels. It should be noted however, that the unknown variable in this analysis is customer acquisition costs. It may happen that it costs significantly more to acquire customers through paid search. The real value is the delta between mean customer lifetime value and mean customer acquisition costs per channel- the larger the delta the more effective the channel.

The time it takes to complete an order

#time to deliver orders
seller_customer = pd.merge(seller_customer,orders_db,on='customer_id',how='outer').fillna(0)

In order to look for any bottlenecks in the logistics process (from purchase to delivery) I had to start off by carrying out an outer merge with the orders table.

def reformat_date(df, column):
  df[column] = pd.to_datetime(df[column])
reformat_date(seller_customer,'order_purchase_timestamp')
reformat_date(seller_customer,'order_approved_at')
reformat_date(seller_customer,'order_delivered_carrier_date')
reformat_date(seller_customer,'order_delivered_customer_date')
reformat_date(seller_customer,'order_estimated_delivery_date')

#calculating time delta
seller_customer['purchase_to_approval'] =(seller_customer['order_approved_at']-seller_customer['order_purchase_timestamp']).apply(lambda x: x.total_seconds()//3600)
seller_customer['time_to_completion'] =(seller_customer['order_delivered_customer_date']-seller_customer['order_purchase_timestamp']).apply(lambda x: x.total_seconds()//3600)
seller_customer['estimated_actual'] =(seller_customer['order_estimated_delivery_date']-seller_customer['order_delivered_customer_date']).apply(lambda x: x.total_seconds()//3600)

#filter negative values
seller_customer2 = seller_customer[((seller_customer.purchase_to_approval > 0) & (seller_customer.time_to_completion > 0))]

Since i needed to analyse delivery times, I created a function converting the relevant date columns to a datetime format and proceeded to calculate the time delta for each of the phases in the purchasing to delivery cycle, including calculating the time delta between the estimated time of delivery and the actual time of delivery. I filtered out rows where the time of completion was below zero, making the assumption that these records had been incorrectly recorded.

Presumably the delivery times would differ from state to state, in order to capture these differences I opted to group the dataset by state and returned both the median times and time ranges by state. I opted to use the violinplot from the seaborn library to capture the full range of delivery times.

Delivery times per state

I then followed this up with a visualisation of the time delta between estimated delivery times and actual delivery times- the assumption being constantly underestimating delivery times could possibly have an adverse effect on customer satisfaction.

Estimated time v actual delivery time

As evidenced in the boxplots there is generally a substantial time delta between the estimated delivery data and the actual delivery date. This can range from an underestimation of the actual delivery time from 9 to about 23 days.

Predicting repeat customers

With this part of the analysis I made the assumption that predicting repeat customers based on observations made from their first purchases would reveal trends that could potentially help identify customers who are more likely to make repeat purchases. This could help me build a model that I could use in the future to predict whether or not a new customer is likely to make a repeat purchase and tailor my marketing efforts to the customer’s profile.

#column of number of purchases made
seller_customer['no_purchases'] = seller_customer.groupby('customer_id')['customer_id'].transform('count')

#create column for repeat and non repeat purchases
seller_customer['repeat'] = seller_customer['no_purchases'].apply(lambda x: 1 if x > 1 else 0)

In preparation for this classification task, I started off by identifying customers who have made repeat purchases using the boolean 1 or True, and 0 for non repeats (False).

Categorical to numerical

#categorical to numerical
seller_customer['customer_city'] = seller_customer['customer_city'].astype('category')
seller_customer['customer_city'] = seller_customer['customer_city'].cat.codes

#origin to numerical
seller_customer['origin'] = seller_customer['origin'].astype('category')
seller_customer['origin'] = seller_customer['origin'].cat.codes

#lead behaviour profile
seller_customer['lead_type']=seller_customer['lead_type'].astype('category')
seller_customer['lead_type']=seller_customer['lead_type'].cat.codes

#removing purchase dates with value 0
seller_customer_clean = seller_customer[seller_customer['order_purchase_timestamp'] != 0]

#sort orders by purchase date
seller_customer_sorted = seller_customer_clean.sort_values(by='order_purchase_timestamp')

#return first purchase made by each customer
seller_customer_sorted.drop_duplicates(subset ="customer_id",keep = 'first', inplace = True)

Since I will be using logistic regression for this classification task I needed to convert categorical variables I would use as features for my machine learning algorithm to numerical variables. I then removed all orders with a purchase date with the value zero — as no date can beclassified as zero. Since I want to make observations based on the first purchase made by a customer, I sorted the orders by purchase date and used the drop duplicates function to only keep the first order made by each customer.

Correlation Matrix

In order to understand potential features to use to predict whether or not a customer would make repeat purchases I returned a correlation matrix showing me the correlation between all the numerical variables in my dataset

Correlation Matrix

The idea behind returning the correlation matrix for all numerical variables is to get a better idea of the potential features by understanding the associations existing in the dataset. This also plays the additional role of mitigating against multicollinearity by understanding where dependencies lie within the dataset. This is especially useful since I am planning on using a linear model (transformed by the logistic function) for this task.

What stands out from the correlation matrix is the strong association between review score, freight value and likelihood to be a repeat customer. These attributes appear to potentially offer the strongest predictive power for my logistic regression model. However because of the seemingly strong dependency between some of the variables.

Building the logistic regression model

Although regression algorithms are typically best suited for predicting continuous variables, logistic regression returns the probability of a binary event occurring. While we can technically use a linear regression algorithm for the same task, the problem is that with linear regression you fit a straight ‘best fit’ line through your sample. Using this for classification problems based on the probability of an event occurring could potentially give you results where the probability is below 0 or over 1(best explained in the Stanford Machine Learning specialisation on Coursera).

1 represents the maximum value of the S-cruve and t denotes the steepness of the curve

With logistic regression we use the sigmoid function to convert the straight line to an S-shaped curve that does not go below 0 or 1 returning probabilities between these two numbers.

Logistic function s-curve

In the context of my dataset the logistic regression model would return the probability of a customer being a repeat customer, if the probability exceeds 0.5 the customer would be classified as a repeat customer. I chose to use this algorithm because of its ease of interpretability and it seemed very appropriate for the task.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
#defining features and target
X = seller_customer[['origin','review_score']].values
y = seller_customer['repeat'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logisticModel = LogisticRegression()
#training data with logistic regression model
logisticModel.fit(X_train,y_train)
#predicting with models
predicts = logisticModel.predict(X_test)

#test model accuracy
from sklearn.metrics import classification_report
#print(classification_report(y_test,predicts))
print("Accuracy:", accuracy_score(y_test, predicts))

Classification report and accuracy score

The classification report in my code displays multiple measures including: precision, f1-score, recall and support for each class. In its entirety the report is used as method to assess the quality of predictions made by a classification algorithm. Precision shows the percentage of correct predictions made by the model (the accuracy of the model/accuracy score) in this context for each of the binary outcomes. Recall returns the score for the ability of the classifier to find all positive instances for each of the outcomes while the f1-score and support return the percent of positive predictions that turned out to be correct and the number of true responses lying within each outcome respectively. Translating this to intuitive results, the model can predict with about 91.05% accuracy whether or not a customer is likely to be a repeat customer based on the assumptions made through my analysis (assumptions defining what a repeat customer is). The model better predicts whether a person will not be a repeat customer- with the precision of such predictions at about 95%. It may be possible to generalise this model to new customers interacting with the brand to possibly predict their likelihood to make repeat purchases, this will generally be largely driven by the customer’s review score and the freight value paid for the product.

Thus concludes my berceuse.

Feel free to reach out with any criticism, feedback, advice and tickets to Avengers Endgame on my Twitter @Emmoemm