/ #Clustering #t-SNE 

Segmenting with Mixed Type Data - A Case Study Using K-Medoids on Subscription Data

With the new year, I started to look for new employment opportunities and even managed to land a handful of final stage interviews before it all grounded to a halt following the corona-virus pandemic. Invariably, as part of the selection process I was asked to analyse a set of data and compile a number of data driven-recommendations to present in my final meeting.

In this post I retrace the steps I took for one of the take home analysis I was tasked with and revisit clustering, one of my favourite analytic methods. Only this time the set up is a lot closer to a real-world situation in that the data I had to analyse came with a mix of categorical and numerical feature. Simply put, this could not be tackled with a bog-standard K-means algorithm as it’s based on pairwise Euclidean distances and has no direct application to categorical data.

The data

library(tidyverse)
library(readxl)
library(skimr)
library(knitr) 
library(janitor)
library(cluster)
library(Rtsne)

The data represents all online acquisitions in February 2018 and their subscription status 3 months later (9th June 2018) for a Fictional News Aggregator Subscription business. It contains a number of parameters describing each account, like account creation date, campaign attributed to acquisition, payment method and length of product trial. A full description of the variables can be found in the Appendix

I got hold of this dataset in the course of a recruitment selection process as I was asked to carry out an analysis and present results and recommendations that could helps improve their products sign up in my final meeting. Although fictitious in nature, I thought it best to further anonimise the dataset by changing names and values of most of the variables as well as removing several features that were of no use to the analysis

The data was raw and required some cleansing and manipulation before it could be used for analysis. You can find the anonimised dataset on my GitHub profile, along with the scripts I used to cleanse the data and perform the analysis

Data Exploration

This is a vital part of any data analysis study, often overlooked in favour of the fancier and “sexier” modelling/playing with the algos. In fact, I find it one of the most creative phases of any project, where you “join the dots” among the different variables and start formulating interesting hypothesis for you to test later on

data_clean <- readRDS("../00_data/data_clean.rds")

Here I’m loading the cleaned data before I explore a selection of the features, run through a number of considerations to set up the analysis as well as show you some of the interesting insight I gathered. You can find the post covering data cleansing and formatting on my webpage @ Segmenting with Mixed Type Data - Initial data inspection and manupulation

In this project I’m also testing quite a few of the adorn_ and the tabyl functions from the janitor library, a family of functions created to help expedite the initial data exploration and cleaning but also very useful to create and format summary tables.

Country of Residence

Let’s start with the geographic distribution of subscribers, the overwhelming majority of whom are UK based

Cancellations

The dynamics of cancellations are definitely something that warrants further investigation (outside the scope of this study). In fact, Failed payments represent roughly 10% of total subscriptions and a staggering 38% when looking at cancellations alone

data_clean %>% 
   tabyl(canc_reason, status) %>%
   adorn_totals(c("row", "col")) %>% 
   adorn_percentages("col") %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   adorn_ns() %>%
   adorn_title("combined") %>%
   kable(align = 'c')
canc_reason/status Active Cancelled Total
- 100.0% (3510) 18.1% (224) 78.6% (3734)
Competitor 0.0% (0) 1.5% (18) 0.4% (18)
Editorial 0.0% (0) 3.6% (44) 0.9% (44)
Failed Payment 0.0% (0) 37.9% (469) 9.9% (469)
Lack of time 0.0% (0) 19.5% (242) 5.1% (242)
Other 0.0% (0) 6.5% (81) 1.7% (81)
Price 0.0% (0) 8.2% (101) 2.1% (101)
UX Related 0.0% (0) 4.8% (60) 1.3% (60)
Total 100.0% (3510) 100.0% (1239) 100.0% (4749)

RECOMMENDATION 1: Review causes of high failed payment rate as it could result in a 10% boost in overall subscriptions

Payment Methods

Credit Card is the preferred payment method for over two fifths (61%) of subscribers

data_clean %>% 
   tabyl(payment_method) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   arrange(desc(n)) %>%  
   kable(align = 'c')
payment_method n percent
Credit Card 2905 61.2%
PayPal 1230 25.9%
Direct Debit 614 12.9%

Campaigns

Three acquisition campaigns drove the largest majority of subscriptions in February 2018

The analysis is going to focus on these top 3 campaigns, which account for over 73% of total acquisitions. Adding additional campaigns to the analysis would likely see them all lumped into one “hybrid” cluster and result in less focused insight.

data_clean %>% 
   tabyl(campaign_code) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   arrange(desc(n)) %>% 
   top_n(10, wt = n) %>% 
   kable(align = 'c')
campaign_code n percent
57572 1546 32.6%
56472 1061 22.3%
56972 885 18.6%
51372 178 3.7%
51472 154 3.2%
57272 141 3.0%
47672 104 2.2%
53872 104 2.2%
48272 95 2.0%
52972 92 1.9%

Trial Length

Potential subscribes have a number of options available but the analysis is going to take into account only customers that took the 1-month trial before sign-up.

Given that all acquisitions refer to a particular month and their current status 3 MONTHS LATER, this will ensure a clear cut window for the analysis AND cover the vast majority of subscriptions (nearly 90%)

data_clean %>% 
   tabyl(trial_length) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   select(-valid_percent) %>% 
   kable(align = 'c')
trial_length n percent
1M 4233 89.1%
2M 1 0.0%
3M 95 2.0%
6M 7 0.1%
NA 413 8.7%

Product Pricing

Zooming in on the top 3 campaigns for clarity, we can see the two price points of £6.99 for the Standard product and £15 for the Premium one.

Surprisingly, campaign 56472 is attributed to the acquisition of both products. It is good practice to keep acquisition campaigns more closely aligned to a single product as it would ensure better accountability and understanding of acquisition dynamics.

data_clean %>% 
   filter(campaign_code %in% c('57572', '56472', '56972')) %>% 
   filter(trial_length == '1M') %>% 
   group_by(product_group, campaign_code,
            contract_monthly_price) %>% 
   count() %>% 
   ungroup() %>%
   arrange(campaign_code) %>% 
   kable(align = 'c')
product_group campaign_code contract_monthly_price n
Standard 56472 6.99 1061
Premium 56972 15.00 264
Standard 56972 6.99 587
Standard 57572 6.99 1521

Subscription Rate

For the majority of the top 10 campaigns, sign up after trial hovers in the 74-76% range. This offers a good benchmark for the general subscription levels we can expect after trial.

data_clean %>% 
   # selecting top n campaigns by number of 
   filter(campaign_code %in% c(
      '57572', '56472', '56972', '51372', '57272',
      '51472', '47672', '53872', '52972', '54072'
                               )) %>% 
   filter(trial_length == '1M') %>%
   group_by(campaign_code, status) %>% 
   count() %>% 
   ungroup() %>% 
   # lag to allign past (lagging) obs to present obs
   mutate(lag_1 = lag(n, n = 1)) %>% 
   # sort out NAs
   tidyr::fill(lag_1, .direction = "up") %>% 
   # calculate cancelled to total rate perc difference
   mutate(act_to_tot = 
            (lag_1 / (n + lag_1)))  %>% 
   filter(status == 'Cancelled') %>%
   select(campaign_code, act_to_tot) %>% 
   arrange(desc(campaign_code)) %>% 
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   kable(align = 'c')
campaign_code act_to_tot
57572 75.5%
57272 75.6%
56972 72.5%
56472 75.8%
54072 86.4%
53872 71.3%
52972 73.9%
51472 48.1%
51372 68.8%
47672 74.8%

The low subscription rate for campaign 51472 may be due to the higher contract price after trial (contract_monthly_price) of £9.65 compared to £6.99 seen for other Standard product contracts.

data_clean %>% 
   filter(campaign_code == '51472') %>% 
   filter(trial_length == '1M') %>% 
   group_by(contract_monthly_price, product_group, campaign_code) %>% 
   count() %>% 
   ungroup() %>% 
   kable(align = 'c')
contract_monthly_price product_group campaign_code n
9.65 Standard 51472 128

RECOMMENDATION 2: Investigate different price points for standard product as lower price point could boost take up

Analysis Structure

In this analysis I want to understand what factors make a customer subscribe after a trial and to do so, I need to identify what approach is best suited for the type and cut of data I have at my disposal, and to define a measure of success.

  • APPROACH: The data is a snapshot in time and as such does not lend itself to any time-based type of analysis. In such cases, one of the best suited approaches to extract insight is clustering, which is especially good when you have no prior domain knowledge to guide you. However, the fact that data is a mix of categorical and numerical features requires a slightly different approach, which I discuss in the next section.

  • MEASURE: The perfect candidate to measure success is Subscription Rate defined as Active / Total Subscribers. The an focus on 1-month trial subscribers also ensures a clear cut window for analysis as their status reflects whether they signed up 3 MONTHS AFTERWARDS.

Methodology

When I started to research cluster analysis with mix categorical and numerical data, I came across an excellent post on Towards Data Science entitled Clustering on mixed type data, from which I borrowed the core analysis coding and adjusted it to my needs. In his article, Thomas Filaire shows how to use the PAM clustering algorithm (Partitioning Around Medoids) to perform the clustering and the silhouette coefficient to select the optimal number of clusters.

The K-medoid, also know as PAM, is a clustering algorithm similar to the more popular K-means algorithm. K-means and K-medoids work in similar ways in that they create groups in your data and work on distances (often referred to as dissimilarities) as they ensure that elements in each group are very similar to one another by minimising the distance within each cluster. However, K-medoids has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data that include both numerical and categorical features.

I’m going to calculate the dissimilarities between observations with the help of the daisy function from the cluster library, which allows to choose between a number of methods (“euclidean”, “manhattan” and “gower”) to run the calculations. The Gower’s distance (1971) is of particular interest to us as it computes dissimilarities on a [0 1] range regardless of whether the input is numerical or categorical, hence making them comparable.

Clustering

I’m going to segment the subscription data by the following 5 dimensions:

  • Status: Active / Cancelled
  • Product Group: Standard / Premium
  • Top 3 Campaigns by Subscription Numbers: 56472 / 56972 / 57572
  • Payment Method: Credit Card / Direct Debit / PayPal
  • Contract Monthly Price: £6.99 / £15

I’ll start with selecting the cut of the data I need for the clustering. I’m keeping account_id for my reference but will not pass it to the algorithm.

clust_data <-
   data_clean %>% 
   # filtering by the top 3 campaigns
   filter(campaign_code %in% c('57572', '56472', '56972')) %>%
   # selecting 1M trial subscriptions
   filter(trial_length == '1M') %>% 
   # select features to cluster by
   select(account_id, status, product_group, campaign_code, 
          contract_monthly_price, payment_method
          ) %>% 
   # setting all features as factors
   mutate(
        account_id           = account_id %>% as_factor(),
        status                 = status %>% as_factor(),
        product_group          = product_group %>% as_factor(),
        campaign_code          = campaign_code %>% as_factor(),
        contract_monthly_price = contract_monthly_price %>% as_factor(),
        payment_method         = payment_method %>% as_factor()
    )

Then, I compute the Gower distance with the daisy function from the cluster package. The “Gower’s distance” would automatically be selected if some features in the data are not numeric but I prefer to spell it out anyway.

gower_dist <- 
  clust_data %>% 
    # de-select account_id
    select(2:6) %>%   
    daisy(metric = "gower")

And that’s that! You’re set to go!

Assessing the Clusters

There are a number of methods to establish the optimal number of clusters to use but in this study I’m using the silhouette coefficient, which contrasts the average distance of elements in the same cluster with average distance of elements in other clusters. In other words, each additional cluster is adding “compactness” to the individual segment, bringing their elements closer together, whilst “moving” each cluster further apart from one another.

First, I’m calculating the Partition Around Medoids (a.k.a. PAM) using the pam function from the cluster library. The one number to keep an eye on is the average silhouette width (or avg.width), which I’m storing away in the sil_width parameter.

sil_width <- c(NA)

for (i in 2:8) {  
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)  
  sil_width[i] <- pam_fit$silinfo$avg.width  
}

In a business context, we want a number of clusters to be both meaningful and easy to handle, (i.e. 2 to 8) and 5-cluster configuration seems a good starting point to investigate.

sil_width %>% 
  as_tibble() %>% 
   rowid_to_column() %>% 
   filter(rowid %in% c(2:8)) %>% 
   ggplot(aes(rowid, value)) +
   geom_line(colour  = 'black', size = 0.7) +
   geom_point(colour = 'black', size = 1.3) +
   theme_minimal() +
   labs(title = 'Silhouette Widths of k-medoid Clusters',
        x     = "Number of clusters",
        y     = 'Silhouette Width') +
   theme(plot.title = element_text(hjust = 0.5))

When deciding on the optimal number of clusters, DO NOT rely exclusively on the output of mathematical methods. Make sure you combine it with domain knowledge and your own judgement as often adding an extra segment (say, going from 5 to 6) may only add complexity for no extra insight.

Visualising the Segments with t-SNE

Now that I have a potential optimal number of clusters, I want to visualise them. To do so, I use the t-SNE ( t-distributed stochastic neighbour embedding), a dimensionality reduction technique that assists with cluster visualisation in a similar way to Principal Component Analysis and UMAP.

First, I pull the partitioning data for the 5-cluster configuration from gower_dist

pam_fit <- 
  gower_dist %>% 
    # diss = TRUE to treat argument as dissimilarity matrix
    pam(k = 5, diss = TRUE)

Then I construct the 2D projection of my 5-dimensional space with Rtsne and plot it

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)


tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering)) %>% 

  # plot
  ggplot(aes(x = X, y = Y, colour = cluster)) +
  geom_point() +
  theme_light() +
  labs(title       = 't-SNE 2D Projections of k-medoid Clusters')  +
  theme(plot.title = element_text(hjust = 0.5))

Aside from a few elements, there is a general good separation between clusters as well as closeness of elements within clusters, which confirms the segmentation relevance

Evaluating the Clusters

At this stage of a project you get to appreciate the main difference between Supervised and Unsupervised approaches. With the latter is the algorithm that does all the leg work but it’s still up to you to interpret the results and understand what the algorithm has found.

To properly evaluate the clusters, you want to first pull all information into one tibble that you can easily manipulate. I start by appending the cluster number from the pam_fit list to the clust_data.

pam_results <- clust_data %>%
   # append the cluster information onto clust_data
  mutate(cluster = pam_fit$clustering) %>% 
   # sort out variable order
  select(account_id, cluster, everything()) %>% 
   # attach some extra data features from the data_clean file
  left_join(select(data_clean, 
                   c(account_id, canc_reason)), by = 'account_id') %>%
  mutate_if(is.character, funs(factor(.)))

NOTE THAT I’m also bringing in canc_reason from data_clean, a dimension that I did NOT use in the calculations. It does not always work but sometimes it may reveal patterns you have not explicitly considered.

From experience I found that a very good exercise is to print out the summaries (on screen or paper) of each single cluster - here’s an example for cluster 2.

Lay all the summaries before you and let your inner Sherlock Holmes comes out to play! The story usually unfolds before your very eyes when you start comparing the differences between clusters.

pam_results %>% 
  filter(cluster == 2) %>%
  select(-c(account_id , cluster, contract_monthly_price)) %>% 
  summary() 
status product_group campaign_code payment_method canc_reason
Cancelled:202 Standard:900 56472: 0 Direct Debit: 0 - :738
Active :698 Premium : 0 57572:900 Credit Card :900 Failed Payment: 81
56972: 0 PayPal : 0 Lack of time : 46
Price : 16
Editorial : 8
Other : 5
(Other) : 6

When you have your story, you may want to frame it in a nice table or graph. Here I’m showing the code I used to assemble my own overall summary and then discuss the results but feel free to tailor it to your own taste and needs!

First I calculate subscription rate (my measure of success) and failed payment rate for each cluster…

# subscription rate - my measure of success... 
subscr <- 
   pam_results %>%
   group_by(cluster, status) %>% 
   count() %>% 
   ungroup() %>% 
   # lag to allign past (lagging) obs to present obs
   mutate(lag_1 = lag(n, n = 1)) %>% 
   # sort out NAs
   tidyr::fill(lag_1, .direction = "up") %>% 
   # calculate active to total rate perc difference
   mutate(sub_rate =  (n / (n + lag_1)))  %>% 
   filter(status == 'Active') %>%
   select(Cluster = cluster, Subsc.Rate = sub_rate) %>% 
   adorn_pct_formatting(rounding = "half up", digits = 1)

# ... and failed payment rate
fail_pymt <-
   pam_results %>%
   group_by(cluster , canc_reason) %>% 
   count() %>% 
   ungroup() %>%
   group_by(cluster) %>% 
   mutate(tot = sum(n)) %>% 
   # calculate failed payment rate to total 
   mutate(fail_pymt_rate = (n / tot)) %>% 
   ungroup() %>%
   filter(canc_reason == 'Failed Payment') %>%
   select(Cluster = cluster, Failed.Pymt.Rate = fail_pymt_rate) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1)

Then, I bring all together in a handy table so that we can have a closer look

pam_results %>% 
   group_by(Cluster   = cluster, 
            Campaign  = campaign_code, 
            Product   = product_group, 
            Mth.Price = contract_monthly_price) %>% 
   summarise(Cluster.Size = n()) %>% 
   ungroup() %>%  
   arrange(Campaign) %>% 
   # attach subscription rate from subscr
   left_join(select(subscr, c(Cluster, Subsc.Rate)), by = 'Cluster') %>% 
   # attach failed payment rate from fail_pymt
   left_join(select(fail_pymt, c(Cluster, Failed.Pymt.Rate)), by = 'Cluster') %>% 
   kable(align = 'c')
Cluster Campaign Product Mth.Price Cluster.Size Subsc.Rate Failed.Pymt.Rate
1 56472 Standard 6.99 1061 75.8% 11.0%
2 57572 Standard 6.99 900 77.6% 9.0%
5 57572 Standard 6.99 621 72.5% 15.0%
3 56972 Standard 6.99 587 76.0% 9.9%
4 56972 Premium 15 264 64.8% 9.5%

Having focused on the top 3 campaigns was a good tactic and resulted in clear separation of the groups, which meant that the K-medoid discovered sub-groups within 2 campaigns.

Let’s take a look at campaign 57572 first:

  • cluster 5 has a lower Subscription Rate than most campaigns (74-76%, remember?) AND higher Failed Payment Rate

  • Further inspection reveals that the lower subscription rate is associated with Direct Debit and PayPal payments

.

pam_results %>% 
   group_by(Cluster     = cluster, 
            Pymt.Method = payment_method) %>% 
   summarise(Cluster.Size = n()) %>% 
   ungroup() %>%  
   filter(Cluster %in% c(2,5)) %>%
   # attach subscription rate from subscr
   left_join(select(subscr, c(Cluster, Subsc.Rate)), by = 'Cluster') %>%
   kable(align = 'c')
Cluster Pymt.Method Cluster.Size Subsc.Rate
2 Credit Card 900 77.6%
5 Direct Debit 165 72.5%
5 PayPal 456 72.5%

RECOMMENDATION 3: there may be potential to incentivise Credit Card payments as they seem to associate with higher Subscription Rate

Let’s move on to campaign 56972:

  • We already found this campaign is attributed to acquisition of both products, and has a lower overall subscription rate of 72.5% if compared to our benchmark of 74-76%

  • K-medoid split this out into two groups, revealing that premium product has a lower subscription rate than its standard counterpart

Cluster Campaign Product Mth.Price Cluster.Size Subsc.Rate Failed.Pymt.Rate
1 56472 Standard 6.99 1061 75.8% 11.0%
2 57572 Standard 6.99 900 77.6% 9.0%
5 57572 Standard 6.99 621 72.5% 15.0%
3 56972 Standard 6.99 587 76.0% 9.9%
4 56972 Premium 15 264 64.8% 9.5%

RECOMMENDATION 4: there may be potential to investigate different price points for Premium product that could help boost take up

RECOMMENDATION 5: acquisition campaigns should be more closely aligned to a single product to ensure accountability and better understanding of acquisition dynamics

Summary of recommendations

  • Review and solve high failed payment rate - This could result in a 10% potential boost in overall subscriptions

  • Incentivise credit card payments - Subscriptions associated with this means of payment have a higher sign up rate that other payments methods

  • Investigate different price points for both standard & premium products - Lower price point could boost take up

  • Keep campaigns more closely aligned to product - This would ensure better accountability and understanding of acquisition dynamics

Closing thoughts

In this project I revisited clustering, one of my favourite analytic methods, to explore and analyse a real-world dataset that included a mix of categorical and numerical feature. This required a different approach from the classical K-means algorithm that cannot be no directly applied to categorical data.

Instead, I used the K-medoids algorithm, also known as PAM (Partitioning Around Medoids), that has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data.

The silhouette coefficient helped to establish the optimal number of clusters, whilst t-SNE ( t-distributed stochastic neighbour embedding), a dimensionality reduction technique akin Principal Component Analysis and UMAP, unveiled good separation between clusters as well as closeness of elements within clusters, confirming the segmentation relevance.

Finally, I condensed the insight generated from the analysis into a number of actionable and data-driven recommendations that, applied correctly, could help improve product sign up.

Code Repository

The full R code and all relevant files can be found on my GitHub profile @ K Medoid Clustering

References

Appendix

Table 1 – Variable Definitions

Attribute Description
Account ID Unique account ID
Created Date Date of original account creation
Country Country of account holder
Status Current status - active/inactive
Product Group Product type
Payment Frequency Most subscriptions are a 1 year contract term - payable annually or monthly
Campaign Code Unique identifier for campaign attributed to acquisition
Start Date Start date of the trial
End Date Scheduled end of term
Cancellation Date Date of instruction to cancel
Cancellation Reason Reason given for cancellation
Monthly Price Current monthly price of subscription
Contract Monthly Price Price after promo period
Trial Length Length of trial
Trial Monthly Price Price during trial
Payment Method Payment method