22 April 2020 / #Clustering #t-SNE

Segmenting with Mixed Type Data - A Case Study Using K-Medoids on Subscription Data

With the new year, I started to look for new employment opportunities and even managed to land a handful of final stage interviews before it all grounded to a halt following the corona-virus pandemic. Invariably, as part of the selection process I was asked to analyse a set of data and compile a number of data driven-recommendations to present in my final meeting.

In this post I retrace the steps I took for one of the take home analysis I was tasked with and revisit clustering, one of my favourite analytic methods. Only this time the set up is a lot closer to a real-world situation in that the data I had to analyse came with a mix of categorical and numerical feature. Simply put, this could not be tackled with a bog-standard K-means algorithm as it’s based on pairwise Euclidean distances and has no direct application to categorical data.

The data

library(tidyverse)
library(readxl)
library(skimr)
library(knitr) 
library(janitor)
library(cluster)
library(Rtsne)

The data represents all online acquisitions in February 2018 and their subscription status 3 months later (9th June 2018) for a Fictional News Aggregator Subscription business. It contains a number of parameters describing each account, like account creation date, campaign attributed to acquisition, payment method and length of product trial. A full description of the variables can be found in the Appendix

I got hold of this dataset in the course of a recruitment selection process as I was asked to carry out an analysis and present results and recommendations that could helps improve their products sign up in my final meeting. Although fictitious in nature, I thought it best to further anonimise the dataset by changing names and values of most of the variables as well as removing several features that were of no use to the analysis

The data was raw and required some cleansing and manipulation before it could be used for analysis. You can find the anonimised dataset on my GitHub profile, along with the scripts I used to cleanse the data and perform the analysis

Data Exploration

This is a vital part of any data analysis study, often overlooked in favour of the fancier and “sexier” modelling/playing with the algos. In fact, I find it one of the most creative phases of any project, where you “join the dots” among the different variables and start formulating interesting hypothesis for you to test later on

data_clean <- readRDS("../00_data/data_clean.rds")

Here I’m loading the cleaned data before I explore a selection of the features, run through a number of considerations to set up the analysis as well as show you some of the interesting insight I gathered. You can find the post covering data cleansing and formatting on my webpage @ Segmenting with Mixed Type Data - Initial data inspection and manupulation

In this project I’m also testing quite a few of the adorn_ and the tabyl functions from the janitor library, a family of functions created to help expedite the initial data exploration and cleaning but also very useful to create and format summary tables.

Country of Residence

Let’s start with the geographic distribution of subscribers, the overwhelming majority of whom are UK based

Cancellations

The dynamics of cancellations are definitely something that warrants further investigation (outside the scope of this study). In fact, Failed payments represent roughly 10% of total subscriptions and a staggering 38% when looking at cancellations alone

data_clean %>% 
   tabyl(canc_reason, status) %>%
   adorn_totals(c("row", "col")) %>% 
   adorn_percentages("col") %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   adorn_ns() %>%
   adorn_title("combined") %>%
   kable(align = 'c')

canc_reason/status	Active	Cancelled	Total
-	100.0% (3510)	18.1% (224)	78.6% (3734)
Competitor	0.0% (0)	1.5% (18)	0.4% (18)
Editorial	0.0% (0)	3.6% (44)	0.9% (44)
Failed Payment	0.0% (0)	37.9% (469)	9.9% (469)
Lack of time	0.0% (0)	19.5% (242)	5.1% (242)
Other	0.0% (0)	6.5% (81)	1.7% (81)
Price	0.0% (0)	8.2% (101)	2.1% (101)
UX Related	0.0% (0)	4.8% (60)	1.3% (60)
Total	100.0% (3510)	100.0% (1239)	100.0% (4749)

RECOMMENDATION 1: Review causes of high failed payment rate as it could result in a 10% boost in overall subscriptions

Payment Methods

Credit Card is the preferred payment method for over two fifths (61%) of subscribers

data_clean %>% 
   tabyl(payment_method) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   arrange(desc(n)) %>%  
   kable(align = 'c')

payment_method	n	percent
Credit Card	2905	61.2%
PayPal	1230	25.9%
Direct Debit	614	12.9%

Campaigns

Three acquisition campaigns drove the largest majority of subscriptions in February 2018

The analysis is going to focus on these top 3 campaigns, which account for over 73% of total acquisitions. Adding additional campaigns to the analysis would likely see them all lumped into one “hybrid” cluster and result in less focused insight.

data_clean %>% 
   tabyl(campaign_code) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   arrange(desc(n)) %>% 
   top_n(10, wt = n) %>% 
   kable(align = 'c')

campaign_code	n	percent
57572	1546	32.6%
56472	1061	22.3%
56972	885	18.6%
51372	178	3.7%
51472	154	3.2%
57272	141	3.0%
47672	104	2.2%
53872	104	2.2%
48272	95	2.0%
52972	92	1.9%

Trial Length

Potential subscribes have a number of options available but the analysis is going to take into account only customers that took the 1-month trial before sign-up.

Given that all acquisitions refer to a particular month and their current status 3 MONTHS LATER, this will ensure a clear cut window for the analysis AND cover the vast majority of subscriptions (nearly 90%)

data_clean %>% 
   tabyl(trial_length) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   select(-valid_percent) %>% 
   kable(align = 'c')

trial_length	n	percent
1M	4233	89.1%
2M	1	0.0%
3M	95	2.0%
6M	7	0.1%
NA	413	8.7%

Product Pricing

Zooming in on the top 3 campaigns for clarity, we can see the two price points of £6.99 for the Standard product and £15 for the Premium one.

Surprisingly, campaign 56472 is attributed to the acquisition of both products. It is good practice to keep acquisition campaigns more closely aligned to a single product as it would ensure better accountability and understanding of acquisition dynamics.

data_clean %>% 
   filter(campaign_code %in% c('57572', '56472', '56972')) %>% 
   filter(trial_length == '1M') %>% 
   group_by(product_group, campaign_code,
            contract_monthly_price) %>% 
   count() %>% 
   ungroup() %>%
   arrange(campaign_code) %>% 
   kable(align = 'c')

product_group	campaign_code	contract_monthly_price	n
Standard	56472	6.99	1061
Premium	56972	15.00	264
Standard	56972	6.99	587
Standard	57572	6.99	1521

Subscription Rate

For the majority of the top 10 campaigns, sign up after trial hovers in the 74-76% range. This offers a good benchmark for the general subscription levels we can expect after trial.

data_clean %>% 
   # selecting top n campaigns by number of 
   filter(campaign_code %in% c(
      '57572', '56472', '56972', '51372', '57272',
      '51472', '47672', '53872', '52972', '54072'
                               )) %>% 
   filter(trial_length == '1M') %>%
   group_by(campaign_code, status) %>% 
   count() %>% 
   ungroup() %>% 
   # lag to allign past (lagging) obs to present obs
   mutate(lag_1 = lag(n, n = 1)) %>% 
   # sort out NAs
   tidyr::fill(lag_1, .direction = "up") %>% 
   # calculate cancelled to total rate perc difference
   mutate(act_to_tot = 
            (lag_1 / (n + lag_1)))  %>% 
   filter(status == 'Cancelled') %>%
   select(campaign_code, act_to_tot) %>% 
   arrange(desc(campaign_code)) %>% 
   adorn_pct_formatting(rounding = "half up", digits = 1) %>% 
   kable(align = 'c')

campaign_code	act_to_tot
57572	75.5%
57272	75.6%
56972	72.5%
56472	75.8%
54072	86.4%
53872	71.3%
52972	73.9%
51472	48.1%
51372	68.8%
47672	74.8%

The low subscription rate for campaign 51472 may be due to the higher contract price after trial (contract_monthly_price) of £9.65 compared to £6.99 seen for other Standard product contracts.

data_clean %>% 
   filter(campaign_code == '51472') %>% 
   filter(trial_length == '1M') %>% 
   group_by(contract_monthly_price, product_group, campaign_code) %>% 
   count() %>% 
   ungroup() %>% 
   kable(align = 'c')

contract_monthly_price	product_group	campaign_code	n
9.65	Standard	51472	128

RECOMMENDATION 2: Investigate different price points for standard product as lower price point could boost take up

Analysis Structure

In this analysis I want to understand what factors make a customer subscribe after a trial and to do so, I need to identify what approach is best suited for the type and cut of data I have at my disposal, and to define a measure of success.

APPROACH: The data is a snapshot in time and as such does not lend itself to any time-based type of analysis. In such cases, one of the best suited approaches to extract insight is clustering, which is especially good when you have no prior domain knowledge to guide you. However, the fact that data is a mix of categorical and numerical features requires a slightly different approach, which I discuss in the next section.
MEASURE: The perfect candidate to measure success is Subscription Rate defined as Active / Total Subscribers. The an focus on 1-month trial subscribers also ensures a clear cut window for analysis as their status reflects whether they signed up 3 MONTHS AFTERWARDS.

Methodology

When I started to research cluster analysis with mix categorical and numerical data, I came across an excellent post on Towards Data Science entitled Clustering on mixed type data, from which I borrowed the core analysis coding and adjusted it to my needs. In his article, Thomas Filaire shows how to use the PAM clustering algorithm (Partitioning Around Medoids) to perform the clustering and the silhouette coefficient to select the optimal number of clusters.

The K-medoid, also know as PAM, is a clustering algorithm similar to the more popular K-means algorithm. K-means and K-medoids work in similar ways in that they create groups in your data and work on distances (often referred to as dissimilarities) as they ensure that elements in each group are very similar to one another by minimising the distance within each cluster. However, K-medoids has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data that include both numerical and categorical features.

I’m going to calculate the dissimilarities between observations with the help of the daisy function from the cluster library, which allows to choose between a number of methods (“euclidean”, “manhattan” and “gower”) to run the calculations. The Gower’s distance (1971) is of particular interest to us as it computes dissimilarities on a [0 1] range regardless of whether the input is numerical or categorical, hence making them comparable.

Clustering

I’m going to segment the subscription data by the following 5 dimensions:

Status: Active / Cancelled
Product Group: Standard / Premium
Top 3 Campaigns by Subscription Numbers: 56472 / 56972 / 57572
Payment Method: Credit Card / Direct Debit / PayPal
Contract Monthly Price: £6.99 / £15

I’ll start with selecting the cut of the data I need for the clustering. I’m keeping account_id for my reference but will not pass it to the algorithm.

clust_data <-
   data_clean %>% 
   # filtering by the top 3 campaigns
   filter(campaign_code %in% c('57572', '56472', '56972')) %>%
   # selecting 1M trial subscriptions
   filter(trial_length == '1M') %>% 
   # select features to cluster by
   select(account_id, status, product_group, campaign_code, 
          contract_monthly_price, payment_method
          ) %>% 
   # setting all features as factors
   mutate(
        account_id           = account_id %>% as_factor(),
        status                 = status %>% as_factor(),
        product_group          = product_group %>% as_factor(),
        campaign_code          = campaign_code %>% as_factor(),
        contract_monthly_price = contract_monthly_price %>% as_factor(),
        payment_method         = payment_method %>% as_factor()
    )

Then, I compute the Gower distance with the daisy function from the cluster package. The “Gower’s distance” would automatically be selected if some features in the data are not numeric but I prefer to spell it out anyway.

gower_dist <- 
  clust_data %>% 
    # de-select account_id
    select(2:6) %>%   
    daisy(metric = "gower")

And that’s that! You’re set to go!

Assessing the Clusters

There are a number of methods to establish the optimal number of clusters to use but in this study I’m using the silhouette coefficient, which contrasts the average distance of elements in the same cluster with average distance of elements in other clusters. In other words, each additional cluster is adding “compactness” to the individual segment, bringing their elements closer together, whilst “moving” each cluster further apart from one another.

First, I’m calculating the Partition Around Medoids (a.k.a. PAM) using the pam function from the cluster library. The one number to keep an eye on is the average silhouette width (or avg.width), which I’m storing away in the sil_width parameter.

sil_width <- c(NA)

for (i in 2:8) {  
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)  
  sil_width[i] <- pam_fit$silinfo$avg.width  
}

In a business context, we want a number of clusters to be both meaningful and easy to handle, (i.e. 2 to 8) and 5-cluster configuration seems a good starting point to investigate.

sil_width %>% 
  as_tibble() %>% 
   rowid_to_column() %>% 
   filter(rowid %in% c(2:8)) %>% 
   ggplot(aes(rowid, value)) +
   geom_line(colour  = 'black', size = 0.7) +
   geom_point(colour = 'black', size = 1.3) +
   theme_minimal() +
   labs(title = 'Silhouette Widths of k-medoid Clusters',
        x     = "Number of clusters",
        y     = 'Silhouette Width') +
   theme(plot.title = element_text(hjust = 0.5))

When deciding on the optimal number of clusters, DO NOT rely exclusively on the output of mathematical methods. Make sure you combine it with domain knowledge and your own judgement as often adding an extra segment (say, going from 5 to 6) may only add complexity for no extra insight.

Visualising the Segments with t-SNE

Now that I have a potential optimal number of clusters, I want to visualise them. To do so, I use the t-SNE ( t-distributed stochastic neighbour embedding), a dimensionality reduction technique that assists with cluster visualisation in a similar way to Principal Component Analysis and UMAP.

First, I pull the partitioning data for the 5-cluster configuration from gower_dist

pam_fit <- 
  gower_dist %>% 
    # diss = TRUE to treat argument as dissimilarity matrix
    pam(k = 5, diss = TRUE)

Then I construct the 2D projection of my 5-dimensional space with Rtsne and plot it

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)


tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering)) %>% 

  # plot
  ggplot(aes(x = X, y = Y, colour = cluster)) +
  geom_point() +
  theme_light() +
  labs(title       = 't-SNE 2D Projections of k-medoid Clusters')  +
  theme(plot.title = element_text(hjust = 0.5))

Aside from a few elements, there is a general good separation between clusters as well as closeness of elements within clusters, which confirms the segmentation relevance

Evaluating the Clusters

At this stage of a project you get to appreciate the main difference between Supervised and Unsupervised approaches. With the latter is the algorithm that does all the leg work but it’s still up to you to interpret the results and understand what the algorithm has found.

To properly evaluate the clusters, you want to first pull all information into one tibble that you can easily manipulate. I start by appending the cluster number from the pam_fit list to the clust_data.

pam_results <- clust_data %>%
   # append the cluster information onto clust_data
  mutate(cluster = pam_fit$clustering) %>% 
   # sort out variable order
  select(account_id, cluster, everything()) %>% 
   # attach some extra data features from the data_clean file
  left_join(select(data_clean, 
                   c(account_id, canc_reason)), by = 'account_id') %>%
  mutate_if(is.character, funs(factor(.)))

NOTE THAT I’m also bringing in canc_reason from data_clean, a dimension that I did NOT use in the calculations. It does not always work but sometimes it may reveal patterns you have not explicitly considered.

From experience I found that a very good exercise is to print out the summaries (on screen or paper) of each single cluster - here’s an example for cluster 2.

Lay all the summaries before you and let your inner Sherlock Holmes comes out to play! The story usually unfolds before your very eyes when you start comparing the differences between clusters.

pam_results %>% 
  filter(cluster == 2) %>%
  select(-c(account_id , cluster, contract_monthly_price)) %>% 
  summary()

status	product_group	campaign_code	payment_method	canc_reason
Cancelled:202	Standard:900	56472: 0	Direct Debit: 0	- :738
Active :698	Premium : 0	57572:900	Credit Card :900	Failed Payment: 81
		56972: 0	PayPal : 0	Lack of time : 46
				Price : 16
				Editorial : 8
				Other : 5
				(Other) : 6

When you have your story, you may want to frame it in a nice table or graph. Here I’m showing the code I used to assemble my own overall summary and then discuss the results but feel free to tailor it to your own taste and needs!

First I calculate subscription rate (my measure of success) and failed payment rate for each cluster…

# subscription rate - my measure of success... 
subscr <- 
   pam_results %>%
   group_by(cluster, status) %>% 
   count() %>% 
   ungroup() %>% 
   # lag to allign past (lagging) obs to present obs
   mutate(lag_1 = lag(n, n = 1)) %>% 
   # sort out NAs
   tidyr::fill(lag_1, .direction = "up") %>% 
   # calculate active to total rate perc difference
   mutate(sub_rate =  (n / (n + lag_1)))  %>% 
   filter(status == 'Active') %>%
   select(Cluster = cluster, Subsc.Rate = sub_rate) %>% 
   adorn_pct_formatting(rounding = "half up", digits = 1)

# ... and failed payment rate
fail_pymt <-
   pam_results %>%
   group_by(cluster , canc_reason) %>% 
   count() %>% 
   ungroup() %>%
   group_by(cluster) %>% 
   mutate(tot = sum(n)) %>% 
   # calculate failed payment rate to total 
   mutate(fail_pymt_rate = (n / tot)) %>% 
   ungroup() %>%
   filter(canc_reason == 'Failed Payment') %>%
   select(Cluster = cluster, Failed.Pymt.Rate = fail_pymt_rate) %>%
   adorn_pct_formatting(rounding = "half up", digits = 1)

Then, I bring all together in a handy table so that we can have a closer look

pam_results %>% 
   group_by(Cluster   = cluster, 
            Campaign  = campaign_code, 
            Product   = product_group, 
            Mth.Price = contract_monthly_price) %>% 
   summarise(Cluster.Size = n()) %>% 
   ungroup() %>%  
   arrange(Campaign) %>% 
   # attach subscription rate from subscr
   left_join(select(subscr, c(Cluster, Subsc.Rate)), by = 'Cluster') %>% 
   # attach failed payment rate from fail_pymt
   left_join(select(fail_pymt, c(Cluster, Failed.Pymt.Rate)), by = 'Cluster') %>% 
   kable(align = 'c')

Cluster	Campaign	Product	Mth.Price	Cluster.Size	Subsc.Rate	Failed.Pymt.Rate
1	56472	Standard	6.99	1061	75.8%	11.0%
2	57572	Standard	6.99	900	77.6%	9.0%
5	57572	Standard	6.99	621	72.5%	15.0%
3	56972	Standard	6.99	587	76.0%	9.9%
4	56972	Premium	15	264	64.8%	9.5%

Having focused on the top 3 campaigns was a good tactic and resulted in clear separation of the groups, which meant that the K-medoid discovered sub-groups within 2 campaigns.

Let’s take a look at campaign 57572 first:

cluster 5 has a lower Subscription Rate than most campaigns (74-76%, remember?) AND higher Failed Payment Rate
Further inspection reveals that the lower subscription rate is associated with Direct Debit and PayPal payments

pam_results %>% 
   group_by(Cluster     = cluster, 
            Pymt.Method = payment_method) %>% 
   summarise(Cluster.Size = n()) %>% 
   ungroup() %>%  
   filter(Cluster %in% c(2,5)) %>%
   # attach subscription rate from subscr
   left_join(select(subscr, c(Cluster, Subsc.Rate)), by = 'Cluster') %>%
   kable(align = 'c')

Cluster	Pymt.Method	Cluster.Size	Subsc.Rate
2	Credit Card	900	77.6%
5	Direct Debit	165	72.5%
5	PayPal	456	72.5%

RECOMMENDATION 3: there may be potential to incentivise Credit Card payments as they seem to associate with higher Subscription Rate

Let’s move on to campaign 56972:

We already found this campaign is attributed to acquisition of both products, and has a lower overall subscription rate of 72.5% if compared to our benchmark of 74-76%
K-medoid split this out into two groups, revealing that premium product has a lower subscription rate than its standard counterpart

Cluster	Campaign	Product	Mth.Price	Cluster.Size	Subsc.Rate	Failed.Pymt.Rate
1	56472	Standard	6.99	1061	75.8%	11.0%
2	57572	Standard	6.99	900	77.6%	9.0%
5	57572	Standard	6.99	621	72.5%	15.0%
3	56972	Standard	6.99	587	76.0%	9.9%
4	56972	Premium	15	264	64.8%	9.5%

RECOMMENDATION 4: there may be potential to investigate different price points for Premium product that could help boost take up

RECOMMENDATION 5: acquisition campaigns should be more closely aligned to a single product to ensure accountability and better understanding of acquisition dynamics

Summary of recommendations

Review and solve high failed payment rate - This could result in a 10% potential boost in overall subscriptions
Incentivise credit card payments - Subscriptions associated with this means of payment have a higher sign up rate that other payments methods
Investigate different price points for both standard & premium products - Lower price point could boost take up
Keep campaigns more closely aligned to product - This would ensure better accountability and understanding of acquisition dynamics

Closing thoughts

In this project I revisited clustering, one of my favourite analytic methods, to explore and analyse a real-world dataset that included a mix of categorical and numerical feature. This required a different approach from the classical K-means algorithm that cannot be no directly applied to categorical data.

Instead, I used the K-medoids algorithm, also known as PAM (Partitioning Around Medoids), that has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data.

The silhouette coefficient helped to establish the optimal number of clusters, whilst t-SNE ( t-distributed stochastic neighbour embedding), a dimensionality reduction technique akin Principal Component Analysis and UMAP, unveiled good separation between clusters as well as closeness of elements within clusters, confirming the segmentation relevance.

Finally, I condensed the insight generated from the analysis into a number of actionable and data-driven recommendations that, applied correctly, could help improve product sign up.

Code Repository

The full R code and all relevant files can be found on my GitHub profile @ K Medoid Clustering

References

For the article that inspired my foray into K-medois clustering see this excellent TDS post by Thomas Filaire: Clustering on mixed type data
For a tidy and fully-featured approach to counting things, see the tabyls Function Vignette

Appendix

Table 1 – Variable Definitions

Attribute	Description
Account ID	Unique account ID
Created Date	Date of original account creation
Country	Country of account holder
Status	Current status - active/inactive
Product Group	Product type
Payment Frequency	Most subscriptions are a 1 year contract term - payable annually or monthly
Campaign Code	Unique identifier for campaign attributed to acquisition
Start Date	Start date of the trial
End Date	Scheduled end of term
Cancellation Date	Date of instruction to cancel
Cancellation Reason	Reason given for cancellation
Monthly Price	Current monthly price of subscription
Contract Monthly Price	Price after promo period
Trial Length	Length of trial
Trial Monthly Price	Price during trial
Payment Method	Payment method