Market Basket Analysis - Part 3 of 3 - A Shiny Product Recommender with Improved Collaborative Filtering
My objective for this piece of work is to carry out a Market Basket Analysis as an end-to-end data science project. I have split the output into three parts, of which this is the THIRD and last, that I have organised as follows:
In the first chapter, I will source, explore and format a complex dataset suitable for modelling with recommendation algorithms.
For the second part, I will apply various machine learning algorithms for Product Recommendation and select the best performing model. This will be done with the support of the recommenderlab package.
In the third and final instalment, I will implement the selected model in a Shiny Web Application.
Introduction
In the course of the research I carried out for this project I came across several pieces of work, all of which provided great inspiration and insight for my Shiny Application. A special mention goes to Jekaterina Novikova, PhD for her Movie Recommender System , and Prof. Michael Hahsler, who created a Joke Recommender. They used the MovieLens
and the Jester
dataset respectively, which come as default with recommenderlab.
However, both authors employed recommenderlab as the engine for their Shiny App, which famously requires a long time to calculate predictions on large datasets. The solution adopted by Jekaterina to make calculations more manageable was to take a sample of the dataset to reduce the size of the rating matrix. However, this may prove detrimental to prediction accuracy.
Fortunately, I came across this brilliant Kaggle kernel by Philipp Spachtholz, who not only carried out a salient analysis on a non-Kaggle dataset but, crucially for me, also build a Shiny-based Book Recommender system using a much faster collaborative filtering code.
Philipp drew his inspiration from this blog-post by SmartCat Consulting, which describes how to use an Improved Collaborative Filtering code and all the associated functions contained in the companion Github repository. In particular, the repository includes the simililarity_measures.R
functions for calculating similarity matrices, and the cf_algorithm.R
file with the collaborative filtering algorithm and prediction function.
In this final part of the project, I will describe how I used these functions in my Shiny implementation.
Loading the Packages
library(tidyverse)
library(knitr)
library(Matrix)
library(recommenderlab)
The Data
In this section I am using the retail
dataset from Part 2 to prepare the data files needed for the Shiny App. Note that in Part 2 I have carried out the extra formatting step of removing those orders that contained the same item more than once.
glimpse(retail)
## Observations: 517,354
## Variables: 10
## $ InvoiceNo <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 53...
## $ StockCode <chr> "85123A", "71053", "84406B", "84029G", "84029E", "...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LA...
## $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2...
## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12...
## $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1....
## $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 1...
## $ Country <fct> United Kingdom, United Kingdom, United Kingdom, Un...
## $ Date <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2...
## $ Time <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, 08:26:00, ...
For the app deployment I need to create 2 data files: past_orders_matrix
and item_list
.
past_orders_matrix
is a user-item sparse matrix containing the history of past orders. This is needed in the Shiny server.R
file for all the calculations.
past_orders_matrix <-
retail %>%
# Select only needed variables
select(InvoiceNo, Description) %>%
# Add a column of 1s
mutate(value = 1) %>%
# Spread into user-item format
spread(Description, value, fill = 0) %>%
select(-InvoiceNo) %>%
# Convert to matrix
as.matrix() %>%
# Convert to class "dgCMatrix"
as("dgCMatrix")
I save the file for use in the app.
saveRDS(past_orders_matrix,
file = "past_orders_matrix.rds")
item_list
is a list of all the products available to purchase. This will feed in the Shiny ui.R
file to make the products list available for selection.
# Creating a unique items list
item_list <- retail %>%
select(Description) %>%
unique()
I save the list for use in the app.
saveRDS(item_list,
file = "item_list.rds")
Improved Collaborative Filtering
To show how the Improved Collaborative Filtering
works, I am fitting the best performing model found in Part 2, the item-based CF
, on the same made-up order. I am also doing the same using recommenderlab
to compare the performance of the two approaches.
First, I re-create the made-up order using the same 6 randomly selected products.
customer_order <- c("GREEN REGENCY TEACUP AND SAUCER",
"SET OF 3 BUTTERFLY COOKIE CUTTERS",
"JAM MAKING SET WITH JARS",
"SET OF TEA COFFEE SUGAR TINS PANTRY",
"SET OF 4 PANTRY JELLY MOULDS")
Next, I to put new_order
in a user_item matrix format.
# put in a matrix format
new_order <- item_list %>%
# Add a 'value' column with 1's for customer order items
mutate(value = as.numeric(Description %in% customer_order)) %>%
# Spread into sparse matrix format
spread(key = Description, value = value) %>%
# Change to a matrix
as.matrix() %>%
# Convert to class "dgCMatrix"
as("dgCMatrix")
Then, I add the new_order
to the past_orders_matrix
as its first entry.
# binding 2 matrices
all_orders_dgc <- t(rbind(new_order, past_orders_matrix))
Now, I need to set a number of parameters required by the Improved CF to work.
# Set range of items to calculate predictions for - here I select them all
items_to_predict <- 1:nrow(all_orders_dgc)
# Set current user to 1, which corresponds to new_order
users <- c(1)
# Set prediction indices
prediction_indices <- as.matrix(expand.grid(items_to_predict, users = users))
I load the algorithm implementations and similarity calculations.
# Load algorithm implementations and similarity calculations
source("cf_algorithm.R")
source("similarity_measures.R")
And finally I can fit the item-based CF
model with the Improved CF
and check the runtime.
start <- Sys.time()
recomm <- predict_cf(all_orders_dgc, prediction_indices,
"ibcf", FALSE, cal_cos, 3, FALSE, 4000, 2000)
end <- Sys.time()
cat('runtime', end - start)
## runtime 0.630003
Wow! That was lightning fast!
Let’s now run the item-based CF
model with recommenderlab and compare performances.
# Convert `all_orders` to class "binaryRatingMatrix"
# signature(from = "dgTMatrix", to = "realRatingMatrix")
all_orders_brm <- as(all_orders_dgc, "realRatingMatrix")
# Run run IBCF model on recommenderlab
start <- Sys.time()
recomm <- Recommender(all_orders_brm,
method = "IBCF",
param = list(k = 5))
end <- Sys.time()
cat('runtime', end - start)
## runtime 12.75939
The speed gain is around 20 times and is consistent with what Philipp Spachtholz witnessed in his work. This is rather promising for the Shiny App!
Deploying the App
For the App deployment I went for a Proof of Concept approach. My focus has been on speed of execution and on getting right all the calculations shown in this article to power the server
side of the App. This reflects on what is currently a minimalistic User Interface, which simply features product selection and a Complete Your Purchase action button.
Here’s a link to the Product Recommender for your peruse.
I will continue to work on the User Interface to add features that improve customer experience and make it more of a final product.
Here are a few ideas I’m toying with at the moment:
- Add Product Price and ability to select Product Quantity
- Add Product Image - with 4,000 items to choose from, this is a mini project in its own right!
- Ability to select item from list using first letter of Product Name
- Enhance the UI visuals with
shinyjs
and/orhtlm
- Research how to Implement a Hybrid approach, given that not all combinations of products currently return a suggestion
Comments
I have to admit that I’ve genuinely enjoyed working on this Market Basket Analysis project. Recommendation systems are a fascinating field of research with real-world applications and I feel that I’ve just scratched the surface.
I have also really appreciated learning the ropes of Shiny App development, which turned out to be more straightforward that I initially thought: reactivity is a key concept to get your head around and it forces you to think in terms of User Interface and Server being two sides of the same coin.
The main consideration for me is that the potential is massive: even small companies with an online presence can benefit from implementing the most basic of recommendation system. With only a few lines of code one can improve customer experience, promote customer loyalty and boost sales.
Code Repository
The full R code can be found on my GitHub profile
References
- For Recommenderlab Package see: https://cran.r-project.org/package=recommenderlab
- For Recommenderlab Package Vignette see: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
- For the SmartCat Improved Collaborative Filter see: https://www.smartcat.io/blog/2017/improved-r-implementation-of-collaborative-filtering/