/ #Data Wrangling #Market Basket Analysis 

Market Basket Analysis - Part 1 of 3: Data Preparation and Exploratory Data Analysis

My objective for this piece of work is to carry out a Market Basket Analysis as an end-to-end data science project. I have split the output into three parts, of which this is the FIRST, that I have organised as follows:

  1. In the first chapter, I will source, explore and format a complex dataset suitable for modelling with recommendation algorithms.

  2. For the second part, I will apply various machine learning algorithms for Product Recommendation and select the best performing model. This will be done with the support of the recommenderlab package.

  3. In the third and final instalment, I will implement the selected model in a Shiny Web Application.

Introduction

Market Basket Analysis or MBA (also referred to as Affinity Analysis) is a set of data mining and data analysis techniques used to understand customers shopping behaviours and uncover relationships among the products that they buy. By analysing the likes of retail basket composition, transactional data, or browsing history, these methods can be used to suggest to customers items they might be interested in. Such recommendations can promote customer loyalty, increase conversion rate by helping customers to find relevant products faster and boost cross-selling and up-selling by suggesting additional items or services.

Recommendation systems are widely employed in several industries. Online retailer Amazon uses these methods to recommend to their customers products that other people often buy together (‘customers who bought this item also bought…’). Every Monday since 2015 entertainment company Spotify suggests to their users a tailored playlist of 30 songs they’ve never listened to based on their consumption history. In 2009 Netflix famously ran a competition (The Netflix Prize) which awarded a $1M Grand Prize to the team that improved the predictions of their movie recommendation system.

Market Basket Analysis also finds applications in such areas as:

  • Promotions: MBA may give indications on how to structure promotion calendars. When two items (say, torch & batteries) are often bought together, a retailer may decide to have only one item on promotion to boost sales of the other.

  • Store Design: store layout can be shaped to place closer together items which are more often bought at the same time (e.g. milk & cereals, shampoo & conditioner, meat & vegetables, etc.)

  • Loyalty Programmes: when a customer has been inactive for some time, he/she can be prompted to re-engage with customised offers based on their purchase history.

Types of recommendation systems

There are several types of recommendation systems, each with their advantages and drawbacks.

I’ve arranged below a brief overview of some of the most known and used methods - please click on the tabs to have a look.

Association Rules

Association Rules analyses items most frequently purchased together to discover strong patterns in the data. Rules associating any pair of purchased items are establish using three metrics: “Support”, “Confidence”, and “Lift”:

  • Support is the probability of two items being purchased together. P(Item1 & Item2)
  • Confidence is Support divided by the probability of buying Item2. Support / P(Item2)
  • Lift is Confidence divided by the probability of buying Item1. Confidence / P(Item1)

The item with the highest Lift is the one most likely to be purchased. One drawback is that this approach doesn’t take into consideration the user’s purchase history.

As the name suggests, recommendations are based on general purchasing trends. Suggestions come from a list of most frequently purchased items that the customer has not yet bought. The main limitation for this approach is that the advice may not be entirely relevant to the customer.

Content-based Filtering

A content-based recommendation system investigates a client’s purchase history to understand hers/his preferences and recommends items similar to those the client has liked in the past. However, this approach relies on detailed transaction information, which may not always be available.

Collaborative Filtering

When a collaborative filtering is used, the recommendation system looks at either clients purchase history or clients product rating history. These are compiled into a user-item ratings matrix, with items (e.g. products) in the columns and users (e.g. customers or orders) in the rows. A measure of similarity is then calculated using methods like Cosine, Pearsons or Jaccard similarity to identify a hierarchy of probability to purchase.

Two methodologies are made use of:

User-based collaborative filtering (UBCF) focuses on similarities between users market basket composition to find other users with similar baskets. Similarity measures are then calculated and used to shape the suggestions list.

Item-based collaboartive filtering (IBCF) employes a similar appsoach but focuses instead on similarities between items that are frequently purchased together.

Hybrid Systems

Hybrid systems try to combine some of the approaches to overcome their limitations. There is a rich literature available online, in particular on the combining of collaborative filtering and content-based filtering, which in some cases have shown to be more effective than the pure approaches.

The Wikipedia article on Hybrid recommender systems also makes a very good read on the subject.

Data

The data for this project comes from the UCI Machine Learning Repository, an online archive of large datasets which includes a wide variety of data types, analysis tasks, and application areas.

In this project I’m using the Online Retail dataset, donated to UCI in 2015 by the School of Engineering at London South Bank University. This is a transnational dataset containing transactions occurring between 01/Dec/2010 and 09/Dec/2011 for a UK-based and registered online retail company. The company sells mainly unique all-occasion gifts and many of their customers are wholesalers.

For clarity and completeness, I’m including here the variables description

Attribute Description
InvoiceNo Invoice number. 6-digit integral number. Indicates cancellation if starts with ‘C’
StockCode Product (item) code. 5-digit integral number
Description Product (item) name
Quantity The quantities of each product (item) per transaction. Numeric
InvoiceDate Invoice Date and Time when each transaction was generated
UnitPrice Unit price. Numeric, Product price per unit in sterling
CustomerID Customer number. 5-digit integral number unique to each customer
Country Country name. The name of the country where each customer resides

I’ve picked this particular dataset because of it’s “real life” nature, which shows in the many manually entered annotations and adjustment codes I found in the Description field. There are even personal comments left by the company’s employees, who likely used the database on a day-to-day basis.

It’s reasonable to assume that the dataset came straight from the company’s database with very little alterations. From experience this is consistent with the state in which an analyst is most likely to receive data from a clients they are to conduct a piece of analysis for.

Data Preparation

Loading the Packages

# Importing libraries
library(data.table)           
library(readxl)               
library(tidyverse)
library(lubridate)
library(skimr)                
library(knitr)                
library(treemap)

Loading and Inspecting the Data

# import raw data file and trim leading and trailing whitespaces
retail <- read_excel("Online Retail.xlsx", trim_ws = TRUE)

The dataset consists of just over 540,000 observations spread across 8 variables. A few Descriptions and several CustomerIDs are missing and there are also some odd negatives Quantity and UnitPrice that would be worth investigating. It’s also worth noting that InvoiceDate is of POSIXct format, from which information about Date and Time of purchase can be extracted.

# First glance at the data
retail %>%  skim()
## Skim summary statistics
##  n obs: 541909 
##  n variables: 8 
## 
## -- Variable type:character --------------------------------------------------------------------------
##     variable missing complete      n min max empty n_unique
##      Country       0   541909 541909   3  20     0       38
##  Description    1454   540455 541909   1  35     0     4211
##    InvoiceNo       0   541909 541909   6   7     0    25900
##    StockCode       0   541909 541909   1  12     0     4070
## 
## -- Variable type:numeric ----------------------------------------------------------------------------
##    variable missing complete      n     mean      sd        p0      p25
##  CustomerID  135080   406829 541909 15287.69 1713.6   12346    13953   
##    Quantity       0   541909 541909     9.55  218.08 -80995        1   
##   UnitPrice       0   541909 541909     4.61   96.76 -11062.06     1.25
##       p50      p75  p100
##  15152    16791    18287
##      3       10    80995
##      2.08     4.13 38970
## 
## -- Variable type:POSIXct ----------------------------------------------------------------------------
##     variable missing complete      n        min        max     median
##  InvoiceDate       0   541909 541909 2010-12-01 2011-12-09 2011-07-19
##  n_unique
##     23260

Cancellations

The very handy Variables Description, which I’ve included above, tells us that if the InvoiceNo starts with letter ‘C’, it indicates a cancellation

retail %>% 
  filter(grepl("C", retail$InvoiceNo)) %>% 
  summarise(Total = n())
## # A tibble: 1 x 1
##   Total
##   <int>
## 1  9288

Cancellations are not needed for the analysis so they can be removed

retail  <- retail %>% 
  filter(!grepl("C", retail$InvoiceNo)) 

# CHECK: total row count - 532,621

Negative Quantities

When filtering by non positive Quantity, the Description shows what looks like a series of manually entered comments (e.g.“thrown away”, “Unsaleable”, “damaged”,“?”). Given that UnitPrice is also set to zero for all of them, it is safe to assume that these were adjustments codes.

retail %>% 
  filter(Quantity <= 0) %>% 
  group_by(Description, UnitPrice) %>% 
  summarise(count = n()) %>%
  arrange(desc(count)) %>% 
  ungroup()
## # A tibble: 139 x 3
##    Description            UnitPrice count
##    <chr>                      <dbl> <int>
##  1 <NA>                           0   862
##  2 check                          0   120
##  3 damages                        0    45
##  4 damaged                        0    42
##  5 ?                              0    41
##  6 sold as set on dotcom          0    20
##  7 Damaged                        0    14
##  8 thrown away                    0     9
##  9 Unsaleable, destroyed.         0     9
## 10 ??                             0     7
## # ... with 129 more rows

Were this a live project, I would typically sense check this assumption with whomever provided the data. In this case, I will simply remove all rows with non-positive Quantity.

retail  <- retail %>% 
  filter(Quantity > 0)

# CHECK: total row count - 531,285

Non-Product StockCodes

Turning my attention to StockCode, I notice a handful of non-product related codes (‘Postage’, ‘Bank Charges’, ‘Gift Vouchers’, etc.).

# Non-product related codes
stc <- c('AMAZONFEE', 'BANK CHARGES', 'C2', 'DCGSSBOY', 'DCGSSGIRL',
         'DOT', 'gift_0001_', 'PADS', 'POST')
retail %>%  
  filter(grepl(paste(stc, collapse = "|"), StockCode))  %>% 
  group_by(StockCode, Description) %>% 
  summarise(count = n()) %>%
  arrange(desc(count)) %>% 
  ungroup()
## # A tibble: 19 x 3
##    StockCode    Description                        count
##    <chr>        <chr>                              <int>
##  1 POST         POSTAGE                             1126
##  2 DOT          DOTCOM POSTAGE                       708
##  3 C2           CARRIAGE                             141
##  4 DCGSSGIRL    GIRLS PARTY BAG                       13
##  5 BANK CHARGES Bank Charges                          12
##  6 DCGSSBOY     BOYS PARTY BAG                        11
##  7 gift_0001_20 Dotcomgiftshop Gift Voucher £20.00     9
##  8 gift_0001_10 Dotcomgiftshop Gift Voucher £10.00     8
##  9 gift_0001_30 Dotcomgiftshop Gift Voucher £30.00     7
## 10 gift_0001_50 Dotcomgiftshop Gift Voucher £50.00     4
## 11 PADS         PADS TO MATCH ALL CUSHIONS             4
## 12 POST         <NA>                                   4
## 13 gift_0001_40 Dotcomgiftshop Gift Voucher £40.00     3
## 14 AMAZONFEE    AMAZON FEE                             2
## 15 C2           <NA>                                   1
## 16 DOT          <NA>                                   1
## 17 gift_0001_10 <NA>                                   1
## 18 gift_0001_20 to push order througha s stock was     1
## 19 gift_0001_30 <NA>                                   1

These can all be removed.

retail <- filter(retail, !grepl(paste(stc, collapse = "|"), StockCode))

# CHECK: total row count - 529,228

Description

Focusing now on the Description field, there are an additional 50 manually entered annotations that need removing. I one case an employee has even vented out their frustration at one of their co-workers (“alan hodge cant mamage this section”), with misspelling and all!

# Additional adjustment codes to remove
descr <- c( "check", "check?", "?", "??", "damaged", "found", 
            "adjustment", "Amazon", "AMAZON", "amazon adjust", 
            "Amazon Adjustment", "amazon sales", "Found", "FOUND",
            "found box", "Found by jackie ", "Found in w/hse", "dotcom",
            "dotcom adjust", "allocate stock for dotcom orders ta", "FBA",
            "Dotcomgiftshop Gift Voucher £100.00", "on cargo order",
            "wrongly sold (22719) barcode", "wrongly marked 23343",
            "dotcomstock", "rcvd be air temp fix for dotcom sit", "Manual",
            "John Lewis", "had been put aside", "for online retail orders",  
            "taig adjust", "amazon", "incorrectly credited C550456 see 47",
            "returned", "wrongly coded 20713", "came coded as 20713", 
            "add stock to allocate online orders", "Adjust bad debt",
            "alan hodge cant mamage this section", "website fixed",
            "did  a credit  and did not tick ret", "michel oops",
            "incorrectly credited C550456 see 47", "mailout", "test",
            "Sale error",  "Lighthouse Trading zero invc incorr", "SAMPLES",
            "Marked as 23343", "wrongly coded 23343","Adjustment", 
            "rcvd be air temp fix for dotcom sit", "Had been put aside."
          )

Filtering out the unwanted entries.

retail <- retail %>% 
  filter(!Description %in% descr)

# CHECK: total row count - 528,732

Last but not least, there are also some 600 NAs in Description.

sum(is.na(retail$Description))
## [1] 584

Potentially, I could have used StockCode to derive the missing entries, but given their small number (around 0.1% of total) I’ve opted to just remove them.

retail <- retail %>% 
  filter(!is.na(Description))

# CHECK: total row count - 528,148

Customer ID

There is still a significant number of NAs in CustomerID, which I will leave as they are.

retail$CustomerID %>%  
  skim()
## 
## Skim summary statistics
## 
## -- Variable type:numeric ----------------------------------------------------------------------------
##  variable missing complete      n    mean      sd    p0   p25   p50   p75
##         .  131778   396370 528148 15301.6 1709.98 12346 13975 15159 16803
##   p100
##  18287

As I will discuss in the modelling part, for the analysis I need to arrange data in a user-item format, where “users” can be either customers or orders. Given that there are almost 5 times as many Orders as there are Customers, I am going to use InvoiceNo for orders in the analysis, which should make for a richer information set.

sapply(retail[,c('InvoiceNo','CustomerID')], function(x) length(unique(x)))
##  InvoiceNo CustomerID 
##      19792       4336

Final touches

There are a couple of housekeeping tasks to sort out and I am ready to go!

retail <- retail %>%
# Setting 'Description' and 'Country' as factors
  mutate(Description = as.factor(Description)) %>%
  mutate(Country = as.factor(Country)) %>% 
# Changing 'InvoiceNo' type to numeric
  mutate(InvoiceNo = as.numeric(InvoiceNo)) %>% 
# Extracting 'Date' and 'Time' from 'InvoiceDate'
  mutate(Date = as.Date(InvoiceDate)) %>% 
  mutate(Time = as.factor(format(InvoiceDate,"%H:%M:%S"))) 

glimpse(retail)
## Observations: 528,148
## Variables: 10
## $ InvoiceNo   <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 53...
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LA...
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2...
## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12...
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1....
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 1...
## $ Country     <fct> United Kingdom, United Kingdom, United Kingdom, Un...
## $ Date        <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2...
## $ Time        <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, 08:26:00, ...

Exploratory Data Analysis

And I am now ready to start exploring the dataset different features.

What items do people buy more often?

The heart-shaped tea light holder is the most popular item.

retail %>% 
  group_by(Description) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  arrange(desc(count)) %>% 
  ggplot(aes(x = reorder(Description, count), y = count)) +
  geom_bar(stat = "identity", fill = "royalblue", colour = "blue") +
  labs(x = "", y = "Top 10 Best Sellers") +
  coord_flip() +
  theme_grey(base_size = 12)

Top 10 most sold products represent around 3% of total items sold by the company

retail %>% 
  group_by(Description) %>% 
  summarize(count = n()) %>% 
  mutate(pct = (count/sum(count))*100) %>% 
  arrange(desc(pct)) %>% 
  ungroup() %>% 
  top_n(10, wt = pct)
## # A tibble: 10 x 3
##    Description                        count   pct
##    <fct>                              <int> <dbl>
##  1 WHITE HANGING HEART T-LIGHT HOLDER  2327 0.441
##  2 JUMBO BAG RED RETROSPOT             2115 0.400
##  3 REGENCY CAKESTAND 3 TIER            2019 0.382
##  4 PARTY BUNTING                       1707 0.323
##  5 LUNCH BAG RED RETROSPOT             1594 0.302
##  6 ASSORTED COLOUR BIRD ORNAMENT       1489 0.282
##  7 SET OF 3 CAKE TINS PANTRY DESIGN    1399 0.265
##  8 PACK OF 72 RETROSPOT CAKE CASES     1370 0.259
##  9 LUNCH BAG  BLACK SKULL.             1328 0.251
## 10 NATURAL SLATE HEART CHALKBOARD      1263 0.239

What time of day do people buy more often?

Lunchtime is the preferred time for shopping online, with the majority of orders places between 12 noon and 3pm.

retail %>% 
  ggplot(aes(hour(hms(Time)))) + 
  geom_histogram(stat = "count",fill = "#E69F00", colour = "red") +
  labs(x = "Hour of Day", y = "") +
  theme_grey(base_size = 12)

What day of the week do people buy more often?

Orders peaks on Thursdays with no orders processed on Saturdays.

retail %>% 
  ggplot(aes(wday(Date, 
                  week_start = getOption("lubridate.week.start", 1)))) + 
  geom_histogram(stat = "count" , fill = "forest green", colour = "dark green") +
  labs(x = "Day of Week", y = "") +
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) +
  theme_grey(base_size = 14)

How many items does each customer buy?

The large majority of customers typically purchase between 2 and 15 items, with a peak at 2.

retail %>% 
  group_by(InvoiceNo) %>% 
  summarise(n = mean(Quantity)) %>%
  ggplot(aes(x = n)) +
  geom_histogram(bins = 100000, fill = "purple", colour = "black") + 
  coord_cartesian(xlim = c(0,100)) +
  scale_x_continuous(breaks = seq(0,100,10)) +
  labs(x = "Average Number of Items per Purchase", y = "") +
  theme_grey(base_size = 14)

What is the average value per order?

The bulk of orders have a value below £20, with the distribution showing a double peak, one at £6 and a more pronounced one at £17.

retail %>% 
  mutate(Value = UnitPrice * Quantity) %>% 
  group_by(InvoiceNo) %>% 
  summarise(n = mean(Value)) %>%
  ggplot(aes(x = n)) +
  geom_histogram(bins = 200000, fill = "firebrick3", colour = "sandybrown") + 
  coord_cartesian(xlim = c(0,100)) +
  scale_x_continuous(breaks = seq(0,100,10)) +
  labs(x = "Average Value per Purchase", y = "") + 
  theme_grey(base_size = 14)

Which countries do they sell their goods to?

Five sixths of the orders come from the United Kingdom.

treemap(retail,
        index      = c("Country"),
        vSize      = "Quantity",
        title      = "",
        palette    = "Set2",
        border.col = "grey40")

Comments

This concludes the data preparation and visualisation part of the project. I have removed Cancellations, eliminated negative Quantity and UnitPrice, got rid of NAs in Description and created two new variables, Date and Time. A total of 13,761 rows (roughly 2.5% of the initial count) were discarded and the dataset has now 528,148 observations.

Code Repository

The full R code can be found on my GitHub profile

References