/ #Data Wrangling 

Loading, Merging and Joining Datasets

This is the minimal coding necessary to assemble various data feeds and sort out the likes of variables naming & new features creation plus some general housekeeping tasks. It includes general housekeeping tasks like sorting variables names, creating essential features and sorting out variables order

I will continue to add to this code should the need arise for other features to be created.

The Dataset

library(tidyverse)
library(lubridate)
library(readr)

The dataset I’m using here accompanies a Redbooks publication called Building 360-Degree Information Applications which is available as a free PDF download. Also available as a free download are the excel files in the Additional Material section to follow along with the exercises in the book.

The data covers 3 & 12 years worth of sales orders for the Sample Outdoors Company, a fictitious B2B outdoor equipment retailer enterprise. The data comes with details about the products they sell as well as their customers (which are retailers in their case). The data is a subset of the GSDB database, a realistic and feature-rich database created by IBM® to support demos for their products.

Load, Merge and Join

I will start with sales transactions as they are distributed across almost 900 separate excel files.

First, I load and merge them at once from a single folder and bind them by row into an orders file.

orders <- list.files(path = "order-details",
           pattern = "*.csv", 
           full.names = T) %>% 
   map_df(~readr::read_csv(.))

NOTE that all files have the same format with each single variable sitting on the same column in each file

orders %>% glimpse()
## Observations: 446,023
## Variables: 45
## $ order_date            <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code      <dbl> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method_it       <chr> "Reparto vendite", "Reparto vendite", "F...
## $ order_method_en       <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_method_nl       <chr> "Vertegenwoordiger", "Vertegenwoordiger"...
## $ order_detail_code     <dbl> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ order_method_no       <chr> "Salgsbesøk", "Salgsbesøk", "Faks", "Fak...
## $ order_method_el       <chr> "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "F...
## $ unit_cost             <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_method_tc       <chr> "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", ...
## $ order_detail_code_2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer_name         <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb      <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ order_method_cs       <chr> "Návšteva prodejce", "Návšteva prodejce"...
## $ order_method_hu       <chr> "Kiszállásos eladás", "Kiszállásos eladá...
## $ order_method_ms       <chr> "Lawatan jualan", "Lawatan jualan", "Fak...
## $ order_method_th       <chr> "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E41><U+0E1F><U+0E01><U+0E0B><U+0E4C>",...
## $ quantity              <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2        <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code     <dbl> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_method_id       <chr> "Kunjungan penjualan", "Kunjungan penjua...
## $ order_number          <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ order_method_sc       <chr> "<U+9500><U+552E><U+9762><U+8BBF>", "<U+9500><U+552E><U+9762><U+8BBF>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", ...
## $ order_method_fi       <chr> "Myyntikäynti", "Myyntikäynti", "Faksi",...
## $ order_method_ru       <chr> "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0424><U+0430><U+043A><U+0441>", "<U+0424>...
## $ retailer_site_code    <dbl> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price       <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ order_method_da       <chr> "Sælgerbesøg", "Sælgerbesøg", "Fax", "Fa...
## $ order_method_pl       <chr> "Wizyta", "Wizyta", "Faks", "Faks", "Fak...
## $ order_method_de       <chr> "Verkaufsbesuch", "Verkaufsbesuch", "Fax...
## $ order_method_ko       <chr> "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", ...
## $ ship_date             <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ order_method_sv       <chr> "Säljbesök", "Säljbesök", "Fax", "Fax", ...
## $ order_method_fr       <chr> "Visite d'un représentant", "Visite d'un...
## $ order_method_code_2   <dbl> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ unit_price            <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_method_pt       <chr> "Visita vendas", "Visita vendas", "Fax",...
## $ order_method_es       <chr> "Visita de ventas", "Visita de ventas", ...
## $ order_close_date      <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <dbl> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ order_method_code     <dbl> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ order_method_ja       <chr> "<U+5E97><U+982D>", "<U+5E97><U+982D>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "...
## $ product_number        <dbl> 41110, 69110, 75110, 76110, 85110, 65110...

A bit of a clean up is required as there are 45 orders variables but most of them are not necessary. For instance, order_method is available in many languages but I only need the one in English.

I start by removing all order_method other than English.

orders <- 
   orders %>% 
   rename(
      order_method = order_method_en,
      retailer = retailer_name
      )  %>% 
   select(
      -contains('order_method_')
      ) 
orders %>% glimpse()
## Observations: 446,023
## Variables: 21
## $ order_date            <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code      <dbl> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method          <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code     <dbl> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost             <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer              <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb      <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity              <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2        <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code     <dbl> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number          <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code    <dbl> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price       <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date             <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price            <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date      <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <dbl> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number        <dbl> 41110, 69110, 75110, 76110, 85110, 65110...

Then I load the products file, which, as it says on the tin, contains the products information

# load products file
products <- 
   read_csv('products.csv')
products %>% glimpse()
## Observations: 274
## Variables: 20
## $ product_description <chr> "The Venue has a fun and fashionable overs...
## $ base_product_number <dbl> 125, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ product_line        <chr> "Personal Accessories", "Camping Equipment...
## $ production_cost     <dbl> 0.00, 4.00, 9.22, 15.93, 5.00, 34.97, 85.1...
## $ product_image       <chr> "P65PA3WT10.jpg", "P01CE1CG1.jpg", "P02CE1...
## $ product_type        <chr> "Watches", "Cooking Gear", "Cooking Gear",...
## $ color               <chr> "Red", "Clear", "Brown", "Unspecified", "S...
## $ gross_margin        <dbl> 0.00, 0.33, 0.23, 0.28, 0.28, 0.30, 0.28, ...
## $ product_line_code   <dbl> 993, 991, 991, 991, 991, 991, 991, 991, 99...
## $ product_name        <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ introduction_date   <dttm> 2005-04-01, 1995-02-15, 1995-02-15, 1995-...
## $ title               <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ product_type_code   <dbl> 960, 951, 951, 951, 951, 951, 951, 951, 95...
## $ product_size        <chr> "One-size", "10 liters", "2 liters", "15-p...
## $ brand               <chr> "Relax", "TrailChef", "TrailChef", "TrailC...
## $ discontinued_date   <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ product_brand_code  <dbl> 756, 701, 701, 701, 701, 701, 701, 701, 70...
## $ product_size_code   <dbl> 853, 808, 807, 825, 804, 823, 824, 845, 84...
## $ product_color_code  <dbl> 922, 908, 906, 924, 923, 923, 923, 923, 91...
## $ product_number      <dbl> 125150, 1110, 2110, 3110, 4110, 5110, 6110...

Finally, the retailers file

# load retailer info
retailers <- 
   read_csv('retailers.csv')
retailers %>% glimpse()
## Observations: 847
## Variables: 13
## $ REGION_CODE        <dbl> 710, 710, 710, 710, 710, 710, 710, 710, 710...
## $ REGION_EN          <chr> "Americas", "Americas", "Americas", "Americ...
## $ COUNTRY_CODE       <dbl> 1004, 1004, 1004, 1004, 1004, 1004, 1004, 1...
## $ COUNTRY_KEY        <dbl> 90002, 90002, 90002, 90002, 90002, 90002, 9...
## $ COUNTRY_EN         <chr> "Canada", "Canada", "Canada", "Canada", "Ca...
## $ RETAILER_KEY       <dbl> 6789, 6789, 6789, 6790, 6790, 6790, 6790, 6...
## $ RETAILER_CODE      <dbl> 1189, 1189, 1189, 1190, 1190, 1190, 1190, 1...
## $ RETAILER_NAME      <chr> "Rock Steady", "Rock Steady", "Rock Steady"...
## $ RETAILER_SITE_KEY  <dbl> 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5...
## $ RETAILER_SITE_CODE <dbl> 20001, 20002, 20003, 20004, 20005, 20006, 2...
## $ RTL_CITY           <chr> "Winnipeg", "Montr\xe9al", "Fredericton", "...
## $ RETAILER_TYPE_CODE <dbl> 6, 6, 6, 8, 8, 8, 8, 3, 2, 2, 2, 2, 2, 1, 7...
## $ RETAILER_TYPE_EN   <chr> "Outdoors Shop", "Outdoors Shop", "Outdoors...

As R is case sensitive, I’m setting retailers column names to lower-case

retailers <- 
   retailers %>% 
   rename_all(tolower) 

Now I can joining orders, products and retailers info into one data frame

df <- 
   orders %>% 
   left_join(products, by = ('product_number')) %>% 
   left_join(retailers, by = ('retailer_site_code')) 

Creating essential features & some housekeeping tasks

One long piece of code to sort all in one go.

I’ve added comments to each block to explain what it’s doing.

orders_tbl   <- 
   
    # create revenue, total product cost and gross profit
   df %>%
      mutate(
         revenue       = quantity * unit_price,
         tot_prod_cost = quantity * unit_cost,
         gross_profit  = revenue - tot_prod_cost
      ) %>% 
      
      # replacing NAs in the return_count variable
   replace_na(list(return_count = 0)) %>% 
      
    # Shorten product line names for readablility and ease of use in charts and code
    
   mutate(
      prod_line = case_when(
         product_line == 'Camping Equipment' ~ 'Camping Eqpt',
         product_line == 'Golf Equipment' ~ 'Golf Eqpt',
         product_line == 'Mountaineering Equipment' ~ 'Mountain Eqpt',
         product_line == 'Personal Accessories' ~ 'Personal Acces',
         product_line == 'Outdoor Protection' ~ 'Outdoor Prot',
         TRUE ~ product_line
      ),
      
      prod_line_2 = case_when(
         product_line == 'Camping Equipment' ~ 'Camping Eqpt',
         product_line == 'Golf Equipment' ~ 'Golf Eqpt',
         product_line == 'Mountaineering Equipment' ~ 'Mountain Eqpt',
         product_line == 'Personal Accessories' ~ 'Personal Acces',
         product_line == 'Outdoor Protection' ~ 'Personal Acces',
         TRUE ~ product_line
      ),
      
      # create alternative regional group
      region2 = case_when(
         country_en %in% c('United Kingdom', 'France', 'Spain',  
                        'Netherlands','Belgium','Switzerland') ~ 'West Europe',
         country_en %in% c('Germany', 'Italy', 'Finland',  
                        'Austria','Sweden','Denmark') ~ 'East Europe',
         TRUE ~ region_en
      )
   ) %>% 
   
   # create financial years
   mutate(
      ord_date = ymd(order_date),
      fin_year = case_when(
         between(ord_date, ymd("2004-07-01"), ymd('2005-06-30')) ~ 'FY_04_05',
         between(ord_date, ymd("2005-07-01"), ymd('2006-06-30')) ~ 'FY_05_06',
         between(ord_date, ymd("2006-07-01"), ymd('2007-06-30')) ~ 'FY_06_07',
         TRUE ~ 'other'
      ),
      
      # create all quarters
      quarter_all = case_when(
         between(ord_date, ymd("2004-01-01"), ymd('2004-03-31')) ~ '04_Q1',
         between(ord_date, ymd("2004-04-01"), ymd('2004-06-30')) ~ '04_Q2',
         between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
         between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
         between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
         between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
         between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
         between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
         between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
         between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
         between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
         between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
         between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
         between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
         between(ord_date, ymd("2007-07-01"), ymd('2007-09-30')) ~ '07_Q3',
         TRUE ~ 'other'
      ),
      
      # create selected quarters
      quarter_sel = case_when(
         between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
         between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
         between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
         between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
         between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
         between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
         between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
         between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
         between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
         between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
         between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
         between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
         TRUE ~ 'other'
      )
      
      
   ) %>% 
   
   # reorder columns and rename a few
      select(
      order_number,
      order_date,
      order_close_date,
      order_ship_date = ship_date,
      fin_year,
      quarter_all,
      quarter_sel,
      order_method,
      retailer,
      retailer_code,
      retailer_type   = retailer_type_en,
      region          = region_en,
      region2,
      country         = country_en,
      city            = rtl_city,
      promotion_code,
      return          = return_count,
      quantity,
      unit_price,
      unit_sale_price,
      unit_cost,
      unit_prod_cost  = production_cost,
      unit_gross_marg = gross_margin,
      revenue,
      tot_prod_cost,
      gross_profit,
      prod_numb       = product_number,
      prod_line,
      prod_line_2,
      prod_type       = product_type,
      prod_name       = product_name,
      brand,
      color,
      size            = product_size,
      intro_date      = introduction_date,
      halt_date       = discontinued_date
   ) 
orders_tbl %>% glimpse()
## Observations: 446,023
## Variables: 36
## $ order_number     <dbl> 100001, 100001, 100002, 100002, 100002, 10000...
## $ order_date       <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 2004-01-...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 2004-01-...
## $ order_ship_date  <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 2004-01-...
## $ fin_year         <chr> "other", "other", "other", "other", "other", ...
## $ quarter_all      <chr> "04_Q1", "04_Q1", "04_Q1", "04_Q1", "04_Q1", ...
## $ quarter_sel      <chr> "other", "other", "other", "other", "other", ...
## $ order_method     <chr> "Sales visit", "Sales visit", "Fax", "Fax", "...
## $ retailer         <chr> "Kavanagh Sports", "Kavanagh Sports", "Ar fre...
## $ retailer_code    <dbl> 1398, 1398, 1760, 1760, 1760, 1760, 1760, 176...
## $ retailer_type    <chr> "Sports Store", "Sports Store", "Outdoors Sho...
## $ region           <chr> "Americas", "Americas", "Americas", "Americas...
## $ region2          <chr> "Americas", "Americas", "Americas", "Americas...
## $ country          <chr> "United States", "United States", "Brazil", "...
## $ city             <chr> "San Clara", "San Clara", "Sao Paulo", "Sao P...
## $ promotion_code   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ return           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ quantity         <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 1107, 8...
## $ unit_price       <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81.55, ...
## $ unit_sale_price  <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75.84, ...
## $ unit_cost        <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.00, 2....
## $ unit_prod_cost   <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.00, 3....
## $ unit_gross_marg  <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50, 0.6...
## $ revenue          <dbl> 8983.04, 10120.00, 19389.78, 6969.44, 26610.4...
## $ tot_prod_cost    <dbl> 3998.72, 4571.48, 12960.00, 4047.16, 13058.78...
## $ gross_profit     <dbl> 4984.32, 5548.52, 6429.78, 2922.28, 13551.62,...
## $ prod_numb        <dbl> 41110, 69110, 75110, 76110, 85110, 65110, 100...
## $ prod_line        <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_line_2      <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_type        <chr> "Lanterns", "Eyewear", "Knives", "Knives", "N...
## $ prod_name        <chr> "Flicker Lantern", "Polar Ice", "Edge Extreme...
## $ brand            <chr> "Firefly", "Polar", "Extreme", "Edge", "Extre...
## $ color            <chr> "Orange", "Black", "Red", "Black", "Black", "...
## $ size             <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "Unspe...
## $ intro_date       <dttm> 2005-01-10, 1995-02-15, 2000-10-26, 1995-02-...
## $ halt_date        <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Remove original files and save

Last but not least, I can drop all original files and save the orders_tbl

# remove original files
rm(df, products, retailers, orders)

# Save orders as RDS
orders_tbl %>%
   write_rds("orders_tbl.rds")

Code Repository

The full R code can be found on my GitHub profile

NOTE: given their size, the RDS file and the order-details folder had to be compress before uploading them on my Github profile