Loading, Merging and Joining Datasets
This is the minimal coding necessary to assemble various data feeds and sort out the likes of variables naming & new features creation plus some general housekeeping tasks. It includes general housekeeping tasks like sorting variables names, creating essential features and sorting out variables order
I will continue to add to this code should the need arise for other features to be created.
The Dataset
library(tidyverse)
library(lubridate)
library(readr)
The dataset I’m using here accompanies a Redbooks publication called Building 360-Degree Information Applications which is available as a free PDF download. Also available as a free download are the excel files in the Additional Material section to follow along with the exercises in the book.
The data covers 3 & 1⁄2 years worth of sales orders
for the Sample Outdoors Company, a fictitious B2B outdoor equipment retailer enterprise. The data comes with details about the products
they sell as well as their customers (which are retailers
in their case). The data is a subset of the GSDB database, a realistic and feature-rich database created by IBM® to support demos for their products.
Load, Merge and Join
I will start with sales transactions as they are distributed across almost 900 separate excel files.
First, I load and merge them at once from a single folder and bind them by row into an orders
file.
orders <- list.files(path = "order-details",
pattern = "*.csv",
full.names = T) %>%
map_df(~readr::read_csv(.))
NOTE that all files have the same format with each single variable sitting on the same column in each file
orders %>% glimpse()
## Observations: 446,023
## Variables: 45
## $ order_date <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <dbl> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method_it <chr> "Reparto vendite", "Reparto vendite", "F...
## $ order_method_en <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_method_nl <chr> "Vertegenwoordiger", "Vertegenwoordiger"...
## $ order_detail_code <dbl> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ order_method_no <chr> "Salgsbesøk", "Salgsbesøk", "Faks", "Fak...
## $ order_method_el <chr> "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "F...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_method_tc <chr> "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", ...
## $ order_detail_code_2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer_name <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ order_method_cs <chr> "Návšteva prodejce", "Návšteva prodejce"...
## $ order_method_hu <chr> "Kiszállásos eladás", "Kiszállásos eladá...
## $ order_method_ms <chr> "Lawatan jualan", "Lawatan jualan", "Fak...
## $ order_method_th <chr> "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E41><U+0E1F><U+0E01><U+0E0B><U+0E4C>",...
## $ quantity <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <dbl> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_method_id <chr> "Kunjungan penjualan", "Kunjungan penjua...
## $ order_number <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ order_method_sc <chr> "<U+9500><U+552E><U+9762><U+8BBF>", "<U+9500><U+552E><U+9762><U+8BBF>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", ...
## $ order_method_fi <chr> "Myyntikäynti", "Myyntikäynti", "Faksi",...
## $ order_method_ru <chr> "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0424><U+0430><U+043A><U+0441>", "<U+0424>...
## $ retailer_site_code <dbl> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ order_method_da <chr> "Sælgerbesøg", "Sælgerbesøg", "Fax", "Fa...
## $ order_method_pl <chr> "Wizyta", "Wizyta", "Faks", "Faks", "Fak...
## $ order_method_de <chr> "Verkaufsbesuch", "Verkaufsbesuch", "Fax...
## $ order_method_ko <chr> "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", ...
## $ ship_date <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ order_method_sv <chr> "Säljbesök", "Säljbesök", "Fax", "Fax", ...
## $ order_method_fr <chr> "Visite d'un représentant", "Visite d'un...
## $ order_method_code_2 <dbl> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_method_pt <chr> "Visita vendas", "Visita vendas", "Fax",...
## $ order_method_es <chr> "Visita de ventas", "Visita de ventas", ...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <dbl> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ order_method_code <dbl> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ order_method_ja <chr> "<U+5E97><U+982D>", "<U+5E97><U+982D>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "...
## $ product_number <dbl> 41110, 69110, 75110, 76110, 85110, 65110...
A bit of a clean up is required as there are 45 orders
variables but most of them are not necessary. For instance, order_method
is available in many languages but I only need the one in English.
I start by removing all order_method
other than English.
orders <-
orders %>%
rename(
order_method = order_method_en,
retailer = retailer_name
) %>%
select(
-contains('order_method_')
)
orders %>% glimpse()
## Observations: 446,023
## Variables: 21
## $ order_date <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <dbl> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code <dbl> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <dbl> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code <dbl> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <dbl> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number <dbl> 41110, 69110, 75110, 76110, 85110, 65110...
Then I load the products
file, which, as it says on the tin, contains the products information
# load products file
products <-
read_csv('products.csv')
products %>% glimpse()
## Observations: 274
## Variables: 20
## $ product_description <chr> "The Venue has a fun and fashionable overs...
## $ base_product_number <dbl> 125, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ product_line <chr> "Personal Accessories", "Camping Equipment...
## $ production_cost <dbl> 0.00, 4.00, 9.22, 15.93, 5.00, 34.97, 85.1...
## $ product_image <chr> "P65PA3WT10.jpg", "P01CE1CG1.jpg", "P02CE1...
## $ product_type <chr> "Watches", "Cooking Gear", "Cooking Gear",...
## $ color <chr> "Red", "Clear", "Brown", "Unspecified", "S...
## $ gross_margin <dbl> 0.00, 0.33, 0.23, 0.28, 0.28, 0.30, 0.28, ...
## $ product_line_code <dbl> 993, 991, 991, 991, 991, 991, 991, 991, 99...
## $ product_name <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ introduction_date <dttm> 2005-04-01, 1995-02-15, 1995-02-15, 1995-...
## $ title <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ product_type_code <dbl> 960, 951, 951, 951, 951, 951, 951, 951, 95...
## $ product_size <chr> "One-size", "10 liters", "2 liters", "15-p...
## $ brand <chr> "Relax", "TrailChef", "TrailChef", "TrailC...
## $ discontinued_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ product_brand_code <dbl> 756, 701, 701, 701, 701, 701, 701, 701, 70...
## $ product_size_code <dbl> 853, 808, 807, 825, 804, 823, 824, 845, 84...
## $ product_color_code <dbl> 922, 908, 906, 924, 923, 923, 923, 923, 91...
## $ product_number <dbl> 125150, 1110, 2110, 3110, 4110, 5110, 6110...
Finally, the retailers
file
# load retailer info
retailers <-
read_csv('retailers.csv')
retailers %>% glimpse()
## Observations: 847
## Variables: 13
## $ REGION_CODE <dbl> 710, 710, 710, 710, 710, 710, 710, 710, 710...
## $ REGION_EN <chr> "Americas", "Americas", "Americas", "Americ...
## $ COUNTRY_CODE <dbl> 1004, 1004, 1004, 1004, 1004, 1004, 1004, 1...
## $ COUNTRY_KEY <dbl> 90002, 90002, 90002, 90002, 90002, 90002, 9...
## $ COUNTRY_EN <chr> "Canada", "Canada", "Canada", "Canada", "Ca...
## $ RETAILER_KEY <dbl> 6789, 6789, 6789, 6790, 6790, 6790, 6790, 6...
## $ RETAILER_CODE <dbl> 1189, 1189, 1189, 1190, 1190, 1190, 1190, 1...
## $ RETAILER_NAME <chr> "Rock Steady", "Rock Steady", "Rock Steady"...
## $ RETAILER_SITE_KEY <dbl> 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5...
## $ RETAILER_SITE_CODE <dbl> 20001, 20002, 20003, 20004, 20005, 20006, 2...
## $ RTL_CITY <chr> "Winnipeg", "Montr\xe9al", "Fredericton", "...
## $ RETAILER_TYPE_CODE <dbl> 6, 6, 6, 8, 8, 8, 8, 3, 2, 2, 2, 2, 2, 1, 7...
## $ RETAILER_TYPE_EN <chr> "Outdoors Shop", "Outdoors Shop", "Outdoors...
As R is case sensitive, I’m setting retailers
column names to lower-case
retailers <-
retailers %>%
rename_all(tolower)
Now I can joining orders
, products
and retailers
info into one data frame
df <-
orders %>%
left_join(products, by = ('product_number')) %>%
left_join(retailers, by = ('retailer_site_code'))
Creating essential features & some housekeeping tasks
One long piece of code to sort all in one go.
I’ve added comments to each block to explain what it’s doing.
orders_tbl <-
# create revenue, total product cost and gross profit
df %>%
mutate(
revenue = quantity * unit_price,
tot_prod_cost = quantity * unit_cost,
gross_profit = revenue - tot_prod_cost
) %>%
# replacing NAs in the return_count variable
replace_na(list(return_count = 0)) %>%
# Shorten product line names for readablility and ease of use in charts and code
mutate(
prod_line = case_when(
product_line == 'Camping Equipment' ~ 'Camping Eqpt',
product_line == 'Golf Equipment' ~ 'Golf Eqpt',
product_line == 'Mountaineering Equipment' ~ 'Mountain Eqpt',
product_line == 'Personal Accessories' ~ 'Personal Acces',
product_line == 'Outdoor Protection' ~ 'Outdoor Prot',
TRUE ~ product_line
),
prod_line_2 = case_when(
product_line == 'Camping Equipment' ~ 'Camping Eqpt',
product_line == 'Golf Equipment' ~ 'Golf Eqpt',
product_line == 'Mountaineering Equipment' ~ 'Mountain Eqpt',
product_line == 'Personal Accessories' ~ 'Personal Acces',
product_line == 'Outdoor Protection' ~ 'Personal Acces',
TRUE ~ product_line
),
# create alternative regional group
region2 = case_when(
country_en %in% c('United Kingdom', 'France', 'Spain',
'Netherlands','Belgium','Switzerland') ~ 'West Europe',
country_en %in% c('Germany', 'Italy', 'Finland',
'Austria','Sweden','Denmark') ~ 'East Europe',
TRUE ~ region_en
)
) %>%
# create financial years
mutate(
ord_date = ymd(order_date),
fin_year = case_when(
between(ord_date, ymd("2004-07-01"), ymd('2005-06-30')) ~ 'FY_04_05',
between(ord_date, ymd("2005-07-01"), ymd('2006-06-30')) ~ 'FY_05_06',
between(ord_date, ymd("2006-07-01"), ymd('2007-06-30')) ~ 'FY_06_07',
TRUE ~ 'other'
),
# create all quarters
quarter_all = case_when(
between(ord_date, ymd("2004-01-01"), ymd('2004-03-31')) ~ '04_Q1',
between(ord_date, ymd("2004-04-01"), ymd('2004-06-30')) ~ '04_Q2',
between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
between(ord_date, ymd("2007-07-01"), ymd('2007-09-30')) ~ '07_Q3',
TRUE ~ 'other'
),
# create selected quarters
quarter_sel = case_when(
between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
TRUE ~ 'other'
)
) %>%
# reorder columns and rename a few
select(
order_number,
order_date,
order_close_date,
order_ship_date = ship_date,
fin_year,
quarter_all,
quarter_sel,
order_method,
retailer,
retailer_code,
retailer_type = retailer_type_en,
region = region_en,
region2,
country = country_en,
city = rtl_city,
promotion_code,
return = return_count,
quantity,
unit_price,
unit_sale_price,
unit_cost,
unit_prod_cost = production_cost,
unit_gross_marg = gross_margin,
revenue,
tot_prod_cost,
gross_profit,
prod_numb = product_number,
prod_line,
prod_line_2,
prod_type = product_type,
prod_name = product_name,
brand,
color,
size = product_size,
intro_date = introduction_date,
halt_date = discontinued_date
)
orders_tbl %>% glimpse()
## Observations: 446,023
## Variables: 36
## $ order_number <dbl> 100001, 100001, 100002, 100002, 100002, 10000...
## $ order_date <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 2004-01-...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 2004-01-...
## $ order_ship_date <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 2004-01-...
## $ fin_year <chr> "other", "other", "other", "other", "other", ...
## $ quarter_all <chr> "04_Q1", "04_Q1", "04_Q1", "04_Q1", "04_Q1", ...
## $ quarter_sel <chr> "other", "other", "other", "other", "other", ...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fax", "...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "Ar fre...
## $ retailer_code <dbl> 1398, 1398, 1760, 1760, 1760, 1760, 1760, 176...
## $ retailer_type <chr> "Sports Store", "Sports Store", "Outdoors Sho...
## $ region <chr> "Americas", "Americas", "Americas", "Americas...
## $ region2 <chr> "Americas", "Americas", "Americas", "Americas...
## $ country <chr> "United States", "United States", "Brazil", "...
## $ city <chr> "San Clara", "San Clara", "Sao Paulo", "Sao P...
## $ promotion_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ return <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ quantity <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 1107, 8...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81.55, ...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75.84, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.00, 2....
## $ unit_prod_cost <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.00, 3....
## $ unit_gross_marg <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50, 0.6...
## $ revenue <dbl> 8983.04, 10120.00, 19389.78, 6969.44, 26610.4...
## $ tot_prod_cost <dbl> 3998.72, 4571.48, 12960.00, 4047.16, 13058.78...
## $ gross_profit <dbl> 4984.32, 5548.52, 6429.78, 2922.28, 13551.62,...
## $ prod_numb <dbl> 41110, 69110, 75110, 76110, 85110, 65110, 100...
## $ prod_line <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_line_2 <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_type <chr> "Lanterns", "Eyewear", "Knives", "Knives", "N...
## $ prod_name <chr> "Flicker Lantern", "Polar Ice", "Edge Extreme...
## $ brand <chr> "Firefly", "Polar", "Extreme", "Edge", "Extre...
## $ color <chr> "Orange", "Black", "Red", "Black", "Black", "...
## $ size <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "Unspe...
## $ intro_date <dttm> 2005-01-10, 1995-02-15, 2000-10-26, 1995-02-...
## $ halt_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Remove original files and save
Last but not least, I can drop all original files and save the orders_tbl
# remove original files
rm(df, products, retailers, orders)
# Save orders as RDS
orders_tbl %>%
write_rds("orders_tbl.rds")
Code Repository
The full R code can be found on my GitHub profile
NOTE: given their size, the RDS file and the order-details folder had to be compress before uploading them on my Github profile