/ #Data Wrangling #Data Exploration 

Loading, Merging and Joining Several Datasets - PostgreSQL EDT

This is the coding necessary to assemble the various data feeds and sort out the likes of variables naming & new features creation plus some general housekeeping tasks.

In order to simulate normal working conditions I would face if the data was stored on a database, I’ve uploaded the excel files onto a local PostgreSQL database that I’ve created on my machine. I am going to go through the steps I followed to set up a connection between RStudio and said database and extract the information I needed.

I am also going to carry out some general housekeeping tasks like sorting variables names, creating essential features and sorting out variables order.

The Dataset

library(tidyverse)
library(lubridate)
library(odbc)       # connecting to ODBC databases using DBI interface
library(DBI)        # database interface
library(RPostgres)  
library(knitr)

The dataset I’m using here accompanies a Redbooks publication called Building 360-Degree Information Applications which is available as a free PDF download. Also available as a free download are the excel files in the Additional Material section to follow along with the exercises in the book.

The data covers 3 & 12 years worth of sales orders for the Sample Outdoors Company, a fictitious B2B outdoor equipment retailer enterprise. The data comes with details about the products they sell as well as their customers (which are retailers in their case). The data is a subset of the GSDB database, a realistic and feature-rich database created by IBM® to support demos for their products.

Create the connection

Detailed instructions on how to work databases in R can be found on Databases using R. This RStudio resource file is quite simply a treasure trove of information, spanning from setting up a connection with a vast array of open source and proprietary databases & running queries to creating visualisations and dashboards.

First thing I need to do is to fetch the information required to establish a link to my database. In everyday work settings this information would be provided by your database administrator but given that I’ve created the database myself and I’m hosting it on my machine, I’ve retrieved all information myself from the main page on pgAdmin, the GUI tool for PostgreSQL:

Select PostgresSQL then go to Properties and look for the Connection tab

I can now create a connection to the database with the dbConnect command

mycon <- 
   DBI::dbConnect(RPostgres::Postgres(),  # constructs the driver
                     dbname = "customer_analytics", # the name of the schema
                     host = "localhost",  # host name (local in my case)
                     port = "5432",       # usual port name
                     user = 'postgres',   # user name
                     password = "your_PW" # PW to 'connect to server'
                     )

Once the connection is established, I use dbListTables to inspect the database content

#list all tables
table_chr <- DBI::dbListTables(mycon)

table_chr

## [1] "orders"    "products"  "retailers"

Now I use tbl() to take a reference to the orderstable…

orders_db <- 
   dplyr::tbl(mycon, "orders")

…and take a look at it as I normally would with glimpse

orders_db %>% glimpse()

## Observations: ??
## Variables: 45
## Database: postgres  [postgres@localhost:5432/customer_analytics]
## $ order_date            <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code      <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method_it       <chr> "Reparto vendite", "Reparto vendite", "F...
## $ order_method_en       <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_method_nl       <chr> "Vertegenwoordiger", "Vertegenwoordiger"...
## $ order_detail_code     <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ order_method_no       <chr> "Salgsbesøk", "Salgsbesøk", "Faks", "Fak...
## $ order_method_el       <chr> "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "F...
## $ unit_cost             <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_method_tc       <chr> "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", ...
## $ order_detail_code_2   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer_name         <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb      <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ order_method_cs       <chr> "Návšteva prodejce", "Návšteva prodejce"...
## $ order_method_hu       <chr> "Kiszállásos eladás", "Kiszállásos eladá...
## $ order_method_ms       <chr> "Lawatan jualan", "Lawatan jualan", "Fak...
## $ order_method_th       <chr> "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E41><U+0E1F><U+0E01><U+0E0B><U+0E4C>",...
## $ quantity              <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2        <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code     <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_method_id       <chr> "Kunjungan penjualan", "Kunjungan penjua...
## $ order_number          <int> 100001, 100001, 100002, 100002, 100002, ...
## $ order_method_sc       <chr> "<U+9500><U+552E><U+9762><U+8BBF>", "<U+9500><U+552E><U+9762><U+8BBF>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", ...
## $ order_method_fi       <chr> "Myyntikäynti", "Myyntikäynti", "Faksi",...
## $ order_method_ru       <chr> "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0424><U+0430><U+043A><U+0441>", "<U+0424>...
## $ retailer_site_code    <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price       <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ order_method_da       <chr> "Sælgerbesøg", "Sælgerbesøg", "Fax", "Fa...
## $ order_method_pl       <chr> "Wizyta", "Wizyta", "Faks", "Faks", "Fak...
## $ order_method_de       <chr> "Verkaufsbesuch", "Verkaufsbesuch", "Fax...
## $ order_method_ko       <chr> "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", ...
## $ ship_date             <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ order_method_sv       <chr> "Säljbesök", "Säljbesök", "Fax", "Fax", ...
## $ order_method_fr       <chr> "Visite d'un représentant", "Visite d'un...
## $ order_method_code_2   <int> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ unit_price            <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_method_pt       <chr> "Visita vendas", "Visita vendas", "Fax",...
## $ order_method_es       <chr> "Visita de ventas", "Visita de ventas", ...
## $ order_close_date      <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ order_method_code     <int> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ order_method_ja       <chr> "<U+5E97><U+982D>", "<U+5E97><U+982D>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "...
## $ product_number        <int> 41110, 69110, 75110, 76110, 85110, 65110...

This mostly looks like a regular tibble with two important differences:

1- The reference to the remote source on a PostgreSQL database is shown

2- The number of Observations is not showing

The latter is due to the fact that any operation carried out on the _db ending tables is simply creating a “reference” to the queries you want to carry out on the database. The actual queries are executed in one go at the very end, when you send the collect request to the database - more on this later on.

I start by removing all order_method other than English.

orders_db <- 
   orders_db %>% 
   rename(
      order_method = order_method_en,
      retailer = retailer_name
      )  %>% 
   select(
      -contains('order_method_')
      ) 
orders_db %>% glimpse()

## Observations: ??
## Variables: 21
## Database: postgres  [postgres@localhost:5432/customer_analytics]
## $ order_date            <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code      <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method          <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code     <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost             <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer              <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb      <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity              <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2        <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code     <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number          <int> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code    <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price       <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date             <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price            <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date      <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number        <int> 41110, 69110, 75110, 76110, 85110, 65110...

Then I take a reference to the products table, which, as it says on the tin, contains the products information

# load products file
products_db <- 
   tbl(mycon, "products")
products_db %>% glimpse()

## Observations: ??
## Variables: 20
## Database: postgres  [postgres@localhost:5432/customer_analytics]
## $ product_description <chr> "The Venue has a fun and fashionable overs...
## $ base_product_number <int> 125, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ product_line        <chr> "Personal Accessories", "Camping Equipment...
## $ production_cost     <dbl> 0.00, 4.00, 9.22, 15.93, 5.00, 34.97, 85.1...
## $ product_image       <chr> "P65PA3WT10.jpg", "P01CE1CG1.jpg", "P02CE1...
## $ product_type        <chr> "Watches", "Cooking Gear", "Cooking Gear",...
## $ color               <chr> "Red", "Clear", "Brown", "Unspecified", "S...
## $ gross_margin        <dbl> 0.00, 0.33, 0.23, 0.28, 0.28, 0.30, 0.28, ...
## $ product_line_code   <int> 993, 991, 991, 991, 991, 991, 991, 991, 99...
## $ product_name        <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ introduction_date   <date> 2005-04-01, 1995-02-15, 1995-02-15, 1995-...
## $ title               <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ product_type_code   <int> 960, 951, 951, 951, 951, 951, 951, 951, 95...
## $ product_size        <chr> "One-size", "10 liters", "2 liters", "15-p...
## $ brand               <chr> "Relax", "TrailChef", "TrailChef", "TrailC...
## $ discontinued_date   <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ product_brand_code  <int> 756, 701, 701, 701, 701, 701, 701, 701, 70...
## $ product_size_code   <int> 853, 808, 807, 825, 804, 823, 824, 845, 84...
## $ product_color_code  <int> 922, 908, 906, 924, 923, 923, 923, 923, 91...
## $ product_number      <int> 125150, 1110, 2110, 3110, 4110, 5110, 6110...

Finally, the retailers table

# load retailer info
retailers_db <- 
   tbl(mycon, "retailers")
retailers_db %>% glimpse()

## Observations: ??
## Variables: 13
## Database: postgres  [postgres@localhost:5432/customer_analytics]
## $ region_code        <int> 710, 710, 710, 710, 710, 710, 710, 710, 710...
## $ region_en          <chr> "Americas", "Americas", "Americas", "Americ...
## $ country_code       <int> 1004, 1004, 1004, 1004, 1004, 1004, 1004, 1...
## $ country_key        <int> 90002, 90002, 90002, 90002, 90002, 90002, 9...
## $ country_en         <chr> "Canada", "Canada", "Canada", "Canada", "Ca...
## $ retailer_key       <int> 6789, 6789, 6789, 6790, 6790, 6790, 6790, 6...
## $ retailer_code      <int> 1189, 1189, 1189, 1190, 1190, 1190, 1190, 1...
## $ retailer_name      <chr> "Rock Steady", "Rock Steady", "Rock Steady"...
## $ retailer_site_key  <int> 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5...
## $ retailer_site_code <int> 20001, 20002, 20003, 20004, 20005, 20006, 2...
## $ rtl_city           <chr> "Winnipeg", "Montreal", "Fredericton", "Mon...
## $ retailer_type_code <int> 6, 6, 6, 8, 8, 8, 8, 3, 2, 2, 2, 2, 2, 1, 7...
## $ retailer_type_en   <chr> "Outdoors Shop", "Outdoors Shop", "Outdoors...

Now I can joining orders_db, products_db and retailers_db info into one data frame

df_db <- 
   orders_db %>% 
   left_join(products_db, by = ('product_number')) %>% 
   left_join(retailers_db, by = ('retailer_site_code')) 
df_db %>% glimpse()

## Observations: ??
## Variables: 52
## Database: postgres  [postgres@localhost:5432/customer_analytics]
## $ order_date            <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code      <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method          <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code     <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost             <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer              <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb      <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity              <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2        <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code     <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number          <int> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code    <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price       <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date             <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price            <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date      <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number        <int> 41110, 69110, 75110, 76110, 85110, 65110...
## $ product_description   <chr> "Simple to use, just requires a candle t...
## $ base_product_number   <int> 41, 69, 75, 76, 85, 65, 100, 90, 95, 79,...
## $ product_line          <chr> "Camping Equipment", "Personal Accessori...
## $ production_cost       <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.0...
## $ product_image         <chr> "P41CE1LT5.jpg", "P69PA3EW11.jpg", "P75P...
## $ product_type          <chr> "Lanterns", "Eyewear", "Knives", "Knives...
## $ color                 <chr> "Orange", "Black", "Red", "Black", "Blac...
## $ gross_margin          <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50...
## $ product_line_code     <int> 991, 993, 993, 993, 993, 993, 994, 994, ...
## $ product_name          <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ introduction_date     <date> 2005-01-10, 1995-02-15, 2000-10-26, 199...
## $ title                 <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ product_type_code     <int> 955, 961, 962, 962, 964, 960, 967, 965, ...
## $ product_size          <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "...
## $ brand                 <chr> "Firefly", "Polar", "Extreme", "Edge", "...
## $ discontinued_date     <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ product_brand_code    <int> 705, 712, 706, 713, 706, 711, 720, 706, ...
## $ product_size_code     <int> 848, 851, 813, 817, 800, 852, 803, 802, ...
## $ product_color_code    <int> 918, 902, 922, 902, 902, 922, 908, 908, ...
## $ region_code           <int> 710, 710, 710, 710, 710, 710, 710, 710, ...
## $ region_en             <chr> "Americas", "Americas", "Americas", "Ame...
## $ country_code          <int> 1003, 1003, 1021, 1021, 1021, 1021, 1021...
## $ country_key           <int> 90001, 90001, 90004, 90004, 90004, 90004...
## $ country_en            <chr> "United States", "United States", "Brazi...
## $ retailer_key          <int> 6998, 6998, 7360, 7360, 7360, 7360, 7360...
## $ retailer_code         <int> 1398, 1398, 1760, 1760, 1760, 1760, 1760...
## $ retailer_name         <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_site_key     <int> 5477, 5477, 5837, 5837, 5837, 5837, 5837...
## $ rtl_city              <chr> "San Clara", "San Clara", "Sao Paulo", "...
## $ retailer_type_code    <int> 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ retailer_type_en      <chr> "Sports Store", "Sports Store", "Outdoor...

I’m happy with the queries I’ve ran and I use collect() to pull all data into a local tibble. This is the stage when the queries are actually applied to the database and, depending on their complexity, number of calculations and amount of data involved, it may take some time to complete.

df <- 
   df_db %>% 
   collect()
df %>% 
   glimpse()

## Observations: 446,023
## Variables: 52
## $ order_date            <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code      <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method          <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code     <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost             <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer              <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb      <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity              <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2        <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code     <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number          <int> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code    <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price       <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date             <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price            <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date      <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number        <int> 41110, 69110, 75110, 76110, 85110, 65110...
## $ product_description   <chr> "Simple to use, just requires a candle t...
## $ base_product_number   <int> 41, 69, 75, 76, 85, 65, 100, 90, 95, 79,...
## $ product_line          <chr> "Camping Equipment", "Personal Accessori...
## $ production_cost       <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.0...
## $ product_image         <chr> "P41CE1LT5.jpg", "P69PA3EW11.jpg", "P75P...
## $ product_type          <chr> "Lanterns", "Eyewear", "Knives", "Knives...
## $ color                 <chr> "Orange", "Black", "Red", "Black", "Blac...
## $ gross_margin          <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50...
## $ product_line_code     <int> 991, 993, 993, 993, 993, 993, 994, 994, ...
## $ product_name          <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ introduction_date     <date> 2005-01-10, 1995-02-15, 2000-10-26, 199...
## $ title                 <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ product_type_code     <int> 955, 961, 962, 962, 964, 960, 967, 965, ...
## $ product_size          <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "...
## $ brand                 <chr> "Firefly", "Polar", "Extreme", "Edge", "...
## $ discontinued_date     <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ product_brand_code    <int> 705, 712, 706, 713, 706, 711, 720, 706, ...
## $ product_size_code     <int> 848, 851, 813, 817, 800, 852, 803, 802, ...
## $ product_color_code    <int> 918, 902, 922, 902, 902, 922, 908, 908, ...
## $ region_code           <int> 710, 710, 710, 710, 710, 710, 710, 710, ...
## $ region_en             <chr> "Americas", "Americas", "Americas", "Ame...
## $ country_code          <int> 1003, 1003, 1021, 1021, 1021, 1021, 1021...
## $ country_key           <int> 90001, 90001, 90004, 90004, 90004, 90004...
## $ country_en            <chr> "United States", "United States", "Brazi...
## $ retailer_key          <int> 6998, 6998, 7360, 7360, 7360, 7360, 7360...
## $ retailer_code         <int> 1398, 1398, 1760, 1760, 1760, 1760, 1760...
## $ retailer_name         <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_site_key     <int> 5477, 5477, 5837, 5837, 5837, 5837, 5837...
## $ rtl_city              <chr> "San Clara", "San Clara", "Sao Paulo", "...
## $ retailer_type_code    <int> 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ retailer_type_en      <chr> "Sports Store", "Sports Store", "Outdoor...

Don’t forget to disconnect from the database

dbDisconnect(mycon)

Creating essential features & some housekeeping tasks

One long piece of code to sort all in one go.

I’ve added comments to each block to explain what it’s doing.

orders_tbl   <- 
   
    # create revenue, total product cost and gross profit
   df %>%
      mutate(
         production_cost = quantity * unit_cost,
         revenue         = quantity * unit_sale_price,
         planned_revenue = quantity * unit_price,
         gross_profit    = revenue  - production_cost
      ) %>% 
      
      # replacing NAs in the return_count variable
   replace_na(list(return_count = 0)) %>% 
      
    # Shorten product line names for readablility and ease of use in charts and code
    
   mutate(
      prod_line = case_when(
         product_line == 'Camping Equipment' ~ 'Camping_Eqpt',
         product_line == 'Golf Equipment' ~ 'Golf_Eqpt',
         product_line == 'Mountaineering Equipment' ~ 'Mountain_Eqpt',
         product_line == 'Personal Accessories' ~ 'Personal_Acces',
         product_line == 'Outdoor Protection' ~ 'Outdoor_Prot',
         TRUE ~ product_line
      ),
      
      prod_line_2 = case_when(
         product_line == 'Camping Equipment' ~ 'Camping_Eqpt',
         product_line == 'Golf Equipment' ~ 'Golf_Eqpt',
         product_line == 'Mountaineering Equipment' ~ 'Mountain_Eqpt',
         product_line == 'Personal Accessories' ~ 'Personal_Acces',
         product_line == 'Outdoor Protection' ~ 'Personal_Acces',
         TRUE ~ product_line
      ),
      
      # create alternative regional group
      region2 = case_when(
         country_en %in% c('United Kingdom', 'France', 'Spain',  
                        'Netherlands','Belgium','Switzerland') ~ 'West_Europe',
         country_en %in% c('Germany', 'Italy', 'Finland',  
                        'Austria','Sweden','Denmark') ~ 'East_Europe',
         TRUE ~ region_en
      )
   ) %>% 
   
   # create financial years
   mutate(
      ord_date = ymd(order_date),
      fin_year = case_when(
         between(ord_date, ymd("2004-07-01"), ymd('2005-06-30')) ~ 'FY_04_05',
         between(ord_date, ymd("2005-07-01"), ymd('2006-06-30')) ~ 'FY_05_06',
         between(ord_date, ymd("2006-07-01"), ymd('2007-06-30')) ~ 'FY_06_07',
         TRUE ~ 'other'
      ),
      
      # create all quarters
      quarter_all = case_when(
         between(ord_date, ymd("2004-01-01"), ymd('2004-03-31')) ~ '04_Q1',
         between(ord_date, ymd("2004-04-01"), ymd('2004-06-30')) ~ '04_Q2',
         between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
         between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
         between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
         between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
         between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
         between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
         between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
         between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
         between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
         between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
         between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
         between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
         between(ord_date, ymd("2007-07-01"), ymd('2007-09-30')) ~ '07_Q3',
         TRUE ~ 'other'
      ),
      
      # create selected quarters
      quarter_sel = case_when(
         between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
         between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
         between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
         between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
         between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
         between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
         between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
         between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
         between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
         between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
         between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
         between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
         TRUE ~ 'other'
      )
      
      
   ) %>% 
   
   # reorder columns and rename a few
      select(
      order_number,
      order_date,
      order_close_date,
      order_ship_date = ship_date,
      fin_year,
      quarter_all,
      quarter_sel,
      order_method,
      retailer,
      retailer_code,
      retailer_site   = retailer_site_key,
      retailer_type   = retailer_type_en,
      region          = region_en,
      region2,
      country         = country_en,
      city            = rtl_city,
      promotion_code,
      return          = return_count,
      quantity,
      unit_price,
      unit_sale_price,
      unit_cost,
      unit_prod_cost  = production_cost,
      unit_gross_marg = gross_margin,
      revenue,
      planned_revenue,
      production_cost,
      gross_profit,
      prod_numb       = product_number,
      prod_line,
      prod_line_2,
      prod_type       = product_type,
      prod_name       = product_name,
      brand,
      color,
      size            = product_size,
      intro_date      = introduction_date,
      halt_date       = discontinued_date
   ) 
orders_tbl %>% glimpse()

## Observations: 446,023
## Variables: 37
## $ order_number     <int> 100001, 100001, 100002, 100002, 100002, 10000...
## $ order_date       <date> 2004-01-12, 2004-01-12, 2004-01-12, 2004-01-...
## $ order_close_date <date> 2004-02-17, 2004-02-17, 2004-01-19, 2004-01-...
## $ order_ship_date  <date> 2004-01-19, 2004-02-17, 2004-01-19, 2004-01-...
## $ fin_year         <chr> "other", "other", "other", "other", "other", ...
## $ quarter_all      <chr> "04_Q1", "04_Q1", "04_Q1", "04_Q1", "04_Q1", ...
## $ quarter_sel      <chr> "other", "other", "other", "other", "other", ...
## $ order_method     <chr> "Sales visit", "Sales visit", "Fax", "Fax", "...
## $ retailer         <chr> "Kavanagh Sports", "Kavanagh Sports", "Ar fre...
## $ retailer_code    <int> 1398, 1398, 1760, 1760, 1760, 1760, 1760, 176...
## $ retailer_site    <int> 5477, 5477, 5837, 5837, 5837, 5837, 5837, 583...
## $ retailer_type    <chr> "Sports Store", "Sports Store", "Outdoors Sho...
## $ region           <chr> "Americas", "Americas", "Americas", "Americas...
## $ region2          <chr> "Americas", "Americas", "Americas", "Americas...
## $ country          <chr> "United States", "United States", "Brazil", "...
## $ city             <chr> "San Clara", "San Clara", "Sao Paulo", "Sao P...
## $ promotion_code   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ return           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ quantity         <int> 256, 92, 162, 172, 74, 90, 422, 3252, 1107, 8...
## $ unit_price       <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81.55, ...
## $ unit_sale_price  <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75.84, ...
## $ unit_cost        <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.00, 2....
## $ unit_prod_cost   <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.00, 3....
## $ unit_gross_marg  <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50, 0.6...
## $ revenue          <dbl> 8983.04, 10120.00, 19389.78, 6969.44, 26610.4...
## $ tot_prod_cost    <dbl> 3998.72, 4571.48, 12960.00, 4047.16, 13058.78...
## $ gross_profit     <dbl> 4984.32, 5548.52, 6429.78, 2922.28, 13551.62,...
## $ prod_numb        <int> 41110, 69110, 75110, 76110, 85110, 65110, 100...
## $ prod_line        <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_line_2      <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_type        <chr> "Lanterns", "Eyewear", "Knives", "Knives", "N...
## $ prod_name        <chr> "Flicker Lantern", "Polar Ice", "Edge Extreme...
## $ brand            <chr> "Firefly", "Polar", "Extreme", "Edge", "Extre...
## $ color            <chr> "Orange", "Black", "Red", "Black", "Black", "...
## $ size             <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "Unspe...
## $ intro_date       <date> 2005-01-10, 1995-02-15, 2000-10-26, 1995-02-...
## $ halt_date        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Remove original files and save

Last but not least, I can drop all original files and save the orders_tbl

# Save orders as RDS
orders_tbl %>%
   write_rds("../00_data/orders_tbl.rds")

Code Repository

The full R code can be found on my GitHub profile