Loading, Merging and Joining Several Datasets - PostgreSQL EDT
This is the coding necessary to assemble the various data feeds and sort out the likes of variables naming & new features creation plus some general housekeeping tasks.
In order to simulate normal working conditions I would face if the data was stored on a database, I’ve uploaded the excel files onto a local PostgreSQL database that I’ve created on my machine. I am going to go through the steps I followed to set up a connection between RStudio and said database and extract the information I needed.
I am also going to carry out some general housekeeping tasks like sorting variables names, creating essential features and sorting out variables order.
The Dataset
library(tidyverse)
library(lubridate)
library(odbc) # connecting to ODBC databases using DBI interface
library(DBI) # database interface
library(RPostgres)
library(knitr)
The dataset I’m using here accompanies a Redbooks publication called Building 360-Degree Information Applications which is available as a free PDF download. Also available as a free download are the excel files in the Additional Material section to follow along with the exercises in the book.
The data covers 3 & 1⁄2 years worth of sales orders
for the Sample Outdoors Company, a fictitious B2B outdoor equipment retailer enterprise. The data comes with details about the products
they sell as well as their customers (which are retailers
in their case). The data is a subset of the GSDB database, a realistic and feature-rich database created by IBM® to support demos for their products.
Create the connection
Detailed instructions on how to work databases in R can be found on Databases using R. This RStudio resource file is quite simply a treasure trove of information, spanning from setting up a connection with a vast array of open source and proprietary databases & running queries to creating visualisations and dashboards.
First thing I need to do is to fetch the information required to establish a link to my database. In everyday work settings this information would be provided by your database administrator but given that I’ve created the database myself and I’m hosting it on my machine, I’ve retrieved all information myself from the main page on pgAdmin
, the GUI tool for PostgreSQL:
Select PostgresSQL
then go to Properties
and look for the Connection
tab
I can now create a connection to the database with the dbConnect
command
mycon <-
DBI::dbConnect(RPostgres::Postgres(), # constructs the driver
dbname = "customer_analytics", # the name of the schema
host = "localhost", # host name (local in my case)
port = "5432", # usual port name
user = 'postgres', # user name
password = "your_PW" # PW to 'connect to server'
)
Once the connection is established, I use dbListTables
to inspect the database content
#list all tables
table_chr <- DBI::dbListTables(mycon)
table_chr
## [1] "orders" "products" "retailers"
Now I use tbl()
to take a reference to the orders
table…
orders_db <-
dplyr::tbl(mycon, "orders")
…and take a look at it as I normally would with glimpse
orders_db %>% glimpse()
## Observations: ??
## Variables: 45
## Database: postgres [postgres@localhost:5432/customer_analytics]
## $ order_date <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method_it <chr> "Reparto vendite", "Reparto vendite", "F...
## $ order_method_en <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_method_nl <chr> "Vertegenwoordiger", "Vertegenwoordiger"...
## $ order_detail_code <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ order_method_no <chr> "Salgsbesøk", "Salgsbesøk", "Faks", "Fak...
## $ order_method_el <chr> "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "F...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_method_tc <chr> "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", ...
## $ order_detail_code_2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer_name <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ order_method_cs <chr> "Návšteva prodejce", "Návšteva prodejce"...
## $ order_method_hu <chr> "Kiszállásos eladás", "Kiszállásos eladá...
## $ order_method_ms <chr> "Lawatan jualan", "Lawatan jualan", "Fak...
## $ order_method_th <chr> "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E41><U+0E1F><U+0E01><U+0E0B><U+0E4C>",...
## $ quantity <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_method_id <chr> "Kunjungan penjualan", "Kunjungan penjua...
## $ order_number <int> 100001, 100001, 100002, 100002, 100002, ...
## $ order_method_sc <chr> "<U+9500><U+552E><U+9762><U+8BBF>", "<U+9500><U+552E><U+9762><U+8BBF>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", ...
## $ order_method_fi <chr> "Myyntikäynti", "Myyntikäynti", "Faksi",...
## $ order_method_ru <chr> "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0424><U+0430><U+043A><U+0441>", "<U+0424>...
## $ retailer_site_code <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ order_method_da <chr> "Sælgerbesøg", "Sælgerbesøg", "Fax", "Fa...
## $ order_method_pl <chr> "Wizyta", "Wizyta", "Faks", "Faks", "Fak...
## $ order_method_de <chr> "Verkaufsbesuch", "Verkaufsbesuch", "Fax...
## $ order_method_ko <chr> "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", ...
## $ ship_date <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ order_method_sv <chr> "Säljbesök", "Säljbesök", "Fax", "Fax", ...
## $ order_method_fr <chr> "Visite d'un représentant", "Visite d'un...
## $ order_method_code_2 <int> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_method_pt <chr> "Visita vendas", "Visita vendas", "Fax",...
## $ order_method_es <chr> "Visita de ventas", "Visita de ventas", ...
## $ order_close_date <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ order_method_code <int> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ order_method_ja <chr> "<U+5E97><U+982D>", "<U+5E97><U+982D>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "...
## $ product_number <int> 41110, 69110, 75110, 76110, 85110, 65110...
This mostly looks like a regular tibble with two important differences:
1- The reference to the remote source on a PostgreSQL
database is shown
2- The number of Observations
is not showing
The latter is due to the fact that any operation carried out on the _db
ending tables is simply creating a “reference” to the queries you want to carry out on the database. The actual queries are executed in one go at the very end, when you send the collect
request to the database - more on this later on.
I start by removing all order_method
other than English.
orders_db <-
orders_db %>%
rename(
order_method = order_method_en,
retailer = retailer_name
) %>%
select(
-contains('order_method_')
)
orders_db %>% glimpse()
## Observations: ??
## Variables: 21
## Database: postgres [postgres@localhost:5432/customer_analytics]
## $ order_date <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number <int> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number <int> 41110, 69110, 75110, 76110, 85110, 65110...
Then I take a reference to the products
table, which, as it says on the tin, contains the products information
# load products file
products_db <-
tbl(mycon, "products")
products_db %>% glimpse()
## Observations: ??
## Variables: 20
## Database: postgres [postgres@localhost:5432/customer_analytics]
## $ product_description <chr> "The Venue has a fun and fashionable overs...
## $ base_product_number <int> 125, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ product_line <chr> "Personal Accessories", "Camping Equipment...
## $ production_cost <dbl> 0.00, 4.00, 9.22, 15.93, 5.00, 34.97, 85.1...
## $ product_image <chr> "P65PA3WT10.jpg", "P01CE1CG1.jpg", "P02CE1...
## $ product_type <chr> "Watches", "Cooking Gear", "Cooking Gear",...
## $ color <chr> "Red", "Clear", "Brown", "Unspecified", "S...
## $ gross_margin <dbl> 0.00, 0.33, 0.23, 0.28, 0.28, 0.30, 0.28, ...
## $ product_line_code <int> 993, 991, 991, 991, 991, 991, 991, 991, 99...
## $ product_name <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ introduction_date <date> 2005-04-01, 1995-02-15, 1995-02-15, 1995-...
## $ title <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ product_type_code <int> 960, 951, 951, 951, 951, 951, 951, 951, 95...
## $ product_size <chr> "One-size", "10 liters", "2 liters", "15-p...
## $ brand <chr> "Relax", "TrailChef", "TrailChef", "TrailC...
## $ discontinued_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ product_brand_code <int> 756, 701, 701, 701, 701, 701, 701, 701, 70...
## $ product_size_code <int> 853, 808, 807, 825, 804, 823, 824, 845, 84...
## $ product_color_code <int> 922, 908, 906, 924, 923, 923, 923, 923, 91...
## $ product_number <int> 125150, 1110, 2110, 3110, 4110, 5110, 6110...
Finally, the retailers
table
# load retailer info
retailers_db <-
tbl(mycon, "retailers")
retailers_db %>% glimpse()
## Observations: ??
## Variables: 13
## Database: postgres [postgres@localhost:5432/customer_analytics]
## $ region_code <int> 710, 710, 710, 710, 710, 710, 710, 710, 710...
## $ region_en <chr> "Americas", "Americas", "Americas", "Americ...
## $ country_code <int> 1004, 1004, 1004, 1004, 1004, 1004, 1004, 1...
## $ country_key <int> 90002, 90002, 90002, 90002, 90002, 90002, 9...
## $ country_en <chr> "Canada", "Canada", "Canada", "Canada", "Ca...
## $ retailer_key <int> 6789, 6789, 6789, 6790, 6790, 6790, 6790, 6...
## $ retailer_code <int> 1189, 1189, 1189, 1190, 1190, 1190, 1190, 1...
## $ retailer_name <chr> "Rock Steady", "Rock Steady", "Rock Steady"...
## $ retailer_site_key <int> 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5...
## $ retailer_site_code <int> 20001, 20002, 20003, 20004, 20005, 20006, 2...
## $ rtl_city <chr> "Winnipeg", "Montreal", "Fredericton", "Mon...
## $ retailer_type_code <int> 6, 6, 6, 8, 8, 8, 8, 3, 2, 2, 2, 2, 2, 1, 7...
## $ retailer_type_en <chr> "Outdoors Shop", "Outdoors Shop", "Outdoors...
Now I can joining orders_db
, products_db
and retailers_db
info into one data frame
df_db <-
orders_db %>%
left_join(products_db, by = ('product_number')) %>%
left_join(retailers_db, by = ('retailer_site_code'))
df_db %>% glimpse()
## Observations: ??
## Variables: 52
## Database: postgres [postgres@localhost:5432/customer_analytics]
## $ order_date <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number <int> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number <int> 41110, 69110, 75110, 76110, 85110, 65110...
## $ product_description <chr> "Simple to use, just requires a candle t...
## $ base_product_number <int> 41, 69, 75, 76, 85, 65, 100, 90, 95, 79,...
## $ product_line <chr> "Camping Equipment", "Personal Accessori...
## $ production_cost <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.0...
## $ product_image <chr> "P41CE1LT5.jpg", "P69PA3EW11.jpg", "P75P...
## $ product_type <chr> "Lanterns", "Eyewear", "Knives", "Knives...
## $ color <chr> "Orange", "Black", "Red", "Black", "Blac...
## $ gross_margin <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50...
## $ product_line_code <int> 991, 993, 993, 993, 993, 993, 994, 994, ...
## $ product_name <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ introduction_date <date> 2005-01-10, 1995-02-15, 2000-10-26, 199...
## $ title <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ product_type_code <int> 955, 961, 962, 962, 964, 960, 967, 965, ...
## $ product_size <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "...
## $ brand <chr> "Firefly", "Polar", "Extreme", "Edge", "...
## $ discontinued_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ product_brand_code <int> 705, 712, 706, 713, 706, 711, 720, 706, ...
## $ product_size_code <int> 848, 851, 813, 817, 800, 852, 803, 802, ...
## $ product_color_code <int> 918, 902, 922, 902, 902, 922, 908, 908, ...
## $ region_code <int> 710, 710, 710, 710, 710, 710, 710, 710, ...
## $ region_en <chr> "Americas", "Americas", "Americas", "Ame...
## $ country_code <int> 1003, 1003, 1021, 1021, 1021, 1021, 1021...
## $ country_key <int> 90001, 90001, 90004, 90004, 90004, 90004...
## $ country_en <chr> "United States", "United States", "Brazi...
## $ retailer_key <int> 6998, 6998, 7360, 7360, 7360, 7360, 7360...
## $ retailer_code <int> 1398, 1398, 1760, 1760, 1760, 1760, 1760...
## $ retailer_name <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_site_key <int> 5477, 5477, 5837, 5837, 5837, 5837, 5837...
## $ rtl_city <chr> "San Clara", "San Clara", "Sao Paulo", "...
## $ retailer_type_code <int> 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ retailer_type_en <chr> "Sports Store", "Sports Store", "Outdoor...
I’m happy with the queries I’ve ran and I use collect()
to pull all data into a local tibble. This is the stage when the queries are actually applied to the database and, depending on their complexity, number of calculations and amount of data involved, it may take some time to complete.
df <-
df_db %>%
collect()
df %>%
glimpse()
## Observations: 446,023
## Variables: 52
## $ order_date <date> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <int> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code <int> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity <int> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <int> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <int> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number <int> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code <int> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date <date> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date <date> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <int> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number <int> 41110, 69110, 75110, 76110, 85110, 65110...
## $ product_description <chr> "Simple to use, just requires a candle t...
## $ base_product_number <int> 41, 69, 75, 76, 85, 65, 100, 90, 95, 79,...
## $ product_line <chr> "Camping Equipment", "Personal Accessori...
## $ production_cost <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.0...
## $ product_image <chr> "P41CE1LT5.jpg", "P69PA3EW11.jpg", "P75P...
## $ product_type <chr> "Lanterns", "Eyewear", "Knives", "Knives...
## $ color <chr> "Orange", "Black", "Red", "Black", "Blac...
## $ gross_margin <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50...
## $ product_line_code <int> 991, 993, 993, 993, 993, 993, 994, 994, ...
## $ product_name <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ introduction_date <date> 2005-01-10, 1995-02-15, 2000-10-26, 199...
## $ title <chr> "Flicker Lantern", "Polar Ice", "Edge Ex...
## $ product_type_code <int> 955, 961, 962, 962, 964, 960, 967, 965, ...
## $ product_size <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "...
## $ brand <chr> "Firefly", "Polar", "Extreme", "Edge", "...
## $ discontinued_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ product_brand_code <int> 705, 712, 706, 713, 706, 711, 720, 706, ...
## $ product_size_code <int> 848, 851, 813, 817, 800, 852, 803, 802, ...
## $ product_color_code <int> 918, 902, 922, 902, 902, 922, 908, 908, ...
## $ region_code <int> 710, 710, 710, 710, 710, 710, 710, 710, ...
## $ region_en <chr> "Americas", "Americas", "Americas", "Ame...
## $ country_code <int> 1003, 1003, 1021, 1021, 1021, 1021, 1021...
## $ country_key <int> 90001, 90001, 90004, 90004, 90004, 90004...
## $ country_en <chr> "United States", "United States", "Brazi...
## $ retailer_key <int> 6998, 6998, 7360, 7360, 7360, 7360, 7360...
## $ retailer_code <int> 1398, 1398, 1760, 1760, 1760, 1760, 1760...
## $ retailer_name <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_site_key <int> 5477, 5477, 5837, 5837, 5837, 5837, 5837...
## $ rtl_city <chr> "San Clara", "San Clara", "Sao Paulo", "...
## $ retailer_type_code <int> 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ retailer_type_en <chr> "Sports Store", "Sports Store", "Outdoor...
Don’t forget to disconnect from the database
dbDisconnect(mycon)
Creating essential features & some housekeeping tasks
One long piece of code to sort all in one go.
I’ve added comments to each block to explain what it’s doing.
orders_tbl <-
# create revenue, total product cost and gross profit
df %>%
mutate(
production_cost = quantity * unit_cost,
revenue = quantity * unit_sale_price,
planned_revenue = quantity * unit_price,
gross_profit = revenue - production_cost
) %>%
# replacing NAs in the return_count variable
replace_na(list(return_count = 0)) %>%
# Shorten product line names for readablility and ease of use in charts and code
mutate(
prod_line = case_when(
product_line == 'Camping Equipment' ~ 'Camping_Eqpt',
product_line == 'Golf Equipment' ~ 'Golf_Eqpt',
product_line == 'Mountaineering Equipment' ~ 'Mountain_Eqpt',
product_line == 'Personal Accessories' ~ 'Personal_Acces',
product_line == 'Outdoor Protection' ~ 'Outdoor_Prot',
TRUE ~ product_line
),
prod_line_2 = case_when(
product_line == 'Camping Equipment' ~ 'Camping_Eqpt',
product_line == 'Golf Equipment' ~ 'Golf_Eqpt',
product_line == 'Mountaineering Equipment' ~ 'Mountain_Eqpt',
product_line == 'Personal Accessories' ~ 'Personal_Acces',
product_line == 'Outdoor Protection' ~ 'Personal_Acces',
TRUE ~ product_line
),
# create alternative regional group
region2 = case_when(
country_en %in% c('United Kingdom', 'France', 'Spain',
'Netherlands','Belgium','Switzerland') ~ 'West_Europe',
country_en %in% c('Germany', 'Italy', 'Finland',
'Austria','Sweden','Denmark') ~ 'East_Europe',
TRUE ~ region_en
)
) %>%
# create financial years
mutate(
ord_date = ymd(order_date),
fin_year = case_when(
between(ord_date, ymd("2004-07-01"), ymd('2005-06-30')) ~ 'FY_04_05',
between(ord_date, ymd("2005-07-01"), ymd('2006-06-30')) ~ 'FY_05_06',
between(ord_date, ymd("2006-07-01"), ymd('2007-06-30')) ~ 'FY_06_07',
TRUE ~ 'other'
),
# create all quarters
quarter_all = case_when(
between(ord_date, ymd("2004-01-01"), ymd('2004-03-31')) ~ '04_Q1',
between(ord_date, ymd("2004-04-01"), ymd('2004-06-30')) ~ '04_Q2',
between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
between(ord_date, ymd("2007-07-01"), ymd('2007-09-30')) ~ '07_Q3',
TRUE ~ 'other'
),
# create selected quarters
quarter_sel = case_when(
between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
TRUE ~ 'other'
)
) %>%
# reorder columns and rename a few
select(
order_number,
order_date,
order_close_date,
order_ship_date = ship_date,
fin_year,
quarter_all,
quarter_sel,
order_method,
retailer,
retailer_code,
retailer_site = retailer_site_key,
retailer_type = retailer_type_en,
region = region_en,
region2,
country = country_en,
city = rtl_city,
promotion_code,
return = return_count,
quantity,
unit_price,
unit_sale_price,
unit_cost,
unit_prod_cost = production_cost,
unit_gross_marg = gross_margin,
revenue,
planned_revenue,
production_cost,
gross_profit,
prod_numb = product_number,
prod_line,
prod_line_2,
prod_type = product_type,
prod_name = product_name,
brand,
color,
size = product_size,
intro_date = introduction_date,
halt_date = discontinued_date
)
orders_tbl %>% glimpse()
## Observations: 446,023
## Variables: 37
## $ order_number <int> 100001, 100001, 100002, 100002, 100002, 10000...
## $ order_date <date> 2004-01-12, 2004-01-12, 2004-01-12, 2004-01-...
## $ order_close_date <date> 2004-02-17, 2004-02-17, 2004-01-19, 2004-01-...
## $ order_ship_date <date> 2004-01-19, 2004-02-17, 2004-01-19, 2004-01-...
## $ fin_year <chr> "other", "other", "other", "other", "other", ...
## $ quarter_all <chr> "04_Q1", "04_Q1", "04_Q1", "04_Q1", "04_Q1", ...
## $ quarter_sel <chr> "other", "other", "other", "other", "other", ...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fax", "...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "Ar fre...
## $ retailer_code <int> 1398, 1398, 1760, 1760, 1760, 1760, 1760, 176...
## $ retailer_site <int> 5477, 5477, 5837, 5837, 5837, 5837, 5837, 583...
## $ retailer_type <chr> "Sports Store", "Sports Store", "Outdoors Sho...
## $ region <chr> "Americas", "Americas", "Americas", "Americas...
## $ region2 <chr> "Americas", "Americas", "Americas", "Americas...
## $ country <chr> "United States", "United States", "Brazil", "...
## $ city <chr> "San Clara", "San Clara", "Sao Paulo", "Sao P...
## $ promotion_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ return <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ quantity <int> 256, 92, 162, 172, 74, 90, 422, 3252, 1107, 8...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81.55, ...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75.84, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.00, 2....
## $ unit_prod_cost <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.00, 3....
## $ unit_gross_marg <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50, 0.6...
## $ revenue <dbl> 8983.04, 10120.00, 19389.78, 6969.44, 26610.4...
## $ tot_prod_cost <dbl> 3998.72, 4571.48, 12960.00, 4047.16, 13058.78...
## $ gross_profit <dbl> 4984.32, 5548.52, 6429.78, 2922.28, 13551.62,...
## $ prod_numb <int> 41110, 69110, 75110, 76110, 85110, 65110, 100...
## $ prod_line <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_line_2 <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_type <chr> "Lanterns", "Eyewear", "Knives", "Knives", "N...
## $ prod_name <chr> "Flicker Lantern", "Polar Ice", "Edge Extreme...
## $ brand <chr> "Firefly", "Polar", "Extreme", "Edge", "Extre...
## $ color <chr> "Orange", "Black", "Red", "Black", "Black", "...
## $ size <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "Unspe...
## $ intro_date <date> 2005-01-10, 1995-02-15, 2000-10-26, 1995-02-...
## $ halt_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Remove original files and save
Last but not least, I can drop all original files and save the orders_tbl
# Save orders as RDS
orders_tbl %>%
write_rds("../00_data/orders_tbl.rds")
Code Repository
The full R code can be found on my GitHub profile