Basic ETL with R and SQL
August 11, 2019
Create a Calendar Table from Point In Time Data
Today we will dive into ETL by taking the following steps:
- Extract data from a point in time table
- Transform the extract into a monthly table
- Load the data back to the database
For this exercise we will use R and SQLite. I like the simplicity of SQLite. It is very fast for this type of database and easy to install, though there are some drawbacks that we will touch upon.
The raw R script is here.
These are the CRAN packages we will need:
library(tidyverse)
library(RSQLite)
library(lubridate)
We will start by installing the ficticious hrsample HR database (hrsample blog). If you went through my earlier blog entry, Multi-Condition Data Retrieval, you can continue to use that same database and skip this next step.
#devtools::install_github("harryahlas/hrsample")
hrsample::hrsampleCreateSQLite("my_db.sqlite3")
Extract
Let’s connect to the database.
con <- dbConnect(SQLite(),'my_db.sqlite3')
The first table we will need to extract is the “deskhistory” table. We will import it into an object called dh
.
dh <- dbGetQuery(con, "SELECT * FROM DESKHISTORY")
Let’s look a little closer at a sample of the dh
(deskhistory) table.
sample_employee_num <- sample(dh$employee_num,1)
dh %>% filter(employee_num == sample_employee_num)
employee_num desk_id desk_id_start_date desk_id_end_date termination_flag promotion_flag
1 14447 535 2003-12-12 2006-10-26 0 0
2 14447 535 2006-10-27 2009-07-23 0 1
3 14447 535 2009-07-24 2011-11-04 0 1
4 14447 535 2011-11-05 2014-12-15 1 1
The deskhistory data is a transactional, point-in-time table. It shows one row for each instance an employee was in a desk_id
(aka position). It includes a start date desk_id_start_date
and end date desk_id_end_date
.
Our goal is to convert this point in time data in to a monthly snapshot, which can be very useful for reporting and analysis. The monthly table will have one row for each month the employee is in a desk_id
.
Transform
One of the drawbacks of SQLite is that it does not store date formats. So our first transform will be to convert the date fields from text into dates. We will also remove the promotion_flag
variable since it is not needed here.
dh <- dh %>%
mutate(desk_id_start_date = as.Date(desk_id_start_date),
desk_id_end_date = as.Date(desk_id_end_date)) %>%
select(-promotion_flag)
Next, let’s create an empty data frame that we will populate with our monthly calendar data.
dh_trend <- tibble()
We need to create a list of all months that will be included in our calendar table data.
start_date <- as.Date("1999-01-01")
end_date <- as.Date("2018-12-31")
month_list <- seq.Date(start_date, end_date, by = "month")
month_list[1:5]
[1] "1999-01-01" "1999-02-01" "1999-03-01" "1999-04-01" "1999-05-01"
Now we are ready to do the real work - creating the new data. To do so, we will create a for loop that:
- Iterates through each month in our list of months
- Identifies the employees that were active as of the last day of that month
- Identifies the employees that terminated during that month
- Adds rows to our table for these active and terminated employees
- Adds a
trend_month
field representing month the employee was in the company
Here is the code for our loop. We’ll review it in more detail in a minute.
for (i in 1:length(month_list)) {
dh_trend_active <- dh %>%
filter(desk_id_start_date <= ceiling_date(month_list[i], "month") - 1,
desk_id_end_date >= ceiling_date(month_list[i], "month") - 1) %>%
filter(termination_flag == 0 | desk_id_end_date != ceiling_date(month_list[i], "month") - 1) %>%
mutate(trend_month = month_list[i],
termination_flag = 0)
dh_trend_term <- dh %>%
filter(termination_flag == 1,
desk_id_end_date <= ceiling_date(month_list[i], "month") - 1,
desk_id_end_date >= month_list[i]
) %>%
mutate(trend_month = month_list[i])
dh_trend <- bind_rows(dh_trend, dh_trend_active, dh_trend_term)
}
Let’s take a look at a small sample of our new monthly calendar data:
dh_trend %>% sample_n(3)
# A tibble: 3 x 6
employee_num desk_id desk_id_start_date desk_id_end_date termination_flag trend_month
1 2690 524 2010-11-04 2015-01-17 0 2014-09-01
2 762 503 2006-04-07 2009-02-23 0 2006-10-01
3 24996 813 2002-08-23 2007-08-06 0 2003-02-01
Every employee has a row for each month they were active. Note the new trend_month
field. A more detailed look at the for loop code is below. Feel free to skip ahead.
Inside the for loop
The first piece of the for loop specifies how many loops to run. This is the number of months in our list of months (month_list
)
for (i in 1:length(month_list)) {
Then we create a temporary data frame dh_trend_active
that captures active employees for the month in the current iteration. The code looks at the desk_id_start_date
and desk_id_end_date
columns to determine if an employee was active during that month.
dh_trend_active <- dh %>%
filter(desk_id_start_date <= ceiling_date(month_list[i], "month") - 1,
desk_id_end_date >= ceiling_date(month_list[i], "month") - 1) %>%
The code above excludes employees who terminated during the month unless they terminated on the last day of the month. We need to exclude those end of month terminations here, otherwise they will be doublecounted with the terminations piece coming up. So we’ll filter them out with this “link” in the dplyr chain:
filter(termination_flag == 0
| desk_id_end_date != ceiling_date(month_list[i], "month") - 1) %>%
The next piece of this chain creates a new column for the month and recodes the termination_flag
to 0 since these employees were all active during that month.
mutate(trend_month = month_list[i],
termination_flag = 0)
The loop repeats a nearly identical process for terminated employees. The key difference is that we are filtering for “termination” rows: where the job was the employee’s last prior to terminating (termination_flag == 1
).
dh_trend_term <- dh %>%
filter(termination_flag == 1,
desk_id_end_date <= ceiling_date(month_list[i], "month") - 1,
desk_id_end_date >= month_list[i]
) %>%
mutate(trend_month = month_list[i])
The last piece of the loop binds the new rows to our dh_trend
table:
dh_trend <- bind_rows(dh_trend, dh_trend_active, dh_trend_term)
Additional Transformation
Our new monthly calendar data produced by the for loop looks okay but we can enhance it. We can make it fit with the SQLite specs and make it more useful by adding additional fields.
As mentioned earlier, SQLite does not have great support for date formats. So we have to reformat the three date columns back to text.
dh_trend$desk_id_start_date <- format(dh_trend$desk_id_start_date, "%Y-%m-%d")
dh_trend$desk_id_end_date <- format(dh_trend$desk_id_end_date, "%Y-%m-%d")
dh_trend$trend_month <- format(dh_trend$trend_month, "%Y-%m-%d")
To increase the value of this calendar data, we’ll add job names to our data. Each desk_id
has a job assigned to it. That assignment is on the deskjob
table.
dj <- dbGetQuery(con, "SELECT * FROM DESKJOB")
dh_trend <- dh_trend %>%
left_join(dj)
Adding an organizational rollup hierarchy will make this table more valuable for reporting. This organizational information will enable us to report on individual lines of business/business groups. We can retrieve org information from the rollup
table.
ru <- dbGetQuery(con, "SELECT * FROM ROLLUP")
While the rollup table has a lot of useful information that will come in handy for monthly reporting, we can remove the CEO level columns since they are the same for everyone and do not add value.
ru <- ru %>%
select(-lvl00_desk_id, - lvl00_org)
Now join the hierarchy rollup data:
dh_trend <- dh_trend %>%
left_join(ru, by = c("desk_id" = "lvl04_desk_id"))
Let’s also add the employee’s name from the employeeinfo
table.
ei <- dbGetQuery(con, "SELECT * FROM EMPLOYEEINFO")
dh_trend <- dh_trend %>%
left_join(ei %>% select(employee_num,
last_name,
first_name))
Our last transformation is to simply reorder some of the columns.
dh_trend <- dh_trend %>%
select(trend_month,
employee_num,
last_name,
first_name,
job_name,
depth,
everything())
Our data is now transformed into calendar format with all the desired fields and formatting. Let’s take a look before we load the data.
dh_trend %>% sample_n(3)
# A tibble: 3 x 17
trend_month employee_num last_name first_name job_name depth desk_id desk_id_start_d~ desk_id_end_date
1 2001-09-01 21823 Carthen Leesa Salespe~ 4 511 1999-01-01 2001-11-04
2 2005-02-01 37797 Karley Jessica Product~ 4 1096 2005-02-06 2007-05-08
3 1999-03-01 18859 Manvelito Rogelio Consult~ 4 224 1999-01-01 2004-06-19
# ... with 8 more variables: termination_flag , lvl01_desk_id , lvl01_org , lvl02_desk_id ,
# lvl02_org , lvl03_desk_id , lvl03_org , lvl04_org
Load
Our last step is to upload the data to a new table to our database. We will call the table employee_trend
. I love how easy it is to load data using R:
dbWriteTable(con, "employee_trend", dh_trend, overwrite = TRUE)
Finally, let’s retrieve a sample of our results.
et_sample <- dbGetQuery(con, "SELECT * FROM EMPLOYEE_TREND ORDER BY RANDOM() LIMIT 3")
glimpse(et_sample)
Observations: 3
Variables: 17
$ trend_month "2002-02-01", "2008-08-01", "2007-10-01"
$ employee_num 26962, 33111, 9524
$ last_name "Reade", "Turri", "Douet"
$ first_name "Melvyn", "Denver", "Keegan"
$ job_name "Developer", "Paralegal", "Salesperson"
$ depth 4, 4, 4
$ desk_id 1032, 875, 504
$ desk_id_start_date "2001-11-30", "2006-06-28", "2007-06-25"
$ desk_id_end_date "2002-08-02", "2008-12-21", "2008-06-29"
$ termination_flag 0, 0, 0
$ lvl01_desk_id 7, 6, 4
$ lvl01_org "Technology", "Legal", "Sales"
$ lvl02_desk_id 39, 29, 25
$ lvl02_org "Tchnlgy - Data Science", "Lgl - Product", "Sls - South"
$ lvl03_desk_id 169, 138, 61
$ lvl03_org "Tchnlgy - Dt Scnc - RAA", "Lgl - Prdct - PSI", "Sls - Sth - Florida"
$ lvl04_org "Tchnlgy - Dt Scnc - RAA - IC01", "Lgl - Prdct - PSI - IC01", "Sls - Sth - Florida -...
Looks good! I hope you found this exercise valuable. Thank you for taking the time to read this. Please don’t hesitate to reach out to me via Twitter with any questions/comments/suggestions. And don’t forget to close the connection:
dbDisconnect(con)