Use R to add a Data Disclaimer to an Excel Report

April 7, 2019

I work with confidential and restricted data in a legal/regulatory environment. Data needs to be accurate and secure. A robust disclaimer can go a long way in helping meet those needs.

In this post we will use R to:

  • Wrangle data from the hrsample database
  • Create a basic human resources summary report in Excel
  • Add a simple data disclaimer tab to our report

You can find the R script here: https://github.com/harryahlas/sample-hr-database/blob/master/analysis/data_disclaimer_tab.R.

Ever in Our Favor

My team often distributes our data in Excel format. If you are cringing, no worries, bear with me. Legal and regulatory reporting is often better left vanilla. Excel/PDF/csv etc reports are more suitable for this type of work than, say, a dynamic Tableau dashboard or Shiny app.

The good news is we can turn this archaic Excel reporting practice in our favor. We can leverage Excel’s ability to have multiple tabs in one workbook to create a data disclaimer. The disclaimer will accomplish two things:

  1. Clearly explain to our client how, if at all, s/he can distribute our data.
  2. Help the client understand the data, mitigating risk and reducing the chance of confusion.

I can’t emphasize enough how critical it is to proactively clarify anything that might potentially confuse the client. The disclaimer tab is the perfect opportunity to do so. The disclaimer tab can go a long way in avoiding confusion and - as a result - save time and money down the line. We’ll see an example of this in a bit.

The Data

For this report, we will use the hrsample database (more info here), which can be installed with devtools::install_github("harryahlas/hrsample"). Let’s say a legal team at this imaginary company wants to know what the company’s job headcount and turnover looked like in 2009.

Let’s begin with the R script. First things first, let’s load packages, create a report name/number for tracking as well as an as of date:

library(hrsample)
library(tidyverse)
library(scales)
library(lubridate)
library(openxlsx)

report_name <- "PA73405 - Attrition by Job 2009"
as_of_date <- Sys.Date()

Next, let’s take a closer look at our data from hrsample. There are two key tables we are interested in here. We can use deskjob_table and deskhistory_table in conjunction to determine all the employees’ movement information. deskjob_table - this data associates each desk/position to a specific job.

knitr::kable(deskjob_table[sample(nrow(deskjob_table),5),])
|     | desk_id|job_name    |
|:----|-------:|:-----------|
|777  |     777|Salesperson |
|1039 |    1039|Analyst     |
|423  |     423|Salesperson |
|503  |     503|Salesperson |
|1123 |    1123|Developer   |

deskhistory_table - this data captures the start and end dates for each employee’s job. It also captures whether the job resulted in a termination and if the job came as the result of a promotion.

knitr::kable(deskhistory_table[sample(nrow(deskhistory_table),5),])
|      | employee_num| desk_id|desk_id_start_date |desk_id_end_date | termination_flag| promotion_flag|
|:-----|------------:|-------:|:------------------|:----------------|----------------:|--------------:|
|687   |        27083|     687|1999-01-01         |2000-05-11       |                0|              0|
|10015 |        22657|     677|2018-12-14         |2999-01-01       |                1|              1|
|1327  |        36399|     888|2000-03-17         |2002-10-06       |                1|              1|
|6431  |        18508|    1069|2011-06-20         |2012-12-19       |                0|              0|
|4962  |        24300|     424|2008-02-23         |2013-05-17       |                1|              0|

You’ll notice both tables share a variable desk_id. You can think of desk_id as the employee’s unique desk number. The desk_id_start_date and desk_id_end_date variables refer to the dates the employee was at that “desk”. If the employee switched to a new job somewhere else in the company they would likely switch to a new desk_id. The desk_id is a critical key: from desk_id we can determine what job, location, manager, line of business, etc. belongs to the employee assigned to that desk_id.

Summarize Data with Tidyverse

Now that we have the raw data, let’s use some standard dplyr/tidyr functions to join the two datasets and create a simple data frame that summarizes turnover and headcount by job. First, we will create a new object called hcto_summary (headcount/turnover summary) by joining our two tables. The default join will be on desk_id.

hcto_summary <- deskhistory_table %>% 
  left_join(deskjob_table) %>%

Next, we’ll filter for all rows that have data for 2009.

  filter(desk_id_start_date <= as.Date("2009-12-31"),
         desk_id_end_date >= as.Date("2009-01-01")) %>%

Some employees will have multiple jobs in 2009. Since we only want to count each employee once, let’s take their most recent job for this report. We have to account for the fact that some employees would have switched jobs in 2009. These employees would have multiple entries during that year, probably for different desk_ids. This is something we will call out on our disclaimer tab later on in order to avoid confusion down the line.

In order to count the most recent job for each employee who was active in 2009, we can sort the data with the most recent desk_id_end_date and then group_by(employee_num) %>% filter(row_number() == 1) in order to select the most recent of those jobs. This is a nifty trick I use a lot, and we will be sure to document it in the disclaimer.

  arrange(desc(desk_id_end_date)) %>% 
  group_by(employee_num) %>% 
  filter(row_number() == 1) %>% 
  ungroup() %>%

Next, we’ll clean up some of the variables and names to make the data easier for the end user to digest. Here we will also calculate our key metric: if an employee’s last day in a desk (desk_id_end_date) was in 2009 and they terminated, then they will be flagged as a termination.

  mutate(year = "2009",
         termination_flag = if_else(termination_flag == 1 & year(desk_id_end_date) == 2009, 
                                    "Terminated", 
                                    "DidNotTerminate")) %>% 

The last piece is some wrangling, starting with a count of jobs separating out terminations. The spread function will pivot the data so there is a column for Terminated and a column for DidNotTerminate. We’ll also calculate total Headcount and TerminationRate columns.

  count(job_name, termination_flag) %>% 
  spread(termination_flag, n, fill = 0) %>% 
  mutate(Headcount =  Terminated + DidNotTerminate,
         TerminationRate = percent(Terminated / Headcount)) %>% 
  arrange(desc(Terminated / Headcount))

Please note spread will be replaced with pivot_wide in the near future!

Here is all the code together. I love tidyverse.

hcto_summary <- deskhistory_table %>% 
  left_join(deskjob_table) %>% 
  filter(desk_id_start_date <= as.Date("2009-12-31"),
         desk_id_end_date >= as.Date("2009-01-01")) %>% 
  arrange(desc(desk_id_end_date)) %>% 
  group_by(employee_num) %>% 
  filter(row_number() == 1) %>% 
  ungroup() %>% 
  mutate(year = "2009",
         termination_flag = if_else(termination_flag == 1 & year(desk_id_end_date) == 2009, 
                                    "Terminated", 
                                    "DidNotTerminate")) %>% 
  count(year, job_name, termination_flag) %>% 
  spread(termination_flag, n, fill = 0) %>% 
  mutate(Headcount =  Terminated + DidNotTerminate,
         TerminationRate = percent(Terminated / Headcount)) %>% 
  arrange(desc(Terminated / Headcount))

Here is the output data that we want to send to the clients:

knitr::kable(hcto_summary)
|year |job_name          | DidNotTerminate| Terminated| Headcount|TerminationRate |
|:----|:-----------------|---------------:|----------:|---------:|:---------------|
|2009 |Paralegal         |              18|          5|        23|21.7%           |
|2009 |Regional Leader   |             136|         27|       163|16.6%           |
|2009 |Salesperson       |             522|         94|       616|15.3%           |
|2009 |Consultant        |              85|         14|        99|14.1%           |
|2009 |Department Leader |              33|          5|        38|13.2%           |
|2009 |Attorney          |              41|          6|        47|12.8%           |
|2009 |Project Manager   |              70|         10|        80|12.5%           |
|2009 |Product Manager   |              79|         10|        89|11.2%           |
|2009 |Analyst           |              72|          9|        81|11.1%           |
|2009 |Developer         |             115|         13|       128|10.2%           |
|2009 |Business Leader   |               7|          0|         7|0.0%            |
|2009 |CEO               |               1|          0|         1|0.0%            |

Data Disclaimer Tab

With the retrieval and manipulation behind us, the data is almost ready to export. We are now at the data disclaimer piece. Before continuing, I need to mention this disclaimer is very basic and for demonstration purposes only. Always consult your legal team for approval of external disclaimer verbiage.

This disclaimer will include distribution information as well as information about how the data was pulled. This is our shot to clarify the way we are counting employees (only count the most recent 2009 job). The client typically doesn’t have time to think about these types of details, so making this clarification now will prepare your client for questions and potentially save you from urgent follow up scrambles that no one enjoys.

I have packaged this example’s verbiage into a data frame:

disclaimer_info <- data.frame(Information = 
  c("Source: hrsample",
    paste("Data as of", as_of_date, "."),
    "Data includes all employees rolling up to CEO Danica Hammel who were active at any point from Jan 1, 2009 through December 31, 2009.",
    "If the employee had multiple jobs during 2009, only the most recent job is counted.",
    "Data is confidential and should be shared on a need to know basis only.",
    "Do not distribute externally."))  

Note that the text is enclosed in a concatenation c("xxx", "yyy", "zzz"). The openxlsx package will look for the commas between quotations and determine that words that come after the commas go into a new row in Excel.

Export to Excel

I tend to favor the openxlsx package over others for exporting to Excel. It seems less error prone because it is not dependent on Java. With that, let’s start by creating a new Excel workbook object wb and adding two tabs: the first for the data summary and the second for the disclaimer.

wb <- createWorkbook()
addWorksheet(wb, report_name)
addWorksheet(wb, "Data Disclaimer")

Now we can write our summary and disclaimer data frames to each sheet. The numbers 1 and 2 below are the sheet numbers.

writeDataTable(wb, 1, hcto_summary)
writeDataTable(wb, 2, disclaimer_info)

Next, let’s add just a touch of formatting to sheet #2, our disclaimer tab. Let’s wrap the items in the first column and set the column width of the first row to 50 pixels.

addStyle(wb, 2, style = createStyle(wrapText = TRUE), rows = 1:7, cols = 1)
setColWidths(wb, 2, 1, widths = 50)

Last, we’ll save the workbook with the report name and date and we are done!

saveWorkbook(wb, paste0(report_name, as_of_date, ".xlsx"), TRUE)

Final Thought - Why Automate?

For a one time report this may seem like a lot of work to add one tab with only 7 rows of text. Why not just input the text directly into Excel?

I would agree this is overkill for one report. However, there is a good chance that this code can be reused later. It could be packaged into a function as well, depending on your need.

The value really rears its head when you start producing multiple reports - super handy! That is something I will be tackling on a future post and look forward to sharing with you.

Thanks for reading - I would appreciate any feedback.

comments powered by Disqus