Efficient Filesearching with R

November 24, 2019

Search files for specific text in specific filetypes

This is a simple script I use often and think it is worth sharing. It is really good for searching for text within specific types of files. Here is a link to the code and to the github repository. The full code is also available at the bottom of this page.

When to Use this Script

On occasion, there are times where you need to look through old work for something. For example, maybe there is a change to field in a database and you need to understand how it impacts your prior work. You may want to know which old scripts have that field name.

One option might be to search every file in your drive for a text phrase. However, you run into two problems when you search through all files:

  • A search through every file in a drive could take a very long time to run (days).
  • The result set you return may be larger than what you want. You may be only interested in .sql scripts but this could be returning other filetypes such as text files or Word documents. Using this script, you can speed up the search and have a better set of results.

An Example

For this exercise, please open a new RStudio project:

  1. Open RStudio
  2. File > New Project… > Version Control > Git
  3. Under Repository URL paste https://github.com/harryahlas/file_search.git
  4. Click Create Project
  5. Under the Files tab click file_search.R.

In this example, let’s say we want to know which files in the sql-data-warehouse-samples-master directory contain either “security” or “exec”. However, we are only interested in SQL scripts and READMEs - files ending in either .sql or .md.

Let’s walk through the code. I am a big fan of the tidyverse workflow; we’ll start by loading this package.

library(tidyverse)

In this example, we’ll search the sql-data-warehouse-samples-master folder (note: these files are sourced from https://github.com/Microsoft/sql-data-warehouse-samples). Of course you can switch this to any directory or drive you choose.

directory_to_search <- "sql-data-warehouse-samples-master"

Next we’ll specify the types of files we want to search. We are only interested in SQL and README files, so we will look for files that end in either .sql or .md.

extension_types <- c(".sql", ".md")

Now we will specify the terms that we want to search for within each SQL or README file, “security” and “exec”.

search_terms <- c("security", "exec")

Later we will use the grepl() function to search through files. This next section of code formats our search terms to fit within grepl().

extension_types_re <- paste0(extension_types, "$", collapse = "|") 
search_terms_re <- paste0(search_terms, collapse = "|")

This is the end result of the formatting:

> extension_types_re
[1] ".sql$|.md$"
> search_terms_re
[1] "security|exec"

Next, we’ll get a list of all the files in the directory, including child folders. We’ll put them into a data frame and then filter for filenames ending in .sql or .md. By searching for only certain filetypes, you could save a lot of time, especially if it is a larger repository.

files_to_search <- list.files(directory_to_search, recursive = TRUE, full.names = TRUE) %>% 
  tibble(filenames = .) %>% 
  filter(grepl(extension_types_re, filenames, ignore.case = TRUE)) 

Let’s look at files_to_search.

files_to_search %>% head()
# A tibble: 6 x 1
  filenames                                                                                                  
                                                                                                        
1 sql-data-warehouse-samples-master/arm-templates/sqlDwAutoScaler/SqlDwAutoScaler/ScaleSqlDw/readme.md       
2 sql-data-warehouse-samples-master/arm-templates/sqlDwAutoScaler/SqlDwAutoScaler/ScaleSqlDwByTimer/readme.md
3 sql-data-warehouse-samples-master/arm-templates/sqlDwTimerScaler/README.md                                 
4 sql-data-warehouse-samples-master/samples/adf/management-operations/Readme.md                              
5 sql-data-warehouse-samples-master/samples/adf/Readme.md                                                    
6 sql-data-warehouse-samples-master/samples/automation/RefreshReplicatedTable/README.md 

To start our last chunk of code, we read in each of the files we identified and look for our search terms. We’ll wrap the read_file() function in a new function read_file_try() in order to handle errors. We’ll then run read_file_try() on each observation of files_to_search. Then we filter for any file that has the text we are looking for. Finally, we will use unnest() to put the text into its own variable.

# try wrapper
read_file_try <- function(file_name) {
  errortest <- try(read_file(file_name), T)
  output <- if_else(class(errortest) == "try-error", "error", as.character(errortest))
}

# Read files and search
files_identified <- files_to_search %>% 
  add_row(filenames = "x") %>% 
  mutate(file_text = map(filenames, read_file_try)) %>% 
  filter(grepl(search_terms_re, file_text, ignore.case = TRUE)) %>% 
  unnest(file_text)

And here is our output:

files_identified$filenames %>% head()
[1] "sql-data-warehouse-samples-master/arm-templates/sqlDwAutoScaler/SqlDwAutoScaler/ScaleSqlDw/readme.md"       
[2] "sql-data-warehouse-samples-master/arm-templates/sqlDwAutoScaler/SqlDwAutoScaler/ScaleSqlDwByTimer/readme.md"
[3] "sql-data-warehouse-samples-master/samples/adf/management-operations/Readme.md"                              
[4] "sql-data-warehouse-samples-master/samples/adf/Readme.md"                                                    
[5] "sql-data-warehouse-samples-master/samples/scripts/deployments/GetCreateStatement_Table.sql"                 
[6] "sql-data-warehouse-samples-master/samples/scripts/monitor/concurrency/running_queued_queries_slots.sql" 

Let me know if you have any feedback - feel free to reach out on Twitter or GitHub.

Full Code

library(tidyverse)

# Identify directory to search
directory_to_search <- "sql-data-warehouse-samples-master"

# Identify filetypes/suffixes to search
extension_types <- c(".sql", ".md")

# Identify search terms
search_terms <- c("security", "exec")

# Format search terms and extensions for grepl filter
extension_types_re <- paste0(extension_types, "$", collapse = "|") 
search_terms_re <- paste0(search_terms, collapse = "|")

# View
extension_types_re
search_terms_re

# Identify files to search
files_to_search <- list.files(directory_to_search, recursive = TRUE, full.names = TRUE) %>% 
  tibble(filenames = .) %>% 
  filter(grepl(extension_types_re, filenames, ignore.case = TRUE)) 

# try wrapper
read_file_try <- function(file_name) {
  errortest <- try(read_file(file_name), T)
  output <- if_else(class(errortest) == "try-error", "error", as.character(errortest))
}

# Read files and search
files_identified <- files_to_search %>% 
  add_row(filenames = "x") %>% 
  mutate(file_text = map(filenames, read_file_try)) %>% 
  filter(grepl(search_terms_re, file_text, ignore.case = TRUE)) %>% 
  unnest(file_text)

files_identified$filenames %>% head()
comments powered by Disqus