This document discusses using purrr to iterate through and read in multiple Excel files stored in a directory into R. It provides two methods: 1) Saving the Excel files as a list after importing, and 2) Storing the Excel files as a nested data frame with the file paths and data. The document also demonstrates how to read all sheets within an Excel file by mapping over the sheet names.
1. Iteration with purrr
Reading in many excel files in a Directory
Have come across a situation where you had to read in many excel files into R?
Luckily purrr allows you to iterate through your excel files and programmatically get them into your R
session
With a few easy steps you can easily harness the awesome power of purr
1. Method 1:
a. Saving excel files as a list
2. Method 2
a. Storing excel sheets as a nested data frame
Setup
load libraries
• tidyverse
• readxl
• fs (gives a rock solid cross-platform interface to the filesystem)
LIBRARY(TIDYVERSE)
LIBRARY(READXL)
LIBRARY(FS)
Setup (continued)
The fs::dir_info() function returns information on all files in a directory and stores it in a tibble
I wrap my expressions in parentheses to assign it to a variable AND print it to the console for example:
(x<-sum(y))
SUPPLY THE DIR_INFO FUNCTION A PATH TO YOUR EXCEL FILES DIRECTORY:
(XL_PATHS <- FS::DIR_INFO('EXCEL_SHEETS/'))
A TIBBLE: 2 X 18
PATH TYPE SIZE PERMISSIONS MODIFICATION_TIME USER GROUP
<FS::PATH> <FCT> <FS:> <FS::PERMS> <DTTM> <CHR> <CHR>
1 EXCEL_SHE~ FILE 15.8M RW- 2019-09-04 19:33:13 <NA> <NA>
2 EXCEL_SHE~ FILE 68.7K RW- 2019-09-21 23:00:40 <NA> <NA>
... WITH 11 MORE VARIABLES: DEVICE_ID <DBL>, HARD_LINKS <DBL>,
2. SPECIAL_DEVICE_ID <DBL>, INODE <DBL>, BLOCK_SIZE <DBL>, BLOCKS <DBL>,
FLAGS <INT>, GENERATION <DBL>, ACCESS_TIME <DTTM>, CHANGE_TIME <DTTM>,
BIRTH_TIME <DTTM>
Saving excel files as a list
Below is a method for storing all excel files that have been imported as a list
• pull the path from the xl_paths tibble using the pull function from dplyr
• This will give the paths to excel sheets you want to import as a charater vector
(XL_PATHS_CHR_VECTOR <- XL_PATHS %>%
PULL(PATH)
)
EXCEL_SHEETS/FACTORY.XLSX EXCEL_SHEETS/PREST_MEET.XLSX
USE THESE PATHS AND APPLY MAP FUNCTION IN ORDER TO ITERATE THE READ_EXCEL FUNCTION OVER THE C
HARACTER VECTOR
THERE ARE MANY METHODS FOR ACHIEVING TE SAME RESULTS BUT I WILL USE AN ANONYMOUS FUNCTION:
XL_PATHS_CHR_VECTOR %>%
MAP( ~ READ_EXCEL(.)) %>%
THIS WILL RETURN A LIST OF 2
LET'S GIVE THE SHEET THEIR ORIGINAL NAMES WITH THE SET_NAMES() FUNCTION
SET_NAMES(XL_PATHS_CHR_VECTOR)
$`EXCEL_SHEETS/FACTORY.XLSX`
A TIBBLE: 275,415 X 15
SZFROM SZTO WEIGHT TRANSDATE TRANSTIME PN
<CHR> <DBL> <DBL> <DTTM> <DTTM> <CHR>
1 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 08:52:00 C340~
2 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 07:45:00 C340~
3 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 09:12:00 C340~
4 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 08:18:00 C340~
5 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 08:23:00 C340~
6 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 08:10:00 C340~
7 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 08:03:00 C340~
8 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 07:28:00 C340~
9 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 09:36:00 C340~
10 <NA> NA 0 2019-09-04 00:00:00 2019-09-04 08:20:00 C340~
... WITH 275,405 MORE ROWS, AND 9 MORE VARIABLES: PACKLINE <CHR>,
MARKETVA <CHR>, VA <CHR>, BINNO <DBL>, BG <CHR>, FC <CHR>,
3. `IF(BINTRANS.TRANSTIME>"06:00" AND BINTRANS.TRANSTIME<"18:30", "DAG",
"NAG")` <CHR>, `DATE (IF (TRANSTIME>="00:00" AND TRANSTIME<"06:00" ,
TRANSDATE-1,TRANSDATE) )` <DTTM>, ...15 <LGL>
$`EXCEL_SHEETS/PREST_MEET.XLSX`
A TIBBLE: 205 X 18
DATE W_DAY WEEK_NUM MONTH_NUM `LINE 1_KG` `LINE 2_KG`
<DTTM> <CHR> <DBL> <DBL> <CHR> <CHR>
1 2019-03-11 00:00:00 MON 11 3 32528.6999~ 37936.0999~
2 2019-03-12 00:00:00 TUE 11 3 40674.9999~ 56930.3999~
3 2019-03-13 00:00:00 WED 11 3 39505.1999~ 58524.5999~
4 2019-03-14 00:00:00 THU 11 3 35589.3999~ 16834.2000~
5 2019-03-15 00:00:00 FRI 11 3 22113.5999~ 12266.6000~
6 2019-03-16 00:00:00 SAT NA NA NA NA
7 2019-03-17 00:00:00 SUN NA NA NA NA
8 2019-03-18 00:00:00 MON 12 3 34105.1999~ 18727.0999~
9 2019-03-19 00:00:00 TUE 12 3 42978.0999~ 30003.3999~
10 2019-03-20 00:00:00 WED 12 3 34028.4999~ 23582.4999~
... WITH 195 MORE ROWS, AND 12 MORE VARIABLES: `LINE 3_KG` <CHR>,
STD_CARTN <CHR>, SHIFT <CHR>, WORKERS_1 <CHR>, WORKERS_2 <CHR>,
WORKERS_3 <CHR>, STD_CRTN_PACKER <CHR>, WEIGHT_WORK_HOUR <CHR>,
`STD_CRTN PER MAN_HOUR` <CHR>, L1_WEIGHT_PACKER_SHIFT <CHR>,
L2_WEIGHT_PACKER_SHIFT <CHR>, L3_WEIGHT_PACKER_SHIFT <CHR>
Storing excel sheets as a nested data frame
Admittedly, I prefer this method for two reasons:
• It’s less coding
• It keeps everything organized in a tabular fashion (Notice the type is )
XL_PATHS %>%
SELECT ONLY THE PATH COLUMN
SELECT(PATH)
A TIBBLE: 2 X 1
PATH
<FS::PATH>
1 EXCEL_SHEETS/FACTORY.XLSX
2 EXCEL_SHEETS/PREST_MEET.XLSX
XL_PATHS %>%
SELECT ONLY THE PATH COLUMN
SELECT(PATH) %>%
4. CREATE A NEW COLUMN CALLED DATA AND POPULATE IT WITH THE EXCEL SHEETS USING THE MAP FUNCT
ION
MUTATE(DATA = PATH %>% MAP(READ_EXCEL))
A TIBBLE: 2 X 2
PATH DATA
<FS::PATH> <LIST>
1 EXCEL_SHEETS/FACTORY.XLSX <TIBBLE [275,415 X 15]>
2 EXCEL_SHEETS/PREST_MEET.XLSX <TIBBLE [205 X 18]>
Reading every sheet of an excel file
You can also use the map function to read in every single sheet into R
LIST ALL THE SHEETS NAMES WITH READ::XLEXCEL_SHEETS()
(SHEET_NAMES <- EXCEL_SHEETS('EXCEL_SHEETS/PREST_MEET.XLSX'))
[1] "FORMULAS" "DAILY" "WEEKLY" "MONTHLY"
SHEET_NAMES %>%
MAP( ~ READ_EXCEL(PATH = 'EXCEL_SHEETS/PREST_MEET.XLSX',SHEET = .)) %>%
GIVE THE LIST THE NAMES OF THE SHEETS
SET_NAMES(SHEET_NAMES)
$FORMULAS
A TIBBLE: 205 X 18
DATE W_DAY WEEK_NUM MONTH_NUM `LINE 1_KG` `LINE 2_KG`
<DTTM> <CHR> <DBL> <DBL> <CHR> <CHR>
1 2019-03-11 00:00:00 MON 11 3 32528.6999~ 37936.0999~
2 2019-03-12 00:00:00 TUE 11 3 40674.9999~ 56930.3999~
3 2019-03-13 00:00:00 WED 11 3 39505.1999~ 58524.5999~
4 2019-03-14 00:00:00 THU 11 3 35589.3999~ 16834.2000~
5 2019-03-15 00:00:00 FRI 11 3 22113.5999~ 12266.6000~
6 2019-03-16 00:00:00 SAT NA NA NA NA
7 2019-03-17 00:00:00 SUN NA NA NA NA
8 2019-03-18 00:00:00 MON 12 3 34105.1999~ 18727.0999~
9 2019-03-19 00:00:00 TUE 12 3 42978.0999~ 30003.3999~
10 2019-03-20 00:00:00 WED 12 3 34028.4999~ 23582.4999~
... WITH 195 MORE ROWS, AND 12 MORE VARIABLES: `LINE 3_KG` <CHR>,
STD_CARTN <CHR>, SHIFT <CHR>, WORKERS_1 <CHR>, WORKERS_2 <CHR>,
WORKERS_3 <CHR>, STD_CRTN_PACKER <CHR>, WEIGHT_WORK_HOUR <CHR>,
`STD_CRTN PER MAN_HOUR` <CHR>, L1_WEIGHT_PACKER_SHIFT <CHR>,
L2_WEIGHT_PACKER_SHIFT <CHR>, L3_WEIGHT_PACKER_SHIFT <CHR>