patternpythonMinor
Plotting level and temperature data from many Excel sheets
Viewed 0 times
temperatureexcelsheetslevelplottingmanyandfromdata
Problem
I've been working on a code that reads in all the sheets of an Excel workbook, where the first two columns in each sheet are "Date" and "Time", and the next two columns are either "Level" and "Temperature, or "LEVEL" and "TEMPERATURE". The code works, but I am working on improving my coding clarity and efficiency, so any advice in those regards would be greatly appreciated.
My function
I started getting more comfortable with
Next, I would like to plot all of my data. So, I
My function
- Reads in the data to a list of dataframes
- Gets rid of any NA columns that were accidentally read in.
- Combines "Date" and "Time" to "DateTime" for each dataframe
- Rounds "DateTime" to the nearest 5 minutes for each dataframe
- Replaces "Date" and "Time" in each dataframe with "DateTime".
I started getting more comfortable with
lapply, but am wondering if I can improve the code efficiency at all instead of have so many lines with lapply.library(readxl)
library(plyr)
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
data <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(data) <- sheets
clean <- lapply(data, function(y) y[, colSums(is.na(y)) == 0])
date <- lapply(clean, "[[", 1)
time <- lapply(clean, "[[", 2)
time <- lapply(time, function(z) format(z, format = "%H:%M"))
datetime <- Map(paste, date, time)
datetime <- lapply(datetime, function(a) as.POSIXct(a, format = "%Y-%m-%d %H:%M"))
rounded <- lapply(datetime, function(b) as.POSIXlt(round(as.numeric(b)/(5*60))*(5*60),origin='1970-01-01'))
addDateTime <- mapply(cbind, clean, "DateTime" = rounded, SIMPLIFY = F)
final <- lapply(addDateTime, function(z) z[!(names(z) %in% c("Date", "Time"))])
return(final)
}Next, I would like to plot all of my data. So, I
- Tun my code for a file
- Combine the list of dataframes into one dataframe while maintaining an "ID" for each dataframe as a column
- Combine the lowercase and uppercase versions of the variable columns.
- Add two new columns that split the "ID". Each ID is s
Solution
There are a few cons with writing code that uses
Instead, I would suggest you write a function to process a single sheet and only loop once using that function (disclaimer: code could not be tested!):
Here are other improvements I made:
calls. I note that your
lapply/Map/mapply at each step. You probably know that these functions are disguised for loops so it is as-if you where running for (i in 1:n) {...} at each line of your code. This will likely be:- slower than a single loop (and
forloops are reputably slow in R...)
- memory inefficient: in the event that each iteration creates a lot of data, that data will be stored in memory for all spreadsheets at the same time
- more cumbersome to test and debug
- more typing
Instead, I would suggest you write a function to process a single sheet and only loop once using that function (disclaimer: code could not be tested!):
library(readxl)
read_excel_allsheets <- function(filename) {
process_sheet <- function(sheet, filename) {
data <- readxl::read_excel(filename, sheet = sheet)
data <- data[, colSums(is.na(data)) == 0])
names(data) <- tolower(data)
stopifnot(c("date", "time", "level", "temperature") %in% names(data))
datetime <- with(data,
as.POSIXct(paste(date, time), format = "%Y-%m-%d %H:%M"))
rounded <- as.POSIXlt(round(as.numeric(datetime) / (5 * 60)) * (5 * 60),
origin = '1970-01-01'))
cbind(datetime = rounded, data[c("level", "temperature")])
}
sheets <- readxl::excel_sheets(filename)
Map(process_sheet, sheets, filename)
}
mysheets <- read_excel_allsheets(filename)
df <- do.call(rbind, Map(cbind, id = names(mysheets), mysheets))
df$plot <- substr(df$id, start = 1, stop = 2)
df$exp <- substr(df$id, start = 3, stop = 4)Here are other improvements I made:
- Rather than carry the weird assumptions about your data column names into the second part of your code, I took care of making uniform names in the data reading/cleaning section. For that, I converted all names to lowercase using
tolower.
- I added a
stopifnot()to check that each sheet contains the four columns you need. Always try to gather in your head the assumptions that your code is relying on, then add code to check these assumptions and die if they are not met.
- Everywhere I needed a column from the data, I referred to it using its name (e.g.
"date") rather than its position (e.g.1). Doing so will always make your code more readable and more robust to data changes.
- I replaced
plyr::ldplywith a base solution. It helps remove a dependency to a 3rd party package but if you prefer to keep dplyr, it's ok too.
- I replaced the
gsubcalls with simpler (easier to read)substr
calls. I note that your
gsub calls were probably returning the expected results but you are doing something weird. To use \\1, your regex pattern should really use a set of parenthesis () to catch something that will be stored inside \\1. It would have made more sense to not use regex captures and just use the empty string "" as the replacement instead of "\\1".Code Snippets
library(readxl)
read_excel_allsheets <- function(filename) {
process_sheet <- function(sheet, filename) {
data <- readxl::read_excel(filename, sheet = sheet)
data <- data[, colSums(is.na(data)) == 0])
names(data) <- tolower(data)
stopifnot(c("date", "time", "level", "temperature") %in% names(data))
datetime <- with(data,
as.POSIXct(paste(date, time), format = "%Y-%m-%d %H:%M"))
rounded <- as.POSIXlt(round(as.numeric(datetime) / (5 * 60)) * (5 * 60),
origin = '1970-01-01'))
cbind(datetime = rounded, data[c("level", "temperature")])
}
sheets <- readxl::excel_sheets(filename)
Map(process_sheet, sheets, filename)
}
mysheets <- read_excel_allsheets(filename)
df <- do.call(rbind, Map(cbind, id = names(mysheets), mysheets))
df$plot <- substr(df$id, start = 1, stop = 2)
df$exp <- substr(df$id, start = 3, stop = 4)Context
StackExchange Code Review Q#109284, answer score: 4
Revisions (0)
No revisions yet.