HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Plotting level and temperature data from many Excel sheets

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
temperatureexcelsheetslevelplottingmanyandfromdata

Problem

I've been working on a code that reads in all the sheets of an Excel workbook, where the first two columns in each sheet are "Date" and "Time", and the next two columns are either "Level" and "Temperature, or "LEVEL" and "TEMPERATURE". The code works, but I am working on improving my coding clarity and efficiency, so any advice in those regards would be greatly appreciated.

My function

  • Reads in the data to a list of dataframes



  • Gets rid of any NA columns that were accidentally read in.



  • Combines "Date" and "Time" to "DateTime" for each dataframe



  • Rounds "DateTime" to the nearest 5 minutes for each dataframe



  • Replaces "Date" and "Time" in each dataframe with "DateTime".



I started getting more comfortable with lapply, but am wondering if I can improve the code efficiency at all instead of have so many lines with lapply.

library(readxl)
library(plyr)

  read_excel_allsheets <- function(filename) {
  sheets <- readxl::excel_sheets(filename)
  data <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
  names(data) <- sheets
  clean <- lapply(data, function(y) y[, colSums(is.na(y)) == 0])
  date <- lapply(clean, "[[", 1)
  time <- lapply(clean, "[[", 2)
  time <- lapply(time, function(z) format(z, format = "%H:%M"))
  datetime <- Map(paste, date, time)
  datetime <- lapply(datetime, function(a) as.POSIXct(a, format = "%Y-%m-%d %H:%M"))
  rounded <- lapply(datetime, function(b) as.POSIXlt(round(as.numeric(b)/(5*60))*(5*60),origin='1970-01-01'))
  addDateTime <- mapply(cbind, clean, "DateTime" = rounded, SIMPLIFY = F)
  final <- lapply(addDateTime, function(z) z[!(names(z) %in% c("Date", "Time"))])
  return(final)
}


Next, I would like to plot all of my data. So, I

  • Tun my code for a file



  • Combine the list of dataframes into one dataframe while maintaining an "ID" for each dataframe as a column



  • Combine the lowercase and uppercase versions of the variable columns.



  • Add two new columns that split the "ID". Each ID is s

Solution

There are a few cons with writing code that uses lapply/Map/mapply at each step. You probably know that these functions are disguised for loops so it is as-if you where running for (i in 1:n) {...} at each line of your code. This will likely be:

  • slower than a single loop (and for loops are reputably slow in R...)



  • memory inefficient: in the event that each iteration creates a lot of data, that data will be stored in memory for all spreadsheets at the same time



  • more cumbersome to test and debug



  • more typing



Instead, I would suggest you write a function to process a single sheet and only loop once using that function (disclaimer: code could not be tested!):

library(readxl)

read_excel_allsheets <- function(filename) {

  process_sheet <- function(sheet, filename) {
    data <- readxl::read_excel(filename, sheet = sheet)
    data <- data[, colSums(is.na(data)) == 0])
    names(data) <- tolower(data)
    stopifnot(c("date", "time", "level", "temperature") %in% names(data))

    datetime <- with(data,
      as.POSIXct(paste(date, time), format = "%Y-%m-%d %H:%M"))
    rounded <- as.POSIXlt(round(as.numeric(datetime) / (5 * 60)) * (5 * 60),
                          origin = '1970-01-01'))

    cbind(datetime = rounded, data[c("level", "temperature")])
  }

  sheets <- readxl::excel_sheets(filename)
  Map(process_sheet, sheets, filename)
}

mysheets <- read_excel_allsheets(filename)
df <- do.call(rbind, Map(cbind, id = names(mysheets), mysheets))
df$plot <- substr(df$id, start = 1, stop = 2)
df$exp  <- substr(df$id, start = 3, stop = 4)


Here are other improvements I made:

  • Rather than carry the weird assumptions about your data column names into the second part of your code, I took care of making uniform names in the data reading/cleaning section. For that, I converted all names to lowercase using tolower.



  • I added a stopifnot() to check that each sheet contains the four columns you need. Always try to gather in your head the assumptions that your code is relying on, then add code to check these assumptions and die if they are not met.



  • Everywhere I needed a column from the data, I referred to it using its name (e.g. "date") rather than its position (e.g. 1). Doing so will always make your code more readable and more robust to data changes.



  • I replaced plyr::ldply with a base solution. It helps remove a dependency to a 3rd party package but if you prefer to keep dplyr, it's ok too.



  • I replaced the gsub calls with simpler (easier to read) substr


calls. I note that your gsub calls were probably returning the expected results but you are doing something weird. To use \\1, your regex pattern should really use a set of parenthesis () to catch something that will be stored inside \\1. It would have made more sense to not use regex captures and just use the empty string "" as the replacement instead of "\\1".

Code Snippets

library(readxl)

read_excel_allsheets <- function(filename) {

  process_sheet <- function(sheet, filename) {
    data <- readxl::read_excel(filename, sheet = sheet)
    data <- data[, colSums(is.na(data)) == 0])
    names(data) <- tolower(data)
    stopifnot(c("date", "time", "level", "temperature") %in% names(data))

    datetime <- with(data,
      as.POSIXct(paste(date, time), format = "%Y-%m-%d %H:%M"))
    rounded <- as.POSIXlt(round(as.numeric(datetime) / (5 * 60)) * (5 * 60),
                          origin = '1970-01-01'))

    cbind(datetime = rounded, data[c("level", "temperature")])
  }

  sheets <- readxl::excel_sheets(filename)
  Map(process_sheet, sheets, filename)
}

mysheets <- read_excel_allsheets(filename)
df <- do.call(rbind, Map(cbind, id = names(mysheets), mysheets))
df$plot <- substr(df$id, start = 1, stop = 2)
df$exp  <- substr(df$id, start = 3, stop = 4)

Context

StackExchange Code Review Q#109284, answer score: 4

Revisions (0)

No revisions yet.