HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Import data from XML files into data.frame

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
intoxmlfilesframefromdataimport

Problem

I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:




...

1276
1447062398490
1447062398538
ABORTED
539722
UTF-8
false

/var/lib/jenkins/workspace/clean-caches
1.624





These are build.xml files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xml

I would like to create a data frame contain job name, build number, duration, and result.

My code to achieve this is the following:

library(XML)

filenames

Which gives me a data frame that looks like this:

job build result duration
1 clean-caches 37 SUCCESS 248701
2 clean-caches 38 FAILURE 1200049
3 clean-caches 39 FAILURE 1200060
4 clean-caches 40 FAILURE 1200123
5 clean-caches 41 SUCCESS 358024
6 clean-caches 42 SUCCESS 130462


This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.

My concerns:

-
Repeated code:

The code blocks to generate the
job and build vectors are identical. Same for duration and result`. If I decide to import more nodes from XML, I'll end up repeating even more code.

  • Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.

Solution

With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:

subdirs <- strsplit(dirname(filenames),
                    split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job   <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result   <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)


With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:

build.info <- function(file, xml_fields = c("duration", "result")) {
   res <- list()
   # process filepath
   subdirs <- rev(unlist(strsplit(dirname(file),
                                  split = .Platform$file.sep)))
   res$job   <- subdirs[[3]]
   res$build <- subdirs[[1]]
   # process xml data
   doc <- xmlTreeParse(file)
   build <- doc$doc$children$build
   res[xml_fields] <- lapply(build[xml_fields], xmlValue)
   # return as a data.frame
   as.data.frame(res)
}


See how the function returns a one row data.frame. Then you can call the function on all files via lapply and bind all the outputs together:

build.data <- do.call(rbind, lapply(filenames, build.info))


With a few changes, you can write a more general function that will take one or more files and do the binding itself (like file.info does)

build.info  0L)
   if (length(file) == 1L) {
      res <- list()
      # process filepath
      subdirs <- rev(unlist(strsplit(dirname(file),
                                     split = .Platform$file.sep)))
      res$job   <- subdirs[[1]]
      res$build <- subdirs[[3]]
      # process xml data
      doc <- xmlTreeParse(file)
      build <- doc$doc$children$build
      res[xml_fields] <- lapply(build[xml_fields], xmlValue)
      # return data.frame
      as.data.frame(res)
   } else {
      do.call(rbind, lapply(file, build.info))
   }
}

build.data <- build.info(filenames)

Code Snippets

subdirs <- strsplit(dirname(filenames),
                    split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job   <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result   <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)
build.info <- function(file, xml_fields = c("duration", "result")) {
   res <- list()
   # process filepath
   subdirs <- rev(unlist(strsplit(dirname(file),
                                  split = .Platform$file.sep)))
   res$job   <- subdirs[[3]]
   res$build <- subdirs[[1]]
   # process xml data
   doc <- xmlTreeParse(file)
   build <- doc$doc$children$build
   res[xml_fields] <- lapply(build[xml_fields], xmlValue)
   # return as a data.frame
   as.data.frame(res)
}
build.data <- do.call(rbind, lapply(filenames, build.info))
build.info <- function(file, xml_fields = c("duration", "result")) {
   stopifnot(length(file) > 0L)
   if (length(file) == 1L) {
      res <- list()
      # process filepath
      subdirs <- rev(unlist(strsplit(dirname(file),
                                     split = .Platform$file.sep)))
      res$job   <- subdirs[[1]]
      res$build <- subdirs[[3]]
      # process xml data
      doc <- xmlTreeParse(file)
      build <- doc$doc$children$build
      res[xml_fields] <- lapply(build[xml_fields], xmlValue)
      # return data.frame
      as.data.frame(res)
   } else {
      do.call(rbind, lapply(file, build.info))
   }
}

build.data <- build.info(filenames)

Context

StackExchange Code Review Q#116442, answer score: 3

Revisions (0)

No revisions yet.