patternpythonMinor
Import data from XML files into data.frame
Viewed 0 times
intoxmlfilesframefromdataimport
Problem
I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:
These are
I would like to create a data frame contain job name, build number, duration, and result.
My code to achieve this is the following:
1 clean-caches 37 SUCCESS 248701
2 clean-caches 38 FAILURE 1200049
3 clean-caches 39 FAILURE 1200060
4 clean-caches 40 FAILURE 1200123
5 clean-caches 41 SUCCESS 358024
6 clean-caches 42 SUCCESS 130462
...
1276
1447062398490
1447062398538
ABORTED
539722
UTF-8
false
/var/lib/jenkins/workspace/clean-caches
1.624
These are
build.xml files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xmlI would like to create a data frame contain job name, build number, duration, and result.
My code to achieve this is the following:
library(XML)
filenames
Which gives me a data frame that looks like this:
job build result duration1 clean-caches 37 SUCCESS 248701
2 clean-caches 38 FAILURE 1200049
3 clean-caches 39 FAILURE 1200060
4 clean-caches 40 FAILURE 1200123
5 clean-caches 41 SUCCESS 358024
6 clean-caches 42 SUCCESS 130462
This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.
My concerns:
-
Repeated code:
The code blocks to generate the job and build vectors are identical. Same for duration and result`. If I decide to import more nodes from XML, I'll end up repeating even more code.- Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.
Solution
With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:
With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:
See how the function returns a one row data.frame. Then you can call the function on all files via
With a few changes, you can write a more general function that will take one or more files and do the binding itself (like
subdirs <- strsplit(dirname(filenames),
split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:
build.info <- function(file, xml_fields = c("duration", "result")) {
res <- list()
# process filepath
subdirs <- rev(unlist(strsplit(dirname(file),
split = .Platform$file.sep)))
res$job <- subdirs[[3]]
res$build <- subdirs[[1]]
# process xml data
doc <- xmlTreeParse(file)
build <- doc$doc$children$build
res[xml_fields] <- lapply(build[xml_fields], xmlValue)
# return as a data.frame
as.data.frame(res)
}See how the function returns a one row data.frame. Then you can call the function on all files via
lapply and bind all the outputs together:build.data <- do.call(rbind, lapply(filenames, build.info))With a few changes, you can write a more general function that will take one or more files and do the binding itself (like
file.info does)build.info 0L)
if (length(file) == 1L) {
res <- list()
# process filepath
subdirs <- rev(unlist(strsplit(dirname(file),
split = .Platform$file.sep)))
res$job <- subdirs[[1]]
res$build <- subdirs[[3]]
# process xml data
doc <- xmlTreeParse(file)
build <- doc$doc$children$build
res[xml_fields] <- lapply(build[xml_fields], xmlValue)
# return data.frame
as.data.frame(res)
} else {
do.call(rbind, lapply(file, build.info))
}
}
build.data <- build.info(filenames)Code Snippets
subdirs <- strsplit(dirname(filenames),
split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)build.info <- function(file, xml_fields = c("duration", "result")) {
res <- list()
# process filepath
subdirs <- rev(unlist(strsplit(dirname(file),
split = .Platform$file.sep)))
res$job <- subdirs[[3]]
res$build <- subdirs[[1]]
# process xml data
doc <- xmlTreeParse(file)
build <- doc$doc$children$build
res[xml_fields] <- lapply(build[xml_fields], xmlValue)
# return as a data.frame
as.data.frame(res)
}build.data <- do.call(rbind, lapply(filenames, build.info))build.info <- function(file, xml_fields = c("duration", "result")) {
stopifnot(length(file) > 0L)
if (length(file) == 1L) {
res <- list()
# process filepath
subdirs <- rev(unlist(strsplit(dirname(file),
split = .Platform$file.sep)))
res$job <- subdirs[[1]]
res$build <- subdirs[[3]]
# process xml data
doc <- xmlTreeParse(file)
build <- doc$doc$children$build
res[xml_fields] <- lapply(build[xml_fields], xmlValue)
# return data.frame
as.data.frame(res)
} else {
do.call(rbind, lapply(file, build.info))
}
}
build.data <- build.info(filenames)Context
StackExchange Code Review Q#116442, answer score: 3
Revisions (0)
No revisions yet.