patternpythonMinor
Using Reddit API in R
Viewed 0 times
apiredditusing
Problem
I'm scraping some comments from Reddit using Reddit JSON API and R. Since the data does not have a flat structure, extracting it is a little tricky, but I've found a way.
To give you a flavour of what I'm having to do, here is a brief example:
The above example gets all comments, the first level of replies to each of these comments and the second level of replies (i.e. replies to each of the replies), but this could go down much deeper, so I'm trying to figure out an efficient way of handling this. To achieve this manually, what I'm having to do is this:
-
Copy the following code from the last loop:
-
Paste that code right after the line that starts with
To give you a flavour of what I'm having to do, here is a brief example:
x = "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json" # example url
rawdat = readLines(x,warn=F) # reading in the data
rawdat = fromJSON(rawdat) # formatting
dat_list = repl = rawdat[[2]][[2]][[2]] # this will be used later
sq = seq(dat_list)[-1]-1 # number of comments
txt = unlist(lapply(sq,function(x)dat_list[[x]][[2]][[14]])) # comments (not replies)
# loop time:
for(a in sq){
repl = tryCatch(repl[[a]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting replies all replies to comment a
if(length(repl)>0){ # in case there are no replies
sq = seq(repl)[-1]-1 # number of replies
txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) # this is what I want
# next level down
for(b in sq){
repl = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting all replies to reply b of comment a
if(length(repl)>0){
sq = seq(repl)[-1]-1
txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]])))
}
}
}
}The above example gets all comments, the first level of replies to each of these comments and the second level of replies (i.e. replies to each of the replies), but this could go down much deeper, so I'm trying to figure out an efficient way of handling this. To achieve this manually, what I'm having to do is this:
-
Copy the following code from the last loop:
for(b in sq){
repl = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL)
if(length(repl)>0){
sq = seq(repl)[-1]-1
txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]])))
}
}-
Paste that code right after the line that starts with
txt = ... and change b in the loop to c.Solution
Here are my main recommendations:
Now for the code:
- use recursion
- use names instead of list indices, for example
node$data$reply$data$childrenreads much better thannode[[2]][[5]][[2]][[2]]and it is also more robust to data changes.
- use well-named variables so you code reads easily
Now for the code:
url <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children
get.comments <- function(node) {
comment <- node$data$body
replies <- node$data$replies
reply.nodes <- if (is.list(replies)) replies$data$children else NULL
return(list(comment, lapply(reply.nodes, get.comments)))
}
txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199Code Snippets
url <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children
get.comments <- function(node) {
comment <- node$data$body
replies <- node$data$replies
reply.nodes <- if (is.list(replies)) replies$data$children else NULL
return(list(comment, lapply(reply.nodes, get.comments)))
}
txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199Context
StackExchange Code Review Q#61602, answer score: 7
Revisions (0)
No revisions yet.