HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Using Reddit API in R

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
apiredditusing

Problem

I'm scraping some comments from Reddit using Reddit JSON API and R. Since the data does not have a flat structure, extracting it is a little tricky, but I've found a way.

To give you a flavour of what I'm having to do, here is a brief example:

x = "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json" # example url
rawdat   = readLines(x,warn=F) # reading in the data
rawdat   = fromJSON(rawdat) # formatting
dat_list = repl = rawdat[[2]][[2]][[2]] # this will be used later
sq       = seq(dat_list)[-1]-1 # number of comments
txt      = unlist(lapply(sq,function(x)dat_list[[x]][[2]][[14]])) # comments (not replies)

# loop time:

for(a in sq){
  repl  = tryCatch(repl[[a]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting replies all replies to comment a

  if(length(repl)>0){ # in case there are no replies
    sq  = seq(repl)[-1]-1 # number of replies
    txt    = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) # this is what I want

    # next level down
    for(b in sq){
      repl  = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting all replies to reply b of comment a

      if(length(repl)>0){
        sq  = seq(repl)[-1]-1
        txt    = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]])))   
      }
    }
  }
}


The above example gets all comments, the first level of replies to each of these comments and the second level of replies (i.e. replies to each of the replies), but this could go down much deeper, so I'm trying to figure out an efficient way of handling this. To achieve this manually, what I'm having to do is this:

-
Copy the following code from the last loop:

for(b in sq){
  repl  = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL)

  if(length(repl)>0){
    sq  = seq(repl)[-1]-1
    txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]])))   
  }
}


-
Paste that code right after the line that starts with txt = ... and change b in the loop to c.

Solution

Here are my main recommendations:

  • use recursion



  • use names instead of list indices, for example node$data$reply$data$children reads much better than node[[2]][[5]][[2]][[2]] and it is also more robust to data changes.



  • use well-named variables so you code reads easily



Now for the code:

url       <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat    <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children

get.comments <- function(node) {
   comment     <- node$data$body
   replies     <- node$data$replies
   reply.nodes <- if (is.list(replies)) replies$data$children else NULL
   return(list(comment, lapply(reply.nodes, get.comments)))
}

txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199

Code Snippets

url       <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat    <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children

get.comments <- function(node) {
   comment     <- node$data$body
   replies     <- node$data$replies
   reply.nodes <- if (is.list(replies)) replies$data$children else NULL
   return(list(comment, lapply(reply.nodes, get.comments)))
}

txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199

Context

StackExchange Code Review Q#61602, answer score: 7

Revisions (0)

No revisions yet.