Tutorial :Is it possible to reuse generated columns in ddply?


I have a script where I'm using ddply, as in the following example:

ddply(df, .(col),  function(x) data.frame(  col1=some_function(x$y),  col2=some_other_function(x$y)  )  )  

Within ddply, is it possible to reuse col1 without calling the entire function again?

For example:

ddply(df, .(col),  function(x) data.frame(  col1=some_function(x$y),  col2=some_other_function(x$y)  col3=col1*col2  )  )  


You've got a whole function to play with! Doesn't have to be a one-liner! This should work:

ddply(df, .(col), function(x) {    tmp <- some_other_function(x$y)    data.frame(      col1=some_function(x$y),      col2=tmp,      col3=tmp    )  })  


This appears to be a good candidate for data.table using the scoping rules of the j component. See FAQ 2.8 for details.

From the FAQ

No anonymous function is passed to the j. Instead, an anonymous body is passed to the j.

So, for your case

library(data.table)  DT <- as.data.table(df)  DT[,{   col1=some_function(y)   col2=some_other_function(y)   col3= col1 *col2   list(col1 = col1, col2 = col2, col3 = col3)   }, by = col]    

or a slightly more direct way :

DT[,list(   col1=col1<-some_function(y)   col2=col2<-some_other_function(y)   col3=col1*col2   ), by = col]    

This avoids one repetition each of col1 and col2, and avoids two repeats of col3; repetition is something we strive to reduce in data.table. The = followed by <- might initially look cumbersome. That allows the following syntactic sugar, though :

DT[,list(   "Projected return (%)"=      col1<-some_function(y),   "Investment ($m)"=           col2<-some_other_function(y),   "Return on Investment ($m)"= col1*col2   ), by = col]    

where the output can be sent directly to latex or html, for example.


I don't think that's possible, but it shouldn't matter too much, because at that point it's not an aggregation function anymore. For example:

#use summarize() in ddply()  data.means <- ddply(data, .(groups), summarize, mean = mean(x), sd = sd(x), n = length(x))  data.means$se <- data.means$sd / sqrt(data.means$n)  data.means$Upper <- data.means$mean + (data.means$SE * 1.96)  data.means$Lower <- data.means$mean - (data.means$SE * 1.96)  

So I didn't calculate the SEs directly, but it wasn't so bad calculating it outside of ddply(). If you really wanted to, you could also do

ddply(data, .(groups), summarize, se = sd(x) / sqrt(length(x)))  

Or to put it in terms of your example

ddply(df, .(col), summarize,        col1=some_function(y),        col2=some_other_function(y)        col3=some_function(y)*some_other_function(y)      )  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »