>I think that in this case, moving the calculation of scale outside of Forward() may not actually give any noticeable runtime improvement

I have same opinion on this one
>creating the mask object is going to take some amount of memory

This one I am not sure how good the optimization of armadillo can do, it would do things like

arma::mat randomMat;// build a random matrix
mask = randomMat; //copy the data of randomMat to mask

or overwrite the data of mask directly since the mask already allocate the buffer?

I do a small test on this

    int main()
        arma::mat temp;
            temp = arma::randu<arma::mat>(1000, 1000);

Then open the task manager on win8 and observe the usage of memory, it is always 8.3MB
So I guess armadillo5.x(or the compiler) do smart enough to avoid memory allocation under this case.

