<p><b>@rcurtin</b> commented on this pull request.</p>

<p>This looks great to me; thank you for taking the time to make these changes.  This will be a nice improvement to mlpack's DET implementation.  I have a few comments, so let me know what you think and we can go from there.</p><hr>

<p>In <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">src/mlpack/core/arma_extend/Mat_extra_bones.hpp</a>:</p>

<pre style='color:#555'>&gt; @@ -12,6 +12,15 @@

 template&lt;typename Archive&gt;

 void serialize(Archive&amp; ar, const unsigned int version);

+/**

+ * These will help us refer the proper vector / column types, only with

+ * specifying the matrix type we want to use.

+ */

+

+typedef Col&lt;elem_type&gt;   vec_type;

+typedef Col&lt;elem_type&gt;   col_type;

+typedef Row&lt;elem_type&gt;   row_type;

</pre>

<p>This is a nice idea, and we should consider submitting something like this upstream, or at least starting a discussion with the Armadillo maintainer.</p>

<hr>

<p>In <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">src/mlpack/methods/det/dt_utils_impl.hpp</a>:</p>

<pre style='color:#555'>&gt;      prunedSequence.push_back(treeSeq);

     oldAlpha = alpha;

     alpha = dtree.PruneAndUpdate(oldAlpha, dataset.n_cols, useVolumeReg);

     // Some sanity checks.  It seems that on some datasets, the error does not

     // increase as the tree is pruned but instead stays the same---hence the

     // &quot;&lt;=&quot; in the final assert.

-    Log::Assert((alpha &lt; std::numeric_limits&lt;double&gt;::max()) ||

-        (dtree.SubtreeLeaves() == 1));

+    Log::Assert((alpha &lt; std::numeric_limits&lt;double&gt;::max()) || (dtree.SubtreeLeaves() == 1));

</pre>

<p>This line is over 80 characters, we should wrap it in accordance with the style guide:<br>

 <a href="https://github.com/mlpack/mlpack/wiki/DesignGuidelines">https://github.com/mlpack/mlpack/wiki/DesignGuidelines</a></p>

<p>I think there are other lines that are too long now too.</p>

<hr>

<p>In <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">src/mlpack/methods/det/dt_utils_impl.hpp</a>:</p>

<pre style='color:#555'>&gt;        cvDTree.PruneAndUpdate(cvOldAlpha, train.n_cols, useVolumeReg);

     }

     // Compute test values for this state of the tree.

     double cvVal = 0.0;

     for (size_t i = 0; i &lt; test.n_cols; ++i)

     {

-      arma::vec testPoint = test.unsafe_col(i);

+      typename MatType::vec_type testPoint = test.unsafe_col(i);

       cvVal += cvDTree.ComputeValue(testPoint);

</pre>

<p>Can we do <code>cvDTree.ComputeValue(test.col(i))</code> here?  It would probably require templatizing <code>ComputeValue()</code> to accept arbitrary vector types.  My concern is that sparse datasets don't have the <code>unsafe_col()</code> method.</p>

<hr>

<p>In <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">src/mlpack/methods/det/dtree_impl.hpp</a>:</p>

<pre style='color:#555'>&gt;  

   const size_t points = end - start;

   double minError = logNegError;

   bool splitFound = false;

   // Loop through each dimension.

-  for (size_t dim = 0; dim &lt; maxVals.n_elem; dim++)

+#ifdef _WIN32

+  #pragma omp parallel for default(shared)

+  for (intmax_t dim = 0; dim &lt; (intmax_t) maxVals.n_elem; ++dim)

+#else

+  #pragma omp parallel for default(shared)

+  for (size_t dim = 0; dim &lt; maxVals.n_elem; ++dim)

+#endif

   {

     // Have to deal with REAL, INTEGER, NOMINAL data differently, so we have to

     // think of how to do that...

</pre>

<p>We can remove this comment now, I think.  This doesn't really handle nominal data but with your refactoring it does handle real and integer data.  Handling nominal data in density estimation trees is not something that I think Pari's paper even talked about (although the extension should be straightforward... kind of) so I don't think we need to worry about that.</p>

<hr>

<p>In <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">src/mlpack/methods/det/dtree_impl.hpp</a>:</p>

<pre style='color:#555'>&gt;      if ((actualMinDimError &gt; minError) &amp;&amp; dimSplitFound)

     {

-      // Calculate actual error (in logspace) by adding terms back to our

-      // estimate.

-      minError = actualMinDimError;

-      splitDim = dim;

-      splitValue = dimSplitValue;

-      leftError = std::log(dimLeftError) - 2 * std::log((double) data.n_cols)

-          - volumeWithoutDim;

-      rightError = std::log(dimRightError) - 2 * std::log((double) data.n_cols)

-          - volumeWithoutDim;

-      splitFound = true;

+      {

</pre>

<p>Why the extra braces?</p>

<hr>

<p>In <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">src/mlpack/methods/det/dtree_impl.hpp</a>:</p>

<pre style='color:#555'>&gt; -    dimVec = arma::sort(dimVec);

-

-    // Find the best split for this dimension.  We need to figure out why

-    // there are spikes if this minLeafSize is enforced here...

-    for (size_t i = minLeafSize - 1; i &lt; dimVec.n_elem - minLeafSize; ++i)

+    // Get the values for splitting. The old implementation:

+    //   dimVec = data.row(dim).subvec(start, end - 1);

+    //   dimVec = arma::sort(dimVec);

+    // could be quite inefficient for sparse matrices, due to copy operations (3).

+    // This one has custom implementation for dense and sparse matrices.

+

+    std::vector&lt;SplitItem&gt; splitVec = details::ExtractSplits(data,

+                                                             dim,

+                                                             start,

+                                                             end,

+                                                             minLeafSize);

</pre>

<p>As far as I can tell the reason for the <code>ExtractSplits</code> function is because the <code>sort()</code> method is not available for sparse matrices.  Suppose that <code>sort()</code> did exist for sparse matrices (e.g. suppose I sat down and wrote it, which I might need to do shortly!).  Then we could do this...</p>

<pre><code>typename MatType::row_type dimVec = data.row(dim).subvec(start, end - 1);

dimVec = arma::sort(dimVec);

// Iterate over all possible values.

typename MatType::row_type::const_row_col_iterator it;

for (it = dimVec.begin_row_col(); ++it; it != dimVec.end_row_col())

{

  // Check the split to the left side of the point that *it represents, if it exists.

  if (it-&gt;col() &gt; 0)

  {

    // do checking for split between dimVec[it-&gt;col() - 1] and dimVec[it-&gt;col()]...

  }

  // If we are in the next-to-last position, check the split to the right, if applicable.

  // There's probably a cleaner way to write this code.

  typename MatType::row_type::const_row_col_iterator it2 = it;

  if ((++it2) == dimVec.end_row_col())

  {

    // do checking for split between dimVec[it-&gt;col()] and dimVec[it-&gt;col() + 1]...

  }

}

</code></pre>

<p>Note that the <code>row_col_iterator</code> will only "stop" at points that are actually represented in memory.  So for sparse matrices it will skip over zero elements.  I think that the <code>row_col_iterator</code> is not actually documented by Armadillo... I should submit a patch for that...</p>

<p>What do you think?  Would this approach work?  If so I will implement <code>SpMat::sort()</code> (it should be pretty straightforward, I think I have a good idea).  That would allow us to avoid having specific code for both the dense and sparse case (I like to push specific code like that to Armadillo wherever possible).</p>

<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">&mdash;<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AJ4bFBpTvYGgxIWs4EUHoERCJWW47zCIks5q1o1ggaJpZM4KZnsm">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/AJ4bFHVjYndDeeXMQSXF3OxRsp10BSYnks5q1o1ggaJpZM4KZnsm.gif" width="1" /></p>

<div itemscope itemtype="http://schema.org/EmailMessage">

<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">

  <link itemprop="url" href="https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462"></link>

  <meta itemprop="name" content="View Pull Request"></meta>

</div>

<meta itemprop="description" content="View this Pull Request on GitHub"></meta>

</div>

<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/mlpack/mlpack","title":"mlpack/mlpack","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/mlpack/mlpack"}},"updates":{"snippets":[{"icon":"PERSON","message":"@rcurtin commented on #802"}],"action":{"name":"View Pull Request","url":"https://github.com/mlpack/mlpack/pull/802#pullrequestreview-4957462"}}}</script>