[mlpack-git] (blog) master: KeonWeekThree (bab2edf)

Mon Jun 13 07:37:41 EDT 2016

Repository : https://github.com/mlpack/blog
On branch  : master
Link       : https://github.com/mlpack/blog/compare/19cae4ee6106486ce352d2ef6bd6468f900a221f...f2dfdbb794dc318ba5de69acae94c6fb6d6a52eb

>---------------------------------------------------------------

commit bab2edf4a781a2aa27f61d3e87d1991d6a2f1f46
Author: Keon Kim <kwk236 at gmail.com>
Date:   Mon Jun 13 20:37:41 2016 +0900

    KeonWeekThree


>---------------------------------------------------------------

bab2edf4a781a2aa27f61d3e87d1991d6a2f1f46
 content/blog/KeonWeekThree.md | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/content/blog/KeonWeekThree.md b/content/blog/KeonWeekThree.md
new file mode 100644
index 0000000..6d8c9e2
--- /dev/null
+++ b/content/blog/KeonWeekThree.md
@@ -0,0 +1,29 @@
+Title: Dataset and Experimentation Tools : Week-3 Highlights
+Date: 2016-06-13 18:00:00
+Tags: gsoc, dataset, data
+Author: Keon Kim
+
+Last week, I planned to finalize missing variable and imputation strategies.
+Tham gave me advices and ideas for implementing the Imputer and DatasetMapper classes.
+So I was able to:
+
+1) Rewrite and finalize Imputer class, DatasetMapper class, and CLI executable that provides imputation methods for missing variables.
+I modularized the mapping policies and imputation strategies. So that they could be used interchangably.
+
+2) Implement utility functions, which are: one-hot-encoding, standard-scale (standardization) and min-max-scale (normalization).
+
+One of the concerns I am having is that some features I have planned are already implemented in armadillo library or mlpack.
+
+I think I had more time reading and analyzing the code so far.
+As a result, I am getting used to the styles of mlpack and C++ in general.
+Next week, I will:
+
+1) Refine and make pull requests for one-hot-encoding and min-max-scale.
+
+2) Start working on statistical analyzing cli executable.
+
+3) Plan and implement proof-of-concept for function that scans through a file and detects faults(can be used independently or before data::Load).
+   I have to think how to re-use or modularize the code in data::Load() since it already has tokenizers.
+
+4) Start worrying about how to treat datetime variables.
+(As of now, mlpack fails to map variables like "1993.05.12" or "1993/05/12". It just recognizes it as number with the first "1993" and discards the rest)