[mlpack-git] (blog) master: keon week ten eleven (18f4672)
gitdub at mlpack.org
gitdub at mlpack.org
Mon Aug 8 15:04:41 EDT 2016
Repository : https://github.com/mlpack/blog
On branch : master
Link : https://github.com/mlpack/blog/compare/0d5896c23a70a7d20d8d030ecc19c38e90bf866e...18f4672bc8ccdeb7f0f1894f9f5fc856b1aedbe0
Author: Keon Kim <kwk236 at gmail.com>
Date: Tue Aug 9 04:04:41 2016 +0900
keon week ten eleven
content/blog/KeonWeekTenEleven.md | 97 +++++++++++++++++++++++++++++++++++++++
1 file changed, 97 insertions(+)
diff --git a/content/blog/KeonWeekTenEleven.md b/content/blog/KeonWeekTenEleven.md
new file mode 100644
@@ -0,0 +1,97 @@
+Title: Dataset and Experimentation Tools : Week - 10 and 11 Highlights
+Date: 2016-07-26 16:00:00
+Tags: gsoc, dataset, data
+Author: Keon Kim
+I've been building a preprocess_validate cli-executable,
+a simple app that prints out warnings for possible invalid values in a dataset.
+The output of this program would be the one like below, which is an ultimately
+what we've been trying to achieve from the
+[WARN ] Possibly problematic value at point 1, categorical feature 0 :
+5 (numeric value in categorical feature)
+[WARN ] Invalid value at point 4, numerical feature 1 :
+[WARN ] Invalid value at point 1, numerical feature 2 : a
+[WARN ] Invalid value at point 4, numerical feature 2 : b
+It took me a longer time to build it because I tried several approaches for
+1) First approach I made a class named Validator that acts similar to
+ Imputer class.
+2) I found that it is hard to track where the missing values are if there is
+ more than two missing values, since every invalid values will be turned
+ to "nan". So I tried hacking DatasetMapper class so that it could store
+ where the invalid values are.
+3) Next decision was to make a new validate_policy which prints out warnings
+ as it goes through the tokens of each dimensions.
+ (though it still has some issues to fix)
+4) Lastly to fundamentally fix the problem, I suggest the approach
+Current `maps` object for DatasetMapper can be described as maps of
+`map<dimension, pair<bimap<string, MappedType>, numMappings>>`
+(NumMappings usually being numeric primitive types.)
+I think process of having multiple map policies can be simplified by having
+to two mapping objects. For validation & imputation purposes we could have
+another mapper (I will call it invalidMaps for now). Which would look like:
+// MapType = map<dimension, pair<bimap<string, MappedType>, numMappings>>;
+// InvalidMapType = maps<string, std::pair<dimension, point>>;
+invalidMaps and maps serve two different purposes.
+maps is used as usual (mapping categorical feature to numeric feature).
+invalidMaps is used as temporary holder for future imputation. Both x and y
+coordinates have to be stored in order to track the invalid values, since
+every invalid values are turned to NaNs.
+I made [commits in this branch](https://github.com/keonkim/mlpack/commits/check)
+to test its usability.
+The code I am referring to is "validate_policy" written in
+I made it to only test, so the code has still a lot to be improved.
+When I run the code with the following dataset using validate policy:
+a, 2, 3
+NULL, 6, a
+b, 9, 1
+a, 2, 3
+c, , b
+The result matrix produced by the above data by data::Load() becomes:
+[INFO ] 3 mappings in dimension 0.
+[INFO ] 0 mappings in dimension 1.
+[INFO ] 0 mappings in dimension 2.
+[DEBUG] 0 nan 1.0000e+00 0 2.0000e+00
+[DEBUG] 2.0000e+00 6.0000e+00 9.0000e+00 2.0000e+00 nan
+[DEBUG] 3.0000e+00 nan 1.0000e+00 3.0000e+00 nan
+3 mappings in dimension 0 would indicate that
+it successfully mapped (a->0, b->1, c->2).
+NULL was not mapped because I set it as one of the user-defined missingValues.
+All `nan`s are mapped using invalidMaps object. And can later be used for
+printing errors or imputations.
+I think this is intuitively a good approach.
+And this can replace the use of all other mapping policies.
+I think this way we can make mlpack more user-friendly by reducing
+introductions to new concepts.
More information about the mlpack-git