One Hot Encoding
Question: One Hot Encoding is the method of converting label data to numeric by transforming a one dimension list of labels to a matrix, with each entry of the matrix containing binary data that represents label occurence. '1' means the label occurs, while '0' means it does not. This is better than using label encoding when the labels have uniform weight (ex. medium does not have greater weight than small). When applying this to tables, one categorical column would be converted to N binary columns, where N is the number of distinct values of the column. Define a function 'ohenc' that takes in a table and symbol column name, and returns the table with that column removed and N new columns appended (where N is the number of distinct values of the column). The new columns should have the old column name prepended to each new column name (ex. size column becomes size_s, size_m, size_l, size_xl, etc. columns). Each new column should contain binary data that represents the occurrence of the label. Each row should have only one occurrence of a transformed label.
More Information:
https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encodingExample
q)show shirts:([]sku:0 0 0 0 1 1;size:`s`s`s`s`l`xl;color:`red`blue`yellow`green`white`black;price:29.95 29.95 29.95 29.95 14.99 14.99)
sku size color price
---------------------
0 s red 29.95
0 s blue 29.95
0 s yellow 29.95
0 s green 29.95
1 l white 14.99
1 xl black 14.99
q)ohenc[shirts;`size] // encode one feature
sku color price size_s size_l size_xl
--------------------------------------
0 red 29.95 1 0 0
0 blue 29.95 1 0 0
0 yellow 29.95 1 0 0
0 green 29.95 1 0 0
1 white 14.99 0 1 0
1 black 14.99 0 0 1
q)ohenc/[shirts;`size`color] // encode multiple features
sku price size_s size_l size_xl color_red color_blue color_yellow color_green color_white color_black
-----------------------------------------------------------------------------------------------------
0 29.95 1 0 0 1 0 0 0 0 0
0 29.95 1 0 0 0 1 0 0 0 0
0 29.95 1 0 0 0 0 1 0 0 0
0 29.95 1 0 0 0 0 0 1 0 0
1 14.99 0 1 0 0 0 0 0 1 0
1 14.99 0 0 1 0 0 0 0 0 1