One Hot Encoding

Question: One Hot Encoding is the method of converting label data to numeric by transforming a one dimension list of labels to a matrix, with each entry of the matrix containing binary data that represents label occurence. '1' means the label occurs, while '0' means it does not. This is better than using label encoding when the labels have uniform weight (ex. medium does not have greater weight than small). When applying this to tables, one categorical column would be converted to N binary columns, where N is the number of distinct values of the column. Define a function 'ohenc' that takes in a table and symbol column name, and returns the table with that column removed and N new columns appended (where N is the number of distinct values of the column). The new columns should have the old column name prepended to each new column name (ex. size column becomes size_s, size_m, size_l, size_xl, etc. columns). Each new column should contain binary data that represents the occurrence of the label. Each row should have only one occurrence of a transformed label.

More Information:

https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

Example

                                
                                q)show shirts:([]sku:0 0 0 0 1 1;size:`s`s`s`s`l`xl;color:`red`blue`yellow`green`white`black;price:29.95 29.95 29.95 29.95 14.99 14.99) 
sku size color  price 
--------------------- 
0   s    red    29.95 
0   s    blue   29.95 
0   s    yellow 29.95 
0   s    green  29.95 
1   l    white  14.99 
1   xl   black  14.99 
q)ohenc[shirts;`size] // encode one feature 
sku color  price size_s size_l size_xl 
-------------------------------------- 
0   red    29.95 1      0      0 
0   blue   29.95 1      0      0 
0   yellow 29.95 1      0      0 
0   green  29.95 1      0      0 
1   white  14.99 0      1      0 
1   black  14.99 0      0      1 
q)ohenc/[shirts;`size`color] // encode multiple features 
sku price size_s size_l size_xl color_red color_blue color_yellow color_green color_white color_black 
----------------------------------------------------------------------------------------------------- 
0   29.95 1      0      0       1         0          0            0           0           0 
0   29.95 1      0      0       0         1          0            0           0           0 
0   29.95 1      0      0       0         0          1            0           0           0 
0   29.95 1      0      0       0         0          0            1           0           0 
1   14.99 0      1      0       0         0          0            0           1           0 
1   14.99 0      0      1       0         0          0            0           0           1
                                
                            

Solution

Tags:
functions machine learning
Searchable Tags
algorithms api architecture asynchronous c csv data structures dictionaries disk feedhandler finance functions ingestion ipc iterators machine learning math multithreading optimizations realtime shared library sql statistics streaming strings tables temporal utility websockets