Split Training and Testing Sets

Question: When training machine learning models, typically at least two sets of data are used: the training set and the testing set. These sets are distinct subsets from a pool of data. Define a function 'stt' that takes in a list or table and a test set ratio, splits the data into two lists/tables based on the test set ratio, and returns a two item list with the first item representing the training set and the second item representing the test set.

Example

                                
                                q)show t:([]s:100?`AAPL`MSFT`IBM`TSLA;p:100?400)
s    p
--------
AAPL 372
AAPL 326
TSLA 329
IBM  330
TSLA 50
..
q)s:stt[t;.2]
q)s
+`s`p!(`AAPL`AAPL`TSLA`IBM`IBM`IBM`AAPL`TSLA`TSLA`IBM`MSFT`TSLA`AAPL`AAPL`IBM`IBM`TSLA`IBM`MSFT`TSLA`MSFT`TSLA`AAPL`IBM`AAPL`MSFT`AAPL`IBM`AAPL`AAPL`IBM`MSFT`TSLA`TSLA`AAPL`AAPL`TSLA`MSFT`MSFT`TSLA..
+`s`p!(`AAPL`TSLA`MSFT`IBM`MSFT`TSLA`MSFT`IBM`MSFT`MSFT`MSFT`TSLA`IBM`MSFT`TSLA`TSLA`IBM`TSLA`MSFT`AAPL;213 50 58 97 232 152 67 351 25 323 13 263 224 29 2 383 114 187 137 393)
q)s[0]
s    p
--------
AAPL 372
AAPL 326
TSLA 329
IBM  330
IBM  251
..
q)count s[0]
80
q)count s[1]
20
q)stt[`a`b`c`d`e`f`g;.2]
`a`b`c`d`f`g
,`e
                                
                            

Solution

Tags:
functions machine learning
Searchable Tags
algorithms api architecture asynchronous c csv data structures dictionaries disk feedhandler finance functions ingestion ipc iterators machine learning math multithreading optimizations realtime shared library sql statistics streaming strings tables temporal utility websockets