Stratified Sampling
Question: Stratified sampling is a method of sampling that involves the division of a population to into smaller sub-groups known as strata. The size of each strata of the sample should be proportional to the size of each strata of the population. Define a function 'sfsmpl' that takes in a list or table, a column name that represents the column to base the strata on (for table data), an interval for creating strata bins (numeric strata values), and a sample size, and returns a sample of the data whose strata sizes are proportional to the strata sizes of the population. The strata can be numerical or symbols.
More Information:
https://en.wikipedia.org/wiki/Stratified_samplingExample
q)d1:([]gender:1000000?`M`F;age:1+1000000?65;score:(1500+200000?901),800000?1500)
// table, strata by gender values
q)select count each age from 0!`gender xgroup sfsmpl[d1;`gender;`;1000]
age
---
500
500
// table, strata by score values 500 increment bins
q)sfsmpl[d1;`score;500;1000]
gender age score
----------------
M 44 1623
F 38 1703
F 23 1578
M 13 1561
F 52 1910
F 5 1814
F 33 1645
F 25 1719
M 19 1501
F 29 1853
M 16 1958
F 51 1698
F 21 1895
M 53 1525
M 8 1761
..
q)select count score by 500 xbar score from sfsmpl[d1;`score;500;1000]
score| score
-----| -----
0 | 267
500 | 266
1000 | 267
1500 | 111
2000 | 89
// list, 500 increment bins
q)count each group 500 xbar sfsmpl[d1`score;`;500;1000]
1500| 111
2000| 89
500 | 266
1000| 267
0 | 267