An exhaustive search regression built on base R
Usage
regsearch(
data,
dependent,
independent,
minvar = 1,
maxvar,
family,
topN = 0,
interactions = FALSE,
multi = FALSE,
...
)
Arguments
- data
A `data.frame` that contains a dependent variable and the independent variables.
- dependent
The dependent variable for the regression.
- independent
A vector of independent variables to be used. These must match the column names from `data`. These can also include interaction terms made from column names from `data`. This allows for specific interaction terms to be used, rather than every possible interaction as is done with `interactions = TRUE`.
- minvar
(Optional) The minimum number of independent variables to be used in the regression. Defaults to 1.
- maxvar
The maximum number of independent variables to be used in the regression. Must be equal to or less than the number of independent variables. If interaction terms are used, they count as one independent variable.
- family
The type of regression. Passed to `glm`. See
glm
for more information.- topN
(Optional) The number of top results to be printed upon run completion. Defaults to 0.
- interactions
(Optional) A boolean indicating whether or not interaction terms should be used. Defaults to `FALSE`.
- multi
(Optional) A boolean indicating whether or not multithreading should be used. Defaults to `FALSE`. It is highly recommended to use multithreading.
- ...
(Optional) Function arguments to be passed to
glm
Value
Returns a `data.table` of information on the regressions run. The resulting data.table is sorted in descending order by the rSquare divided by the mean p-value. This is generally reliable in pushing quality regressions to the top of the list.
- `formula`
The regression formula used.
- `aic`
The aic for the regression.
- `rSquare`
The calculated r-square for the regression.
- `warn`
Currently unused.
- independent
Each variable column contains the p-values for that variable or interaction term in a given regression.
Examples
# Creating dummy data
dt <- data.frame("dependent" = sample(c(0, 1), 100, replace = TRUE),
"ind_1" = runif(100, 0, 1),
"ind_2" = runif(100, 0, 1),
"ind_3" = runif(100, 0, 1),
"ind_4" = runif(100, 0, 1))
# Without interaction terms and multithreading
## Two top results
regsearch(dt, "dependent", c("ind_1", "ind_2", "ind_3", "ind_4"),
1, 4, "binomial", 2)
#> Warning: Missing 'interactions' argument. Defaulting to FALSE.
#> Warning: Missing 'multi' argument. Defaulting to FALSE.
#> [1] "Assembling regresions..."
#> [1] "Creating 15 formulas. Please be patient, this may take a while."
#> [1] "Creating regressions..."
#> [1] "Running 15 regressions. Please be patient, this may take a while."
#> [1] "Running regressions..."
#>
#> Call:
#> glm(formula = as.formula(x), family = family, data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.4949 -1.1981 0.9401 1.1158 1.3594
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.5011 0.5154 -0.972 0.331
#> ind_1 0.6554 0.7359 0.891 0.373
#> ind_4 0.7032 0.7656 0.918 0.358
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 138.27 on 99 degrees of freedom
#> Residual deviance: 136.51 on 97 degrees of freedom
#> AIC: 142.51
#>
#> Number of Fisher Scoring iterations: 4
#>
#>
#> Call:
#> glm(formula = as.formula(x), family = family, data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.462 -1.221 0.939 1.128 1.372
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.5605 0.5882 -0.953 0.341
#> ind_1 0.6461 0.7374 0.876 0.381
#> ind_2 0.1348 0.6406 0.210 0.833
#> ind_4 0.6959 0.7662 0.908 0.364
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 138.27 on 99 degrees of freedom
#> Residual deviance: 136.46 on 96 degrees of freedom
#> AIC: 144.46
#>
#> Number of Fisher Scoring iterations: 4
#>
#> formula aic bic rSquare warn
#> 1: dependent ~ + ind_1 + ind_4 142.5053 150.3208 0.01276 NA
#> 2: dependent ~ + ind_1 + ind_2 + ind_4 144.4609 154.8816 0.01308 NA
#> 3: dependent ~ + ind_1 + ind_3 + ind_4 144.5023 154.9230 0.01278 NA
#> 4: dependent ~ + ind_4 141.3038 146.5141 0.00698 NA
#> 5: dependent ~ + ind_1 + ind_2 + ind_3 + ind_4 146.4584 159.4842 0.01310 NA
#> 6: dependent ~ + ind_1 141.3572 146.5675 0.00660 NA
#> 7: dependent ~ + ind_2 + ind_4 143.2334 151.0489 0.00749 NA
#> 8: dependent ~ + ind_3 + ind_4 143.2697 151.0852 0.00723 NA
#> 9: dependent ~ + ind_1 + ind_2 143.2936 151.1092 0.00706 NA
#> 10: dependent ~ + ind_2 + ind_3 + ind_4 145.2018 155.6225 0.00772 NA
#> 11: dependent ~ + ind_1 + ind_3 143.3571 151.1726 0.00660 NA
#> 12: dependent ~ + ind_1 + ind_2 + ind_3 145.2936 155.7143 0.00706 NA
#> 13: dependent ~ + ind_2 + ind_3 144.1523 151.9678 0.00085 NA
#> 14: dependent ~ + ind_2 142.1719 147.3823 0.00070 NA
#> 15: dependent ~ + ind_3 142.2475 147.4578 0.00016 NA
#> xIntercept ind_1 ind_2 ind_3 ind_4
#> 1: 0.3308671 0.3731085 NA NA 0.3583815
#> 2: 0.3406195 0.3809417 0.8332921 NA 0.3637671
#> 3: 0.4019967 0.3824943 NA 0.9569377 0.3575983
#> 4: 0.6202945 NA NA NA 0.3285351
#> 5: 0.3955666 0.3899512 0.8339157 0.9594924 0.3630483
#> 6: 0.5776676 0.3414683 NA NA NA
#> 7: 0.5817247 NA 0.7908741 NA 0.3352401
#> 8: 0.6359655 NA NA 0.8536337 0.3255300
#> 9: 0.5479413 0.3504516 0.8009956 NA NA
#> 10: 0.5894053 NA 0.7944590 0.8589336 0.3322281
#> 11: 0.6558815 0.3471964 NA 0.9928694 NA
#> 12: 0.6091045 0.3558050 0.8010777 0.9959040 NA
#> 13: 0.9547657 NA 0.7578417 0.8886398 NA
#> 14: 0.9527065 NA 0.7552192 NA NA
#> 15: 0.8802303 NA NA 0.8827580 NA
## No top results
regsearch(dt, "dependent", c("ind_1", "ind_2", "ind_3", "ind_4"),
1, 4, "binomial", FALSE, FALSE)
#> Warning: Missing 'multi' argument. Defaulting to FALSE.
#> [1] "Assembling regresions..."
#> [1] "Creating 15 formulas. Please be patient, this may take a while."
#> [1] "Creating regressions..."
#> [1] "Running 15 regressions. Please be patient, this may take a while."
#> [1] "Running regressions..."
#> formula aic bic rSquare warn
#> 1: dependent ~ + ind_1 + ind_4 142.5053 150.3208 0.01276 NA
#> 2: dependent ~ + ind_1 + ind_2 + ind_4 144.4609 154.8816 0.01308 NA
#> 3: dependent ~ + ind_1 + ind_3 + ind_4 144.5023 154.9230 0.01278 NA
#> 4: dependent ~ + ind_4 141.3038 146.5141 0.00698 NA
#> 5: dependent ~ + ind_1 + ind_2 + ind_3 + ind_4 146.4584 159.4842 0.01310 NA
#> 6: dependent ~ + ind_1 141.3572 146.5675 0.00660 NA
#> 7: dependent ~ + ind_2 + ind_4 143.2334 151.0489 0.00749 NA
#> 8: dependent ~ + ind_3 + ind_4 143.2697 151.0852 0.00723 NA
#> 9: dependent ~ + ind_1 + ind_2 143.2936 151.1092 0.00706 NA
#> 10: dependent ~ + ind_2 + ind_3 + ind_4 145.2018 155.6225 0.00772 NA
#> 11: dependent ~ + ind_1 + ind_3 143.3571 151.1726 0.00660 NA
#> 12: dependent ~ + ind_1 + ind_2 + ind_3 145.2936 155.7143 0.00706 NA
#> 13: dependent ~ + ind_2 + ind_3 144.1523 151.9678 0.00085 NA
#> 14: dependent ~ + ind_2 142.1719 147.3823 0.00070 NA
#> 15: dependent ~ + ind_3 142.2475 147.4578 0.00016 NA
#> xIntercept ind_1 ind_2 ind_3 ind_4
#> 1: 0.3308671 0.3731085 NA NA 0.3583815
#> 2: 0.3406195 0.3809417 0.8332921 NA 0.3637671
#> 3: 0.4019967 0.3824943 NA 0.9569377 0.3575983
#> 4: 0.6202945 NA NA NA 0.3285351
#> 5: 0.3955666 0.3899512 0.8339157 0.9594924 0.3630483
#> 6: 0.5776676 0.3414683 NA NA NA
#> 7: 0.5817247 NA 0.7908741 NA 0.3352401
#> 8: 0.6359655 NA NA 0.8536337 0.3255300
#> 9: 0.5479413 0.3504516 0.8009956 NA NA
#> 10: 0.5894053 NA 0.7944590 0.8589336 0.3322281
#> 11: 0.6558815 0.3471964 NA 0.9928694 NA
#> 12: 0.6091045 0.3558050 0.8010777 0.9959040 NA
#> 13: 0.9547657 NA 0.7578417 0.8886398 NA
#> 14: 0.9527065 NA 0.7552192 NA NA
#> 15: 0.8802303 NA NA 0.8827580 NA
# With interaction terms and multithreading
regsearch(dt, "dependent", c("ind_1", "ind_2", "ind_3", "ind_4"),
1, 4, "binomial", TRUE, TRUE)
#> Warning: Missing 'multi' argument. Defaulting to FALSE.
#> [1] "Gathering variables..."
#> [1] "WARNING: Using interaction terms without multithreading may take a very long time"
#> [1] "Assembling regresions..."
#> [1] "Creating 385 formulas. Please be patient, this may take a while."
#> [1] "Creating regressions..."
#> [1] "Running 105 regressions. Please be patient, this may take a while."
#> [1] "Running regressions..."
#>
#> Call:
#> glm(formula = as.formula(x), family = family, data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6655 -1.1921 0.8124 1.1342 1.4298
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.6204 0.7692 0.807 0.420
#> ind_3 -1.5993 1.3131 -1.218 0.223
#> ind_4 -1.5214 1.6048 -0.948 0.343
#> ind_3:ind_4 4.5781 2.8658 1.598 0.110
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 138.27 on 99 degrees of freedom
#> Residual deviance: 134.62 on 96 degrees of freedom
#> AIC: 142.62
#>
#> Number of Fisher Scoring iterations: 4
#>
#> formula aic bic rSquare warn
#> 1: dependent ~ + ind_3*ind_4 142.6173 153.0380 0.02641 NA
#> 2: dependent ~ + ind_3*ind_4 + ind_1 144.0950 157.1209 0.03019 NA
#> 3: dependent ~ + ind_3*ind_4 + ind_2 144.4206 157.4464 0.02783 NA
#> 4: dependent ~ + ind_3*ind_4 + ind_1 + ind_2 145.9382 161.5692 0.03132 NA
#> 5: dependent ~ + ind_1*ind_4 + ind_3*ind_4 145.6978 161.3288 0.03306 NA
#> ---
#> 101: dependent ~ + ind_2*ind_3 + ind_1 147.2936 160.3194 0.00706 NA
#> 102: dependent ~ + ind_2 + ind_3 144.1523 151.9678 0.00085 NA
#> 103: dependent ~ + ind_2 142.1719 147.3823 0.00070 NA
#> 104: dependent ~ + ind_2*ind_3 146.1511 156.5718 0.00085 NA
#> 105: dependent ~ + ind_3 142.2475 147.4578 0.00016 NA
#> xIntercept ind_1 ind_2 ind_3 ind_4 ind_1.ind_2 ind_1.ind_3
#> 1: 0.4199248 NA NA 0.2232378 0.3431217 NA NA
#> 2: 0.6678149 0.4705920 NA 0.2220420 0.3624368 NA NA
#> 3: 0.5224838 NA 0.6576698 0.2091700 0.3202841 NA NA
#> 4: 0.7486319 0.4880024 0.6923043 0.2097495 0.3405966 NA NA
#> 5: 0.9716514 0.3596372 NA 0.2027834 0.7627338 NA NA
#> ---
#> 101: 0.7407833 0.3561515 0.9149675 0.9962740 NA NA NA
#> 102: 0.9547657 NA 0.7578417 0.8886398 NA NA NA
#> 103: 0.9527065 NA 0.7552192 NA NA NA NA
#> 104: 0.9503169 NA 0.8640432 0.9223537 NA NA NA
#> 105: 0.8802303 NA NA 0.8827580 NA NA NA
#> ind_1.ind_4 ind_2.ind_3 ind_2.ind_4 ind_3.ind_4 ind_3.ind_2 ind_4.ind_2
#> 1: NA NA NA 0.1101492 NA NA
#> 2: NA NA NA 0.1273546 NA NA
#> 3: NA NA NA 0.1024724 NA NA
#> 4: NA NA NA 0.1194127 NA NA
#> 5: 0.5303457 NA NA NA NA NA
#> ---
#> 101: NA 0.9935548 NA NA NA NA
#> 102: NA NA NA NA NA NA
#> 103: NA NA NA NA NA NA
#> 104: NA 0.9717019 NA NA NA NA
#> 105: NA NA NA NA NA NA
#> ind_4.ind_3
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: 0.107964
#> ---
#> 101: NA
#> 102: NA
#> 103: NA
#> 104: NA
#> 105: NA